[00:00:04] RoanKattouw ostriches Krenair: Respected human, time to deploy Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160126T0000). Please do the needful. [00:00:04] James_F AaronSchulz ebernhardson jgirault jan_drewniak bd808: A patch you scheduled for Evening SWAT (Max 8 patches) is about to be deployed. Please be available during the process. [00:00:19] here o/ [00:00:21] mega ping [00:00:33] * Krenair grumbles [00:00:35] 10 patches [00:00:54] there is an *8* patch limit [00:00:56] I'm happy to skip mine for space. [00:01:01] Even if mine were there first [00:01:03] not that I have time to do it today anyway, but still [00:01:48] I can slide mine to tomorrow AM with no harm [00:01:50] * bd808 does so [00:02:59] There is already 8 I think tomorrow AM [00:03:47] they are entirely config patches, so it's an easy deploy [00:03:49] i can just do them all [00:04:16] that's the spirit ;) [00:04:26] okay [00:05:21] (03CR) 10EBernhardson: [C: 032] VisualEditor: Provide framework for enabling an A/B test for IPs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/258404 (owner: 10Jforrester) [00:05:25] (03CR) 10EBernhardson: [C: 032] VisualEditor: Don't set ShowBetaWelcome, now set in repo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/258405 (owner: 10Jforrester) [00:05:40] ebernhardson: mine can stay or go as you see fit. I would have done it this afternoon but I was busy with other stuffs [00:05:56] James_F: actually, you had a -1 on the second patch. But I'm assuming since you asked for deploy the dependency has been shipped out? [00:06:15] ebernhardson: Yeah, the C-1 is now removed. [00:06:17] (03Merged) 10jenkins-bot: VisualEditor: Provide framework for enabling an A/B test for IPs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/258404 (owner: 10Jforrester) [00:06:43] (03Merged) 10jenkins-bot: VisualEditor: Don't set ShowBetaWelcome, now set in repo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/258405 (owner: 10Jforrester) [00:08:52] !log ebernhardson@mira Synchronized wmf-config/InitialiseSettings.php: SWAT James_F (duration: 01m 35s) [00:08:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:09:59] ebernhardson: Nothing seems broken here. [00:10:25] (03CR) 10EBernhardson: [C: 032] filebackend: add configuration for codfw [mediawiki-config] - 10https://gerrit.wikimedia.org/r/197499 (https://phabricator.wikimedia.org/T91754) (owner: 10Giuseppe Lavagetto) [00:10:28] "SWAT James F"...sounds dastardly [00:10:30] !log ebernhardson@mira Synchronized wmf-config/CommonSettings.php: SWAT James_F (duration: 01m 26s) [00:10:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:10:33] (03CR) 10EBernhardson: [C: 032] Set $wgCentralAuthUseSlaves for loginwiki, mw.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/265178 (https://phabricator.wikimedia.org/T119689) (owner: 10Aaron Schulz) [00:10:36] 6operations, 10Wikimedia-Language-setup, 10Wikimedia-Site-Requests: Rename cbk-zamwiki to cbkwiki - https://phabricator.wikimedia.org/T124657#1964245 (10Liuxinyu970226) [00:10:47] James_F: well, not broken is a success i suppose ;) [00:11:02] ebernhardson: :-) [00:11:13] ebernhardson: https://en.wikipedia.org/wiki/Swatting :) [00:11:16] (03Merged) 10jenkins-bot: filebackend: add configuration for codfw [mediawiki-config] - 10https://gerrit.wikimedia.org/r/197499 (https://phabricator.wikimedia.org/T91754) (owner: 10Giuseppe Lavagetto) [00:11:25] lol :) [00:11:39] (03Merged) 10jenkins-bot: Set $wgCentralAuthUseSlaves for loginwiki, mw.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/265178 (https://phabricator.wikimedia.org/T119689) (owner: 10Aaron Schulz) [00:11:54] 6operations, 10Wikimedia-Site-Requests: Rename cbk-zamwiki to cbkwiki - https://phabricator.wikimedia.org/T124657#1964250 (10Liuxinyu970226) [00:12:50] 6operations, 10Wikimedia-Site-Requests: Rename cbk-zamwiki to cbkwiki - https://phabricator.wikimedia.org/T124657#1961729 (10Liuxinyu970226) Sorry [00:13:32] !log ebernhardson@mira Synchronized wmf-config/filebackend-production.php: SWAT AaronSchulz (duration: 01m 26s) [00:13:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:14:00] AaronSchulz: first out, second is syncing now [00:14:50] (03CR) 10EBernhardson: [C: 032] Adjust cirrus titlesuggest index shard counts [mediawiki-config] - 10https://gerrit.wikimedia.org/r/261287 (https://phabricator.wikimedia.org/T124332) (owner: 10EBernhardson) [00:14:56] (03CR) 10EBernhardson: [C: 032] Remove variables for unused experiment [mediawiki-config] - 10https://gerrit.wikimedia.org/r/260176 (owner: 10EBernhardson) [00:15:01] ok [00:15:05] (03CR) 10EBernhardson: [C: 032] Change CirrusSearch sharding values for codfw [mediawiki-config] - 10https://gerrit.wikimedia.org/r/265372 (https://phabricator.wikimedia.org/T124215) (owner: 10EBernhardson) [00:15:13] (03CR) 10EBernhardson: [C: 032] Add popularity_score field to cirrussearch indices [mediawiki-config] - 10https://gerrit.wikimedia.org/r/265927 (owner: 10EBernhardson) [00:15:17] !log ebernhardson@mira Synchronized wmf-config/CommonSettings.php: SWAT AaronSchulz (duration: 01m 26s) [00:15:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:15:31] (03Merged) 10jenkins-bot: Adjust cirrus titlesuggest index shard counts [mediawiki-config] - 10https://gerrit.wikimedia.org/r/261287 (https://phabricator.wikimedia.org/T124332) (owner: 10EBernhardson) [00:15:56] (03Merged) 10jenkins-bot: Remove variables for unused experiment [mediawiki-config] - 10https://gerrit.wikimedia.org/r/260176 (owner: 10EBernhardson) [00:15:58] (03CR) 10jenkins-bot: [V: 04-1] Change CirrusSearch sharding values for codfw [mediawiki-config] - 10https://gerrit.wikimedia.org/r/265372 (https://phabricator.wikimedia.org/T124215) (owner: 10EBernhardson) [00:16:01] :S [00:16:24] oh, merge conflict with another patch i just merged...sec [00:16:52] (03Merged) 10jenkins-bot: Add popularity_score field to cirrussearch indices [mediawiki-config] - 10https://gerrit.wikimedia.org/r/265927 (owner: 10EBernhardson) [00:19:28] (03PS4) 10EBernhardson: Change CirrusSearch sharding values for codfw [mediawiki-config] - 10https://gerrit.wikimedia.org/r/265372 (https://phabricator.wikimedia.org/T124215) [00:19:50] (03CR) 10jenkins-bot: [V: 04-1] Change CirrusSearch sharding values for codfw [mediawiki-config] - 10https://gerrit.wikimedia.org/r/265372 (https://phabricator.wikimedia.org/T124215) (owner: 10EBernhardson) [00:20:09] * ebernhardson hates merge conflicts.. [00:21:05] (03PS5) 10EBernhardson: Change CirrusSearch sharding values for codfw [mediawiki-config] - 10https://gerrit.wikimedia.org/r/265372 (https://phabricator.wikimedia.org/T124215) [00:22:21] (03CR) 10EBernhardson: [C: 032] Change CirrusSearch sharding values for codfw [mediawiki-config] - 10https://gerrit.wikimedia.org/r/265372 (https://phabricator.wikimedia.org/T124215) (owner: 10EBernhardson) [00:22:58] (03Merged) 10jenkins-bot: Change CirrusSearch sharding values for codfw [mediawiki-config] - 10https://gerrit.wikimedia.org/r/265372 (https://phabricator.wikimedia.org/T124215) (owner: 10EBernhardson) [00:25:19] !log ebernhardson@mira Synchronized wmf-config/CommonSettings.php: SWAT ebernhardson (duration: 01m 27s) [00:25:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:26:10] 6operations, 6Discovery, 10Wikidata, 10Wikidata-Query-Service: Adjust balance of WDQS nodes to allow continued operation if eqiad went offline. - https://phabricator.wikimedia.org/T124627#1964292 (10Tfinc) If we have lost a DC then we should not be doing maintenance on a node. Stability is key then. I'm us... [00:27:19] !log ebernhardson@mira Synchronized wmf-config/CirrusSearch-common.php: SWAT ebernhardson (duration: 01m 26s) [00:27:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:28:04] jgirault: you're up next [00:28:13] o/ [00:28:28] (03CR) 10EBernhardson: [C: 032] Bump portals to master (remove A/B/C test from production) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/266278 (https://phabricator.wikimedia.org/T124245) (owner: 10JGirault) [00:28:54] are you cleaning varnish? [00:29:02] !log ebernhardson@mira Synchronized wmf-config/InitialiseSettings.php: SWAT ebernhardson (duration: 01m 26s) [00:29:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:29:36] (03Merged) 10jenkins-bot: Bump portals to master (remove A/B/C test from production) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/266278 (https://phabricator.wikimedia.org/T124245) (owner: 10JGirault) [00:30:57] 7Blocked-on-Operations, 6operations, 10RESTBase, 10hardware-requests: Expand SSD space in Cassandra cluster - https://phabricator.wikimedia.org/T121575#1964309 (10RobH) [00:32:04] !log ebernhardson@mira Synchronized portals/: SWAT jgirault (duration: 01m 28s) [00:32:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:32:10] jgirault: yours is out, please check [00:32:24] jgirault: might need to avoid cache headers somehow, or i can purge [00:32:42] ebernhardson: can you purge? [00:32:50] jgirault: just the main url, or assets too? [00:33:00] actually it works, just tried [00:33:15] does that mean caching is broken? :) [00:33:31] ebernhardson: I tried https://www.wikipedia.org/?abc123#abtest2 [00:33:41] ahh, ok [00:33:51] ebernhardson: we’re good ! [00:34:10] 6operations: Metrics not reaching Graphite - https://phabricator.wikimedia.org/T124639#1964356 (10Krinkle) 5Open>3Resolved Monitoring (mid-long term) statsv is {T117994}. Restart (one-time) has been down. Closing task (assuming that's all for now). [00:34:12] (03CR) 10EBernhardson: [C: 032] Only send warning and higher session logs to logstash [mediawiki-config] - 10https://gerrit.wikimedia.org/r/266307 (owner: 10BryanDavis) [00:34:16] bd808: last up is you [00:34:27] 6operations, 7Graphite, 7Monitoring: Add monitoring for analytics-statsv service - https://phabricator.wikimedia.org/T117994#1788765 (10Krinkle) [00:34:46] 6operations, 6Performance-Team, 7Graphite, 7Monitoring: Add monitoring for analytics-statsv service - https://phabricator.wikimedia.org/T117994#1788765 (10Krinkle) [00:34:53] (03Merged) 10jenkins-bot: Only send warning and higher session logs to logstash [mediawiki-config] - 10https://gerrit.wikimedia.org/r/266307 (owner: 10BryanDavis) [00:35:19] sweet [00:37:43] !log ebernhardson@mira Synchronized wmf-config/InitialiseSettings.php: SWAT bd808 (duration: 01m 34s) [00:37:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:37:49] bd808: all synced out [00:38:38] minor jump in kibana fatalmonitor, but looks to be related to earlier patch [00:38:56] looks like sync order issue perhaps... [00:39:04] ebernhardson: looks good. Log volume for the session channel dropped as expected [00:40:33] !log ebernhardson@mira Synchronized wmf-config/CommonSettings.php: (no message) (duration: 01m 25s) [00:40:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:41:13] hoo: yt? [00:41:43] looks to have stopped with the resync, i think with that SWAT is complete [00:42:10] ori: Yes [00:43:01] (03PS4) 10Nuria: Removing code that generates pageviews using legacy definition [puppet] - 10https://gerrit.wikimedia.org/r/265656 (https://phabricator.wikimedia.org/T124244) [00:44:16] hoo: I proposed a change to how test.wikipedia.org is configured in Varnish in ; bblack pointed out (correctly) that it would impact test.wikidata.org as well, so I wanted to check if that would be OK. [00:44:55] I saw that… I don't think we do anything that is impacted by that [00:45:17] someone remind me what the distinction between engineering and wikitech-l is now? [00:45:31] !log mobileapps deploying c2318b6 [00:45:33] Krenair: engineering gets read more often [00:45:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:45:41] ebernhardson: heh [00:45:46] Krenair: less noise on engineering [00:45:54] really? :| wow [00:46:19] Krenair: I sent it to engineering@ because I figured the change would only meaningfully impact people with shell access [00:46:41] did you announce to shell users that engineering@ was opened to subscription? [00:48:24] yeah, I have wikitech-l set to digest and engineering not set to digest [00:48:24] Krenair: No; I wasn't sure if I should. When I proposed to close the list, several people said they wanted to have a low-volume list that was scoped (in subject-matter, if not visibility) to things that pertain to staff [00:48:48] shell access is not limited to staff [00:48:58] shell access has never been limited to staff [00:49:23] what does engineering@ have to do with shell access? [00:52:37] 6operations, 10CirrusSearch, 6Discovery, 7Elasticsearch: Look into encrypting Elasticsearch traffic - https://phabricator.wikimedia.org/T124444#1964464 (10EBernhardson) Moving this into the ops column, as this is almost entirely backend infrastructure. Switching the connection from http to https between ap... [00:57:58] ori, ^ [00:59:17] jesus, I don't know [00:59:28] forward it to anyone you think might care [01:00:10] it is insanely exhausting to be grilled about a single e-mail like that, especially now that the list is in principle accessible to anyone [01:00:52] ori: BUT YOU MUST EXPLAIN YOURSELF! [01:02:08] (03PS1) 10Tim Landscheidt: Tools: Allow proxymanager to add and remove proxy forward entries [puppet] - 10https://gerrit.wikimedia.org/r/266448 [01:02:47] that kind of scrutiny feels hostile to me, and I think that it is ultimately counterproductive, in that it makes people more likely to avoid public lists altogether. [01:03:03] (03CR) 10Tim Landscheidt: "Tests:" [puppet] - 10https://gerrit.wikimedia.org/r/266448 (owner: 10Tim Landscheidt) [01:09:03] ori, it's about being inclusive of non-staff, I'm not trying to personally attack you [01:10:26] Clearly what we need...is another list :D [01:10:28] the reasons for keeping engineering@ don't seem particularly strong to me [01:10:33] shell-users-l :D [01:13:42] (03PS1) 10Andrew Bogott: Create /etc/mediawiki/WikitechPrivateSettings.php [puppet] - 10https://gerrit.wikimedia.org/r/266451 (https://phabricator.wikimedia.org/T124732) [01:13:44] (03PS1) 10Andrew Bogott: Remove puppet classes and files associated with /srv/mediawiki/private/WikitechPrivateLdapSettings.php [puppet] - 10https://gerrit.wikimedia.org/r/266452 (https://phabricator.wikimedia.org/T124732) [01:15:44] (03PS1) 10Andrew Bogott: Get wikitech private settings from a new location: [mediawiki-config] - 10https://gerrit.wikimedia.org/r/266453 (https://phabricator.wikimedia.org/T124732) [01:16:19] (03CR) 10Alex Monk: Create /etc/mediawiki/WikitechPrivateSettings.php (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/266451 (https://phabricator.wikimedia.org/T124732) (owner: 10Andrew Bogott) [01:17:05] (03CR) 10Alex Monk: [C: 031] Remove puppet classes and files associated with /srv/mediawiki/private/WikitechPrivateLdapSettings.php [puppet] - 10https://gerrit.wikimedia.org/r/266452 (https://phabricator.wikimedia.org/T124732) (owner: 10Andrew Bogott) [01:19:41] (03PS2) 10Andrew Bogott: Create /etc/mediawiki/WikitechPrivateSettings.php [puppet] - 10https://gerrit.wikimedia.org/r/266451 (https://phabricator.wikimedia.org/T124732) [01:19:43] (03PS2) 10Andrew Bogott: Remove puppet classes and files associated with /srv/mediawiki/private/WikitechPrivateLdapSettings.php [puppet] - 10https://gerrit.wikimedia.org/r/266452 (https://phabricator.wikimedia.org/T124732) [01:20:34] the Wikimedia movement has a great many communication channels, with quite a lot of overlap in terms of subject matter. I don't think that it is important that discussions like these (i.e., changes which are reversible and which carry no real political implications) reach anyone who is conceivably interested. Information gets around; public information doubly so. [01:20:40] It's more important that anyone conceivably be affected be able to retrace the thinking behind the change and respond to it if they object. [01:21:04] *anyone conceivably affected [01:22:18] I have no problem whatsoever with you forwarding it to wikitech-l if you think it would be of interest to readers of that list [01:22:38] I tried to forward it but I suppose it didn't work because I'm not subscribed from that address [01:22:54] would you like me to forward it? [01:24:21] the problem is that someone is sending something relevant to shell users to a list about staff stuff [01:24:44] people shouldn't really be doing that [01:25:32] (03PS1) 10Cenarium: Move account creation throttle to ping limiter and remove noratelimit from account creators [mediawiki-config] - 10https://gerrit.wikimedia.org/r/266454 (https://phabricator.wikimedia.org/T85538) [01:25:52] * YuviPanda rakes ori over the coals some more [01:25:53] it doesn't matter whether it's you or someone else [01:25:59] IT DID NOT EVEN HAVE A PGP SIGNATURE! [01:26:06] YuviPanda, you are not being helpful [01:26:12] neither are you, Krenair [01:26:13] Krenair: I agree that having engineering@ and wikitech-l@ is stupid. [01:26:16] I said so. [01:26:18] But shrug. [01:26:33] It's also dumb that we have #wikimedia-tech and #wikimedia-dev. [01:26:39] (and #mediawiki) [01:27:11] And #mediawiki-core and #wikimedia-devtools and... [01:27:18] Such fragmentation. Oh well. [01:32:00] now everyone goes back quietly to status quo, and one less person will attempt to even try anything. [01:32:12] * YuviPanda goes back to finding things to eat [01:34:21] Are we still beating Ori up? I'm curious about the HTML attachments. [01:36:01] (03CR) 10Andrew Bogott: [C: 032] Create /etc/mediawiki/WikitechPrivateSettings.php [puppet] - 10https://gerrit.wikimedia.org/r/266451 (https://phabricator.wikimedia.org/T124732) (owner: 10Andrew Bogott) [01:38:31] the complaining is wildly unhelpful. Whenever anyone tries to do something a few people just bicker and whine as if they should be in charge of everything... [01:41:07] Krenair: I’ve verified that the .php files in the before and after of https://gerrit.wikimedia.org/r/#/c/266453/ are the same. Willing to merge that patch? [01:42:21] (03CR) 10Alex Monk: [C: 032] Get wikitech private settings from a new location: [mediawiki-config] - 10https://gerrit.wikimedia.org/r/266453 (https://phabricator.wikimedia.org/T124732) (owner: 10Andrew Bogott) [01:42:30] thank you! [01:42:54] (03Merged) 10jenkins-bot: Get wikitech private settings from a new location: [mediawiki-config] - 10https://gerrit.wikimedia.org/r/266453 (https://phabricator.wikimedia.org/T124732) (owner: 10Andrew Bogott) [01:43:00] woops, almost did it from tin :) [01:43:09] I should use the deployment.(eqiad|codfw).wmnet thing [01:44:37] mw1019 HHVM unhappy? [01:45:00] !log krenair@mira Synchronized wmf-config/wikitech.php: https://gerrit.wikimedia.org/r/#/c/266453/ (duration: 01m 27s) [01:45:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [01:46:09] andrewbogott, LGTM [01:47:08] yep, looks like wikitech survived. Thanks. [01:47:20] we’ll see what labtestwiki is doing these days... [01:48:36] andrewbogott, not much, it seems? [01:49:01] oh, wait, HSTS [01:49:21] nope, http 500 [01:49:31] Krenair: it’s still trying to load the old config file which is not there for some reason… [01:49:40] I’m syncing and doing a puppet run and we’ll see [01:49:41] oh [01:49:55] I need to run a command [01:49:56] PHP Fatal error: require_once(): Failed opening required '/srv/mediawiki/private/WikitechPrivateLdapSettings.php' [01:49:59] seems probably related :) [01:50:02] because we never added it to the scap list [01:50:13] I’m doing sync-common on labtestweb2001 right now [01:50:17] so am I [01:50:21] great :) [01:50:39] I think I’m good with it not getting pushes from scap, since that could clobber development work in progress [01:50:56] it loads now [01:51:01] 6operations, 6Discovery, 10Wikidata, 10Wikidata-Query-Service: Adjust balance of WDQS nodes to allow continued operation if eqiad went offline. - https://phabricator.wikimedia.org/T124627#1964744 (10Smalyshev) OK then, I would then suggest imaging a server in codfw, and once it is complete we can proceed t... [01:51:41] oh dear [01:51:49] sync-common was really not happy [01:51:51] hm, [b162446a] 2016-01-26 01:51:38: Fatal exception of type "PasswordError" [01:51:53] that’s a new one [01:52:08] Krenair: you think doing two at once broke things? [01:52:11] rsync: rename failed for "/srv/mediawiki/extract2.php" (from .~tmp~/extract2.php): No such file or directory (2) [01:52:11] rsync: rename failed for "/srv/mediawiki/mobilelanding.php" (from .~tmp~/mobilelanding.php): No such file or directory (2) [01:52:11] rsync: rename failed for "/srv/mediawiki/wikiversions.json" (from .~tmp~/wikiversions.json): No such file or directory (2) [01:52:11] rsync: rename failed for "/srv/mediawiki/wikiversions.php" (from .~tmp~/wikiversions.php): No such file or directory (2) [01:52:13] etc. [01:52:23] I will stop syncing and let you do another one [01:52:28] I ran it again and it finished in 3 seconds [01:53:05] ok. [01:53:16] where did you get that exception? [01:53:27] creating an account. [01:53:31] I’m going to try again [01:53:45] mysql> select user_name from user; [01:53:46] Empty set (0.00 sec) [01:53:48] yeah [01:54:26] ok, same results [01:54:35] so, we’re somewhere new at least :) [01:55:04] so what was the rest of the exception? [01:55:19] tragically, that’s all it says [01:55:30] (03PS6) 10Krinkle: [WIP] Implement /w/static.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/263566 (https://phabricator.wikimedia.org/T99096) [01:55:46] 6operations, 10MediaWiki-Cache, 10MediaWiki-JobQueue, 10MediaWiki-JobRunner, and 2 others: Investigate massive increase in htmlCacheUpdate jobs in Dec/Jan - https://phabricator.wikimedia.org/T124418#1964746 (10JanZerebecki) I added it so I can look at the things related to this ticket in one graph (queue s... [01:57:11] andrewbogott, fluorine:/a/mw-log/exception.log has more details [01:57:32] 'Failed to add user because LDAPSetCreationValues returned false' [01:57:33] 2016-01-26 01:51:38 labtestweb2001 labtestwiki exception ERROR: [b162446a] /w/index.php?title=Special:UserLogin&action=submitlogin&type=signup&returnto=Main+Page PasswordError from line 2389 of /srv/mediawiki/php-1.27.0-wmf.10/includes/user/User.php: There was either an authentication database error or you are not allowed to update your external account. {"exception_id":"b162446a"} [01:57:33] [Exception PasswordError] (/srv/mediawiki/php-1.27.0-wmf.10/includes/user/User.php:2389) There was either an authentication database error or you are not allowed to update your external account. [01:57:33] #0 /srv/mediawiki/php-1.27.0-wmf.10/includes/specials/SpecialUserlogin.php(676): User->setPassword(string) [01:57:36] that’s what ldap says at least [01:57:51] from the ldap extension debug log? [01:58:04] oh, wait... [01:58:06] LdapAuthentication.php: $this->printDebug( "Failed to add user because LDAPSetCreationValues returned false", NONSENSITIVE ); [01:58:13] from /a/mw-log/ldap.log [01:58:24] but that exception is from your attempt, not from mine... [01:58:29] in your case the user already existed in ldap [01:59:25] andrewbogott, weren't we using a separate ldap server? [01:59:27] oh, right [01:59:31] because I tried to create my account already [01:59:40] and it failed and broke everything and gave me a user account with an IP as a name [01:59:45] also the new ldap server has an almost-complete import of the old one [01:59:50] right [02:00:24] So just before 'Failed to add user because LDAPSetCreationValues returned false' there should be something else [02:00:29] there is [02:00:37] One of these: [02:00:38] $auth->printDebug( "Unable to allocate a UID", NONSENSITIVE ); [02:00:41] $auth->printDebug( "Invalid shell name $shellaccountname", NONSENSITIVE ); [02:00:42] just a minute, though, I’m going to try to do a fresh test [02:00:45] $auth->printDebug( "$shellaccountname is not a creatable name.", NONSENSITIVE ); [02:00:46] with a new username [02:00:52] $auth->printDebug( "User $shellaccountname already exists.", NONSENSITIVE ); [02:00:52] ok [02:02:23] here’s everything: https://dpaste.de/AT9u [02:02:25] not very helpful [02:03:48] 2016-01-26 02:01:21 labtestweb2001 labtestwiki ldap INFO: 2.1.0 Failed to bind as cn=proxyagent,ou=profile,dc=wikimedia,dc=org [02:03:52] that can't be right [02:03:58] Successfully added user, and then later… Failed to modify the user's password [02:04:06] oh? I missed that, that’s something [02:04:28] hm [02:06:16] well, sure enough, I can’t ldapsearch with that cn and the password from /etc/mediawiki/ [02:06:19] 6operations, 10MediaWiki-Cache, 10MediaWiki-JobQueue, 10MediaWiki-JobRunner, and 2 others: Investigate massive increase in htmlCacheUpdate jobs in Dec/Jan - https://phabricator.wikimedia.org/T124418#1964774 (10BBlack) Yeah but the rate increase we're looking at is actually in the htmlCacheUpdate job insert... [02:06:43] andrewbogott, I get ldap_bind: Invalid credentials (49) from terbium [02:06:54] with that password and '-h labtestservices2001.wikimedia.org' [02:06:58] yeah, me too [02:07:15] ldapsearch didn't exist for me on labtestweb2001? [02:07:34] yeah, ok, I just never updated that password. Stay tuned :) [02:07:46] I was about to say in the security channel that it's using the public password still, but ok :P [02:09:35] ok, password changed [02:09:42] so, once more, I will create an account [02:10:25] the password seems to be set correctly now [02:10:43] and account creation works! Or at least reports that it works [02:11:18] yeah, I can log in a second time with that account [02:11:32] andrewbogott, btw, what happened with TLS certificates? [02:11:39] I made a new one [02:11:42] did you find some way to get it signed in a trusted way? [02:12:13] we have a home-made CA for internal services. That’s what moritz user, so I signed with the same authority for this. [02:12:30] aha, so there was an internal CA I wasn't aware of :) [02:12:35] yeah [02:12:39] I think I suggested making one without considering there might already be one [02:12:54] makes sense for internal-facing stuff like this [02:12:59] Well, unfortunately, there are multiples. So I was waiting on moritz to figure out which one to use [02:13:04] haha [02:13:08] what's the other one? [02:13:18] I think there’s one that’s the official “from now on only use this one” ca [02:13:23] but I couldn’t tell which was which [02:13:43] all that I know is here: https://phabricator.wikimedia.org/T124374 [02:13:47] (which isn’t much) [02:14:02] 6operations, 5Patch-For-Review: labtestservices2001.wikimedia.org.crt - https://phabricator.wikimedia.org/T124374#1964775 (10Andrew) 5Open>3Resolved [02:14:21] so… next, I guess is to make a real account and figure out how to make it a cloud-admin... [02:15:25] Krenair: want me to delete your existing ldap from labtest user so you can make a fresh one? [02:15:39] yes please [02:15:47] I can promote a user to cloudadmin [02:17:14] andrewbogott, would you be comfortable merging https://gerrit.wikimedia.org/r/#/c/265907/ later? [02:19:28] sure [02:19:41] ok, I didn’t find a user named ‘krenair’ but I deleted a bunch of test accounts [02:20:10] uid=krenair, cn is Alex Monk [02:20:44] I managed to sign up [02:21:06] and if you want to promote me, I’m labtestwikitech [02:22:20] you mean labtestandrew, andrewbogott? [02:22:41] yes [02:23:06] copied the wrong field from my password fault, fortunately not the password [02:23:11] haha [02:23:58] !log mwdeploy@tin sync-l10n completed (1.27.0-wmf.10) (duration: 09m 36s) [02:24:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:24:51] > var_dump( User::newFromName( 'labtestandrew' )->addGroup( 'cloudadmin' ) ); [02:24:51] bool(true) [02:25:27] ^^ please don't use that sort of thing on normal production wikis, stewards should usually be doing such things there... but this is (lab test) wikitech and the DB/network rules would prevent it [02:25:49] (03CR) 10Andrew Bogott: [C: 032] "Verified that this is from the same krenair who can log in to bast1001 :)" [puppet] - 10https://gerrit.wikimedia.org/r/265907 (owner: 10Alex Monk) [02:25:55] (03PS2) 10Andrew Bogott: admin: Replace my prod yubikey SSH key [puppet] - 10https://gerrit.wikimedia.org/r/265907 (owner: 10Alex Monk) [02:26:09] thanks [02:26:18] Have to rebase before I can merge [02:27:06] Krenair: we also need ‘admin’ or bureaucrat or something in order to bestow rights to other accounts, right? [02:27:25] (03CR) 10Andrew Bogott: [C: 032] admin: Replace my prod yubikey SSH key [puppet] - 10https://gerrit.wikimedia.org/r/265907 (owner: 10Alex Monk) [02:28:05] Special:ListGroupRights is the page which shows you which groups can grant/take away what [02:29:10] Huh. [02:29:18] it doesn't appear to be set up correctly. [02:29:56] oh no wait [02:30:00] andrewbogott, so cloudadmin gets userrights [02:30:07] 'userrights' [02:30:15] the right in itself which lets you give/remove any local group [02:30:44] so labtestandrew can let anyone do anything on labtestwikitech now, careful with it :p [02:30:48] well, 'anything' [02:30:59] !log l10nupdate@tin ResourceLoader cache refresh completed at Tue Jan 26 02:30:58 UTC 2016 (duration 7m 0s) [02:31:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:31:20] great [02:31:34] ok, next step is for me to set up an admin project in keystone, I think... [02:31:38] and roles and such [03:00:43] meh... this yubikey is not quite working as expected for anything except slot 9a :( [03:08:57] Krenair: yeah apparently it has trouble with switching modes. kinda driver/OS -dependent how easily it resets [03:09:29] if you insert it and use it for just 2FA, everything's fine. In most cases I've heard of, you can then also use it for 9a ssh stuff without issue. [03:09:42] but once you touch 9a ssh stuff, 2FA button pushes are dead [03:10:07] that's not the issue I've found [03:10:18] some software can reset that state easily. e.g. if I remove->reinsert key and launch LastPass and have it query yubi 2FA, it resets fine. [03:10:46] but remove->reinsert and go try Google 2FA first, and it fails to do anything useful :/ [03:11:01] I can use SSH on slot 9a and I can use it for Google 2FA [03:11:11] but I can't use a separate SSH key on a different slot [03:11:21] I thought I had this working at one point [03:11:23] oh I wasn't aware you can do more than one ssh key in a single yubi [03:11:30] I don't think that's possible, but I don't know [03:12:24] Krenair: I created an instance and can see the console! It works in project ‘labtestproject’ but not in ‘testlabs’ for some reason. [03:13:46] PROBLEM - Codfw HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [03:14:24] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [03:15:12] bblack, do you know what happens when you delete-certificate and reimport cert.pem? [03:15:30] Krenair: no idea [03:16:34] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [03:17:52] Actually I'm beginning to wonder whether I just made this key wrongly somehow [03:18:04] RECOVERY - Codfw HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [03:28:09] Well, this new one seems like it works against my VPS [03:31:48] bblack, andrewbogott: would either of you mind helping me try again? [03:31:59] Krenair: new key, you mean? [03:32:01] yes [03:32:04] sure [03:33:29] (03PS1) 10Alex Monk: admin: Replace my prod yubikey SSH key (take 2) [puppet] - 10https://gerrit.wikimedia.org/r/266465 [03:35:24] (03PS2) 10Andrew Bogott: admin: Replace my prod yubikey SSH key (take 2) [puppet] - 10https://gerrit.wikimedia.org/r/266465 (owner: 10Alex Monk) [03:36:51] (03CR) 10Andrew Bogott: [C: 032] admin: Replace my prod yubikey SSH key (take 2) [puppet] - 10https://gerrit.wikimedia.org/r/266465 (owner: 10Alex Monk) [03:37:24] Krenair: want me to speed up the roll-out of that anyplace? bast1001? [03:37:37] one of the bastions would be enough for me to test it, thanks [03:37:47] ok, refreshing puppet on 1001 [03:37:57] I can just wait though, I don't want to waste your time [03:38:06] no worries [03:38:22] I’m about to head out though… hopefully this one takes :) [03:38:53] andrewbogott, it worked [03:38:53] ok, bast1001 should have the new key now. [03:38:56] cool [03:39:24] ok, I’m off. I’m excited to have labtestwikitech up and running now — thanks for all your help with that. [03:39:53] you're welcome [03:52:27] (03CR) 10Legoktm: [C: 031] Get rid of $wg = $wmg for Graph [mediawiki-config] - 10https://gerrit.wikimedia.org/r/266433 (https://phabricator.wikimedia.org/T119117) (owner: 10Dereckson) [03:53:42] (03CR) 10Legoktm: Get rid of $wg = $wmg for Graph (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/266433 (https://phabricator.wikimedia.org/T119117) (owner: 10Dereckson) [04:27:29] !log restarted resetGlobalUserTokens.php after it lost mysql connection again [04:27:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [04:29:52] (03PS2) 10Dereckson: Use extension registration for Graph [mediawiki-config] - 10https://gerrit.wikimedia.org/r/266433 (https://phabricator.wikimedia.org/T119117) [04:36:20] (03PS1) 10Dereckson: Get rid of $wg = $wmg for BetaFeatures [mediawiki-config] - 10https://gerrit.wikimedia.org/r/266470 (https://phabricator.wikimedia.org/T119117) [06:05:54] PROBLEM - puppet last run on mw1244 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:16] PROBLEM - puppet last run on mw2052 is CRITICAL: CRITICAL: puppet fail [06:31:25] PROBLEM - puppet last run on cp2013 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:46] PROBLEM - puppet last run on mw2158 is CRITICAL: CRITICAL: Puppet has 1 failures [06:32:05] PROBLEM - puppet last run on lvs2002 is CRITICAL: CRITICAL: Puppet has 1 failures [06:32:06] PROBLEM - puppet last run on cp4016 is CRITICAL: CRITICAL: Puppet has 1 failures [06:32:07] PROBLEM - puppet last run on cp3048 is CRITICAL: CRITICAL: Puppet has 1 failures [06:32:25] PROBLEM - puppet last run on mw1110 is CRITICAL: CRITICAL: Puppet has 1 failures [06:32:35] PROBLEM - puppet last run on mw2129 is CRITICAL: CRITICAL: Puppet has 1 failures [06:32:45] PROBLEM - puppet last run on cp3017 is CRITICAL: CRITICAL: Puppet has 1 failures [06:33:15] RECOVERY - puppet last run on mw1244 is OK: OK: Puppet is currently enabled, last run 23 seconds ago with 0 failures [06:56:35] RECOVERY - puppet last run on cp2013 is OK: OK: Puppet is currently enabled, last run 40 seconds ago with 0 failures [06:57:06] RECOVERY - puppet last run on lvs2002 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:57:15] RECOVERY - puppet last run on cp4016 is OK: OK: Puppet is currently enabled, last run 25 seconds ago with 0 failures [06:57:16] RECOVERY - puppet last run on cp3048 is OK: OK: Puppet is currently enabled, last run 36 seconds ago with 0 failures [06:57:34] RECOVERY - puppet last run on mw1110 is OK: OK: Puppet is currently enabled, last run 40 seconds ago with 0 failures [06:57:45] RECOVERY - puppet last run on mw2129 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:57:55] RECOVERY - puppet last run on cp3017 is OK: OK: Puppet is currently enabled, last run 58 seconds ago with 0 failures [06:58:35] RECOVERY - puppet last run on mw2052 is OK: OK: Puppet is currently enabled, last run 3 seconds ago with 0 failures [06:59:05] RECOVERY - puppet last run on mw2158 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [08:15:35] PROBLEM - configured eth on mw1161 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:15:55] PROBLEM - dhclient process on mw1161 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:16:44] PROBLEM - puppet last run on mw1161 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:16:44] PROBLEM - RAID on mw1161 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:16:56] PROBLEM - nutcracker port on mw1161 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:16:56] PROBLEM - salt-minion processes on mw1161 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:17:04] PROBLEM - SSH on mw1161 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:17:14] PROBLEM - DPKG on mw1161 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:18:04] RECOVERY - dhclient process on mw1161 is OK: PROCS OK: 0 processes with command name dhclient [08:18:55] RECOVERY - nutcracker port on mw1161 is OK: TCP OK - 0.000 second response time on port 11212 [08:18:55] RECOVERY - salt-minion processes on mw1161 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [08:18:56] RECOVERY - SSH on mw1161 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.4 (protocol 2.0) [08:19:14] RECOVERY - DPKG on mw1161 is OK: All packages OK [08:19:35] PROBLEM - puppet last run on mw2127 is CRITICAL: CRITICAL: puppet fail [08:19:45] RECOVERY - configured eth on mw1161 is OK: OK - interfaces up [08:22:21] 6operations, 10ops-eqiad, 5Patch-For-Review: mw1172, mw1178,mw1217, mw1228, mw1257 are unresponsive, mgmt interface unreachable - https://phabricator.wikimedia.org/T124642#1965069 (10Joe) [08:23:10] 6operations, 10ops-eqiad, 5Patch-For-Review: mw1172, mw1178,mw1217, mw1257 are unresponsive, mgmt interface unreachable - https://phabricator.wikimedia.org/T124642#1965071 (10Joe) [08:24:06] ACKNOWLEDGEMENT - Host mw1257 is DOWN: PING CRITICAL - Packet loss = 100% Giuseppe Lavagetto T124642 [08:24:06] ACKNOWLEDGEMENT - Host mw1217 is DOWN: PING CRITICAL - Packet loss = 100% Giuseppe Lavagetto T124642 [08:24:06] ACKNOWLEDGEMENT - Host mw1178 is DOWN: PING CRITICAL - Packet loss = 100% Giuseppe Lavagetto T124642 [08:24:06] ACKNOWLEDGEMENT - Host mw1172 is DOWN: PING CRITICAL - Packet loss = 100% Giuseppe Lavagetto T124642 [08:25:14] ACKNOWLEDGEMENT - Host mw1228 is DOWN: PING CRITICAL - Packet loss = 100% Giuseppe Lavagetto T122005 [08:25:24] PROBLEM - Disk space on mw1161 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:25:35] PROBLEM - SSH on mw1161 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:27:25] RECOVERY - Disk space on mw1161 is OK: DISK OK [08:29:54] PROBLEM - nutcracker port on mw1161 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:29:54] PROBLEM - salt-minion processes on mw1161 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:31:09] (03PS4) 10Elukey: Add the moving average function to the event logging's insert rate alarming metric. Bug: T124204 [puppet] - 10https://gerrit.wikimedia.org/r/266264 (https://phabricator.wikimedia.org/T124204) [08:32:05] RECOVERY - nutcracker port on mw1161 is OK: TCP OK - 0.000 second response time on port 11212 [08:34:05] PROBLEM - Disk space on mw1161 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:38:35] PROBLEM - nutcracker port on mw1161 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:40:24] RECOVERY - Disk space on mw1161 is OK: DISK OK [08:42:12] (03CR) 10Giuseppe Lavagetto: [C: 032] Add the moving average function to the event logging's insert rate alarming metric. Bug: T124204 [puppet] - 10https://gerrit.wikimedia.org/r/266264 (https://phabricator.wikimedia.org/T124204) (owner: 10Elukey) [08:42:24] RECOVERY - RAID on mw1161 is OK: OK: no RAID installed [08:42:44] RECOVERY - salt-minion processes on mw1161 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [08:42:44] RECOVERY - nutcracker port on mw1161 is OK: TCP OK - 0.000 second response time on port 11212 [08:47:14] PROBLEM - DPKG on mw1161 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:47:24] RECOVERY - puppet last run on mw2127 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [08:48:44] PROBLEM - RAID on mw1161 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:49:45] PROBLEM - configured eth on mw1161 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:51:04] PROBLEM - Disk space on mw1161 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:51:24] RECOVERY - DPKG on mw1161 is OK: All packages OK [08:51:54] RECOVERY - configured eth on mw1161 is OK: OK - interfaces up [08:53:06] RECOVERY - Disk space on mw1161 is OK: DISK OK [08:59:45] PROBLEM - salt-minion processes on mw1161 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:59:45] PROBLEM - nutcracker port on mw1161 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:01:18] (03PS1) 10Ema: esams: add text nodes to mobile cluster [puppet] - 10https://gerrit.wikimedia.org/r/266475 (https://phabricator.wikimedia.org/T109286) [09:01:55] PROBLEM - Disk space on mw1161 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:02:16] PROBLEM - DPKG on mw1161 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:02:25] PROBLEM - puppet last run on mw2024 is CRITICAL: CRITICAL: Puppet has 1 failures [09:02:55] (03Abandoned) 10Giuseppe Lavagetto: neodymium: add role::deployment::salt_masters [puppet] - 10https://gerrit.wikimedia.org/r/266218 (owner: 10Giuseppe Lavagetto) [09:06:15] RECOVERY - nutcracker port on mw1161 is OK: TCP OK - 0.000 second response time on port 11212 [09:06:35] RECOVERY - DPKG on mw1161 is OK: All packages OK [09:07:31] 6operations: reinstall eqiad memcache servers with jessie - https://phabricator.wikimedia.org/T123711#1965105 (10Joe) All memcached hosts have both memcached and the session-related redis. So reinstalling them has a small but non-trivial effect: when a server goes down, we lose 1/18th of the current user sessio... [09:08:06] RECOVERY - Disk space on mw1161 is OK: DISK OK [09:12:14] RECOVERY - RAID on mw1161 is OK: OK: no RAID installed [09:12:35] RECOVERY - SSH on mw1161 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.4 (protocol 2.0) [09:12:44] RECOVERY - salt-minion processes on mw1161 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [09:20:45] PROBLEM - RAID on mw1161 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:21:14] PROBLEM - nutcracker port on mw1161 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:21:14] PROBLEM - salt-minion processes on mw1161 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:21:42] (03PS2) 10Muehlenhoff: Remove debdeploy::master from palladium [puppet] - 10https://gerrit.wikimedia.org/r/266219 [09:21:51] (03CR) 10Muehlenhoff: [C: 032 V: 032] Remove debdeploy::master from palladium [puppet] - 10https://gerrit.wikimedia.org/r/266219 (owner: 10Muehlenhoff) [09:21:54] PROBLEM - configured eth on mw1161 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:23:15] RECOVERY - salt-minion processes on mw1161 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [09:23:15] RECOVERY - nutcracker port on mw1161 is OK: TCP OK - 0.000 second response time on port 11212 [09:23:16] PROBLEM - SSH on mw1161 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:23:35] PROBLEM - DPKG on mw1161 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:24:54] RECOVERY - RAID on mw1161 is OK: OK: no RAID installed [09:25:15] RECOVERY - SSH on mw1161 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.4 (protocol 2.0) [09:25:25] RECOVERY - DPKG on mw1161 is OK: All packages OK [09:25:55] RECOVERY - configured eth on mw1161 is OK: OK - interfaces up [09:28:04] (03CR) 10Hoo man: [C: 031] "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/264461 (owner: 10Suriyaa Kudo) [09:28:50] <_joe_> !log finishing reboots of appservers in eqiad [09:28:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:29:04] RECOVERY - puppet last run on mw1161 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [09:30:04] RECOVERY - puppet last run on mw2024 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [09:30:53] !log restarting Jenkins to upgrade the gearman plugin with https://review.openstack.org/#/c/271543/ [09:30:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:36:17] PROBLEM - HHVM rendering on mw1048 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:36:26] PROBLEM - Apache HTTP on mw1059 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:38:16] RECOVERY - HHVM rendering on mw1048 is OK: HTTP OK: HTTP/1.1 200 OK - 64744 bytes in 0.113 second response time [09:38:17] RECOVERY - Apache HTTP on mw1059 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.041 second response time [09:38:56] PROBLEM - Host mw1171 is DOWN: PING CRITICAL - Packet loss = 100% [09:40:36] RECOVERY - Host mw1171 is UP: PING OK - Packet loss = 0%, RTA = 1.37 ms [09:42:37] PROBLEM - Host mw1111 is DOWN: PING CRITICAL - Packet loss = 100% [09:42:56] PROBLEM - puppet last run on mw2078 is CRITICAL: CRITICAL: Puppet has 1 failures [09:43:36] RECOVERY - Host mw1111 is UP: PING OK - Packet loss = 0%, RTA = 0.38 ms [09:58:46] PROBLEM - Eqiad HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [1000.0] [09:59:58] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [1000.0] [10:06:03] <_joe_> uh let's see this [10:06:21] 6operations, 10Salt: Salt minions randomly crashing when the deployment server grain gets changed - https://phabricator.wikimedia.org/T124646#1965213 (10ArielGlenn) Forgot to mention, this is actually an issue with the pillar refresh after the grain is set. [10:06:27] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [10:06:46] <_joe_> ema, elukey when you see those alarms about 5xx reqs/min you should look at https://grafana-admin.wikimedia.org/dashboard/db/varnish-http-errors [10:07:24] <_joe_> as you can see from the 4th graph (HTTP 5xx Responses) [10:07:27] RECOVERY - Eqiad HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [10:07:33] <_joe_> it was just a spike [10:07:43] <_joe_> if it wasn't, it's worth investigating more [10:08:36] RECOVERY - puppet last run on mw2078 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [10:09:37] hmm, for a moment i couldn't get to mediawiki.org [10:09:51] too bad i refreshed and didn't copy the bottom error message [10:10:03] <_joe_> yurik: was it a 503? [10:10:07] i think so [10:13:27] PROBLEM - puppet last run on cp3043 is CRITICAL: CRITICAL: Puppet has 1 failures [10:14:27] PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [1000.0] [10:14:47] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [1000.0] [10:14:49] (03PS1) 10ArielGlenn: make default log rotation for apache be 30 days [puppet] - 10https://gerrit.wikimedia.org/r/266480 [10:15:16] PROBLEM - Mobile HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [10:15:28] (03Abandoned) 10ArielGlenn: apache: keep two weeks' worth of logs, rather than 1yr [puppet] - 10https://gerrit.wikimedia.org/r/130296 (owner: 10ArielGlenn) [10:16:14] (03Abandoned) 10Muehlenhoff: Move debdeploy::master off palladium [puppet] - 10https://gerrit.wikimedia.org/r/266215 (owner: 10Muehlenhoff) [10:18:53] 6operations, 10Salt: Move salt master to separate host from puppet master - https://phabricator.wikimedia.org/T115287#1965246 (10ArielGlenn) git-deploy moved to neodymium yesterday, debdeploy was moved by moritz today. Giving a couple of days for any problems to shake out, on Thursday palladium will be remove... [10:19:27] RECOVERY - Mobile HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [10:20:47] RECOVERY - Esams HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [10:20:54] _joe_: similar spike in esams, what could be the cause? [10:21:07] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [10:22:47] 6operations, 10Salt: Move salt master to separate host from puppet master - https://phabricator.wikimedia.org/T115287#1965250 (10Joe) Please note that we will need to find a way to allow salt key signing during the reimaging/imaging of a server; this isn't a blocker for the decommission of palladium, but I'd k... [10:30:31] 7Puppet, 6operations, 10Salt: Make it possible for wmf-reimage to work seamlessly with a non-local salt master - https://phabricator.wikimedia.org/T124761#1965260 (10Joe) 3NEW a:3ArielGlenn [10:31:10] 7Puppet, 6operations, 10Salt: Make it possible for wmf-reimage to work seamlessly with a non-local salt master - https://phabricator.wikimedia.org/T124761#1965260 (10Joe) a:5ArielGlenn>3Joe [10:32:35] 6operations, 5codfw-rollout, 3codfw-rollout-Jan-Mar-2016: Switchover of the application servers to codfw - https://phabricator.wikimedia.org/T124671#1965272 (10Joe) [10:35:16] PROBLEM - check_mysql on db1008 is CRITICAL: SLOW_SLAVE CRITICAL: Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 782 [10:37:18] (03PS1) 10Giuseppe Lavagetto: Use the logical redis definition for GettingStarted. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/266481 (https://phabricator.wikimedia.org/T124671) [10:39:07] RECOVERY - puppet last run on cp3043 is OK: OK: Puppet is currently enabled, last run 33 seconds ago with 0 failures [10:40:16] RECOVERY - check_mysql on db1008 is OK: Uptime: 586915 Threads: 2 Questions: 4505529 Slow queries: 3909 Opens: 1610 Flush tables: 2 Open tables: 417 Queries per second avg: 7.676 Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 0 [10:40:51] 6operations, 10CirrusSearch, 6Discovery, 7Elasticsearch: Look into encrypting Elasticsearch traffic - https://phabricator.wikimedia.org/T124444#1965282 (10faidon) >>! In T124444#1964147, @EBernhardson wrote: > I realizes it's a ton more work, hardware, and I honestly don't even know what would be involved.... [10:43:03] (03CR) 10Giuseppe Lavagetto: [C: 031] esams: add text nodes to mobile cluster [puppet] - 10https://gerrit.wikimedia.org/r/266475 (https://phabricator.wikimedia.org/T109286) (owner: 10Ema) [10:44:03] (03CR) 10BBlack: [C: 031] esams: add text nodes to mobile cluster [puppet] - 10https://gerrit.wikimedia.org/r/266475 (https://phabricator.wikimedia.org/T109286) (owner: 10Ema) [10:46:00] (03PS2) 10Ema: esams: add text nodes to mobile cluster [puppet] - 10https://gerrit.wikimedia.org/r/266475 (https://phabricator.wikimedia.org/T109286) [10:46:18] (03CR) 10Ema: [C: 032 V: 032] esams: add text nodes to mobile cluster [puppet] - 10https://gerrit.wikimedia.org/r/266475 (https://phabricator.wikimedia.org/T109286) (owner: 10Ema) [10:50:51] !log Starting migration of mobile traffic to text cluster in esams https://phabricator.wikimedia.org/T109286 [10:50:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [11:10:35] 6operations, 10MediaWiki-Cache, 10MediaWiki-JobQueue, 10MediaWiki-JobRunner, and 2 others: Investigate massive increase in htmlCacheUpdate jobs in Dec/Jan - https://phabricator.wikimedia.org/T124418#1965334 (10Addshore) As far as I can tell in Wikibase.... - WikiPgaeUpdater::scheduleRefereshLinks creat... [11:23:26] PROBLEM - Host multatuli is DOWN: PING CRITICAL - Packet loss = 100% [11:24:07] RECOVERY - Host multatuli is UP: PING OK - Packet loss = 0%, RTA = 85.94 ms [11:41:05] multatuli was me (reboot I forgot to ack in icinga) [11:46:07] !log rebooting bromine for kernel update [11:46:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [11:50:52] !log rebooting etherpad1001 for kernel update [11:50:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [12:05:11] 6operations, 10ContentTranslation-cxserver, 6Services, 10Traffic: Remove cxserver from parsoidcache cluster - https://phabricator.wikimedia.org/T110478#1965406 (10BBlack) Does that imply that **nothing** should be using the hostnames `cxserver.wikimedia.org` and/or `cxserver.eqiad.wikimedia.org`, which map... [12:05:41] 6operations, 10Graphoid, 6Services, 10Traffic: Remove graphoid from parsoidcache - https://phabricator.wikimedia.org/T110477#1965407 (10BBlack) Are things still using the hostnames `graphoid.wikimedia.org` and/or `graphoid.eqiad.wikimedia.org`, which map to the cache_parsoid cluster rather than through res... [12:06:20] 6operations, 10Citoid, 6Services, 10Traffic: Remove citoid from parsoidcache - https://phabricator.wikimedia.org/T110476#1965408 (10BBlack) Are things still using the hostnames `citoid.wikimedia.org` and/or `citoid.eqiad.wikimedia.org`, which map to the cache_parsoid cluster rather than through restbase? [12:06:58] 6operations, 10RESTBase, 6Services, 10Traffic: Remove restbase from parsoidcache - https://phabricator.wikimedia.org/T110475#1965410 (10BBlack) Are things still using the hostnames `rest.wikimedia.org` and/or `restbase.wikimedia.org` and/or `restbase.eqiad.wikimedia.org`, which map to the cache_parsoid clu... [12:08:55] (03PS1) 10BBlack: VCL: do not use illegal "trusted" XFF values for XCIP [puppet] - 10https://gerrit.wikimedia.org/r/266486 (https://phabricator.wikimedia.org/T120121) [12:10:11] !log rebooting mx2001/mx1001 (with a delay in between) for kernel update [12:10:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [12:29:06] PROBLEM - puppet last run on ganeti2002 is CRITICAL: CRITICAL: Puppet has 3 failures [12:29:16] PROBLEM - NTP peers on nescio is CRITICAL: NTP CRITICAL: Server not synchronized, Offset unknown [12:31:13] (03PS1) 10BBlack: make cache_parsoid LVS IPs slightly more intuitive [puppet] - 10https://gerrit.wikimedia.org/r/266488 [12:31:15] (03PS1) 10BBlack: cache_parsoid: use local backends in codfw [puppet] - 10https://gerrit.wikimedia.org/r/266489 [12:34:31] (03CR) 10BBlack: [C: 032] make cache_parsoid LVS IPs slightly more intuitive [puppet] - 10https://gerrit.wikimedia.org/r/266488 (owner: 10BBlack) [12:39:38] !log rolling reboot of ganeti200{1,2,3,4,5,6}.codfw.wmnet for kernel upgrade [12:39:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [12:42:48] PROBLEM - puppet last run on cp3044 is CRITICAL: CRITICAL: puppet fail [12:44:26] RECOVERY - NTP peers on nescio is OK: NTP OK: Offset 7e-06 secs [12:48:19] 6operations, 10Graphoid, 6Services, 10Traffic: Remove graphoid from parsoidcache - https://phabricator.wikimedia.org/T110477#1965474 (10Yurik) Not to my knowledge. I sometimes use it for debugging, eg when restbase has a bad day, but I can ssh directly [12:54:36] PROBLEM - Unmerged changes on repository puppet on strontium is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet). [12:54:36] PROBLEM - Unmerged changes on repository puppet on palladium is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet). [12:54:56] ^ me [12:56:46] RECOVERY - Unmerged changes on repository puppet on strontium is OK: No changes to merge. [12:56:46] RECOVERY - Unmerged changes on repository puppet on palladium is OK: No changes to merge. [13:03:57] 6operations, 10Parsoid: Need databases provisioned for parsoid-rt testing, visual diff testing - https://phabricator.wikimedia.org/T124703#1965507 (10jcrespo) The database has been exported, the 3 databases are being imported now into m5-master. [13:05:17] (03PS1) 10Jcrespo: Depool pc1002 for maintenance (clone to pc1005) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/266490 (https://phabricator.wikimedia.org/T121888) [13:10:58] RECOVERY - puppet last run on cp3044 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [13:11:38] (03CR) 10Jcrespo: [C: 032] Depool pc1002 for maintenance (clone to pc1005) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/266490 (https://phabricator.wikimedia.org/T121888) (owner: 10Jcrespo) [13:14:23] !log jynus@mira Synchronized wmf-config/db-eqiad.php: Depool pc1002 for maintenance (clone to pc1005) (duration: 01m 39s) [13:14:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:21:47] PROBLEM - High load average on labstore1001 is CRITICAL: CRITICAL: 50.00% of data above the critical threshold [24.0] [13:23:45] I am probably afecting some job working on terbium if mediawiki doesn't now how to connect, but I really need to bring down pc1002, and I cannot wait [13:25:57] (03PS2) 10Bmansurov: Add sampling rates for mobile web language switcher [mediawiki-config] - 10https://gerrit.wikimedia.org/r/265292 (https://phabricator.wikimedia.org/T123932) [13:30:04] PROBLEM - mysqld processes on pc1002 is CRITICAL: PROCS CRITICAL: 0 processes with command name mysqld [13:30:52] PROBLEM - mysqld processes on pc1005 is CRITICAL: PROCS CRITICAL: 0 processes with command name mysqld [13:31:00] <_joe_> jynus: is this expected? [13:31:04] <_joe_> (1005) [13:31:09] <_joe_> if so, disregard [13:31:23] both of them at the same time ? [13:31:31] 1002 is him [13:31:53] clone to pc1005 [13:31:58] I see. I bet they are both him [13:32:27] ok [13:32:40] !log rebooted nescio/maerlant for kernel update [13:32:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:33:40] ACKNOWLEDGEMENT - High load average on labstore1001 is CRITICAL: CRITICAL: 66.67% of data above the critical threshold [24.0] cpettet yes I see [13:38:49] oh s*** [13:39:41] that must be "oh sure" as in everything's fine, right? :P [13:39:46] :-D [13:42:13] yes, everithing is fine [13:43:04] there are 2 factors here, I am getting older [13:43:21] and I have too many things going on at the same time [13:44:32] the third is- where is my expert-system controlled alter system? [13:44:53] don't even try that 'getting older' thing on me, young whipper-snapper! [13:45:26] PROBLEM - Host alsafi is DOWN: PING CRITICAL - Packet loss = 100% [13:47:53] at this point I am happy with bringing down the right hosts, and not the wrong ones [13:48:04] should I get concerned? [13:48:38] no, it was regular maintenance that was not downtimed properly [13:48:42] sorry [13:49:29] pc1002 is the depooled host, pc1005 is the new host, I am cloning them [13:49:47] before decommision pc1002 and pool pc1005 [13:50:24] there is a better solution- paging based on service, not on servers [13:50:26] RECOVERY - High load average on labstore1001 is OK: OK: Less than 50.00% above the threshold [16.0] [13:51:17] PROBLEM - DPKG on cp4019 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [13:51:42] to be clear here, there was no user notice at all [13:53:36] PROBLEM - puppet last run on cp4019 is CRITICAL: CRITICAL: Puppet has 1 failures [13:55:08] I think I inherited a page config useful for working in a different timezone [13:55:27] probably [13:55:33] easy to change that [13:55:57] so I'm trying root@ganeti2001:~# gnt-instance console alsafi.wikimedia.org [13:55:57] given that ssh in fails [13:56:04] and it's not doing much. any ideas? [13:56:20] * apergos eyes akosiaris [13:58:50] hmm [13:59:59] I saw a note in SAL from mutante yesterday that he 'logged in as though it were hibernating' and it came back up [14:01:19] if it's that, some VMs need to be rebooted to get the KVM disk_aio setting applied and alsafi is probably one of them [14:01:48] oh [14:02:06] but it doesn't look like that though [14:02:19] I just migrated it though, lemme check [14:02:23] I might be the cause [14:02:58] ok [14:03:10] I gave up on the console ting, it hung forever [14:03:12] *thing [14:03:36] alsafi login: [14:03:36] Debian GNU/Linux 8 alsafi ttyS0 [14:03:40] nope it did not [14:03:43] I just got a console [14:03:49] well I sure did not. meh [14:03:54] so this is probably network related misconfiguration on my side [14:04:19] I was on ganeti2001 at the time, was that a mistake? [14:04:30] nope [14:04:34] huh [14:04:49] did you press a couple of enters though ? [14:04:55] no [14:05:04] one but not two [14:05:16] one should have been enough [14:05:36] ok definitely network related [14:05:45] I"m repeating the experiment with the same results [14:05:56] oh, I am attached to it right now [14:06:00] ah :-D [14:06:01] nm then [14:08:20] ACKNOWLEDGEMENT - Host mw2173 is DOWN: PING CRITICAL - Packet loss = 100% Giuseppe Lavagetto T124408 [14:08:46] PROBLEM - Host mx2001 is DOWN: PING CRITICAL - Packet loss = 100% [14:11:01] that's the exact same issue [14:11:05] ^ [14:11:12] fixing both as we speak [14:11:18] awesome [14:12:26] RECOVERY - Host mx2001 is UP: PING WARNING - Packet loss = 86%, RTA = 384.85 ms [14:13:07] RECOVERY - Host alsafi is UP: PING OK - Packet loss = 0%, RTA = 36.21 ms [14:13:45] \o/ [14:14:11] !log migrate alsafi,mx2001 back from ganeti2004 to fix a network misconfiguration [14:14:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:14:34] wrong assumption on my part btw [14:14:47] oh? [14:15:18] I never put into auto lo in e/n/i eth0.2002. I assumed vlan2002 which depends on it would bring the slave interface up as well [14:15:40] ouch [14:16:17] yeah, that part needs some better puppetization. And it is actually possible these days in jessie [14:16:24] using /etc/network/interfaces.d/ [14:16:36] I 'll start concocting something up to handle these things better [14:16:37] another impetus to get stuff moved [14:17:34] the thing daniel talked about is evident here btw https://ganglia.wikimedia.org/latest/graph.php?r=day&z=xlarge&h=alsafi.wikimedia.org&m=cpu_report&s=descending&mc=2&g=load_report&c=Miscellaneous+codfw [14:17:46] PROBLEM - puppet last run on mx2001 is CRITICAL: CRITICAL: puppet fail [14:17:47] PROBLEM - puppet last run on alsafi is CRITICAL: CRITICAL: puppet fail [14:18:14] you get to see a huge load increase for no apparent reason. Then https://ganglia.wikimedia.org/latest/graph.php?r=day&z=xlarge&h=alsafi.wikimedia.org&m=cpu_report&s=descending&mc=2&g=cpu_report&c=Miscellaneous+codfw [14:18:19] and you see the IOwait [14:18:50] so that seems to be fixed by disk_aio=native which I 've applied throughout the clusters but we still need a reboot in some VMs [14:18:59] after that we will hopefully not see it ever again [14:21:10] !log migrating alsafi,mx2001 back to 2004 for testing [14:21:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:21:31] what on earth was it doing during that time to have that high load [14:21:37] nothing [14:21:41] it's a kvm/qemu bug [14:21:46] ahahaha [14:22:09] and the native setting is the workaround then [14:22:13] yes [14:23:04] which vms still need a reboot, can they be on a list to be scheduled? [14:24:12] quite a few (more than 70%), but yeah, I 'll create one. I 've been testing the workaround and I think it works fine, so... what better time to do it than now ? [14:24:22] +1 [14:26:41] debian gurus? how do you *really* force apt-get to *never* ask questions at all (even if that means it has to just fail) [14:27:06] I tried: [14:27:10] apt-get -y -o Dpkg::Options::=--force-confdef -o Dpkg::Options::=--force-confold upgrade (and it's running via salt, so it has no terminal anyways) [14:27:35] and it's still hanging on asking a question that a package wants answered, which has an appropriate default answer and just needs enter pressed :P [14:28:31] DEBIAN_FRONTEND [14:28:37] RECOVERY - puppet last run on alsafi is OK: OK: Puppet is currently enabled, last run 37 seconds ago with 0 failures [14:28:39] maybe? :) [14:28:54] no idea :) [14:29:00] export DEBIAN_FRONTEND='noninteractive' or something like that [14:29:08] oh yeah it's a setting like that [14:29:14] used in docker containers a lot [14:29:14] ok [14:30:02] * bblack grumbles something about how that should be a simple flag like "--non-interactive", which should maybe activate itself when there's no terminal attached to the initial command [14:30:04] that is exactly the right var and setting [14:30:22] * bblack and wonders why apt-get actually creates a fake terminal for the dpkg it invokes... [14:32:17] I got no answer for that [14:32:42] I wonde rhow much stuff in software is "oh yeah we never removed that workaround, it's obsolete" [14:34:37] I think I now remember I left a screen session on some mysql server doing an important task [14:34:57] RECOVERY - DPKG on cp4019 is OK: All packages OK [14:35:10] I only now have to find which one [14:35:22] this was before vacations [14:39:12] (03PS1) 10Alex Monk: Change ukwikinews logo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/266497 (https://phabricator.wikimedia.org/T124778) [14:39:34] !log upgrading packages (incl kernel) on all ulsfo caches (cp4xxx) [14:39:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:41:37] RECOVERY - puppet last run on mx2001 is OK: OK: Puppet is currently enabled, last run 45 seconds ago with 0 failures [14:42:12] (03PS4) 10Rush: Enable RPS on eth0 on labstores [puppet] - 10https://gerrit.wikimedia.org/r/261598 (owner: 10Mark Bergsma) [14:42:28] jynus: it's not on db* because none of them have screen running on them :-P [14:43:41] (03CR) 10Rush: [C: 032] Enable RPS on eth0 on labstores [puppet] - 10https://gerrit.wikimedia.org/r/261598 (owner: 10Mark Bergsma) [14:44:31] apergos, that is false, I am running 1 right now [14:44:36] where? [14:44:52] jynus: [14:46:11] https://phabricator.wikimedia.org/P2527 [14:47:40] been there for 2 hours at least [14:48:03] ugh can't help it if debian / ubuntu calls the process SCREEN instead of screen :-/ [14:48:06] RECOVERY - puppet last run on cp4019 is OK: OK: Puppet is currently enabled, last run 5 seconds ago with 0 failures [14:48:06] !log RPS on eth0 on labstores [14:48:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:48:11] lol [14:48:39] huh some of these are running as springle :-D [14:49:10] yes, do not touch them if you do not want all db infrastructure to collapse [14:49:37] "only" 22 hosts to check, good luck... [14:50:27] you have a paste so I do not have to rerun it? [14:50:41] I suppose it will be on the log, doesn't matter [14:53:06] yes I have the info [14:53:16] where would you like a paste? [14:53:42] file on salt master? [14:54:09] (03PS1) 10Ema: esams: remove varnish-fe,nginx services from mobile cluster [puppet] - 10https://gerrit.wikimedia.org/r/266499 (https://phabricator.wikimedia.org/T109286) [14:54:16] /root/dbscreens.txt [14:54:40] I think some of those springle screens can go but that's a task for another time [14:54:59] on neodymium is the file, of course [14:55:05] of course [14:55:41] no need to search "SCREEN -S partitioning" [14:56:24] only ten hosts left then :-D [14:56:53] <_joe_> uhm just got a page? [14:57:03] !log Finished migration of mobile traffic to text cluster in esams https://phabricator.wikimedia.org/T109286 [14:57:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:57:27] er? [14:57:31] I got nothin [14:57:33] _joe_, I didn't [14:57:39] what was it? [14:58:17] <_joe_> nope, an old one got delivered again [14:58:24] ah whew [14:59:00] it happened to me constantly when in the US, I received the same page 20 times [15:02:11] bblack, hi, re mobile merge - i saw you commented about merging ip ranges - weren't we trying to stabilize those ranges so that some of our zero partners can identify mobile traffic via ips? [15:04:29] (03CR) 10BBlack: [C: 031] esams: remove varnish-fe,nginx services from mobile cluster [puppet] - 10https://gerrit.wikimedia.org/r/266499 (https://phabricator.wikimedia.org/T109286) (owner: 10Ema) [15:04:38] yurik: yes, kinda [15:05:06] (03CR) 10Ema: [C: 032 V: 032] esams: remove varnish-fe,nginx services from mobile cluster [puppet] - 10https://gerrit.wikimedia.org/r/266499 (https://phabricator.wikimedia.org/T109286) (owner: 10Ema) [15:08:05] (03PS1) 10Jcrespo: m4-master is now the eventlogging master (pointed by dbproxy1004) [software] - 10https://gerrit.wikimedia.org/r/266500 [15:08:53] (03CR) 10Ottomata: [C: 031] m4-master is now the eventlogging master (pointed by dbproxy1004) [software] - 10https://gerrit.wikimedia.org/r/266500 (owner: 10Jcrespo) [15:08:53] :) [15:09:19] ? [15:09:25] are you a wizard? [15:09:45] (03CR) 10Jcrespo: [C: 032 V: 032] m4-master is now the eventlogging master (pointed by dbproxy1004) [software] - 10https://gerrit.wikimedia.org/r/266500 (owner: 10Jcrespo) [15:10:07] FYI, it has been running for a few ours already [15:10:10] *hours [15:11:49] jynus: hi ja [15:11:52] ah ok cool [15:12:01] how's it look? worired it might be super slow with just one process [15:12:17] (sorry the post office line was very long this morning :/ ) [15:12:28] it is not slow, but it looks we were missing 20% of events [15:12:41] over the weekend? [15:12:43] whatcha mean? [15:12:48] overally [15:12:51] ? [15:13:11] like forever? [15:13:20] after I finish with it, I will run it from 1 Jan to see how many differences we get [15:13:34] ha, crazy ok, like the sync.sh script has been lazy? [15:13:45] I do not know, really [15:13:47] hm [15:13:54] maybe it is just false positives [15:15:30] 6operations, 6Discovery: Elasticsearch health and capacity planning FY2016-17 - https://phabricator.wikimedia.org/T124626#1965732 (10dcausse) Yes it's extremely hard to guess, the morelike problem makes it hard to evaluate. Cluster wide: Without serving morelike queries tp95 starts to move at 1200qps (prefix)... [15:16:15] I will know more when it finishes [15:16:53] ok [15:16:56] thanks [15:19:27] PROBLEM - NTP on cygnus is CRITICAL: NTP CRITICAL: Offset -2.074615479 secs [15:22:06] (03CR) 10Dereckson: "PS2: load extension through wfLoadExtension" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/266433 (https://phabricator.wikimedia.org/T119117) (owner: 10Dereckson) [15:28:56] PROBLEM - NTP on pollux is CRITICAL: NTP CRITICAL: Offset -9.967888832 secs [15:31:53] (03PS3) 10Ottomata: Refactor MirrorMaker puppetization [puppet/kafka] - 10https://gerrit.wikimedia.org/r/265789 (https://phabricator.wikimedia.org/T124077) [15:32:07] (03PS4) 10Ottomata: Refactor MirrorMaker puppetization [puppet/kafka] - 10https://gerrit.wikimedia.org/r/265789 (https://phabricator.wikimedia.org/T124077) [15:40:38] (03PS1) 10Ema: eqiad: add text nodes to mobile cluster [puppet] - 10https://gerrit.wikimedia.org/r/266503 (https://phabricator.wikimedia.org/T109286) [15:41:24] 6operations, 6Discovery: Elasticsearch health and capacity planning FY2016-17 - https://phabricator.wikimedia.org/T124626#1965764 (10dcausse) Side note: I think we can really optimize server usage by splitting cluster by feature. From what I understand in this paper[1]: parallelization of slow queries can real... [15:46:39] 6operations, 6Project-Creators: Operations-related subprojects/tags reorganization - https://phabricator.wikimedia.org/T119944#1965793 (10matmarex) >>! In T119944#1960512, @Aklapper wrote: >>>! In T119944#1950162, @matmarex wrote: >> empty up #Wikimedia-Media-Storage, moving the reports in it to #swift or else... [15:46:49] (03PS1) 10Jcrespo: Updated partitioning for s1 and s4 [software] - 10https://gerrit.wikimedia.org/r/266504 (https://phabricator.wikimedia.org/T120513) [15:47:01] (03PS5) 10Ottomata: Refactor MirrorMaker puppetization [puppet/kafka] - 10https://gerrit.wikimedia.org/r/265789 (https://phabricator.wikimedia.org/T124077) [15:48:30] (03CR) 10Jcrespo: [V: 032] Updated partitioning for s1 and s4 [software] - 10https://gerrit.wikimedia.org/r/266504 (https://phabricator.wikimedia.org/T120513) (owner: 10Jcrespo) [15:48:39] (03CR) 10Jcrespo: [C: 032] Updated partitioning for s1 and s4 [software] - 10https://gerrit.wikimedia.org/r/266504 (https://phabricator.wikimedia.org/T120513) (owner: 10Jcrespo) [15:51:06] (03CR) 10Ottomata: [C: 032] Refactor MirrorMaker puppetization [puppet/kafka] - 10https://gerrit.wikimedia.org/r/265789 (https://phabricator.wikimedia.org/T124077) (owner: 10Ottomata) [15:54:04] 6operations, 6Commons, 7Swift: Update rsvg on the image scalers - https://phabricator.wikimedia.org/T112421#1965815 (10matmarex) [15:54:07] 6operations, 6Commons, 5MW-1.27-release-notes, 7Swift: Some files had disappeared from Commons after renaming - https://phabricator.wikimedia.org/T111838#1965816 (10matmarex) [15:57:17] (03PS2) 10Hashar: contint: stop cloning mediawiki/tools/codesniffer.git [puppet] - 10https://gerrit.wikimedia.org/r/260018 (https://phabricator.wikimedia.org/T66371) [15:57:49] (03CR) 10Hashar: [C: 031 V: 032] "Simple rebase. Still cherry-picked on integration puppetmaster." [puppet] - 10https://gerrit.wikimedia.org/r/260018 (https://phabricator.wikimedia.org/T66371) (owner: 10Hashar) [15:58:16] 6operations, 6Commons, 7Monitoring, 7Swift: Monitor [[Special:ListFiles]] for non 200 HTTP statuses in thumbnails - https://phabricator.wikimedia.org/T106937#1965841 (10matmarex) [15:58:19] 6operations, 10RESTBase, 6Services, 10Traffic: Remove restbase from parsoidcache - https://phabricator.wikimedia.org/T110475#1965843 (10GWicke) @bblack, there are still users for rest.wikimedia.org. I sent a reminder and announced a shut-down date for March. If we set up a redirect (or rewrite) for the dom... [16:00:05] anomie ostriches thcipriani marktraceur Krenair: Dear anthropoid, the time has come. Please deploy Morning SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160126T1600). [16:00:05] Dereckson: A patch you scheduled for Morning SWAT (Max 8 patches) is about to be deployed. Please be available during the process. [16:00:10] Hello. [16:01:05] RECOVERY - mysqld processes on pc1005 is OK: PROCS OK: 1 process with command name mysqld [16:01:11] Dereckson: Hiya, I can SWAT for you this morning. [16:01:22] Okay. [16:01:39] (03PS1) 10Ottomata: Rotate kafka-mirror GC logs too [puppet/kafka] - 10https://gerrit.wikimedia.org/r/266508 [16:02:45] RECOVERY - mysqld processes on pc1002 is OK: PROCS OK: 1 process with command name mysqld [16:02:47] (03CR) 10Ottomata: [C: 032 V: 032] Pass flake8 and add it to tox envlist [puppet/varnishkafka] - 10https://gerrit.wikimedia.org/r/265252 (owner: 10Hashar) [16:03:05] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/265893 (https://phabricator.wikimedia.org/T124389) (owner: 10Dereckson) [16:03:25] (03CR) 10Ottomata: [C: 032 V: 032] Add .gitreview [puppet/varnishkafka] - 10https://gerrit.wikimedia.org/r/265250 (owner: 10Hashar) [16:03:54] (03CR) 10Ottomata: [C: 032 V: 032] Introduce tox as a test entry point [puppet/varnishkafka] - 10https://gerrit.wikimedia.org/r/265251 (owner: 10Hashar) [16:04:16] (03CR) 10Ottomata: [C: 032 V: 032] Add .gitreview [puppet/kafkatee] - 10https://gerrit.wikimedia.org/r/264010 (owner: 10Hashar) [16:04:33] (03CR) 10Ottomata: [C: 032 V: 032] Introduce tox as a test entry point [puppet/kafkatee] - 10https://gerrit.wikimedia.org/r/264011 (owner: 10Hashar) [16:04:41] thcipriani: 265893 depends of 265892, which depends of 265891 [16:04:57] (03CR) 10Ottomata: [C: 032 V: 032] Pass flake8 and add it to tox envlist [puppet/kafkatee] - 10https://gerrit.wikimedia.org/r/264012 (owner: 10Hashar) [16:05:31] (03CR) 10Ottomata: [C: 032 V: 032] Rotate kafka-mirror GC logs too [puppet/kafka] - 10https://gerrit.wikimedia.org/r/266508 (owner: 10Ottomata) [16:05:57] (03PS2) 10Giuseppe Lavagetto: Use the logical redis definition for GettingStarted. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/266481 (https://phabricator.wikimedia.org/T124671) [16:05:59] (03PS1) 10Giuseppe Lavagetto: Rationalize definition of service hosts [mediawiki-config] - 10https://gerrit.wikimedia.org/r/266509 (https://phabricator.wikimedia.org/T114273) [16:06:01] (03PS1) 10Giuseppe Lavagetto: Define Production service entries for InitialiseSettings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/266510 (https://phabricator.wikimedia.org/T114273) [16:06:03] (03PS1) 10Giuseppe Lavagetto: Reduce poolcounter configuration complexity [mediawiki-config] - 10https://gerrit.wikimedia.org/r/266511 (https://phabricator.wikimedia.org/T114273) [16:06:05] (03PS1) 10Giuseppe Lavagetto: Add references to wmfServices for Cirrusearch. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/266512 (https://phabricator.wikimedia.org/T114273) [16:06:07] (03PS1) 10Giuseppe Lavagetto: Use wmfMasterDatacenter for picking the master redis config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/266513 (https://phabricator.wikimedia.org/T114273) [16:06:09] (03PS1) 10Giuseppe Lavagetto: Configure redis LockManager in both DCs, use the master everywhere. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/266514 [16:06:12] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/265891 (https://phabricator.wikimedia.org/T124389) (owner: 10Dereckson) [16:07:46] (03CR) 10jenkins-bot: [V: 04-1] Define Production service entries for InitialiseSettings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/266510 (https://phabricator.wikimedia.org/T114273) (owner: 10Giuseppe Lavagetto) [16:07:54] (03CR) 10jenkins-bot: [V: 04-1] Reduce poolcounter configuration complexity [mediawiki-config] - 10https://gerrit.wikimedia.org/r/266511 (https://phabricator.wikimedia.org/T114273) (owner: 10Giuseppe Lavagetto) [16:08:06] (03CR) 10jenkins-bot: [V: 04-1] Add references to wmfServices for Cirrusearch. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/266512 (https://phabricator.wikimedia.org/T114273) (owner: 10Giuseppe Lavagetto) [16:08:08] hmm, zuul still being slow about picking up that change... [16:08:17] (03CR) 10jenkins-bot: [V: 04-1] Use wmfMasterDatacenter for picking the master redis config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/266513 (https://phabricator.wikimedia.org/T114273) (owner: 10Giuseppe Lavagetto) [16:08:27] (03CR) 10jenkins-bot: [V: 04-1] Configure redis LockManager in both DCs, use the master everywhere. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/266514 (owner: 10Giuseppe Lavagetto) [16:08:32] <_joe_> ugh, and ofc [16:08:55] (03Merged) 10jenkins-bot: Namespace configuration for wuu.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/265891 (https://phabricator.wikimedia.org/T124389) (owner: 10Dereckson) [16:08:58] It's taking 265891 [16:12:02] !log thcipriani@mira Synchronized wmf-config/InitialiseSettings.php: SWAT: Namespace configuration for wuu.wikipedia [[gerrit:265891]] (duration: 01m 29s) [16:12:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:12:08] ^ Dereckson check please [16:12:28] Testing. [16:12:47] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/265892 (https://phabricator.wikimedia.org/T124389) (owner: 10Dereckson) [16:13:24] (03PS1) 10Jcrespo: Pool new parsercache pc1005 after cloning it from pc1002 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/266516 (https://phabricator.wikimedia.org/T121888) [16:13:46] Tested. Works fine. [16:14:16] PROBLEM - NTP on serpens is CRITICAL: NTP CRITICAL: Offset unknown [16:15:35] (03Merged) 10jenkins-bot: Remove Tranwiki namespace on wuu.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/265892 (https://phabricator.wikimedia.org/T124389) (owner: 10Dereckson) [16:16:03] (03Merged) 10jenkins-bot: Add Portal namespace on wuu.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/265893 (https://phabricator.wikimedia.org/T124389) (owner: 10Dereckson) [16:18:13] 6operations, 6Commons, 10MediaWiki-File-management, 6Multimedia: image magick stripping colour profile of PNG files [probably regression] - https://phabricator.wikimedia.org/T113123#1965913 (10matmarex) [16:18:37] RECOVERY - NTP on serpens is OK: NTP OK: Offset 0.005597949028 secs [16:19:16] !log thcipriani@mira Synchronized wmf-config/InitialiseSettings.php: SWAT: Remove Tranwiki namespace on wuu.wikipedia [[gerrit:265892]] and Add Portal namespace on wuu.wikipedia [[gerrit:265893]] (duration: 01m 27s) [16:19:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:19:21] ^ Dereckson check plase [16:19:24] *please [16:19:41] Testing. [16:21:18] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/265896 (https://phabricator.wikimedia.org/T122175) (owner: 10Dereckson) [16:21:44] (03CR) 10Ema: "This one should be merged *after* the pybal+etcd setup is done in eqiad." [puppet] - 10https://gerrit.wikimedia.org/r/266503 (https://phabricator.wikimedia.org/T109286) (owner: 10Ema) [16:21:59] (03Merged) 10jenkins-bot: Namespaces configuration on sk.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/265896 (https://phabricator.wikimedia.org/T122175) (owner: 10Dereckson) [16:22:06] 265892 and 265893 Verified. [16:22:07] PROBLEM - HHVM rendering on mw1258 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:22:53] Dereckson: thank you [16:23:27] PROBLEM - Apache HTTP on mw1258 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:23:35] (03CR) 10Hashar: "recheck" [puppet/kafkatee] - 10https://gerrit.wikimedia.org/r/264012 (owner: 10Hashar) [16:23:38] (03CR) 10Hashar: "recheck" [puppet/varnishkafka] - 10https://gerrit.wikimedia.org/r/265252 (owner: 10Hashar) [16:23:46] (03CR) 10RobH: [C: 031] "This is slated for puppetswat today, and looks acceptable for merge at that time (pending rebase)." [puppet] - 10https://gerrit.wikimedia.org/r/265427 (https://phabricator.wikimedia.org/T120843) (owner: 10EBernhardson) [16:24:39] (03CR) 10RobH: [C: 031] "Looks good for puppetswat later today." [puppet] - 10https://gerrit.wikimedia.org/r/238850 (owner: 10EBernhardson) [16:25:11] !log thcipriani@mira Synchronized wmf-config/InitialiseSettings.php: SWAT: Namespaces configuration on sk.wikipedia [[gerrit:265896]] (duration: 01m 27s) [16:26:04] ^ Dereckson check please [16:26:04] PROBLEM - High load average on labstore1001 is CRITICAL: CRITICAL: 77.78% of data above the critical threshold [24.0] [16:26:04] (03PS1) 10Ottomata: Add nagios_servicegroup parameter to kafka::mirror::monitoring [puppet/kafka] - 10https://gerrit.wikimedia.org/r/266517 [16:26:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:26:44] (03CR) 10Hashar: "CI is enabled :-}" [puppet/varnishkafka] - 10https://gerrit.wikimedia.org/r/265252 (owner: 10Hashar) [16:26:46] (03CR) 10Ottomata: [C: 032] Add nagios_servicegroup parameter to kafka::mirror::monitoring [puppet/kafka] - 10https://gerrit.wikimedia.org/r/266517 (owner: 10Ottomata) [16:26:51] (03CR) 10Hashar: "CI is enabled :-}" [puppet/kafkatee] - 10https://gerrit.wikimedia.org/r/264012 (owner: 10Hashar) [16:27:03] (03CR) 10RobH: [C: 031] "slated for puppetswat shortly." [puppet] - 10https://gerrit.wikimedia.org/r/266299 (https://phabricator.wikimedia.org/T123869) (owner: 10Eevans) [16:27:08] thcipriani: Tested [16:27:14] (03CR) 10Ottomata: [V: 032] Add nagios_servicegroup parameter to kafka::mirror::monitoring [puppet/kafka] - 10https://gerrit.wikimedia.org/r/266517 (owner: 10Ottomata) [16:27:27] PROBLEM - Host alsafi is DOWN: PING CRITICAL - Packet loss = 100% [16:27:57] Dereckson: I'll circle back to the category collation and do that one at the end, since it requires running a script [16:28:20] Okay. [16:28:38] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/265666 (https://phabricator.wikimedia.org/T124167) (owner: 10Dereckson) [16:29:18] (03Merged) 10jenkins-bot: Enable SandboxLink on nl.wikiquote [mediawiki-config] - 10https://gerrit.wikimedia.org/r/265666 (https://phabricator.wikimedia.org/T124167) (owner: 10Dereckson) [16:29:29] (03PS1) 10Ottomata: Puppetize Kafka MirrorMaker on analytics1021 mirroring from main-eqiad to analytics-eqiad [puppet] - 10https://gerrit.wikimedia.org/r/266518 [16:30:46] 6operations, 6Performance-Team, 10Traffic: Support HTTP/2 - https://phabricator.wikimedia.org/T96848#1965960 (10BBlack) Quick update, I did a small re-check on just a single text node in esams (mobile + desktop text traffic, random subsample of IPs, mostly in Europe) for 5 minutes: | Protocol | Percentage... [16:31:33] !log thcipriani@mira Synchronized wmf-config/InitialiseSettings.php: SWAT: Enable SandboxLink on nl.wikiquote [[gerrit:265666]] (duration: 01m 26s) [16:31:36] ^ Dereckson check please [16:31:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:31:47] PROBLEM - puppet last run on mw2052 is CRITICAL: CRITICAL: puppet fail [16:31:47] RECOVERY - NTP on cygnus is OK: NTP OK: Offset 0.0002664327621 secs [16:32:12] thcipriani: works [16:32:24] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/265623 (https://phabricator.wikimedia.org/T124154) (owner: 10Dereckson) [16:32:42] (03CR) 10Ottomata: [C: 032] Puppetize Kafka MirrorMaker on analytics1021 mirroring from main-eqiad to analytics-eqiad [puppet] - 10https://gerrit.wikimedia.org/r/266518 (owner: 10Ottomata) [16:33:17] (03Merged) 10jenkins-bot: Update et.wikiquote logo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/265623 (https://phabricator.wikimedia.org/T124154) (owner: 10Dereckson) [16:35:02] (03PS1) 10Ottomata: Remove invalid parameter jmx_port from kafka::mirror::monitoring use [puppet] - 10https://gerrit.wikimedia.org/r/266519 [16:35:15] (03CR) 10Ottomata: [C: 032 V: 032] Remove invalid parameter jmx_port from kafka::mirror::monitoring use [puppet] - 10https://gerrit.wikimedia.org/r/266519 (owner: 10Ottomata) [16:36:38] !log thcipriani@mira Synchronized w/static/images/project-logos/etwikiquote.png: SWAT: Update et.wikiquote logo [[gerrit:265623]] (duration: 01m 27s) [16:36:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:36:45] ^ Dereckson check please [16:38:05] (03PS1) 10Ottomata: Fix dependency for nrpe::monitor_service { "kafka-mirror-${title}" [puppet/kafka] - 10https://gerrit.wikimedia.org/r/266521 [16:39:09] (03CR) 10Ottomata: [C: 032 V: 032] Fix dependency for nrpe::monitor_service { "kafka-mirror-${title}" [puppet/kafka] - 10https://gerrit.wikimedia.org/r/266521 (owner: 10Ottomata) [16:39:27] https://et.wikiquote.org/w/static/images/project-logos/etwikiquote.png is live and okay, but not yet https://et.wikiquote.org/static/images/project-logos/etwikiquote.png [16:39:44] (03Merged) 10jenkins-bot: Namespace configuration on ur.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/265888 (https://phabricator.wikimedia.org/T122045) (owner: 10Dereckson) [16:40:31] Dereckson: hmm, I see both updated [16:41:15] oh wait, in incognito I see the non-w link as not updated, too [16:41:28] With wget, I've the former version too. [16:41:28] ...doublechecking [16:45:30] (03PS1) 10Ottomata: Use brokers_string instead of brokers_array so each hostname is suffixed with broker port [puppet] - 10https://gerrit.wikimedia.org/r/266525 [16:45:49] (03CR) 10Ottomata: [C: 032 V: 032] Use brokers_string instead of brokers_array so each hostname is suffixed with broker port [puppet] - 10https://gerrit.wikimedia.org/r/266525 (owner: 10Ottomata) [16:45:51] Dereckson: hmm with ?debug=true I get the correct logo [16:46:21] cache issue so I imagine [16:48:31] Indeed. I tried purgeList, didn't seem to have an effect. [16:48:43] We wait to see if it takes the new files later and revisit the issue if not? [16:49:05] thcipriani: what would purgeList do exactly in this case? [16:49:06] RECOVERY - Host alsafi is UP: PING OK - Packet loss = 0%, RTA = 36.40 ms [16:49:11] Dereckson: yeah, I'm syncing your urwiki change now. [16:53:33] I don't know what interfaces you have for purging, but /static/ assets are all virtually under the same hostname for caching purposes [16:53:33] they're all https://www.wikimedia.org/static/.... regardless of what wiki they're referenced from, if you're purging [16:53:33] apergos: are you still the person whom I should email if I want to mirror XML dumps? [16:53:33] bblack: kk, yeah, I saw X-Cache headers, cache busting worked, that was the rational behind purgeList [16:53:33] vvv: yep I'm the one [16:53:33] (03CR) 10Aklapper: [C: 031] "fine with me" [puppet] - 10https://gerrit.wikimedia.org/r/266316 (https://phabricator.wikimedia.org/T123581) (owner: 10Dzahn) [16:53:33] PROBLEM - RAID on mw1161 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [16:53:33] PROBLEM - puppet last run on mw1161 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [16:53:33] PROBLEM - configured eth on mw1161 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [16:53:33] ^ sync-proxies hung up on mw1161 as well :( [16:53:33] PROBLEM - SSH on mw1161 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:53:33] PROBLEM - dhclient process on mw1161 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [16:55:07] PROBLEM - puppet last run on alsafi is CRITICAL: CRITICAL: puppet fail [16:55:37] PROBLEM - nutcracker process on mw1161 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [16:55:47] !log thcipriani@mira Synchronized wmf-config/InitialiseSettings.php: SWAT: Namespace configuration on ur.wikipedia [[gerrit:265888]] (duration: 07m 10s) [16:55:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:55:53] 6operations, 6Project-Creators: Operations-related subprojects/tags reorganization - https://phabricator.wikimedia.org/T119944#1966063 (10matmarex) #wikimedia-media-storage has no more open tasks and it's been archived. #swift has twenty or so new ones :), and #mediawiki-file-management has also gained a few p... [16:55:58] ^ Dereckson check please [16:56:26] RECOVERY - RAID on mw1161 is OK: OK: no RAID installed [16:56:27] RECOVERY - puppet last run on mw1161 is OK: OK: Puppet is currently enabled, last run 14 minutes ago with 0 failures [16:56:45] Testing.4~ [16:56:57] RECOVERY - configured eth on mw1161 is OK: OK - interfaces up [16:57:26] RECOVERY - SSH on mw1161 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.4 (protocol 2.0) [16:57:26] RECOVERY - dhclient process on mw1161 is OK: PROCS OK: 0 processes with command name dhclient [16:57:31] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/266427 (https://phabricator.wikimedia.org/T123627) (owner: 10Dereckson) [16:57:37] RECOVERY - nutcracker process on mw1161 is OK: PROCS OK: 1 process with UID = 108 (nutcracker), command name nutcracker [16:58:12] (03Merged) 10jenkins-bot: Set category collation to uca-lt on lt.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/266427 (https://phabricator.wikimedia.org/T123627) (owner: 10Dereckson) [16:58:52] Tested. [16:59:27] RECOVERY - puppet last run on alsafi is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [16:59:57] RECOVERY - puppet last run on mw2052 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [17:00:05] RobH cmjohnson1: Dear anthropoid, the time has come. Please deploy Puppet SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160126T1700). [17:00:05] Krenair ebernhardson urandom: A patch you scheduled for Puppet SWAT(Max 8 patches) is about to be deployed. Please be available during the process. [17:00:28] cmjohnson1: you about? [17:00:40] ys [17:01:29] apergos: email sent [17:01:32] thanks! [17:01:36] ok puppet swat time. so first thing is all these apache patches [17:01:39] !log thcipriani@mira Synchronized wmf-config/InitialiseSettings.php: SWAT: Set category collation to uca-lt on lt.wikipedia [[gerrit:266427]] (duration: 01m 33s) [17:01:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:01:48] apache config patches that is. [17:02:01] cmjohnson1: so, it used to be that https://wikitech.wikimedia.org/wiki/Application_servers was accurate [17:02:08] there's only 3 now [17:02:18] !log running updateCollation on ltwiki [17:02:21] but i found last time that manually pushing puppet for apache was painful. so best to do the testing as the page says [17:02:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:02:24] Krenair: cool [17:02:37] (03PS1) 10Ottomata: Move analytics-eqiad kafka mirror test to kafka1001 [puppet] - 10https://gerrit.wikimedia.org/r/266528 [17:03:01] cmjohnson1: so review https://wikitech.wikimedia.org/wiki/Application_servers#Deploying_config [17:03:35] you should be able to walk through each change via those steps (for Krenair's config changes) [17:03:51] (so you'll be disabling puppet on app servers during this run) [17:03:58] via salt on neodymium [17:04:19] hmm well that's not good, tried: mwscript updateCollation.php --wiki=ltwiki --previous-colation=uppercase: "processing...Database is read-only: Brief Database Maintenance in progress, please try again in 3 minutes" [17:04:29] (03CR) 10Ottomata: [C: 032] Move analytics-eqiad kafka mirror test to kafka1001 [puppet] - 10https://gerrit.wikimedia.org/r/266528 (owner: 10Ottomata) [17:04:44] cmjohnson1: oh i forgot the most important part, before we touch shit we need to ensure the previous swat window is done [17:04:48] surely those two beta patches don't need prod app server puppets disabling? [17:04:57] (they dont appear to be ;) [17:05:26] robh: just trying to run the maintenance script, then I'll be out of your way :P [17:05:31] no worries =] [17:05:48] Since swat is intentionally regular, minor disruptions are no big deal. [17:06:24] Krenair: ever run into the "Database is read-only" message above? Never seen it before running this script... [17:06:24] (plus im just shocked to have puppet swat patches) [17:07:32] jynus, ^ [17:07:45] thcipriani, I've certainly seen DBs go read-only, not sure about that particular maintenance message [17:07:46] in theory that happens when there is general lag, but that is not hte case [17:08:17] PROBLEM - Check correctness of the icinga configuration on neon is CRITICAL: Icinga configuration contains errors [17:08:19] jynus: hmm, lemme try again. [17:08:38] it's s3 [17:09:06] yes, but I do not see it [17:09:11] that script only uses master, so db1038 [17:09:21] (03PS2) 10Giuseppe Lavagetto: Define Production service entries for InitialiseSettings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/266510 (https://phabricator.wikimedia.org/T114273) [17:10:04] well I got a slow query message: https://phabricator.wikimedia.org/P2528 [17:10:06] PROBLEM - NTP on fermium is CRITICAL: NTP CRITICAL: Offset 28.98543322 secs [17:10:38] thcipriani: heads up, i'm going to start merging backports for wmf.11 on mira [17:10:58] PROBLEM - NTP on krypton is CRITICAL: NTP CRITICAL: Offset 19.68465185 secs [17:11:17] RECOVERY - High load average on labstore1001 is OK: OK: Less than 50.00% above the threshold [16.0] [17:11:37] are you deploying marxarelli? [17:11:40] (03CR) 10jenkins-bot: [V: 04-1] Define Production service entries for InitialiseSettings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/266510 (https://phabricator.wikimedia.org/T114273) (owner: 10Giuseppe Lavagetto) [17:11:46] Krenair: not yet [17:12:00] just wanted to get wmf.11 ready for when swat is finished [17:12:13] puppet swat is immediately following normal swat [17:12:59] Krenair: should i wait? the tentative plan was to sync up to mw1017 for targeting of wmf.11 patches [17:13:12] *targeted testing* [17:13:30] i can hold off if that's a problem [17:13:58] dont interrupt puppet swat with unplanned swatting ;] [17:14:04] :) [17:14:06] this week we actually have patches! [17:15:07] thcipriani: we're still standing by until you give us the all clear to proceed =] [17:15:14] robh: kk. i'll just rebase wmf.11 to get it ready but hold off on syncing [17:15:17] (03PS1) 10Ottomata: Run kafka mirror on both kafka1001 and kafka1002 [puppet] - 10https://gerrit.wikimedia.org/r/266530 [17:15:22] (not rushing you just ensuring you know we arent going to start pushing on top of ya!) [17:16:01] The HHVM process on mw1019 is dying every 5-7 minutes like a clockwork. 1743 HHVM errors from that host in logstash for the last hour. [17:16:08] thcipriani, when static changes are made like https://gerrit.wikimedia.org/r/#/c/265623/ please send the URL (with www.wikimedia.org hostname) to purgeList.php [17:16:21] like this: echo 'https://www.wikimedia.org/static/images/project-logos/etwikiquote.png' | mwscript purgeList.php [17:16:33] Is mw1019 the same host I was whining about yesterday? [17:16:42] I did it this time [17:16:44] Krenair: gotcha, thanks. [17:16:58] bd808, I was whining about that one yesterday [17:17:05] (03CR) 10Ottomata: [C: 032] Run kafka mirror on both kafka1001 and kafka1002 [puppet] - 10https://gerrit.wikimedia.org/r/266530 (owner: 10Ottomata) [17:17:11] (03PS3) 10Cmjohnson: beta: Remove deployment.wmflabs.org VHost that doesn't actually resolve [puppet] - 10https://gerrit.wikimedia.org/r/265548 (owner: 10Alex Monk) [17:17:39] bd808, anomie, tgr: merging wmf.11 backports but holding off on deploying to mw1017 until after puppet swat [17:17:43] hmm, so I can connect with db1038 via the sql script from mira, but I can't run the updateCollation script... [17:17:57] PROBLEM - puppet last run on kafka1001 is CRITICAL: CRITICAL: puppet fail [17:18:04] yeah, the sql script doesn't take that sort of thing into account AFAIK [17:18:06] Krenair: confirmed that we were both complaining about it. J.oe said not to worry about it for now. [17:18:23] is it trying to run it on the wrong host? db1038 is not in read only [17:18:29] marxarelli: sweet. Have fun with our pal Jenkins [17:18:35] maybe it is in read only at mediawiki level? [17:18:36] (03CR) 10Cmjohnson: [C: 032] beta: Remove deployment.wmflabs.org VHost that doesn't actually resolve [puppet] - 10https://gerrit.wikimedia.org/r/265548 (owner: 10Alex Monk) [17:18:45] jynus, I think MW has it's own 'read-only' status [17:18:47] yes [17:18:50] cmjohnson: So depending on how long it takes for us to get into our swat window, we may end up rolling some of the propsed patches from today to thursday. [17:19:41] thcipriani / Krenair > et.wikiquote logo now works [17:20:06] RECOVERY - puppet last run on kafka1001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [17:20:18] Dereckson: cool. Krenair ran the right purgeList command :) [17:20:33] !log disabling puppet on mw cluster [17:20:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:21:06] robh: go ahead with puppet swat. not sure what's going on with updateCollation, probably take a few to figure it out. [17:21:23] I'm looking into updateCollation [17:22:16] PROBLEM - High load average on labstore1001 is CRITICAL: CRITICAL: 87.50% of data above the critical threshold [24.0] [17:22:57] ok, cmjohnson1 is handling the apache updates and im assisting today, you can see chris is now proceeding =] [17:23:32] 6operations, 10vm-requests: request VM for releases.wm.org - https://phabricator.wikimedia.org/T124261#1966143 (10Dzahn) a:3Dzahn [17:24:37] cmjohnson1: While you are working on the apache changes, I'll start on the search ones [17:24:49] okay [17:25:01] ebernhardson: Do either one of your patches require me to restart search service? [17:25:14] they seem like they will simply puppet change into place and roll, but I want to be certain [17:25:43] (03PS2) 10RobH: [cirrus maint] redirect stderr to log and use full mwscript path [puppet] - 10https://gerrit.wikimedia.org/r/265427 (https://phabricator.wikimedia.org/T120843) (owner: 10EBernhardson) [17:26:20] robh: well, one of them does eventually (minimum cluster nodes), but we can't just restart the search service. It requires a 3 day rolling restart across the cluster [17:26:24] robh: me and dcausse will work that out [17:26:28] (03PS1) 10Dzahn: releases: add role on bromine [puppet] - 10https://gerrit.wikimedia.org/r/266531 (https://phabricator.wikimedia.org/T124261) [17:26:31] (we both have root in elastic*) [17:26:43] ebernhardson: ok, so as long as I see it roll in puppet sucessfully then the puppet swat portion is done for these? [17:26:49] robh: yes [17:26:51] (the two search ones, the icinga one i understand ;) [17:26:58] awesome, I'm rebasing and merging them for you now [17:27:13] (03PS3) 10Giuseppe Lavagetto: Define Production service entries for InitialiseSettings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/266510 (https://phabricator.wikimedia.org/T114273) [17:27:40] (03CR) 10Dzahn: [C: 032] releases: add role on bromine [puppet] - 10https://gerrit.wikimedia.org/r/266531 (https://phabricator.wikimedia.org/T124261) (owner: 10Dzahn) [17:27:51] (03PS1) 10Ottomata: Set group_prefix on kafka-mirror jmxtrans metrics [puppet] - 10https://gerrit.wikimedia.org/r/266532 [17:27:56] are you executing that from mira? [17:28:09] (03PS3) 10Cmjohnson: mediawiki: Move www.wikimedia.org portal into wwwportals [puppet] - 10https://gerrit.wikimedia.org/r/265642 (owner: 10Alex Monk) [17:28:11] (03CR) 10RobH: [C: 032] [elasticsearch] Update recover_after_nodes value [puppet] - 10https://gerrit.wikimedia.org/r/238850 (owner: 10EBernhardson) [17:28:12] can you try from terbium? [17:28:27] jynus: the updateCollation script? yes. [17:28:32] (03CR) 10jenkins-bot: [V: 04-1] Define Production service entries for InitialiseSettings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/266510 (https://phabricator.wikimedia.org/T114273) (owner: 10Giuseppe Lavagetto) [17:28:34] to both? [17:28:38] So that string is from wikimedia's config [17:28:41] But it's all commented out [17:28:50] # This key must exist for the master switch script to work [17:28:50] 'readOnlyBySection' => array( [17:28:51] # 'DEFAULT' => 'Brief Database Maintenance in progress, please try again in 3 minutes', #s3 [17:28:51] etc. [17:28:55] (03PS2) 10Ottomata: Set group_prefix on kafka-mirror jmxtrans metrics [puppet] - 10https://gerrit.wikimedia.org/r/266532 [17:29:08] (03PS3) 10Ottomata: Set group_prefix on kafka-mirror jmxtrans metrics [puppet] - 10https://gerrit.wikimedia.org/r/266532 [17:29:25] (03CR) 10Ottomata: [C: 032 V: 032] Set group_prefix on kafka-mirror jmxtrans metrics [puppet] - 10https://gerrit.wikimedia.org/r/266532 (owner: 10Ottomata) [17:29:39] (03PS2) 10Dzahn: releases: add role on bromine [puppet] - 10https://gerrit.wikimedia.org/r/266531 (https://phabricator.wikimedia.org/T124261) [17:29:55] Krenair: commented in db-eqiad.php, but uncommented in db-codfw.php [17:30:02] (03CR) 10Dzahn: [V: 032] releases: add role on bromine [puppet] - 10https://gerrit.wikimedia.org/r/266531 (https://phabricator.wikimedia.org/T124261) (owner: 10Dzahn) [17:30:16] yep [17:30:21] I was going to say that [17:30:22] running from terbium seems to work [17:30:27] since, because, eqiad. [17:30:32] Ohhh. [17:30:33] Yep [17:30:38] <_joe_> thcipriani: what's the issue? [17:30:48] not sever-related, _joe_ [17:30:53] mediawiki-config [17:31:01] although I am unsure how to fix [17:31:01] <_joe_> I am interested anyways [17:31:10] <_joe_> what is the issue specifically? [17:31:13] because efectively, codfw is read only [17:31:16] <_joe_> with databases? [17:31:16] (03PS4) 10Cmjohnson: mediawiki: Move www.wikimedia.org portal into wwwportals [puppet] - 10https://gerrit.wikimedia.org/r/265642 (owner: 10Alex Monk) [17:31:25] <_joe_> heh, we should pair up on those [17:31:39] <_joe_> I have ideas on how to configure appservers [17:32:25] but mediawiki tries to write to the local master that is now the eqiad master [17:32:36] ACKNOWLEDGEMENT - puppet last run on bromine is CRITICAL: CRITICAL: puppet fail daniel_zahn fixing puppet roles [17:32:39] but mediawiki won't allow [17:32:57] even if it is rw in reality, from the point of view of the main master [17:33:05] *main datacenter [17:33:26] (03PS1) 10Ottomata: Lint fixes for role::kafka::analytics::mirror [puppet] - 10https://gerrit.wikimedia.org/r/266533 [17:33:36] I actually thought about this before db-eqiad.php and db-codfw.php makes no sense [17:33:54] I thought db-codfw points to the eqiad masters? [17:33:54] (03CR) 10Cmjohnson: [C: 032] mediawiki: Move www.wikimedia.org portal into wwwportals [puppet] - 10https://gerrit.wikimedia.org/r/265642 (owner: 10Alex Monk) [17:34:03] hmm. more NTP issues than usual in icinga [17:34:13] but restarting deamons usually fixes them [17:34:14] because tight now db-codfw's master is in eqiad [17:34:31] (03PS3) 10RobH: [cirrus maint] redirect stderr to log and use full mwscript path [puppet] - 10https://gerrit.wikimedia.org/r/265427 (https://phabricator.wikimedia.org/T120843) (owner: 10EBernhardson) [17:34:40] so we can comment out the read-only part? [17:34:45] (03PS2) 10Ottomata: Lint fixes for role::kafka::analytics::mirror [puppet] - 10https://gerrit.wikimedia.org/r/266533 [17:34:54] (03CR) 10RobH: [C: 032] [cirrus maint] redirect stderr to log and use full mwscript path [puppet] - 10https://gerrit.wikimedia.org/r/265427 (https://phabricator.wikimedia.org/T120843) (owner: 10EBernhardson) [17:35:00] !log mw1258 - restart hhvm [17:35:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:35:16] no, we do not want to accidentally write to eqiad from codfw, right? [17:35:19] icinga config is broken [17:35:47] RECOVERY - Apache HTTP on mw1258 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.025 second response time [17:36:06] Error: Contact group 'analytics_eqiad' specified in service 'Kafka MirrorMaker analytics-eqiad' for host 'kafka1001' is not defined anywhere! [17:36:38] RECOVERY - HHVM rendering on mw1258 is OK: HTTP OK: HTTP/1.1 200 OK - 64840 bytes in 0.084 second response time [17:36:46] actually, probably yes, because only some shards are read-only [17:36:50] ? [17:36:53] oook [17:37:01] unsure of how that stuff works, with you shortly [17:37:17] are contact groups not available to use everywhere? [17:37:18] but that is a security concern [17:37:35] ottomata: is it maybe just a - vs _ or so? [17:37:43] looks [17:37:47] because we do not want cross-wiki queries until we setup SSL there [17:37:57] ok, rolling the maint script update onto elastic1001 (the rest will get on normal call in, this is a paranoid post merge puppet run) [17:38:42] ottomata: so analytics_eqiad is a service group, defined and ready to use, but the errors says it looks for the same thing as a contact group [17:38:48] like group of people to notify [17:39:11] I am unsure on how to proceed, Krenair, _joe_ I accept suggestions [17:39:56] RECOVERY - High load average on labstore1001 is OK: OK: Less than 50.00% above the threshold [16.0] [17:40:13] maybe set the master to a local master, but makse sure it is read-only at mysql side? [17:40:17] RECOVERY - NTP on pollux is OK: NTP OK: Offset 0.001584768295 secs [17:41:06] (03PS1) 10Dereckson: Document db-codfw readOnlyBySection [mediawiki-config] - 10https://gerrit.wikimedia.org/r/266534 [17:41:17] jynus: so should we add some comment like this one ? ^ [17:41:50] well, first decide what to do :-) [17:42:09] jynus, db-codfw's master DB lines point to eqiad... therefore, shouldn't it be safe to remove read-only mode? [17:42:23] ottomata: i think it's this modules/role/manifests/graphite/alerts.pp: group => 'analytics_eqiad', [17:42:32] Krenair, I explained why not- we do not want cross-datacenter writes [17:42:45] that kind of group is probably a contact group, because they are alerts [17:42:51] well [17:43:00] !log ltwiki collation updated 503623 rows processed [17:43:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:43:10] let's just leave it as is and have people run maint scripts from terbium instead? [17:43:13] that is in general- we do not want cross-datacenter queries [17:43:27] if they need to write [17:43:35] ^ Dereckson ltwiki is updated. [17:43:42] jynus: Krenair thank you for the help! [17:43:44] we should also change the read only reason to explain this [17:44:04] Thank you for the deploy. [17:44:20] we should fix it, both with the comment and I think it is not fully read-only [17:44:28] mutante: what's the difference between contact group and nagios service group? [17:44:38] only s3 (twice), s1 and s5 [17:44:41] (03PS1) 10Dzahn: graphite alerts, fix analytics monitoring group name [puppet] - 10https://gerrit.wikimedia.org/r/266535 [17:45:02] oh [17:45:04] i should just change it ot analytics [17:45:05] hm [17:45:12] or really not [17:45:13] ottomata: contact group is a group of people, only used for notifications. service group is a group of services, groups service checks together in the web ui [17:45:13] is there a way to check that s2 is not read-only? [17:45:14] at all [17:45:14] ok. [17:45:15] nm [17:45:21] ottomata: did just that [17:45:22] oh [17:45:29] cmjohnson1, robh: got a bit distracted by the DB stuff, how's swat going? [17:45:39] hmmm [17:45:46] mutante: i think that is not what is causing the error though [17:45:50] chris has tow of the three pushed to the test apache (the rest have puppet halted) [17:45:57] and is doing the tests now, so its moving along =] [17:46:05] group is actually monitoring::service group [17:46:06] let me open a ticket, even if this is trivial, because it is important for the future failover [17:46:07] cool [17:46:16] which is a service group [17:46:19] mutante: i'm on it... [17:46:52] ottomata: i think it ends up being a contact group because that is a graphite alert [17:47:00] ottomata: ok, cool [17:47:42] the problem is a recent commit of mine for kafka mirror stuff [17:47:51] hm, why does nrpe::monitor_service not take a service group param!? [17:48:42] (03PS1) 10Ottomata: Remove incorrect nagios_servicegroup param for kafka::mirror::monitoring [puppet/kafka] - 10https://gerrit.wikimedia.org/r/266536 [17:49:09] (03PS4) 10Giuseppe Lavagetto: Define Production service entries for InitialiseSettings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/266510 (https://phabricator.wikimedia.org/T114273) [17:49:55] (03CR) 10jenkins-bot: [V: 04-1] Define Production service entries for InitialiseSettings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/266510 (https://phabricator.wikimedia.org/T114273) (owner: 10Giuseppe Lavagetto) [17:50:10] ottomata: it probably should, our use of service groups could be improved, we have some but a lot is missing [17:50:16] (03PS3) 10Cmjohnson: beta: Move login and bits apache configs into wikimedia.conf, like prod [puppet] - 10https://gerrit.wikimedia.org/r/265659 (owner: 10Alex Monk) [17:50:39] (03CR) 10Ottomata: [C: 032] Remove incorrect nagios_servicegroup param for kafka::mirror::monitoring [puppet/kafka] - 10https://gerrit.wikimedia.org/r/266536 (owner: 10Ottomata) [17:51:05] 6operations, 10DBA, 10MediaWiki-Configuration, 5codfw-rollout, 3codfw-rollout-Jan-Mar-2016: codfw is in read only according to mediawiki - https://phabricator.wikimedia.org/T124795#1966220 (10jcrespo) 3NEW [17:51:13] ^ [17:51:32] (03PS1) 10Ottomata: Update kafka submodule and remove incorrect use of nagios_servicegroup [puppet] - 10https://gerrit.wikimedia.org/r/266538 [17:51:43] (03PS3) 10Ottomata: Lint fixes for role::kafka::analytics::mirror [puppet] - 10https://gerrit.wikimedia.org/r/266533 [17:51:45] (03PS2) 10Ottomata: Update kafka submodule and remove incorrect use of nagios_servicegroup [puppet] - 10https://gerrit.wikimedia.org/r/266538 [17:51:47] (03PS2) 10Dzahn: graphite alerts, fix analytics monitoring group name [puppet] - 10https://gerrit.wikimedia.org/r/266535 [17:52:01] (03CR) 10Cmjohnson: [C: 032] beta: Move login and bits apache configs into wikimedia.conf, like prod [puppet] - 10https://gerrit.wikimedia.org/r/265659 (owner: 10Alex Monk) [17:52:03] (03CR) 10Ottomata: [C: 032 V: 032] Lint fixes for role::kafka::analytics::mirror [puppet] - 10https://gerrit.wikimedia.org/r/266533 (owner: 10Ottomata) [17:52:09] (03PS4) 10Ottomata: Lint fixes for role::kafka::analytics::mirror [puppet] - 10https://gerrit.wikimedia.org/r/266533 [17:52:17] (03CR) 10Ottomata: [V: 032] Lint fixes for role::kafka::analytics::mirror [puppet] - 10https://gerrit.wikimedia.org/r/266533 (owner: 10Ottomata) [17:52:35] (03PS3) 10Ottomata: Update kafka submodule and remove incorrect use of nagios_servicegroup [puppet] - 10https://gerrit.wikimedia.org/r/266538 [17:52:37] (03PS5) 10RobH: [elasticsearch] Update recover_after_nodes value [puppet] - 10https://gerrit.wikimedia.org/r/238850 (owner: 10EBernhardson) [17:52:49] 6operations, 10DBA, 10MediaWiki-Configuration, 5codfw-rollout, 3codfw-rollout-Jan-Mar-2016: codfw is in read only according to mediawiki - https://phabricator.wikimedia.org/T124795#1966245 (10jcrespo) My personal recommendation is to make it 100% read only, point to local masters, and force maintenance f... [17:52:54] (03CR) 10Ottomata: [C: 032 V: 032] Update kafka submodule and remove incorrect use of nagios_servicegroup [puppet] - 10https://gerrit.wikimedia.org/r/266538 (owner: 10Ottomata) [17:53:37] 6operations, 10DBA, 10MediaWiki-Configuration, 6Release-Engineering-Team, and 2 others: codfw is in read only according to mediawiki - https://phabricator.wikimedia.org/T124795#1966247 (10Krenair) phab doesn't auto-add projects like that anymore [17:53:47] (03CR) 10RobH: [C: 032] [elasticsearch] Update recover_after_nodes value [puppet] - 10https://gerrit.wikimedia.org/r/238850 (owner: 10EBernhardson) [17:53:51] (03PS6) 10RobH: [elasticsearch] Update recover_after_nodes value [puppet] - 10https://gerrit.wikimedia.org/r/238850 (owner: 10EBernhardson) [17:54:07] Krenair, let me wish! [17:54:20] jynus: on Phabricator, you can cc team projects in the subscribers field by the way [17:54:39] yeah, but editing is too much work [17:54:52] you can CC any project in subscribers [17:54:56] that doesn't mean you should though [17:55:06] yep [17:55:17] although it wouldn't affect me much, others would probably not like to see it abused :p [17:55:23] jynus: you have a 'add CCs' section (will be 'Change subscribers' on next update) [17:55:55] I know, was a mistake [17:56:01] 7Blocked-on-Operations, 6operations, 10Wikimedia-General-or-Unknown: Invalidate all users sessions - https://phabricator.wikimedia.org/T124440#1966254 (10Legoktm) It's still running :/ [17:56:28] can we focus on the task itself? [17:56:45] I don't have anything useful to add to it [17:57:09] as in you agree with my suggestion? [17:57:15] I'd be happy with fixing the read-only message in the config to be useful [17:57:41] question is that I think the read only config is also wrong [17:58:23] 6operations, 10Citoid, 6Services, 10Traffic: Remove citoid from parsoidcache - https://phabricator.wikimedia.org/T110476#1966263 (10mobrovac) >>! In T110476#1965408, @BBlack wrote: > Are things still using the hostnames `citoid.wikimedia.org` and/or `citoid.eqiad.wikimedia.org`, which map to the cache_pars... [17:59:33] (03CR) 10Jcrespo: [C: 04-1] "We need to fix the config first before freezing it." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/266534 (owner: 10Dereckson) [18:02:59] (03PS1) 10Cscott: Add missing `.deployment-prep` to redis server hostname. [puppet] - 10https://gerrit.wikimedia.org/r/266539 [18:03:59] let me focus on our primary infrastructure first [18:04:26] PROBLEM - NTP on mendelevium is CRITICAL: NTP CRITICAL: Offset 5.305729508 secs [18:04:33] (03PS2) 10Jcrespo: Pool new parsercache pc1005 after cloning it from pc1002 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/266516 (https://phabricator.wikimedia.org/T121888) [18:04:57] PROBLEM - NTP on technetium is CRITICAL: NTP CRITICAL: Offset 21.0652746 secs [18:05:20] (03CR) 10Jcrespo: [C: 032] Pool new parsercache pc1005 after cloning it from pc1002 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/266516 (https://phabricator.wikimedia.org/T121888) (owner: 10Jcrespo) [18:05:26] (03PS5) 10RobH: Add alert for elasticsearch 50th percentile prefix search time [puppet] - 10https://gerrit.wikimedia.org/r/265942 (https://phabricator.wikimedia.org/T124542) (owner: 10EBernhardson) [18:06:11] (03CR) 10RobH: [C: 032] Add alert for elasticsearch 50th percentile prefix search time [puppet] - 10https://gerrit.wikimedia.org/r/265942 (https://phabricator.wikimedia.org/T124542) (owner: 10EBernhardson) [18:08:38] !log jynus@mira Synchronized wmf-config/db-eqiad.php: Pool new parsercache pc1005 after cloning it from pc1002 (duration: 01m 28s) [18:08:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:09:18] ebernhardson: all of your swat patches are merged, the neon one for icinga alerts is going live now. [18:09:42] (03PS1) 10Dereckson: Set WikidataPageBanner namespaces on fr.wikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/266541 (https://phabricator.wikimedia.org/T123084) [18:12:55] 6operations, 10Graphoid, 6Services, 10Traffic: Remove graphoid from parsoidcache - https://phabricator.wikimedia.org/T110477#1966309 (10mobrovac) AFAIK, `graphoid.(eqiad.)wikimedia.org` can be safely removed. [18:14:00] !log i broke icinga, fixing [18:14:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:14:26] !log starting puppet on mw cluster [18:14:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:14:30] 6operations, 10Parsoid: Need databases provisioned for parsoid-rt testing, visual diff testing - https://phabricator.wikimedia.org/T124703#1966329 (10jcrespo) 5Open>3Resolved a:3jcrespo The 3 databases have been successfully imported into m5-master. Use T124704 to request access and puppetizing it. [18:15:42] 10Ops-Access-Requests, 6operations, 6Parsing-Team: Getting parsing-team members sudo access to manage (start, stop, restart) services on ruthenium - https://phabricator.wikimedia.org/T124701#1966370 (10jcrespo) [18:16:59] (03PS1) 10RobH: Revert "Add alert for elasticsearch 50th percentile prefix search time" [puppet] - 10https://gerrit.wikimedia.org/r/266543 [18:17:17] rolling back my change to unbreak icinga =P [18:17:36] (03CR) 10RobH: [C: 032] Revert "Add alert for elasticsearch 50th percentile prefix search time" [puppet] - 10https://gerrit.wikimedia.org/r/266543 (owner: 10RobH) [18:18:14] !log running mwscript updateArticleCount.php --wiki=jawiki --update=1 [18:18:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:18:42] robh: it was broken before i think [18:18:50] ..... [18:18:55] someone left it in a broken state live? [18:19:12] mutante: if so then my rollback wont resurrect icinga [18:19:43] robh: see backlog from about 9:38 [18:19:57] https://gerrit.wikimedia.org/r/#/c/266535/ [18:20:06] ok... [18:20:18] mutante: the backlog for me is very cluttered, can you summarize? [18:20:44] ottomata: so did you break it before i got to it? [18:20:55] (03PS1) 10TheDJ: Raise file upload limit to 2,5 GB [mediawiki-config] - 10https://gerrit.wikimedia.org/r/266544 (https://phabricator.wikimedia.org/T116514) [18:21:46] !log icinga is broken, it seems it was from a change before mine, but my forced reload broke it [18:21:47] robh: i saw the icinga check for icinga config itself reported it as broken, then i ran icinga -v and saw it was looking for a contact group called "analytics_eqiad", there is a service group called analytics_eqiad but not a contact group [18:21:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:21:51] i haven't checked, but i think i fixed it [18:21:53] no? [18:22:03] https://gerrit.wikimedia.org/r/#/c/266538/ [18:22:07] i pushed a change and it broke it again then [18:22:50] ok, back. [18:22:59] my change also broke it, sorry for the bad ping ottomata [18:23:09] it just took a few moments to catch up =P [18:24:19] RECOVERY - Check correctness of the icinga configuration on neon is OK: Icinga configuration is correct [18:24:20] I'm getting "404 File Not Found" when creating pages [18:24:30] ah, unrelated things then and fixed, cool ! [18:24:41] RECOVERY - NTP on krypton is OK: NTP OK: Offset -0.004741430283 secs [18:25:01] ebernhardson: i take it bakc i had to roll the icinga patch back cuz something broke [18:25:05] (03Abandoned) 10Dzahn: graphite alerts, fix analytics monitoring group name [puppet] - 10https://gerrit.wikimedia.org/r/266535 (owner: 10Dzahn) [18:25:10] but i've taken the liberty to move it to thursdays puppetswat [18:25:16] and i'm goit to poke at it before then [18:25:34] urandom: we didnt get to yours today but its now on thursday [18:25:40] robh: kk [18:25:55] (03PS1) 10Dzahn: bugzilla-static: ensure_resource to fix duplicates [puppet] - 10https://gerrit.wikimedia.org/r/266546 [18:26:08] and cmjohnson1 is still pushing his apache changes to the rest of the cluster (so we arent out of the window quite yet) [18:26:32] (03PS2) 10Dzahn: bugzilla-static: ensure_resource to fix duplicates [puppet] - 10https://gerrit.wikimedia.org/r/266546 [18:27:07] Houston, we've got a big problem [18:27:27] There is no user by the name "Vituzzu". Check your spelling. <-- while trying loggin to meta [18:27:33] Does anyone knows why https://meta.wikimedia.org/wiki/Special:BlankPage redirects to https://wikimediafoundation.org/wiki/Special:BlankPage ? [18:27:53] Hm meta seems broken yes [18:27:54] !log i broke icinga, but then i fixed it, icinga back to normal. [18:27:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:28:06] and commons. Can't save things. get 404's [18:28:12] cmjohnson1: we have an issue. [18:28:15] with logging in [18:28:33] also got redirected to wikimediafoundation-wiki when trying to do so sometimes [18:29:18] cmjohnson1: i think one of those changes broken things [18:29:20] Hi. Meta-Wiki seems broken. Known issue? [18:29:22] robh: kk [18:29:24] https://phabricator.wikimedia.org/T124804 [18:29:33] Vito: It's not a CA-thing. [18:29:37] robh: was it the prior brokenness i saw mentioned, or was it mine as well? [18:29:43] I think so sjoerddebruin [18:29:47] robh: that seems bad and related to that change that moved wwwportals [18:29:50] PROBLEM - NTP on planet1001 is CRITICAL: NTP CRITICAL: Offset 1.700246453 secs [18:29:51] i'd revert that [18:29:54] ebernhardson: not sure but rolling it back fixed it [18:30:00] ok [18:30:00] and now we're in outage condition [18:30:02] Could someone please rv group0? [18:30:07] cmjohnson1: revert https://gerrit.wikimedia.org/r/#/c/265659/ [18:30:14] its breakign things [18:30:15] robh: did you or should i [18:30:24] i did not, please do so [18:30:36] (03PS1) 10Cmjohnson: Revert "beta: Move login and bits apache configs into wikimedia.conf, like prod" [puppet] - 10https://gerrit.wikimedia.org/r/266549 [18:30:54] jynus, thanks for the quick response in provisioning the erstwhile-ruthenium-dbs. what is involved in getting access to them and puppetizing it (T124704)? [18:31:11] Since its login related I don't think we have to do anything to caching due to this. [18:31:22] but im not certain. [18:31:29] We're 301ing meta.wikimedia.org to wikimediafoundation.org currently. [18:31:35] subbu, please wait some seconds, there are some issues going on [18:31:44] So there might be bad cache at some level. [18:31:46] revert this https://gerrit.wikimedia.org/r/#/c/265642/4/modules/mediawiki/files/apache/sites/wikimedia.conf [18:31:49] cmjohnson1: [18:32:11] robh: it's reverting modules/role/manifests/elasticsearch/alerts.pp [18:32:14] i think that because it touched wikimedia.org docroot [18:32:30] wait [18:32:31] fuck [18:32:33] that would be in respose to Leah's comment [18:32:35] mutante: which one to revert? [18:32:39] oh [18:32:49] mutante: so all of the changes broke things? [18:33:03] so multiple issues [18:33:05] i don't know if more than one broke something, i just suspected that one [18:33:12] it's possible , yes [18:33:13] jynus, k [18:33:53] Reedy: ^ hey are you here [18:33:59] Ugh [18:34:02] we send 301s? [18:34:10] PROBLEM - NTP on dubnium is CRITICAL: NTP CRITICAL: Offset unknown [18:34:31] hey all, is officewiki looking broken and otherwise is redirecting to foundationwiki. related? [18:34:41] dr0ptp4kt: Yes. [18:34:46] Leah: thx [18:34:47] same happens for other wikis dr0ptp4kt [18:34:52] Vito: thx [18:34:56] thx all around :) [18:34:57] Everything on wikimedia.org is probably borked currently, including login, office, and meta. [18:35:04] we just need to wait for the tech guys to fix it [18:35:07] yikes [18:35:20] meanwhile we can say silly funny stuffs about this outage [18:35:27] In another channel, sure. :-) [18:35:32] hehehehe [18:36:03] (03PS1) 10Cmjohnson: Revert "mediawiki: Move www.wikimedia.org portal into wwwportals" [puppet] - 10https://gerrit.wikimedia.org/r/266551 [18:36:12] btw I fear many bots/tasks will need to be restarted as soon as the outage ends [18:36:37] (03PS2) 10Cmjohnson: Revert "beta: Move login and bits apache configs into wikimedia.conf, like prod" [puppet] - 10https://gerrit.wikimedia.org/r/266549 [18:38:05] (03PS2) 10Cmjohnson: Revert "mediawiki: Move www.wikimedia.org portal into wwwportals" [puppet] - 10https://gerrit.wikimedia.org/r/266551 [18:38:16] (03CR) 10Cmjohnson: [C: 032] Revert "beta: Move login and bits apache configs into wikimedia.conf, like prod" [puppet] - 10https://gerrit.wikimedia.org/r/266549 (owner: 10Cmjohnson) [18:39:13] (03PS3) 10Cmjohnson: Revert "mediawiki: Move www.wikimedia.org portal into wwwportals" [puppet] - 10https://gerrit.wikimedia.org/r/266551 [18:39:57] greg-g, robh: do you think I could sneak an OCG deploy either before or after the train deploy today? [18:40:06] we're in an outage. [18:40:16] PROBLEM - NTP on bohrium is CRITICAL: NTP CRITICAL: Offset 1.762993813 secs [18:40:17] robh: ah. that's why I ask! [18:40:20] 6operations, 10MediaWiki-extensions-CentralAuth, 10netops: wikimedia.org seems to be gone - https://phabricator.wikimedia.org/T124804#1966496 (10Vituzzu) [18:40:27] (03CR) 10Cmjohnson: [C: 032] Revert "mediawiki: Move www.wikimedia.org portal into wwwportals" [puppet] - 10https://gerrit.wikimedia.org/r/266551 (owner: 10Cmjohnson) [18:40:52] changes reverted [18:41:31] :) [18:41:32] salt run puppet? [18:41:53] robh: i'll check back after the train deploy then, hopefully things will not be on fire. [18:42:22] 6operations, 10MediaWiki-extensions-CentralAuth, 10netops: wikimedia.org seems to be gone - https://phabricator.wikimedia.org/T124804#1966502 (10Aklapper) @Vituzzu: Thanks for reporting this. https://gerrit.wikimedia.org/r/#/c/266551/ got reverted so things should be back to normal. Can you confirm (by bypas... [18:42:48] <_joe_> hoo: I'm on it [18:43:12] :) [18:43:32] 6operations, 10MediaWiki-extensions-CentralAuth, 10netops: wikimedia.org seems to be gone - https://phabricator.wikimedia.org/T124804#1966507 (10I_JethroBT) Agreed, meta.wikimedia.org has been completely replaced with a broken-ish landing page for Wikimedia projects: {F3283563} [18:43:43] Meh, and https://commons.wikimedia.org/wiki/Commons:Village_pump seems to redirect me to wmf: [18:43:57] yes it does [18:44:05] meta is down? https://meta.wikimedia.org/ [18:44:11] AndyRussG: Yes, known. [18:44:18] Also, yes. https://phabricator.wikimedia.org/T124804 [18:44:20] Leah: ah K thx :) [18:44:21] AndyRussG: changes causing it are being reveted [18:44:33] K thx! [18:44:49] <_joe_> !log running salt --batch-size=20 -C 'G@luster:appserver and G@site:eqiad' cmd.run 'puppet agent -t --tags mw-apache-config' [18:44:52] 6operations, 10MediaWiki-extensions-CentralAuth, 10netops: wikimedia.org seems to be gone - https://phabricator.wikimedia.org/T124804#1966511 (10Aklapper) and https://commons.wikimedia.org/wiki/Commons:Village_pump redirects me to wmf: [18:45:03] if affects everything in wikimedia.org, but not wikipedia.org [18:45:05] Leah: Wiki13: Can u tell me when this happened about? It would affect FR [18:45:12] wikitech seems fine. [18:45:23] i think about 15 mins ago [18:45:35] according to the log of this channel [18:45:39] 6operations, 10MediaWiki-extensions-CentralAuth, 10netops: Meta and Commons seem to redirect to wikimediafoundation.org - https://phabricator.wikimedia.org/T124804#1966518 (10Aklapper) [18:45:40] Or I should say, might affect fr (checking) [18:45:45] Wiki13: thx! [18:45:50] 6operations, 10MediaWiki-extensions-CentralAuth, 10netops: Meta and Commons seem to redirect to wikimediafoundation.org - https://phabricator.wikimedia.org/T124804#1966520 (10I_JethroBT) @Aklapper after bypassing my cache, meta.wikimedia.org is still gone. [18:45:51] nah it wont affect FR [18:45:58] only *.wikimedia.org sites [18:46:02] AndyRussG: doesnt affect wikipedia.org [18:46:03] like meta and sommons [18:46:04] Wiki13: FR uses CentralNotice which depends on banners from meta [18:46:05] <_joe_> it's running, it will take some time to be applied though [18:46:11] <_joe_> about 10 minutes at least [18:46:17] then its borked AndyRussG [18:46:24] ;p [18:46:25] AndyRussG: I don't see an exact culprit at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:46:25] AndyRussG Wiki13 : yeah, donate.wikimedia.org is redirecting too. and CN is inacessible [18:46:26] _joe_: What about varnish? These were 301s [18:46:46] <_joe_> hoo: I'm first fixing the app layer [18:46:49] we have to wait until the techs here fix it [18:46:50] the-wub: what campaigns are (were) up? [18:46:52] (for those tuning in now, this is https://phabricator.wikimedia.org/T124804) [18:47:03] Sure, but that should go on its own now... hopefully [18:47:21] Leah: AndyRussG: it was the puppetswat [18:47:28] Leah: probably https://gerrit.wikimedia.org/r/#/c/265642 it was in the puppet swat [18:47:31] AndyRussG: only low level banners. let's take FR discussion to our channel [18:47:35] Thanks to folks for fixing the problem. : ) [18:47:45] the-wub: yep, thx! [18:47:52] whichhhh i suppose isn't on SAL, that kind of sucks, eh? [18:47:57] MatmaRex: Right. [18:48:00] <_joe_> hoo: it will take some time to recover from though [18:48:14] <_joe_> it's a _ton_ of stuff to figure out [18:49:59] ehm, yes, is Meta down, right? [18:50:01] _joe_ Are we talking a matter of hours? [18:50:06] mafk - Correct. [18:50:26] i_jethrobot: thanks [18:50:37] <_joe_> i_jethrobot: 20 minutes tops [18:50:39] <_joe_> I hope [18:51:28] _joe_ OK. Good time for some lunch, then. : ) [18:52:37] RECOVERY - NTP on dubnium is OK: NTP OK: Offset -0.02100622654 secs [18:52:52] 6operations, 10MediaWiki-extensions-CentralAuth, 10netops: Meta and Commons seem to redirect to wikimediafoundation.org - https://phabricator.wikimedia.org/T124804#1966553 (10Vituzzu) @Aklapper still doesn't work for me. I'm currently served by Amsterdam's cluster btw. [18:52:54] jynus: https://icinga.wikimedia.org/cgi-bin/icinga/extinfo.cgi?type=1&host=db1047 [18:53:00] PROCS CRITICAL: 2 processes with UID = 0 (root), args '/bin/bash /usr/local/bin/eventlogging_sync.sh'  [18:53:00] ok? [18:53:05] 6operations, 10MediaWiki-extensions-CentralAuth, 10netops: Meta, Commons, Wikispecies seem to redirect to wikimediafoundation.org - https://phabricator.wikimedia.org/T124804#1966554 (10OhanaUnited) [18:53:26] ottomata, not a concern now [18:53:32] 6operations, 10MediaWiki-extensions-CentralAuth, 10netops: Meta, Commons, Wikispecies seem to redirect to wikimediafoundation.org - https://phabricator.wikimedia.org/T124804#1966462 (10OhanaUnited) Wikispecies also has the same issue [18:53:41] k [18:54:47] 6operations, 10MediaWiki-extensions-CentralAuth, 10netops: All wikis under wikimedia.org (Meta, Commons, others) are redirecting to wikimediafoundation.org - https://phabricator.wikimedia.org/T124804#1966572 (10matmarex) [18:55:41] 6operations, 10netops: All wikis under wikimedia.org (Meta, Commons, others) are redirecting to wikimediafoundation.org - https://phabricator.wikimedia.org/T124804#1966580 (10matmarex) [18:55:57] 6operations, 10Wikimedia-Apache-configuration, 10netops: All wikis under wikimedia.org (Meta, Commons, others) are redirecting to wikimediafoundation.org - https://phabricator.wikimedia.org/T124804#1966462 (10matmarex) [18:55:59] 6operations, 10Wikimedia-Apache-configuration, 10netops: All wikis under wikimedia.org (Meta, Commons, others) are redirecting to wikimediafoundation.org - https://phabricator.wikimedia.org/T124804#1966592 (10Dzahn) The remaining issues are because a tagged puppet run is now executed on all appservers, which... [18:56:02] 6operations, 10Wikimedia-Apache-configuration, 10netops: All wikis under wikimedia.org (Meta, Commons, others) are redirecting to wikimediafoundation.org - https://phabricator.wikimedia.org/T124804#1966593 (10Tbayer) Office.wikimedia.org is affected too, just for the record. [18:56:39] 6operations, 10Wikimedia-Apache-configuration, 10netops: All wikis under wikimedia.org (Meta, Commons, others) are redirecting to wikimediafoundation.org - https://phabricator.wikimedia.org/T124804#1966597 (10Dzahn) >>! In T124804#1966593, @Tbayer wrote: > Office.wikimedia.org is affected too, just for the r... [18:57:04] 6operations, 10Wikimedia-Apache-configuration, 10netops: All wikis under wikimedia.org (Meta, Commons, others) are redirecting to wikimediafoundation.org - https://phabricator.wikimedia.org/T124804#1966603 (10MZMcBride) This issue is definitely going to require incident documentation ( (03PS1) 10EBernhardson: Put more like query load back on eqiad for codfw load testing [mediawiki-config] - 10https://gerrit.wikimedia.org/r/266559 [18:58:15] 6operations, 10Wikimedia-Apache-configuration, 10netops: All wikis under wikimedia.org (Meta, Commons, others) are redirecting to wikimediafoundation.org - https://phabricator.wikimedia.org/T124804#1966615 (10Izno) >>! In T124804#1966597, @Dzahn wrote: > > everything under .wikimedia.org is affected but not... [18:58:19] 6operations, 10Wikimedia-Apache-configuration, 10netops: All wikis under wikimedia.org (Meta, Commons, others) are redirecting to wikimediafoundation.org - https://phabricator.wikimedia.org/T124804#1966618 (10Mike_Peel) >>! In T124804#1966597, @Dzahn wrote: > > everything under .wikimedia.org is affected bu... [18:59:06] RECOVERY - NTP on planet1001 is OK: NTP OK: Offset -0.009496450424 secs [19:00:04] marxarelli: Dear anthropoid, the time has come. Please deploy MediaWiki train (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160126T1900). [19:00:07] 6operations, 10Wikimedia-Apache-configuration, 10netops: All wikis under wikimedia.org (Meta, Commons, others) are redirecting to wikimediafoundation.org - https://phabricator.wikimedia.org/T124804#1966649 (10Dzahn) >>! In T124804#1966618, @Mike_Peel wrote: >>>! In T124804#1966597, @Dzahn wrote: >> >> every... [19:00:19] (03CR) 10Aaron Schulz: Use the logical redis definition for GettingStarted. (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/266481 (https://phabricator.wikimedia.org/T124671) (owner: 10Giuseppe Lavagetto) [19:02:18] !log backports to wmf.11 ready on mira but delaying train due to wikimedia.org outage [19:02:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:04:35] 6operations, 10Wikimedia-Apache-configuration, 10netops: All wikis under wikimedia.org (Meta, Commons, others) are redirecting to wikimediafoundation.org - https://phabricator.wikimedia.org/T124804#1966668 (10Mike_Peel) This kind of outage should probably appear on http://status.wikimedia.org/ ... (unless th... [19:06:45] 6operations, 10Wikimedia-Apache-configuration, 10netops: All wikis under wikimedia.org (Meta, Commons, others) are redirecting to wikimediafoundation.org - https://phabricator.wikimedia.org/T124804#1966674 (10Pine) @Mike_Peel agreed. [19:08:26] RECOVERY - NTP on fermium is OK: NTP OK: Offset 0.01764667034 secs [19:08:56] 6operations, 10Wikimedia-Apache-configuration, 10netops: All wikis under wikimedia.org (Meta, Commons, others) are redirecting to wikimediafoundation.org - https://phabricator.wikimedia.org/T124804#1966694 (10Pine) Update: Commons is working now, but not Meta. [19:11:06] PROBLEM - Check correctness of the icinga configuration on neon is CRITICAL: Icinga configuration contains errors [19:11:11] 6operations, 10Wikimedia-Apache-configuration, 10netops: All wikis under wikimedia.org (Meta, Commons, others) are redirecting to wikimediafoundation.org - https://phabricator.wikimedia.org/T124804#1966709 (10matmarex) >>! In T124804#1966668, @Mike_Peel wrote: > This kind of outage should probably appear on... [19:12:24] 6operations, 10Wikimedia-Apache-configuration, 10netops: All wikis under wikimedia.org (Meta, Commons, others) are redirecting to wikimediafoundation.org - https://phabricator.wikimedia.org/T124804#1966713 (10Pine) Commons is down again for me. [19:12:38] RECOVERY - NTP on bohrium is OK: NTP OK: Offset -0.009258747101 secs [19:13:43] 6operations, 10Wikimedia-Apache-configuration, 10netops: All wikis under wikimedia.org (Meta, Commons, others) are redirecting to wikimediafoundation.org - https://phabricator.wikimedia.org/T124804#1966722 (10Vituzzu) "There is no user by the name "Vituzzu". Check your spelling." again at meta. [19:13:59] 6operations, 10Wikimedia-Apache-configuration, 10netops: All wikis under wikimedia.org (Meta, Commons, others) are redirecting to wikimediafoundation.org - https://phabricator.wikimedia.org/T124804#1966723 (10Aklapper) >>! In T124804#1966713, @Pine wrote: > Commons is down again for me. Please see T124804#1... [19:14:41] 6operations, 7Monitoring: add icinga and watchmouse https checks for content on commons. or other wikimedia.org sites - https://phabricator.wikimedia.org/T124812#1966725 (10Dzahn) 3NEW [19:14:42] !log issuing a varnish ban on all eqiad backend varnish for req.http.host .*wikimedia.org [19:14:43] 6operations, 6Discovery: Elasticsearch health and capacity planning FY2016-17 - https://phabricator.wikimedia.org/T124626#1966735 (10TJones) David mentioned this ticket, and I had to take a peek. > If my math is right, a 100% increase in 12 months extrapolated to 18 months gives > > current capacity = 1 > in... [19:14:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:15:06] 6operations, 10Wikimedia-Apache-configuration, 10netops: All wikis under wikimedia.org (Meta, Commons, loginwiki, others) are redirecting to wikimediafoundation.org - https://phabricator.wikimedia.org/T124804#1966736 (10matmarex) [19:15:15] 6operations, 10Wikimedia-Apache-configuration, 10netops: All wikis under wikimedia.org (Meta, Commons, loginwiki, others) are redirecting to wikimediafoundation.org - https://phabricator.wikimedia.org/T124804#1966738 (10Dzahn) >>! In T124804#1966668, @Mike_Peel wrote: > This kind of outage should probably ap... [19:16:18] 6operations, 10Wikimedia-Apache-configuration, 10netops: All wikis under wikimedia.org (Meta, Commons, loginwiki, others) are redirecting to wikimediafoundation.org - https://phabricator.wikimedia.org/T124804#1966744 (10RobH) Operations is still working on this issue. At this time the underlying issue has b... [19:17:37] 6operations, 10Wikimedia-Apache-configuration, 10netops: All wikis under wikimedia.org (Meta, Commons, loginwiki, others) are redirecting to wikimediafoundation.org - https://phabricator.wikimedia.org/T124804#1966746 (10Pine) @Dzahn @RobH thank you. [19:18:55] 6operations, 7Monitoring: add icinga and watchmouse https checks for content on commons. or other wikimedia.org sites - https://phabricator.wikimedia.org/T124812#1966753 (10Dzahn) [19:18:57] 6operations, 7Monitoring: add icinga and watchmouse https checks for content on commons. or other wikimedia.org sites - https://phabricator.wikimedia.org/T124812#1966755 (10Mike_Peel) Checking for specific strings would make sense - standard HTTP tokens or headers perhaps? But beyond that, the user expectation... [19:19:31] 6operations, 10Wikimedia-Apache-configuration, 10netops: All wikis under wikimedia.org (Meta, Commons, loginwiki, others) are redirecting to wikimediafoundation.org - https://phabricator.wikimedia.org/T124804#1966757 (10jcrespo) At 7:36 PM, for reasons operations team has not yet investigated, a wrong config... [19:20:11] (hmm, is loginwiki actually down?) [19:21:39] I can access that MatmaRex [19:21:45] and meta [19:21:52] hmm, actually, things might just be fixed now [19:22:07] i'm just idly wondering if loginwiki was actually affected [19:22:14] anyway. not important. [19:22:39] login, meta and commons ok on brazil. [19:22:54] <_joe_> can anyone still having issues please state so? [19:22:58] MatmaRex: there was a report in the wm-l thread about login to mw.org not working [19:23:06] <_joe_> because the problems should be fixed at least on desktop [19:23:30] _joe_: I still can't get Meta or Commons on desktop [19:23:30] <_joe_> Pine: as of now? [19:23:31] Yes [19:23:35] 6operations, 10Wikimedia-Apache-configuration, 10netops: All wikis under wikimedia.org (Meta, Commons, loginwiki, others) are redirecting to wikimediafoundation.org - https://phabricator.wikimedia.org/T124804#1966774 (10Mike_Peel) >>! In T124804#1966757, @jcrespo wrote: > @Mike_Peel That panel is not handled... [19:23:38] <_joe_> can you open a new browser window to test? [19:23:49] _joe_: Works for me [19:23:52] 6operations, 10Wikimedia-Apache-configuration, 10netops: All wikis under wikimedia.org (Meta, Commons, loginwiki, others) are redirecting to wikimediafoundation.org - https://phabricator.wikimedia.org/T124804#1966776 (10jcrespo) Correction, it was 19:23 UTC. [19:23:57] <_joe_> because browsers tend to cache pages [19:24:37] _joe_: I just refreshed, they're both working now for me [19:24:50] <_joe_> Pine: :) glad to hear [19:24:51] 6operations, 10Wikimedia-Apache-configuration, 10netops: All wikis under wikimedia.org (Meta, Commons, loginwiki, others) are redirecting to wikimediafoundation.org - https://phabricator.wikimedia.org/T124804#1966779 (10jeblad) ...and from Oslo, 10 points for well-done cleanup! :) [19:25:06] _joe_: Yup, all clear in Chrome incognito [19:25:27] To stay up, or not to stay up, that is the question. [19:25:28] 6operations, 10Wikimedia-Apache-configuration, 10netops: All wikis under wikimedia.org (Meta, Commons, loginwiki, others) are redirecting to wikimediafoundation.org - https://phabricator.wikimedia.org/T124804#1966782 (10jcrespo) > The webpage showing the status of operations isn't handled by the operations t... [19:25:52] 6operations, 7Monitoring: add icinga and watchmouse https checks for content on commons. or other wikimedia.org sites - https://phabricator.wikimedia.org/T124812#1966785 (10Grendelkhan) Additionally, are there presubmit/integration checks that would have caught this? The builds looked green on push. [19:26:00] Commons and Meta do not redirect for me anymore. [19:26:27] PROBLEM - NTP on rutherfordium is CRITICAL: NTP CRITICAL: Offset 12.7241199 secs [19:27:25] !log issuing a varnish ban on all eqiad frontend varnish for req.http.host .*wikimedia.org [19:27:35] !log issuing a varnish ban on all codfw backend varnish for req.http.host .*wikimedia.org [19:27:54] !log issuing a varnish ban on all codfw frontend varnish for req.http.host .*wikimedia.org [19:28:00] _joe_: is it ok to proceed with the train to mw1017 and group0? [19:28:07] !log issuing a varnish ban on all ulsfo backend varnish for req.http.host .*wikimedia.org [19:28:15] !log issuing a varnish ban on all ulsfo frontend varnish for req.http.host .*wikimedia.org [19:28:20] marxarelli: no, please not yet [19:28:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:28:22] following the fun invalidation stuff [19:28:25] <_joe_> marxarelli: please not yet [19:28:28] ack [19:28:32] !log issuing a varnish ban on all ulsfo backend varnish for req.http.host .*wikimedia.org [19:28:38] !log issuing a varnish ban on all esams backend varnish for req.http.host .*wikimedia.org [19:28:51] <_joe_> (we're just back logging, we already did all of that) [19:28:51] !log issuing a varnish ban on all esams frontend varnish for req.http.host .*wikimedia.org [19:28:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:28:55] !log all of the above already done, back logging [19:28:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:28:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:29:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:29:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:29:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:29:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:30:49] _joe_, akosiaris: Thanks for the fast response. :-) [19:32:16] RECOVERY - Check correctness of the icinga configuration on neon is OK: Icinga configuration is correct [19:33:28] (03PS1) 10Eevans: Enable EventBus on remaining (applicable) wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/266564 (https://phabricator.wikimedia.org/T116786) [19:34:09] PROBLEM - tools-home on tools.wmflabs.org is CRITICAL: CRITICAL - Socket timeout after 20 seconds [19:34:20] (03CR) 10Jhobs: [C: 031] "Assuming that 0.01 -> 0.1 change was intentional, then this LGTM." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/265292 (https://phabricator.wikimedia.org/T123932) (owner: 10Bmansurov) [19:35:33] !log all of the above referred to cache_text [19:36:13] !log issuing a varnish ban on all eqiad mobile backend varnish for req.http.host .*wikimedia.org [19:36:20] !log issuing a varnish ban on all eqiad mobile frontend varnish for req.http.host .*wikimedia.org [19:36:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:36:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:36:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:36:54] !log issuing a varnish ban on all codfw mobile backend varnish for req.http.host .*wikimedia.org [19:36:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:38:23] 6operations, 10Wikimedia-Apache-configuration, 10netops: All wikis under wikimedia.org (Meta, Commons, loginwiki, others) are redirecting to wikimediafoundation.org - https://phabricator.wikimedia.org/T124804#1966833 (10MZMcBride) >>! In T124804#1966722, @Vituzzu wrote: > "There is no user by the name "Vituz... [19:38:28] 6operations, 10Wikimedia-Apache-configuration, 10netops: All wikis under wikimedia.org (Meta, Commons, loginwiki, others) are redirecting to wikimediafoundation.org - https://phabricator.wikimedia.org/T124804#1966835 (10Vituzzu) >>! In T124804#1966833, @MZMcBride wrote: >>>! In T124804#1966722, @Vituzzu wrot... [19:38:34] 6operations, 10Wikimedia-Apache-configuration, 10netops: All wikis under wikimedia.org (Meta, Commons, loginwiki, others) are redirecting to wikimediafoundation.org - https://phabricator.wikimedia.org/T124804#1966836 (10jcrespo) Update: while we believe most issues have been solved now, the caching purge has... [19:39:21] 6operations, 7Monitoring: add icinga and watchmouse https checks for content on commons. or other wikimedia.org sites - https://phabricator.wikimedia.org/T124812#1966839 (10jayvdb) [19:39:38] RECOVERY - tools-home on tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 772570 bytes in 6.513 second response time [19:39:40] 6operations, 10Wikimedia-Apache-configuration, 10netops: All wikis under wikimedia.org (Meta, Commons, loginwiki, others) are redirecting to wikimediafoundation.org - https://phabricator.wikimedia.org/T124804#1966845 (10Harej) [19:40:51] 6operations, 7Monitoring: add icinga and watchmouse https checks for content on commons. or other wikimedia.org sites - https://phabricator.wikimedia.org/T124812#1966853 (10Dzahn) There is a script called apache-fast-test. (modules/apache/files/apache-fast-test) but it's not run automatically by integration. I... [19:43:23] !log issuing a varnish ban on all codfw mobile frontend varnish for req.http.host .*wikimedia.org [19:43:42] !log issuing a varnish ban on all ulsfo mobile backend varnish for req.http.host .*wikimedia.org [19:43:48] !log issuing a varnish ban on all ulsfo mobile frontend varnish for req.http.host .*wikimedia.org [19:43:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:44:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:44:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:45:52] !log issuing a varnish ban on all esams mobile backend varnish for req.http.host .*wikimedia.org [19:45:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:46:00] !log issuing a varnish ban on all esams mobile frontend varnish for req.http.host .*wikimedia.org [19:46:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:49:02] (03PS1) 10EBernhardson: Keep daily graphite data for 5 years [puppet] - 10https://gerrit.wikimedia.org/r/266567 [19:49:20] we don't have any S3 buckets, do we? [19:49:58] bgerstle: What's your real question? :-) [19:50:09] if we have any S3 buckets i can use [19:50:12] marxarelli|brb: red alert is done. You can proceed with deploy [19:50:14] to upload build artifacts from travis [19:50:39] https://docs.travis-ci.com/user/uploading-artifacts/ [19:50:47] (03CR) 10EBernhardson: "I was looking over the data stored in graphite to get some ideas for capacity planning for elasticsearch, and was a bit disappointed to on" [puppet] - 10https://gerrit.wikimedia.org/r/266567 (owner: 10EBernhardson) [19:50:59] bgerstle: why not use github's "releases" feature? [19:51:07] cscott: because this is specifically for test runs [19:51:17] cscott: i.e. to see failed visual test images [19:51:20] RECOVERY - NTP on rutherfordium is OK: NTP OK: Offset -1.537799835e-05 secs [19:51:52] bgerstle: you could commit them to a repo and git push [19:56:19] cscott: others have already asked about uploading directly to GitHub, but apparently github upload API is deprecated [19:56:20] cscott: don't think i'd have push access from Travis VM [19:56:20] or Travis environment in general (not sure if OS X is actually a VM) [19:57:15] cscott: and unfortunately, these tests are only failing in travis :-( [19:57:18] well, if it helps https://github.com/cscott/node-icu-bidi/blob/master/scripts/publish.js is my script to push release binaries to github from travis. [19:57:19] i'm pretty sure you can push to git via github's api as well, you just need an OAuth token. which you'd store in a secure env variable in travis. [19:57:19] cscott: i see, you're using an access token (which i assume has that privilege) [19:57:19] (03PS1) 10Ottomata: Make all kafka broker metrics prefixed with kafka.cluster.$cluster_name [puppet] - 10https://gerrit.wikimedia.org/r/266568 (https://phabricator.wikimedia.org/T121643) [19:57:19] yeah, or i can just use build artifacts that upload to S3 [19:57:19] (03PS2) 10Ottomata: Make all kafka broker metrics prefixed with kafka.cluster.$cluster_name [puppet] - 10https://gerrit.wikimedia.org/r/266568 (https://phabricator.wikimedia.org/T121643) [19:57:19] bgerstle: yeah. it just seems a shame to pay for storage when there's so much free storage floating around. [19:57:20] but developer time probably costs more than S3 does [19:57:20] and people have already people cool tools that visualize the images in S3: https://github.com/ashfurrow/second_curtain [19:57:20] you could upload to commons. ;) [19:57:20] cscott: e.g. getting this when a visual test fails would be _really_ nice https://eigen-ci.s3.amazonaws.com/snapshots/2014-08-04--15-47/index.html [19:57:21] just need an AWS bucket :-) [19:57:26] Leah: so whaddya say? :-) [19:59:00] I don't know if the Wikimedia Foundation has any S3 buckets currently. [19:59:09] I guess file a ticket in Phabricator and mark it operations? [20:00:32] Leah: sure, was hoping i could get a quick answer, but the absence of a "hell no" to S3 is enough to keep me going for now :-) [20:00:51] (03PS3) 10Chad: Update debian package for gerrit [debs/gerrit] - 10https://gerrit.wikimedia.org/r/263631 [20:01:10] bgerstle: I can't really tell if the Travis builds just need storage space to live on or if they're really S3-specific. [20:01:21] We have the former, of course. [20:01:49] The whole Travis/mobile mess won't really be any worse by incorporating Amazon hosting, I don't think. [20:01:59] !log proceeding with train deploy. wmf.11 to mw1017, then group0 [20:02:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:02:23] Leah: unfortunately it's S3 specific at the moment :-( [20:02:37] there's no mention of an adapter to different storage repos [20:02:39] tgr, bd808, anomie: ^ [20:02:55] All right. [20:03:09] Leah: i was specifically wondering about any existing S3 buckets [20:03:29] (since it's S3 specific) [20:03:43] I haven't heard of the Wikimedia Foundation having any S3 buckets. [20:03:47] Leah: sorry, ahve another meeting, bbl [20:03:53] marxarelli: cool beans. anomie and I are in a meeting but he can jump out to test things when you are ready [20:03:53] No worries, bye. [20:05:33] Leah: thanks! [20:07:09] (03PS3) 10Ottomata: Make all kafka broker metrics prefixed with kafka.cluster.$cluster_name [puppet] - 10https://gerrit.wikimedia.org/r/266568 (https://phabricator.wikimedia.org/T121643) [20:09:28] (03CR) 10Ottomata: [C: 032] Make all kafka broker metrics prefixed with kafka.cluster.$cluster_name [puppet] - 10https://gerrit.wikimedia.org/r/266568 (https://phabricator.wikimedia.org/T121643) (owner: 10Ottomata) [20:10:11] RECOVERY - NTP on mendelevium is OK: NTP OK: Offset -0.002828121185 secs [20:11:07] (03PS1) 10Ottomata: Remove extra group_prefix assignment [puppet] - 10https://gerrit.wikimedia.org/r/266571 [20:11:20] (03CR) 10Ottomata: [C: 032 V: 032] Remove extra group_prefix assignment [puppet] - 10https://gerrit.wikimedia.org/r/266571 (owner: 10Ottomata) [20:14:57] !log running 'sync-common --verbose deployment.eqiad.wmnet' on mw1017 to sync wmf.11 for initial testing [20:14:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:16:11] RECOVERY - NTP on technetium is OK: NTP OK: Offset -0.001200199127 secs [20:18:02] !log locally modified wikiversions.php and wikiversions.json on mw1017 for testing [20:18:19] (03CR) 10Mobrovac: [C: 031] Enable EventBus on remaining (applicable) wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/266564 (https://phabricator.wikimedia.org/T116786) (owner: 10Eevans) [20:18:26] (03PS1) 10Ottomata: End kafka group_prefix propertly with . [puppet] - 10https://gerrit.wikimedia.org/r/266575 [20:18:47] (03CR) 10Ottomata: [C: 032 V: 032] End kafka group_prefix propertly with . [puppet] - 10https://gerrit.wikimedia.org/r/266575 (owner: 10Ottomata) [20:18:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:19:17] er, actually, bd808, anomie|meeting is that the best course? (locally modifying wikiversions on mw1017) [20:19:38] (03PS4) 10Chad: Update debian package for gerrit [debs/gerrit] - 10https://gerrit.wikimedia.org/r/263631 [20:20:50] marxarelli: yeah. Just changing wikiversions.php locally on mw1017 is all it takes [20:21:32] (03CR) 10Aaron Schulz: [C: 031] Raise file upload limit to 2,5 GB [mediawiki-config] - 10https://gerrit.wikimedia.org/r/266544 (https://phabricator.wikimedia.org/T116514) (owner: 10TheDJ) [20:22:22] marxarelli: https://www.mediawiki.org/wiki/Special:Version from mw1017 is still showing .10 for me [20:22:39] bd808: whoops. just did testwiki [20:22:55] sec [20:23:04] marxarelli: ah ok [20:24:35] bd808: k. all of group0 should be on wmf.11 (mw1017) now [20:28:31] marxarelli: *nod* [20:28:32] (03PS3) 10Dzahn: bugzilla-static: ensure_resource to fix duplicates [puppet] - 10https://gerrit.wikimedia.org/r/266546 [20:28:32] (03PS5) 10Chad: Update debian package for gerrit [debs/gerrit] - 10https://gerrit.wikimedia.org/r/263631 [20:28:32] (03CR) 10Dzahn: [C: 032] bugzilla-static: ensure_resource to fix duplicates [puppet] - 10https://gerrit.wikimedia.org/r/266546 (owner: 10Dzahn) [20:28:32] (03PS1) 10Ottomata: Pass group_prefix to analytics-eqiad kafka jmxtrans [puppet] - 10https://gerrit.wikimedia.org/r/266578 [20:28:32] (03PS2) 10Ottomata: Pass group_prefix to analytics-eqiad kafka jmxtrans [puppet] - 10https://gerrit.wikimedia.org/r/266578 [20:28:32] (03CR) 10Ottomata: [C: 032 V: 032] Pass group_prefix to analytics-eqiad kafka jmxtrans [puppet] - 10https://gerrit.wikimedia.org/r/266578 (owner: 10Ottomata) [20:28:32] marxarelli: for really testing things we need to make all wikis on mw1017 to .11. anomie will be done in this meeting in a few minutes and can help [20:28:32] tgr: group0 via mw1017 has .11 again now [20:28:40] bd808: ah, ok. will do [20:43:37] !log modified wikiversions.php locally on mw1017 to promote all wikis to wmf.11 for initial testing [20:43:37] PROBLEM - Ensure mysql credential creation for tools users is running on labstore1001 is CRITICAL: CRITICAL - Expecting active but unit create-dbusers is inactive [20:43:38] (03PS1) 10Dzahn: releases: use ensure_resource, avoid duplicate defs [puppet] - 10https://gerrit.wikimedia.org/r/266581 [20:43:38] !log stopping nfs on labstore1001 [20:43:38] (03PS2) 10Dzahn: releases: use ensure_resource, avoid duplicate defs [puppet] - 10https://gerrit.wikimedia.org/r/266581 [20:43:38] marxarelli: mw1017 interactions are looking good to me. logout/login worked, edit worked [20:43:38] (03CR) 10Dzahn: [C: 032] releases: use ensure_resource, avoid duplicate defs [puppet] - 10https://gerrit.wikimedia.org/r/266581 (owner: 10Dzahn) [20:43:39] login status after account creation seems unreliable [20:43:40] I'll try to come up with reproduction steps [20:43:40] bd808, tgr: k. will wait for y'all to give the green light before continuing with group0 promotion [20:43:42] tgr: would that be a group0 blocker or just a group1 blocker? [20:43:42] mostly meaning how long do we need to keep marxarelli hanging on the line [20:43:42] group1, I tested with loginwiki on mw1017 [20:43:43] !log starting nfsd on labstore1001 [20:48:54] bd808, tgr: k. i'll proceed with group0 then [20:50:14] (03PS3) 10Andrew Bogott: Remove puppet classes and files associated with /srv/mediawiki/private/WikitechPrivateLdapSettings.php [puppet] - 10https://gerrit.wikimedia.org/r/266452 (https://phabricator.wikimedia.org/T124732) [20:51:40] (03CR) 10Andrew Bogott: [C: 032] Remove puppet classes and files associated with /srv/mediawiki/private/WikitechPrivateLdapSettings.php [puppet] - 10https://gerrit.wikimedia.org/r/266452 (https://phabricator.wikimedia.org/T124732) (owner: 10Andrew Bogott) [20:53:15] 7Puppet, 10Beta-Cluster-Infrastructure, 5Patch-For-Review, 7Tracking: Remove all ::beta roles in puppet - https://phabricator.wikimedia.org/T86644#1967239 (10Krenair) [20:53:58] (03PS3) 10Andrew Bogott: Don't send puppet nags to the novaadmin user. [puppet] - 10https://gerrit.wikimedia.org/r/266192 (https://phabricator.wikimedia.org/T124516) [20:56:24] I'm hotwiring mw1017 to not throttle account creations from my ip [20:58:34] (03CR) 10Andrew Bogott: [C: 032] Don't send puppet nags to the novaadmin user. [puppet] - 10https://gerrit.wikimedia.org/r/266192 (https://phabricator.wikimedia.org/T124516) (owner: 10Andrew Bogott) [20:58:34] !log drop labstore1001 nfs threads down to 192 [20:58:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:59:08] 6operations, 10Continuous-Integration-Infrastructure, 5Patch-For-Review, 7WorkType-NewFunctionality: Phase out operations-puppet-pep8 Jenkins job and tools/puppet_pep8.py - https://phabricator.wikimedia.org/T114887#1967267 (10hashar) The job `operations-puppet-tox-pep8-jessie` is always triggered. It will... [20:59:23] tgr: when marxarelli syncs the wikiversions bump you'll have to set it back up probably [20:59:37] yeah [20:59:50] !log getting 'Lost parent, LightProcess exiting' when running sync-dir [20:59:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:00:23] marxarelli: from which servers? [21:00:30] bd808: mira [21:00:47] whole msg is '[Tue Jan 26 20:59:06 2016] [hphp] [5613:7f79e535fd00:0:000001] [] Lost parent, LightProcess exiting' [21:00:58] o_O locally hhvm is puking and dying? [21:01:16] known issue, ignore it for now, it is not affecting sync [21:01:17] i can't tell if it's remote execution or not [21:01:34] well, it is [21:01:37] for all servers [21:02:15] jynus: rgr that. i'll resume it then [21:02:32] !log resuming sync-dir and ignoring error as a known issue [21:02:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:02:56] I can tell you because I personally tested it [21:07:23] hi again ops! a quick search on wikitech revealed "WebPageTest" which appears to use S3: https://wikitech.wikimedia.org/wiki/WebPageTest#Setup_S3 [21:07:29] * marxarelli is glad he upgraded his dsl service, helps with the error stream [21:08:06] bgerstle: and? [21:08:08] 6operations, 6Labs, 10Labs-Infrastructure, 5Patch-For-Review: mail from testlabs to ops list - https://phabricator.wikimedia.org/T124516#1967301 (10Andrew) 5Open>3Resolved [21:08:22] i'd like to talk to the people involved to see if i can use S3 too [21:08:48] sorry gotta go... meeting multitasking doesn't work [21:08:56] ping phedenskog or me sometime [21:08:59] 6operations, 10RESTBase, 5Patch-For-Review: Reduce log spam by removing non-operational cassandra IPs from seeds - https://phabricator.wikimedia.org/T123869#1967306 (10Eevans) This didn't go out in today's Puppet SWAT, and has been rescheduled for Thursday. [21:09:04] ori will do, thanks! [21:09:50] marxarelli: you're working on the train deploy? [21:09:57] cscott: yep [21:10:24] marxarelli: could you ping me when it's done? i'd like to sneak in an OCG deploy afterwards, if there's time. [21:10:36] cscott: sure thing [21:11:22] !log sync-dir php linting failed [21:11:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:11:49] ah, it was local b0rking [21:12:09] ok, that other error is related to the lint failure [21:15:04] bd808: so i have no idea if sync-dir was successful because of the hhvm `php -l` madness [21:15:32] marxarelli: if lint failed then it never got around to tyring to sync [21:16:17] well, i can hack scap to use php5 -l for now [21:16:34] that might be worth a shot [21:16:45] or manually lint and temporary remove check_valid_syntax [21:17:00] k. i'll try option 1 [21:18:59] alright. hack worked but the lint still failed. might be real [21:19:02] * marxarelli checks [21:21:22] !log lint error found when running sync-dir 'Errors parsing /srv/mediawiki-staging/php-1.27.0-wmf.11/extensions/Echo/includes/iterator/CallbackFilterIterator.php' [21:21:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:22:33] !log Fatal error: Cannot redeclare class CallbackFilterIterator in /srv/mediawiki-staging/php-1.27.0-wmf.11/extensions/Echo/includes/iterator/CallbackFilterIterator.php on line 24 [21:22:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:24:09] That's not good [21:24:12] bd808: who can i ping about that? ^ Roan? [21:24:21] Roans on holiday AFAIK [21:24:30] might only be a php 5.4 issue [21:24:35] marxarelli: legoktm? [21:24:39] 5.4? :P [21:24:46] 5.5 [21:24:47] * This class is implemented as part of SPL starting at PHP5.4. This [21:24:47] * re-implementation provides backwards compatibility to mediawiki [21:24:47] * running on PHP5.3. [21:24:56] huh [21:25:01] oh fun [21:25:02] i'm linting with 5.5 on mira [21:25:11] but CallbackFilterIterator was introduced in 5.4 [21:25:34] Shouldn't it be wrapped in an if !exists? [21:25:45] yeah, seems like [21:27:15] Wonder how well that works in an autoloader [21:27:27] strange that this didn't occur last week, but perhaps that's because the linter was running via hhvm [21:27:40] or php 5.3? [21:27:45] Did tin have hhvm? [21:27:56] whatever the debian default php is on tin [21:27:57] nope, tin was on 5.3 [21:27:59] *was* [21:28:02] which explains a lot [21:28:21] tin was 5.3 definitely [21:28:40] * marxarelli sighs [21:28:48] that is a reason we still have CI job running Zend 5.3 (there are more reasons) [21:28:51] Needs a task filing at least [21:29:05] nope [21:29:13] we should just go 5.5 only! [21:29:25] so I am really wondering how the hell a Zend 5.4+ method has been introduced in the code base [21:29:47] it's a shim for poor old 5.3 [21:29:58] heh [21:30:05] autoloader never gets called for it, so no fatals [21:30:13] but linter just lints all .php [21:30:15] Considering, it's only tin on WMF servers... [21:30:22] i'll file a task unless someone else is already on it [21:30:29] Yeah, so an if !class_exists() should fix it? [21:30:36] new bug: nuke tin from the orbit [21:30:49] MaxSem: I think _joe_ is getting on with it [21:30:58] MaxSem: Then we have the community to deal with [21:33:16] tin is no longer the main deployment host [21:33:22] it is still receiving deployments though [21:33:34] we sill need a Zend 5.3 for CI regardless, since we run test for release branches [21:33:40] arguably we still need 5.3 support [21:33:57] Until tin is reinstalled... [21:34:35] then, if/when we bump core, we only need 5.3 for < 1.27 [21:35:06] oh [21:39:52] jynus, pinging again about https://phabricator.wikimedia.org/T124704 to see what is involved in getting this done. [21:41:05] !log filed https://phabricator.wikimedia.org/T124828 for fatal in extensions/Echo [21:41:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:42:14] well done scap :) [21:45:37] legoktm, matt_flaschen: anyone around to take care of ^ ? [21:48:26] 6operations, 10Traffic, 7Documentation: Automate and/or better-document varnish ban procedure for operations staff, so it can be accomplished with more speed and confidence in outage conditions - https://phabricator.wikimedia.org/T124835#1967490 (10RobH) 3NEW a:3BBlack [21:48:58] (03PS1) 10Ottomata: Fix alert for eventlogging raw - valid rate [puppet] - 10https://gerrit.wikimedia.org/r/266597 [21:49:02] bd808: fwiw, this is a long-standing bug that doesn't occur with hhvm -l [21:49:18] marxarelli: heh. their tests exclude that file from parallel-lint [21:49:22] (03CR) 10Ottomata: [C: 032 V: 032] Fix alert for eventlogging raw - valid rate [puppet] - 10https://gerrit.wikimedia.org/r/266597 (owner: 10Ottomata) [21:49:23] 6operations, 10Traffic, 7Documentation: Automate and/or better-document varnish ban procedure for operations staff, so it can be accomplished with more speed and confidence in outage conditions - https://phabricator.wikimedia.org/T124835#1967498 (10RobH) I initially assigned this to @bblack, but it can be ac... [21:51:57] (03PS1) 10Ori.livneh: New set of speed experiments [mediawiki-config] - 10https://gerrit.wikimedia.org/r/266599 [21:52:57] (03CR) 10Ori.livneh: [C: 032] New set of speed experiments [mediawiki-config] - 10https://gerrit.wikimedia.org/r/266599 (owner: 10Ori.livneh) [21:53:29] (03Merged) 10jenkins-bot: New set of speed experiments [mediawiki-config] - 10https://gerrit.wikimedia.org/r/266599 (owner: 10Ori.livneh) [21:55:38] !log ori@mira Synchronized docroot and w: I9b054d847a: New set of speed experiments (duration: 01m 29s) [21:55:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [22:01:06] robh: Did you mean to put the date twice in the page title of https://wikitech.wikimedia.org/wiki/Incident_documentation/20160126-20160126-WikimediaDomainRedirection ? [22:01:59] obviously that was an oversight, I moved it [22:02:00] Leah: nope, fixed! [22:02:03] uhh [22:02:06] i moved it already... [22:02:11] uhhh [22:02:15] looks like we both moved it [22:02:21] we both did? whatevs as long as its right [22:02:21] heh [22:02:35] and it is [22:02:50] interesting, I assumed MW prevented that sort of weird conflict [22:02:50] ok [22:02:57] It's... supposed to. [22:03:24] PROBLEM - Kafka Cluster analytics-eqiad Broker Messages In Per Second on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 13 data above and 46 below the confidence bounds [22:03:27] Thanks. I was looking at https://wikitech.wikimedia.org/wiki/Category:Incident_documentation and we actually do use date ranges sometimes, I guess. that's why I asked. [22:03:32] ! [22:03:41] i think that's just because i changed the metrics ^^ [22:03:50] although it did tell me it suppressed a redirect, which I don't remember pressing the button for, and it didn't log [22:04:00] It looks like OAuth is broken on test.wikipedia.org [22:04:03] anomaly will take a while to go away [22:04:15] tgr, anomie: ^ [22:04:19] i left a redirect in place [22:04:27] odd mw behavior [22:04:27] I believe testwiki went to wmf.11 again today [22:04:46] it's about to, after this linter business gets cleared up [22:04:53] oooh [22:05:00] * marxarelli is the king of slow trains [22:05:02] I have a cross-wiki notification [22:05:22] nice, forgot that was enabled [22:05:36] ragesoss: https://tools.wmflabs.org/oauth-hello-world/ works for me [22:06:00] tgr: that doesn't auth to test.wikipedia.org [22:06:20] tgr try https://dashboard-testing.wikiedu.org/ [22:06:38] When I click 'Allow' there, it redirects back to the same page (with all the params stripped)\ [22:06:53] that seems to work too [22:07:03] hm... [22:07:19] 6operations, 10Wikimedia-Apache-configuration, 10netops: All wikis under wikimedia.org (Meta, Commons, loginwiki, others) are redirecting to wikimediafoundation.org - https://phabricator.wikimedia.org/T124804#1967560 (10MZMcBride) >>! In T124804#1966836, @jcrespo wrote: > Followup will be on this ticket and... [22:07:47] ragesoss: did you use a newly created account? [22:07:53] tgr: no. [22:08:01] I'm just trying to log in with my usual account. [22:08:47] ragesoss: there was an outage for all of *.wikimedia.org. if there's a 301 in there, the Location: could be cached, maybe? [22:09:03] ragesoss: try in a clear browser session [22:09:16] it worked after I logged out and back in. [22:09:20] sorry for the false alarm. [22:09:21] thanks! [22:09:43] ragesoss: thank you (things have been odd lately, so better safe than sorry ;) ) [22:10:48] 6operations, 10Wikimedia-Apache-configuration, 10netops: All wikis under wikimedia.org (Meta, Commons, loginwiki, others) are redirecting to wikimediafoundation.org - https://phabricator.wikimedia.org/T124804#1967564 (10RobH) 5Open>3Resolved a:3RobH resolving as I've sent the outage notification to the... [22:11:04] (03PS1) 10Aaron Schulz: Set $wgCentralAuthUseSlaves on all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/266603 [22:11:24] ok, everything is linting ok now. proceeding with the train [22:11:46] * bd808 wipes the lint off of marxarelli's back [22:12:13] 6operations, 10vm-requests, 5Patch-For-Review: move releases.wm.org to bromine (was: request VM for releases.wm.org) - https://phabricator.wikimedia.org/T124261#1967569 (10Dzahn) [22:12:18] bd808: is that a hairy man joke? ;P [22:12:53] ragesoss: was that with User:Ragesoss? [22:12:58] If it is then I'm throwing bricks from inside my house of unwanted hair [22:13:07] one hair can stop an entire train [22:13:16] tgr: It was not working with "User:Sage (Wiki Ed)" [22:13:37] it's a very little train [22:14:08] yay, we're syncing finally [22:14:23] jouncebot: next release [22:14:23] In 1 hour(s) and 45 minute(s): Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160127T0000) [22:15:40] i wanted: In 17 days ... Mediawiki 1.27 , heh [22:15:43] !log dduvall@mira Synchronized php-1.27.0-wmf.11: syncing wmf.11 backports of session fixes (duration: 03m 55s) [22:15:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [22:16:41] is there a way to estimate how much longer until somebody needs to upload new mediawiki files on releases.wikimedia.org [22:17:02] or any release files actually [22:17:19] !log dduvall@mira rebuilt wikiversions.php and synchronized wikiversions files: group0 to 1.27.0-wmf.11 [22:17:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [22:17:42] whoops ... sec [22:17:50] mutante: ask ostriches and csteipp. They are the folks who cut new tarballs AFAIK [22:18:16] bd808: makes sense, thx [22:18:18] mutante: Hm why do you ask? [22:18:33] mutante: Sometime in the next 2 months we will, I'm pretty sure. [22:18:36] ostriches: i am moving the releases.wm site to a different place [22:19:02] on a virtual machine.. and we save one physical server [22:19:09] and get rid of ubuntu [22:19:17] mmk [22:20:11] (03PS1) 10Dduvall: Group0 to 1.27.0-wmf.11 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/266606 [22:21:11] oh boy, that's a big diff [22:21:22] i'll also copy your home dirs. do you do other stuff there (caesium) besides uploading? gpg ? [22:21:26] i guess hhvm on mira pretty prints json [22:21:37] marxarelli: yes! [22:22:27] (03CR) 10Dduvall: [C: 032] "Diff is larger than expected due to pretty printed JSON on mira." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/266606 (owner: 10Dduvall) [22:22:52] (03Merged) 10jenkins-bot: Group0 to 1.27.0-wmf.11 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/266606 (owner: 10Dduvall) [22:22:57] I like it so much better! [22:23:07] * greg-g is kinda back in the realm of awareness [22:23:13] nice [22:24:07] and i just found python -m json.tool ... :) [22:24:19] (03CR) 10BryanDavis: "So glad to see a version built with a modern PHP runtime that knows how to pretty print JSON!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/266606 (owner: 10Dduvall) [22:25:03] !log dduvall@mira rebuilt wikiversions.php and synchronized wikiversions files: group0 to 1.27.0-wmf.11, for real this time [22:25:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [22:26:19] 6operations, 10vm-requests, 5Patch-For-Review: move releases.wm.org to bromine (was: request VM for releases.wm.org) - https://phabricator.wikimedia.org/T124261#1967740 (10Dzahn) @akosiaris It does mean that all shell users who are in "releasers-mediawiki" or "releasers-mobile" now get access to a machine wi... [22:26:41] bd808, anomie, tgr: group0 has been promoted finally! [22:26:57] oh [22:27:01] for some reason I thought that was done earlier [22:27:14] Krenair: we just did it on mw1017 before [22:27:19] ahhh [22:27:21] that explains it [22:27:33] which gets testwiki by accident mostly [22:27:40] thcipriani, ostriches: fyi, there's a local modification to scap on mira to make `php -l` function without spewing errors [22:27:41] yeah, I think I saw it on testwiki [22:27:49] bd808: the OAuth channel should be in logstash, right? [22:27:51] just changed php -l to php5 -l [22:27:55] even debug level events? [22:28:11] tgr: I don't think we send debug [22:29:09] not even when $wmgMonologChannels['OAuth'] === 'debug' ? [22:29:40] my first deploy week(s) have felt a bit like The Last Crusade [22:29:45] is there a local log or do I have to filter through fluorine? [22:30:12] "don't look now, Indy, Jehovah is spelt with an i' [22:31:16] tgr: logstash only gets debug when it is explictly configured with "logstash=>debug" -- https://github.com/wikimedia/operations-mediawiki-config/blob/master/wmf-config/logging.php#L148 [22:37:14] 6operations, 10Wikimedia-Apache-configuration, 10incident-20160126-WikimediaDomainRedirection, 10netops: All wikis under wikimedia.org (Meta, Commons, loginwiki, others) are redirecting to wikimediafoundation.org - https://phabricator.wikimedia.org/T124804#1967786 (10greg) [22:37:26] (03PS1) 10Dzahn: releases: setup rsyncd to copy release files [puppet] - 10https://gerrit.wikimedia.org/r/266608 (https://phabricator.wikimedia.org/T124261) [22:37:33] 6operations, 10incident-20160126-WikimediaDomainRedirection, 7Monitoring: add icinga and watchmouse https checks for content on commons. or other wikimedia.org sites - https://phabricator.wikimedia.org/T124812#1967793 (10greg) [22:38:29] 6operations, 10Continuous-Integration-Config, 10incident-20160126-WikimediaDomainRedirection, 7Regression: operations-apache-config-lint replacement doesn't check syntax - https://phabricator.wikimedia.org/T114801#1967805 (10greg) [22:38:50] 6operations, 10Traffic, 10incident-20160126-WikimediaDomainRedirection, 7Documentation: Automate and/or better-document varnish ban procedure for operations staff, so it can be accomplished with more speed and confidence in outage conditions - https://phabricator.wikimedia.org/T124835#1967813 (10greg) [22:39:12] (03PS2) 10Dzahn: releases: setup rsyncd to copy release files [puppet] - 10https://gerrit.wikimedia.org/r/266608 (https://phabricator.wikimedia.org/T124261) [22:39:25] (03CR) 10Dzahn: [C: 032] releases: setup rsyncd to copy release files [puppet] - 10https://gerrit.wikimedia.org/r/266608 (https://phabricator.wikimedia.org/T124261) (owner: 10Dzahn) [22:39:56] tgr: I think you have to dig in fluorine's logs. Testwiki does log full debug there [22:41:04] PROBLEM - Unmerged changes on repository mediawiki_config on tin is CRITICAL: There is one unmerged change in mediawiki_config (dir /srv/mediawiki-staging/). [22:44:13] (03PS1) 10Aaron Schulz: Enable deferred writes to codfw swift cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/266609 (https://phabricator.wikimedia.org/T91869) [22:44:53] icinga-wm: on tin? [22:47:37] if we're using mira, then yeah. mira's usually complaining a tad about it. [22:47:55] (during the time when you've fetched locally, but haven't sync'd yet so the other one hasn't caught up) [22:49:51] (03PS1) 10Dzahn: releases: add ferm rule to allow rsync to bromine [puppet] - 10https://gerrit.wikimedia.org/r/266613 (https://phabricator.wikimedia.org/T124261) [22:50:26] (03PS2) 10Dzahn: releases: add ferm rule to allow rsync to bromine [puppet] - 10https://gerrit.wikimedia.org/r/266613 (https://phabricator.wikimedia.org/T124261) [22:51:00] (03CR) 10Dzahn: [C: 032] releases: add ferm rule to allow rsync to bromine [puppet] - 10https://gerrit.wikimedia.org/r/266613 (https://phabricator.wikimedia.org/T124261) (owner: 10Dzahn) [22:52:05] (03CR) 10QChris: [C: 04-2] "CR-2 since it seems the comment from 2016-01-24T21:18 is getting" [debs/gerrit] - 10https://gerrit.wikimedia.org/r/263631 (owner: 10Chad) [22:56:16] 6operations, 10vm-requests, 5Patch-For-Review: move releases.wm.org to bromine (was: request VM for releases.wm.org) - https://phabricator.wikimedia.org/T124261#1967914 (10Dzahn) setup rsync, copying the release files over to bromine now ... running in screen .. [23:00:20] bd808: tgr anomie: good start on https://etherpad.wikimedia.org/p/SessionManagerRolloutFailure, can we get that cleaned up and put on wikitech, please? [23:00:46] greg-g: yeah. that's on my todo list [23:00:52] word, thank you [23:11:27] 6operations, 10netops: Peer with SFMIX at ULSFO with 200 Paul - https://phabricator.wikimedia.org/T124843#1967980 (10Reedy) 3NEW [23:16:41] 6operations, 10netops: Peer with SFMIX at ULSFO with 200 Paul - https://phabricator.wikimedia.org/T124843#1968012 (10Dzahn) {meme, src=votecat} let me know if you need smart hands at ulsfo for this [23:23:30] Krenair, ostriches: it's possible that https://phabricator.wikimedia.org/T124828 will need a backport to wmf.10 for the evening swat [23:23:50] s/possible/probable/ [23:25:02] lgtm [23:25:53] 6operations, 10Incident-20160126-WikimediaDomainRedirection, 10Wikimedia-Apache-configuration, 10netops: All wikis under wikimedia.org (Meta, Commons, loginwiki, others) are redirecting to wikimediafoundation.org - https://phabricator.wikimedia.org/T124804#1968040 (10TheDJ) >>! In T124804#1966782, @jcrespo... [23:28:02] (03PS1) 10Dzahn: releases: also rsync /home dirs with user tools [puppet] - 10https://gerrit.wikimedia.org/r/266616 (https://phabricator.wikimedia.org/T124261) [23:28:04] marxarelli, ok [23:28:24] (03CR) 10jenkins-bot: [V: 04-1] releases: also rsync /home dirs with user tools [puppet] - 10https://gerrit.wikimedia.org/r/266616 (https://phabricator.wikimedia.org/T124261) (owner: 10Dzahn) [23:34:06] (03PS2) 10Dzahn: releases: also rsync /home dirs with user tools [puppet] - 10https://gerrit.wikimedia.org/r/266616 (https://phabricator.wikimedia.org/T124261) [23:38:24] (03CR) 10Dzahn: [C: 032] releases: also rsync /home dirs with user tools [puppet] - 10https://gerrit.wikimedia.org/r/266616 (https://phabricator.wikimedia.org/T124261) (owner: 10Dzahn) [23:45:44] 6operations, 10netops: Peer with SFMIX at ULSFO in 200 Paul - https://phabricator.wikimedia.org/T124843#1968096 (10Reedy) [23:46:21] (03PS7) 10Andrew Bogott: Keystone: Adopt a multi-domain model [puppet] - 10https://gerrit.wikimedia.org/r/244350 [23:48:35] (03PS8) 10Andrew Bogott: Keystone: Adopt a multi-domain model with ldap users but mysql role assignment [puppet] - 10https://gerrit.wikimedia.org/r/244350 [23:51:21] (03PS1) 10Rush: nfsd: bump threads avail to 192 [puppet] - 10https://gerrit.wikimedia.org/r/266622 [23:51:36] 6operations, 10Traffic: update the multicast purging documentation - https://phabricator.wikimedia.org/T82096#1968101 (10BBlack) 5Open>3Resolved Fixed up https://wikitech.wikimedia.org/wiki/Multicast_HTCP_purging [23:51:40] (03PS2) 10Rush: nfsd: bump threads avail to 192 [puppet] - 10https://gerrit.wikimedia.org/r/266622 [23:51:42] (03CR) 10jenkins-bot: [V: 04-1] nfsd: bump threads avail to 192 [puppet] - 10https://gerrit.wikimedia.org/r/266622 (owner: 10Rush) [23:51:53] (03PS2) 10EBernhardson: Put more like query load back on eqiad for codfw load testing [mediawiki-config] - 10https://gerrit.wikimedia.org/r/266559 [23:53:05] (03CR) 10Rush: [C: 032] nfsd: bump threads avail to 192 [puppet] - 10https://gerrit.wikimedia.org/r/266622 (owner: 10Rush) [23:55:12] 6operations, 10vm-requests, 5Patch-For-Review: move releases.wm.org to bromine (was: request VM for releases.wm.org) - https://phabricator.wikimedia.org/T124261#1968108 (10Dzahn) @csteipp @demon This is the ticket re: moving the releases server. The purpose is to replace another Ubuntu system (caesium). The... [23:56:13] RECOVERY - Ensure mysql credential creation for tools users is running on labstore1001 is OK: OK - create-dbusers is active [23:59:35] that recovery sounds good, but also something i never saw before, heh [23:59:57] new?