[00:00:02] (03CR) 10Dzahn: [C: 032] "confirmed with racktables, ping/host" [dns] - 10https://gerrit.wikimedia.org/r/407454 (owner: 10Papaul) [00:00:04] addshore, hashar, anomie, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: Dear deployers, time to do the Evening SWAT (Max 8 patches) deploy. Dont look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180221T0000). [00:00:04] mooeypoo, Amir1, eddiegp, and Jamesofur: A patch you scheduled for Evening SWAT (Max 8 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [00:00:09] (03PS4) 10Dzahn: DNS: Add production DNS entry for db2093 [dns] - 10https://gerrit.wikimedia.org/r/407454 (owner: 10Papaul) [00:00:18] o/ [00:00:22] my patch is not testable [00:00:23] o/ [00:00:34] o/ [00:00:41] I'm here with Jamesofur [00:00:42] \o [00:00:46] o/ [00:02:22] 10Operations, 10ops-codfw, 10DBA, 10Patch-For-Review: rack/setup/install db2093 (WAS: rack/setup/install tendril2001) - https://phabricator.wikimedia.org/T186123#3987795 (10Dzahn) host 10.192.48.91 91.48.192.10.in-addr.arpa domain name pointer db2093.codfw.wmnet. host db2093.codfw.wmnet db2093.codfw.wmnet... [00:05:35] (03PS1) 10Dzahn: install_server: rename tendril2001 to db2093 [puppet] - 10https://gerrit.wikimedia.org/r/413068 (https://phabricator.wikimedia.org/T186123) [00:05:45] who can swat today? [00:06:36] (03CR) 10Dzahn: [C: 032] install_server: rename tendril2001 to db2093 [puppet] - 10https://gerrit.wikimedia.org/r/413068 (https://phabricator.wikimedia.org/T186123) (owner: 10Dzahn) [00:06:48] I can SWAT [00:07:18] mooeypoo: looks like no_justification did your already, is that correct? [00:07:32] (03PS3) 10Thcipriani: Enable x-kill feature everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/412664 (https://phabricator.wikimedia.org/T186714) (owner: 10Ladsgroup) [00:07:37] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/412664 (https://phabricator.wikimedia.org/T186714) (owner: 10Ladsgroup) [00:07:43] thcipriani, okay omg that explains so much [00:07:49] :) [00:07:55] thcipriani, etonkovidova and I are trying to test the bug and we can't find it [00:07:57] rofl [00:08:11] "HOW IS IT NOT BROKEN!" was heard in the office several times. [00:08:16] Thanks ;) i guess it was done [00:08:19] haha nice [00:08:54] thcipriani: yes [00:08:56] hmm... why they can get their fixes sooner? :P [00:09:02] [00:09:22] (03Merged) 10jenkins-bot: Enable x-kill feature everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/412664 (https://phabricator.wikimedia.org/T186714) (owner: 10Ladsgroup) [00:09:32] (03CR) 10jenkins-bot: Enable x-kill feature everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/412664 (https://phabricator.wikimedia.org/T186714) (owner: 10Ladsgroup) [00:09:42] x-kill looks like some nerve agent [00:10:08] 10Operations, 10ops-codfw, 10DBA, 10Patch-For-Review: rack/setup/install db2093 (WAS: rack/setup/install tendril2001) - https://phabricator.wikimedia.org/T186123#3987813 (10Dzahn) @Papaul prod IP added, renamed in DHCP, partman doesn't have to be changed. you can now go ahead with the OS install [00:10:35] Hauskatze: ssshhh don't give away the plans [00:10:54] no_justification: as long as you exempt me from the slaughter... [00:11:04] otherwise I'm calling the cops [00:11:10] Amir1: x-kill patch is a thing you have to monitor, can't be tested, correct? [00:11:10] your choice dear [00:11:25] yup [00:11:50] k, going live [00:12:29] if it blows up, it will happen at least several days from now and I'm constantly monitoring everything, might turn it off for some wikis. built some metrics just to make sure [00:14:05] !log thcipriani@tin Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:412664|Enable x-kill feature everywhere]] T186714 T184322 (duration: 01m 13s) [00:14:10] ^ Amir1 live now [00:14:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:14:22] T186714: enable x-kill feature on Commons - https://phabricator.wikimedia.org/T186714 [00:14:22] T184322: Enable fine grained lua tracking gradually in client wikis - https://phabricator.wikimedia.org/T184322 [00:14:41] Thanks [00:14:46] thcipriani: BTW, jouncebot didn't seem to spot my backport. Did you? :-) [00:15:14] James_F: https://gerrit.wikimedia.org/r/#/c/411298/ ? Going through the zuul tubes now. [00:15:25] Ah, yes. Awesome. :-) [00:15:27] (03CR) 10Huji: [C: 04-1] Allow CheckUsers and Stewards to access private data from the AbuseLog (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/413062 (https://phabricator.wikimedia.org/T160357) (owner: 10MarcoAurelio) [00:17:14] (03CR) 10Jalexander: ">" (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/413062 (https://phabricator.wikimedia.org/T160357) (owner: 10MarcoAurelio) [00:17:38] (03CR) 10MarcoAurelio: Allow CheckUsers and Stewards to access private data from the AbuseLog (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/413062 (https://phabricator.wikimedia.org/T160357) (owner: 10MarcoAurelio) [00:17:51] (03PS4) 10MarcoAurelio: Allow CheckUsers and Stewards to access private data from the AbuseLog [mediawiki-config] - 10https://gerrit.wikimedia.org/r/413062 (https://phabricator.wikimedia.org/T160357) [00:18:30] heh Jamesofur you de+1'd it :P [00:18:47] * Jamesofur rolls his eyes a bit [00:18:56] (03CR) 10Jalexander: [C: 031] Allow CheckUsers and Stewards to access private data from the AbuseLog [mediawiki-config] - 10https://gerrit.wikimedia.org/r/413062 (https://phabricator.wikimedia.org/T160357) (owner: 10MarcoAurelio) [00:19:44] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/413062 (https://phabricator.wikimedia.org/T160357) (owner: 10MarcoAurelio) [00:21:17] (03Merged) 10jenkins-bot: Allow CheckUsers and Stewards to access private data from the AbuseLog [mediawiki-config] - 10https://gerrit.wikimedia.org/r/413062 (https://phabricator.wikimedia.org/T160357) (owner: 10MarcoAurelio) [00:21:21] oh, it's happening [00:21:28] (03CR) 10jenkins-bot: Allow CheckUsers and Stewards to access private data from the AbuseLog [mediawiki-config] - 10https://gerrit.wikimedia.org/r/413062 (https://phabricator.wikimedia.org/T160357) (owner: 10MarcoAurelio) [00:21:46] eddiegp: James_F both of your changes are live on mwdebug1002, check please [00:22:15] ack [00:22:55] thcipriani: Yeah, looks good to me. [00:23:20] James_F: ok, making your change live [00:24:41] thcipriani: Works. [00:25:02] (03PS1) 10Dzahn: rename tendril2001.mgmt to db2093.mgmt [dns] - 10https://gerrit.wikimedia.org/r/413070 (https://phabricator.wikimedia.org/T186123) [00:25:05] So can be deployed as well. [00:25:21] eddiegp: ok, will deploy after current sync is done [00:26:19] !log thcipriani@tin Synchronized php-1.31.0-wmf.21/resources/src/mediawiki/mediawiki.ForeignStructuredUpload.js: SWAT: Follow-up I0bb4ed7f7: [[gerrit:411298|Use correct "this"]] T187523 (duration: 01m 13s) [00:26:23] ^ James_F live [00:26:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:26:33] T187523: Unable to upload images in VisualEditor in both Chrome and Firefox on beta and in production - https://phabricator.wikimedia.org/T187523 [00:26:46] (03CR) 10Dzahn: [C: 032] rename tendril2001.mgmt to db2093.mgmt [dns] - 10https://gerrit.wikimedia.org/r/413070 (https://phabricator.wikimedia.org/T186123) (owner: 10Dzahn) [00:28:13] thcipriani: Thank you. [00:28:30] yw :) [00:29:13] * eddiegp just realised I've tested a wmf.21 cherry-pick on testwiki, a group.0 wiki (running wmf.22 which already includes the fix) [00:29:25] But tested it on dewiki now, which also worked :D [00:29:37] !log thcipriani@tin Synchronized php-1.31.0-wmf.21/includes/page/WikiPage.php: SWAT: [[gerrit:413059|site_stats: Unbreak counting newly created pages]] (duration: 01m 12s) [00:29:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:29:54] ^ eddiegp well it's live now :) [00:30:07] thcipriani: Thanks! [00:30:09] yw :) [00:31:31] (03PS1) 10Dzahn: fix subnet for db2093.mgmt [dns] - 10https://gerrit.wikimedia.org/r/413072 [00:31:43] mutante: hmm it now has a rewrite and an override. It looks like the rewrite command one line higher is now obsolete right? [00:31:50] Jamesofur: Hauskatze your change is live on mwdebug1002, check please [00:31:54] Or does it still do something? [00:31:56] ack, checking [00:31:59] I guess I can file a task for running the maintenance script then. [00:32:19] Ah I guess it’s for root url only [00:32:33] (03CR) 10Dzahn: [C: 032] fix subnet for db2093.mgmt [dns] - 10https://gerrit.wikimedia.org/r/413072 (owner: 10Dzahn) [00:32:50] Krinkle: The override is just that single URL and takes precedence over the rewrite, the rewrite is an wildcard. [00:33:08] (03CR) 10Huji: [C: 031] "Marco's explanation was satisfactory" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/413062 (https://phabricator.wikimedia.org/T160357) (owner: 10MarcoAurelio) [00:33:22] Yeah, right. [00:33:40] thcipriani: wfm; waiting on Jamesofur [00:33:42] Krinkle: i had the same thought at first.. then i read the comments above.. then i wasnt sure.. then i tested it and could confirm it works :p [00:34:06] it's created this way by the .dat file [00:34:25] and what eddie said :) [00:35:57] tbh I found the dat file confusing too, I've fiddled with the options there and then compiled it a few times until the apache diff looked right to me :D [00:36:04] it still does this: [00:36:04] http://techblog.wikimedia.org/foo/ [00:36:05] * 301 Moved Permanently http://blog.wikimedia.org/foo/ [00:36:06] tested that [00:38:56] (03CR) 10Dzahn: [C: 04-2] "you can abandon this. it is now called db2093 instead and that is already covered by partman regex" [puppet] - 10https://gerrit.wikimedia.org/r/408731 (https://phabricator.wikimedia.org/T186123) (owner: 10Papaul) [00:39:21] thcipriani: Hauskatze yup yup, sorry I had an emergency come up but back now [00:39:26] (but WFM) [00:39:53] no worries. OK, going live. [00:43:31] !log thcipriani@tin Synchronized wmf-config/abusefilter.php: SWAT: [[gerrit:413062|Allow CheckUsers and Stewards to access private data from the AbuseLog]] T160357 (duration: 01m 12s) [00:43:43] ^ Jamesofur Hauskatze live now [00:43:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:43:45] T160357: Allow those with CheckUser right to access AbuseLog private information on WMF projects - https://phabricator.wikimedia.org/T160357 [00:43:54] \o/ [00:44:34] thcipriani: looking good, thanks :) [00:44:57] yw, glad to hear it :) [00:50:09] (03PS2) 10Chad: Turn wikimedia.org docroot into symlink to standard-docroot [mediawiki-config] - 10https://gerrit.wikimedia.org/r/411369 [00:51:33] 10Operations, 10DBA, 10MediaWiki-General-or-Unknown, 10MW-1.31-release-notes (WMF-deploy-2018-02-20 (1.31.0-wmf.22)), and 2 others: Regularly purge expired temporary userrights from DB tables - https://phabricator.wikimedia.org/T176754#3987931 (10greg) Hi, sorry, my bugmail backlog is woefully long right n... [00:51:51] (03PS2) 10Dzahn: Gerrit: Tweak SSH timeout settings and such [puppet] - 10https://gerrit.wikimedia.org/r/411397 (owner: 10Chad) [00:52:47] (03CR) 10Dzahn: [C: 032] Gerrit: Tweak SSH timeout settings and such [puppet] - 10https://gerrit.wikimedia.org/r/411397 (owner: 10Chad) [00:53:40] (03PS3) 10Dzahn: Gerrit: Also set ldap read timeout [puppet] - 10https://gerrit.wikimedia.org/r/411394 (owner: 10Chad) [00:54:10] (03PS3) 10Papaul: Partman: Add db2093 to partman recipe [puppet] - 10https://gerrit.wikimedia.org/r/408731 (https://phabricator.wikimedia.org/T186123) [00:54:21] (03CR) 10Dzahn: [C: 032] Gerrit: Also set ldap read timeout [puppet] - 10https://gerrit.wikimedia.org/r/411394 (owner: 10Chad) [00:56:28] !log gerrit2001 - restarted gerrit to test that gerrit:411397 and gerrit:411394 don't break anything - didn't touch cobalt right now to minimize affecting users and their logins [00:56:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:57:23] (03PS4) 10Dzahn: Partman: Add db2093 to partman recipe [puppet] - 10https://gerrit.wikimedia.org/r/408731 (https://phabricator.wikimedia.org/T186123) (owner: 10Papaul) [00:59:22] PROBLEM - puppet last run on kafka1012 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [00:59:32] (03PS5) 10Dzahn: Partman: Add db2093 to partman recipe [puppet] - 10https://gerrit.wikimedia.org/r/408731 (https://phabricator.wikimedia.org/T186123) (owner: 10Papaul) [01:00:25] (03CR) 10Dzahn: [C: 032] Partman: Add db2093 to partman recipe [puppet] - 10https://gerrit.wikimedia.org/r/408731 (https://phabricator.wikimedia.org/T186123) (owner: 10Papaul) [01:14:40] mutante: ty for the merges there [01:15:07] no_justification: you're welcome [01:15:50] mutante: I'd also like to revisit the key exchange algorithms & related settings, but low-prio [01:16:12] (03CR) 10Chad: [C: 032] Turn wikimedia.org docroot into symlink to standard-docroot [mediawiki-config] - 10https://gerrit.wikimedia.org/r/411369 (owner: 10Chad) [01:16:15] no_justification: if we can be more strict. yes please [01:16:32] I think we can. We already blacklisted 2 of them, but we could probably do better. [01:16:47] yea, worth re-checking . *nod* [01:17:28] (03Merged) 10jenkins-bot: Turn wikimedia.org docroot into symlink to standard-docroot [mediawiki-config] - 10https://gerrit.wikimedia.org/r/411369 (owner: 10Chad) [01:17:39] (03CR) 10jenkins-bot: Turn wikimedia.org docroot into symlink to standard-docroot [mediawiki-config] - 10https://gerrit.wikimedia.org/r/411369 (owner: 10Chad) [01:24:12] PROBLEM - puppet last run on db1102 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 3 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[git_pull_operations/mediawiki-config] [01:24:23] RECOVERY - puppet last run on kafka1012 is OK: OK: Puppet is currently enabled, last run 42 seconds ago with 0 failures [01:24:43] PROBLEM - puppet last run on db1095 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[git_pull_operations/mediawiki-config] [01:31:32] 10Operations, 10ops-codfw, 10DBA, 10Patch-For-Review: rack/setup/install db2093 (WAS: rack/setup/install tendril2001) - https://phabricator.wikimedia.org/T186123#3987999 (10Papaul) [01:31:54] Ummmm...can't pull ops/mw-config to db*? [01:32:42] !log demon@tin Synchronized docroot/: Swapping wikimedia.org docroot for symlink (duration: 01m 27s) [01:32:42] demon@tin: Failed to log message to wiki. Somebody should check the error logs. [01:32:48] dafuq? [01:32:56] That. Failed. Bad. [01:33:53] PROBLEM - puppet last run on labsdb1010 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[git_pull_operations/mediawiki-config] [01:34:19] 10Operations, 10DBA, 10MediaWiki-General-or-Unknown, 10MW-1.31-release-notes (WMF-deploy-2018-02-20 (1.31.0-wmf.22)), and 2 others: Regularly purge expired temporary userrights from DB tables - https://phabricator.wikimedia.org/T176754#3988002 (10EddieGP) >>! In T176754#3987931, @greg wrote: > Hi, sorry, m... [01:34:43] PROBLEM - puppet last run on labsdb1011 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 3 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[git_pull_operations/mediawiki-config] [01:36:03] !log demon@tin Synchronized docroot/: Swapping wikimedia.org docroot for symlink (second try, old WPFirefoxMobileOS cleanup was still needed) (duration: 01m 12s) [01:36:05] Ok, so the DB servers do *not* like my merge. But why? [01:36:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:36:18] I have a guess.... [01:39:02] PROBLEM - puppet last run on labsdb1009 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 6 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[git_pull_operations/mediawiki-config] [01:39:48] Can someone have a look at labsdb1011:/usr/local/lib/mediawiki-config and tell me if it has docroot/wikimedia.org/WikipediaFirefoxMobileOS [01:40:05] (if so, delete it, those should've been cleaned up before but deleting submodules is freaky magic) [01:40:16] or any of the labsdb* ones that are failing? [01:43:20] bd808: Pinging you cuz labsdb* are "yours" ^ [01:47:42] (03PS1) 10Chad: Gerrit: Improve registration url [puppet] - 10https://gerrit.wikimedia.org/r/413079 [01:48:17] Meh, I'll revert for now :\ [01:48:49] (03PS1) 10Chad: Revert "Turn wikimedia.org docroot into symlink to standard-docroot" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/413080 [01:48:55] (03CR) 10Chad: [V: 032 C: 032] Revert "Turn wikimedia.org docroot into symlink to standard-docroot" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/413080 (owner: 10Chad) [01:50:09] (03CR) 10jenkins-bot: Revert "Turn wikimedia.org docroot into symlink to standard-docroot" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/413080 (owner: 10Chad) [01:51:09] !log demon@tin Synchronized docroot/: revert docroot improvements. some servers don't like improvements (duration: 01m 12s) [01:51:19] * no_justification waits for labsdb* and friends to recover [01:51:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:51:29] * no_justification files task to clean up old instances of WikipediaFirefoxMobileOS [01:54:12] RECOVERY - puppet last run on db1102 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [01:54:43] RECOVERY - puppet last run on db1095 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [01:54:47] !log WikipediaMobileFirefoxOS submodule references caused labsdb* (and related) puppet failures. They should recover now (self reverted my docroot changes). Filed T187850 [01:55:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:55:01] T187850: Cleanup remaining WikipediaMobileFirefoxOS references - https://phabricator.wikimedia.org/T187850 [02:01:50] !log running `initSiteStats.php --update` for all wikis in small.dblist. T187845 [02:02:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:02:04] T187845: Run initSiteStats.php for all wikis - https://phabricator.wikimedia.org/T187845 [02:03:56] RECOVERY - puppet last run on labsdb1010 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [02:03:56] RECOVERY - puppet last run on labsdb1009 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [02:04:46] RECOVERY - puppet last run on labsdb1011 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [02:15:56] !log running `initSiteStats.php --update` for all wikis in medium.dblist. T187845 [02:16:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:16:10] T187845: Run initSiteStats.php for medium/large.dblist - https://phabricator.wikimedia.org/T187845 [02:21:11] no_justification: can you file a task to check that? I actually don’t have root there to clean anything up for $reasons [02:21:17] I did [02:21:28] T187850 [02:21:28] T187850: Cleanup remaining WikipediaMobileFirefoxOS references - https://phabricator.wikimedia.org/T187850 [02:21:31] Awesome [02:21:41] (also some stuff in beta busted & recovered after I reverted) [02:31:20] !log l10nupdate@tin scap sync-l10n completed (1.31.0-wmf.21) (duration: 06m 18s) [02:31:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:26:08] PROBLEM - MariaDB Slave Lag: s1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 746.87 seconds [03:31:55] !log andrew@tin Started deploy [horizon/deploy@0e28f49]: updating branded graphics [03:32:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:34:43] !log andrew@tin Finished deploy [horizon/deploy@0e28f49]: updating branded graphics (duration: 02m 49s) [03:34:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:04:34] 10Operations, 10ops-codfw, 10DBA, 10Patch-For-Review: rack/setup/install db2093 (WAS: rack/setup/install tendril2001) - https://phabricator.wikimedia.org/T186123#3988175 (10Papaul) [04:05:18] RECOVERY - MariaDB Slave Lag: s1 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 235.94 seconds [04:05:38] 10Operations, 10ops-codfw, 10DBA, 10Patch-For-Review: rack/setup/install db2093 (WAS: rack/setup/install tendril2001) - https://phabricator.wikimedia.org/T186123#3934359 (10Papaul) a:05Papaul>03Marostegui @Marostegui it is all yours. Installation complete . [04:10:52] !log andrew@tin Started deploy [horizon/deploy@0e7783d]: updating branded graphics slightly more [04:11:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:13:37] !log andrew@tin Finished deploy [horizon/deploy@0e7783d]: updating branded graphics slightly more (duration: 02m 45s) [04:13:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:19:06] (03PS1) 10BryanDavis: labsdb: Remove obsolete mediawiki-config submodule [puppet] - 10https://gerrit.wikimedia.org/r/413095 (https://phabricator.wikimedia.org/T187850) [05:20:23] (03CR) 10BryanDavis: "It might be easier for a root to just rm these files manually on labsdb1009, labsdb1010, and labsdb1011." [puppet] - 10https://gerrit.wikimedia.org/r/413095 (https://phabricator.wikimedia.org/T187850) (owner: 10BryanDavis) [05:43:48] PROBLEM - puppet last run on dbproxy1008 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [05:43:48] PROBLEM - puppet last run on mw1271 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [05:44:08] PROBLEM - puppet last run on mw1299 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [05:44:09] PROBLEM - puppet last run on stat1006 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [05:44:18] PROBLEM - puppet last run on cp4029 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [05:44:28] PROBLEM - puppet last run on analytics1059 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [05:44:28] PROBLEM - puppet last run on ores1006 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [05:45:58] PROBLEM - puppet last run on mw1257 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [05:46:08] PROBLEM - puppet last run on mw1320 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [05:46:58] PROBLEM - puppet last run on prometheus2003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [05:47:02] (03PS4) 10KartikMistry: Deploy Compact Language Links out of Beta on English Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/412696 (https://phabricator.wikimedia.org/T187677) [05:47:08] PROBLEM - puppet last run on hassium is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [05:47:18] PROBLEM - puppet last run on kubernetes1002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [05:47:28] PROBLEM - puppet last run on pc1004 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [05:47:48] PROBLEM - puppet last run on db1084 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [06:11:59] RECOVERY - puppet last run on prometheus2003 is OK: OK: Puppet is currently enabled, last run 49 seconds ago with 0 failures [06:12:08] RECOVERY - puppet last run on hassium is OK: OK: Puppet is currently enabled, last run 22 seconds ago with 0 failures [06:12:18] RECOVERY - puppet last run on kubernetes1002 is OK: OK: Puppet is currently enabled, last run 14 seconds ago with 0 failures [06:12:28] RECOVERY - puppet last run on pc1004 is OK: OK: Puppet is currently enabled, last run 49 seconds ago with 0 failures [06:12:48] RECOVERY - puppet last run on db1084 is OK: OK: Puppet is currently enabled, last run 16 seconds ago with 0 failures [06:13:48] RECOVERY - puppet last run on dbproxy1008 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:13:50] RECOVERY - puppet last run on mw1271 is OK: OK: Puppet is currently enabled, last run 59 seconds ago with 0 failures [06:14:09] RECOVERY - puppet last run on mw1299 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:14:09] RECOVERY - puppet last run on stat1006 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [06:14:18] RECOVERY - puppet last run on cp4029 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:14:29] RECOVERY - puppet last run on analytics1059 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [06:14:29] RECOVERY - puppet last run on ores1006 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [06:15:58] RECOVERY - puppet last run on mw1257 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [06:16:11] RECOVERY - puppet last run on mw1320 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [06:23:51] 10Operations, 10ops-codfw, 10DBA, 10Patch-For-Review: Replace db2030 from m5 with another host (WAS: Degraded RAID on db2030) - https://phabricator.wikimedia.org/T187722#3988250 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by marostegui on neodymium.eqiad.wmnet for hosts: ``` db2037.codfw.wm... [06:26:29] (03PS1) 10Marostegui: db-eqiad.php: Depool db1105:3312 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/413102 (https://phabricator.wikimedia.org/T187089) [06:27:57] (03CR) 10jerkins-bot: [V: 04-1] db-eqiad.php: Depool db1105:3312 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/413102 (https://phabricator.wikimedia.org/T187089) (owner: 10Marostegui) [06:29:35] (03PS2) 10Marostegui: db-eqiad.php: Depool db1105:3312 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/413102 (https://phabricator.wikimedia.org/T187089) [06:32:48] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1105:3312 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/413102 (https://phabricator.wikimedia.org/T187089) (owner: 10Marostegui) [06:32:56] (03PS1) 10Marostegui: mariadb: Move db2037 from s4 role to m5 [puppet] - 10https://gerrit.wikimedia.org/r/413103 (https://phabricator.wikimedia.org/T187722) [06:33:28] (03CR) 10jerkins-bot: [V: 04-1] mariadb: Move db2037 from s4 role to m5 [puppet] - 10https://gerrit.wikimedia.org/r/413103 (https://phabricator.wikimedia.org/T187722) (owner: 10Marostegui) [06:34:22] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1105:3312 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/413102 (https://phabricator.wikimedia.org/T187089) (owner: 10Marostegui) [06:36:22] (03PS2) 10Marostegui: mariadb: Move db2037 from s4 role to m5 [puppet] - 10https://gerrit.wikimedia.org/r/413103 (https://phabricator.wikimedia.org/T187722) [06:36:29] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Depool db1105 for alter table (duration: 01m 17s) [06:36:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:36:57] !log Deploy schema change on db1105:3312 - T187089 T185128 T153182 [06:37:04] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1105:3312 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/413102 (https://phabricator.wikimedia.org/T187089) (owner: 10Marostegui) [06:37:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:37:11] T187089: Fix WMF schemas to not break when comment store goes WRITE_NEW - https://phabricator.wikimedia.org/T187089 [06:37:11] T153182: Perform schema change to add externallinks.el_index_60 to all wikis - https://phabricator.wikimedia.org/T153182 [06:37:11] T185128: Schema change to prepare for dropping archive.ar_text and archive.ar_flags - https://phabricator.wikimedia.org/T185128 [06:43:28] (03PS3) 10Marostegui: mariadb: Move db2037 from s4 role to m5 [puppet] - 10https://gerrit.wikimedia.org/r/413103 (https://phabricator.wikimedia.org/T187722) [06:45:48] (03PS4) 10Marostegui: mariadb: Move db2037 from s4 role to m5 [puppet] - 10https://gerrit.wikimedia.org/r/413103 (https://phabricator.wikimedia.org/T187722) [06:47:29] (03PS5) 10Marostegui: mariadb: Move db2037 from s4 role to m5 [puppet] - 10https://gerrit.wikimedia.org/r/413103 (https://phabricator.wikimedia.org/T187722) [06:48:10] 10Operations, 10ops-codfw, 10DBA, 10Patch-For-Review: Replace db2030 from m5 with another host (WAS: Degraded RAID on db2030) - https://phabricator.wikimedia.org/T187722#3988266 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['db2037.codfw.wmnet'] ``` and were **ALL** successful. [06:51:27] (03CR) 10Marostegui: "https://puppet-compiler.wmflabs.org/compiler02/10061/" [puppet] - 10https://gerrit.wikimedia.org/r/413103 (https://phabricator.wikimedia.org/T187722) (owner: 10Marostegui) [06:51:29] (03CR) 10Marostegui: [C: 032] mariadb: Move db2037 from s4 role to m5 [puppet] - 10https://gerrit.wikimedia.org/r/413103 (https://phabricator.wikimedia.org/T187722) (owner: 10Marostegui) [06:55:59] (03PS1) 10Marostegui: install_server: Reimage db2037 as jessie instead [puppet] - 10https://gerrit.wikimedia.org/r/413106 [06:57:21] (03CR) 10Marostegui: [C: 032] install_server: Reimage db2037 as jessie instead [puppet] - 10https://gerrit.wikimedia.org/r/413106 (owner: 10Marostegui) [06:59:11] 10Operations, 10ops-codfw, 10DBA, 10Patch-For-Review: Replace db2030 from m5 with another host (WAS: Degraded RAID on db2030) - https://phabricator.wikimedia.org/T187722#3988268 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by marostegui on neodymium.eqiad.wmnet for hosts: ``` db2037.codfw.wm... [07:20:47] !log Stop Mariadb on db1108 for kernel upgrade [07:21:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:23:12] PROBLEM - haproxy failover on dbproxy1009 is CRITICAL: CRITICAL check_failover servers up 1 down 1 [07:23:31] ^ expected [07:23:31] PROBLEM - haproxy failover on dbproxy1004 is CRITICAL: CRITICAL check_failover servers up 1 down 1 [07:25:19] <_joe_> oh noes [07:25:44] that doesn't page :) [07:26:07] <_joe_> I find it amaizing that you're taking me seriously here :P [07:26:14] xddddd [07:29:30] RECOVERY - haproxy failover on dbproxy1004 is OK: OK check_failover servers up 2 down 0 [07:30:19] RECOVERY - haproxy failover on dbproxy1009 is OK: OK check_failover servers up 2 down 0 [07:36:49] 10Operations, 10ops-codfw, 10DBA, 10Patch-For-Review: Replace db2030 from m5 with another host (WAS: Degraded RAID on db2030) - https://phabricator.wikimedia.org/T187722#3988286 (10Marostegui) I am trying to check what's wrong with db2037, as it is showing: ``` [ 52.315934] blk_update_request: critical... [07:37:16] 10Operations, 10ops-codfw, 10DBA, 10Patch-For-Review: Replace db2030 from m5 with another host (WAS: Degraded RAID on db2030) - https://phabricator.wikimedia.org/T187722#3988287 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['db2037.codfw.wmnet'] ``` Of which those **FAILED**: ``` ['db2037.c... [07:37:44] 10Operations, 10ops-codfw, 10DBA, 10Patch-For-Review: Replace db2030 from m5 with another host (WAS: Degraded RAID on db2030) - https://phabricator.wikimedia.org/T187722#3988288 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by marostegui on neodymium.eqiad.wmnet for hosts: ``` db2037.codfw.wm... [07:43:34] 10Operations, 10User-Elukey: Ganeti instances to support Kafka Burrow Consumer lag monitoring - https://phabricator.wikimedia.org/T187805#3988289 (10elukey) >>! In T187805#3987587, @Dzahn wrote: > also, site.pp already looks like this, where everything is in a role except burrows being the oddball which should... [07:45:46] (03CR) 10Matthias Mullie: [C: 04-1] "-1 until deploy" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/410433 (https://phabricator.wikimedia.org/T187261) (owner: 10Matthias Mullie) [07:52:11] 10Operations, 10ops-codfw, 10DBA, 10Patch-For-Review: Replace db2030 from m5 with another host (WAS: Degraded RAID on db2030) - https://phabricator.wikimedia.org/T187722#3988292 (10Marostegui) a:05Marostegui>03Papaul And ILO isn't working any more, so the PXE cannot be set. ``` root@neodymium:~# ipmito... [07:53:13] 10Operations, 10ops-codfw, 10DBA, 10Patch-For-Review: Replace db2030 from m5 with another host (WAS: Degraded RAID on db2030) - https://phabricator.wikimedia.org/T187722#3988294 (10Marostegui) HW logs show nothing by the way [08:01:49] (03PS1) 10Elukey: role::prometheus::ops: correct Kafka Burrow exporter's port [puppet] - 10https://gerrit.wikimedia.org/r/413107 (https://phabricator.wikimedia.org/T180442) [08:02:19] (03CR) 10Elukey: [C: 032] role::prometheus::ops: correct Kafka Burrow exporter's port [puppet] - 10https://gerrit.wikimedia.org/r/413107 (https://phabricator.wikimedia.org/T180442) (owner: 10Elukey) [08:11:07] 10Operations, 10ops-codfw, 10DBA, 10Patch-For-Review: Replace db2030 from m5 with another host (WAS: Degraded RAID on db2030) - https://phabricator.wikimedia.org/T187722#3988323 (10Marostegui) I have managed to get the system up after fixing a few i-nodes: ``` root@db2037:~# touch test root@db2037:~# ```... [08:18:00] (03PS2) 10Gilles: Add Thumbor/Mediawiki shared secret [mediawiki-config] - 10https://gerrit.wikimedia.org/r/412928 (https://phabricator.wikimedia.org/T169144) [08:18:16] (03PS2) 10Gilles: Serve officewiki thumbnails with Thumbor [mediawiki-config] - 10https://gerrit.wikimedia.org/r/412952 (https://phabricator.wikimedia.org/T169144) [08:18:41] 10Operations, 10ops-codfw, 10DBA, 10Patch-For-Review: Replace db2030 from m5 with another host (WAS: Degraded RAID on db2030) - https://phabricator.wikimedia.org/T187722#3988328 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['db2037.codfw.wmnet'] ``` Of which those **FAILED**: ``` ['db2037.c... [08:19:39] 10Operations, 10ops-codfw, 10DBA, 10Patch-For-Review: Replace db2030 from m5 with another host (WAS: Degraded RAID on db2030) - https://phabricator.wikimedia.org/T187722#3988329 (10Marostegui) I am unable to reimage the server due to the PXE thing I described at: T187722#3988292 The system looks fine, so f... [08:20:02] !log foreachwikiindblist "% private.dblist" extensions/WikimediaMaintenance/filebackend/setZoneAccess.php --backend=local-multiwrite --private [08:20:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:23:07] (03PS1) 1020after4: Phabricator: restart apache every sunday night [puppet] - 10https://gerrit.wikimedia.org/r/413114 (https://phabricator.wikimedia.org/T187790) [08:32:24] (03PS2) 10Filippo Giunchedi: Add all private wikis to swift::proxy::private_container_list [puppet] - 10https://gerrit.wikimedia.org/r/412980 (https://phabricator.wikimedia.org/T169144) (owner: 10Gilles) [08:32:31] (03PS1) 10Gilles: Serve private wiki thumbnails with Thumbor [mediawiki-config] - 10https://gerrit.wikimedia.org/r/413115 (https://phabricator.wikimedia.org/T169144) [08:33:51] (03CR) 10Filippo Giunchedi: [C: 032] Add all private wikis to swift::proxy::private_container_list [puppet] - 10https://gerrit.wikimedia.org/r/412980 (https://phabricator.wikimedia.org/T169144) (owner: 10Gilles) [08:35:30] !log roll-restart thumbor in codfw and eqiad to apply https://gerrit.wikimedia.org/r/c/412980 [08:35:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:38:31] PROBLEM - Disk space on stat1005 is CRITICAL: DISK CRITICAL - free space: /srv 283618 MB (3% inode=93%) [08:39:58] I am working on --^ [08:41:49] (03PS1) 10Muehlenhoff: Record extended MOU for flemmerich [puppet] - 10https://gerrit.wikimedia.org/r/413116 [08:45:16] (03CR) 10Muehlenhoff: [C: 032] Record extended MOU for flemmerich [puppet] - 10https://gerrit.wikimedia.org/r/413116 (owner: 10Muehlenhoff) [08:56:53] 10Operations, 10Analytics-Kanban, 10Patch-For-Review, 10User-Elukey, 10User-Joe: rack/setup/install conf1004-conf1006 - https://phabricator.wikimedia.org/T166081#3988353 (10elukey) Before proceeding, https://phabricator.wikimedia.org/T187022 needs to be closed. [08:57:29] 10Operations, 10ops-eqiad, 10DC-Ops, 10hardware-requests: Decommission xenon, cerium, praseodymium - https://phabricator.wikimedia.org/T187446#3988354 (10fgiunchedi) [08:59:17] (03PS2) 10Muehlenhoff: Add repository component for tor on stretch [puppet] - 10https://gerrit.wikimedia.org/r/410910 [09:07:47] (03CR) 10Filippo Giunchedi: [C: 031] Add repository component for tor on stretch [puppet] - 10https://gerrit.wikimedia.org/r/410910 (owner: 10Muehlenhoff) [09:11:26] (03PS7) 10Filippo Giunchedi: prometheus: add check prometheus metric script [puppet] - 10https://gerrit.wikimedia.org/r/409054 (https://phabricator.wikimedia.org/T181410) [09:11:35] (03CR) 10Filippo Giunchedi: [C: 032] prometheus: add check prometheus metric script [puppet] - 10https://gerrit.wikimedia.org/r/409054 (https://phabricator.wikimedia.org/T181410) (owner: 10Filippo Giunchedi) [09:18:02] (03CR) 10Gehel: [C: 04-1] wdqs: allow configuration of kafka based updates (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/412873 (https://phabricator.wikimedia.org/T185951) (owner: 10Gehel) [09:18:35] (03PS7) 10Gehel: wdqs: allow configuration of kafka based updates [puppet] - 10https://gerrit.wikimedia.org/r/412873 (https://phabricator.wikimedia.org/T185951) [09:20:36] (03PS1) 10Marostegui: m5.hosts: Add db2037 [software] - 10https://gerrit.wikimedia.org/r/413120 (https://phabricator.wikimedia.org/T187722) [09:21:53] (03PS2) 10Marostegui: m5.hosts: Add db2037 [software] - 10https://gerrit.wikimedia.org/r/413120 (https://phabricator.wikimedia.org/T187722) [09:22:52] (03CR) 10Elukey: wdqs: allow configuration of kafka based updates (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/412873 (https://phabricator.wikimedia.org/T185951) (owner: 10Gehel) [09:23:05] (03PS7) 10Filippo Giunchedi: cassandra: create parent data directories with exec [puppet] - 10https://gerrit.wikimedia.org/r/404705 (https://phabricator.wikimedia.org/T175284) (owner: 10Eevans) [09:23:56] (03CR) 10Marostegui: [C: 032] m5.hosts: Add db2037 [software] - 10https://gerrit.wikimedia.org/r/413120 (https://phabricator.wikimedia.org/T187722) (owner: 10Marostegui) [09:23:58] !log installing sqlite security updates on stretch [09:24:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:25:03] (03Merged) 10jenkins-bot: m5.hosts: Add db2037 [software] - 10https://gerrit.wikimedia.org/r/413120 (https://phabricator.wikimedia.org/T187722) (owner: 10Marostegui) [09:32:21] (03CR) 10Filippo Giunchedi: [C: 032] cassandra: create parent data directories with exec [puppet] - 10https://gerrit.wikimedia.org/r/404705 (https://phabricator.wikimedia.org/T175284) (owner: 10Eevans) [09:37:55] (03PS1) 10Filippo Giunchedi: cassandra: create data directories only when needed [puppet] - 10https://gerrit.wikimedia.org/r/413121 (https://phabricator.wikimedia.org/T175284) [09:40:01] (03CR) 10Filippo Giunchedi: [C: 032] cassandra: create data directories only when needed [puppet] - 10https://gerrit.wikimedia.org/r/413121 (https://phabricator.wikimedia.org/T175284) (owner: 10Filippo Giunchedi) [09:41:35] 10Operations, 10DBA, 10Patch-For-Review: Setup newer machines and replace all old misc (m*) and x1 codfw machines - https://phabricator.wikimedia.org/T183470#3988386 (10jcrespo) [09:44:57] apergos: hey do you know offhand if there's a handy archive of the mobile app files that used to be on dumps.wikimedia.org (or possibly download.wikimedia.org)? would've been in dirs like /android and /ios [09:45:14] i had someone looking for a copy of an old app for testing, just curious if we still have them floating around :D [09:45:24] I've probably cleaned those up but let me just double check [09:45:27] thx [09:45:44] i'm not sure if the one they're looking for would've even been there though (blackberry tablet one) [09:46:00] errr that sounds unlikely but let me see what's still around [09:46:04] if not i'll check my old backups at home when i get back :D [09:46:26] !log installing dbus updates from stretch point release [09:46:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:46:50] brion: give me a year please? :-D [09:46:54] hehe [09:46:58] no rush ;) [09:47:03] no, seriously [09:47:09] I've got a bunchof stuff from 2012 [09:47:11] oh for source? [09:47:13] yeah 2012 or 2013 [09:47:14] yeah! [09:47:33] jeebus it's 2018 now? ffs [09:47:58] lolol [09:48:14] i thought time ended in 2012. mayans and all [09:48:19] so give me an idea of some string in the name in the blackberry app [09:48:26] would end in ".bar" [09:48:33] (i assume it stood for Blackberry ARchive) [09:48:51] (03PS1) 10Muehlenhoff: Add library hint for dbus [puppet] - 10https://gerrit.wikimedia.org/r/413124 [09:49:09] lots of nice apk and ipa but no bar files [09:49:22] any idea of another directory you might have put them under? [09:49:46] if it's not under /blackberry or /playbook it's probably not there [09:50:38] ariel@dataset1001:/data/xmldatadumps/public/other$ ls -l PlayBook/ [09:50:38] total 1900 [09:50:38] -rw-r--r-- 1 dumpsgen wikidev 1943346 Dec 7 2012 Wikipedia-v1.3.3.bar [09:50:41] ta dah!! [09:50:43] aweeeeeesome [09:50:45] thanks :D [09:50:49] man we just never throw anything out do we [09:50:54] lol [09:50:58] so they can direct download that sucka [09:51:18] woot [09:51:20] thanks apergos :D [09:51:23] yw! [09:51:30] thanks for the trip down memory lane [09:51:31] i shoulda known the dir was in CamelCase ;) [09:51:39] hehe [09:52:10] now I just gotta know [09:52:24] is someone seriously still using an old blackberry that this is gonna run on? [09:52:49] this ancient 2012 edition missing all the latest greatest knowledge [09:53:36] there are some die-hard BlackBerry 10 users still ;) [09:53:43] but that version might not install on them [09:53:55] since it was for the slightly earlier tablet [09:54:00] we'll see ;) [09:54:03] (03CR) 10Muehlenhoff: [C: 032] Add library hint for dbus [puppet] - 10https://gerrit.wikimedia.org/r/413124 (owner: 10Muehlenhoff) [09:55:32] lolol well better get working on that, stat! [10:00:04] kart_: It is that lovely time of the day again! You are hereby commanded to deploy Compact Language Links: Dry run for preference migration script. (T187677). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180221T1000). [10:00:04] aharoni: A patch you scheduled for Compact Language Links: Dry run for preference migration script. (T187677) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [10:00:04] T187677: Deploy Compact Language Links on the English Wikipedia - https://phabricator.wikimedia.org/T187677 [10:00:33] aharoni: around? [10:01:13] around [10:01:23] !log Running CLL preference migration script dry-run on terbium (T187677) [10:01:28] aharoni: cool. [10:01:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:01:45] good stash bot [10:02:14] (03PS2) 10Jcrespo: mariadb: Move db2044 from codfw-core-s4 to codfw-misc-s2 [puppet] - 10https://gerrit.wikimedia.org/r/412994 (https://phabricator.wikimedia.org/T183470) [10:02:16] (03PS2) 10Jcrespo: dbproxy: Setup db2044 as the main m2 host on codfw and monitor it [puppet] - 10https://gerrit.wikimedia.org/r/412995 (https://phabricator.wikimedia.org/T183470) [10:02:40] (03CR) 10jerkins-bot: [V: 04-1] mariadb: Move db2044 from codfw-core-s4 to codfw-misc-s2 [puppet] - 10https://gerrit.wikimedia.org/r/412994 (https://phabricator.wikimedia.org/T183470) (owner: 10Jcrespo) [10:02:48] aharoni: I'm logging output in file too. [10:03:28] (03CR) 10Phuedx: [C: 031] Disable Page Previews EventLogging instrumentation [mediawiki-config] - 10https://gerrit.wikimedia.org/r/412996 (https://phabricator.wikimedia.org/T185973) (owner: 10Pmiazga) [10:04:26] kart_: good. [10:05:01] PROBLEM - High CPU load on API appserver on mw1231 is CRITICAL: CRITICAL - load average: 38.80, 35.29, 32.19 [10:06:37] hmm. ^^ hope not related to script running on terbium. [10:07:13] kart_: unlikely, it is a recurring problem [10:07:18] OK! [10:07:24] (03PS1) 10Marostegui: site.pp: Remove db2037 from s4 [puppet] - 10https://gerrit.wikimedia.org/r/413128 (https://phabricator.wikimedia.org/T187722) [10:07:31] jynus: ^ [10:09:55] !log installing openssh bugfix updates from jessie/stretch point releases [10:10:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:10:42] (03CR) 10Jcrespo: [C: 031] site.pp: Remove db2037 from s4 [puppet] - 10https://gerrit.wikimedia.org/r/413128 (https://phabricator.wikimedia.org/T187722) (owner: 10Marostegui) [10:10:54] (03CR) 10Marostegui: [C: 032] site.pp: Remove db2037 from s4 [puppet] - 10https://gerrit.wikimedia.org/r/413128 (https://phabricator.wikimedia.org/T187722) (owner: 10Marostegui) [10:13:23] 10Operations, 10User-Elukey: Ganeti instances to support Kafka Burrow Consumer lag monitoring - https://phabricator.wikimedia.org/T187805#3988447 (10akosiaris) What amount of resources (CPU, mem, disk) are we talking about ? From https://grafana.wikimedia.org/dashboard/db/prometheus-machine-stats?orgId=1&var-s... [10:13:55] (03PS3) 10Jcrespo: mariadb: Move db2044 from codfw-core-s4 to codfw-misc-s2 [puppet] - 10https://gerrit.wikimedia.org/r/412994 (https://phabricator.wikimedia.org/T183470) [10:13:57] (03PS3) 10Jcrespo: dbproxy: Setup db2044 as the main m2 host on codfw and monitor it [puppet] - 10https://gerrit.wikimedia.org/r/412995 (https://phabricator.wikimedia.org/T183470) [10:14:23] (03CR) 10Marostegui: "This looks good to me. Only comment, the commit message says db1111 and db1111 :-)" [puppet] - 10https://gerrit.wikimedia.org/r/412994 (https://phabricator.wikimedia.org/T183470) (owner: 10Jcrespo) [10:14:27] (03CR) 10jerkins-bot: [V: 04-1] mariadb: Move db2044 from codfw-core-s4 to codfw-misc-s2 [puppet] - 10https://gerrit.wikimedia.org/r/412994 (https://phabricator.wikimedia.org/T183470) (owner: 10Jcrespo) [10:15:37] (03CR) 10Marostegui: "db1111 and db1112 have notifications disabled manually on icinga btw" [puppet] - 10https://gerrit.wikimedia.org/r/412994 (https://phabricator.wikimedia.org/T183470) (owner: 10Jcrespo) [10:15:50] RECOVERY - Disk space on stat1005 is OK: DISK OK [10:17:17] (03CR) 10Jcrespo: "Yes, the roles should do that without hiera, I will do that on a separate commit. I would also put them on its own "shard" on monitoring. " [puppet] - 10https://gerrit.wikimedia.org/r/412994 (https://phabricator.wikimedia.org/T183470) (owner: 10Jcrespo) [10:19:11] (03CR) 10Marostegui: [C: 031] mariadb: Move db2044 from codfw-core-s4 to codfw-misc-s2 [puppet] - 10https://gerrit.wikimedia.org/r/412994 (https://phabricator.wikimedia.org/T183470) (owner: 10Jcrespo) [10:19:35] (03CR) 10jerkins-bot: [V: 04-1] mariadb: Move db2044 from codfw-core-s4 to codfw-misc-s2 [puppet] - 10https://gerrit.wikimedia.org/r/412994 (https://phabricator.wikimedia.org/T183470) (owner: 10Jcrespo) [10:21:24] 10Operations, 10ops-codfw, 10DBA, 10Patch-For-Review: Replace db2030 from m5 with another host (WAS: Degraded RAID on db2030) - https://phabricator.wikimedia.org/T187722#3988479 (10Marostegui) m5 is now replicating on db2037. I will leave notifications disable till we do the tests with @Papaul when he has... [10:26:19] !log Remove db2030 from tendril - T187768 [10:26:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:26:35] T187768: Decommission db2030 - https://phabricator.wikimedia.org/T187768 [10:28:39] (03PS1) 10Marostegui: dbproxy1005: Replace db2030 with db2037 [puppet] - 10https://gerrit.wikimedia.org/r/413129 (https://phabricator.wikimedia.org/T187722) [10:30:34] (03PS1) 10Jcrespo: Change m5-slave to db2037 [dns] - 10https://gerrit.wikimedia.org/r/413130 (https://phabricator.wikimedia.org/T187722) [10:30:50] (03CR) 10Marostegui: [C: 031] Change m5-slave to db2037 [dns] - 10https://gerrit.wikimedia.org/r/413130 (https://phabricator.wikimedia.org/T187722) (owner: 10Jcrespo) [10:31:40] (03CR) 10Jcrespo: [C: 032] Change m5-slave to db2037 [dns] - 10https://gerrit.wikimedia.org/r/413130 (https://phabricator.wikimedia.org/T187722) (owner: 10Jcrespo) [10:32:03] (03CR) 10Jcrespo: [C: 031] dbproxy1005: Replace db2030 with db2037 [puppet] - 10https://gerrit.wikimedia.org/r/413129 (https://phabricator.wikimedia.org/T187722) (owner: 10Marostegui) [10:32:10] (03CR) 10Marostegui: [C: 032] dbproxy1005: Replace db2030 with db2037 [puppet] - 10https://gerrit.wikimedia.org/r/413129 (https://phabricator.wikimedia.org/T187722) (owner: 10Marostegui) [10:33:11] !log Reload haproxy on dbproxy1005 - T187722 [10:33:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:33:24] T187722: Replace db2030 from m5 with another host (WAS: Degraded RAID on db2030) - https://phabricator.wikimedia.org/T187722 [10:33:41] PROBLEM - High CPU load on API appserver on mw1227 is CRITICAL: CRITICAL - load average: 38.36, 34.73, 32.11 [10:34:00] RECOVERY - haproxy failover on dbproxy1005 is OK: OK check_failover servers up 2 down 0 [10:38:10] 10Operations, 10ops-codfw, 10DBA, 10hardware-requests, 10Patch-For-Review: Decommission db2030 - https://phabricator.wikimedia.org/T187768#3988496 (10Marostegui) a:05Marostegui>03RobH Assigning it directly to @robh so he can finish up with this (please let me know if you prefer another way of letting... [10:40:52] !log Finished running CLL preference migration script dry-run on terbium (T187677) [10:41:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:41:08] T187677: Deploy Compact Language Links on the English Wikipedia - https://phabricator.wikimedia.org/T187677 [10:41:11] PROBLEM - puppet last run on mw1276 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 6 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[set debconf flag seen for wireshark-common/install-setuid] [10:43:04] 10Operations, 10DBA, 10Patch-For-Review: Setup newer machines and replace all old misc (m*) and x1 codfw machines - https://phabricator.wikimedia.org/T183470#3988511 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by jynus on sarin.codfw.wmnet for hosts: ``` ['db2044.codfw.wmnet'] ``` The log can... [10:43:13] (03CR) 10Jcrespo: [V: 032 C: 032] mariadb: Move db2044 from codfw-core-s4 to codfw-misc-s2 [puppet] - 10https://gerrit.wikimedia.org/r/412994 (https://phabricator.wikimedia.org/T183470) (owner: 10Jcrespo) [10:43:17] 10Operations, 10User-Elukey: Ganeti instances to support Kafka Burrow Consumer lag monitoring - https://phabricator.wikimedia.org/T187805#3988524 (10elukey) >>! In T187805#3988447, @akosiaris wrote: > What amount of resources (CPU, mem, disk) are we talking about ? From https://grafana.wikimedia.org/dashboard/... [10:43:21] (03PS4) 10Jcrespo: mariadb: Move db2044 from codfw-core-s4 to codfw-misc-s2 [puppet] - 10https://gerrit.wikimedia.org/r/412994 (https://phabricator.wikimedia.org/T183470) [10:43:50] PROBLEM - High CPU load on API appserver on mw1226 is CRITICAL: CRITICAL - load average: 35.93, 33.06, 32.11 [10:43:54] (03CR) 10jerkins-bot: [V: 04-1] mariadb: Move db2044 from codfw-core-s4 to codfw-misc-s2 [puppet] - 10https://gerrit.wikimedia.org/r/412994 (https://phabricator.wikimedia.org/T183470) (owner: 10Jcrespo) [10:44:30] (03CR) 10Jcrespo: [V: 032 C: 032] mariadb: Move db2044 from codfw-core-s4 to codfw-misc-s2 [puppet] - 10https://gerrit.wikimedia.org/r/412994 (https://phabricator.wikimedia.org/T183470) (owner: 10Jcrespo) [10:59:15] PROBLEM - MariaDB Slave Lag: s2 on dbstore2002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 50805.68 seconds [11:01:10] that is an expired downtime [11:01:20] from the alter table that got started yesterda [11:01:27] dbstore2001 will complain too I guess [11:01:32] I am going to downtime them now again [11:06:14] RECOVERY - puppet last run on mw1276 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [11:10:45] 10Operations, 10DBA, 10Patch-For-Review: Setup newer machines and replace all old misc (m*) and x1 codfw machines - https://phabricator.wikimedia.org/T183470#3988605 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['db2044.codfw.wmnet'] ``` and were **ALL** successful. [11:11:44] PROBLEM - puppet last run on lvs5002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [11:12:54] !log cloning db2011 to db2044 [11:13:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:14:07] I think m2 proxies will complain now [11:16:04] PROBLEM - haproxy failover on dbproxy1002 is CRITICAL: CRITICAL check_failover servers up 1 down 1 [11:17:04] PROBLEM - haproxy failover on dbproxy1007 is CRITICAL: CRITICAL check_failover servers up 1 down 1 [11:17:14] !log installing db5.3 security updates [11:17:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:19:54] (03PS4) 10Jcrespo: dbproxy: Setup db2044 as the main m2 host on codfw and monitor it [puppet] - 10https://gerrit.wikimedia.org/r/412995 (https://phabricator.wikimedia.org/T183470) [11:23:26] (03CR) 10Jcrespo: [C: 031] "We finally have the replica back up and a recent backup." [puppet] - 10https://gerrit.wikimedia.org/r/412964 (https://phabricator.wikimedia.org/T168470) (owner: 10Andrew Bogott) [11:24:29] (03CR) 10Marostegui: [C: 04-1] "Let's wait until we have confirmation from papaul that everything looks good with the new replica, on his end before. Should happen today " [puppet] - 10https://gerrit.wikimedia.org/r/412964 (https://phabricator.wikimedia.org/T168470) (owner: 10Andrew Bogott) [11:32:19] (03CR) 10Jcrespo: [C: 032] dbproxy: Setup db2044 as the main m2 host on codfw and monitor it [puppet] - 10https://gerrit.wikimedia.org/r/412995 (https://phabricator.wikimedia.org/T183470) (owner: 10Jcrespo) [11:32:21] (03CR) 10Ema: [C: 032] etcd: Introduce reconnectTimeout [debs/pybal] - 10https://gerrit.wikimedia.org/r/411264 (https://phabricator.wikimedia.org/T169765) (owner: 10Ema) [11:36:15] (03PS1) 10Ema: etcd: Introduce reconnectTimeout [debs/pybal] (1.14) - 10https://gerrit.wikimedia.org/r/413141 (https://phabricator.wikimedia.org/T169765) [11:37:55] (03CR) 10Ema: [C: 032] etcd: Introduce reconnectTimeout [debs/pybal] (1.14) - 10https://gerrit.wikimedia.org/r/413141 (https://phabricator.wikimedia.org/T169765) (owner: 10Ema) [11:40:00] (03PS1) 10Filippo Giunchedi: nagios_common: switch to check_prometheus_metric Python implementation [puppet] - 10https://gerrit.wikimedia.org/r/413142 (https://phabricator.wikimedia.org/T181410) [11:41:44] RECOVERY - puppet last run on lvs5002 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [11:41:50] (03PS1) 10Elukey: profile::hadoop::master|worker: relax JVM heap size monitors [puppet] - 10https://gerrit.wikimedia.org/r/413143 [11:42:30] (03CR) 10Elukey: [C: 032] profile::hadoop::master|worker: relax JVM heap size monitors [puppet] - 10https://gerrit.wikimedia.org/r/413143 (owner: 10Elukey) [11:44:24] RECOVERY - High CPU load on API appserver on mw1226 is OK: OK - load average: 7.10, 11.90, 23.20 [11:44:25] (03PS1) 10Ema: 1.14.4: Introduce etcd reconnectTimeout [debs/pybal] - 10https://gerrit.wikimedia.org/r/413145 (https://phabricator.wikimedia.org/T169765) [11:46:13] (03CR) 10Ema: [C: 032] 1.14.4: Introduce etcd reconnectTimeout [debs/pybal] - 10https://gerrit.wikimedia.org/r/413145 (https://phabricator.wikimedia.org/T169765) (owner: 10Ema) [11:46:29] (03PS1) 10Ema: 1.14.4: Introduce etcd reconnectTimeout [debs/pybal] (1.14) - 10https://gerrit.wikimedia.org/r/413146 (https://phabricator.wikimedia.org/T169765) [11:46:38] PROBLEM - High CPU load on API appserver on mw1227 is CRITICAL: CRITICAL - load average: 34.70, 34.45, 32.06 [11:47:33] (03CR) 10Ema: [C: 032] 1.14.4: Introduce etcd reconnectTimeout [debs/pybal] (1.14) - 10https://gerrit.wikimedia.org/r/413146 (https://phabricator.wikimedia.org/T169765) (owner: 10Ema) [11:48:44] (03PS1) 10Urbanecm: Add new throttle rule [mediawiki-config] - 10https://gerrit.wikimedia.org/r/413147 (https://phabricator.wikimedia.org/T187870) [11:49:32] Hi, is anybody with deploy privs available? I need to deploy last-minute throttle exception rule, see T187870. [11:49:32] T187870: Lift IP account limit on 2018-02-21 - https://phabricator.wikimedia.org/T187870 [11:50:19] zeljkof, hashar, twentyafterfour, no_justification ^^^ [11:51:24] Urbanecm: I'm around [11:51:40] looks like that can be deployed during eu swat, right? [11:52:00] (03CR) 10Filippo Giunchedi: "PCC https://puppet-compiler.wmflabs.org/compiler02/10065/" [puppet] - 10https://gerrit.wikimedia.org/r/413142 (https://phabricator.wikimedia.org/T181410) (owner: 10Filippo Giunchedi) [11:52:06] the event is in 4 hours, swat is in 2 [11:52:10] Theoretically, but it is almost full and I won't be available (travelling). [11:52:41] just add it to the top of the calendar, I'll make sure to deploy it [11:52:51] Ok, that's great. Thank you! [11:53:09] Urbanecm: no problem! :) [11:58:28] (03PS1) 10Elukey: profile::hadoop::master|worker: tune again JVM Heap size monitors [puppet] - 10https://gerrit.wikimedia.org/r/413148 [12:01:15] (03CR) 10Elukey: [C: 032] profile::hadoop::master|worker: tune again JVM Heap size monitors [puppet] - 10https://gerrit.wikimedia.org/r/413148 (owner: 10Elukey) [12:01:19] !log pybal 1.14.4 uploaded to apt.w.o [12:01:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:02:08] !log lvs5003: pybal upgraded to 1.14.4 [12:02:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:09:23] PROBLEM - High CPU load on API appserver on mw1234 is CRITICAL: CRITICAL - load average: 35.55, 33.44, 32.05 [12:10:03] !log uploading retpoline-enabled gcc-4.9 to apt.wikimedia.org / jessie-wikimedia to be able to use it on boron for building Linux (trying to adapt our pbuilder setup to also include security.debian.org ran into a few proxy-related problems and this is really a rare corner case anyway) [12:10:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:10:29] (03Abandoned) 10Alexandros Kosiaris: Revert "Revert "Use security mirrors in cowbuilder apt config"" [puppet] - 10https://gerrit.wikimedia.org/r/412930 (owner: 10Alexandros Kosiaris) [12:14:25] PROBLEM - High CPU load on API appserver on mw1234 is CRITICAL: CRITICAL - load average: 35.54, 33.31, 32.25 [12:17:25] PROBLEM - High CPU load on API appserver on mw1283 is CRITICAL: CRITICAL - load average: 42.12, 41.50, 40.95 [12:21:25] !log restart hhvm on mw1227 - high load, hhvm-dump-debug in /home/elukey/hhvm.23382.bt [12:21:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:23:32] (03PS10) 10Arturo Borrero Gonzalez: toollabs: add apt pinnings for key packages [puppet] - 10https://gerrit.wikimedia.org/r/410177 (https://phabricator.wikimedia.org/T187193) [12:23:35] chasemp: ^^^ [12:24:29] (03PS11) 10Arturo Borrero Gonzalez: toollabs: add apt pinnings for key packages [puppet] - 10https://gerrit.wikimedia.org/r/410177 (https://phabricator.wikimedia.org/T187193) [12:26:44] !log restart hhvm on mw1231 - high load, hhvm-dump-debug in /home/elukey/hhvm.6759.bt [12:26:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:29:50] RECOVERY - High CPU load on API appserver on mw1227 is OK: OK - load average: 8.05, 11.84, 23.69 [12:29:55] (03PS1) 10Marostegui: db-eqiad.php: db1067 is now candidate master in s1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/413153 (https://phabricator.wikimedia.org/T186321) [12:32:13] (03CR) 10Marostegui: [C: 032] db-eqiad.php: db1067 is now candidate master in s1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/413153 (https://phabricator.wikimedia.org/T186321) (owner: 10Marostegui) [12:33:48] (03Merged) 10jenkins-bot: db-eqiad.php: db1067 is now candidate master in s1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/413153 (https://phabricator.wikimedia.org/T186321) (owner: 10Marostegui) [12:35:09] (03CR) 10Volans: [C: 04-1] "Nice! I think there is a typo, see inline also for a couple of other comments." (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/413142 (https://phabricator.wikimedia.org/T181410) (owner: 10Filippo Giunchedi) [12:35:25] (03PS1) 10Vgutierrez: Improve reactor mocking [debs/pybal] - 10https://gerrit.wikimedia.org/r/413154 (https://phabricator.wikimedia.org/T169765) [12:35:34] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Clarify that db1067 is now s1 candidate master - T186321 (duration: 01m 13s) [12:35:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:35:47] T186321: Prepare and indicate proper master db failover candidates for all database sections (s1-s8, x1) - https://phabricator.wikimedia.org/T186321 [12:36:20] PROBLEM - High CPU load on API appserver on mw1348 is CRITICAL: CRITICAL - load average: 112.25, 63.01, 50.56 [12:36:21] PROBLEM - High CPU load on API appserver on mw1284 is CRITICAL: CRITICAL - load average: 90.51, 42.44, 30.75 [12:36:21] PROBLEM - High CPU load on API appserver on mw1281 is CRITICAL: CRITICAL - load average: 94.34, 48.23, 37.01 [12:36:25] <_joe_> uh [12:36:41] PROBLEM - High CPU load on API appserver on mw1343 is CRITICAL: CRITICAL - load average: 118.93, 59.57, 44.09 [12:37:04] (03CR) 10jenkins-bot: db-eqiad.php: db1067 is now candidate master in s1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/413153 (https://phabricator.wikimedia.org/T186321) (owner: 10Marostegui) [12:37:50] PROBLEM - Nginx local proxy to apache on mw1343 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:38:10] PROBLEM - Nginx local proxy to apache on mw1348 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:38:10] PROBLEM - Apache HTTP on mw1348 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:38:11] PROBLEM - HHVM rendering on mw1343 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:38:11] PROBLEM - Apache HTTP on mw1343 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:38:30] PROBLEM - Apache HTTP on mw1281 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:38:30] PROBLEM - HHVM rendering on mw1281 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:38:30] PROBLEM - Nginx local proxy to apache on mw1281 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:38:30] PROBLEM - Nginx local proxy to apache on mw1284 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:38:33] !log restart hhvm on mw1234 - high load [12:38:41] PROBLEM - HHVM rendering on mw1348 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:38:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:39:21] RECOVERY - Nginx local proxy to apache on mw1284 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 617 bytes in 0.031 second response time [12:39:40] RECOVERY - HHVM rendering on mw1348 is OK: HTTP OK: HTTP/1.1 200 OK - 74194 bytes in 2.195 second response time [12:40:11] PROBLEM - Apache HTTP on mw1284 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:41:01] RECOVERY - Apache HTTP on mw1284 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 616 bytes in 0.028 second response time [12:41:10] RECOVERY - High CPU load on API appserver on mw1231 is OK: OK - load average: 8.98, 11.36, 23.16 [12:42:50] PROBLEM - HHVM rendering on mw1348 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:43:40] <_joe_> !log rolling restart of hhvm on api servers under high load [12:43:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:45:40] RECOVERY - HHVM rendering on mw1348 is OK: HTTP OK: HTTP/1.1 200 OK - 74068 bytes in 0.340 second response time [12:46:01] RECOVERY - Nginx local proxy to apache on mw1348 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 617 bytes in 0.083 second response time [12:46:10] RECOVERY - Apache HTTP on mw1348 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 616 bytes in 0.026 second response time [12:46:40] RECOVERY - High CPU load on API appserver on mw1234 is OK: OK - load average: 7.74, 14.05, 23.67 [12:48:30] RECOVERY - Apache HTTP on mw1281 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 616 bytes in 0.032 second response time [12:48:30] RECOVERY - Nginx local proxy to apache on mw1281 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 617 bytes in 0.038 second response time [12:48:30] RECOVERY - HHVM rendering on mw1281 is OK: HTTP OK: HTTP/1.1 200 OK - 74068 bytes in 0.122 second response time [12:50:20] RECOVERY - Apache HTTP on mw1343 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 616 bytes in 0.057 second response time [12:50:20] RECOVERY - HHVM rendering on mw1343 is OK: HTTP OK: HTTP/1.1 200 OK - 74068 bytes in 0.352 second response time [12:50:50] RECOVERY - Nginx local proxy to apache on mw1343 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 617 bytes in 0.068 second response time [12:57:40] RECOVERY - High CPU load on API appserver on mw1283 is OK: OK - load average: 12.51, 17.44, 29.30 [12:58:35] RECOVERY - High CPU load on API appserver on mw1284 is OK: OK - load average: 11.18, 15.26, 29.46 [13:02:48] (03PS12) 10Arturo Borrero Gonzalez: toollabs: add apt pinnings for key packages [puppet] - 10https://gerrit.wikimedia.org/r/410177 (https://phabricator.wikimedia.org/T187193) [13:02:56] (03CR) 10Rush: "> Patch Set 1:" [puppet] - 10https://gerrit.wikimedia.org/r/413095 (https://phabricator.wikimedia.org/T187850) (owner: 10BryanDavis) [13:09:35] RECOVERY - High CPU load on API appserver on mw1348 is OK: OK - load average: 13.29, 14.64, 35.30 [13:13:36] RECOVERY - High CPU load on API appserver on mw1281 is OK: OK - load average: 10.51, 12.24, 28.88 [13:14:56] 10Operations: Integrate stretch 9.3 point update - https://phabricator.wikimedia.org/T182655#3989003 (10MoritzMuehlenhoff) 05Open>03Resolved This is completely rolled out. [13:15:55] RECOVERY - High CPU load on API appserver on mw1343 is OK: OK - load average: 12.85, 13.41, 35.99 [13:17:37] (03PS3) 10Muehlenhoff: Add repository component for tor on stretch [puppet] - 10https://gerrit.wikimedia.org/r/410910 [13:23:15] PROBLEM - Host boron is DOWN: PING CRITICAL - Packet loss = 18%, RTA = 6286.53 ms [13:23:15] PROBLEM - Host ununpentium is DOWN: PING CRITICAL - Packet loss = 100% [13:23:57] PROBLEM - Host etcd1003 is DOWN: PING CRITICAL - Packet loss = 100% [13:23:57] PROBLEM - Host meitnerium is DOWN: PING CRITICAL - Packet loss = 100% [13:23:57] PROBLEM - Host etcd1004 is DOWN: PING CRITICAL - Packet loss = 100% [13:24:05] PROBLEM - Host kubestagetcd1002 is DOWN: PING CRITICAL - Packet loss = 100% [13:24:15] PROBLEM - Host kubestagetcd1003 is DOWN: PING CRITICAL - Packet loss = 100% [13:24:16] PROBLEM - Host fermium is DOWN: PING CRITICAL - Packet loss = 100% [13:24:31] and again [13:24:41] 1007 this time [13:24:55] PROBLEM - SSH on ganeti1007 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:26:22] !log powercycling ganeti1007 [13:26:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:27:25] PROBLEM - Request latencies on chlorine is CRITICAL: CRITICAL - apiserver_request_latencies is 1794991 https://grafana.wikimedia.org/dashboard/db/kubernetes-api [13:27:45] PROBLEM - Request latencies on neon is CRITICAL: CRITICAL - apiserver_request_latencies is 13238562 https://grafana.wikimedia.org/dashboard/db/kubernetes-api [13:27:45] PROBLEM - etcd request latencies on chlorine is CRITICAL: CRITICAL - etcd_request_latencies is 2545360 https://grafana.wikimedia.org/dashboard/db/kubernetes-api [13:27:55] PROBLEM - etcd request latencies on neon is CRITICAL: CRITICAL - etcd_request_latencies is 17285176 https://grafana.wikimedia.org/dashboard/db/kubernetes-api [13:28:00] !log Reboot db2092 for a kernel upgrade [13:28:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:28:55] PROBLEM - Host ganeti1007 is DOWN: PING CRITICAL - Packet loss = 100% [13:29:45] PROBLEM - HHVM jobrunner on mw1306 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 473 bytes in 0.001 second response time [13:29:56] RECOVERY - SSH on ganeti1007 is OK: SSH OK - OpenSSH_6.7p1 Debian-5+deb8u4 (protocol 2.0) [13:30:05] RECOVERY - Host ganeti1007 is UP: PING OK - Packet loss = 0%, RTA = 0.24 ms [13:30:25] RECOVERY - Request latencies on chlorine is OK: OK - apiserver_request_latencies is 1996 https://grafana.wikimedia.org/dashboard/db/kubernetes-api [13:30:45] RECOVERY - HHVM jobrunner on mw1306 is OK: HTTP OK: HTTP/1.1 200 OK - 206 bytes in 0.001 second response time [13:30:45] RECOVERY - Host ununpentium is UP: PING OK - Packet loss = 0%, RTA = 7.83 ms [13:30:45] RECOVERY - Host boron is UP: PING OK - Packet loss = 0%, RTA = 7.20 ms [13:30:45] RECOVERY - Host kubestagetcd1002 is UP: PING OK - Packet loss = 0%, RTA = 9.04 ms [13:30:46] RECOVERY - etcd request latencies on chlorine is OK: OK - etcd_request_latencies is 1657 https://grafana.wikimedia.org/dashboard/db/kubernetes-api [13:30:55] RECOVERY - Host etcd1003 is UP: PING OK - Packet loss = 0%, RTA = 7.45 ms [13:30:55] RECOVERY - Host etcd1004 is UP: PING OK - Packet loss = 0%, RTA = 8.70 ms [13:31:05] RECOVERY - Host fermium is UP: PING OK - Packet loss = 0%, RTA = 7.63 ms [13:31:05] RECOVERY - Host meitnerium is UP: PING OK - Packet loss = 0%, RTA = 9.41 ms [13:31:15] RECOVERY - Host kubestagetcd1003 is UP: PING OK - Packet loss = 0%, RTA = 5.57 ms [13:32:12] 10Operations, 10ops-eqiad: Kernels errors on ganeti1005- ganeti1008 under high I/O - https://phabricator.wikimedia.org/T181121#3989044 (10MoritzMuehlenhoff) Happened again on ganeti1007, again with page allocation errors. [13:33:45] (03PS2) 10Jdrewniak: Removing Mobile beta feedback link [mediawiki-config] - 10https://gerrit.wikimedia.org/r/412983 (https://phabricator.wikimedia.org/T187712) [13:34:15] PROBLEM - SSH on labtestservices2002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:34:46] PROBLEM - SSH on labtestservices2003 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:35:15] RECOVERY - SSH on labtestservices2002 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.10 (protocol 2.0) [13:36:45] RECOVERY - SSH on labtestservices2003 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.10 (protocol 2.0) [13:37:36] RECOVERY - haproxy failover on dbproxy1002 is OK: OK check_failover servers up 2 down 0 [13:37:41] haproxies should recover now [13:38:05] RECOVERY - haproxy failover on dbproxy1007 is OK: OK check_failover servers up 2 down 0 [13:38:15] PROBLEM - SSH on labtestservices2002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:38:36] (03PS4) 10Addshore: WIP DNM role and profile for wdcm dashboards [puppet] - 10https://gerrit.wikimedia.org/r/387211 [13:39:14] (03CR) 10jerkins-bot: [V: 04-1] WIP DNM role and profile for wdcm dashboards [puppet] - 10https://gerrit.wikimedia.org/r/387211 (owner: 10Addshore) [13:39:46] RECOVERY - Request latencies on neon is OK: OK - apiserver_request_latencies is 34755 https://grafana.wikimedia.org/dashboard/db/kubernetes-api [13:40:05] RECOVERY - etcd request latencies on neon is OK: OK - etcd_request_latencies is 24151 https://grafana.wikimedia.org/dashboard/db/kubernetes-api [13:44:23] (03PS1) 10Jcrespo: mariadb: Reenable notifications on db2044 (m2) [puppet] - 10https://gerrit.wikimedia.org/r/413159 (https://phabricator.wikimedia.org/T183470) [13:45:19] (03CR) 10Ema: [V: 032 C: 032] "Nice, that's a much better approach. Thanks!" [debs/pybal] - 10https://gerrit.wikimedia.org/r/413154 (https://phabricator.wikimedia.org/T169765) (owner: 10Vgutierrez) [13:45:55] PROBLEM - SSH on labtestservices2003 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:48:05] PROBLEM - etcd request latencies on neon is CRITICAL: CRITICAL - etcd_request_latencies is 4871667 https://grafana.wikimedia.org/dashboard/db/kubernetes-api [13:50:06] RECOVERY - etcd request latencies on neon is OK: OK - etcd_request_latencies is 4992 https://grafana.wikimedia.org/dashboard/db/kubernetes-api [13:51:55] RECOVERY - SSH on labtestservices2003 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.10 (protocol 2.0) [13:52:31] (03PS1) 10Rush: openstack: labs-instance-transport1-b-codfw designations [dns] - 10https://gerrit.wikimedia.org/r/413160 (https://phabricator.wikimedia.org/T184209) [13:53:11] 10Operations, 10Traffic, 10Patch-For-Review: varnish-be returning 503s upon VCL reload - https://phabricator.wikimedia.org/T157430#3989077 (10BBlack) 05Resolved>03Open TL;DR - Current solution is a fixed 2s load->use delay. I think we should probably do more here at this point, especially in light of eq... [13:53:55] 10Operations, 10Traffic: varnish: discard cold vcl - https://phabricator.wikimedia.org/T187778#3989081 (10BBlack) I was the one arguing for cron, on I think the faulty assumption that a VCL had to go `cold` before it could be `discard`ed. However, apparently that's not the case. You can `discard` a `warm` VC... [13:55:57] (03CR) 10Zfilipin: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/413147 (https://phabricator.wikimedia.org/T187870) (owner: 10Urbanecm) [13:56:37] <_joe_> isn't swat in 10 minutes? [13:56:41] <_joe_> err, 5 [13:56:47] _joe_: yes, preparing [13:56:48] <_joe_> jouncebot: next [13:56:48] In 0 hour(s) and 3 minute(s): European Mid-day SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180221T1400) [13:56:56] <_joe_> ahah ok [13:57:05] did not start the deployment yet, reviewing and merging the first commit [13:57:07] <_joe_> I thought I missed it by 11 hour [13:57:16] <_joe_> 1* hour even [13:57:31] (03Merged) 10jenkins-bot: Add new throttle rule [mediawiki-config] - 10https://gerrit.wikimedia.org/r/413147 (https://phabricator.wikimedia.org/T187870) (owner: 10Urbanecm) [13:57:36] <_joe_> zeljkof: yeah I though I went to lunch when I needed to be here for SWAT [13:57:41] (03CR) 10jenkins-bot: Add new throttle rule [mediawiki-config] - 10https://gerrit.wikimedia.org/r/413147 (https://phabricator.wikimedia.org/T187870) (owner: 10Urbanecm) [13:57:53] _joe_: all is good, you are back in time :) [14:00:04] addshore, hashar, anomie, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: Dear deployers, time to do the European Mid-day SWAT(Max 8 patches) deploy. Dont look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180221T1400). [14:00:04] Urbanecm, gilles, raynor, _joe_, and Jhs: A patch you scheduled for European Mid-day SWAT(Max 8 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [14:00:05] PROBLEM - SSH on labtestservices2003 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:00:12] o/ [14:00:14] 10Operations, 10Traffic, 10Wikimedia-Incident: Investigate varnishd child crashes when multiple nodes get depooled/pooled concurrently - https://phabricator.wikimedia.org/T154801#3989098 (10BBlack) Should we do something here? The same crash can exist at remote DCs as well (the frontends would crash if all... [14:00:23] I can SWAT today [14:00:26] i'm here :) [14:00:31] o/ [14:00:45] starting with 413147 by Urbanecm, it should take just a minute [14:02:05] RECOVERY - SSH on labtestservices2003 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.10 (protocol 2.0) [14:02:26] !log zfilipin@tin Synchronized wmf-config/throttle.php: SWAT: [[gerrit:413147|Add new throttle rule (T187870)]] (duration: 01m 13s) [14:02:29] gilles, raynor, _joe_, and Jhs: people that are able and willing to deploy their patches have priority - so, do you want to deploy your patch(es)? ;) [14:02:38] (if you can) [14:02:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:02:41] T187870: Lift IP account limit on 2018-02-21 - https://phabricator.wikimedia.org/T187870 [14:02:55] PROBLEM - High CPU load on API appserver on mw1276 is CRITICAL: CRITICAL - load average: 104.80, 53.15, 40.17 [14:03:26] zeljkof, you mean, do it ourselves? I can't [14:03:26] PROBLEM - High CPU load on API appserver on mw1316 is CRITICAL: CRITICAL - load average: 82.41, 47.14, 37.48 [14:03:39] I also can't [14:04:06] PROBLEM - High CPU load on API appserver on mw1339 is CRITICAL: CRITICAL - load average: 96.62, 53.94, 38.96 [14:04:10] zeljkof: I've never deployed a config patch before. I believe I have the rights for it, though [14:04:25] I'd be happy to learn [14:04:26] <_joe_> zeljkof: I do deploy config patches, but I'll let other do their own [14:04:34] <_joe_> and wait in queue [14:04:45] PROBLEM - Apache HTTP on mw1276 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:04:52] <_joe_> uhm wait a sec [14:04:56] PROBLEM - Apache HTTP on mw1316 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:04:58] _joe_: you can go first, if you want to, I need some time to review other patches [14:05:03] <_joe_> zeljkof: wait [14:05:06] PROBLEM - Nginx local proxy to apache on mw1276 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:05:06] PROBLEM - HHVM rendering on mw1276 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:05:07] <_joe_> see alerts [14:05:10] something going on? [14:05:15] PROBLEM - Nginx local proxy to apache on mw1316 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:05:18] <_joe_> seems so [14:05:19] hm [14:05:21] <_joe_> let me check [14:05:35] PROBLEM - HHVM rendering on mw1316 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:05:37] the first patch (throttle rule) deployed fine [14:05:59] gilles: this is all I know :) https://wikitech.wikimedia.org/wiki/SWAT_deploys/Deployers [14:06:04] <_joe_> zeljkof: I think the problem is exacerbated by deployments [14:06:12] <_joe_> so hang on a sec [14:06:16] _joe_: sure [14:06:55] <_joe_> !log restarting hhvm on misbehaving api appservers [14:07:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:07:32] zeljkof: ok, I'm willing to give those instructions a try, when you guys give me the green light [14:08:06] PROBLEM - SSH on labtestservices2003 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:08:32] <_joe_> gilles: prepare yourself in the meanwhile [14:08:53] gilles: do you do other deployments? this should be pretty much the same [14:09:07] I am, I'll +2 the config change since that's time-consuming [14:09:17] zeljkof: if mine goes well, sure [14:09:41] (03CR) 10Gilles: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/412928 (https://phabricator.wikimedia.org/T169144) (owner: 10Gilles) [14:10:05] RECOVERY - Nginx local proxy to apache on mw1276 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 617 bytes in 0.047 second response time [14:10:05] RECOVERY - HHVM rendering on mw1276 is OK: HTTP OK: HTTP/1.1 200 OK - 74115 bytes in 1.190 second response time [14:10:07] gilles: what I wanted to say, deploying config change should not be much different than deploying other things, if you have done that [14:10:20] I haven't done that either [14:10:36] RECOVERY - Apache HTTP on mw1276 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 616 bytes in 0.068 second response time [14:10:38] gilles: I can deploy the rest of the changes, but feel free to take over swat if you want to practice :) [14:10:55] RECOVERY - Apache HTTP on mw1316 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 616 bytes in 0.039 second response time [14:11:06] RECOVERY - Nginx local proxy to apache on mw1316 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 617 bytes in 0.032 second response time [14:11:10] (03Merged) 10jenkins-bot: Add Thumbor/Mediawiki shared secret [mediawiki-config] - 10https://gerrit.wikimedia.org/r/412928 (https://phabricator.wikimedia.org/T169144) (owner: 10Gilles) [14:11:20] (03CR) 10jenkins-bot: Add Thumbor/Mediawiki shared secret [mediawiki-config] - 10https://gerrit.wikimedia.org/r/412928 (https://phabricator.wikimedia.org/T169144) (owner: 10Gilles) [14:11:26] RECOVERY - HHVM rendering on mw1316 is OK: HTTP OK: HTTP/1.1 200 OK - 74114 bytes in 0.419 second response time [14:11:41] <_joe_> gilles: you may proceed [14:12:15] RECOVERY - High CPU load on API appserver on mw1339 is OK: OK - load average: 16.26, 31.03, 35.49 [14:12:37] (03PS5) 10Addshore: Role and profile for wdcm dashboards [puppet] - 10https://gerrit.wikimedia.org/r/387211 [14:13:13] (03CR) 10jerkins-bot: [V: 04-1] Role and profile for wdcm dashboards [puppet] - 10https://gerrit.wikimedia.org/r/387211 (owner: 10Addshore) [14:14:24] is it normal for "scap pull" on mwdebug1002 to hang after "14:12:28 Finished rsync common (duration: 00m 03s)"? [14:14:38] gilles: it takes a few minutes the first time you do it [14:14:42] ok [14:14:51] after that, it takes seconds [14:16:56] RECOVERY - SSH on labtestservices2002 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.10 (protocol 2.0) [14:17:34] hmmm, seeing a fatal in my tests, but can't see how it's related to my config change [14:17:37] require_once(/srv/mediawiki/docroot/wikimedia.org/w/../multiversion/MWMultiVersion.php): File not found [14:17:50] (when browsing with the debug header) [14:18:04] the scap pull had a bunch of these: [14:18:04] cannot delete non-empty directory: php-1.31.0-wmf.17/cache/l10n [14:18:05] uh oh [14:18:18] is that expected? or something I should clean up? [14:18:23] l10n messages should be fine [14:18:55] cannot delete non-empty directory: php-1.31.0-wmf.17/cache/l10n [14:18:55] cannot delete non-empty directory: php-1.31.0-wmf.17/cache/l10n [14:18:55] cannot delete non-empty directory: php-1.31.0-wmf.17/cache [14:18:57] cannot delete non-empty directory: php-1.31.0-wmf.17/cache [14:18:59] cannot delete non-empty directory: php-1.31.0-wmf.17 [14:19:01] cannot delete non-empty directory: php-1.31.0-wmf.20/cache/l10n [14:19:03] cannot delete non-empty directory: php-1.31.0-wmf.20/cache/l10n [14:19:05] cannot delete non-empty directory: php-1.31.0-wmf.20/cache [14:19:06] looking at it at mwdebug1002 [14:19:09] that's the full list [14:19:19] <_joe_> gilles: mwdebug1001? [14:19:25] mwdebug1002 [14:19:26] that happens sometimes, I think it's a scap regression [14:19:43] that used to happen a while ago, was fixed, but apparently happens again [14:19:50] should not cause any problems [14:20:22] I can see the error in logstash [14:20:36] https://logstash.wikimedia.org/app/kibana#/dashboard/mwdebug1002 [14:20:45] Fatal error: require_once(/srv/mediawiki/docroot/wikimedia.org/w/../multiversion/MWMultiVersion.php): File not found in /srv/mediawiki/docroot/wikimedia.org/w/thumb.php on line 2 [14:20:57] <_joe_> wth? [14:21:01] (03CR) 10MarkTraceur: [C: 031] Load 3D extension on other wikis, for display only [mediawiki-config] - 10https://gerrit.wikimedia.org/r/410433 (https://phabricator.wikimedia.org/T187261) (owner: 10Matthias Mullie) [14:21:25] <_joe_> oh, why thumb.php is called on an appserver? [14:21:33] gilles: I would suggest reverting [14:21:35] <_joe_> I would very much expect it not to work, indeed [14:21:46] _joe_: where is that? [14:22:06] <_joe_> "/srv/mediawiki/docroot/wikimedia.org/w/thumb.php" [14:22:11] <_joe_> in your paste [14:22:17] _joe_: I'm working on proxying that to thumbor [14:22:19] ah :) did not notice [14:22:22] that's the whole point of the change [14:22:45] <_joe_> gilles: yeah, I don't expect it would work if called on mwdebug1002, tbh, but maybe I'm wrong [14:22:52] <_joe_> gilles: which url are you testing? [14:22:59] https://commons.wikimedia.org/w/thumb.php?f=Broom%20icon.svg&w=501 [14:23:03] or https://commons.wikimedia.beta.wmflabs.org/w/thumb.php?f=Victoria_memorial.jpg&w=300 [14:23:12] <_joe_> I guess the former [14:23:17] I should have tried with mwdebug1002 before applying the change, for sure [14:23:18] <_joe_> the latter is on the beta cluster [14:23:32] <_joe_> gilles: you can try now, with apache-fast-test on tin [14:23:41] how? [14:24:03] <_joe_> echo 'https://commons.wikimedia.org/w/thumb.php?f=Broom%20icon.svg&w=501' > test_thumb [14:24:21] <_joe_> apache-fast-test test_thumb mwdebug1001 mwdebug1002 [14:24:45] <_joe_> ok, maybe try with a normal appserver [14:24:49] <_joe_> like mw1261 [14:25:07] <_joe_> you will see tat request returns 200 OK on mw1261, and 500 on mwdebug1002 [14:25:14] <_joe_> so yeah, I'll suggest reverting [14:25:26] well in practice those will still go to imagescalers [14:25:47] _joe_, gilles: we have other patches to deploy, I would suggest requesting a deploy window for this, so you have more time to debug [14:25:57] and revert now [14:26:04] <_joe_> +1 [14:26:07] sure [14:26:15] PROBLEM - SSH on labtestservices2002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:26:25] <_joe_> I'm not even sure that's the problem we're seeing, btw [14:26:25] gilles: sorry for such a bad first-deploy experience :) [14:26:46] _joe_: want to go next? [14:27:01] so, make a revert commit or just revert it on the deployment machine? [14:27:06] RECOVERY - SSH on labtestservices2002 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.10 (protocol 2.0) [14:27:09] <_joe_> gilles: revert commit [14:27:19] <_joe_> zeljkof: once this is sorted out, sure [14:27:19] gilles: https://wikitech.wikimedia.org/wiki/SWAT_deploys/Deployers#Reverting [14:27:33] the link is the official docs for urgent reverting [14:27:38] (03PS1) 10Gilles: Revert "Add Thumbor/Mediawiki shared secret" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/413165 [14:27:52] (03CR) 10Gilles: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/413165 (owner: 10Gilles) [14:27:53] since your commit is not urgent, you can revert it in gerrit, or using the docs, what ever you prefer [14:29:10] (03Merged) 10jenkins-bot: Revert "Add Thumbor/Mediawiki shared secret" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/413165 (owner: 10Gilles) [14:29:17] so how to I schedule a deploy window dedicated to this? just make one up and add it to the deployment wiki page? [14:29:23] (03CR) 10jenkins-bot: Revert "Add Thumbor/Mediawiki shared secret" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/413165 (owner: 10Gilles) [14:29:39] gilles: you should probably ping greg-g [14:29:48] (03PS1) 10Jon Harald Søby: Add sitename for Burmese Wiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/413166 (https://phabricator.wikimedia.org/T187882) [14:30:05] PROBLEM - High CPU load on API appserver on mw1228 is CRITICAL: CRITICAL - load average: 36.62, 34.57, 32.21 [14:30:12] <_joe_> sigh, again? [14:30:18] gilles: but yes, I think it's just add a window to deployments page when there is nothing else going on, just make sure greg-g knows about it [14:30:25] ok I'm done reverting this, you can proceed with the rest of the SWAT [14:30:26] <_joe_> gilles: are you deploying anything, by any chance? [14:30:30] sorry for taking half of the window [14:30:38] gilles: no problem, happens :) [14:30:43] <_joe_> mwdebug1001 is still broken FWIW [14:30:49] _joe_: I'm not doing anything [14:30:55] <_joe_> oh ok [14:31:04] _joe_: for the record, I am also doing nothing [14:31:06] <_joe_> so I have to pull your change on tin, too? [14:31:06] I've just brought back mwdebug1002 to its original state [14:31:15] PROBLEM - SSH on labtestservices2002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:31:18] <_joe_> gilles: what about mwdebug1001? [14:31:24] _joe_: never touched it [14:31:36] RECOVERY - High CPU load on API appserver on mw1316 is OK: OK - load average: 13.71, 15.64, 34.67 [14:31:36] <_joe_> uhm, something seems very wrong there [14:31:56] RECOVERY - High CPU load on API appserver on mw1276 is OK: OK - load average: 12.47, 13.70, 29.31 [14:33:28] <_joe_> zeljkof: I repeat, something is very wrong with mwdebug1001-1002 [14:33:35] RECOVERY - SSH on labtestservices2003 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.10 (protocol 2.0) [14:34:22] _joe_: uh oh, what is wrong? I never use 1001, only 1002 [14:34:36] <_joe_> zeljkof: with vboth of them [14:34:46] <_joe_> the thumb.php issue seems specific to those servers [14:34:49] <_joe_> and not going away [14:35:00] <_joe_> anyhow, let me deploy my change [14:35:18] (03PS3) 10Giuseppe Lavagetto: Enable EtcdConfig on the debug hosts [mediawiki-config] - 10https://gerrit.wikimedia.org/r/411296 (https://phabricator.wikimedia.org/T149617) [14:35:30] hm, if the revert is merged, you can run `scap pull` on both machines, that should bring them back to pre-swat state [14:35:40] <_joe_> zeljkof: yeah gonna do that right now [14:36:05] I have a second small patch I just committed, do you think we can add that one on today's SWAT? [14:36:16] even though mine is the 8th in the list :) [14:36:30] Jhs: if nothing goes wrong and we do not run out of time... :) [14:36:40] I am fine with doing more patches [14:36:45] is it urgent? [14:36:58] not at all :) [14:37:00] I could swear that the thumb.php requests through mwdebug1002 worked one or two weeks ago, the last time I deployed a similar config change that enabled proxying of thumb.php requests to thumbor for all public wikis [14:37:08] just adding sitename for Burmese Wiktionary [14:37:12] but maybe I misremember [14:37:23] <_joe_> I'm not going to let us deploy anything until the mwdebug issue is understood and resolved [14:37:26] <_joe_> . [14:37:52] Jhs: add it to the calendar and if things go fine (but probably not, as it stands now) it will be deployed [14:37:57] <_joe_> we can't have a part of the deploy not work reliably [14:37:59] i'll add it to the list, and if we don't have time we don't have time. no problem [14:38:05] <_joe_> !log restarting hhvm on mwdebug1002 [14:38:06] _joe_: you're only seeing the fatals on the thumb.php requests, though, right? w/load.php requests work fine [14:38:12] <_joe_> gilles: still [14:38:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:38:21] yeah I know, it's really bizarre [14:38:25] _joe_: yes, let's make sure no "I broke wikipedia" t-shirts are sent today [14:38:40] <_joe_> restarting hhvm did the trick on mwdebug1002 [14:38:45] (03PS1) 10Jcrespo: mariadb: Update socket location of misc services (m1, m2, m5) [puppet] - 10https://gerrit.wikimedia.org/r/413167 (https://phabricator.wikimedia.org/T183470) [14:38:45] <_joe_> it seems it was in a bad state [14:38:45] -_- [14:38:52] thanks, hhvm [14:39:15] PROBLEM - puppet last run on labstore1005 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [14:39:33] gilles: I am looking for "cannot delete non-empty directory" scap bug in phab, will report it is happening again [14:39:48] <_joe_> ook, let me merge my change then [14:39:58] gilles: I did find this T162207 [14:39:59] T162207: When "scap pull" does a (slow) CDB rebuild, it should tell me that that's what it's doing - https://phabricator.wikimedia.org/T162207 [14:40:13] (03CR) 10Jcrespo: [C: 04-2] "This should be merged when all m1,2,5 hosts have been upgraded. Until then, we can upgrade them and ln tmp -> run, plus change the basedir" [puppet] - 10https://gerrit.wikimedia.org/r/413167 (https://phabricator.wikimedia.org/T183470) (owner: 10Jcrespo) [14:40:26] (03CR) 10Giuseppe Lavagetto: [C: 032] Enable EtcdConfig on the debug hosts [mediawiki-config] - 10https://gerrit.wikimedia.org/r/411296 (https://phabricator.wikimedia.org/T149617) (owner: 10Giuseppe Lavagetto) [14:40:47] <_joe_> zeljkof: deploying my change as soon as it's merged [14:41:11] _joe_: ok, I'm standing by to take over [14:41:37] _joe_: have you restarted hhvm on mwdebug1001 as well? [14:41:42] <_joe_> gilles: yes [14:41:56] (03Merged) 10jenkins-bot: Enable EtcdConfig on the debug hosts [mediawiki-config] - 10https://gerrit.wikimedia.org/r/411296 (https://phabricator.wikimedia.org/T149617) (owner: 10Giuseppe Lavagetto) [14:42:01] <_joe_> !log restarted hhvm on mwdebug1001 too [14:42:09] (03CR) 10jenkins-bot: Enable EtcdConfig on the debug hosts [mediawiki-config] - 10https://gerrit.wikimedia.org/r/411296 (https://phabricator.wikimedia.org/T149617) (owner: 10Giuseppe Lavagetto) [14:42:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:43:27] 10Operations, 10Deployments, 10Release-Engineering-Team (Kanban), 10User-Addshore: cannot delete non-empty directory: php-1.29.0-wmf.3 messages on 'scap sync' on mwdebug1002 - https://phabricator.wikimedia.org/T157030#3989178 (10zeljkofilipin) 05Resolved>03Open This is happening again. ``` zfilipin@mw... [14:43:37] <_joe_> pulling on mwdebug1002 [14:44:02] FWIW, the non-empty directory thing is a result of a cleanup that did not delete the l10n cache. Since the rsync command excludes the files that remain, but includes a --delete, rsync reports that it is not deleting this directory because it has files in it that rsync is ignoring. [14:44:27] gilles: T157030 [14:44:27] T157030: cannot delete non-empty directory: php-1.29.0-wmf.3 messages on 'scap sync' on mwdebug1002 - https://phabricator.wikimedia.org/T157030 [14:44:30] will do the cleanup [14:44:53] thcipriani|afk: just noticed the command that needs to run at mwdebug1002 [14:45:12] 10Operations, 10Deployments, 10Release-Engineering-Team (Kanban), 10User-Addshore: cannot delete non-empty directory: php-1.29.0-wmf.3 messages on 'scap sync' on mwdebug1002 - https://phabricator.wikimedia.org/T157030#3989184 (10zeljkofilipin) 05Open>03Resolved [14:45:19] 10Operations, 10DBA: Decommission db2011 - https://phabricator.wikimedia.org/T187886#3989185 (10jcrespo) p:05Triage>03Normal [14:45:26] zeljkof: maybe hold off until no_j.ustification is around [14:45:43] there must be some change in how clean works because this was fixed [14:46:18] he'd know for sure what's happening. IIRC I saw some changes to clean happen recently that maybe don't work as expected. [14:46:45] thcipriani|afk: should I reopen the bug and assign to him, so we don't forget about it? [14:46:48] <_joe_> zeljkof: I'm deploying now [14:46:51] (03PS1) 10Gilles: Add Thumbor/Mediawiki shared secret [mediawiki-config] - 10https://gerrit.wikimedia.org/r/413168 (https://phabricator.wikimedia.org/T169144) [14:46:53] (03CR) 10Jcrespo: [C: 032] mariadb: Reenable notifications on db2044 (m2) [puppet] - 10https://gerrit.wikimedia.org/r/413159 (https://phabricator.wikimedia.org/T183470) (owner: 10Jcrespo) [14:47:01] zeljkof: sounds good [14:47:06] thcipriani|afk: will do [14:47:13] thcipriani|afk: and thanks :) [14:47:30] yw :) [14:47:36] * thcipriani|afk wonders away to make coffee [14:47:45] PROBLEM - SSH on labtestservices2003 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:48:37] 10Operations, 10Deployments, 10Release-Engineering-Team (Kanban), 10User-Addshore: cannot delete non-empty directory: php-1.29.0-wmf.3 messages on 'scap sync' on mwdebug1002 - https://phabricator.wikimedia.org/T157030#3989222 (10zeljkofilipin) 05Resolved>03Open a:05thcipriani>03demon @thcipriani sa... [14:48:54] (03PS3) 10Gilles: Serve officewiki thumbnails with Thumbor [mediawiki-config] - 10https://gerrit.wikimedia.org/r/412952 (https://phabricator.wikimedia.org/T169144) [14:49:03] (03PS2) 10Gilles: Serve private wiki thumbnails with Thumbor [mediawiki-config] - 10https://gerrit.wikimedia.org/r/413115 (https://phabricator.wikimedia.org/T169144) [14:49:14] !log oblivian@tin Synchronized wmf-config: Serve configuration to mwdebug hosts via etcd (duration: 01m 16s) [14:49:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:49:26] (03PS1) 10Andrew Bogott: add lvs ip for labweb services [dns] - 10https://gerrit.wikimedia.org/r/413169 (https://phabricator.wikimedia.org/T187506) [14:50:13] <_joe_> zeljkof: you can proceed, [14:50:22] _joe_: thanks, will do [14:50:25] PROBLEM - High CPU load on API appserver on mw1290 is CRITICAL: CRITICAL - load average: 114.78, 57.24, 41.08 [14:50:34] <_joe_> and of [14:50:37] <_joe_> course [14:50:38] raynor, Jhs: with 10 minutes left, we have time for 1-2 patches, is any of your patches urgent? [14:51:25] not UBN, but both high priority here. both config changes [14:51:42] raynor: ok, starting with your patches then [14:51:45] PROBLEM - Nginx local proxy to apache on mw1290 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:51:56] Jhs: sorry, looks like we will not have the time for any of your patches today [14:52:05] PROBLEM - Apache HTTP on mw1290 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:52:06] (well, in this window) [14:52:26] PROBLEM - HHVM rendering on mw1290 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:52:43] raynor: anything special about your patches? can not be tested at mwdebug1002, needs a lot of time to test, needs a script to run...? [14:52:44] zeljkof, mine are not urgent, so i'm fine with waiting [14:52:46] <_joe_> !log rolling restart another 4 api appservers [14:52:59] (03PS2) 10Zfilipin: Disable Page Previews EventLogging instrumentation [mediawiki-config] - 10https://gerrit.wikimedia.org/r/412996 (https://phabricator.wikimedia.org/T185973) (owner: 10Pmiazga) [14:52:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:53:11] nothing special, no scripts required [14:53:33] first one hides the feedback link, second one disables event logging [14:53:41] (03CR) 10Zfilipin: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/412996 (https://phabricator.wikimedia.org/T185973) (owner: 10Pmiazga) [14:53:56] for popups extension, we have enough events for everyone [14:53:58] :) [14:54:23] I can check those on prod [14:55:13] (03Merged) 10jenkins-bot: Disable Page Previews EventLogging instrumentation [mediawiki-config] - 10https://gerrit.wikimedia.org/r/412996 (https://phabricator.wikimedia.org/T185973) (owner: 10Pmiazga) [14:55:20] raynor: can both patches be tested at mwdebug1002? [14:55:25] yes [14:55:35] ok, the first one will be there in a minute [14:56:24] raynor: 412996 is at mwdebug1002, please test and let me know if I can deploy [14:56:25] RECOVERY - HHVM rendering on mw1290 is OK: HTTP OK: HTTP/1.1 200 OK - 74114 bytes in 0.268 second response time [14:56:36] RECOVERY - Nginx local proxy to apache on mw1290 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 617 bytes in 0.050 second response time [14:56:45] <_joe_> let me know when SWAT is done, I have to play a bit with the mwdebug hosts [14:56:48] testing [14:56:55] RECOVERY - Apache HTTP on mw1290 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 616 bytes in 0.031 second response time [14:56:58] _joe_: will do, in 5 minutes or so [14:57:08] (03CR) 10jenkins-bot: Disable Page Previews EventLogging instrumentation [mediawiki-config] - 10https://gerrit.wikimedia.org/r/412996 (https://phabricator.wikimedia.org/T185973) (owner: 10Pmiazga) [14:57:11] <_joe_> zeljkof: take your time, I did slow you people down [14:57:50] (03CR) 10Zfilipin: [C: 031] Removing Mobile beta feedback link [mediawiki-config] - 10https://gerrit.wikimedia.org/r/412983 (https://phabricator.wikimedia.org/T187712) (owner: 10Jdrewniak) [14:58:21] _joe_: and let me know when you're done with that, I think I'll just resume my stuff afterwards, with greg-g still asleep right now, haven't got a reply. my config changes are very low risk for general traffic, only focused on thumb.php (which is extremely low traffic) [14:58:54] <_joe_> gilles: yeah I am convinced we saw a red herring there [14:58:57] sec, I need to open page with ?debug=true, it takes sec [14:59:08] <_joe_> but you know, better safe than sorry [14:59:23] right, especially since the symptoms might come back when we scap pull again [14:59:35] RECOVERY - MariaDB Slave Lag: s2 on dbstore2002 is OK: OK slave_sql_lag Replication lag: 14.77 seconds [14:59:36] <_joe_> gilles: I doubt that's the case, tbh [14:59:48] (03PS1) 10Ottomata: [WIP] Refactor kafkatee module to support multi instance [puppet] - 10https://gerrit.wikimedia.org/r/413170 (https://phabricator.wikimedia.org/T187890) [15:00:29] hmm, config chaneg is there but I still see events o_O [15:00:37] (03CR) 10jerkins-bot: [V: 04-1] [WIP] Refactor kafkatee module to support multi instance [puppet] - 10https://gerrit.wikimedia.org/r/413170 (https://phabricator.wikimedia.org/T187890) (owner: 10Ottomata) [15:00:48] raynor: better than seeing dead people :D [15:01:35] raynor: do you need more time, or should we revert? [15:01:56] give me a minute, I'm checking the code [15:02:10] (03PS2) 10Filippo Giunchedi: nagios_common: switch to check_prometheus_metric Python implementation [puppet] - 10https://gerrit.wikimedia.org/r/413142 (https://phabricator.wikimedia.org/T181410) [15:02:15] RECOVERY - High CPU load on API appserver on mw1228 is OK: OK - load average: 8.35, 14.23, 23.50 [15:02:17] ok, nevermind [15:02:25] if debug=true we always send the event. stupid me [15:02:35] zeljkof, it's ok, you can push to prod [15:02:35] (03PS1) 10Andrew Bogott: labweb: add lvs service in front of labweb1001/1002 [puppet] - 10https://gerrit.wikimedia.org/r/413171 (https://phabricator.wikimedia.org/T187506) [15:02:42] raynor: deploying [15:02:47] (03CR) 10Filippo Giunchedi: nagios_common: switch to check_prometheus_metric Python implementation (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/413142 (https://phabricator.wikimedia.org/T181410) (owner: 10Filippo Giunchedi) [15:02:58] amazing, thx. The second patch is much easier to test [15:03:07] (03PS2) 10Ottomata: [WIP] Refactor kafkatee module to support multi instance [puppet] - 10https://gerrit.wikimedia.org/r/413170 (https://phabricator.wikimedia.org/T187890) [15:03:18] (03PS2) 10Andrew Bogott: labweb: add lvs service in front of labweb1001/1002 [puppet] - 10https://gerrit.wikimedia.org/r/413171 (https://phabricator.wikimedia.org/T187506) [15:03:49] (03PS3) 10Zfilipin: Removing Mobile beta feedback link [mediawiki-config] - 10https://gerrit.wikimedia.org/r/412983 (https://phabricator.wikimedia.org/T187712) (owner: 10Jdrewniak) [15:03:52] (03CR) 10jerkins-bot: [V: 04-1] [WIP] Refactor kafkatee module to support multi instance [puppet] - 10https://gerrit.wikimedia.org/r/413170 (https://phabricator.wikimedia.org/T187890) (owner: 10Ottomata) [15:03:59] !log zfilipin@tin Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:412996|Disable Page Previews EventLogging instrumentation (T185973)]] (duration: 01m 13s) [15:04:02] (03CR) 10Zfilipin: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/412983 (https://phabricator.wikimedia.org/T187712) (owner: 10Jdrewniak) [15:04:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:04:11] T185973: [Config] Disable Page Previews EventLogging instrumentation - https://phabricator.wikimedia.org/T185973 [15:04:15] RECOVERY - puppet last run on labstore1005 is OK: OK: Puppet is currently enabled, last run 13 seconds ago with 0 failures [15:04:19] raynor: 412996 is deployed, please test [15:04:35] raynor: will let you know when 412983 is at mwdebug1002 [15:04:59] (03PS3) 10Andrew Bogott: labweb: add lvs service in front of labweb1001/1002 [puppet] - 10https://gerrit.wikimedia.org/r/413171 (https://phabricator.wikimedia.org/T187506) [15:05:05] PROBLEM - High CPU load on API appserver on mw1287 is CRITICAL: CRITICAL - load average: 64.60, 39.99, 32.81 [15:05:09] (03PS4) 10Andrew Bogott: labweb: add lvs service in front of labweb1001/1002 [puppet] - 10https://gerrit.wikimedia.org/r/413171 (https://phabricator.wikimedia.org/T187506) [15:05:31] (03Merged) 10jenkins-bot: Removing Mobile beta feedback link [mediawiki-config] - 10https://gerrit.wikimedia.org/r/412983 (https://phabricator.wikimedia.org/T187712) (owner: 10Jdrewniak) [15:05:33] zeljkof - tested, works [15:05:38] tested on production [15:05:48] great [15:05:52] (03PS3) 10Ottomata: [WIP] Refactor kafkatee module to support multi instance [puppet] - 10https://gerrit.wikimedia.org/r/413170 (https://phabricator.wikimedia.org/T187890) [15:06:14] raynor: the second patch is at mwdebug1002 [15:06:52] zeljkof: tested on mwdebug1002 - works [15:07:01] raynor: deploying [15:07:03] please deploy [15:07:09] (03CR) 10jenkins-bot: Removing Mobile beta feedback link [mediawiki-config] - 10https://gerrit.wikimedia.org/r/412983 (https://phabricator.wikimedia.org/T187712) (owner: 10Jdrewniak) [15:08:10] !log zfilipin@tin Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:412983|Removing Mobile beta feedback link (T187712)]] (duration: 01m 12s) [15:08:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:08:25] T187712: [Config] Remove feedback link from settings page - https://phabricator.wikimedia.org/T187712 [15:08:28] raynor: deployed, please check and thanks for deploying with #releng ;) [15:08:34] !log EU SWAT finished [15:08:42] _joe_: I'm done [15:08:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:08:51] zeljkof: tested on production. works [15:09:05] RECOVERY - High CPU load on API appserver on mw1287 is OK: OK - load average: 17.07, 27.81, 29.58 [15:09:07] Jhs: sorry, please reschedule your patches for another swat [15:09:13] we ran out of time today [15:09:15] thanks for deploying everything [15:09:28] raynor: no problemo, that is what I do :D [15:09:36] 10Operations, 10Beta-Cluster-Infrastructure, 10Readers-Web-Backlog: Beta cluster api.php never responds - https://phabricator.wikimedia.org/T187891#3989296 (10Niedzielski) [15:10:04] 10Operations, 10Beta-Cluster-Infrastructure, 10Readers-Web-Backlog: Beta cluster api.php never responds - https://phabricator.wikimedia.org/T187891#3989312 (10Niedzielski) [15:10:08] 10Operations, 10User-Elukey: Ganeti instances to support Kafka Burrow Consumer lag monitoring - https://phabricator.wikimedia.org/T187805#3989313 (10akosiaris) >>! In T187805#3988524, @elukey wrote: >>>! In T187805#3988447, @akosiaris wrote: >> What amount of resources (CPU, mem, disk) are we talking about ? F... [15:10:29] <_joe_> zeljkof: thanks [15:10:33] 10Operations, 10User-Elukey: Ganeti instances to support Kafka Burrow Consumer lag monitoring - https://phabricator.wikimedia.org/T187805#3989314 (10akosiaris) Anyway, seems to me fine to go with the VM approach. Wanna file the task under #vm-requests ? [15:11:27] 10Operations, 10vm-requests, 10User-Elukey: Ganeti instances to support Kafka Burrow Consumer lag monitoring - https://phabricator.wikimedia.org/T187805#3989326 (10elukey) [15:11:29] 10Operations, 10Beta-Cluster-Infrastructure, 10Readers-Web-Backlog: Beta cluster api.php never responds - https://phabricator.wikimedia.org/T187891#3989296 (10Niedzielski) [15:11:35] 10Operations, 10ops-eqiad, 10DBA, 10Patch-For-Review: Rack and setup db1115 (tendril replacement database) - https://phabricator.wikimedia.org/T185788#3989328 (10Cmjohnson) [15:11:56] 10Operations, 10vm-requests, 10User-Elukey: Ganeti instances to support Kafka Burrow Consumer lag monitoring - https://phabricator.wikimedia.org/T187805#3986390 (10elukey) >>! In T187805#3989314, @akosiaris wrote: > Anyway, seems to me fine to go with the VM approach. Wanna file the task under #vm-requests ?... [15:12:05] 10Operations, 10ops-eqiad, 10DBA, 10Patch-For-Review: Rack and setup db1115 (tendril replacement database) - https://phabricator.wikimedia.org/T185788#3925325 (10Cmjohnson) @jcrespo and @Marostegui This is all yours. Please resolve once verified. Thanks! [15:13:14] (03CR) 10Giuseppe Lavagetto: "Overall correct, see the two minor comments." (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/413171 (https://phabricator.wikimedia.org/T187506) (owner: 10Andrew Bogott) [15:14:02] 10Operations, 10Beta-Cluster-Infrastructure, 10Readers-Web-Backlog (Tracking): Beta cluster api.php never responds - https://phabricator.wikimedia.org/T187891#3989347 (10Niedzielski) [15:14:11] 10Operations, 10Proton, 10Readers-Web-Backlog, 10Readers-Web-Kanbanana-Board, 10Services (watching): Choose a deploy server for the chromium-render service - https://phabricator.wikimedia.org/T187821#3989349 (10Niedzielski) [15:15:35] RECOVERY - High CPU load on API appserver on mw1290 is OK: OK - load average: 9.39, 12.50, 29.90 [15:17:03] (03PS3) 10Muehlenhoff: Enable base::service_auto_restart for prometheus-openldap-exporter [puppet] - 10https://gerrit.wikimedia.org/r/411219 (https://phabricator.wikimedia.org/T135991) [15:17:08] 10Operations, 10vm-requests, 10User-Elukey: Ganeti instances to support Kafka Burrow Consumer lag monitoring - https://phabricator.wikimedia.org/T187805#3989356 (10akosiaris) >>! In T187805#3989326, @elukey wrote: >>>! In T187805#3989314, @akosiaris wrote: >> Anyway, seems to me fine to go with the VM approa... [15:17:22] <_joe_> gilles: you can deploy if you want, I'll get a coffee and continue with tests afterwards [15:17:34] (03PS1) 10Jcrespo: mariadb: set db2011 (old codfw:s2) as spare [puppet] - 10https://gerrit.wikimedia.org/r/413174 (https://phabricator.wikimedia.org/T187886) [15:17:34] _joe_: alright, I'm starting [15:17:43] (03CR) 10Gilles: [C: 032] Add Thumbor/Mediawiki shared secret [mediawiki-config] - 10https://gerrit.wikimedia.org/r/413168 (https://phabricator.wikimedia.org/T169144) (owner: 10Gilles) [15:17:45] (03PS12) 10Chico Venancio: shinken: WMCS: use sumSeries to reduce puppet failures false positves [puppet] - 10https://gerrit.wikimedia.org/r/411315 (https://phabricator.wikimedia.org/T161898) [15:19:18] (03Merged) 10jenkins-bot: Add Thumbor/Mediawiki shared secret [mediawiki-config] - 10https://gerrit.wikimedia.org/r/413168 (https://phabricator.wikimedia.org/T169144) (owner: 10Gilles) [15:19:23] !log Thumbor private wiki support deployment [15:19:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:20:17] (03CR) 10Andrew Bogott: labweb: add lvs service in front of labweb1001/1002 (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/413171 (https://phabricator.wikimedia.org/T187506) (owner: 10Andrew Bogott) [15:20:21] (03CR) 10Jcrespo: [C: 032] mariadb: set db2011 (old codfw:s2) as spare [puppet] - 10https://gerrit.wikimedia.org/r/413174 (https://phabricator.wikimedia.org/T187886) (owner: 10Jcrespo) [15:22:53] Fucking messages directory [15:23:00] I hates it [15:23:09] Has nothing to do with scap clean [15:23:23] We've had that error for *years* [15:23:37] Blame l18nupdate [15:23:44] Fix is easy. [15:23:52] Ssh everywhere it's complaining and delete [15:24:27] !log gilles@tin Synchronized wmf-config/filebackend.php: Thumbor private wiki support deployment: [[gerrit:413168|Add Thumbor/Mediawiki shared secret (T169144)]] (duration: 01m 12s) [15:24:32] !log reboot labtestservices2002 [15:24:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:24:41] T169144: Serve thumb.php requests with Thumbor - https://phabricator.wikimedia.org/T169144 [15:24:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:25:21] (03PS5) 10Andrew Bogott: labweb: add lvs service in front of labweb1001/1002 [puppet] - 10https://gerrit.wikimedia.org/r/413171 (https://phabricator.wikimedia.org/T187506) [15:25:42] (03PS6) 10Andrew Bogott: labweb: add lvs service in front of labweb1001/1002 [puppet] - 10https://gerrit.wikimedia.org/r/413171 (https://phabricator.wikimedia.org/T187506) [15:26:37] 10Operations, 10Deployments, 10Release-Engineering-Team (Kanban), 10User-Addshore: cannot delete non-empty directory: php-1.29.0-wmf.3 messages on 'scap sync' on mwdebug1002 - https://phabricator.wikimedia.org/T157030#3989410 (10demon) Has nothing to do with scap clean. We've been fighting this same error... [15:27:05] !log gilles@tin Synchronized private/PrivateSettings.php.example: Thumbor private wiki support deployment: [[gerrit:413168|Add Thumbor/Mediawiki shared secret (T169144)]] (duration: 01m 11s) [15:27:06] RECOVERY - SSH on labtestservices2002 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.10 (protocol 2.0) [15:27:16] (03CR) 10Andrew Bogott: labweb horizon: share memcached among labwebs [puppet] - 10https://gerrit.wikimedia.org/r/411546 (https://phabricator.wikimedia.org/T187506) (owner: 10Andrew Bogott) [15:27:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:27:32] (03CR) 10Andrew Bogott: [C: 032] add lvs ip for labweb services [dns] - 10https://gerrit.wikimedia.org/r/413169 (https://phabricator.wikimedia.org/T187506) (owner: 10Andrew Bogott) [15:27:46] 10Operations, 10ops-codfw, 10DBA, 10netops: switch port configuration for tendril2001 - https://phabricator.wikimedia.org/T186172#3989413 (10Marostegui) Can this be resolved then? [15:28:19] 10Operations, 10DBA, 10Patch-For-Review: Setup newer machines and replace all old misc (m*) and x1 codfw machines - https://phabricator.wikimedia.org/T183470#3989416 (10jcrespo) I would say this is resolved- only pending the actual decommission (tracked on separate tickets), and the setup of extra servers fo... [15:28:58] (03CR) 10Andrew Bogott: [C: 032] labweb: add lvs service in front of labweb1001/1002 [puppet] - 10https://gerrit.wikimedia.org/r/413171 (https://phabricator.wikimedia.org/T187506) (owner: 10Andrew Bogott) [15:29:08] 10Operations, 10ops-codfw, 10DBA, 10Patch-For-Review: rack/setup/install db2093 (WAS: rack/setup/install tendril2001) - https://phabricator.wikimedia.org/T186123#3989417 (10Marostegui) [15:29:31] 10Operations, 10ops-codfw, 10DBA, 10Patch-For-Review: rack/setup/install db2093 (WAS: rack/setup/install tendril2001) - https://phabricator.wikimedia.org/T186123#3934359 (10Marostegui) 05Open>03Resolved Thanks @papaul! This looks good. We can continue the setup at T184704 [15:29:51] (03CR) 10jenkins-bot: Add Thumbor/Mediawiki shared secret [mediawiki-config] - 10https://gerrit.wikimedia.org/r/413168 (https://phabricator.wikimedia.org/T169144) (owner: 10Gilles) [15:29:54] (03CR) 10Gilles: [C: 032] Serve officewiki thumbnails with Thumbor [mediawiki-config] - 10https://gerrit.wikimedia.org/r/412952 (https://phabricator.wikimedia.org/T169144) (owner: 10Gilles) [15:29:59] (03CR) 10Muehlenhoff: [C: 032] Enable base::service_auto_restart for prometheus-openldap-exporter [puppet] - 10https://gerrit.wikimedia.org/r/411219 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [15:30:05] (03PS4) 10Muehlenhoff: Enable base::service_auto_restart for prometheus-openldap-exporter [puppet] - 10https://gerrit.wikimedia.org/r/411219 (https://phabricator.wikimedia.org/T135991) [15:30:46] 10Operations, 10DBA, 10Patch-For-Review: Setup newer machines and replace all old misc (m*) and x1 codfw machines - https://phabricator.wikimedia.org/T183470#3989424 (10jcrespo) 05Open>03Resolved a:03jcrespo [15:31:20] 10Operations, 10ops-eqiad, 10DBA, 10Patch-For-Review: Rack and setup db1115 (tendril replacement database) - https://phabricator.wikimedia.org/T185788#3989429 (10Marostegui) 05Open>03Resolved Thanks @Cmjohnson the host looks good. We can continue the service setup at T184704 [15:31:30] (03Merged) 10jenkins-bot: Serve officewiki thumbnails with Thumbor [mediawiki-config] - 10https://gerrit.wikimedia.org/r/412952 (https://phabricator.wikimedia.org/T169144) (owner: 10Gilles) [15:31:33] 10Operations, 10ops-eqiad, 10DBA, 10Patch-For-Review: Rack and setup db1115 (tendril replacement database) - https://phabricator.wikimedia.org/T185788#3989434 (10Marostegui) [15:32:16] 10Operations, 10Deployments, 10Release-Engineering-Team (Kanban), 10User-Addshore: cannot delete non-empty directory: php-1.29.0-wmf.3 messages on 'scap sync' on mwdebug1002 - https://phabricator.wikimedia.org/T157030#3989437 (10thcipriani) >>! In T157030#3989410, @demon wrote: > Has nothing to do with sca... [15:32:32] (03PS8) 10Andrew Bogott: labweb horizon: share memcached among labwebs [puppet] - 10https://gerrit.wikimedia.org/r/411546 (https://phabricator.wikimedia.org/T187506) [15:32:33] 10Operations, 10DBA, 10Patch-For-Review: Setup newer machines and replace all old misc (m*) and x1 eqiad machines - https://phabricator.wikimedia.org/T183469#3989439 (10jcrespo) [15:32:50] (03PS9) 10Andrew Bogott: labweb horizon: share memcached among labwebs [puppet] - 10https://gerrit.wikimedia.org/r/411546 (https://phabricator.wikimedia.org/T187506) [15:33:48] (03CR) 10Andrew Bogott: [C: 032] labweb horizon: share memcached among labwebs [puppet] - 10https://gerrit.wikimedia.org/r/411546 (https://phabricator.wikimedia.org/T187506) (owner: 10Andrew Bogott) [15:33:56] 10Operations, 10DBA, 10Patch-For-Review: Setup newer machines and replace all old misc (m*) and x1 eqiad machines - https://phabricator.wikimedia.org/T183469#3854189 (10jcrespo) [15:34:29] 10Operations, 10DBA, 10Patch-For-Review: Setup newer machines and replace all old misc (m*) and x1 eqiad machines - https://phabricator.wikimedia.org/T183469#3854189 (10jcrespo) [15:34:38] (03PS1) 10Ottomata: Qualify erb template variables in puppet self.conf.erb [puppet] - 10https://gerrit.wikimedia.org/r/413175 [15:34:58] andrewbogott: ^ look ok to you? [15:35:07] (03CR) 10jerkins-bot: [V: 04-1] Qualify erb template variables in puppet self.conf.erb [puppet] - 10https://gerrit.wikimedia.org/r/413175 (owner: 10Ottomata) [15:35:18] not really sure how to test other than to merge and try [15:35:24] 10Operations, 10ops-codfw, 10DBA, 10netops: switch port configuration for tendril2001 - https://phabricator.wikimedia.org/T186172#3989446 (10Papaul) Yes we can resolve this [15:35:45] 10Operations, 10Deployments, 10Release-Engineering-Team (Kanban), 10User-Addshore: cannot delete non-empty directory: php-1.29.0-wmf.3 messages on 'scap sync' on mwdebug1002 - https://phabricator.wikimedia.org/T157030#3989447 (10demon) Sorta. From what I can tell, `rsync` won't delete a destination directo... [15:35:51] (03CR) 10jenkins-bot: Serve officewiki thumbnails with Thumbor [mediawiki-config] - 10https://gerrit.wikimedia.org/r/412952 (https://phabricator.wikimedia.org/T169144) (owner: 10Gilles) [15:35:52] ottomata: https://gerrit.wikimedia.org/r/#/c/411616/ [15:35:54] 10Operations, 10ops-codfw, 10DBA, 10netops: switch port configuration for db2093 - https://phabricator.wikimedia.org/T186172#3989448 (10Papaul) [15:36:03] (03PS2) 10Ottomata: Qualify erb template variables in puppet self.conf.erb [puppet] - 10https://gerrit.wikimedia.org/r/413175 [15:36:08] that whole module is unused, I'm going to remove it, possibly today [15:36:09] sorry :) [15:36:13] oh ha! [15:36:28] but, then how am I getting it in cloud? am i applying the wrong class? [15:36:30] looking at docs... [15:36:42] removal is blocked by https://phabricator.wikimedia.org/T187622 [15:36:55] ottomata: you should be using somethingsomething:standalone [15:37:03] ah [15:37:05] k [15:37:12] 10Operations, 10Deployments, 10Release-Engineering-Team (Kanban), 10User-Addshore: cannot delete non-empty directory: php-1.29.0-wmf.3 messages on 'scap sync' on mwdebug1002 - https://phabricator.wikimedia.org/T157030#3989462 (10demon) Fwiw, this works just fine: SSH_AUTH_SOCK=/run/keyholder/proxy.sock... [15:37:36] 10Operations, 10ops-codfw, 10DBA, 10Patch-For-Review: rack/setup/install db2093 (WAS: rack/setup/install tendril2001) - https://phabricator.wikimedia.org/T186123#3989465 (10Marostegui) [15:37:41] 10Operations, 10ops-codfw, 10DBA, 10netops: switch port configuration for db2093 - https://phabricator.wikimedia.org/T186172#3989463 (10Marostegui) 05Open>03Resolved Thanks! [15:37:44] !log gilles@tin Synchronized wmf-config/filebackend.php: Thumbor private wiki support deployment: [[gerrit:413168|Serve officewiki thumbnails with Thumbor (T169144)]] (duration: 01m 11s) [15:37:55] thanks andrewbogott [15:37:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:37:56] T169144: Serve thumb.php requests with Thumbor - https://phabricator.wikimedia.org/T169144 [15:38:05] 10Operations, 10Deployments, 10Release-Engineering-Team (Kanban), 10User-Addshore: cannot delete non-empty directory: php-1.29.0-wmf.3 messages on 'scap sync' on mwdebug1002 - https://phabricator.wikimedia.org/T157030#3989467 (10demon) In the old method, we just did ^^^ and never had a "partial" cleanup li... [15:38:23] ottomata: sorry that that old class is still there as a trap. I've been wanting to delete it for a year but only just chased the last user off of it on Saturday [15:38:52] (03Abandoned) 10Ottomata: Qualify erb template variables in puppet self.conf.erb [puppet] - 10https://gerrit.wikimedia.org/r/413175 (owner: 10Ottomata) [15:39:28] (03CR) 10Gilles: [C: 032] Serve private wiki thumbnails with Thumbor [mediawiki-config] - 10https://gerrit.wikimedia.org/r/413115 (https://phabricator.wikimedia.org/T169144) (owner: 10Gilles) [15:39:37] (03CR) 10jerkins-bot: [V: 04-1] Serve private wiki thumbnails with Thumbor [mediawiki-config] - 10https://gerrit.wikimedia.org/r/413115 (https://phabricator.wikimedia.org/T169144) (owner: 10Gilles) [15:39:52] (03PS3) 10Gilles: Serve private wiki thumbnails with Thumbor [mediawiki-config] - 10https://gerrit.wikimedia.org/r/413115 (https://phabricator.wikimedia.org/T169144) [15:40:16] (03CR) 10Gilles: [C: 032] Serve private wiki thumbnails with Thumbor [mediawiki-config] - 10https://gerrit.wikimedia.org/r/413115 (https://phabricator.wikimedia.org/T169144) (owner: 10Gilles) [15:41:32] PROBLEM - puppet last run on cp1051 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 4 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[retry-load-new-vcl-file] [15:41:43] RECOVERY - SSH on labtestservices2003 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.10 (protocol 2.0) [15:41:46] (03Merged) 10jenkins-bot: Serve private wiki thumbnails with Thumbor [mediawiki-config] - 10https://gerrit.wikimedia.org/r/413115 (https://phabricator.wikimedia.org/T169144) (owner: 10Gilles) [15:41:56] (03CR) 10jenkins-bot: Serve private wiki thumbnails with Thumbor [mediawiki-config] - 10https://gerrit.wikimedia.org/r/413115 (https://phabricator.wikimedia.org/T169144) (owner: 10Gilles) [15:42:40] 10Operations, 10Deployments, 10Release-Engineering-Team (Kanban), 10User-Addshore: cannot delete non-empty directory: php-1.29.0-wmf.3 messages on 'scap sync' on mwdebug1002 - https://phabricator.wikimedia.org/T157030#3989479 (10demon) I think there's two actionables here! # Make sure we delete these direc... [15:42:54] andrewbogott: fyi,i just applied puppetmaster::standalone role [15:42:56] to a new node [15:43:02] i think a couple of runs will get it [15:43:06] ottomata: cool, I think you'll like it better [15:43:09] but first run fails to install puppet-master [15:43:11] Job for puppet-master.service failed. See 'systemctl status puppet-master.service' and 'journalctl -xn' for details. [15:43:11] invoke-rc.d: initscript puppet-master, action "start" failed. [15:43:11] dpkg: error processing package puppet-master (--configure): [15:43:11] subprocess installed post-installation script returned error exit status 1 [15:43:20] note that that role sets up a puppetmaster but doesn't point anything to use it [15:43:24] also, no idea if it works anywhere but jessie [15:43:25] Errors were encountered while processing: [15:43:25] puppet-master [15:43:27] its jessie [15:43:59] andrewbogott: will it use itself as puppetmaster? [15:44:08] no [15:44:11] oh [15:44:11] hmm [15:44:14] not unless you set it as its own puppetmaster [15:44:21] that's all in the docs you're looking at I think :) [15:44:41] k reading... [15:44:43] !log pruned old 1.29.x and 1.30.x versions that somehow stuck around. Also 1.31.0-wmf.* cache/ directories for unused branches. T157030 [15:44:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:44:57] T157030: cannot delete non-empty directory: php-1.29.0-wmf.3 messages on 'scap sync' on mwdebug1002 - https://phabricator.wikimedia.org/T157030 [15:45:05] ah right.. [15:45:27] (03CR) 10Alexandros Kosiaris: [C: 031] "\o/" [puppet] - 10https://gerrit.wikimedia.org/r/411616 (https://phabricator.wikimedia.org/T182810) (owner: 10Andrew Bogott) [15:46:09] (03PS1) 10Andrew Bogott: horizon: add a missing arg [puppet] - 10https://gerrit.wikimedia.org/r/413178 (https://phabricator.wikimedia.org/T187506) [15:46:32] 10Operations, 10Deployments, 10Release-Engineering-Team (Kanban), 10User-Addshore: cannot delete non-empty directory: php-1.29.0-wmf.3 messages on 'scap sync' on mwdebug1002 - https://phabricator.wikimedia.org/T157030#3989505 (10demon) I'm curious if `scap clean` is the wrong approach. A daily (or heck, we... [15:46:53] ottomata: sorry, didn't mean to rtfm you — I think it's straightforward, though. You make a puppetmaster, and then later you decide who uses that puppetmaster (which may or may not be the puppetmaster host itself) [15:47:09] I usually find it much less disorienting to have the puppetmaster and the client be different VMs [15:47:23] (03CR) 10Andrew Bogott: [C: 032] horizon: add a missing arg [puppet] - 10https://gerrit.wikimedia.org/r/413178 (https://phabricator.wikimedia.org/T187506) (owner: 10Andrew Bogott) [15:49:42] PROBLEM - LVS HTTP IPv4 on labweb.svc.eqiad.wmnet is CRITICAL: connect to address 10.2.2.40 and port 80: Connection refused [15:49:46] PROBLEM - puppet last run on cp1058 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 4 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[retry-load-new-vcl-file] [15:49:55] <_joe_> andrewbogott: ^^ [15:49:59] paged [15:50:01] <_joe_> did you restart pybal? [15:50:04] yeah, that is actually nice andrewbogott [15:50:11] nice that the standalone pm does not have to pm itself [15:50:19] <_joe_> pybal doesn't pick up changes unless you don't restart it [15:50:20] great [15:50:21] also, i've never tried this git remote push thing [15:50:22] looks fancy [15:50:32] i always do rsync to /var/lib/git ... [15:50:37] !log gilles@tin Synchronized wmf-config/filebackend.php: Thumbor private wiki support deployment: [[gerrit:413115|Serve private wiki thumbnails with Thumbor (T169144)]] (duration: 01m 12s) [15:50:39] _joe_: I didn't do anything that a plain puppet merge didn't do [15:50:40] will try next time i need a longer lived pm [15:50:42] <_joe_> ottomata: we just got paged, and andrew needs to be the one looking into it :P [15:50:45] is there a by-hand step I'm missing? [15:50:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:50:51] T169144: Serve thumb.php requests with Thumbor - https://phabricator.wikimedia.org/T169144 [15:51:15] <_joe_> andrewbogott: yes, once puppet has run on the load-balancers, you need to manually restart pybal where relevant [15:51:24] <_joe_> in your case, the eqiad low-traffic pair [15:51:31] crap, ok. [15:51:32] <_joe_> so first lvs1006, then 1003 [15:52:11] lvs1006.eqiad.wmnet? I can't connect for some reason [15:52:23] <_joe_> .wm.org [15:52:29] <_joe_> I'm on it andrewbogott [15:52:36] thanks [15:52:58] <_joe_> so puppet still didn't run on lvs1006 [15:53:09] <_joe_> but it did run on einsteinium I guess [15:53:09] meanwhile I should really remove that 'critical' flag, doing now [15:53:10] _joe_: I'm done with my deployment [15:53:18] <_joe_> gilles: ack [15:54:41] <_joe_> andrewbogott: so once you've restarted pybal, you can check your pool is there with [15:54:45] (03PS1) 10Andrew Bogott: labweb lvs: mark critical: false [puppet] - 10https://gerrit.wikimedia.org/r/413179 [15:54:46] PROBLEM - puppet last run on cp1061 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 6 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[retry-load-new-vcl-file] [15:54:46] <_joe_> curl localhost:9090/pools | grep labweb [15:54:55] (03PS2) 10Andrew Bogott: labweb lvs: mark critical: false [puppet] - 10https://gerrit.wikimedia.org/r/413179 [15:55:07] (03PS1) 10BBlack: Add hiera max_core_rtt data [puppet] - 10https://gerrit.wikimedia.org/r/413180 (https://phabricator.wikimedia.org/T157430) [15:55:19] <_joe_> but curl localhost:9090/pools/labweb_80 [15:55:22] <_joe_> seems empty [15:55:23] <_joe_> uhm [15:55:44] (03CR) 10Andrew Bogott: [C: 032] labweb lvs: mark critical: false [puppet] - 10https://gerrit.wikimedia.org/r/413179 (owner: 10Andrew Bogott) [15:57:15] <_joe_> andrewbogott: how did you merge your puppet change? [15:57:19] <_joe_> exact command please [15:57:22] <_joe_> :) [15:57:29] <_joe_> the one where you added labweb [15:57:43] 'sudo puppet merge' on puppetmaster1001.wikmedia.org [15:57:46] <_joe_> ah [15:57:52] (03PS2) 10BBlack: Add hiera max_core_rtt data [puppet] - 10https://gerrit.wikimedia.org/r/413180 (https://phabricator.wikimedia.org/T157430) [15:57:58] <_joe_> and you dind't notice errors from conftool-merge? [15:58:11] PROBLEM - Confd template for /srv/config-master/pybal/eqiad/labweb on labpuppetmaster1001 is CRITICAL: File not found: /srv/config-master/pybal/eqiad/labweb [15:58:14] <_joe_> sudo -i puppet merge when you want to merge changes there [15:58:20] <_joe_> anyways [15:58:24] <_joe_> I'm fixing that too [15:58:25] I didn't, but that doesn't mean there weren't any... [15:58:30] thanks, what did I miss? [15:58:43] <_joe_> lemme see, are you about to merge another change? [15:58:56] Just did (critical: false) but I'm clear now [15:59:10] RECOVERY - Confd template for /srv/config-master/pybal/eqiad/labweb on labpuppetmaster1001 is OK: No errors detected [15:59:17] 10Operations, 10Deployments, 10Release-Engineering-Team (Kanban), 10User-Addshore: cannot delete non-empty directory: php-1.29.0-wmf.3 messages on 'scap sync' on mwdebug1002 - https://phabricator.wikimedia.org/T157030#3989547 (10thcipriani) Just to create a small test case that demos what's going wrong: `... [15:59:21] conftool says [15:59:25] https://www.irccloud.com/pastebin/PRoJNZam/ [15:59:30] PROBLEM - puppet last run on cp1045 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 22 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[retry-load-new-vcl-file] [15:59:37] is that because of the lack of -i ? [15:59:41] <_joe_> yes [15:59:52] ok, will try to retrain my fingers [15:59:53] <_joe_> the credentials are available in the root home dir [16:00:02] <_joe_> this is what I did to fix it https://dpaste.de/7ajq/raw [16:00:22] 10Operations, 10monitoring, 10Patch-For-Review, 10User-fgiunchedi: Encrypt syslog traffic - https://phabricator.wikimedia.org/T136312#3989548 (10fgiunchedi) Today rsyslogd was "stuck" accepting new connections on lithium and wezen, at about the same time. This is a strace from `check_ssl` on einsteinium:... [16:00:30] !log restart rsyslogd on lithium and wezen - T136312 [16:00:34] ok, makes sense [16:00:41] PROBLEM - rsyslog TLS listener on port 6514 on lithium is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection reset by peer [16:00:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:00:45] T136312: Encrypt syslog traffic - https://phabricator.wikimedia.org/T136312 [16:00:50] RECOVERY - rsyslog TLS listener on port 6514 on lithium is OK: SSL OK - Certificate lithium.eqiad.wmnet valid until 2021-10-23 19:09:29 +0000 (expires in 1340 days) [16:01:28] (03PS1) 10Gilles: Stop routing Varnish thumb.php traffic to image scalers [puppet] - 10https://gerrit.wikimedia.org/r/413185 (https://phabricator.wikimedia.org/T187899) [16:01:45] <_joe_> gilles: \o/ [16:01:55] \o/ indeed [16:02:30] 10Operations, 10User-Elukey: Ganeti instances to support Kafka Burrow Consumer lag monitoring - https://phabricator.wikimedia.org/T187805#3989556 (10elukey) [16:02:40] PROBLEM - rsyslog TLS listener on port 6514 on wezen is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection reset by peer [16:02:54] shoudl recover soon [16:03:00] RECOVERY - rsyslog TLS listener on port 6514 on wezen is OK: SSL OK - Certificate wezen.codfw.wmnet valid until 2021-08-21 20:09:05 +0000 (expires in 1277 days) [16:03:44] (03PS1) 10Andrew Bogott: horizon memcache: fix an issue with erb var resolution [puppet] - 10https://gerrit.wikimedia.org/r/413186 (https://phabricator.wikimedia.org/T187506) [16:04:18] 10Operations, 10Deployments, 10Release-Engineering-Team (Kanban), 10User-Addshore: cannot delete non-empty directory: php-1.29.0-wmf.3 messages on 'scap sync' on mwdebug1002 - https://phabricator.wikimedia.org/T157030#3989575 (10thcipriani) >>! In T157030#3989447, @demon wrote: > Sorta. From what I can tel... [16:04:33] (03CR) 10Andrew Bogott: [C: 032] horizon memcache: fix an issue with erb var resolution [puppet] - 10https://gerrit.wikimedia.org/r/413186 (https://phabricator.wikimedia.org/T187506) (owner: 10Andrew Bogott) [16:05:53] 10Operations, 10vm-requests: Site: eqiad|codfw VM request for Kafka Burrow Lag monitoring - https://phabricator.wikimedia.org/T187901#3989586 (10elukey) [16:06:07] 10Operations, 10User-Elukey: Ganeti instances to support Kafka Burrow Consumer lag monitoring - https://phabricator.wikimedia.org/T187805#3989596 (10elukey) done! https://phabricator.wikimedia.org/T187901 [16:06:40] PROBLEM - PyBal IPVS diff check on lvs1003 is CRITICAL: CRITICAL: Hosts in IPVS but unknown to PyBal: set([labweb1001.wikimedia.org]) [16:06:51] PROBLEM - PyBal backends health check on lvs1003 is CRITICAL: PYBAL CRITICAL - CRITICAL - labweb_80: Servers labweb1001.wikimedia.org are marked down but pooled [16:08:00] PROBLEM - PyBal IPVS diff check on lvs1006 is CRITICAL: CRITICAL: Hosts in IPVS but unknown to PyBal: set([labweb1001.wikimedia.org]) [16:10:20] (03PS1) 10Andrew Bogott: labweb: inclued role::lvs::realserver on labweb hosts [puppet] - 10https://gerrit.wikimedia.org/r/413188 (https://phabricator.wikimedia.org/T187506) [16:11:05] (03CR) 10Andrew Bogott: [C: 032] labweb: inclued role::lvs::realserver on labweb hosts [puppet] - 10https://gerrit.wikimedia.org/r/413188 (https://phabricator.wikimedia.org/T187506) (owner: 10Andrew Bogott) [16:11:18] andrewbogott: varnish reloads are failing with: [16:11:20] Backend host '"labweb.svc.wikimedia.org"' could not be resolved to an IP address [16:12:00] RECOVERY - LVS HTTP IPv4 on labweb.svc.eqiad.wmnet is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 453 bytes in 0.001 second response time [16:12:02] <_joe_> ema: ouch [16:12:14] <_joe_> ema: i didn't notice that in my reviwe [16:12:17] what the heck? That's the one part of this that I /do/ know how to do, I thought... [16:12:46] <_joe_> andrewbogott: s/wikimedia.org/eqiad.wmnet/ [16:12:55] yup [16:13:29] crap, ok, fixing [16:13:56] (03PS1) 10Elukey: role::analytics_cluster::coordinator: enable mon. for oozie|hive [puppet] - 10https://gerrit.wikimedia.org/r/413189 (https://phabricator.wikimedia.org/T184794) [16:14:15] so many moving parts [16:14:18] (03PS1) 10Andrew Bogott: labweb: correct labweb service hostname [puppet] - 10https://gerrit.wikimedia.org/r/413190 [16:14:50] (03CR) 10Andrew Bogott: [C: 032] labweb: correct labweb service hostname [puppet] - 10https://gerrit.wikimedia.org/r/413190 (owner: 10Andrew Bogott) [16:16:36] forced a puppet run on cp1045, all good [16:16:55] <_joe_> ok [16:16:57] <_joe_> cool [16:19:10] PROBLEM - PyBal backends health check on lvs1006 is CRITICAL: PYBAL CRITICAL - CRITICAL - labweb_80: Servers labweb1001.wikimedia.org are marked down but pooled [16:19:30] RECOVERY - puppet last run on cp1045 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [16:19:40] RECOVERY - puppet last run on cp1061 is OK: OK: Puppet is currently enabled, last run 50 seconds ago with 0 failures [16:20:37] <_joe_> uh? [16:20:49] it looks up to me [16:21:02] <_joe_> that means there is another typo maybe [16:21:20] probably! [16:21:41] <_joe_> is the server name labweb1001.wikimedia.org? [16:21:54] <_joe_> yes, so no typo [16:21:58] <_joe_> lemme see what's up [16:22:29] <_joe_> curl localhost:9090/pools/labweb_80 [16:22:29] <_joe_> labweb1002.wikimedia.org: enabled/down/not pooled [16:22:29] <_joe_> labweb1001.wikimedia.org: enabled/down/pooled [16:22:53] <_joe_> same on lvs1003 [16:23:16] <_joe_> [labweb_80 IdleConnection] WARN: labweb1001.wikimedia.org (enabled/down/pooled): Connection to 208.80.154.160:80 failed. [16:23:22] how specifically does it decide if the host is up or down? [16:23:24] <_joe_> ok, this seems like a network issue [16:23:27] oh, ok [16:23:32] <_joe_> andrewbogott: depending on your configs [16:23:45] but in this case, it's just seeing if port 80 is up [16:24:05] <_joe_> andrewbogott: no, you are both doing an IdleConnection check, and a ProxyFetch check [16:24:05] which it is, but maybe I need a firewall fix... [16:24:09] <_joe_> both fail [16:24:12] <_joe_> and yes, it seems so [16:24:22] so which ports should I be opening? [16:24:40] (03PS4) 10Ottomata: Refactor kafkatee module to support multi instance [puppet] - 10https://gerrit.wikimedia.org/r/413170 (https://phabricator.wikimedia.org/T187890) [16:24:49] elukey: ^ would you review that? [16:25:12] <_joe_> andrewbogott: port 80 I guess, but I think it's more complicated than you think [16:25:17] (and shouldn't that be handled by role::lvs::realserver?) [16:25:25] yeah, might not be just the firewall [16:25:28] (03CR) 10jerkins-bot: [V: 04-1] Refactor kafkatee module to support multi instance [puppet] - 10https://gerrit.wikimedia.org/r/413170 (https://phabricator.wikimedia.org/T187890) (owner: 10Ottomata) [16:26:02] _joe_: there's a icinga warning on lvs1010 for pybal-etcd connections: 58 connections established with conf1001.eqiad.wmnet:2379 (min=59) [16:26:15] I'll restart pybal there too [16:26:21] <_joe_> ema: pybal needs to be restartedd there, yes [16:26:23] <_joe_> clearly [16:26:31] _joe_: let me know if/when you need me to back this out so you can get on with your life [16:26:36] <_joe_> ema: can you or someone from your team help andrewbogott? [16:27:11] <_joe_> I think the problem is he's trying to do LVS/DR to a public IP from the labs subnet maybe? [16:27:19] <_joe_> I haven't looked into it [16:27:52] !log lvs1010: restart pybal [16:28:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:28:05] (03PS5) 10Ottomata: Refactor kafkatee module to support multi instance [puppet] - 10https://gerrit.wikimedia.org/r/413170 (https://phabricator.wikimedia.org/T187890) [16:28:49] (03PS2) 10Elukey: role::analytics_cluster::coordinator: enable mon. for oozie|hive [puppet] - 10https://gerrit.wikimedia.org/r/413189 (https://phabricator.wikimedia.org/T184794) [16:29:20] PROBLEM - PyBal backends health check on lvs1010 is CRITICAL: PYBAL CRITICAL - CRITICAL - labweb_80: Servers labweb1001.wikimedia.org are marked down but pooled [16:30:00] <_joe_> andrewbogott: either you find someone to help troubleshooting this issue, or you should revert [16:30:07] yep, ok [16:31:31] (03PS5) 10Gehel: elasticsearch: collect elasticsearch metrics on per node percentiles [puppet] - 10https://gerrit.wikimedia.org/r/412670 [16:31:46] (03CR) 10jerkins-bot: [V: 04-1] elasticsearch: collect elasticsearch metrics on per node percentiles [puppet] - 10https://gerrit.wikimedia.org/r/412670 (owner: 10Gehel) [16:31:54] right so the LVSs can't connect to 208.80.154.160:80 [16:32:30] PROBLEM - PyBal IPVS diff check on lvs1010 is CRITICAL: CRITICAL: Hosts in IPVS but unknown to PyBal: set([labweb1001.wikimedia.org]) [16:32:51] ACKNOWLEDGEMENT - PyBal IPVS diff check on lvs1003 is CRITICAL: CRITICAL: Hosts in IPVS but unknown to PyBal: set([labweb1001.wikimedia.org]) andrew bogott This is a work in progress, Im looking at it. [16:32:51] ACKNOWLEDGEMENT - PyBal backends health check on lvs1003 is CRITICAL: PYBAL CRITICAL - CRITICAL - labweb_80: Servers labweb1001.wikimedia.org are marked down but pooled andrew bogott This is a work in progress, Im looking at it. [16:32:51] ACKNOWLEDGEMENT - PyBal IPVS diff check on lvs1006 is CRITICAL: CRITICAL: Hosts in IPVS but unknown to PyBal: set([labweb1001.wikimedia.org]) andrew bogott This is a work in progress, Im looking at it. [16:32:51] ACKNOWLEDGEMENT - PyBal backends health check on lvs1006 is CRITICAL: PYBAL CRITICAL - CRITICAL - labweb_80: Servers labweb1001.wikimedia.org are marked down but pooled andrew bogott This is a work in progress, Im looking at it. [16:32:51] ACKNOWLEDGEMENT - PyBal IPVS diff check on lvs1010 is CRITICAL: CRITICAL: Hosts in IPVS but unknown to PyBal: set([labweb1001.wikimedia.org]) andrew bogott This is a work in progress, Im looking at it. [16:32:51] ACKNOWLEDGEMENT - PyBal backends health check on lvs1010 is CRITICAL: PYBAL CRITICAL - CRITICAL - labweb_80: Servers labweb1001.wikimedia.org are marked down but pooled andrew bogott This is a work in progress, Im looking at it. [16:33:46] ema: I'm looking at ferm issues but I suspect we also need to do something on the switch to allow this [16:34:32] (03CR) 10BryanDavis: "> could I just 'rm -R /usr/local/lib/mediawiki-config' and run puppet" [puppet] - 10https://gerrit.wikimedia.org/r/413095 (https://phabricator.wikimedia.org/T187850) (owner: 10BryanDavis) [16:36:25] 10Operations, 10Beta-Cluster-Infrastructure, 10Readers-Web-Backlog (Tracking): Beta cluster api.php never responds - https://phabricator.wikimedia.org/T187891#3989296 (10thcipriani) Looks like the api is able to render a trivial test or two for me currently: https://en.wikipedia.beta.wmflabs.org/w/api.php?ac... [16:37:53] !log ppchelko@tin Started deploy [changeprop/deploy@1be63aa]: Simplify ORES precaching by using the new endpoint T158437 [16:38:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:38:09] T158437: Change ORES rules to send all events to new "/precache" endpoint - https://phabricator.wikimedia.org/T158437 [16:39:26] !log ppchelko@tin Finished deploy [changeprop/deploy@1be63aa]: Simplify ORES precaching by using the new endpoint T158437 (duration: 01m 33s) [16:39:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:40:21] (03PS1) 10Andrew Bogott: horizon/labweb: open firewall to internal IPs for port 80 [puppet] - 10https://gerrit.wikimedia.org/r/413194 (https://phabricator.wikimedia.org/T187506) [16:40:43] <_joe_> !log testing various etcd failure scenarios on mwdebug1001, T185078 [16:40:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:40:57] T185078: Test EtcdConfig in different failure scenarios - https://phabricator.wikimedia.org/T185078 [16:41:24] (03CR) 10Andrew Bogott: [C: 032] horizon/labweb: open firewall to internal IPs for port 80 [puppet] - 10https://gerrit.wikimedia.org/r/413194 (https://phabricator.wikimedia.org/T187506) (owner: 10Andrew Bogott) [16:41:30] RECOVERY - puppet last run on cp1051 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [16:46:15] (03PS6) 10Gehel: elasticsearch: collect elasticsearch metrics on per node percentiles [puppet] - 10https://gerrit.wikimedia.org/r/412670 [16:47:21] (03CR) 10EBernhardson: [C: 031] elasticsearch: collect elasticsearch metrics on per node percentiles [puppet] - 10https://gerrit.wikimedia.org/r/412670 (owner: 10Gehel) [16:49:50] RECOVERY - puppet last run on cp1058 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [16:52:17] (03PS1) 10Bstorm: tools-static: Change to reverse proxy of cdnjs [puppet] - 10https://gerrit.wikimedia.org/r/413197 (https://phabricator.wikimedia.org/T182604) [16:53:36] (03CR) 10Rush: [C: 031] toollabs: add apt pinnings for key packages [puppet] - 10https://gerrit.wikimedia.org/r/410177 (https://phabricator.wikimedia.org/T187193) (owner: 10Arturo Borrero Gonzalez) [16:53:44] (03PS13) 10Rush: toollabs: add apt pinnings for key packages [puppet] - 10https://gerrit.wikimedia.org/r/410177 (https://phabricator.wikimedia.org/T187193) (owner: 10Arturo Borrero Gonzalez) [16:54:19] (03PS7) 10Gehel: elasticsearch: collect elasticsearch metrics on per node percentiles [puppet] - 10https://gerrit.wikimedia.org/r/412670 [16:54:39] (03CR) 10Arturo Borrero Gonzalez: [C: 032] toollabs: add apt pinnings for key packages [puppet] - 10https://gerrit.wikimedia.org/r/410177 (https://phabricator.wikimedia.org/T187193) (owner: 10Arturo Borrero Gonzalez) [16:56:01] 10Operations: Define a special range in constants.pp for the LVS hosts - https://phabricator.wikimedia.org/T187910#3989817 (10Andrew) [16:56:15] !log oblivian@puppetmaster1001 conftool action : edit; selector: name=ReadOnly,scope=eqiad [16:56:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:00:03] (03CR) 10Chad: "Having a root just clean it up was what I had hoped for last night on IRC ;-)" [puppet] - 10https://gerrit.wikimedia.org/r/413095 (https://phabricator.wikimedia.org/T187850) (owner: 10BryanDavis) [17:01:39] (03PS1) 10Arturo Borrero Gonzalez: Revert "toollabs: add apt pinnings for key packages" [puppet] - 10https://gerrit.wikimedia.org/r/413198 [17:02:03] (03CR) 10Chad: "db1102 and db1095 complained too, are they labs replicas but poorly named?" [puppet] - 10https://gerrit.wikimedia.org/r/413095 (https://phabricator.wikimedia.org/T187850) (owner: 10BryanDavis) [17:02:20] (03CR) 10Paladox: [C: 031] Gerrit: Improve registration url [puppet] - 10https://gerrit.wikimedia.org/r/413079 (owner: 10Chad) [17:02:32] (03CR) 10Arturo Borrero Gonzalez: [C: 032] Revert "toollabs: add apt pinnings for key packages" [puppet] - 10https://gerrit.wikimedia.org/r/413198 (owner: 10Arturo Borrero Gonzalez) [17:02:55] (03CR) 10Rush: "rush@tools-checker-01:~$ sudo facter -p | grep -i os_version" [puppet] - 10https://gerrit.wikimedia.org/r/410177 (https://phabricator.wikimedia.org/T187193) (owner: 10Arturo Borrero Gonzalez) [17:04:26] Hi [17:04:30] RECOVERY - PyBal backends health check on lvs1010 is OK: PYBAL OK - All pools are healthy [17:04:36] (03CR) 10Rush: "root@tools-worker-1001:~# sudo facter -p | grep lsbdistcodename" [puppet] - 10https://gerrit.wikimedia.org/r/410177 (https://phabricator.wikimedia.org/T187193) (owner: 10Arturo Borrero Gonzalez) [17:05:27] (03PS2) 10Bstorm: tools-static: Change to reverse proxy of cdnjs [puppet] - 10https://gerrit.wikimedia.org/r/413197 (https://phabricator.wikimedia.org/T182604) [17:05:56] I have question. Why is gerrit as to I am blocked again? Sometimes when I want to see my change, it showing as to page is not found. Sometimes I see page without problems [17:06:15] You are not blocked. [17:06:28] I know [17:06:47] But it showing sometimes as to I am [17:07:04] <_joe_> !log finished testing on mwdebug1001 for swat [17:07:10] Right now, I had problems with showing https://gerrit.wikimedia.org/r/#/c/412947/ [17:07:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:07:28] Zoranzoki21 if you were blocked, it would not let you sign in [17:07:30] RECOVERY - PyBal IPVS diff check on lvs1010 is OK: OK: no difference between hosts in IPVS/PyBal [17:07:43] I think what your trying to view is either a deleted change or a draft [17:07:51] That ^ [17:07:55] https://gerrit.wikimedia.org/r/#/c/412947/ is draft? [17:08:11] https://gerrit.wikimedia.org/r/#/c/412947/ works for me [17:08:17] Zoranzoki21 what error do you get? [17:08:30] (03PS1) 10Ema: labweb lvs: proxyfetch configuration [puppet] - 10https://gerrit.wikimedia.org/r/413201 [17:08:55] The page you requested was not found, or you do not have permission to view this page. [17:09:11] Hmm, does it say internal error? [17:09:14] (too) [17:09:23] no [17:10:05] That shows up for me, even when not logged in [17:10:10] That's inconsistent with being blocked [17:10:14] Or it being private [17:10:19] So it wants at one point for a few seconds. Well, a couple of times when I refresh, it works smoothly for about an hour or two. [17:10:33] *it working [17:10:57] It's like something blocks it to show, and what I do not know. [17:11:05] First I thinked to you blocked me again [17:11:05] That sounds like a bad link or something. That's more of a 404-style error than anything [17:11:22] !log ppchelko@tin Started deploy [changeprop/deploy@e9a6bb0]: Use post for ORES precache rules T158437 [17:11:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:11:36] T158437: Change ORES rules to send all events to new "/precache" endpoint - https://phabricator.wikimedia.org/T158437 [17:11:37] Well I've got nothing obvious in the error logs. [17:11:43] But now working [17:12:05] I will, when next time this happen, send screenshot here [17:12:08] (03PS1) 10Arturo Borrero Gonzalez: toollabs: add apt pinnings for key packages [puppet] - 10https://gerrit.wikimedia.org/r/413202 (https://phabricator.wikimedia.org/T187193) [17:12:20] (03PS2) 10Ema: labweb lvs: proxyfetch configuration [puppet] - 10https://gerrit.wikimedia.org/r/413201 [17:12:46] !log ppchelko@tin Finished deploy [changeprop/deploy@e9a6bb0]: Use post for ORES precache rules T158437 (duration: 01m 23s) [17:12:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:12:57] (03CR) 10Ema: [C: 032] labweb lvs: proxyfetch configuration [puppet] - 10https://gerrit.wikimedia.org/r/413201 (owner: 10Ema) [17:14:14] (03CR) 10Bstorm: "Note: Puppet is disabled on tools-static-11 and tools-static-10 so that this change doesn't impact the existing setup to allow for smooth " [puppet] - 10https://gerrit.wikimedia.org/r/413197 (https://phabricator.wikimedia.org/T182604) (owner: 10Bstorm) [17:17:00] !log eqiad LVSs: bounce pybal for labweb proxfetch config changes [17:17:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:17:42] 10Operations, 10Beta-Cluster-Infrastructure, 10Readers-Web-Backlog (Tracking): Beta cluster api.php never responds - https://phabricator.wikimedia.org/T187891#3989938 (10Niedzielski) 05Open>03Resolved a:03thcipriani Fixed! Thank you @thcipriani! [17:18:41] 10Operations, 10LDAP-Access-Requests, 10WMF-NDA-Requests: Request to be added to the ldap/wmde group - https://phabricator.wikimedia.org/T187442#3975346 (10RStallman-legalteam) Hello all, Just an update that I checked w/ the attorneys and we're going to continue to do NDAs for all LDAP access requests that... [17:19:04] 10Operations, 10Beta-Cluster-Infrastructure, 10Readers-Web-Backlog (Tracking), 10Release-Engineering-Team (Kanban): Beta cluster api.php never responds - https://phabricator.wikimedia.org/T187891#3989962 (10thcipriani) >>! In T187891#3989938, @Niedzielski wrote: > Fixed! Thank you @thcipriani! glad to hea... [17:19:07] (03PS2) 10Arturo Borrero Gonzalez: toollabs: add apt pinnings for key packages [puppet] - 10https://gerrit.wikimedia.org/r/413202 (https://phabricator.wikimedia.org/T187193) [17:19:20] (03CR) 10BryanDavis: "> db1102 and db1095 complained too, are they labs replicas but poorly" [puppet] - 10https://gerrit.wikimedia.org/r/413095 (https://phabricator.wikimedia.org/T187850) (owner: 10BryanDavis) [17:19:40] RECOVERY - PyBal backends health check on lvs1006 is OK: PYBAL OK - All pools are healthy [17:20:14] (03CR) 10Rush: [C: 031] "Best hopes!" [puppet] - 10https://gerrit.wikimedia.org/r/413202 (https://phabricator.wikimedia.org/T187193) (owner: 10Arturo Borrero Gonzalez) [17:20:30] RECOVERY - PyBal backends health check on lvs1003 is OK: PYBAL OK - All pools are healthy [17:20:35] (03PS3) 10Arturo Borrero Gonzalez: toollabs: add apt pinnings for key packages [puppet] - 10https://gerrit.wikimedia.org/r/413202 (https://phabricator.wikimedia.org/T187193) [17:21:07] (03CR) 10Arturo Borrero Gonzalez: [C: 032] toollabs: add apt pinnings for key packages [puppet] - 10https://gerrit.wikimedia.org/r/413202 (https://phabricator.wikimedia.org/T187193) (owner: 10Arturo Borrero Gonzalez) [17:21:40] RECOVERY - PyBal IPVS diff check on lvs1003 is OK: OK: no difference between hosts in IPVS/PyBal [17:23:00] RECOVERY - PyBal IPVS diff check on lvs1006 is OK: OK: no difference between hosts in IPVS/PyBal [17:32:13] (03PS1) 10Rush: toolforge: apply pinning to k8s components [puppet] - 10https://gerrit.wikimedia.org/r/413206 (https://phabricator.wikimedia.org/T187193) [17:32:47] (03CR) 10jerkins-bot: [V: 04-1] toolforge: apply pinning to k8s components [puppet] - 10https://gerrit.wikimedia.org/r/413206 (https://phabricator.wikimedia.org/T187193) (owner: 10Rush) [17:34:40] <_joe_> !log resuming tests on mwdebug1001 [17:34:45] (03PS2) 10Rush: toolforge: apply pinning to k8s components [puppet] - 10https://gerrit.wikimedia.org/r/413206 (https://phabricator.wikimedia.org/T187193) [17:34:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:35:20] (03CR) 10jerkins-bot: [V: 04-1] toolforge: apply pinning to k8s components [puppet] - 10https://gerrit.wikimedia.org/r/413206 (https://phabricator.wikimedia.org/T187193) (owner: 10Rush) [17:36:51] (03CR) 10Arturo Borrero Gonzalez: [V: 031 C: 031] toolforge: apply pinning to k8s components [puppet] - 10https://gerrit.wikimedia.org/r/413206 (https://phabricator.wikimedia.org/T187193) (owner: 10Rush) [17:37:06] (03CR) 10Rush: [V: 032 C: 032] "Sorry style check we have to move forward." [puppet] - 10https://gerrit.wikimedia.org/r/413206 (https://phabricator.wikimedia.org/T187193) (owner: 10Rush) [17:41:33] (03PS1) 10Ema: icinga: promote check_established_connections alerts to critical [puppet] - 10https://gerrit.wikimedia.org/r/413208 (https://phabricator.wikimedia.org/T170847) [17:42:59] 10Operations: Define a special range in constants.pp for the LVS hosts - https://phabricator.wikimedia.org/T187910#3989817 (10mark) You are setting up a publicly accessible web service, right? So you should probably open up port 80 (and/or 443) to the entire world, not just LVS servers. Traffic is "routed" via... [17:43:57] !log eqsin LVSs: upgrade pybal to 1.14.4 [17:44:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:50:48] (03PS3) 10Zoranzoki21: Added throttle rule for Wikipedia Women in Red editathon [mediawiki-config] - 10https://gerrit.wikimedia.org/r/412947 (https://phabricator.wikimedia.org/T187803) [17:50:57] (03CR) 10jerkins-bot: [V: 04-1] Added throttle rule for Wikipedia Women in Red editathon [mediawiki-config] - 10https://gerrit.wikimedia.org/r/412947 (https://phabricator.wikimedia.org/T187803) (owner: 10Zoranzoki21) [17:53:21] (03PS4) 10Zoranzoki21: Added new throttle rule for Wikipedia Women in Red editathon [mediawiki-config] - 10https://gerrit.wikimedia.org/r/412947 (https://phabricator.wikimedia.org/T187803) [17:53:25] (03PS5) 10Zoranzoki21: Added new throttle rule for Wikipedia Women in Red editathon [mediawiki-config] - 10https://gerrit.wikimedia.org/r/412947 (https://phabricator.wikimedia.org/T187803) [17:57:35] (03PS1) 10Vgutierrez: Provide an UDP monitor. [debs/pybal] - 10https://gerrit.wikimedia.org/r/413211 (https://phabricator.wikimedia.org/T178151) [17:59:47] (03PS3) 10Herron: WIP: puppetdbquery: upgrade to 3.0.1 [puppet] - 10https://gerrit.wikimedia.org/r/410050 (https://phabricator.wikimedia.org/T187259) [18:00:05] addshore, hashar, anomie, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: That opportune time is upon us again. Time for a Morning SWAT (Max 8 patches) deploy. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180221T1800). [18:00:05] razesoldier, Zoranzoki21, and Jhs: A patch you scheduled for Morning SWAT (Max 8 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [18:00:13] I'm here! [18:00:17] I'm here [18:00:17] (03CR) 10jerkins-bot: [V: 04-1] WIP: puppetdbquery: upgrade to 3.0.1 [puppet] - 10https://gerrit.wikimedia.org/r/410050 (https://phabricator.wikimedia.org/T187259) (owner: 10Herron) [18:00:30] and I [18:00:48] Have the problems from earlier today been solved? [18:01:40] Who will be today our swater? [18:03:47] (03CR) 10Elukey: Refactor kafkatee module to support multi instance (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/413170 (https://phabricator.wikimedia.org/T187890) (owner: 10Ottomata) [18:04:06] 10Operations: Integrate jessie 8.10 point release - https://phabricator.wikimedia.org/T182656#3990159 (10MoritzMuehlenhoff) 05Open>03Resolved This is fully rolled out. [18:07:54] * Jhs pings zeljkof :) [18:08:23] * Zoranzoki21 trying to ping all swaters:) [18:12:44] <_joe_> !log stopped testing on mwdebug1001 for SWAT window [18:12:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:14:39] 10Operations, 10Domains, 10Traffic, 10WMF-Design, and 2 others: Create subdomain for Design and Wikimedia User Interface Style Guide - https://phabricator.wikimedia.org/T185282#3990194 (10Volker_E) @Dzahn Any updates on above? Accomplishing this task is part of end of Q goals and is dependent on the curren... [18:16:55] 10Operations, 10Proton, 10Readers-Web-Backlog, 10Services (watching): Choose a deploy server for the chromium-render service - https://phabricator.wikimedia.org/T187821#3990203 (10Niedzielski) [18:17:05] Jhs, Zoranzoki21: sorry, can not swat [18:17:18] I know zeljkof. But who can? [18:17:51] I can SWAT [18:18:08] oh thank you god. Thank you thcipriani [18:18:31] (03PS3) 10Thcipriani: Set Topic namespace alias of zhwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/412439 (https://phabricator.wikimedia.org/T187546) (owner: 10星耀晨曦) [18:19:00] razesoldier: are you setup to test ^ on the mwdebug machines? [18:19:18] (03PS1) 10ArielGlenn: restbase dumps in xml format [dumps] - 10https://gerrit.wikimedia.org/r/413212 [18:19:24] Yes, I can test [18:19:34] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/412439 (https://phabricator.wikimedia.org/T187546) (owner: 10星耀晨曦) [18:19:37] cool :) [18:19:44] (03CR) 10jerkins-bot: [V: 04-1] restbase dumps in xml format [dumps] - 10https://gerrit.wikimedia.org/r/413212 (owner: 10ArielGlenn) [18:19:50] via browser extension [18:20:33] yay thcipriani :) [18:20:44] (03Merged) 10jenkins-bot: Set Topic namespace alias of zhwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/412439 (https://phabricator.wikimedia.org/T187546) (owner: 10星耀晨曦) [18:20:55] (03CR) 10jenkins-bot: Set Topic namespace alias of zhwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/412439 (https://phabricator.wikimedia.org/T187546) (owner: 10星耀晨曦) [18:21:21] razesoldier: your change is live on mwdebug1002, check please [18:22:01] looks good, [18:22:22] Can be redirected to topic namespace [18:22:37] (03PS2) 10ArielGlenn: restbase dumps in xml format [dumps] - 10https://gerrit.wikimedia.org/r/413212 [18:22:39] great, pushing the change live [18:24:45] !log thcipriani@tin Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:412439|Set Topic namespace alias of zhwiki]] T187546 (duration: 01m 13s) [18:24:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:24:59] T187546: Set Topic namespace alias of zhwiki - https://phabricator.wikimedia.org/T187546 [18:25:00] ^ razesoldier your change is live everywhere, thanks for the patch! [18:25:26] can my next? easier is [18:25:28] Thanks for your swat :D [18:27:25] (03PS6) 10Zoranzoki21: Added new throttle rule for Wikipedia Women in Red editathon [mediawiki-config] - 10https://gerrit.wikimedia.org/r/412947 (https://phabricator.wikimedia.org/T187803) [18:28:59] (03CR) 10Thcipriani: [C: 04-1] "comment inline" (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/412947 (https://phabricator.wikimedia.org/T187803) (owner: 10Zoranzoki21) [18:29:11] Zoranzoki21: time looks wrong ^ [18:29:46] thcipriani: I will for few seconds fix patch [18:29:54] k [18:30:19] Zoranzoki21, you know you can edit directly in Gerrit right? :) [18:30:22] (03PS7) 10Zoranzoki21: Added new throttle rule for Wikipedia Women in Red editathon [mediawiki-config] - 10https://gerrit.wikimedia.org/r/412947 (https://phabricator.wikimedia.org/T187803) [18:30:27] I know [18:30:28] I did it [18:30:45] (03PS4) 10Thcipriani: Add namespace localization for sdwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/412676 (https://phabricator.wikimedia.org/T186943) (owner: 10Jon Harald Søby) [18:30:47] thanks [18:30:50] (Y) [18:30:51] :) [18:30:58] asn [18:31:01] thcipriani: Now you can [18:32:39] (03PS1) 10Rush: toolforge: update pin for kubernetes-client [puppet] - 10https://gerrit.wikimedia.org/r/413213 (https://phabricator.wikimedia.org/T187193) [18:32:48] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/412947 (https://phabricator.wikimedia.org/T187803) (owner: 10Zoranzoki21) [18:33:13] thcipriani: thanks [18:33:28] Zoranzoki21: you're welcome, thanks for the patch [18:33:38] :) [18:33:39] once it merges, I'll sync it live [18:33:45] thcipriani: ok [18:34:17] (03Merged) 10jenkins-bot: Added new throttle rule for Wikipedia Women in Red editathon [mediawiki-config] - 10https://gerrit.wikimedia.org/r/412947 (https://phabricator.wikimedia.org/T187803) (owner: 10Zoranzoki21) [18:35:42] thcipriani: postmerge afraid me again [18:36:59] (03CR) 10jenkins-bot: Added new throttle rule for Wikipedia Women in Red editathon [mediawiki-config] - 10https://gerrit.wikimedia.org/r/412947 (https://phabricator.wikimedia.org/T187803) (owner: 10Zoranzoki21) [18:37:00] (03CR) 10Rush: "This worked seemingly fine for all 3 labsdb10[09|10|11]" [puppet] - 10https://gerrit.wikimedia.org/r/413095 (https://phabricator.wikimedia.org/T187850) (owner: 10BryanDavis) [18:37:08] Zoranzoki21: it's queued now https://integration.wikimedia.org/ci/job/beta-mediawiki-config-update-eqiad/ but there's another job running on the executor that post-merge uses [18:37:24] thcipriani: ok. postmerge always afraiding me [18:37:51] !log labsdb rm -fR /usr/local/lib/mediawiki-config && puppet agent --test [18:37:56] (03CR) 10Rush: "@marostegui could we run '/usr/local/lib/mediawiki-config && puppet agent --test' on db1102 and db1095 as an easy fix for submodule cleanu" [puppet] - 10https://gerrit.wikimedia.org/r/413095 (https://phabricator.wikimedia.org/T187850) (owner: 10BryanDavis) [18:38:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:38:43] (03CR) 10Rush: [C: 032] toolforge: update pin for kubernetes-client [puppet] - 10https://gerrit.wikimedia.org/r/413213 (https://phabricator.wikimedia.org/T187193) (owner: 10Rush) [18:38:45] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/412676 (https://phabricator.wikimedia.org/T186943) (owner: 10Jon Harald Søby) [18:39:35] !log thcipriani@tin Synchronized wmf-config/throttle.php: SWAT: [[gerrit:412947|Added new throttle rule for Wikipedia Women in Red editathon]] T187803 (duration: 01m 12s) [18:39:46] ^ Zoranzoki21 your change is live now [18:39:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:39:49] T187803: Temporary lift of IP cap on en.wikipedia for 2018-08-03 - https://phabricator.wikimedia.org/T187803 [18:40:12] thcipriani: tnx [18:40:13] (03Merged) 10jenkins-bot: Add namespace localization for sdwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/412676 (https://phabricator.wikimedia.org/T186943) (owner: 10Jon Harald Søby) [18:40:20] yw :) [18:40:24] (03CR) 10jenkins-bot: Add namespace localization for sdwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/412676 (https://phabricator.wikimedia.org/T186943) (owner: 10Jon Harald Søby) [18:40:50] thcipriani, i assume you saw my comment about script run for this one? [18:42:20] Jhs: saw the tag, do you need namespacedupes run for this? [18:42:34] Jhs: also, it's live on mwdebug1002, check please [18:44:23] thcipriani, yeah, namespaceDupes to be safe [18:44:29] looks good to me on 1002 [18:45:05] ok, I'll sync live and run namespaceDupes on terbium after [18:45:14] coolio (Y) [18:48:28] !log thcipriani@tin Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:412676|Add namespace localization for sdwiki]] T186943 (duration: 01m 13s) [18:48:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:48:42] T186943: Localize & change namespaces on Sindhi Wikipedia (sdwiki) - https://phabricator.wikimedia.org/T186943 [18:49:56] Jhs: ^ live and namespacedupes run: 2132 links to fix, 2132 were resolvable. [18:50:19] sweet [18:51:05] (03PS2) 10Thcipriani: Add sitename for Burmese Wiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/413166 (https://phabricator.wikimedia.org/T187882) (owner: 10Jon Harald Søby) [18:51:07] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/413166 (https://phabricator.wikimedia.org/T187882) (owner: 10Jon Harald Søby) [18:52:34] (03Merged) 10jenkins-bot: Add sitename for Burmese Wiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/413166 (https://phabricator.wikimedia.org/T187882) (owner: 10Jon Harald Søby) [18:52:48] (03CR) 10jenkins-bot: Add sitename for Burmese Wiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/413166 (https://phabricator.wikimedia.org/T187882) (owner: 10Jon Harald Søby) [18:52:53] (03CR) 10Ottomata: Refactor kafkatee module to support multi instance (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/413170 (https://phabricator.wikimedia.org/T187890) (owner: 10Ottomata) [18:53:21] (03PS6) 10Ottomata: Refactor kafkatee module to support multi instance [puppet] - 10https://gerrit.wikimedia.org/r/413170 (https://phabricator.wikimedia.org/T187890) [18:53:28] (03PS2) 10Ottomata: Remove kafkatee as a submodule and re-add it into ops/puppet preserving history [puppet] - 10https://gerrit.wikimedia.org/r/413056 [18:53:34] Jhs: sitename patch for mywiktionary is live on mwdebug1002, check please [18:53:37] (03CR) 10Ottomata: [V: 032 C: 032] Remove kafkatee as a submodule and re-add it into ops/puppet preserving history [puppet] - 10https://gerrit.wikimedia.org/r/413056 (owner: 10Ottomata) [18:53:44] thcipriani: Run namespaceDupes.php (and then again with --fix) after you finish namespace changes [18:54:02] thcipriani, looks good (Y) [18:54:36] It's a pretty safe script to run, but a dry run never hearts :) [18:55:07] no_justification: this is likely true. I'll update docs https://wikitech.wikimedia.org/wiki/SWAT_deploys/Deployers#namespaceDupes after swat [18:55:17] Jhs: k, going live. [18:57:17] (03PS1) 10Ottomata: Revert "Remove kafkatee as a submodule and re-add it into ops/puppet preserving history" [puppet] - 10https://gerrit.wikimedia.org/r/413215 [18:57:21] (03CR) 10Ottomata: [V: 032 C: 032] Revert "Remove kafkatee as a submodule and re-add it into ops/puppet preserving history" [puppet] - 10https://gerrit.wikimedia.org/r/413215 (owner: 10Ottomata) [18:59:10] !log thcipriani@tin Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:413166|Add sitename for Burmese Wiktionary]] T187882 (duration: 01m 06s) [18:59:17] ^ Jhs live now [18:59:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:59:24] T187882: Localized sitename for Burmese Wiktionary - https://phabricator.wikimedia.org/T187882 [18:59:32] looks good :) [18:59:51] great, thanks for the patch :) [18:59:55] [19:00:05] Deploy window Pre MediaWiki train sanity break (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180221T1900) [19:00:05] No GERRIT patches in the queue for this window AFAICS. [19:00:45] thanks for swatting thcipriani :) [19:00:55] yw :) [19:02:04] https://wikitech.wikimedia.org/wiki/SWAT_deploys/Deployers#namespaceDupes updated [19:03:10] PROBLEM - Unmerged changes on repository puppet on puppetmaster1001 is CRITICAL: There are 2 unmerged changes in puppet (dir /var/lib/git/operations/puppet, ref HEAD..origin/production). [19:04:01] Thanks for that, thcipriani. [19:05:05] * thcipriani doffs hat [19:05:09] (03PS1) 10Ottomata: Refactor kafkatee module to support multi instance [puppet/kafkatee] - 10https://gerrit.wikimedia.org/r/413217 (https://phabricator.wikimedia.org/T187890) [19:05:31] (03Abandoned) 10Ottomata: Refactor kafkatee module to support multi instance [puppet] - 10https://gerrit.wikimedia.org/r/413170 (https://phabricator.wikimedia.org/T187890) (owner: 10Ottomata) [19:06:50] (03PS2) 10Ottomata: Refactor kafkatee module to support multi instance [puppet/kafkatee] - 10https://gerrit.wikimedia.org/r/413217 (https://phabricator.wikimedia.org/T187890) [19:07:06] (03CR) 10Ottomata: [V: 032 C: 032] Refactor kafkatee module to support multi instance [puppet/kafkatee] - 10https://gerrit.wikimedia.org/r/413217 (https://phabricator.wikimedia.org/T187890) (owner: 10Ottomata) [19:07:33] (03PS1) 10Ottomata: Update kafkatee submodule with multi instance support [puppet] - 10https://gerrit.wikimedia.org/r/413220 (https://phabricator.wikimedia.org/T187890) [19:11:44] 10Operations, 10Proton, 10Readers-Web-Backlog, 10Services (watching): Choose a deploy server for the chromium-render service - https://phabricator.wikimedia.org/T187821#3990425 (10mobrovac) Given the requirements, I would be inclined to say Kubernetes, but we don't have any services on it yet. So perhaps G... [19:13:58] (03CR) 10Smalyshev: wdqs: allow configuration of kafka based updates (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/412873 (https://phabricator.wikimedia.org/T185951) (owner: 10Gehel) [19:16:04] (03PS2) 10Ottomata: Update kafkatee submodule with multi instance support [puppet] - 10https://gerrit.wikimedia.org/r/413220 (https://phabricator.wikimedia.org/T187890) [19:16:30] PROBLEM - HHVM rendering on mw2128 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:17:21] RECOVERY - HHVM rendering on mw2128 is OK: HTTP OK: HTTP/1.1 200 OK - 74582 bytes in 1.325 second response time [19:20:22] (03PS3) 10Ottomata: Update kafkatee submodule with multi instance support [puppet] - 10https://gerrit.wikimedia.org/r/413220 (https://phabricator.wikimedia.org/T187890) [19:21:19] 10Operations, 10netops: Cloud IPv6 subnets - https://phabricator.wikimedia.org/T187929#3990473 (10ayounsi) p:05Triage>03Normal [19:23:41] (03CR) 10Ottomata: "Looks good: https://puppet-compiler.wmflabs.org/compiler02/10071/" [puppet] - 10https://gerrit.wikimedia.org/r/413220 (https://phabricator.wikimedia.org/T187890) (owner: 10Ottomata) [19:24:42] !log applying changes to kafkatee module, first rhenium then oxygen. will require manual config fixings [19:24:46] (03CR) 10Ottomata: [C: 032] Update kafkatee submodule with multi instance support [puppet] - 10https://gerrit.wikimedia.org/r/413220 (https://phabricator.wikimedia.org/T187890) (owner: 10Ottomata) [19:24:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:26:30] RECOVERY - Unmerged changes on repository puppet on puppetmaster1001 is OK: No changes to merge. [19:27:50] going to sneak another patch into SWAT, just bumping a pool counter limit for something cirrus [19:27:57] jouncebot: next [19:27:57] In 0 hour(s) and 32 minute(s): MediaWiki train (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180221T2000) [19:28:13] (03PS2) 10EBernhardson: Increase pool counter workers for cirrus namespace lookup [mediawiki-config] - 10https://gerrit.wikimedia.org/r/412982 [19:29:21] (03CR) 10EBernhardson: [C: 032] Increase pool counter workers for cirrus namespace lookup [mediawiki-config] - 10https://gerrit.wikimedia.org/r/412982 (owner: 10EBernhardson) [19:29:46] (03CR) 10Zhuyifei1999: "Why is caching the response from api.cdnjs.com needed?" [puppet] - 10https://gerrit.wikimedia.org/r/413197 (https://phabricator.wikimedia.org/T182604) (owner: 10Bstorm) [19:30:53] (03Merged) 10jenkins-bot: Increase pool counter workers for cirrus namespace lookup [mediawiki-config] - 10https://gerrit.wikimedia.org/r/412982 (owner: 10EBernhardson) [19:31:08] (03CR) 10jenkins-bot: Increase pool counter workers for cirrus namespace lookup [mediawiki-config] - 10https://gerrit.wikimedia.org/r/412982 (owner: 10EBernhardson) [19:34:12] !log ebernhardson@tin Synchronized wmf-config/PoolCounterSettings.php: Increase pool counter workers for cirrus namespace lookup (duration: 01m 13s) [19:34:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:37:11] (03PS4) 10Herron: WIP: puppetdbquery: upgrade to 3.0.1 [puppet] - 10https://gerrit.wikimedia.org/r/410050 (https://phabricator.wikimedia.org/T187259) [19:37:24] (03CR) 10Ayounsi: [C: 032] openstack: labs-instance-transport1-b-codfw designations [dns] - 10https://gerrit.wikimedia.org/r/413160 (https://phabricator.wikimedia.org/T184209) (owner: 10Rush) [19:38:08] (03CR) 10jerkins-bot: [V: 04-1] WIP: puppetdbquery: upgrade to 3.0.1 [puppet] - 10https://gerrit.wikimedia.org/r/410050 (https://phabricator.wikimedia.org/T187259) (owner: 10Herron) [19:38:23] (03CR) 10Zhuyifei1999: tools-static: Change to reverse proxy of cdnjs (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/413197 (https://phabricator.wikimedia.org/T182604) (owner: 10Bstorm) [19:39:00] PROBLEM - puppet last run on rhenium is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 4 minutes ago with 1 failures. Failed resources (up to 3 shown): Service[kafkatee-webrequest] [19:41:04] (03CR) 10Bstorm: "Caching the response is how the other bit (cdnjs-index) works. It used to consume a packages.json from the checkout. Now, we get the sam" [puppet] - 10https://gerrit.wikimedia.org/r/413197 (https://phabricator.wikimedia.org/T182604) (owner: 10Bstorm) [19:41:19] 10Operations, 10Cloud-Services, 10netops: Labs to Cloud renaming for networking equipment - https://phabricator.wikimedia.org/T187933#3990570 (10chasemp) [19:41:28] 10Operations, 10Cloud-Services, 10netops: Labs to Cloud renaming for networking equipment - https://phabricator.wikimedia.org/T187933#3990586 (10chasemp) p:05Triage>03Low [19:45:42] (03PS1) 10Cmjohnson: Adding mgmt and production dns [dns] - 10https://gerrit.wikimedia.org/r/413230 (https://phabricator.wikimedia.org/T186073) [19:46:10] (03CR) 10Bstorm: "I will also add that, we must cache the api.cdnjs.com response because it takes forever and can fail without special handling. That can b" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/413197 (https://phabricator.wikimedia.org/T182604) (owner: 10Bstorm) [19:46:32] (03CR) 10Zhuyifei1999: "What I mean is, does cdnjs-index still have to access tools-static in order to generate the index? Can't it fetch the json from the API di" [puppet] - 10https://gerrit.wikimedia.org/r/413197 (https://phabricator.wikimedia.org/T182604) (owner: 10Bstorm) [19:49:19] (03CR) 10Bstorm: "yes and yes. After this is basically working, I can change how the frontend is generated. I think freeing up the disk space can happen b" [puppet] - 10https://gerrit.wikimedia.org/r/413197 (https://phabricator.wikimedia.org/T182604) (owner: 10Bstorm) [19:51:24] (03CR) 10Bstorm: "In fact, I can start setting up the api call in cdnjs-index while this is in review. :) It is needed for this change and the present sta" [puppet] - 10https://gerrit.wikimedia.org/r/413197 (https://phabricator.wikimedia.org/T182604) (owner: 10Bstorm) [19:53:15] (03PS1) 10Ottomata: Fix typo in kafkatee.systemd.erb WantedBy [puppet/kafkatee] - 10https://gerrit.wikimedia.org/r/413231 (https://phabricator.wikimedia.org/T187890) [19:53:27] (03CR) 10Ottomata: [V: 032 C: 032] Fix typo in kafkatee.systemd.erb WantedBy [puppet/kafkatee] - 10https://gerrit.wikimedia.org/r/413231 (https://phabricator.wikimedia.org/T187890) (owner: 10Ottomata) [19:54:07] (03PS1) 10Ottomata: Update kafkatee submodule with systemd typo fix [puppet] - 10https://gerrit.wikimedia.org/r/413232 [19:54:16] (03CR) 10Ottomata: [V: 032 C: 032] Update kafkatee submodule with systemd typo fix [puppet] - 10https://gerrit.wikimedia.org/r/413232 (owner: 10Ottomata) [19:56:13] (03CR) 10Zhuyifei1999: tools-static: Change to reverse proxy of cdnjs (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/413197 (https://phabricator.wikimedia.org/T182604) (owner: 10Bstorm) [19:57:33] (03PS3) 10Bstorm: tools-static: Change to reverse proxy of cdnjs [puppet] - 10https://gerrit.wikimedia.org/r/413197 (https://phabricator.wikimedia.org/T182604) [19:59:00] RECOVERY - puppet last run on rhenium is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [20:00:04] twentyafterfour: Your horoscope predicts another unfortunate MediaWiki train deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180221T2000). [20:00:05] No GERRIT patches in the queue for this window AFAICS. [20:01:22] (03PS2) 10Krinkle: mediawiki: Enable auto_prepend_file setting for HHVM on Beta Cluster [puppet] - 10https://gerrit.wikimedia.org/r/412827 (https://phabricator.wikimedia.org/T180183) [20:07:11] (03PS1) 10Brion VIBBER: WIP - gzip .stl files on transfer (application/sla) [puppet] - 10https://gerrit.wikimedia.org/r/413236 (https://phabricator.wikimedia.org/T187930) [20:07:28] (03PS1) 10Ottomata: Set two webrequest kafkatee instances consuming from analytics and jumbo [puppet] - 10https://gerrit.wikimedia.org/r/413237 (https://phabricator.wikimedia.org/T185136) [20:10:10] (03PS1) 10Andrew Bogott: labweb: remove specific memcached port [puppet] - 10https://gerrit.wikimedia.org/r/413239 (https://phabricator.wikimedia.org/T187506) [20:10:34] !log phab2001 - testing phab restart cron [20:10:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:10:57] (03CR) 10Dzahn: [C: 032] "tested and compiled http://puppet-compiler.wmflabs.org/10072/" [puppet] - 10https://gerrit.wikimedia.org/r/413114 (https://phabricator.wikimedia.org/T187790) (owner: 1020after4) [20:11:04] (03PS2) 10Dzahn: Phabricator: restart apache every sunday night [puppet] - 10https://gerrit.wikimedia.org/r/413114 (https://phabricator.wikimedia.org/T187790) (owner: 1020after4) [20:12:14] (03CR) 10Andrew Bogott: [C: 032] labweb: remove specific memcached port [puppet] - 10https://gerrit.wikimedia.org/r/413239 (https://phabricator.wikimedia.org/T187506) (owner: 10Andrew Bogott) [20:14:03] (03PS2) 10Ottomata: Set two webrequest kafkatee instances consuming from analytics and jumbo [puppet] - 10https://gerrit.wikimedia.org/r/413237 (https://phabricator.wikimedia.org/T185136) [20:14:50] wow, i cant submit it because it needs rebase but the rebase button also doesnt work.. ok.. [20:15:00] that needs special timing :) [20:16:47] (03CR) 10Ottomata: [V: 032 C: 032] "Looks good: https://puppet-compiler.wmflabs.org/compiler02/10074/oxygen.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/413237 (https://phabricator.wikimedia.org/T185136) (owner: 10Ottomata) [20:16:54] (03CR) 10Ottomata: [C: 032] Set two webrequest kafkatee instances consuming from analytics and jumbo [puppet] - 10https://gerrit.wikimedia.org/r/413237 (https://phabricator.wikimedia.org/T185136) (owner: 10Ottomata) [20:16:58] (03PS3) 10Ottomata: Set two webrequest kafkatee instances consuming from analytics and jumbo [puppet] - 10https://gerrit.wikimedia.org/r/413237 (https://phabricator.wikimedia.org/T185136) [20:17:00] (03CR) 10Ottomata: [V: 032 C: 032] Set two webrequest kafkatee instances consuming from analytics and jumbo [puppet] - 10https://gerrit.wikimedia.org/r/413237 (https://phabricator.wikimedia.org/T185136) (owner: 10Ottomata) [20:21:10] PROBLEM - Check systemd state on rhenium is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [20:21:21] ^ me [20:21:25] just fixed [20:22:10] RECOVERY - Check systemd state on rhenium is OK: OK - running: The system is fully operational [20:22:27] (03PS1) 1020after4: group1 wikis to 1.31.0-wmf.22 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/413240 [20:22:29] (03CR) 1020after4: [C: 032] group1 wikis to 1.31.0-wmf.22 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/413240 (owner: 1020after4) [20:24:08] 10Operations: Define a special range in constants.pp for the LVS hosts - https://phabricator.wikimedia.org/T187910#3990747 (10Andrew) This particular service is behind the misc-web varnishes. So port 80 needs to be open to those varnishes and to lvs, but nothing else. $DOMAIN_NETWORKS covers both those sets, s... [20:24:17] (03Merged) 10jenkins-bot: group1 wikis to 1.31.0-wmf.22 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/413240 (owner: 1020after4) [20:26:07] (03PS3) 1020after4: Phabricator: restart apache every sunday night [puppet] - 10https://gerrit.wikimedia.org/r/413114 (https://phabricator.wikimedia.org/T187790) [20:26:50] twentyafterfour: oh, you are already fixing it :) [20:26:57] it got into a special kind of rebase trap :) [20:27:09] (03CR) 10jenkins-bot: group1 wikis to 1.31.0-wmf.22 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/413240 (owner: 1020after4) [20:27:22] i was able to submit now, thx [20:27:33] !log twentyafterfour@tin rebuilt and synchronized wikiversions files: group1 wikis to 1.31.0-wmf.22 [20:27:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:28:42] !log twentyafterfour@tin Synchronized php: group1 wikis to 1.31.0-wmf.22 (duration: 01m 08s) [20:28:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:29:45] uhm [20:30:23] Database is read-only: The database has been automatically locked while the replica database servers catch up to the master. [20:31:18] Specific wikis? [20:31:35] I'm guessing mostly s3 wikis cuz group1 [20:31:38] commons [20:31:45] Ah so s4 [20:31:56] also wikidata [20:32:11] Hmmm [20:32:15] happened immediately after group1 [20:32:22] but I don't see how it could be related? [20:32:29] I guess I should roll back anyway [20:32:49] heh, also Notice: Array to string conversion in /srv/mediawiki/php-1.31.0-wmf.21/includes/libs/rdbms/database/position/MySQLMasterPos.php on line 41 [20:33:04] but hmm that's the old branch [20:33:17] I saw that one [20:33:23] Roll back [20:33:29] If it goes away....related :p [20:34:25] !log rolling back group1 to wmf.21 [20:34:35] (03PS1) 1020after4: group1 wikis to 1.31.0-wmf.21 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/413241 [20:34:37] (03CR) 1020after4: [C: 032] group1 wikis to 1.31.0-wmf.21 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/413241 (owner: 1020after4) [20:34:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:34:46] Hold the sync for like 30 seconds [20:34:50] Wanna check tendril [20:34:50] (03PS1) 10Ottomata: Set up secondary webrequest camus job to consume from jumbo [puppet] - 10https://gerrit.wikimedia.org/r/413242 [20:34:51] ok [20:34:54] (03CR) 10Dzahn: [C: 032] "Notice: /Stage[main]/Profile::Phabricator::Main/Cron[phab_restart]/ensure: created" [puppet] - 10https://gerrit.wikimedia.org/r/413114 (https://phabricator.wikimedia.org/T187790) (owner: 1020after4) [20:35:43] (03CR) 1020after4: "Thanks, dzahn!" [puppet] - 10https://gerrit.wikimedia.org/r/413114 (https://phabricator.wikimedia.org/T187790) (owner: 1020after4) [20:36:03] (03Merged) 10jenkins-bot: group1 wikis to 1.31.0-wmf.21 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/413241 (owner: 1020after4) [20:36:18] 10Operations, 10Phabricator, 10Patch-For-Review, 10Release-Engineering-Team (Kanban), and 2 others: Phabricator: Clean up deadlocked apache processes - https://phabricator.wikimedia.org/T187790#3990786 (10Dzahn) restart cron has been installed on both servers [20:36:20] no_justification: ok tell me when, I'll sync [20:36:40] Hmm, tendril reports no replag :\ [20:37:03] (03CR) 10jenkins-bot: group1 wikis to 1.31.0-wmf.21 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/413241 (owner: 1020after4) [20:37:12] Go ahead and sync [20:37:41] (03CR) 10Ottomata: "Looks good." [puppet] - 10https://gerrit.wikimedia.org/r/413242 (owner: 10Ottomata) [20:37:42] so the code that detects it is broken then? [20:37:43] PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 90.00% of data above the critical threshold [50.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=2&fullscreen [20:37:44] (03CR) 10Ottomata: "https://puppet-compiler.wmflabs.org/compiler02/10075/analytics1003.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/413242 (owner: 10Ottomata) [20:37:46] (03CR) 10Ottomata: [C: 032] Set up secondary webrequest camus job to consume from jumbo [puppet] - 10https://gerrit.wikimedia.org/r/413242 (owner: 10Ottomata) [20:37:54] icinga-wm: you're too slow [20:38:08] !log twentyafterfour@tin rebuilt and synchronized wikiversions files: group1 wikis to 1.31.0-wmf.21 [20:38:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:39:21] !log twentyafterfour@tin Synchronized php: group1 wikis to 1.31.0-wmf.21 (duration: 01m 12s) [20:39:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:40:09] yeah so there is _something_ wrong with that error reporting in wmf.22 [20:40:20] is that code new? [20:40:34] auto-read-only [20:42:49] 10Operations, 10Domains, 10Traffic, 10WMF-Design, and 2 others: Create subdomain for Design and Wikimedia User Interface Style Guide - https://phabricator.wikimedia.org/T185282#3990805 (10Dzahn) @Volker_E I think the ball is in your court. As i said above , it's not a problem as long as you can get your co... [20:43:44] RECOVERY - MediaWiki exceptions and fatals per minute on graphite1001 is OK: OK: Less than 70.00% above the threshold [25.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=2&fullscreen [20:43:45] 10Operations, 10Domains, 10Traffic, 10WMF-Design, and 2 others: Create subdomain for Design and Wikimedia User Interface Style Guide - https://phabricator.wikimedia.org/T185282#3990811 (10Dzahn) please see https://www.mediawiki.org/wiki/Gerrit/New_repositories/Requests [20:44:36] (03PS1) 10Ottomata: Produce webrequest_misc logs to Kafka jumbo instead of Kafka analytics [puppet] - 10https://gerrit.wikimedia.org/r/413243 (https://phabricator.wikimedia.org/T185136) [20:45:06] twentyafterfour: Question for AaronSchulz [20:47:04] 10Operations, 10LDAP-Access-Requests, 10WMF-NDA-Requests: Request to be added to the ldap/wmde group - https://phabricator.wikimedia.org/T187442#3975346 (10Dzahn) I'll also start adding the "wmde" users to our "ldap_only" admins group then to avoid confusion. [20:48:08] 10Operations, 10vm-requests: Site: eqiad|codfw VM request for Kafka Burrow Lag monitoring - https://phabricator.wikimedia.org/T187901#3989586 (10Dzahn) What host names do we want to use? [20:48:26] 10Operations, 10vm-requests: Site: eqiad|codfw VM request for Kafka Burrow Lag monitoring - https://phabricator.wikimedia.org/T187901#3990824 (10Dzahn) (because DNS needs to exist before anything else and the VMs can be created) [20:51:52] (03PS2) 10Dzahn: Gerrit: Improve registration url [puppet] - 10https://gerrit.wikimedia.org/r/413079 (owner: 10Chad) [20:53:11] (03CR) 10Dzahn: [C: 032] Gerrit: Improve registration url [puppet] - 10https://gerrit.wikimedia.org/r/413079 (owner: 10Chad) [20:53:21] no_justification: yep, created a task T187942 [20:53:22] T187942: Replication lag detection broken in wmf.22 - https://phabricator.wikimedia.org/T187942 [20:53:39] 10Operations, 10Mathoid, 10Prod-Kubernetes, 10Kubernetes, and 3 others: Serve at least 50% of Mathoid via kubernetes - https://phabricator.wikimedia.org/T184919#3990859 (10mobrovac) [20:53:42] !log MediaWiki Train for 1.31.0-wmf.22 is blocked by T187942 [20:53:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:54:55] (03CR) 10Dzahn: [C: 032] "deployed. restarted service on gerrit2001, but not on cobalt" [puppet] - 10https://gerrit.wikimedia.org/r/413079 (owner: 10Chad) [20:55:33] (03PS2) 10Dzahn: Add two exceptions to long-running screen monitoring checks [puppet] - 10https://gerrit.wikimedia.org/r/412674 (owner: 10Muehlenhoff) [20:55:41] (03CR) 10Dzahn: [C: 032] Add two exceptions to long-running screen monitoring checks [puppet] - 10https://gerrit.wikimedia.org/r/412674 (owner: 10Muehlenhoff) [20:56:54] 10Operations, 10vm-requests: Site: eqiad|codfw VM request for Kafka Burrow Lag monitoring - https://phabricator.wikimedia.org/T187901#3990869 (10Dzahn) How about "kafkamon", akin to "netmon"? [21:00:01] (03PS3) 10Giuseppe Lavagetto: mediawiki: Enable auto_prepend_file setting for HHVM on Beta Cluster [puppet] - 10https://gerrit.wikimedia.org/r/412827 (https://phabricator.wikimedia.org/T180183) (owner: 10Krinkle) [21:00:04] cscott, arlolra, subbu, bearND, halfak, and Amir1: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for Services – Parsoid / Citoid / Mobileapps / ORES / … . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180221T2100). [21:00:05] No GERRIT patches in the queue for this window AFAICS. [21:01:20] (03CR) 10Giuseppe Lavagetto: [C: 032] mediawiki: Enable auto_prepend_file setting for HHVM on Beta Cluster [puppet] - 10https://gerrit.wikimedia.org/r/412827 (https://phabricator.wikimedia.org/T180183) (owner: 10Krinkle) [21:02:17] nothing for parsoid [21:05:18] nothing for mobileapps today [21:10:11] (03PS3) 10Dzahn: Add two exceptions to long-running screen monitoring checks [puppet] - 10https://gerrit.wikimedia.org/r/412674 (owner: 10Muehlenhoff) [21:13:27] I’m pushing a minor ORES service update. [21:14:34] !log ppchelko@tin Started deploy [restbase/deploy@56fffcf]: Do not check for article deletion for update requests T181636 [21:14:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:14:49] T181636: Content service incorrectly reports article as "deleted" - https://phabricator.wikimedia.org/T181636 [21:15:26] !log awight@tin Started deploy [ores/deploy@7bbf21f]: T187914 on the ores* cluster [21:15:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:15:39] T187914: New precache endpoint isn't reporting its metrics correctly - https://phabricator.wikimedia.org/T187914 [21:20:21] (03PS4) 10Dzahn: Add two exceptions to long-running screen monitoring checks [puppet] - 10https://gerrit.wikimedia.org/r/412674 (owner: 10Muehlenhoff) [21:23:44] !log restart hhvm on mw1221 - high load alarms [21:23:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:25:03] PROBLEM - mobileapps endpoints health on scb2006 is CRITICAL: /{domain}/v1/page/most-read/{yyyy}/{mm}/{dd} (retrieve the most-read articles for January 1, 2016) timed out before a response was received: /{domain}/v1/page/most-read/{yyyy}/{mm}/{dd} (retrieve the most-read articles for January 1, 2016 (with aggregated=true)) timed out before a response was received [21:25:17] 10Operations, 10ops-eqiad, 10fundraising-tech-ops: rack frbast1001 - https://phabricator.wikimedia.org/T187363#3990946 (10cwdent) [21:25:55] 10Operations, 10ops-eqiad, 10fundraising-tech-ops: rack frdata1001 - https://phabricator.wikimedia.org/T187364#3990951 (10cwdent) [21:26:53] RECOVERY - mobileapps endpoints health on scb2006 is OK: All endpoints are healthy [21:27:44] !log restart hhvm on mw1227 - high load alarms [21:27:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:28:28] !log awight@tin Finished deploy [ores/deploy@7bbf21f]: T187914 on the ores* cluster (duration: 13m 03s) [21:28:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:28:40] T187914: New precache endpoint isn't reporting its metrics correctly - https://phabricator.wikimedia.org/T187914 [21:29:05] !log awight@tin Started deploy [ores/deploy@addba9c]: T187914 on the scb* cluster [21:29:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:29:31] 10Operations, 10ops-eqiad, 10fundraising-tech-ops, 10Patch-For-Review: Rack/setup frmon1001 - https://phabricator.wikimedia.org/T186073#3990975 (10cwdent) [21:30:02] 10Operations, 10ops-eqiad, 10fundraising-tech-ops: rack frpig1001 - https://phabricator.wikimedia.org/T187365#3990976 (10cwdent) [21:30:11] !log restart hhvm on mw1229 - high load alarms [21:30:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:30:32] !log ppchelko@tin Finished deploy [restbase/deploy@56fffcf]: Do not check for article deletion for update requests T181636 (duration: 15m 59s) [21:30:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:30:45] T181636: Content service incorrectly reports article as "deleted" - https://phabricator.wikimedia.org/T181636 [21:30:48] 10Operations, 10ops-eqiad, 10fundraising-tech-ops: rack frdata1001 - https://phabricator.wikimedia.org/T187364#3990984 (10cwdent) [21:34:48] !log restart hhvm on mw1232 - high load alarms [21:35:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:39:07] !log awight@tin Finished deploy [ores/deploy@addba9c]: T187914 on the scb* cluster (duration: 10m 02s) [21:39:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:39:22] T187914: New precache endpoint isn't reporting its metrics correctly - https://phabricator.wikimedia.org/T187914 [21:44:17] (03CR) 10Cdentinger: [C: 04-1] "@cmjohnson I was not educated about the vlans when I chose these IPs, I updated all the task descriptions with better information: T186073" [dns] - 10https://gerrit.wikimedia.org/r/413230 (https://phabricator.wikimedia.org/T186073) (owner: 10Cmjohnson) [21:44:55] !log restart hhvm on mw1233 - high load alarms [21:45:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:46:38] !log restart hhvm on mw1235 - high load alarms [21:46:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:46:59] 10Operations, 10DBA, 10Release-Engineering-Team, 10cloud-services-team, 10wikitech.wikimedia.org: Move some wikis to s5 - https://phabricator.wikimedia.org/T184805#3896851 (10demon) >>! In T184805#3896861, @jcrespo wrote: > Only adding #releng and #wmcs in case they can think of a reason not to move them... [21:50:39] !log restart hhvm on mw1224 - high load alarms [21:50:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:52:29] (03PS4) 10Bstorm: tools-static: Change to reverse proxy of cdnjs [puppet] - 10https://gerrit.wikimedia.org/r/413197 (https://phabricator.wikimedia.org/T182604) [21:53:59] (03CR) 10Bstorm: "The new version removes the cron to generate packages.json altogether because I've already merged the api call into cdnjs-index. This is " (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/413197 (https://phabricator.wikimedia.org/T182604) (owner: 10Bstorm) [21:58:32] 10Operations, 10DBA, 10cloud-services-team, 10wikitech.wikimedia.org, 10Release-Engineering-Team (Watching / External): Move some wikis to s5 - https://phabricator.wikimedia.org/T184805#3991103 (10greg) [22:04:34] 10Operations, 10LDAP-Access-Requests, 10WMF-NDA-Requests: Request to be added to the ldap/wmde group - https://phabricator.wikimedia.org/T187442#3991109 (10RStallman-legalteam) Raz's NDA is fully signed. Thanks! [22:31:04] (03CR) 10BBlack: [C: 031] "Looks sane, and doesn't look like any other known mimetypes happen to contain the string "sla"." [puppet] - 10https://gerrit.wikimedia.org/r/413236 (https://phabricator.wikimedia.org/T187930) (owner: 10Brion VIBBER) [22:38:47] 10Operations, 10Phabricator, 10Patch-For-Review, 10Release-Engineering-Team (Kanban), and 2 others: Apache on phab1001 is gradually leaking worker processes which are stuck in "Gracefully finishing" state - https://phabricator.wikimedia.org/T182832#3991212 (10mmodell) [22:38:51] 10Operations, 10Phabricator, 10Patch-For-Review, 10Release-Engineering-Team (Kanban), and 2 others: Phabricator: Clean up deadlocked apache processes - https://phabricator.wikimedia.org/T187790#3991211 (10mmodell) 05Open>03Resolved [22:52:57] 10Operations, 10HHVM, 10Patch-For-Review, 10Performance-Team (Radar): HHVM hangs on the API cluster - https://phabricator.wikimedia.org/T184048#3991252 (10Joe) we finally tracked this down to `JpegMetadataExtractor::segmentSplitter` where a infinite loop can happen in case the jpeg is broken: https://gith... [23:21:46] (03PS1) 10Andrew Bogott: labweb: install nutcracker [puppet] - 10https://gerrit.wikimedia.org/r/413275 (https://phabricator.wikimedia.org/T187506) [23:22:22] (03CR) 10jerkins-bot: [V: 04-1] labweb: install nutcracker [puppet] - 10https://gerrit.wikimedia.org/r/413275 (https://phabricator.wikimedia.org/T187506) (owner: 10Andrew Bogott) [23:26:14] (03PS2) 10Andrew Bogott: labweb: install nutcracker [puppet] - 10https://gerrit.wikimedia.org/r/413275 (https://phabricator.wikimedia.org/T187506) [23:26:53] (03CR) 10jerkins-bot: [V: 04-1] labweb: install nutcracker [puppet] - 10https://gerrit.wikimedia.org/r/413275 (https://phabricator.wikimedia.org/T187506) (owner: 10Andrew Bogott) [23:29:14] (03PS3) 10Andrew Bogott: labweb: install nutcracker [puppet] - 10https://gerrit.wikimedia.org/r/413275 (https://phabricator.wikimedia.org/T187506) [23:33:07] 10Operations, 10ops-eqiad, 10netops: Rack/cable/configure asw2-a-eqiad switch stack - https://phabricator.wikimedia.org/T187960#3991379 (10ayounsi) p:05Triage>03Normal [23:33:39] (03PS2) 10Gergő Tisza: Enable loginOnly mode for local auth provider on beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/409638 (https://phabricator.wikimedia.org/T57420) [23:40:44] 10Operations, 10ops-eqiad, 10netops: Rack/cable/configure asw2-c-eqiad switch stack - https://phabricator.wikimedia.org/T187962#3991415 (10ayounsi) p:05Triage>03Normal [23:42:28] 10Operations, 10ops-eqiad, 10netops: Rack/cable/configure asw2-c-eqiad switch stack - https://phabricator.wikimedia.org/T187962#3991441 (10ayounsi) [23:57:09] 10Operations, 10LDAP-Access-Requests, 10WMF-NDA-Requests: Request to be added to the ldap/wmde group - https://phabricator.wikimedia.org/T187442#3991466 (10Dzahn) a:05MoritzMuehlenhoff>03Dzahn