[00:00:02] <wikibugs>	 (03CR) 10Dzahn: [C: 032] "confirmed with racktables, ping/host" [dns] - 10https://gerrit.wikimedia.org/r/407454 (owner: 10Papaul)
[00:00:04] <jouncebot>	 addshore, hashar, anomie, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: Dear deployers, time to do the Evening SWAT (Max 8 patches) deploy. Dont look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180221T0000).
[00:00:04] <jouncebot>	 mooeypoo, Amir1, eddiegp, and Jamesofur: A patch you scheduled for Evening SWAT (Max 8 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[00:00:09] <wikibugs>	 (03PS4) 10Dzahn: DNS: Add production DNS entry for db2093 [dns] - 10https://gerrit.wikimedia.org/r/407454 (owner: 10Papaul)
[00:00:18] <Amir1>	 o/ 
[00:00:22] <Amir1>	 my patch is not testable
[00:00:23] <Hauskatze>	 o/
[00:00:34] <eddiegp>	 o/
[00:00:41] <Hauskatze>	 I'm here with Jamesofur 
[00:00:42] <Jamesofur>	 \o
[00:00:46] <mooeypoo>	 o/
[00:02:22] <wikibugs>	 10Operations, 10ops-codfw, 10DBA, 10Patch-For-Review: rack/setup/install db2093 (WAS: rack/setup/install tendril2001) - https://phabricator.wikimedia.org/T186123#3987795 (10Dzahn) host 10.192.48.91 91.48.192.10.in-addr.arpa domain name pointer db2093.codfw.wmnet.  host db2093.codfw.wmnet db2093.codfw.wmnet...
[00:05:35] <wikibugs>	 (03PS1) 10Dzahn: install_server: rename tendril2001 to db2093 [puppet] - 10https://gerrit.wikimedia.org/r/413068 (https://phabricator.wikimedia.org/T186123)
[00:05:45] <Hauskatze>	 who can swat today?
[00:06:36] <wikibugs>	 (03CR) 10Dzahn: [C: 032] install_server: rename tendril2001 to db2093 [puppet] - 10https://gerrit.wikimedia.org/r/413068 (https://phabricator.wikimedia.org/T186123) (owner: 10Dzahn)
[00:06:48] <thcipriani>	 I can SWAT
[00:07:18] <thcipriani>	 mooeypoo: looks like no_justification did your already, is that correct?
[00:07:32] <wikibugs>	 (03PS3) 10Thcipriani: Enable x-kill feature everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/412664 (https://phabricator.wikimedia.org/T186714) (owner: 10Ladsgroup)
[00:07:37] <wikibugs>	 (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/412664 (https://phabricator.wikimedia.org/T186714) (owner: 10Ladsgroup)
[00:07:43] <mooeypoo>	 thcipriani, okay omg that explains so much
[00:07:49] <thcipriani>	 :)
[00:07:55] <mooeypoo>	 thcipriani, etonkovidova and I are trying to test the bug and we can't find it
[00:07:57] <mooeypoo>	 rofl
[00:08:11] <mooeypoo>	 "HOW IS IT NOT BROKEN!" was heard in the office several times.
[00:08:16] <mooeypoo>	 Thanks ;) i guess it was done
[00:08:19] <thcipriani>	 haha nice
[00:08:54] <no_justification>	 thcipriani: yes
[00:08:56] <Hauskatze>	 hmm... why they can get their fixes sooner? :P
[00:09:02] <Hauskatze>	 <joke>
[00:09:22] <wikibugs>	 (03Merged) 10jenkins-bot: Enable x-kill feature everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/412664 (https://phabricator.wikimedia.org/T186714) (owner: 10Ladsgroup)
[00:09:32] <wikibugs>	 (03CR) 10jenkins-bot: Enable x-kill feature everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/412664 (https://phabricator.wikimedia.org/T186714) (owner: 10Ladsgroup)
[00:09:42] <Hauskatze>	 x-kill looks like some nerve agent
[00:10:08] <wikibugs>	 10Operations, 10ops-codfw, 10DBA, 10Patch-For-Review: rack/setup/install db2093 (WAS: rack/setup/install tendril2001) - https://phabricator.wikimedia.org/T186123#3987813 (10Dzahn) @Papaul prod IP added, renamed in DHCP, partman doesn't have to be changed. you can now go ahead with the OS install
[00:10:35] <no_justification>	 Hauskatze: ssshhh don't give away the plans
[00:10:54] <Hauskatze>	 no_justification: as long as you exempt me from the slaughter...
[00:11:04] <Hauskatze>	 otherwise I'm calling the cops
[00:11:10] <thcipriani>	 Amir1: x-kill patch is a thing you have to monitor, can't be tested, correct?
[00:11:10] <Hauskatze>	 your choice dear
[00:11:25] <Amir1>	 yup
[00:11:50] <thcipriani>	 k, going live
[00:12:29] <Amir1>	 if it blows up, it will happen at least several days from now and I'm constantly monitoring everything, might turn it off for some wikis. built some metrics just to make sure
[00:14:05] <logmsgbot>	 !log thcipriani@tin Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:412664|Enable x-kill feature everywhere]] T186714 T184322 (duration: 01m 13s)
[00:14:10] <thcipriani>	 ^ Amir1 live now
[00:14:21] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:14:22] <stashbot>	 T186714: enable x-kill feature on Commons - https://phabricator.wikimedia.org/T186714
[00:14:22] <stashbot>	 T184322: Enable fine grained lua tracking gradually in client wikis - https://phabricator.wikimedia.org/T184322
[00:14:41] <Amir1>	 Thanks
[00:14:46] <James_F>	 thcipriani: BTW, jouncebot didn't seem to spot my backport. Did you? :-)
[00:15:14] <thcipriani>	 James_F: https://gerrit.wikimedia.org/r/#/c/411298/ ? Going through the zuul tubes now.
[00:15:25] <James_F>	 Ah, yes. Awesome. :-)
[00:15:27] <wikibugs>	 (03CR) 10Huji: [C: 04-1] Allow CheckUsers and Stewards to access private data from the AbuseLog (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/413062 (https://phabricator.wikimedia.org/T160357) (owner: 10MarcoAurelio)
[00:17:14] <wikibugs>	 (03CR) 10Jalexander: ">" (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/413062 (https://phabricator.wikimedia.org/T160357) (owner: 10MarcoAurelio)
[00:17:38] <wikibugs>	 (03CR) 10MarcoAurelio: Allow CheckUsers and Stewards to access private data from the AbuseLog (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/413062 (https://phabricator.wikimedia.org/T160357) (owner: 10MarcoAurelio)
[00:17:51] <wikibugs>	 (03PS4) 10MarcoAurelio: Allow CheckUsers and Stewards to access private data from the AbuseLog [mediawiki-config] - 10https://gerrit.wikimedia.org/r/413062 (https://phabricator.wikimedia.org/T160357)
[00:18:30] <Hauskatze>	 heh Jamesofur you de+1'd it :P
[00:18:47] * Jamesofur rolls his eyes a bit
[00:18:56] <wikibugs>	 (03CR) 10Jalexander: [C: 031] Allow CheckUsers and Stewards to access private data from the AbuseLog [mediawiki-config] - 10https://gerrit.wikimedia.org/r/413062 (https://phabricator.wikimedia.org/T160357) (owner: 10MarcoAurelio)
[00:19:44] <wikibugs>	 (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/413062 (https://phabricator.wikimedia.org/T160357) (owner: 10MarcoAurelio)
[00:21:17] <wikibugs>	 (03Merged) 10jenkins-bot: Allow CheckUsers and Stewards to access private data from the AbuseLog [mediawiki-config] - 10https://gerrit.wikimedia.org/r/413062 (https://phabricator.wikimedia.org/T160357) (owner: 10MarcoAurelio)
[00:21:21] <Hauskatze>	 oh, it's happening
[00:21:28] <wikibugs>	 (03CR) 10jenkins-bot: Allow CheckUsers and Stewards to access private data from the AbuseLog [mediawiki-config] - 10https://gerrit.wikimedia.org/r/413062 (https://phabricator.wikimedia.org/T160357) (owner: 10MarcoAurelio)
[00:21:46] <thcipriani>	 eddiegp: James_F both of your changes are live on mwdebug1002, check please
[00:22:15] <eddiegp>	 ack
[00:22:55] <James_F>	 thcipriani: Yeah, looks good to me.
[00:23:20] <thcipriani>	 James_F: ok, making your change live
[00:24:41] <eddiegp>	 thcipriani: Works.
[00:25:02] <wikibugs>	 (03PS1) 10Dzahn: rename tendril2001.mgmt to db2093.mgmt [dns] - 10https://gerrit.wikimedia.org/r/413070 (https://phabricator.wikimedia.org/T186123)
[00:25:05] <eddiegp>	 So can be deployed as well.
[00:25:21] <thcipriani>	 eddiegp: ok, will deploy after current sync is done
[00:26:19] <logmsgbot>	 !log thcipriani@tin Synchronized php-1.31.0-wmf.21/resources/src/mediawiki/mediawiki.ForeignStructuredUpload.js: SWAT: Follow-up I0bb4ed7f7: [[gerrit:411298|Use correct "this"]] T187523 (duration: 01m 13s)
[00:26:23] <thcipriani>	 ^ James_F live
[00:26:32] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:26:33] <stashbot>	 T187523: Unable to upload images in VisualEditor in both Chrome and Firefox on beta and in production - https://phabricator.wikimedia.org/T187523
[00:26:46] <wikibugs>	 (03CR) 10Dzahn: [C: 032] rename tendril2001.mgmt to db2093.mgmt [dns] - 10https://gerrit.wikimedia.org/r/413070 (https://phabricator.wikimedia.org/T186123) (owner: 10Dzahn)
[00:28:13] <James_F>	 thcipriani: Thank you.
[00:28:30] <thcipriani>	 yw :)
[00:29:13] * eddiegp just realised I've tested a wmf.21 cherry-pick on testwiki, a group.0 wiki (running wmf.22 which already includes the fix)
[00:29:25] <eddiegp>	 But tested it on dewiki now, which also worked :D
[00:29:37] <logmsgbot>	 !log thcipriani@tin Synchronized php-1.31.0-wmf.21/includes/page/WikiPage.php: SWAT: [[gerrit:413059|site_stats: Unbreak counting newly created pages]] (duration: 01m 12s)
[00:29:50] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:29:54] <thcipriani>	 ^ eddiegp well it's live now :)
[00:30:07] <eddiegp>	 thcipriani: Thanks!
[00:30:09] <thcipriani>	 yw :)
[00:31:31] <wikibugs>	 (03PS1) 10Dzahn: fix subnet for db2093.mgmt [dns] - 10https://gerrit.wikimedia.org/r/413072
[00:31:43] <Krinkle>	 mutante: hmm it now has a rewrite and an override. It looks like the rewrite command one line higher is now obsolete right?
[00:31:50] <thcipriani>	 Jamesofur: Hauskatze your change is live on mwdebug1002, check please
[00:31:54] <Krinkle>	 Or does it still do something?
[00:31:56] <Hauskatze>	 ack, checking
[00:31:59] <eddiegp>	 I guess I can file a task for running the maintenance script then.
[00:32:19] <Krinkle>	 Ah I guess it’s for root url only
[00:32:33] <wikibugs>	 (03CR) 10Dzahn: [C: 032] fix subnet for db2093.mgmt [dns] - 10https://gerrit.wikimedia.org/r/413072 (owner: 10Dzahn)
[00:32:50] <eddiegp>	 Krinkle: The override is just that single URL and takes precedence over the rewrite, the rewrite is an wildcard.
[00:33:08] <wikibugs>	 (03CR) 10Huji: [C: 031] "Marco's explanation was satisfactory" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/413062 (https://phabricator.wikimedia.org/T160357) (owner: 10MarcoAurelio)
[00:33:22] <eddiegp>	 Yeah, right.
[00:33:40] <Hauskatze>	 thcipriani: wfm; waiting on Jamesofur
[00:33:42] <mutante>	 Krinkle: i had the same thought at first.. then i read the comments above.. then i wasnt sure.. then i tested it  and could confirm it works :p
[00:34:06] <mutante>	 it's created this way by the .dat file
[00:34:25] <mutante>	 and what eddie said :)
[00:35:57] <eddiegp>	 tbh I found the dat file confusing too, I've fiddled with the options there and then compiled it a few times until the apache diff looked right to me :D
[00:36:04] <mutante>	 it still does this:
[00:36:04] <mutante>	 http://techblog.wikimedia.org/foo/
[00:36:05] <mutante>	  * 301 Moved Permanently http://blog.wikimedia.org/foo/
[00:36:06] <mutante>	 tested that
[00:38:56] <wikibugs>	 (03CR) 10Dzahn: [C: 04-2] "you can abandon this. it is now called db2093 instead and that is already covered by partman regex" [puppet] - 10https://gerrit.wikimedia.org/r/408731 (https://phabricator.wikimedia.org/T186123) (owner: 10Papaul)
[00:39:21] <Jamesofur>	 thcipriani: Hauskatze yup yup, sorry I had an emergency come up but back now
[00:39:26] <Jamesofur>	 (but WFM)
[00:39:53] <thcipriani>	 no worries. OK, going live.
[00:43:31] <logmsgbot>	 !log thcipriani@tin Synchronized wmf-config/abusefilter.php: SWAT: [[gerrit:413062|Allow CheckUsers and Stewards to access private data from the AbuseLog]] T160357 (duration: 01m 12s)
[00:43:43] <thcipriani>	 ^ Jamesofur Hauskatze live now
[00:43:45] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:43:45] <stashbot>	 T160357: Allow those with CheckUser right to access AbuseLog private information on WMF projects - https://phabricator.wikimedia.org/T160357
[00:43:54] <Hauskatze>	 \o/
[00:44:34] <Jamesofur>	 thcipriani: looking good, thanks :)
[00:44:57] <thcipriani>	 yw, glad to hear it :)
[00:50:09] <wikibugs>	 (03PS2) 10Chad: Turn wikimedia.org docroot into symlink to standard-docroot [mediawiki-config] - 10https://gerrit.wikimedia.org/r/411369
[00:51:33] <wikibugs>	 10Operations, 10DBA, 10MediaWiki-General-or-Unknown, 10MW-1.31-release-notes (WMF-deploy-2018-02-20 (1.31.0-wmf.22)), and 2 others: Regularly purge expired temporary userrights from DB tables - https://phabricator.wikimedia.org/T176754#3987931 (10greg) Hi, sorry, my bugmail backlog is woefully long right n...
[00:51:51] <wikibugs>	 (03PS2) 10Dzahn: Gerrit: Tweak SSH timeout settings and such [puppet] - 10https://gerrit.wikimedia.org/r/411397 (owner: 10Chad)
[00:52:47] <wikibugs>	 (03CR) 10Dzahn: [C: 032] Gerrit: Tweak SSH timeout settings and such [puppet] - 10https://gerrit.wikimedia.org/r/411397 (owner: 10Chad)
[00:53:40] <wikibugs>	 (03PS3) 10Dzahn: Gerrit: Also set ldap read timeout [puppet] - 10https://gerrit.wikimedia.org/r/411394 (owner: 10Chad)
[00:54:10] <wikibugs>	 (03PS3) 10Papaul: Partman: Add db2093 to partman recipe [puppet] - 10https://gerrit.wikimedia.org/r/408731 (https://phabricator.wikimedia.org/T186123)
[00:54:21] <wikibugs>	 (03CR) 10Dzahn: [C: 032] Gerrit: Also set ldap read timeout [puppet] - 10https://gerrit.wikimedia.org/r/411394 (owner: 10Chad)
[00:56:28] <mutante>	 !log gerrit2001 - restarted gerrit to test that gerrit:411397 and gerrit:411394 don't break anything - didn't touch cobalt right now to minimize affecting users and their logins
[00:56:41] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:57:23] <wikibugs>	 (03PS4) 10Dzahn: Partman: Add db2093 to partman recipe [puppet] - 10https://gerrit.wikimedia.org/r/408731 (https://phabricator.wikimedia.org/T186123) (owner: 10Papaul)
[00:59:22] <icinga-wm>	 PROBLEM - puppet last run on kafka1012 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[00:59:32] <wikibugs>	 (03PS5) 10Dzahn: Partman: Add db2093 to partman recipe [puppet] - 10https://gerrit.wikimedia.org/r/408731 (https://phabricator.wikimedia.org/T186123) (owner: 10Papaul)
[01:00:25] <wikibugs>	 (03CR) 10Dzahn: [C: 032] Partman: Add db2093 to partman recipe [puppet] - 10https://gerrit.wikimedia.org/r/408731 (https://phabricator.wikimedia.org/T186123) (owner: 10Papaul)
[01:14:40] <no_justification>	 mutante: ty for the merges there
[01:15:07] <mutante>	 no_justification: you're welcome
[01:15:50] <no_justification>	 mutante: I'd also like to revisit the key exchange algorithms & related settings, but low-prio
[01:16:12] <wikibugs>	 (03CR) 10Chad: [C: 032] Turn wikimedia.org docroot into symlink to standard-docroot [mediawiki-config] - 10https://gerrit.wikimedia.org/r/411369 (owner: 10Chad)
[01:16:15] <mutante>	 no_justification: if we can be more strict. yes please
[01:16:32] <no_justification>	 I think we can. We already blacklisted 2 of them, but we could probably do better.
[01:16:47] <mutante>	 yea, worth re-checking . *nod*
[01:17:28] <wikibugs>	 (03Merged) 10jenkins-bot: Turn wikimedia.org docroot into symlink to standard-docroot [mediawiki-config] - 10https://gerrit.wikimedia.org/r/411369 (owner: 10Chad)
[01:17:39] <wikibugs>	 (03CR) 10jenkins-bot: Turn wikimedia.org docroot into symlink to standard-docroot [mediawiki-config] - 10https://gerrit.wikimedia.org/r/411369 (owner: 10Chad)
[01:24:12] <icinga-wm>	 PROBLEM - puppet last run on db1102 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 3 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[git_pull_operations/mediawiki-config]
[01:24:23] <icinga-wm>	 RECOVERY - puppet last run on kafka1012 is OK: OK: Puppet is currently enabled, last run 42 seconds ago with 0 failures
[01:24:43] <icinga-wm>	 PROBLEM - puppet last run on db1095 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[git_pull_operations/mediawiki-config]
[01:31:32] <wikibugs>	 10Operations, 10ops-codfw, 10DBA, 10Patch-For-Review: rack/setup/install db2093 (WAS: rack/setup/install tendril2001) - https://phabricator.wikimedia.org/T186123#3987999 (10Papaul)
[01:31:54] <no_justification>	 Ummmm...can't pull ops/mw-config to db*?
[01:32:42] <logmsgbot>	 !log demon@tin Synchronized docroot/: Swapping wikimedia.org docroot for symlink (duration: 01m 27s)
[01:32:42] <stashbot>	 demon@tin: Failed to log message to wiki. Somebody should check the error logs.
[01:32:48] <no_justification>	 dafuq?
[01:32:56] <no_justification>	 That. Failed. Bad.
[01:33:53] <icinga-wm>	 PROBLEM - puppet last run on labsdb1010 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[git_pull_operations/mediawiki-config]
[01:34:19] <wikibugs>	 10Operations, 10DBA, 10MediaWiki-General-or-Unknown, 10MW-1.31-release-notes (WMF-deploy-2018-02-20 (1.31.0-wmf.22)), and 2 others: Regularly purge expired temporary userrights from DB tables - https://phabricator.wikimedia.org/T176754#3988002 (10EddieGP) >>! In T176754#3987931, @greg wrote: > Hi, sorry, m...
[01:34:43] <icinga-wm>	 PROBLEM - puppet last run on labsdb1011 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 3 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[git_pull_operations/mediawiki-config]
[01:36:03] <logmsgbot>	 !log demon@tin Synchronized docroot/: Swapping wikimedia.org docroot for symlink (second try, old WPFirefoxMobileOS cleanup was still needed) (duration: 01m 12s)
[01:36:05] <no_justification>	 Ok, so the DB servers do *not* like my merge. But why?
[01:36:17] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[01:36:18] <no_justification>	 I have a guess....
[01:39:02] <icinga-wm>	 PROBLEM - puppet last run on labsdb1009 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 6 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[git_pull_operations/mediawiki-config]
[01:39:48] <no_justification>	 Can someone have a look at labsdb1011:/usr/local/lib/mediawiki-config and tell me if it has docroot/wikimedia.org/WikipediaFirefoxMobileOS
[01:40:05] <no_justification>	 (if so, delete it, those should've been cleaned up before but deleting submodules is freaky magic)
[01:40:16] <no_justification>	 or any of the labsdb* ones that are failing?
[01:43:20] <no_justification>	 bd808: Pinging you cuz labsdb* are "yours" ^
[01:47:42] <wikibugs>	 (03PS1) 10Chad: Gerrit: Improve registration url [puppet] - 10https://gerrit.wikimedia.org/r/413079
[01:48:17] <no_justification>	 Meh, I'll revert for now :\
[01:48:49] <wikibugs>	 (03PS1) 10Chad: Revert "Turn wikimedia.org docroot into symlink to standard-docroot" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/413080
[01:48:55] <wikibugs>	 (03CR) 10Chad: [V: 032 C: 032] Revert "Turn wikimedia.org docroot into symlink to standard-docroot" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/413080 (owner: 10Chad)
[01:50:09] <wikibugs>	 (03CR) 10jenkins-bot: Revert "Turn wikimedia.org docroot into symlink to standard-docroot" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/413080 (owner: 10Chad)
[01:51:09] <logmsgbot>	 !log demon@tin Synchronized docroot/: revert docroot improvements. some servers don't like improvements (duration: 01m 12s)
[01:51:19] * no_justification waits for labsdb* and friends to recover
[01:51:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[01:51:29] * no_justification files task to clean up old instances of WikipediaFirefoxMobileOS
[01:54:12] <icinga-wm>	 RECOVERY - puppet last run on db1102 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures
[01:54:43] <icinga-wm>	 RECOVERY - puppet last run on db1095 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures
[01:54:47] <no_justification>	 !log WikipediaMobileFirefoxOS submodule references caused labsdb* (and related) puppet failures. They should recover now (self reverted my docroot changes). Filed T187850
[01:55:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[01:55:01] <stashbot>	 T187850: Cleanup remaining WikipediaMobileFirefoxOS references - https://phabricator.wikimedia.org/T187850
[02:01:50] <no_justification>	 !log running `initSiteStats.php --update` for all wikis in small.dblist. T187845
[02:02:03] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[02:02:04] <stashbot>	 T187845: Run initSiteStats.php for all wikis - https://phabricator.wikimedia.org/T187845
[02:03:56] <icinga-wm>	 RECOVERY - puppet last run on labsdb1010 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[02:03:56] <icinga-wm>	 RECOVERY - puppet last run on labsdb1009 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures
[02:04:46] <icinga-wm>	 RECOVERY - puppet last run on labsdb1011 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures
[02:15:56] <no_justification>	 !log running `initSiteStats.php --update` for all wikis in medium.dblist. T187845
[02:16:10] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[02:16:10] <stashbot>	 T187845: Run initSiteStats.php for medium/large.dblist - https://phabricator.wikimedia.org/T187845
[02:21:11] <bd808>	 no_justification: can you file a task to check that? I actually don’t have root there to clean anything up for $reasons
[02:21:17] <no_justification>	 I did
[02:21:28] <no_justification>	 T187850
[02:21:28] <stashbot>	 T187850: Cleanup remaining WikipediaMobileFirefoxOS references - https://phabricator.wikimedia.org/T187850
[02:21:31] <bd808>	 Awesome 
[02:21:41] <no_justification>	 (also some stuff in beta busted & recovered after I reverted)
[02:31:20] <logmsgbot>	 !log l10nupdate@tin scap sync-l10n completed (1.31.0-wmf.21) (duration: 06m 18s)
[02:31:33] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[03:26:08] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: s1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 746.87 seconds
[03:31:55] <logmsgbot>	 !log andrew@tin Started deploy [horizon/deploy@0e28f49]: updating branded graphics
[03:32:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[03:34:43] <logmsgbot>	 !log andrew@tin Finished deploy [horizon/deploy@0e28f49]: updating branded graphics (duration: 02m 49s)
[03:34:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[04:04:34] <wikibugs>	 10Operations, 10ops-codfw, 10DBA, 10Patch-For-Review: rack/setup/install db2093 (WAS: rack/setup/install tendril2001) - https://phabricator.wikimedia.org/T186123#3988175 (10Papaul)
[04:05:18] <icinga-wm>	 RECOVERY - MariaDB Slave Lag: s1 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 235.94 seconds
[04:05:38] <wikibugs>	 10Operations, 10ops-codfw, 10DBA, 10Patch-For-Review: rack/setup/install db2093 (WAS: rack/setup/install tendril2001) - https://phabricator.wikimedia.org/T186123#3934359 (10Papaul) a:05Papaul>03Marostegui @Marostegui  it is all yours. Installation complete .
[04:10:52] <logmsgbot>	 !log andrew@tin Started deploy [horizon/deploy@0e7783d]: updating branded graphics slightly more
[04:11:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[04:13:37] <logmsgbot>	 !log andrew@tin Finished deploy [horizon/deploy@0e7783d]: updating branded graphics slightly more (duration: 02m 45s)
[04:13:49] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[05:19:06] <wikibugs>	 (03PS1) 10BryanDavis: labsdb: Remove obsolete mediawiki-config submodule [puppet] - 10https://gerrit.wikimedia.org/r/413095 (https://phabricator.wikimedia.org/T187850)
[05:20:23] <wikibugs>	 (03CR) 10BryanDavis: "It might be easier for a root to just rm these files manually on labsdb1009, labsdb1010, and labsdb1011." [puppet] - 10https://gerrit.wikimedia.org/r/413095 (https://phabricator.wikimedia.org/T187850) (owner: 10BryanDavis)
[05:43:48] <icinga-wm>	 PROBLEM - puppet last run on dbproxy1008 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[05:43:48] <icinga-wm>	 PROBLEM - puppet last run on mw1271 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[05:44:08] <icinga-wm>	 PROBLEM - puppet last run on mw1299 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[05:44:09] <icinga-wm>	 PROBLEM - puppet last run on stat1006 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[05:44:18] <icinga-wm>	 PROBLEM - puppet last run on cp4029 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[05:44:28] <icinga-wm>	 PROBLEM - puppet last run on analytics1059 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[05:44:28] <icinga-wm>	 PROBLEM - puppet last run on ores1006 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[05:45:58] <icinga-wm>	 PROBLEM - puppet last run on mw1257 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[05:46:08] <icinga-wm>	 PROBLEM - puppet last run on mw1320 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[05:46:58] <icinga-wm>	 PROBLEM - puppet last run on prometheus2003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[05:47:02] <wikibugs>	 (03PS4) 10KartikMistry: Deploy Compact Language Links out of Beta on English Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/412696 (https://phabricator.wikimedia.org/T187677)
[05:47:08] <icinga-wm>	 PROBLEM - puppet last run on hassium is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[05:47:18] <icinga-wm>	 PROBLEM - puppet last run on kubernetes1002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[05:47:28] <icinga-wm>	 PROBLEM - puppet last run on pc1004 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[05:47:48] <icinga-wm>	 PROBLEM - puppet last run on db1084 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[06:11:59] <icinga-wm>	 RECOVERY - puppet last run on prometheus2003 is OK: OK: Puppet is currently enabled, last run 49 seconds ago with 0 failures
[06:12:08] <icinga-wm>	 RECOVERY - puppet last run on hassium is OK: OK: Puppet is currently enabled, last run 22 seconds ago with 0 failures
[06:12:18] <icinga-wm>	 RECOVERY - puppet last run on kubernetes1002 is OK: OK: Puppet is currently enabled, last run 14 seconds ago with 0 failures
[06:12:28] <icinga-wm>	 RECOVERY - puppet last run on pc1004 is OK: OK: Puppet is currently enabled, last run 49 seconds ago with 0 failures
[06:12:48] <icinga-wm>	 RECOVERY - puppet last run on db1084 is OK: OK: Puppet is currently enabled, last run 16 seconds ago with 0 failures
[06:13:48] <icinga-wm>	 RECOVERY - puppet last run on dbproxy1008 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[06:13:50] <icinga-wm>	 RECOVERY - puppet last run on mw1271 is OK: OK: Puppet is currently enabled, last run 59 seconds ago with 0 failures
[06:14:09] <icinga-wm>	 RECOVERY - puppet last run on mw1299 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[06:14:09] <icinga-wm>	 RECOVERY - puppet last run on stat1006 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures
[06:14:18] <icinga-wm>	 RECOVERY - puppet last run on cp4029 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[06:14:29] <icinga-wm>	 RECOVERY - puppet last run on analytics1059 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures
[06:14:29] <icinga-wm>	 RECOVERY - puppet last run on ores1006 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures
[06:15:58] <icinga-wm>	 RECOVERY - puppet last run on mw1257 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures
[06:16:11] <icinga-wm>	 RECOVERY - puppet last run on mw1320 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures
[06:23:51] <wikibugs>	 10Operations, 10ops-codfw, 10DBA, 10Patch-For-Review: Replace db2030 from m5 with another host (WAS: Degraded RAID on db2030) - https://phabricator.wikimedia.org/T187722#3988250 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by marostegui on neodymium.eqiad.wmnet for hosts: ``` db2037.codfw.wm...
[06:26:29] <wikibugs>	 (03PS1) 10Marostegui: db-eqiad.php: Depool db1105:3312 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/413102 (https://phabricator.wikimedia.org/T187089)
[06:27:57] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] db-eqiad.php: Depool db1105:3312 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/413102 (https://phabricator.wikimedia.org/T187089) (owner: 10Marostegui)
[06:29:35] <wikibugs>	 (03PS2) 10Marostegui: db-eqiad.php: Depool db1105:3312 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/413102 (https://phabricator.wikimedia.org/T187089)
[06:32:48] <wikibugs>	 (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1105:3312 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/413102 (https://phabricator.wikimedia.org/T187089) (owner: 10Marostegui)
[06:32:56] <wikibugs>	 (03PS1) 10Marostegui: mariadb: Move db2037 from s4 role to m5 [puppet] - 10https://gerrit.wikimedia.org/r/413103 (https://phabricator.wikimedia.org/T187722)
[06:33:28] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] mariadb: Move db2037 from s4 role to m5 [puppet] - 10https://gerrit.wikimedia.org/r/413103 (https://phabricator.wikimedia.org/T187722) (owner: 10Marostegui)
[06:34:22] <wikibugs>	 (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1105:3312 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/413102 (https://phabricator.wikimedia.org/T187089) (owner: 10Marostegui)
[06:36:22] <wikibugs>	 (03PS2) 10Marostegui: mariadb: Move db2037 from s4 role to m5 [puppet] - 10https://gerrit.wikimedia.org/r/413103 (https://phabricator.wikimedia.org/T187722)
[06:36:29] <logmsgbot>	 !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Depool db1105 for alter table (duration: 01m 17s)
[06:36:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:36:57] <marostegui>	 !log Deploy schema change on db1105:3312 - T187089 T185128 T153182
[06:37:04] <wikibugs>	 (03CR) 10jenkins-bot: db-eqiad.php: Depool db1105:3312 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/413102 (https://phabricator.wikimedia.org/T187089) (owner: 10Marostegui)
[06:37:10] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:37:11] <stashbot>	 T187089: Fix WMF schemas to not break when comment store goes WRITE_NEW - https://phabricator.wikimedia.org/T187089
[06:37:11] <stashbot>	 T153182: Perform schema change to add externallinks.el_index_60 to all wikis - https://phabricator.wikimedia.org/T153182
[06:37:11] <stashbot>	 T185128: Schema change to prepare for dropping archive.ar_text and archive.ar_flags - https://phabricator.wikimedia.org/T185128
[06:43:28] <wikibugs>	 (03PS3) 10Marostegui: mariadb: Move db2037 from s4 role to m5 [puppet] - 10https://gerrit.wikimedia.org/r/413103 (https://phabricator.wikimedia.org/T187722)
[06:45:48] <wikibugs>	 (03PS4) 10Marostegui: mariadb: Move db2037 from s4 role to m5 [puppet] - 10https://gerrit.wikimedia.org/r/413103 (https://phabricator.wikimedia.org/T187722)
[06:47:29] <wikibugs>	 (03PS5) 10Marostegui: mariadb: Move db2037 from s4 role to m5 [puppet] - 10https://gerrit.wikimedia.org/r/413103 (https://phabricator.wikimedia.org/T187722)
[06:48:10] <wikibugs>	 10Operations, 10ops-codfw, 10DBA, 10Patch-For-Review: Replace db2030 from m5 with another host (WAS: Degraded RAID on db2030) - https://phabricator.wikimedia.org/T187722#3988266 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['db2037.codfw.wmnet'] ```  and were **ALL** successful.
[06:51:27] <wikibugs>	 (03CR) 10Marostegui: "https://puppet-compiler.wmflabs.org/compiler02/10061/" [puppet] - 10https://gerrit.wikimedia.org/r/413103 (https://phabricator.wikimedia.org/T187722) (owner: 10Marostegui)
[06:51:29] <wikibugs>	 (03CR) 10Marostegui: [C: 032] mariadb: Move db2037 from s4 role to m5 [puppet] - 10https://gerrit.wikimedia.org/r/413103 (https://phabricator.wikimedia.org/T187722) (owner: 10Marostegui)
[06:55:59] <wikibugs>	 (03PS1) 10Marostegui: install_server: Reimage db2037 as jessie instead [puppet] - 10https://gerrit.wikimedia.org/r/413106
[06:57:21] <wikibugs>	 (03CR) 10Marostegui: [C: 032] install_server: Reimage db2037 as jessie instead [puppet] - 10https://gerrit.wikimedia.org/r/413106 (owner: 10Marostegui)
[06:59:11] <wikibugs>	 10Operations, 10ops-codfw, 10DBA, 10Patch-For-Review: Replace db2030 from m5 with another host (WAS: Degraded RAID on db2030) - https://phabricator.wikimedia.org/T187722#3988268 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by marostegui on neodymium.eqiad.wmnet for hosts: ``` db2037.codfw.wm...
[07:20:47] <marostegui>	 !log Stop Mariadb on db1108 for kernel upgrade
[07:21:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:23:12] <icinga-wm>	 PROBLEM - haproxy failover on dbproxy1009 is CRITICAL: CRITICAL check_failover servers up 1 down 1
[07:23:31] <marostegui>	 ^ expected
[07:23:31] <icinga-wm>	 PROBLEM - haproxy failover on dbproxy1004 is CRITICAL: CRITICAL check_failover servers up 1 down 1
[07:25:19] <_joe_>	 oh noes
[07:25:44] <marostegui>	 that doesn't page :)
[07:26:07] <_joe_>	 I find it amaizing that you're taking me seriously here :P
[07:26:14] <marostegui>	 xddddd
[07:29:30] <icinga-wm>	 RECOVERY - haproxy failover on dbproxy1004 is OK: OK check_failover servers up 2 down 0
[07:30:19] <icinga-wm>	 RECOVERY - haproxy failover on dbproxy1009 is OK: OK check_failover servers up 2 down 0
[07:36:49] <wikibugs>	 10Operations, 10ops-codfw, 10DBA, 10Patch-For-Review: Replace db2030 from m5 with another host (WAS: Degraded RAID on db2030) - https://phabricator.wikimedia.org/T187722#3988286 (10Marostegui) I am trying to check what's wrong with db2037, as it is showing: ```  [   52.315934] blk_update_request: critical...
[07:37:16] <wikibugs>	 10Operations, 10ops-codfw, 10DBA, 10Patch-For-Review: Replace db2030 from m5 with another host (WAS: Degraded RAID on db2030) - https://phabricator.wikimedia.org/T187722#3988287 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['db2037.codfw.wmnet'] ```  Of which those **FAILED**: ``` ['db2037.c...
[07:37:44] <wikibugs>	 10Operations, 10ops-codfw, 10DBA, 10Patch-For-Review: Replace db2030 from m5 with another host (WAS: Degraded RAID on db2030) - https://phabricator.wikimedia.org/T187722#3988288 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by marostegui on neodymium.eqiad.wmnet for hosts: ``` db2037.codfw.wm...
[07:43:34] <wikibugs>	 10Operations, 10User-Elukey: Ganeti instances to support Kafka Burrow Consumer lag monitoring - https://phabricator.wikimedia.org/T187805#3988289 (10elukey) >>! In T187805#3987587, @Dzahn wrote: > also, site.pp already looks like this, where everything is in a role except burrows being the oddball which should...
[07:45:46] <wikibugs>	 (03CR) 10Matthias Mullie: [C: 04-1] "-1 until deploy" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/410433 (https://phabricator.wikimedia.org/T187261) (owner: 10Matthias Mullie)
[07:52:11] <wikibugs>	 10Operations, 10ops-codfw, 10DBA, 10Patch-For-Review: Replace db2030 from m5 with another host (WAS: Degraded RAID on db2030) - https://phabricator.wikimedia.org/T187722#3988292 (10Marostegui) a:05Marostegui>03Papaul And ILO isn't working any more, so the PXE cannot be set. ``` root@neodymium:~# ipmito...
[07:53:13] <wikibugs>	 10Operations, 10ops-codfw, 10DBA, 10Patch-For-Review: Replace db2030 from m5 with another host (WAS: Degraded RAID on db2030) - https://phabricator.wikimedia.org/T187722#3988294 (10Marostegui) HW logs show nothing by the way
[08:01:49] <wikibugs>	 (03PS1) 10Elukey: role::prometheus::ops: correct Kafka Burrow exporter's port [puppet] - 10https://gerrit.wikimedia.org/r/413107 (https://phabricator.wikimedia.org/T180442)
[08:02:19] <wikibugs>	 (03CR) 10Elukey: [C: 032] role::prometheus::ops: correct Kafka Burrow exporter's port [puppet] - 10https://gerrit.wikimedia.org/r/413107 (https://phabricator.wikimedia.org/T180442) (owner: 10Elukey)
[08:11:07] <wikibugs>	 10Operations, 10ops-codfw, 10DBA, 10Patch-For-Review: Replace db2030 from m5 with another host (WAS: Degraded RAID on db2030) - https://phabricator.wikimedia.org/T187722#3988323 (10Marostegui) I have managed to get the system up after fixing a few i-nodes:  ``` root@db2037:~# touch test root@db2037:~# ```...
[08:18:00] <wikibugs>	 (03PS2) 10Gilles: Add Thumbor/Mediawiki shared secret [mediawiki-config] - 10https://gerrit.wikimedia.org/r/412928 (https://phabricator.wikimedia.org/T169144)
[08:18:16] <wikibugs>	 (03PS2) 10Gilles: Serve officewiki thumbnails with Thumbor [mediawiki-config] - 10https://gerrit.wikimedia.org/r/412952 (https://phabricator.wikimedia.org/T169144)
[08:18:41] <wikibugs>	 10Operations, 10ops-codfw, 10DBA, 10Patch-For-Review: Replace db2030 from m5 with another host (WAS: Degraded RAID on db2030) - https://phabricator.wikimedia.org/T187722#3988328 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['db2037.codfw.wmnet'] ```  Of which those **FAILED**: ``` ['db2037.c...
[08:19:39] <wikibugs>	 10Operations, 10ops-codfw, 10DBA, 10Patch-For-Review: Replace db2030 from m5 with another host (WAS: Degraded RAID on db2030) - https://phabricator.wikimedia.org/T187722#3988329 (10Marostegui) I am unable to reimage the server due to the PXE thing I described at: T187722#3988292 The system looks fine, so f...
[08:20:02] <gilles>	 !log foreachwikiindblist "% private.dblist" extensions/WikimediaMaintenance/filebackend/setZoneAccess.php --backend=local-multiwrite --private 
[08:20:15] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:23:07] <wikibugs>	 (03PS1) 1020after4: Phabricator: restart apache every sunday night [puppet] - 10https://gerrit.wikimedia.org/r/413114 (https://phabricator.wikimedia.org/T187790)
[08:32:24] <wikibugs>	 (03PS2) 10Filippo Giunchedi: Add all private wikis to swift::proxy::private_container_list [puppet] - 10https://gerrit.wikimedia.org/r/412980 (https://phabricator.wikimedia.org/T169144) (owner: 10Gilles)
[08:32:31] <wikibugs>	 (03PS1) 10Gilles: Serve private wiki thumbnails with Thumbor [mediawiki-config] - 10https://gerrit.wikimedia.org/r/413115 (https://phabricator.wikimedia.org/T169144)
[08:33:51] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 032] Add all private wikis to swift::proxy::private_container_list [puppet] - 10https://gerrit.wikimedia.org/r/412980 (https://phabricator.wikimedia.org/T169144) (owner: 10Gilles)
[08:35:30] <godog>	 !log roll-restart thumbor in codfw and eqiad to apply https://gerrit.wikimedia.org/r/c/412980
[08:35:42] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:38:31] <icinga-wm>	 PROBLEM - Disk space on stat1005 is CRITICAL: DISK CRITICAL - free space: /srv 283618 MB (3% inode=93%)
[08:39:58] <elukey>	 I am working on --^
[08:41:49] <wikibugs>	 (03PS1) 10Muehlenhoff: Record extended MOU for flemmerich [puppet] - 10https://gerrit.wikimedia.org/r/413116
[08:45:16] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 032] Record extended MOU for flemmerich [puppet] - 10https://gerrit.wikimedia.org/r/413116 (owner: 10Muehlenhoff)
[08:56:53] <wikibugs>	 10Operations, 10Analytics-Kanban, 10Patch-For-Review, 10User-Elukey, 10User-Joe: rack/setup/install conf1004-conf1006 - https://phabricator.wikimedia.org/T166081#3988353 (10elukey) Before proceeding, https://phabricator.wikimedia.org/T187022 needs to be closed.
[08:57:29] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops, 10hardware-requests: Decommission xenon, cerium, praseodymium - https://phabricator.wikimedia.org/T187446#3988354 (10fgiunchedi)
[08:59:17] <wikibugs>	 (03PS2) 10Muehlenhoff: Add repository component for tor on stretch [puppet] - 10https://gerrit.wikimedia.org/r/410910
[09:07:47] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 031] Add repository component for tor on stretch [puppet] - 10https://gerrit.wikimedia.org/r/410910 (owner: 10Muehlenhoff)
[09:11:26] <wikibugs>	 (03PS7) 10Filippo Giunchedi: prometheus: add check prometheus metric script [puppet] - 10https://gerrit.wikimedia.org/r/409054 (https://phabricator.wikimedia.org/T181410)
[09:11:35] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 032] prometheus: add check prometheus metric script [puppet] - 10https://gerrit.wikimedia.org/r/409054 (https://phabricator.wikimedia.org/T181410) (owner: 10Filippo Giunchedi)
[09:18:02] <wikibugs>	 (03CR) 10Gehel: [C: 04-1] wdqs: allow configuration of kafka based updates (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/412873 (https://phabricator.wikimedia.org/T185951) (owner: 10Gehel)
[09:18:35] <wikibugs>	 (03PS7) 10Gehel: wdqs: allow configuration of kafka based updates [puppet] - 10https://gerrit.wikimedia.org/r/412873 (https://phabricator.wikimedia.org/T185951)
[09:20:36] <wikibugs>	 (03PS1) 10Marostegui: m5.hosts: Add db2037 [software] - 10https://gerrit.wikimedia.org/r/413120 (https://phabricator.wikimedia.org/T187722)
[09:21:53] <wikibugs>	 (03PS2) 10Marostegui: m5.hosts: Add db2037 [software] - 10https://gerrit.wikimedia.org/r/413120 (https://phabricator.wikimedia.org/T187722)
[09:22:52] <wikibugs>	 (03CR) 10Elukey: wdqs: allow configuration of kafka based updates (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/412873 (https://phabricator.wikimedia.org/T185951) (owner: 10Gehel)
[09:23:05] <wikibugs>	 (03PS7) 10Filippo Giunchedi: cassandra: create parent data directories with exec [puppet] - 10https://gerrit.wikimedia.org/r/404705 (https://phabricator.wikimedia.org/T175284) (owner: 10Eevans)
[09:23:56] <wikibugs>	 (03CR) 10Marostegui: [C: 032] m5.hosts: Add db2037 [software] - 10https://gerrit.wikimedia.org/r/413120 (https://phabricator.wikimedia.org/T187722) (owner: 10Marostegui)
[09:23:58] <moritzm>	 !log installing sqlite security updates on stretch
[09:24:10] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:25:03] <wikibugs>	 (03Merged) 10jenkins-bot: m5.hosts: Add db2037 [software] - 10https://gerrit.wikimedia.org/r/413120 (https://phabricator.wikimedia.org/T187722) (owner: 10Marostegui)
[09:32:21] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 032] cassandra: create parent data directories with exec [puppet] - 10https://gerrit.wikimedia.org/r/404705 (https://phabricator.wikimedia.org/T175284) (owner: 10Eevans)
[09:37:55] <wikibugs>	 (03PS1) 10Filippo Giunchedi: cassandra: create data directories only when needed [puppet] - 10https://gerrit.wikimedia.org/r/413121 (https://phabricator.wikimedia.org/T175284)
[09:40:01] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 032] cassandra: create data directories only when needed [puppet] - 10https://gerrit.wikimedia.org/r/413121 (https://phabricator.wikimedia.org/T175284) (owner: 10Filippo Giunchedi)
[09:41:35] <wikibugs>	 10Operations, 10DBA, 10Patch-For-Review: Setup newer machines and replace all old misc (m*) and x1 codfw machines - https://phabricator.wikimedia.org/T183470#3988386 (10jcrespo)
[09:44:57] <brion>	 apergos: hey do you know offhand if there's a handy archive of the mobile app files that used to be on dumps.wikimedia.org (or possibly download.wikimedia.org)? would've been in dirs like /android and /ios
[09:45:14] <brion>	 i had someone looking for a copy of an old app for testing, just curious if we still have them floating around :D
[09:45:24] <apergos>	 I've probably cleaned those up but let me just double check
[09:45:27] <brion>	 thx
[09:45:44] <brion>	 i'm not sure if the one they're looking for would've even been there though (blackberry tablet one)
[09:46:00] <apergos>	 errr that sounds unlikely but let me see what's still around
[09:46:04] <brion>	 if not i'll check my old backups at home when i get back :D
[09:46:26] <moritzm>	 !log installing dbus updates from stretch point release
[09:46:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:46:50] <apergos>	 brion: give me a year please? :-D
[09:46:54] <brion>	 hehe
[09:46:58] <brion>	 no rush ;)
[09:47:03] <apergos>	 no, seriously
[09:47:09] <apergos>	 I've got a bunchof stuff from 2012
[09:47:11] <brion>	 oh for source?
[09:47:13] <brion>	 yeah 2012 or 2013 
[09:47:14] <apergos>	 yeah!
[09:47:33] <brion>	 jeebus it's 2018 now? ffs
[09:47:58] <apergos>	 lolol
[09:48:14] <brion>	 i thought time ended in 2012. mayans and all
[09:48:19] <apergos>	 so give me an idea of some string in the name in the blackberry app
[09:48:26] <brion>	 would end in ".bar"
[09:48:33] <brion>	 (i assume it stood for Blackberry ARchive)
[09:48:51] <wikibugs>	 (03PS1) 10Muehlenhoff: Add library hint for dbus [puppet] - 10https://gerrit.wikimedia.org/r/413124
[09:49:09] <apergos>	 lots of nice apk and ipa but no bar files
[09:49:22] <apergos>	 any idea of another directory you might have put them under?
[09:49:46] <brion>	 if it's not under /blackberry or /playbook it's probably not there
[09:50:38] <apergos>	 ariel@dataset1001:/data/xmldatadumps/public/other$ ls -l PlayBook/
[09:50:38] <apergos>	 total 1900
[09:50:38] <apergos>	 -rw-r--r-- 1 dumpsgen wikidev 1943346 Dec  7  2012 Wikipedia-v1.3.3.bar
[09:50:41] <apergos>	 ta dah!!
[09:50:43] <brion>	 aweeeeeesome
[09:50:45] <brion>	 thanks :D
[09:50:49] <apergos>	 man we just never throw anything out do we
[09:50:54] <brion>	 lol
[09:50:58] <apergos>	 so they can direct download that sucka
[09:51:18] <brion>	 woot
[09:51:20] <brion>	 thanks apergos :D
[09:51:23] <apergos>	 yw!
[09:51:30] <apergos>	 thanks for the trip down memory lane
[09:51:31] <brion>	 i shoulda known the dir was in CamelCase ;)
[09:51:39] <apergos>	 hehe
[09:52:10] <apergos>	 now I just gotta know
[09:52:24] <apergos>	 is someone seriously still using an old blackberry that this is gonna run on? 
[09:52:49] <apergos>	 this ancient 2012 edition missing all the latest greatest knowledge
[09:53:36] <brion>	 there are some die-hard BlackBerry 10 users still ;)
[09:53:43] <brion>	 but that version might not install on them
[09:53:55] <brion>	 since it was for the slightly earlier tablet
[09:54:00] <brion>	 we'll see ;)
[09:54:03] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 032] Add library hint for dbus [puppet] - 10https://gerrit.wikimedia.org/r/413124 (owner: 10Muehlenhoff)
[09:55:32] <apergos>	 lolol  well better get working on that, stat!
[10:00:04] <jouncebot>	 kart_: It is that lovely time of the day again! You are hereby commanded to deploy Compact Language Links: Dry run for preference migration script. (T187677). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180221T1000).
[10:00:04] <jouncebot>	 aharoni: A patch you scheduled for Compact Language Links: Dry run for preference migration script. (T187677) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[10:00:04] <stashbot>	 T187677: Deploy Compact Language Links on the English Wikipedia - https://phabricator.wikimedia.org/T187677
[10:00:33] <kart_>	 aharoni: around?
[10:01:13] <aharoni>	 around
[10:01:23] <kart_>	 !log Running CLL preference migration script dry-run on terbium (T187677)
[10:01:28] <kart_>	 aharoni: cool.
[10:01:35] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:01:45] <kart_>	 good stash bot
[10:02:14] <wikibugs>	 (03PS2) 10Jcrespo: mariadb: Move db2044 from codfw-core-s4 to codfw-misc-s2 [puppet] - 10https://gerrit.wikimedia.org/r/412994 (https://phabricator.wikimedia.org/T183470)
[10:02:16] <wikibugs>	 (03PS2) 10Jcrespo: dbproxy: Setup db2044 as the main m2 host on codfw and monitor it [puppet] - 10https://gerrit.wikimedia.org/r/412995 (https://phabricator.wikimedia.org/T183470)
[10:02:40] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] mariadb: Move db2044 from codfw-core-s4 to codfw-misc-s2 [puppet] - 10https://gerrit.wikimedia.org/r/412994 (https://phabricator.wikimedia.org/T183470) (owner: 10Jcrespo)
[10:02:48] <kart_>	 aharoni: I'm logging output in file too.
[10:03:28] <wikibugs>	 (03CR) 10Phuedx: [C: 031] Disable Page Previews EventLogging instrumentation [mediawiki-config] - 10https://gerrit.wikimedia.org/r/412996 (https://phabricator.wikimedia.org/T185973) (owner: 10Pmiazga)
[10:04:26] <aharoni>	 kart_: good.
[10:05:01] <icinga-wm>	 PROBLEM - High CPU load on API appserver on mw1231 is CRITICAL: CRITICAL - load average: 38.80, 35.29, 32.19
[10:06:37] <kart_>	 hmm. ^^ hope not related to script running on terbium.
[10:07:13] <godog>	 kart_: unlikely, it is a recurring problem
[10:07:18] <kart_>	 OK!
[10:07:24] <wikibugs>	 (03PS1) 10Marostegui: site.pp: Remove db2037 from s4 [puppet] - 10https://gerrit.wikimedia.org/r/413128 (https://phabricator.wikimedia.org/T187722)
[10:07:31] <marostegui>	 jynus: ^
[10:09:55] <moritzm>	 !log installing openssh bugfix updates from jessie/stretch point releases
[10:10:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:10:42] <wikibugs>	 (03CR) 10Jcrespo: [C: 031] site.pp: Remove db2037 from s4 [puppet] - 10https://gerrit.wikimedia.org/r/413128 (https://phabricator.wikimedia.org/T187722) (owner: 10Marostegui)
[10:10:54] <wikibugs>	 (03CR) 10Marostegui: [C: 032] site.pp: Remove db2037 from s4 [puppet] - 10https://gerrit.wikimedia.org/r/413128 (https://phabricator.wikimedia.org/T187722) (owner: 10Marostegui)
[10:13:23] <wikibugs>	 10Operations, 10User-Elukey: Ganeti instances to support Kafka Burrow Consumer lag monitoring - https://phabricator.wikimedia.org/T187805#3988447 (10akosiaris) What amount of resources (CPU, mem, disk) are we talking about ? From https://grafana.wikimedia.org/dashboard/db/prometheus-machine-stats?orgId=1&var-s...
[10:13:55] <wikibugs>	 (03PS3) 10Jcrespo: mariadb: Move db2044 from codfw-core-s4 to codfw-misc-s2 [puppet] - 10https://gerrit.wikimedia.org/r/412994 (https://phabricator.wikimedia.org/T183470)
[10:13:57] <wikibugs>	 (03PS3) 10Jcrespo: dbproxy: Setup db2044 as the main m2 host on codfw and monitor it [puppet] - 10https://gerrit.wikimedia.org/r/412995 (https://phabricator.wikimedia.org/T183470)
[10:14:23] <wikibugs>	 (03CR) 10Marostegui: "This looks good to me. Only comment, the commit message says db1111 and db1111 :-)" [puppet] - 10https://gerrit.wikimedia.org/r/412994 (https://phabricator.wikimedia.org/T183470) (owner: 10Jcrespo)
[10:14:27] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] mariadb: Move db2044 from codfw-core-s4 to codfw-misc-s2 [puppet] - 10https://gerrit.wikimedia.org/r/412994 (https://phabricator.wikimedia.org/T183470) (owner: 10Jcrespo)
[10:15:37] <wikibugs>	 (03CR) 10Marostegui: "db1111 and db1112 have notifications disabled manually on icinga btw" [puppet] - 10https://gerrit.wikimedia.org/r/412994 (https://phabricator.wikimedia.org/T183470) (owner: 10Jcrespo)
[10:15:50] <icinga-wm>	 RECOVERY - Disk space on stat1005 is OK: DISK OK
[10:17:17] <wikibugs>	 (03CR) 10Jcrespo: "Yes, the roles should do that without hiera, I will do that on a separate commit. I would also put them on its own "shard" on monitoring. " [puppet] - 10https://gerrit.wikimedia.org/r/412994 (https://phabricator.wikimedia.org/T183470) (owner: 10Jcrespo)
[10:19:11] <wikibugs>	 (03CR) 10Marostegui: [C: 031] mariadb: Move db2044 from codfw-core-s4 to codfw-misc-s2 [puppet] - 10https://gerrit.wikimedia.org/r/412994 (https://phabricator.wikimedia.org/T183470) (owner: 10Jcrespo)
[10:19:35] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] mariadb: Move db2044 from codfw-core-s4 to codfw-misc-s2 [puppet] - 10https://gerrit.wikimedia.org/r/412994 (https://phabricator.wikimedia.org/T183470) (owner: 10Jcrespo)
[10:21:24] <wikibugs>	 10Operations, 10ops-codfw, 10DBA, 10Patch-For-Review: Replace db2030 from m5 with another host (WAS: Degraded RAID on db2030) - https://phabricator.wikimedia.org/T187722#3988479 (10Marostegui) m5 is now replicating on db2037. I will leave notifications disable till we do the tests with @Papaul when he has...
[10:26:19] <marostegui>	 !log Remove db2030 from tendril - T187768
[10:26:35] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:26:35] <stashbot>	 T187768: Decommission db2030 - https://phabricator.wikimedia.org/T187768
[10:28:39] <wikibugs>	 (03PS1) 10Marostegui: dbproxy1005: Replace db2030 with db2037 [puppet] - 10https://gerrit.wikimedia.org/r/413129 (https://phabricator.wikimedia.org/T187722)
[10:30:34] <wikibugs>	 (03PS1) 10Jcrespo: Change m5-slave to db2037 [dns] - 10https://gerrit.wikimedia.org/r/413130 (https://phabricator.wikimedia.org/T187722)
[10:30:50] <wikibugs>	 (03CR) 10Marostegui: [C: 031] Change m5-slave to db2037 [dns] - 10https://gerrit.wikimedia.org/r/413130 (https://phabricator.wikimedia.org/T187722) (owner: 10Jcrespo)
[10:31:40] <wikibugs>	 (03CR) 10Jcrespo: [C: 032] Change m5-slave to db2037 [dns] - 10https://gerrit.wikimedia.org/r/413130 (https://phabricator.wikimedia.org/T187722) (owner: 10Jcrespo)
[10:32:03] <wikibugs>	 (03CR) 10Jcrespo: [C: 031] dbproxy1005: Replace db2030 with db2037 [puppet] - 10https://gerrit.wikimedia.org/r/413129 (https://phabricator.wikimedia.org/T187722) (owner: 10Marostegui)
[10:32:10] <wikibugs>	 (03CR) 10Marostegui: [C: 032] dbproxy1005: Replace db2030 with db2037 [puppet] - 10https://gerrit.wikimedia.org/r/413129 (https://phabricator.wikimedia.org/T187722) (owner: 10Marostegui)
[10:33:11] <marostegui>	 !log Reload haproxy on dbproxy1005 - T187722
[10:33:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:33:24] <stashbot>	 T187722: Replace db2030 from m5 with another host (WAS: Degraded RAID on db2030) - https://phabricator.wikimedia.org/T187722
[10:33:41] <icinga-wm>	 PROBLEM - High CPU load on API appserver on mw1227 is CRITICAL: CRITICAL - load average: 38.36, 34.73, 32.11
[10:34:00] <icinga-wm>	 RECOVERY - haproxy failover on dbproxy1005 is OK: OK check_failover servers up 2 down 0
[10:38:10] <wikibugs>	 10Operations, 10ops-codfw, 10DBA, 10hardware-requests, 10Patch-For-Review: Decommission db2030 - https://phabricator.wikimedia.org/T187768#3988496 (10Marostegui) a:05Marostegui>03RobH Assigning it directly to @robh so he can finish up with this (please let me know if you prefer another way of letting...
[10:40:52] <kart_>	 !log Finished running CLL preference migration script dry-run on terbium (T187677)
[10:41:07] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:41:08] <stashbot>	 T187677: Deploy Compact Language Links on the English Wikipedia - https://phabricator.wikimedia.org/T187677
[10:41:11] <icinga-wm>	 PROBLEM - puppet last run on mw1276 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 6 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[set debconf flag seen for wireshark-common/install-setuid]
[10:43:04] <wikibugs>	 10Operations, 10DBA, 10Patch-For-Review: Setup newer machines and replace all old misc (m*) and x1 codfw machines - https://phabricator.wikimedia.org/T183470#3988511 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by jynus on sarin.codfw.wmnet for hosts: ``` ['db2044.codfw.wmnet'] ``` The log can...
[10:43:13] <wikibugs>	 (03CR) 10Jcrespo: [V: 032 C: 032] mariadb: Move db2044 from codfw-core-s4 to codfw-misc-s2 [puppet] - 10https://gerrit.wikimedia.org/r/412994 (https://phabricator.wikimedia.org/T183470) (owner: 10Jcrespo)
[10:43:17] <wikibugs>	 10Operations, 10User-Elukey: Ganeti instances to support Kafka Burrow Consumer lag monitoring - https://phabricator.wikimedia.org/T187805#3988524 (10elukey) >>! In T187805#3988447, @akosiaris wrote: > What amount of resources (CPU, mem, disk) are we talking about ? From https://grafana.wikimedia.org/dashboard/...
[10:43:21] <wikibugs>	 (03PS4) 10Jcrespo: mariadb: Move db2044 from codfw-core-s4 to codfw-misc-s2 [puppet] - 10https://gerrit.wikimedia.org/r/412994 (https://phabricator.wikimedia.org/T183470)
[10:43:50] <icinga-wm>	 PROBLEM - High CPU load on API appserver on mw1226 is CRITICAL: CRITICAL - load average: 35.93, 33.06, 32.11
[10:43:54] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] mariadb: Move db2044 from codfw-core-s4 to codfw-misc-s2 [puppet] - 10https://gerrit.wikimedia.org/r/412994 (https://phabricator.wikimedia.org/T183470) (owner: 10Jcrespo)
[10:44:30] <wikibugs>	 (03CR) 10Jcrespo: [V: 032 C: 032] mariadb: Move db2044 from codfw-core-s4 to codfw-misc-s2 [puppet] - 10https://gerrit.wikimedia.org/r/412994 (https://phabricator.wikimedia.org/T183470) (owner: 10Jcrespo)
[10:59:15] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: s2 on dbstore2002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 50805.68 seconds
[11:01:10] <marostegui>	 that is an expired downtime
[11:01:20] <marostegui>	 from the alter table that got started yesterda
[11:01:27] <marostegui>	 dbstore2001 will complain too I guess
[11:01:32] <marostegui>	 I am going to downtime them now again
[11:06:14] <icinga-wm>	 RECOVERY - puppet last run on mw1276 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[11:10:45] <wikibugs>	 10Operations, 10DBA, 10Patch-For-Review: Setup newer machines and replace all old misc (m*) and x1 codfw machines - https://phabricator.wikimedia.org/T183470#3988605 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['db2044.codfw.wmnet'] ```  and were **ALL** successful.
[11:11:44] <icinga-wm>	 PROBLEM - puppet last run on lvs5002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[11:12:54] <jynus>	 !log cloning db2011 to db2044
[11:13:07] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:14:07] <jynus>	 I think m2 proxies will complain now
[11:16:04] <icinga-wm>	 PROBLEM - haproxy failover on dbproxy1002 is CRITICAL: CRITICAL check_failover servers up 1 down 1
[11:17:04] <icinga-wm>	 PROBLEM - haproxy failover on dbproxy1007 is CRITICAL: CRITICAL check_failover servers up 1 down 1
[11:17:14] <moritzm>	 !log installing db5.3 security updates
[11:17:27] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:19:54] <wikibugs>	 (03PS4) 10Jcrespo: dbproxy: Setup db2044 as the main m2 host on codfw and monitor it [puppet] - 10https://gerrit.wikimedia.org/r/412995 (https://phabricator.wikimedia.org/T183470)
[11:23:26] <wikibugs>	 (03CR) 10Jcrespo: [C: 031] "We finally have the replica back up and a recent backup." [puppet] - 10https://gerrit.wikimedia.org/r/412964 (https://phabricator.wikimedia.org/T168470) (owner: 10Andrew Bogott)
[11:24:29] <wikibugs>	 (03CR) 10Marostegui: [C: 04-1] "Let's wait until we have confirmation from papaul that everything looks good with the new replica, on his end before. Should happen today " [puppet] - 10https://gerrit.wikimedia.org/r/412964 (https://phabricator.wikimedia.org/T168470) (owner: 10Andrew Bogott)
[11:32:19] <wikibugs>	 (03CR) 10Jcrespo: [C: 032] dbproxy: Setup db2044 as the main m2 host on codfw and monitor it [puppet] - 10https://gerrit.wikimedia.org/r/412995 (https://phabricator.wikimedia.org/T183470) (owner: 10Jcrespo)
[11:32:21] <wikibugs>	 (03CR) 10Ema: [C: 032] etcd: Introduce reconnectTimeout [debs/pybal] - 10https://gerrit.wikimedia.org/r/411264 (https://phabricator.wikimedia.org/T169765) (owner: 10Ema)
[11:36:15] <wikibugs>	 (03PS1) 10Ema: etcd: Introduce reconnectTimeout [debs/pybal] (1.14) - 10https://gerrit.wikimedia.org/r/413141 (https://phabricator.wikimedia.org/T169765)
[11:37:55] <wikibugs>	 (03CR) 10Ema: [C: 032] etcd: Introduce reconnectTimeout [debs/pybal] (1.14) - 10https://gerrit.wikimedia.org/r/413141 (https://phabricator.wikimedia.org/T169765) (owner: 10Ema)
[11:40:00] <wikibugs>	 (03PS1) 10Filippo Giunchedi: nagios_common: switch to check_prometheus_metric Python implementation [puppet] - 10https://gerrit.wikimedia.org/r/413142 (https://phabricator.wikimedia.org/T181410)
[11:41:44] <icinga-wm>	 RECOVERY - puppet last run on lvs5002 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[11:41:50] <wikibugs>	 (03PS1) 10Elukey: profile::hadoop::master|worker: relax JVM heap size monitors [puppet] - 10https://gerrit.wikimedia.org/r/413143
[11:42:30] <wikibugs>	 (03CR) 10Elukey: [C: 032] profile::hadoop::master|worker: relax JVM heap size monitors [puppet] - 10https://gerrit.wikimedia.org/r/413143 (owner: 10Elukey)
[11:44:24] <icinga-wm>	 RECOVERY - High CPU load on API appserver on mw1226 is OK: OK - load average: 7.10, 11.90, 23.20
[11:44:25] <wikibugs>	 (03PS1) 10Ema: 1.14.4: Introduce etcd reconnectTimeout [debs/pybal] - 10https://gerrit.wikimedia.org/r/413145 (https://phabricator.wikimedia.org/T169765)
[11:46:13] <wikibugs>	 (03CR) 10Ema: [C: 032] 1.14.4: Introduce etcd reconnectTimeout [debs/pybal] - 10https://gerrit.wikimedia.org/r/413145 (https://phabricator.wikimedia.org/T169765) (owner: 10Ema)
[11:46:29] <wikibugs>	 (03PS1) 10Ema: 1.14.4: Introduce etcd reconnectTimeout [debs/pybal] (1.14) - 10https://gerrit.wikimedia.org/r/413146 (https://phabricator.wikimedia.org/T169765)
[11:46:38] <icinga-wm>	 PROBLEM - High CPU load on API appserver on mw1227 is CRITICAL: CRITICAL - load average: 34.70, 34.45, 32.06
[11:47:33] <wikibugs>	 (03CR) 10Ema: [C: 032] 1.14.4: Introduce etcd reconnectTimeout [debs/pybal] (1.14) - 10https://gerrit.wikimedia.org/r/413146 (https://phabricator.wikimedia.org/T169765) (owner: 10Ema)
[11:48:44] <wikibugs>	 (03PS1) 10Urbanecm: Add new throttle rule [mediawiki-config] - 10https://gerrit.wikimedia.org/r/413147 (https://phabricator.wikimedia.org/T187870)
[11:49:32] <Urbanecm>	 Hi, is anybody with deploy privs available? I need to deploy last-minute throttle exception rule, see T187870.
[11:49:32] <stashbot>	 T187870: Lift IP account limit on 2018-02-21 - https://phabricator.wikimedia.org/T187870
[11:50:19] <Urbanecm>	 zeljkof, hashar, twentyafterfour, no_justification ^^^
[11:51:24] <zeljkof>	 Urbanecm: I'm around
[11:51:40] <zeljkof>	 looks like that can be deployed during eu swat, right?
[11:52:00] <wikibugs>	 (03CR) 10Filippo Giunchedi: "PCC https://puppet-compiler.wmflabs.org/compiler02/10065/" [puppet] - 10https://gerrit.wikimedia.org/r/413142 (https://phabricator.wikimedia.org/T181410) (owner: 10Filippo Giunchedi)
[11:52:06] <zeljkof>	 the event is in 4 hours, swat is in 2
[11:52:10] <Urbanecm>	 Theoretically, but it is almost full and I won't be available (travelling). 
[11:52:41] <zeljkof>	 just add it to the top of the calendar, I'll make sure to deploy it
[11:52:51] <Urbanecm>	 Ok, that's great. Thank you!
[11:53:09] <zeljkof>	 Urbanecm: no problem! :)
[11:58:28] <wikibugs>	 (03PS1) 10Elukey: profile::hadoop::master|worker: tune again JVM Heap size monitors [puppet] - 10https://gerrit.wikimedia.org/r/413148
[12:01:15] <wikibugs>	 (03CR) 10Elukey: [C: 032] profile::hadoop::master|worker: tune again JVM Heap size monitors [puppet] - 10https://gerrit.wikimedia.org/r/413148 (owner: 10Elukey)
[12:01:19] <ema>	 !log pybal 1.14.4 uploaded to apt.w.o
[12:01:31] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:02:08] <ema>	 !log lvs5003: pybal upgraded to 1.14.4
[12:02:20] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:09:23] <icinga-wm>	 PROBLEM - High CPU load on API appserver on mw1234 is CRITICAL: CRITICAL - load average: 35.55, 33.44, 32.05
[12:10:03] <moritzm>	 !log uploading retpoline-enabled gcc-4.9 to apt.wikimedia.org / jessie-wikimedia to be able to use it on boron for building Linux (trying to adapt our pbuilder setup to also include security.debian.org ran into a few proxy-related problems and this is really a rare corner case anyway)
[12:10:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:10:29] <wikibugs>	 (03Abandoned) 10Alexandros Kosiaris: Revert "Revert "Use security mirrors in cowbuilder apt config"" [puppet] - 10https://gerrit.wikimedia.org/r/412930 (owner: 10Alexandros Kosiaris)
[12:14:25] <icinga-wm>	 PROBLEM - High CPU load on API appserver on mw1234 is CRITICAL: CRITICAL - load average: 35.54, 33.31, 32.25
[12:17:25] <icinga-wm>	 PROBLEM - High CPU load on API appserver on mw1283 is CRITICAL: CRITICAL - load average: 42.12, 41.50, 40.95
[12:21:25] <elukey>	 !log restart hhvm on mw1227 - high load, hhvm-dump-debug in /home/elukey/hhvm.23382.bt
[12:21:38] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:23:32] <wikibugs>	 (03PS10) 10Arturo Borrero Gonzalez: toollabs: add apt pinnings for key packages [puppet] - 10https://gerrit.wikimedia.org/r/410177 (https://phabricator.wikimedia.org/T187193)
[12:23:35] <arturo>	 chasemp: ^^^
[12:24:29] <wikibugs>	 (03PS11) 10Arturo Borrero Gonzalez: toollabs: add apt pinnings for key packages [puppet] - 10https://gerrit.wikimedia.org/r/410177 (https://phabricator.wikimedia.org/T187193)
[12:26:44] <elukey>	 !log restart hhvm on mw1231 - high load, hhvm-dump-debug in /home/elukey/hhvm.6759.bt
[12:26:59] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:29:50] <icinga-wm>	 RECOVERY - High CPU load on API appserver on mw1227 is OK: OK - load average: 8.05, 11.84, 23.69
[12:29:55] <wikibugs>	 (03PS1) 10Marostegui: db-eqiad.php: db1067 is now candidate master in s1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/413153 (https://phabricator.wikimedia.org/T186321)
[12:32:13] <wikibugs>	 (03CR) 10Marostegui: [C: 032] db-eqiad.php: db1067 is now candidate master in s1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/413153 (https://phabricator.wikimedia.org/T186321) (owner: 10Marostegui)
[12:33:48] <wikibugs>	 (03Merged) 10jenkins-bot: db-eqiad.php: db1067 is now candidate master in s1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/413153 (https://phabricator.wikimedia.org/T186321) (owner: 10Marostegui)
[12:35:09] <wikibugs>	 (03CR) 10Volans: [C: 04-1] "Nice! I think there is a typo, see inline also for a couple of other comments." (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/413142 (https://phabricator.wikimedia.org/T181410) (owner: 10Filippo Giunchedi)
[12:35:25] <wikibugs>	 (03PS1) 10Vgutierrez: Improve reactor mocking [debs/pybal] - 10https://gerrit.wikimedia.org/r/413154 (https://phabricator.wikimedia.org/T169765)
[12:35:34] <logmsgbot>	 !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Clarify that db1067 is now s1 candidate master - T186321 (duration: 01m 13s)
[12:35:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:35:47] <stashbot>	 T186321: Prepare and indicate proper master db failover candidates for all database sections (s1-s8, x1) - https://phabricator.wikimedia.org/T186321
[12:36:20] <icinga-wm>	 PROBLEM - High CPU load on API appserver on mw1348 is CRITICAL: CRITICAL - load average: 112.25, 63.01, 50.56
[12:36:21] <icinga-wm>	 PROBLEM - High CPU load on API appserver on mw1284 is CRITICAL: CRITICAL - load average: 90.51, 42.44, 30.75
[12:36:21] <icinga-wm>	 PROBLEM - High CPU load on API appserver on mw1281 is CRITICAL: CRITICAL - load average: 94.34, 48.23, 37.01
[12:36:25] <_joe_>	 uh
[12:36:41] <icinga-wm>	 PROBLEM - High CPU load on API appserver on mw1343 is CRITICAL: CRITICAL - load average: 118.93, 59.57, 44.09
[12:37:04] <wikibugs>	 (03CR) 10jenkins-bot: db-eqiad.php: db1067 is now candidate master in s1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/413153 (https://phabricator.wikimedia.org/T186321) (owner: 10Marostegui)
[12:37:50] <icinga-wm>	 PROBLEM - Nginx local proxy to apache on mw1343 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[12:38:10] <icinga-wm>	 PROBLEM - Nginx local proxy to apache on mw1348 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[12:38:10] <icinga-wm>	 PROBLEM - Apache HTTP on mw1348 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[12:38:11] <icinga-wm>	 PROBLEM - HHVM rendering on mw1343 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[12:38:11] <icinga-wm>	 PROBLEM - Apache HTTP on mw1343 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[12:38:30] <icinga-wm>	 PROBLEM - Apache HTTP on mw1281 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[12:38:30] <icinga-wm>	 PROBLEM - HHVM rendering on mw1281 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[12:38:30] <icinga-wm>	 PROBLEM - Nginx local proxy to apache on mw1281 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[12:38:30] <icinga-wm>	 PROBLEM - Nginx local proxy to apache on mw1284 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[12:38:33] <elukey>	 !log restart hhvm on mw1234 - high load
[12:38:41] <icinga-wm>	 PROBLEM - HHVM rendering on mw1348 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[12:38:45] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:39:21] <icinga-wm>	 RECOVERY - Nginx local proxy to apache on mw1284 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 617 bytes in 0.031 second response time
[12:39:40] <icinga-wm>	 RECOVERY - HHVM rendering on mw1348 is OK: HTTP OK: HTTP/1.1 200 OK - 74194 bytes in 2.195 second response time
[12:40:11] <icinga-wm>	 PROBLEM - Apache HTTP on mw1284 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[12:41:01] <icinga-wm>	 RECOVERY - Apache HTTP on mw1284 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 616 bytes in 0.028 second response time
[12:41:10] <icinga-wm>	 RECOVERY - High CPU load on API appserver on mw1231 is OK: OK - load average: 8.98, 11.36, 23.16
[12:42:50] <icinga-wm>	 PROBLEM - HHVM rendering on mw1348 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[12:43:40] <_joe_>	 !log rolling restart of hhvm on api servers under high load
[12:43:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:45:40] <icinga-wm>	 RECOVERY - HHVM rendering on mw1348 is OK: HTTP OK: HTTP/1.1 200 OK - 74068 bytes in 0.340 second response time
[12:46:01] <icinga-wm>	 RECOVERY - Nginx local proxy to apache on mw1348 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 617 bytes in 0.083 second response time
[12:46:10] <icinga-wm>	 RECOVERY - Apache HTTP on mw1348 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 616 bytes in 0.026 second response time
[12:46:40] <icinga-wm>	 RECOVERY - High CPU load on API appserver on mw1234 is OK: OK - load average: 7.74, 14.05, 23.67
[12:48:30] <icinga-wm>	 RECOVERY - Apache HTTP on mw1281 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 616 bytes in 0.032 second response time
[12:48:30] <icinga-wm>	 RECOVERY - Nginx local proxy to apache on mw1281 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 617 bytes in 0.038 second response time
[12:48:30] <icinga-wm>	 RECOVERY - HHVM rendering on mw1281 is OK: HTTP OK: HTTP/1.1 200 OK - 74068 bytes in 0.122 second response time
[12:50:20] <icinga-wm>	 RECOVERY - Apache HTTP on mw1343 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 616 bytes in 0.057 second response time
[12:50:20] <icinga-wm>	 RECOVERY - HHVM rendering on mw1343 is OK: HTTP OK: HTTP/1.1 200 OK - 74068 bytes in 0.352 second response time
[12:50:50] <icinga-wm>	 RECOVERY - Nginx local proxy to apache on mw1343 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 617 bytes in 0.068 second response time
[12:57:40] <icinga-wm>	 RECOVERY - High CPU load on API appserver on mw1283 is OK: OK - load average: 12.51, 17.44, 29.30
[12:58:35] <icinga-wm>	 RECOVERY - High CPU load on API appserver on mw1284 is OK: OK - load average: 11.18, 15.26, 29.46
[13:02:48] <wikibugs>	 (03PS12) 10Arturo Borrero Gonzalez: toollabs: add apt pinnings for key packages [puppet] - 10https://gerrit.wikimedia.org/r/410177 (https://phabricator.wikimedia.org/T187193)
[13:02:56] <wikibugs>	 (03CR) 10Rush: "> Patch Set 1:" [puppet] - 10https://gerrit.wikimedia.org/r/413095 (https://phabricator.wikimedia.org/T187850) (owner: 10BryanDavis)
[13:09:35] <icinga-wm>	 RECOVERY - High CPU load on API appserver on mw1348 is OK: OK - load average: 13.29, 14.64, 35.30
[13:13:36] <icinga-wm>	 RECOVERY - High CPU load on API appserver on mw1281 is OK: OK - load average: 10.51, 12.24, 28.88
[13:14:56] <wikibugs>	 10Operations: Integrate stretch 9.3 point update - https://phabricator.wikimedia.org/T182655#3989003 (10MoritzMuehlenhoff) 05Open>03Resolved This is completely rolled out.
[13:15:55] <icinga-wm>	 RECOVERY - High CPU load on API appserver on mw1343 is OK: OK - load average: 12.85, 13.41, 35.99
[13:17:37] <wikibugs>	 (03PS3) 10Muehlenhoff: Add repository component for tor on stretch [puppet] - 10https://gerrit.wikimedia.org/r/410910
[13:23:15] <icinga-wm>	 PROBLEM - Host boron is DOWN: PING CRITICAL - Packet loss = 18%, RTA = 6286.53 ms
[13:23:15] <icinga-wm>	 PROBLEM - Host ununpentium is DOWN: PING CRITICAL - Packet loss = 100%
[13:23:57] <icinga-wm>	 PROBLEM - Host etcd1003 is DOWN: PING CRITICAL - Packet loss = 100%
[13:23:57] <icinga-wm>	 PROBLEM - Host meitnerium is DOWN: PING CRITICAL - Packet loss = 100%
[13:23:57] <icinga-wm>	 PROBLEM - Host etcd1004 is DOWN: PING CRITICAL - Packet loss = 100%
[13:24:05] <icinga-wm>	 PROBLEM - Host kubestagetcd1002 is DOWN: PING CRITICAL - Packet loss = 100%
[13:24:15] <icinga-wm>	 PROBLEM - Host kubestagetcd1003 is DOWN: PING CRITICAL - Packet loss = 100%
[13:24:16] <icinga-wm>	 PROBLEM - Host fermium is DOWN: PING CRITICAL - Packet loss = 100%
[13:24:31] <moritzm>	 and again
[13:24:41] <moritzm>	 1007 this time
[13:24:55] <icinga-wm>	 PROBLEM - SSH on ganeti1007 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[13:26:22] <moritzm>	 !log powercycling ganeti1007
[13:26:35] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:27:25] <icinga-wm>	 PROBLEM - Request latencies on chlorine is CRITICAL: CRITICAL - apiserver_request_latencies is 1794991 https://grafana.wikimedia.org/dashboard/db/kubernetes-api
[13:27:45] <icinga-wm>	 PROBLEM - Request latencies on neon is CRITICAL: CRITICAL - apiserver_request_latencies is 13238562 https://grafana.wikimedia.org/dashboard/db/kubernetes-api
[13:27:45] <icinga-wm>	 PROBLEM - etcd request latencies on chlorine is CRITICAL: CRITICAL - etcd_request_latencies is 2545360 https://grafana.wikimedia.org/dashboard/db/kubernetes-api
[13:27:55] <icinga-wm>	 PROBLEM - etcd request latencies on neon is CRITICAL: CRITICAL - etcd_request_latencies is 17285176 https://grafana.wikimedia.org/dashboard/db/kubernetes-api
[13:28:00] <marostegui>	 !log Reboot db2092 for a kernel upgrade
[13:28:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:28:55] <icinga-wm>	 PROBLEM - Host ganeti1007 is DOWN: PING CRITICAL - Packet loss = 100%
[13:29:45] <icinga-wm>	 PROBLEM - HHVM jobrunner on mw1306 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 473 bytes in 0.001 second response time
[13:29:56] <icinga-wm>	 RECOVERY - SSH on ganeti1007 is OK: SSH OK - OpenSSH_6.7p1 Debian-5+deb8u4 (protocol 2.0)
[13:30:05] <icinga-wm>	 RECOVERY - Host ganeti1007 is UP: PING OK - Packet loss = 0%, RTA = 0.24 ms
[13:30:25] <icinga-wm>	 RECOVERY - Request latencies on chlorine is OK: OK - apiserver_request_latencies is 1996 https://grafana.wikimedia.org/dashboard/db/kubernetes-api
[13:30:45] <icinga-wm>	 RECOVERY - HHVM jobrunner on mw1306 is OK: HTTP OK: HTTP/1.1 200 OK - 206 bytes in 0.001 second response time
[13:30:45] <icinga-wm>	 RECOVERY - Host ununpentium is UP: PING OK - Packet loss = 0%, RTA = 7.83 ms
[13:30:45] <icinga-wm>	 RECOVERY - Host boron is UP: PING OK - Packet loss = 0%, RTA = 7.20 ms
[13:30:45] <icinga-wm>	 RECOVERY - Host kubestagetcd1002 is UP: PING OK - Packet loss = 0%, RTA = 9.04 ms
[13:30:46] <icinga-wm>	 RECOVERY - etcd request latencies on chlorine is OK: OK - etcd_request_latencies is 1657 https://grafana.wikimedia.org/dashboard/db/kubernetes-api
[13:30:55] <icinga-wm>	 RECOVERY - Host etcd1003 is UP: PING OK - Packet loss = 0%, RTA = 7.45 ms
[13:30:55] <icinga-wm>	 RECOVERY - Host etcd1004 is UP: PING OK - Packet loss = 0%, RTA = 8.70 ms
[13:31:05] <icinga-wm>	 RECOVERY - Host fermium is UP: PING OK - Packet loss = 0%, RTA = 7.63 ms
[13:31:05] <icinga-wm>	 RECOVERY - Host meitnerium is UP: PING OK - Packet loss = 0%, RTA = 9.41 ms
[13:31:15] <icinga-wm>	 RECOVERY - Host kubestagetcd1003 is UP: PING OK - Packet loss = 0%, RTA = 5.57 ms
[13:32:12] <wikibugs>	 10Operations, 10ops-eqiad: Kernels errors on ganeti1005- ganeti1008 under high I/O - https://phabricator.wikimedia.org/T181121#3989044 (10MoritzMuehlenhoff) Happened again on ganeti1007, again with page allocation errors.
[13:33:45] <wikibugs>	 (03PS2) 10Jdrewniak: Removing Mobile beta feedback link [mediawiki-config] - 10https://gerrit.wikimedia.org/r/412983 (https://phabricator.wikimedia.org/T187712)
[13:34:15] <icinga-wm>	 PROBLEM - SSH on labtestservices2002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[13:34:46] <icinga-wm>	 PROBLEM - SSH on labtestservices2003 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[13:35:15] <icinga-wm>	 RECOVERY - SSH on labtestservices2002 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.10 (protocol 2.0)
[13:36:45] <icinga-wm>	 RECOVERY - SSH on labtestservices2003 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.10 (protocol 2.0)
[13:37:36] <icinga-wm>	 RECOVERY - haproxy failover on dbproxy1002 is OK: OK check_failover servers up 2 down 0
[13:37:41] <jynus>	 haproxies should recover now
[13:38:05] <icinga-wm>	 RECOVERY - haproxy failover on dbproxy1007 is OK: OK check_failover servers up 2 down 0
[13:38:15] <icinga-wm>	 PROBLEM - SSH on labtestservices2002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[13:38:36] <wikibugs>	 (03PS4) 10Addshore: WIP DNM role and profile for wdcm dashboards [puppet] - 10https://gerrit.wikimedia.org/r/387211
[13:39:14] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] WIP DNM role and profile for wdcm dashboards [puppet] - 10https://gerrit.wikimedia.org/r/387211 (owner: 10Addshore)
[13:39:46] <icinga-wm>	 RECOVERY - Request latencies on neon is OK: OK - apiserver_request_latencies is 34755 https://grafana.wikimedia.org/dashboard/db/kubernetes-api
[13:40:05] <icinga-wm>	 RECOVERY - etcd request latencies on neon is OK: OK - etcd_request_latencies is 24151 https://grafana.wikimedia.org/dashboard/db/kubernetes-api
[13:44:23] <wikibugs>	 (03PS1) 10Jcrespo: mariadb: Reenable notifications on db2044 (m2) [puppet] - 10https://gerrit.wikimedia.org/r/413159 (https://phabricator.wikimedia.org/T183470)
[13:45:19] <wikibugs>	 (03CR) 10Ema: [V: 032 C: 032] "Nice, that's a much better approach. Thanks!" [debs/pybal] - 10https://gerrit.wikimedia.org/r/413154 (https://phabricator.wikimedia.org/T169765) (owner: 10Vgutierrez)
[13:45:55] <icinga-wm>	 PROBLEM - SSH on labtestservices2003 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[13:48:05] <icinga-wm>	 PROBLEM - etcd request latencies on neon is CRITICAL: CRITICAL - etcd_request_latencies is 4871667 https://grafana.wikimedia.org/dashboard/db/kubernetes-api
[13:50:06] <icinga-wm>	 RECOVERY - etcd request latencies on neon is OK: OK - etcd_request_latencies is 4992 https://grafana.wikimedia.org/dashboard/db/kubernetes-api
[13:51:55] <icinga-wm>	 RECOVERY - SSH on labtestservices2003 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.10 (protocol 2.0)
[13:52:31] <wikibugs>	 (03PS1) 10Rush: openstack: labs-instance-transport1-b-codfw designations [dns] - 10https://gerrit.wikimedia.org/r/413160 (https://phabricator.wikimedia.org/T184209)
[13:53:11] <wikibugs>	 10Operations, 10Traffic, 10Patch-For-Review: varnish-be returning 503s upon VCL reload - https://phabricator.wikimedia.org/T157430#3989077 (10BBlack) 05Resolved>03Open TL;DR - Current solution is a fixed 2s load->use delay.  I think we should probably do more here at this point, especially in light of eq...
[13:53:55] <wikibugs>	 10Operations, 10Traffic: varnish: discard cold vcl - https://phabricator.wikimedia.org/T187778#3989081 (10BBlack) I was the one arguing for cron, on I think the faulty assumption that a VCL had to go `cold` before it could be `discard`ed.  However, apparently that's not the case.  You can `discard` a `warm` VC...
[13:55:57] <wikibugs>	 (03CR) 10Zfilipin: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/413147 (https://phabricator.wikimedia.org/T187870) (owner: 10Urbanecm)
[13:56:37] <_joe_>	 isn't swat in 10 minutes?
[13:56:41] <_joe_>	 err, 5
[13:56:47] <zeljkof>	 _joe_: yes, preparing
[13:56:48] <_joe_>	 jouncebot: next
[13:56:48] <jouncebot>	 In 0 hour(s) and 3 minute(s): European Mid-day SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180221T1400)
[13:56:56] <_joe_>	 ahah ok
[13:57:05] <zeljkof>	 did not start the deployment yet, reviewing and merging the first commit
[13:57:07] <_joe_>	 I thought I missed it by 11 hour
[13:57:16] <_joe_>	 1* hour even
[13:57:31] <wikibugs>	 (03Merged) 10jenkins-bot: Add new throttle rule [mediawiki-config] - 10https://gerrit.wikimedia.org/r/413147 (https://phabricator.wikimedia.org/T187870) (owner: 10Urbanecm)
[13:57:36] <_joe_>	 zeljkof: yeah I though I went to lunch when I needed to be here for SWAT
[13:57:41] <wikibugs>	 (03CR) 10jenkins-bot: Add new throttle rule [mediawiki-config] - 10https://gerrit.wikimedia.org/r/413147 (https://phabricator.wikimedia.org/T187870) (owner: 10Urbanecm)
[13:57:53] <zeljkof>	 _joe_: all is good, you are back in time :)
[14:00:04] <jouncebot>	 addshore, hashar, anomie, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: Dear deployers, time to do the European Mid-day SWAT(Max 8 patches) deploy. Dont look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180221T1400).
[14:00:04] <jouncebot>	 Urbanecm, gilles, raynor, _joe_, and Jhs: A patch you scheduled for European Mid-day SWAT(Max 8 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[14:00:05] <icinga-wm>	 PROBLEM - SSH on labtestservices2003 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[14:00:12] <gilles>	 o/
[14:00:14] <wikibugs>	 10Operations, 10Traffic, 10Wikimedia-Incident: Investigate varnishd child crashes when multiple nodes get depooled/pooled concurrently - https://phabricator.wikimedia.org/T154801#3989098 (10BBlack) Should we do something here?  The same crash can exist at remote DCs as well (the frontends would crash if all...
[14:00:23] <zeljkof>	 I can SWAT today
[14:00:26] <Jhs>	 i'm here :)
[14:00:31] <raynor>	 o/
[14:00:45] <zeljkof>	 starting with 413147 by Urbanecm, it should take just a minute
[14:02:05] <icinga-wm>	 RECOVERY - SSH on labtestservices2003 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.10 (protocol 2.0)
[14:02:26] <logmsgbot>	 !log zfilipin@tin Synchronized wmf-config/throttle.php: SWAT: [[gerrit:413147|Add new throttle rule (T187870)]] (duration: 01m 13s)
[14:02:29] <zeljkof>	 gilles, raynor, _joe_, and Jhs: people that are able and willing to deploy their patches have priority - so, do you want to deploy your patch(es)? ;)
[14:02:38] <zeljkof>	 (if you can)
[14:02:41] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:02:41] <stashbot>	 T187870: Lift IP account limit on 2018-02-21 - https://phabricator.wikimedia.org/T187870
[14:02:55] <icinga-wm>	 PROBLEM - High CPU load on API appserver on mw1276 is CRITICAL: CRITICAL - load average: 104.80, 53.15, 40.17
[14:03:26] <Jhs>	 zeljkof, you mean, do it ourselves? I can't
[14:03:26] <icinga-wm>	 PROBLEM - High CPU load on API appserver on mw1316 is CRITICAL: CRITICAL - load average: 82.41, 47.14, 37.48
[14:03:39] <raynor>	 I also can't
[14:04:06] <icinga-wm>	 PROBLEM - High CPU load on API appserver on mw1339 is CRITICAL: CRITICAL - load average: 96.62, 53.94, 38.96
[14:04:10] <gilles>	 zeljkof: I've never deployed a config patch before. I believe I have the rights for it, though
[14:04:25] <gilles>	 I'd be happy to learn
[14:04:26] <_joe_>	 zeljkof: I do deploy config patches, but I'll let other do their own
[14:04:34] <_joe_>	 and wait in queue
[14:04:45] <icinga-wm>	 PROBLEM - Apache HTTP on mw1276 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[14:04:52] <_joe_>	 uhm wait a sec
[14:04:56] <icinga-wm>	 PROBLEM - Apache HTTP on mw1316 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[14:04:58] <zeljkof>	 _joe_: you can go first, if you want to, I need some time to review other patches
[14:05:03] <_joe_>	 zeljkof: wait
[14:05:06] <icinga-wm>	 PROBLEM - Nginx local proxy to apache on mw1276 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[14:05:06] <icinga-wm>	 PROBLEM - HHVM rendering on mw1276 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[14:05:07] <_joe_>	 see alerts
[14:05:10] <zeljkof>	 something going on?
[14:05:15] <icinga-wm>	 PROBLEM - Nginx local proxy to apache on mw1316 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[14:05:18] <_joe_>	 seems so
[14:05:19] <zeljkof>	 hm
[14:05:21] <_joe_>	 let me check
[14:05:35] <icinga-wm>	 PROBLEM - HHVM rendering on mw1316 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[14:05:37] <zeljkof>	 the first patch (throttle rule) deployed fine
[14:05:59] <zeljkof>	 gilles: this is all I know :) https://wikitech.wikimedia.org/wiki/SWAT_deploys/Deployers
[14:06:04] <_joe_>	 zeljkof: I think the problem is exacerbated by deployments
[14:06:12] <_joe_>	 so hang on a sec
[14:06:16] <zeljkof>	 _joe_: sure
[14:06:55] <_joe_>	 !log restarting hhvm on misbehaving api appservers
[14:07:08] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:07:32] <gilles>	 zeljkof: ok, I'm willing to give those instructions a try, when you guys give me the green light
[14:08:06] <icinga-wm>	 PROBLEM - SSH on labtestservices2003 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[14:08:32] <_joe_>	 gilles: prepare yourself in the meanwhile
[14:08:53] <zeljkof>	 gilles: do you do other deployments? this should be pretty much the same
[14:09:07] <gilles>	 I am, I'll +2 the config change since that's time-consuming
[14:09:17] <gilles>	 zeljkof: if mine goes well, sure
[14:09:41] <wikibugs>	 (03CR) 10Gilles: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/412928 (https://phabricator.wikimedia.org/T169144) (owner: 10Gilles)
[14:10:05] <icinga-wm>	 RECOVERY - Nginx local proxy to apache on mw1276 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 617 bytes in 0.047 second response time
[14:10:05] <icinga-wm>	 RECOVERY - HHVM rendering on mw1276 is OK: HTTP OK: HTTP/1.1 200 OK - 74115 bytes in 1.190 second response time
[14:10:07] <zeljkof>	 gilles: what I wanted to say, deploying config change should not be much different than deploying other things, if you have done that
[14:10:20] <gilles>	 I haven't done that either
[14:10:36] <icinga-wm>	 RECOVERY - Apache HTTP on mw1276 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 616 bytes in 0.068 second response time
[14:10:38] <zeljkof>	 gilles: I can deploy the rest of the changes, but feel free to take over swat if you want to practice :)
[14:10:55] <icinga-wm>	 RECOVERY - Apache HTTP on mw1316 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 616 bytes in 0.039 second response time
[14:11:06] <icinga-wm>	 RECOVERY - Nginx local proxy to apache on mw1316 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 617 bytes in 0.032 second response time
[14:11:10] <wikibugs>	 (03Merged) 10jenkins-bot: Add Thumbor/Mediawiki shared secret [mediawiki-config] - 10https://gerrit.wikimedia.org/r/412928 (https://phabricator.wikimedia.org/T169144) (owner: 10Gilles)
[14:11:20] <wikibugs>	 (03CR) 10jenkins-bot: Add Thumbor/Mediawiki shared secret [mediawiki-config] - 10https://gerrit.wikimedia.org/r/412928 (https://phabricator.wikimedia.org/T169144) (owner: 10Gilles)
[14:11:26] <icinga-wm>	 RECOVERY - HHVM rendering on mw1316 is OK: HTTP OK: HTTP/1.1 200 OK - 74114 bytes in 0.419 second response time
[14:11:41] <_joe_>	 gilles: you may proceed
[14:12:15] <icinga-wm>	 RECOVERY - High CPU load on API appserver on mw1339 is OK: OK - load average: 16.26, 31.03, 35.49
[14:12:37] <wikibugs>	 (03PS5) 10Addshore: Role and profile for wdcm dashboards [puppet] - 10https://gerrit.wikimedia.org/r/387211
[14:13:13] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Role and profile for wdcm dashboards [puppet] - 10https://gerrit.wikimedia.org/r/387211 (owner: 10Addshore)
[14:14:24] <gilles>	 is it normal for "scap pull" on mwdebug1002 to hang after "14:12:28 Finished rsync common (duration: 00m 03s)"?
[14:14:38] <zeljkof>	 gilles: it takes a few minutes the first time you do it
[14:14:42] <gilles>	 ok
[14:14:51] <zeljkof>	 after that, it takes seconds
[14:16:56] <icinga-wm>	 RECOVERY - SSH on labtestservices2002 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.10 (protocol 2.0)
[14:17:34] <gilles>	 hmmm, seeing a fatal in my tests, but can't see how it's related to my config change
[14:17:37] <gilles>	 require_once(/srv/mediawiki/docroot/wikimedia.org/w/../multiversion/MWMultiVersion.php): File not found
[14:17:50] <gilles>	 (when browsing with the debug header)
[14:18:04] <gilles>	 the scap pull had a bunch of these:
[14:18:04] <gilles>	 cannot delete non-empty directory: php-1.31.0-wmf.17/cache/l10n
[14:18:05] <zeljkof>	 uh oh
[14:18:18] <gilles>	 is that expected? or something I should clean up?
[14:18:23] <zeljkof>	 l10n messages should be fine
[14:18:55] <gilles>	 cannot delete non-empty directory: php-1.31.0-wmf.17/cache/l10n
[14:18:55] <gilles>	 cannot delete non-empty directory: php-1.31.0-wmf.17/cache/l10n
[14:18:55] <gilles>	 cannot delete non-empty directory: php-1.31.0-wmf.17/cache
[14:18:57] <gilles>	 cannot delete non-empty directory: php-1.31.0-wmf.17/cache
[14:18:59] <gilles>	 cannot delete non-empty directory: php-1.31.0-wmf.17
[14:19:01] <gilles>	 cannot delete non-empty directory: php-1.31.0-wmf.20/cache/l10n
[14:19:03] <gilles>	 cannot delete non-empty directory: php-1.31.0-wmf.20/cache/l10n
[14:19:05] <gilles>	 cannot delete non-empty directory: php-1.31.0-wmf.20/cache
[14:19:06] <zeljkof>	 looking at it at mwdebug1002
[14:19:09] <gilles>	 that's the full list
[14:19:19] <_joe_>	 gilles: mwdebug1001?
[14:19:25] <gilles>	 mwdebug1002
[14:19:26] <zeljkof>	 that happens sometimes, I think it's a scap regression
[14:19:43] <zeljkof>	 that used to happen a while ago, was fixed, but apparently happens again
[14:19:50] <zeljkof>	 should not cause any problems
[14:20:22] <zeljkof>	 I can see the error in logstash
[14:20:36] <zeljkof>	 https://logstash.wikimedia.org/app/kibana#/dashboard/mwdebug1002
[14:20:45] <zeljkof>	 Fatal error: require_once(/srv/mediawiki/docroot/wikimedia.org/w/../multiversion/MWMultiVersion.php): File not found in /srv/mediawiki/docroot/wikimedia.org/w/thumb.php on line 2
[14:20:57] <_joe_>	 wth?
[14:21:01] <wikibugs>	 (03CR) 10MarkTraceur: [C: 031] Load 3D extension on other wikis, for display only [mediawiki-config] - 10https://gerrit.wikimedia.org/r/410433 (https://phabricator.wikimedia.org/T187261) (owner: 10Matthias Mullie)
[14:21:25] <_joe_>	 oh, why thumb.php is called on an appserver?
[14:21:33] <zeljkof>	 gilles: I would suggest reverting
[14:21:35] <_joe_>	 I would very much expect it not to work, indeed
[14:21:46] <zeljkof>	 _joe_: where is that?
[14:22:06] <_joe_>	 "/srv/mediawiki/docroot/wikimedia.org/w/thumb.php"
[14:22:11] <_joe_>	 in your paste
[14:22:17] <gilles>	 _joe_: I'm working on proxying that to thumbor
[14:22:19] <zeljkof>	 ah :) did not notice
[14:22:22] <gilles>	 that's the whole point of the change
[14:22:45] <_joe_>	 gilles: yeah, I don't expect it would work if called on mwdebug1002, tbh, but maybe I'm wrong
[14:22:52] <_joe_>	 gilles: which url are you testing?
[14:22:59] <gilles>	 https://commons.wikimedia.org/w/thumb.php?f=Broom%20icon.svg&w=501
[14:23:03] <gilles>	 or https://commons.wikimedia.beta.wmflabs.org/w/thumb.php?f=Victoria_memorial.jpg&w=300
[14:23:12] <_joe_>	 I guess the former
[14:23:17] <gilles>	 I should have tried with mwdebug1002 before applying the change, for sure
[14:23:18] <_joe_>	 the latter is on the beta cluster
[14:23:32] <_joe_>	 gilles: you can try now, with apache-fast-test on tin
[14:23:41] <gilles>	 how?
[14:24:03] <_joe_>	 echo 'https://commons.wikimedia.org/w/thumb.php?f=Broom%20icon.svg&w=501' > test_thumb
[14:24:21] <_joe_>	 apache-fast-test test_thumb mwdebug1001 mwdebug1002
[14:24:45] <_joe_>	 ok, maybe try with a normal appserver
[14:24:49] <_joe_>	 like mw1261
[14:25:07] <_joe_>	 you will see tat request returns 200 OK on mw1261, and 500 on mwdebug1002
[14:25:14] <_joe_>	 so yeah, I'll suggest reverting
[14:25:26] <gilles>	 well in practice those will still go to imagescalers
[14:25:47] <zeljkof>	 _joe_, gilles: we have other patches to deploy, I would suggest requesting a deploy window for this, so you have more time to debug
[14:25:57] <zeljkof>	 and revert now
[14:26:04] <_joe_>	 +1
[14:26:07] <gilles>	 sure
[14:26:15] <icinga-wm>	 PROBLEM - SSH on labtestservices2002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[14:26:25] <_joe_>	 I'm not even sure that's the problem we're seeing, btw
[14:26:25] <zeljkof>	 gilles: sorry for such a bad first-deploy experience :)
[14:26:46] <zeljkof>	 _joe_: want to go next?
[14:27:01] <gilles>	 so, make a revert commit or just revert it on the deployment machine?
[14:27:06] <icinga-wm>	 RECOVERY - SSH on labtestservices2002 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.10 (protocol 2.0)
[14:27:09] <_joe_>	 gilles: revert commit
[14:27:19] <_joe_>	 zeljkof: once this is sorted out, sure
[14:27:19] <zeljkof>	 gilles: https://wikitech.wikimedia.org/wiki/SWAT_deploys/Deployers#Reverting
[14:27:33] <zeljkof>	 the link is the official docs for urgent reverting
[14:27:38] <wikibugs>	 (03PS1) 10Gilles: Revert "Add Thumbor/Mediawiki shared secret" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/413165
[14:27:52] <wikibugs>	 (03CR) 10Gilles: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/413165 (owner: 10Gilles)
[14:27:53] <zeljkof>	 since your commit is not urgent, you can revert it in gerrit, or using the docs, what ever you prefer
[14:29:10] <wikibugs>	 (03Merged) 10jenkins-bot: Revert "Add Thumbor/Mediawiki shared secret" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/413165 (owner: 10Gilles)
[14:29:17] <gilles>	 so how to I schedule a deploy window dedicated to this? just make one up and add it to the deployment wiki page?
[14:29:23] <wikibugs>	 (03CR) 10jenkins-bot: Revert "Add Thumbor/Mediawiki shared secret" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/413165 (owner: 10Gilles)
[14:29:39] <zeljkof>	 gilles: you should probably ping greg-g 
[14:29:48] <wikibugs>	 (03PS1) 10Jon Harald Søby: Add sitename for Burmese Wiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/413166 (https://phabricator.wikimedia.org/T187882)
[14:30:05] <icinga-wm>	 PROBLEM - High CPU load on API appserver on mw1228 is CRITICAL: CRITICAL - load average: 36.62, 34.57, 32.21
[14:30:12] <_joe_>	 sigh, again?
[14:30:18] <zeljkof>	 gilles: but yes, I think it's just add a window to deployments page when there is nothing else going on, just make sure greg-g knows about it
[14:30:25] <gilles>	 ok I'm done reverting this, you can proceed with the rest of the SWAT
[14:30:26] <_joe_>	 gilles: are you deploying anything, by any chance?
[14:30:30] <gilles>	 sorry for taking half of the window
[14:30:38] <zeljkof>	 gilles: no problem, happens :)
[14:30:43] <_joe_>	 mwdebug1001 is still broken FWIW
[14:30:49] <gilles>	 _joe_: I'm not doing anything
[14:30:55] <_joe_>	 oh ok
[14:31:04] <zeljkof>	 _joe_: for the record, I am also doing nothing 
[14:31:06] <_joe_>	 so I have to pull your change on tin, too?
[14:31:06] <gilles>	 I've just brought back mwdebug1002 to its original state
[14:31:15] <icinga-wm>	 PROBLEM - SSH on labtestservices2002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[14:31:18] <_joe_>	 gilles: what about mwdebug1001?
[14:31:24] <gilles>	 _joe_: never touched it
[14:31:36] <icinga-wm>	 RECOVERY - High CPU load on API appserver on mw1316 is OK: OK - load average: 13.71, 15.64, 34.67
[14:31:36] <_joe_>	 uhm, something seems very wrong there
[14:31:56] <icinga-wm>	 RECOVERY - High CPU load on API appserver on mw1276 is OK: OK - load average: 12.47, 13.70, 29.31
[14:33:28] <_joe_>	 zeljkof: I repeat, something is very wrong with mwdebug1001-1002
[14:33:35] <icinga-wm>	 RECOVERY - SSH on labtestservices2003 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.10 (protocol 2.0)
[14:34:22] <zeljkof>	 _joe_: uh oh, what is wrong? I never use 1001, only 1002
[14:34:36] <_joe_>	 zeljkof: with vboth of them
[14:34:46] <_joe_>	 the thumb.php issue seems specific to those servers
[14:34:49] <_joe_>	 and not going away
[14:35:00] <_joe_>	 anyhow, let me deploy my change
[14:35:18] <wikibugs>	 (03PS3) 10Giuseppe Lavagetto: Enable EtcdConfig on the debug hosts [mediawiki-config] - 10https://gerrit.wikimedia.org/r/411296 (https://phabricator.wikimedia.org/T149617)
[14:35:30] <zeljkof>	 hm, if the revert is merged, you can run `scap pull` on both machines, that should bring them back to pre-swat state
[14:35:40] <_joe_>	 zeljkof: yeah gonna do that right now
[14:36:05] <Jhs>	 I have a second small patch I just committed, do you think we can add that one on today's SWAT?
[14:36:16] <Jhs>	 even though mine is the 8th in the list :)
[14:36:30] <zeljkof>	 Jhs: if nothing goes wrong and we do not run out of time... :)
[14:36:40] <zeljkof>	 I am fine with doing more patches
[14:36:45] <zeljkof>	 is it urgent?
[14:36:58] <Jhs>	 not at all :)
[14:37:00] <gilles>	 I could swear that the thumb.php requests through mwdebug1002 worked one or two weeks ago, the last time I deployed a similar config change that enabled proxying of thumb.php requests to thumbor for all public wikis
[14:37:08] <Jhs>	 just adding sitename for Burmese Wiktionary
[14:37:12] <gilles>	 but maybe I misremember
[14:37:23] <_joe_>	 I'm not going to let us deploy anything until the mwdebug issue is understood and resolved
[14:37:26] <_joe_>	 .
[14:37:52] <zeljkof>	 Jhs: add it to the calendar and if things go fine (but probably not, as it stands now) it will be deployed
[14:37:57] <_joe_>	 we can't have a part of the deploy not work reliably
[14:37:59] <Jhs>	 i'll add it to the list, and if we don't have time we don't have time. no problem
[14:38:05] <_joe_>	 !log restarting hhvm on mwdebug1002
[14:38:06] <gilles>	 _joe_: you're only seeing the fatals on the thumb.php requests, though, right? w/load.php requests work fine
[14:38:12] <_joe_>	 gilles: still
[14:38:19] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:38:21] <gilles>	 yeah I know, it's really bizarre
[14:38:25] <zeljkof>	 _joe_: yes, let's make sure no "I broke wikipedia" t-shirts are sent today
[14:38:40] <_joe_>	 restarting hhvm did the trick on mwdebug1002
[14:38:45] <wikibugs>	 (03PS1) 10Jcrespo: mariadb: Update socket location of misc services (m1, m2, m5) [puppet] - 10https://gerrit.wikimedia.org/r/413167 (https://phabricator.wikimedia.org/T183470)
[14:38:45] <_joe_>	 it seems it was in a bad state
[14:38:45] <gilles>	 -_-
[14:38:52] <gilles>	 thanks, hhvm
[14:39:15] <icinga-wm>	 PROBLEM - puppet last run on labstore1005 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[14:39:33] <zeljkof>	 gilles: I am looking for "cannot delete non-empty directory" scap bug in phab, will report it is happening again
[14:39:48] <_joe_>	 ook, let me merge my change then
[14:39:58] <zeljkof>	 gilles: I did find this T162207
[14:39:59] <stashbot>	 T162207: When "scap pull" does a (slow) CDB rebuild, it should tell me that that's what it's doing - https://phabricator.wikimedia.org/T162207
[14:40:13] <wikibugs>	 (03CR) 10Jcrespo: [C: 04-2] "This should be merged when all m1,2,5 hosts have been upgraded. Until then, we can upgrade them and ln tmp -> run, plus change the basedir" [puppet] - 10https://gerrit.wikimedia.org/r/413167 (https://phabricator.wikimedia.org/T183470) (owner: 10Jcrespo)
[14:40:26] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 032] Enable EtcdConfig on the debug hosts [mediawiki-config] - 10https://gerrit.wikimedia.org/r/411296 (https://phabricator.wikimedia.org/T149617) (owner: 10Giuseppe Lavagetto)
[14:40:47] <_joe_>	 zeljkof: deploying my change as soon as it's merged
[14:41:11] <zeljkof>	 _joe_: ok, I'm standing by to take over
[14:41:37] <gilles>	 _joe_: have you restarted hhvm on mwdebug1001 as well?
[14:41:42] <_joe_>	 gilles: yes
[14:41:56] <wikibugs>	 (03Merged) 10jenkins-bot: Enable EtcdConfig on the debug hosts [mediawiki-config] - 10https://gerrit.wikimedia.org/r/411296 (https://phabricator.wikimedia.org/T149617) (owner: 10Giuseppe Lavagetto)
[14:42:01] <_joe_>	 !log restarted hhvm on mwdebug1001 too
[14:42:09] <wikibugs>	 (03CR) 10jenkins-bot: Enable EtcdConfig on the debug hosts [mediawiki-config] - 10https://gerrit.wikimedia.org/r/411296 (https://phabricator.wikimedia.org/T149617) (owner: 10Giuseppe Lavagetto)
[14:42:14] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:43:27] <wikibugs>	 10Operations, 10Deployments, 10Release-Engineering-Team (Kanban), 10User-Addshore: cannot delete non-empty directory: php-1.29.0-wmf.3 messages on 'scap sync' on mwdebug1002 - https://phabricator.wikimedia.org/T157030#3989178 (10zeljkofilipin) 05Resolved>03Open This is happening again.  ``` zfilipin@mw...
[14:43:37] <_joe_>	 pulling on mwdebug1002
[14:44:02] <thcipriani|afk>	 FWIW, the non-empty directory thing is a result of a cleanup that did not delete the l10n cache. Since the rsync command excludes the files that remain, but includes a --delete, rsync reports that it is not deleting this directory because it has files in it that rsync is ignoring.
[14:44:27] <zeljkof>	 gilles: T157030
[14:44:27] <stashbot>	 T157030: cannot delete non-empty directory: php-1.29.0-wmf.3 messages on 'scap sync' on mwdebug1002 - https://phabricator.wikimedia.org/T157030
[14:44:30] <zeljkof>	 will do the cleanup
[14:44:53] <zeljkof>	 thcipriani|afk: just noticed the command that needs to run at mwdebug1002
[14:45:12] <wikibugs>	 10Operations, 10Deployments, 10Release-Engineering-Team (Kanban), 10User-Addshore: cannot delete non-empty directory: php-1.29.0-wmf.3 messages on 'scap sync' on mwdebug1002 - https://phabricator.wikimedia.org/T157030#3989184 (10zeljkofilipin) 05Open>03Resolved
[14:45:19] <wikibugs>	 10Operations, 10DBA: Decommission db2011 - https://phabricator.wikimedia.org/T187886#3989185 (10jcrespo) p:05Triage>03Normal
[14:45:26] <thcipriani|afk>	 zeljkof: maybe hold off until no_j.ustification is around
[14:45:43] <thcipriani|afk>	 there must be some change in how clean works because this was fixed
[14:46:18] <thcipriani|afk>	 he'd know for sure what's happening. IIRC I saw some changes to clean happen recently that maybe don't work as expected.
[14:46:45] <zeljkof>	 thcipriani|afk: should I reopen the bug and assign to him, so we don't forget about it?
[14:46:48] <_joe_>	 zeljkof: I'm deploying now
[14:46:51] <wikibugs>	 (03PS1) 10Gilles: Add Thumbor/Mediawiki shared secret [mediawiki-config] - 10https://gerrit.wikimedia.org/r/413168 (https://phabricator.wikimedia.org/T169144)
[14:46:53] <wikibugs>	 (03CR) 10Jcrespo: [C: 032] mariadb: Reenable notifications on db2044 (m2) [puppet] - 10https://gerrit.wikimedia.org/r/413159 (https://phabricator.wikimedia.org/T183470) (owner: 10Jcrespo)
[14:47:01] <thcipriani|afk>	 zeljkof: sounds good
[14:47:06] <zeljkof>	 thcipriani|afk: will do
[14:47:13] <zeljkof>	 thcipriani|afk: and thanks :)
[14:47:30] <thcipriani|afk>	 yw :)
[14:47:36] * thcipriani|afk wonders away to make coffee
[14:47:45] <icinga-wm>	 PROBLEM - SSH on labtestservices2003 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[14:48:37] <wikibugs>	 10Operations, 10Deployments, 10Release-Engineering-Team (Kanban), 10User-Addshore: cannot delete non-empty directory: php-1.29.0-wmf.3 messages on 'scap sync' on mwdebug1002 - https://phabricator.wikimedia.org/T157030#3989222 (10zeljkofilipin) 05Resolved>03Open a:05thcipriani>03demon @thcipriani sa...
[14:48:54] <wikibugs>	 (03PS3) 10Gilles: Serve officewiki thumbnails with Thumbor [mediawiki-config] - 10https://gerrit.wikimedia.org/r/412952 (https://phabricator.wikimedia.org/T169144)
[14:49:03] <wikibugs>	 (03PS2) 10Gilles: Serve private wiki thumbnails with Thumbor [mediawiki-config] - 10https://gerrit.wikimedia.org/r/413115 (https://phabricator.wikimedia.org/T169144)
[14:49:14] <logmsgbot>	 !log oblivian@tin Synchronized wmf-config: Serve configuration to mwdebug hosts via etcd (duration: 01m 16s)
[14:49:26] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:49:26] <wikibugs>	 (03PS1) 10Andrew Bogott: add lvs ip for labweb services [dns] - 10https://gerrit.wikimedia.org/r/413169 (https://phabricator.wikimedia.org/T187506)
[14:50:13] <_joe_>	 zeljkof: you can proceed, 
[14:50:22] <zeljkof>	 _joe_: thanks, will do
[14:50:25] <icinga-wm>	 PROBLEM - High CPU load on API appserver on mw1290 is CRITICAL: CRITICAL - load average: 114.78, 57.24, 41.08
[14:50:34] <_joe_>	 and of
[14:50:37] <_joe_>	 course
[14:50:38] <zeljkof>	 raynor, Jhs: with 10 minutes left, we have time for 1-2 patches, is any of your patches urgent?
[14:51:25] <raynor>	 not UBN, but both high priority here. both config changes
[14:51:42] <zeljkof>	 raynor: ok, starting with your patches then
[14:51:45] <icinga-wm>	 PROBLEM - Nginx local proxy to apache on mw1290 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[14:51:56] <zeljkof>	 Jhs: sorry, looks like we will not have the time for any of your patches today 
[14:52:05] <icinga-wm>	 PROBLEM - Apache HTTP on mw1290 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[14:52:06] <zeljkof>	 (well, in this window)
[14:52:26] <icinga-wm>	 PROBLEM - HHVM rendering on mw1290 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[14:52:43] <zeljkof>	 raynor: anything special about your patches? can not be tested at mwdebug1002, needs a lot of time to test, needs a script to run...?
[14:52:44] <Jhs>	 zeljkof, mine are not urgent, so i'm fine with waiting
[14:52:46] <_joe_>	 !log rolling restart another 4 api appservers
[14:52:59] <wikibugs>	 (03PS2) 10Zfilipin: Disable Page Previews EventLogging instrumentation [mediawiki-config] - 10https://gerrit.wikimedia.org/r/412996 (https://phabricator.wikimedia.org/T185973) (owner: 10Pmiazga)
[14:52:59] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:53:11] <raynor>	 nothing special, no scripts required
[14:53:33] <raynor>	 first one hides the feedback link, second one disables event logging
[14:53:41] <wikibugs>	 (03CR) 10Zfilipin: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/412996 (https://phabricator.wikimedia.org/T185973) (owner: 10Pmiazga)
[14:53:56] <raynor>	 for popups extension, we have enough events for everyone
[14:53:58] <raynor>	 :)
[14:54:23] <raynor>	 I can check those on prod
[14:55:13] <wikibugs>	 (03Merged) 10jenkins-bot: Disable Page Previews EventLogging instrumentation [mediawiki-config] - 10https://gerrit.wikimedia.org/r/412996 (https://phabricator.wikimedia.org/T185973) (owner: 10Pmiazga)
[14:55:20] <zeljkof>	 raynor: can both patches be tested at mwdebug1002?
[14:55:25] <raynor>	 yes
[14:55:35] <zeljkof>	 ok, the first one will be there in a minute
[14:56:24] <zeljkof>	 raynor: 412996 is at mwdebug1002, please test and let me know if I can deploy
[14:56:25] <icinga-wm>	 RECOVERY - HHVM rendering on mw1290 is OK: HTTP OK: HTTP/1.1 200 OK - 74114 bytes in 0.268 second response time
[14:56:36] <icinga-wm>	 RECOVERY - Nginx local proxy to apache on mw1290 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 617 bytes in 0.050 second response time
[14:56:45] <_joe_>	 let me know when SWAT is done, I have to play a bit with the mwdebug hosts
[14:56:48] <raynor>	 testing
[14:56:55] <icinga-wm>	 RECOVERY - Apache HTTP on mw1290 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 616 bytes in 0.031 second response time
[14:56:58] <zeljkof>	 _joe_: will do, in 5 minutes or so
[14:57:08] <wikibugs>	 (03CR) 10jenkins-bot: Disable Page Previews EventLogging instrumentation [mediawiki-config] - 10https://gerrit.wikimedia.org/r/412996 (https://phabricator.wikimedia.org/T185973) (owner: 10Pmiazga)
[14:57:11] <_joe_>	 zeljkof: take your time, I did slow you people down
[14:57:50] <wikibugs>	 (03CR) 10Zfilipin: [C: 031] Removing Mobile beta feedback link [mediawiki-config] - 10https://gerrit.wikimedia.org/r/412983 (https://phabricator.wikimedia.org/T187712) (owner: 10Jdrewniak)
[14:58:21] <gilles>	 _joe_: and let me know when you're done with that, I think I'll just resume my stuff afterwards, with greg-g still asleep right now, haven't got a reply. my config changes are very low risk for general traffic, only focused on thumb.php (which is extremely low traffic)
[14:58:54] <_joe_>	 gilles: yeah I am convinced we saw a red herring there
[14:58:57] <raynor>	 sec, I need to open page with ?debug=true, it takes sec
[14:59:08] <_joe_>	 but you know, better safe than sorry
[14:59:23] <gilles>	 right, especially since the symptoms might come back when we scap pull again
[14:59:35] <icinga-wm>	 RECOVERY - MariaDB Slave Lag: s2 on dbstore2002 is OK: OK slave_sql_lag Replication lag: 14.77 seconds
[14:59:36] <_joe_>	 gilles: I doubt that's the case, tbh
[14:59:48] <wikibugs>	 (03PS1) 10Ottomata: [WIP] Refactor kafkatee module to support multi instance [puppet] - 10https://gerrit.wikimedia.org/r/413170 (https://phabricator.wikimedia.org/T187890)
[15:00:29] <raynor>	 hmm, config chaneg is there but I still see events o_O
[15:00:37] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] [WIP] Refactor kafkatee module to support multi instance [puppet] - 10https://gerrit.wikimedia.org/r/413170 (https://phabricator.wikimedia.org/T187890) (owner: 10Ottomata)
[15:00:48] <zeljkof>	 raynor: better than seeing dead people :D
[15:01:35] <zeljkof>	 raynor: do you need more time, or should we revert?
[15:01:56] <raynor>	 give me a minute, I'm checking the code
[15:02:10] <wikibugs>	 (03PS2) 10Filippo Giunchedi: nagios_common: switch to check_prometheus_metric Python implementation [puppet] - 10https://gerrit.wikimedia.org/r/413142 (https://phabricator.wikimedia.org/T181410)
[15:02:15] <icinga-wm>	 RECOVERY - High CPU load on API appserver on mw1228 is OK: OK - load average: 8.35, 14.23, 23.50
[15:02:17] <raynor>	 ok, nevermind
[15:02:25] <raynor>	 if debug=true we always send the event. stupid me
[15:02:35] <raynor>	 zeljkof, it's ok, you can push to prod
[15:02:35] <wikibugs>	 (03PS1) 10Andrew Bogott: labweb: add lvs service in front of labweb1001/1002 [puppet] - 10https://gerrit.wikimedia.org/r/413171 (https://phabricator.wikimedia.org/T187506)
[15:02:42] <zeljkof>	 raynor: deploying
[15:02:47] <wikibugs>	 (03CR) 10Filippo Giunchedi: nagios_common: switch to check_prometheus_metric Python implementation (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/413142 (https://phabricator.wikimedia.org/T181410) (owner: 10Filippo Giunchedi)
[15:02:58] <raynor>	 amazing, thx. The second patch is much easier to test
[15:03:07] <wikibugs>	 (03PS2) 10Ottomata: [WIP] Refactor kafkatee module to support multi instance [puppet] - 10https://gerrit.wikimedia.org/r/413170 (https://phabricator.wikimedia.org/T187890)
[15:03:18] <wikibugs>	 (03PS2) 10Andrew Bogott: labweb: add lvs service in front of labweb1001/1002 [puppet] - 10https://gerrit.wikimedia.org/r/413171 (https://phabricator.wikimedia.org/T187506)
[15:03:49] <wikibugs>	 (03PS3) 10Zfilipin: Removing Mobile beta feedback link [mediawiki-config] - 10https://gerrit.wikimedia.org/r/412983 (https://phabricator.wikimedia.org/T187712) (owner: 10Jdrewniak)
[15:03:52] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] [WIP] Refactor kafkatee module to support multi instance [puppet] - 10https://gerrit.wikimedia.org/r/413170 (https://phabricator.wikimedia.org/T187890) (owner: 10Ottomata)
[15:03:59] <logmsgbot>	 !log zfilipin@tin Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:412996|Disable Page Previews EventLogging instrumentation (T185973)]] (duration: 01m 13s)
[15:04:02] <wikibugs>	 (03CR) 10Zfilipin: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/412983 (https://phabricator.wikimedia.org/T187712) (owner: 10Jdrewniak)
[15:04:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:04:11] <stashbot>	 T185973: [Config] Disable Page Previews EventLogging instrumentation - https://phabricator.wikimedia.org/T185973
[15:04:15] <icinga-wm>	 RECOVERY - puppet last run on labstore1005 is OK: OK: Puppet is currently enabled, last run 13 seconds ago with 0 failures
[15:04:19] <zeljkof>	 raynor: 412996 is deployed, please test
[15:04:35] <zeljkof>	 raynor: will let you know when 412983 is at mwdebug1002
[15:04:59] <wikibugs>	 (03PS3) 10Andrew Bogott: labweb: add lvs service in front of labweb1001/1002 [puppet] - 10https://gerrit.wikimedia.org/r/413171 (https://phabricator.wikimedia.org/T187506)
[15:05:05] <icinga-wm>	 PROBLEM - High CPU load on API appserver on mw1287 is CRITICAL: CRITICAL - load average: 64.60, 39.99, 32.81
[15:05:09] <wikibugs>	 (03PS4) 10Andrew Bogott: labweb: add lvs service in front of labweb1001/1002 [puppet] - 10https://gerrit.wikimedia.org/r/413171 (https://phabricator.wikimedia.org/T187506)
[15:05:31] <wikibugs>	 (03Merged) 10jenkins-bot: Removing Mobile beta feedback link [mediawiki-config] - 10https://gerrit.wikimedia.org/r/412983 (https://phabricator.wikimedia.org/T187712) (owner: 10Jdrewniak)
[15:05:33] <raynor>	 zeljkof - tested, works
[15:05:38] <raynor>	 tested on production
[15:05:48] <zeljkof>	 great
[15:05:52] <wikibugs>	 (03PS3) 10Ottomata: [WIP] Refactor kafkatee module to support multi instance [puppet] - 10https://gerrit.wikimedia.org/r/413170 (https://phabricator.wikimedia.org/T187890)
[15:06:14] <zeljkof>	 raynor: the second patch is at mwdebug1002
[15:06:52] <raynor>	 zeljkof: tested on mwdebug1002 - works
[15:07:01] <zeljkof>	 raynor: deploying
[15:07:03] <raynor>	 please deploy
[15:07:09] <wikibugs>	 (03CR) 10jenkins-bot: Removing Mobile beta feedback link [mediawiki-config] - 10https://gerrit.wikimedia.org/r/412983 (https://phabricator.wikimedia.org/T187712) (owner: 10Jdrewniak)
[15:08:10] <logmsgbot>	 !log zfilipin@tin Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:412983|Removing Mobile beta feedback link (T187712)]] (duration: 01m 12s)
[15:08:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:08:25] <stashbot>	 T187712: [Config] Remove feedback link from settings page - https://phabricator.wikimedia.org/T187712
[15:08:28] <zeljkof>	 raynor: deployed, please check and thanks for deploying with #releng ;)
[15:08:34] <zeljkof>	 !log EU SWAT finished
[15:08:42] <zeljkof>	 _joe_: I'm done
[15:08:47] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:08:51] <raynor>	 zeljkof: tested on production. works
[15:09:05] <icinga-wm>	 RECOVERY - High CPU load on API appserver on mw1287 is OK: OK - load average: 17.07, 27.81, 29.58
[15:09:07] <zeljkof>	 Jhs: sorry, please reschedule your patches for another swat 
[15:09:13] <zeljkof>	 we ran out of time today
[15:09:15] <raynor>	 thanks for deploying everything
[15:09:28] <zeljkof>	 raynor: no problemo, that is what I do :D
[15:09:36] <wikibugs>	 10Operations, 10Beta-Cluster-Infrastructure, 10Readers-Web-Backlog: Beta cluster api.php never responds - https://phabricator.wikimedia.org/T187891#3989296 (10Niedzielski)
[15:10:04] <wikibugs>	 10Operations, 10Beta-Cluster-Infrastructure, 10Readers-Web-Backlog: Beta cluster api.php never responds - https://phabricator.wikimedia.org/T187891#3989312 (10Niedzielski)
[15:10:08] <wikibugs>	 10Operations, 10User-Elukey: Ganeti instances to support Kafka Burrow Consumer lag monitoring - https://phabricator.wikimedia.org/T187805#3989313 (10akosiaris) >>! In T187805#3988524, @elukey wrote: >>>! In T187805#3988447, @akosiaris wrote: >> What amount of resources (CPU, mem, disk) are we talking about ? F...
[15:10:29] <_joe_>	 zeljkof: thanks
[15:10:33] <wikibugs>	 10Operations, 10User-Elukey: Ganeti instances to support Kafka Burrow Consumer lag monitoring - https://phabricator.wikimedia.org/T187805#3989314 (10akosiaris) Anyway, seems to me fine to go with the VM approach. Wanna file the task under #vm-requests ?
[15:11:27] <wikibugs>	 10Operations, 10vm-requests, 10User-Elukey: Ganeti instances to support Kafka Burrow Consumer lag monitoring - https://phabricator.wikimedia.org/T187805#3989326 (10elukey)
[15:11:29] <wikibugs>	 10Operations, 10Beta-Cluster-Infrastructure, 10Readers-Web-Backlog: Beta cluster api.php never responds - https://phabricator.wikimedia.org/T187891#3989296 (10Niedzielski)
[15:11:35] <wikibugs>	 10Operations, 10ops-eqiad, 10DBA, 10Patch-For-Review: Rack and setup db1115 (tendril replacement database) - https://phabricator.wikimedia.org/T185788#3989328 (10Cmjohnson)
[15:11:56] <wikibugs>	 10Operations, 10vm-requests, 10User-Elukey: Ganeti instances to support Kafka Burrow Consumer lag monitoring - https://phabricator.wikimedia.org/T187805#3986390 (10elukey) >>! In T187805#3989314, @akosiaris wrote: > Anyway, seems to me fine to go with the VM approach. Wanna file the task under #vm-requests ?...
[15:12:05] <wikibugs>	 10Operations, 10ops-eqiad, 10DBA, 10Patch-For-Review: Rack and setup db1115 (tendril replacement database) - https://phabricator.wikimedia.org/T185788#3925325 (10Cmjohnson) @jcrespo and @Marostegui  This is all yours. Please resolve once verified.  Thanks!
[15:13:14] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: "Overall correct, see the two minor comments." (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/413171 (https://phabricator.wikimedia.org/T187506) (owner: 10Andrew Bogott)
[15:14:02] <wikibugs>	 10Operations, 10Beta-Cluster-Infrastructure, 10Readers-Web-Backlog (Tracking): Beta cluster api.php never responds - https://phabricator.wikimedia.org/T187891#3989347 (10Niedzielski)
[15:14:11] <wikibugs>	 10Operations, 10Proton, 10Readers-Web-Backlog, 10Readers-Web-Kanbanana-Board, 10Services (watching): Choose a deploy server for the chromium-render service - https://phabricator.wikimedia.org/T187821#3989349 (10Niedzielski)
[15:15:35] <icinga-wm>	 RECOVERY - High CPU load on API appserver on mw1290 is OK: OK - load average: 9.39, 12.50, 29.90
[15:17:03] <wikibugs>	 (03PS3) 10Muehlenhoff: Enable base::service_auto_restart for prometheus-openldap-exporter [puppet] - 10https://gerrit.wikimedia.org/r/411219 (https://phabricator.wikimedia.org/T135991)
[15:17:08] <wikibugs>	 10Operations, 10vm-requests, 10User-Elukey: Ganeti instances to support Kafka Burrow Consumer lag monitoring - https://phabricator.wikimedia.org/T187805#3989356 (10akosiaris) >>! In T187805#3989326, @elukey wrote: >>>! In T187805#3989314, @akosiaris wrote: >> Anyway, seems to me fine to go with the VM approa...
[15:17:22] <_joe_>	 gilles: you can deploy if you want, I'll get a coffee and continue with tests afterwards
[15:17:34] <wikibugs>	 (03PS1) 10Jcrespo: mariadb: set db2011 (old codfw:s2) as spare [puppet] - 10https://gerrit.wikimedia.org/r/413174 (https://phabricator.wikimedia.org/T187886)
[15:17:34] <gilles>	 _joe_: alright, I'm starting
[15:17:43] <wikibugs>	 (03CR) 10Gilles: [C: 032] Add Thumbor/Mediawiki shared secret [mediawiki-config] - 10https://gerrit.wikimedia.org/r/413168 (https://phabricator.wikimedia.org/T169144) (owner: 10Gilles)
[15:17:45] <wikibugs>	 (03PS12) 10Chico Venancio: shinken: WMCS: use sumSeries to reduce puppet failures false positves [puppet] - 10https://gerrit.wikimedia.org/r/411315 (https://phabricator.wikimedia.org/T161898)
[15:19:18] <wikibugs>	 (03Merged) 10jenkins-bot: Add Thumbor/Mediawiki shared secret [mediawiki-config] - 10https://gerrit.wikimedia.org/r/413168 (https://phabricator.wikimedia.org/T169144) (owner: 10Gilles)
[15:19:23] <gilles>	 !log Thumbor private wiki support deployment
[15:19:36] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:20:17] <wikibugs>	 (03CR) 10Andrew Bogott: labweb: add lvs service in front of labweb1001/1002 (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/413171 (https://phabricator.wikimedia.org/T187506) (owner: 10Andrew Bogott)
[15:20:21] <wikibugs>	 (03CR) 10Jcrespo: [C: 032] mariadb: set db2011 (old codfw:s2) as spare [puppet] - 10https://gerrit.wikimedia.org/r/413174 (https://phabricator.wikimedia.org/T187886) (owner: 10Jcrespo)
[15:22:53] <no_justification>	 Fucking messages directory
[15:23:00] <no_justification>	 I hates it
[15:23:09] <no_justification>	 Has nothing to do with scap clean
[15:23:23] <no_justification>	 We've had that error for *years*
[15:23:37] <no_justification>	 Blame l18nupdate
[15:23:44] <no_justification>	 Fix is easy. 
[15:23:52] <no_justification>	 Ssh everywhere it's complaining and delete
[15:24:27] <logmsgbot>	 !log gilles@tin Synchronized wmf-config/filebackend.php: Thumbor private wiki support deployment: [[gerrit:413168|Add Thumbor/Mediawiki shared secret (T169144)]] (duration: 01m 12s)
[15:24:32] <chasemp>	 !log reboot labtestservices2002
[15:24:41] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:24:41] <stashbot>	 T169144: Serve thumb.php requests with Thumbor - https://phabricator.wikimedia.org/T169144
[15:24:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:25:21] <wikibugs>	 (03PS5) 10Andrew Bogott: labweb: add lvs service in front of labweb1001/1002 [puppet] - 10https://gerrit.wikimedia.org/r/413171 (https://phabricator.wikimedia.org/T187506)
[15:25:42] <wikibugs>	 (03PS6) 10Andrew Bogott: labweb: add lvs service in front of labweb1001/1002 [puppet] - 10https://gerrit.wikimedia.org/r/413171 (https://phabricator.wikimedia.org/T187506)
[15:26:37] <wikibugs>	 10Operations, 10Deployments, 10Release-Engineering-Team (Kanban), 10User-Addshore: cannot delete non-empty directory: php-1.29.0-wmf.3 messages on 'scap sync' on mwdebug1002 - https://phabricator.wikimedia.org/T157030#3989410 (10demon) Has nothing to do with scap clean. We've been fighting this same error...
[15:27:05] <logmsgbot>	 !log gilles@tin Synchronized private/PrivateSettings.php.example: Thumbor private wiki support deployment: [[gerrit:413168|Add Thumbor/Mediawiki shared secret (T169144)]] (duration: 01m 11s)
[15:27:06] <icinga-wm>	 RECOVERY - SSH on labtestservices2002 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.10 (protocol 2.0)
[15:27:16] <wikibugs>	 (03CR) 10Andrew Bogott: labweb horizon: share memcached among labwebs [puppet] - 10https://gerrit.wikimedia.org/r/411546 (https://phabricator.wikimedia.org/T187506) (owner: 10Andrew Bogott)
[15:27:20] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:27:32] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 032] add lvs ip for labweb services [dns] - 10https://gerrit.wikimedia.org/r/413169 (https://phabricator.wikimedia.org/T187506) (owner: 10Andrew Bogott)
[15:27:46] <wikibugs>	 10Operations, 10ops-codfw, 10DBA, 10netops: switch port configuration for tendril2001 - https://phabricator.wikimedia.org/T186172#3989413 (10Marostegui) Can this be resolved then?
[15:28:19] <wikibugs>	 10Operations, 10DBA, 10Patch-For-Review: Setup newer machines and replace all old misc (m*) and x1 codfw machines - https://phabricator.wikimedia.org/T183470#3989416 (10jcrespo) I would say this is resolved- only pending the actual decommission (tracked on separate tickets), and the setup of extra servers fo...
[15:28:58] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 032] labweb: add lvs service in front of labweb1001/1002 [puppet] - 10https://gerrit.wikimedia.org/r/413171 (https://phabricator.wikimedia.org/T187506) (owner: 10Andrew Bogott)
[15:29:08] <wikibugs>	 10Operations, 10ops-codfw, 10DBA, 10Patch-For-Review: rack/setup/install db2093 (WAS: rack/setup/install tendril2001) - https://phabricator.wikimedia.org/T186123#3989417 (10Marostegui)
[15:29:31] <wikibugs>	 10Operations, 10ops-codfw, 10DBA, 10Patch-For-Review: rack/setup/install db2093 (WAS: rack/setup/install tendril2001) - https://phabricator.wikimedia.org/T186123#3934359 (10Marostegui) 05Open>03Resolved Thanks @papaul! This looks good.  We can continue the setup at T184704
[15:29:51] <wikibugs>	 (03CR) 10jenkins-bot: Add Thumbor/Mediawiki shared secret [mediawiki-config] - 10https://gerrit.wikimedia.org/r/413168 (https://phabricator.wikimedia.org/T169144) (owner: 10Gilles)
[15:29:54] <wikibugs>	 (03CR) 10Gilles: [C: 032] Serve officewiki thumbnails with Thumbor [mediawiki-config] - 10https://gerrit.wikimedia.org/r/412952 (https://phabricator.wikimedia.org/T169144) (owner: 10Gilles)
[15:29:59] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 032] Enable base::service_auto_restart for prometheus-openldap-exporter [puppet] - 10https://gerrit.wikimedia.org/r/411219 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff)
[15:30:05] <wikibugs>	 (03PS4) 10Muehlenhoff: Enable base::service_auto_restart for prometheus-openldap-exporter [puppet] - 10https://gerrit.wikimedia.org/r/411219 (https://phabricator.wikimedia.org/T135991)
[15:30:46] <wikibugs>	 10Operations, 10DBA, 10Patch-For-Review: Setup newer machines and replace all old misc (m*) and x1 codfw machines - https://phabricator.wikimedia.org/T183470#3989424 (10jcrespo) 05Open>03Resolved a:03jcrespo
[15:31:20] <wikibugs>	 10Operations, 10ops-eqiad, 10DBA, 10Patch-For-Review: Rack and setup db1115 (tendril replacement database) - https://phabricator.wikimedia.org/T185788#3989429 (10Marostegui) 05Open>03Resolved Thanks @Cmjohnson the host looks good. We can continue the service setup at T184704
[15:31:30] <wikibugs>	 (03Merged) 10jenkins-bot: Serve officewiki thumbnails with Thumbor [mediawiki-config] - 10https://gerrit.wikimedia.org/r/412952 (https://phabricator.wikimedia.org/T169144) (owner: 10Gilles)
[15:31:33] <wikibugs>	 10Operations, 10ops-eqiad, 10DBA, 10Patch-For-Review: Rack and setup db1115 (tendril replacement database) - https://phabricator.wikimedia.org/T185788#3989434 (10Marostegui)
[15:32:16] <wikibugs>	 10Operations, 10Deployments, 10Release-Engineering-Team (Kanban), 10User-Addshore: cannot delete non-empty directory: php-1.29.0-wmf.3 messages on 'scap sync' on mwdebug1002 - https://phabricator.wikimedia.org/T157030#3989437 (10thcipriani) >>! In T157030#3989410, @demon wrote: > Has nothing to do with sca...
[15:32:32] <wikibugs>	 (03PS8) 10Andrew Bogott: labweb horizon: share memcached among labwebs [puppet] - 10https://gerrit.wikimedia.org/r/411546 (https://phabricator.wikimedia.org/T187506)
[15:32:33] <wikibugs>	 10Operations, 10DBA, 10Patch-For-Review: Setup newer machines and replace all old misc (m*) and x1 eqiad machines - https://phabricator.wikimedia.org/T183469#3989439 (10jcrespo)
[15:32:50] <wikibugs>	 (03PS9) 10Andrew Bogott: labweb horizon: share memcached among labwebs [puppet] - 10https://gerrit.wikimedia.org/r/411546 (https://phabricator.wikimedia.org/T187506)
[15:33:48] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 032] labweb horizon: share memcached among labwebs [puppet] - 10https://gerrit.wikimedia.org/r/411546 (https://phabricator.wikimedia.org/T187506) (owner: 10Andrew Bogott)
[15:33:56] <wikibugs>	 10Operations, 10DBA, 10Patch-For-Review: Setup newer machines and replace all old misc (m*) and x1 eqiad machines - https://phabricator.wikimedia.org/T183469#3854189 (10jcrespo)
[15:34:29] <wikibugs>	 10Operations, 10DBA, 10Patch-For-Review: Setup newer machines and replace all old misc (m*) and x1 eqiad machines - https://phabricator.wikimedia.org/T183469#3854189 (10jcrespo)
[15:34:38] <wikibugs>	 (03PS1) 10Ottomata: Qualify erb template variables in puppet self.conf.erb [puppet] - 10https://gerrit.wikimedia.org/r/413175
[15:34:58] <ottomata>	 andrewbogott:  ^ look ok to you?
[15:35:07] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Qualify erb template variables in puppet self.conf.erb [puppet] - 10https://gerrit.wikimedia.org/r/413175 (owner: 10Ottomata)
[15:35:18] <ottomata>	 not really sure how to test other than to merge and try
[15:35:24] <wikibugs>	 10Operations, 10ops-codfw, 10DBA, 10netops: switch port configuration for tendril2001 - https://phabricator.wikimedia.org/T186172#3989446 (10Papaul) Yes we can resolve this
[15:35:45] <wikibugs>	 10Operations, 10Deployments, 10Release-Engineering-Team (Kanban), 10User-Addshore: cannot delete non-empty directory: php-1.29.0-wmf.3 messages on 'scap sync' on mwdebug1002 - https://phabricator.wikimedia.org/T157030#3989447 (10demon) Sorta. From what I can tell, `rsync` won't delete a destination directo...
[15:35:51] <wikibugs>	 (03CR) 10jenkins-bot: Serve officewiki thumbnails with Thumbor [mediawiki-config] - 10https://gerrit.wikimedia.org/r/412952 (https://phabricator.wikimedia.org/T169144) (owner: 10Gilles)
[15:35:52] <andrewbogott>	 ottomata: https://gerrit.wikimedia.org/r/#/c/411616/
[15:35:54] <wikibugs>	 10Operations, 10ops-codfw, 10DBA, 10netops: switch port configuration for db2093 - https://phabricator.wikimedia.org/T186172#3989448 (10Papaul)
[15:36:03] <wikibugs>	 (03PS2) 10Ottomata: Qualify erb template variables in puppet self.conf.erb [puppet] - 10https://gerrit.wikimedia.org/r/413175
[15:36:08] <andrewbogott>	 that whole module is unused, I'm going to remove it, possibly today
[15:36:09] <andrewbogott>	 sorry :)
[15:36:13] <ottomata>	 oh ha!
[15:36:28] <ottomata>	 but, then how am I getting it in cloud?  am i applying the wrong class?
[15:36:30] <ottomata>	 looking at docs...
[15:36:42] <andrewbogott>	 removal is blocked by https://phabricator.wikimedia.org/T187622
[15:36:55] <andrewbogott>	 ottomata: you should be using somethingsomething:standalone
[15:37:03] <ottomata>	 ah
[15:37:05] <ottomata>	 k
[15:37:12] <wikibugs>	 10Operations, 10Deployments, 10Release-Engineering-Team (Kanban), 10User-Addshore: cannot delete non-empty directory: php-1.29.0-wmf.3 messages on 'scap sync' on mwdebug1002 - https://phabricator.wikimedia.org/T157030#3989462 (10demon) Fwiw, this works just fine:    SSH_AUTH_SOCK=/run/keyholder/proxy.sock...
[15:37:36] <wikibugs>	 10Operations, 10ops-codfw, 10DBA, 10Patch-For-Review: rack/setup/install db2093 (WAS: rack/setup/install tendril2001) - https://phabricator.wikimedia.org/T186123#3989465 (10Marostegui)
[15:37:41] <wikibugs>	 10Operations, 10ops-codfw, 10DBA, 10netops: switch port configuration for db2093 - https://phabricator.wikimedia.org/T186172#3989463 (10Marostegui) 05Open>03Resolved Thanks!
[15:37:44] <logmsgbot>	 !log gilles@tin Synchronized wmf-config/filebackend.php: Thumbor private wiki support deployment: [[gerrit:413168|Serve officewiki thumbnails with Thumbor (T169144)]] (duration: 01m 11s)
[15:37:55] <ottomata>	 thanks andrewbogott
[15:37:56] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:37:56] <stashbot>	 T169144: Serve thumb.php requests with Thumbor - https://phabricator.wikimedia.org/T169144
[15:38:05] <wikibugs>	 10Operations, 10Deployments, 10Release-Engineering-Team (Kanban), 10User-Addshore: cannot delete non-empty directory: php-1.29.0-wmf.3 messages on 'scap sync' on mwdebug1002 - https://phabricator.wikimedia.org/T157030#3989467 (10demon) In the old method, we just did ^^^ and never had a "partial" cleanup li...
[15:38:23] <andrewbogott>	 ottomata: sorry that that old class is still there as a trap.  I've been wanting to delete it for a year but only just chased the last user off of it on Saturday
[15:38:52] <wikibugs>	 (03Abandoned) 10Ottomata: Qualify erb template variables in puppet self.conf.erb [puppet] - 10https://gerrit.wikimedia.org/r/413175 (owner: 10Ottomata)
[15:39:28] <wikibugs>	 (03CR) 10Gilles: [C: 032] Serve private wiki thumbnails with Thumbor [mediawiki-config] - 10https://gerrit.wikimedia.org/r/413115 (https://phabricator.wikimedia.org/T169144) (owner: 10Gilles)
[15:39:37] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Serve private wiki thumbnails with Thumbor [mediawiki-config] - 10https://gerrit.wikimedia.org/r/413115 (https://phabricator.wikimedia.org/T169144) (owner: 10Gilles)
[15:39:52] <wikibugs>	 (03PS3) 10Gilles: Serve private wiki thumbnails with Thumbor [mediawiki-config] - 10https://gerrit.wikimedia.org/r/413115 (https://phabricator.wikimedia.org/T169144)
[15:40:16] <wikibugs>	 (03CR) 10Gilles: [C: 032] Serve private wiki thumbnails with Thumbor [mediawiki-config] - 10https://gerrit.wikimedia.org/r/413115 (https://phabricator.wikimedia.org/T169144) (owner: 10Gilles)
[15:41:32] <icinga-wm>	 PROBLEM - puppet last run on cp1051 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 4 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[retry-load-new-vcl-file]
[15:41:43] <icinga-wm>	 RECOVERY - SSH on labtestservices2003 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.10 (protocol 2.0)
[15:41:46] <wikibugs>	 (03Merged) 10jenkins-bot: Serve private wiki thumbnails with Thumbor [mediawiki-config] - 10https://gerrit.wikimedia.org/r/413115 (https://phabricator.wikimedia.org/T169144) (owner: 10Gilles)
[15:41:56] <wikibugs>	 (03CR) 10jenkins-bot: Serve private wiki thumbnails with Thumbor [mediawiki-config] - 10https://gerrit.wikimedia.org/r/413115 (https://phabricator.wikimedia.org/T169144) (owner: 10Gilles)
[15:42:40] <wikibugs>	 10Operations, 10Deployments, 10Release-Engineering-Team (Kanban), 10User-Addshore: cannot delete non-empty directory: php-1.29.0-wmf.3 messages on 'scap sync' on mwdebug1002 - https://phabricator.wikimedia.org/T157030#3989479 (10demon) I think there's two actionables here! # Make sure we delete these direc...
[15:42:54] <ottomata>	 andrewbogott:  fyi,i just applied puppetmaster::standalone role
[15:42:56] <ottomata>	 to a new node
[15:43:02] <ottomata>	 i think a couple of runs will get it
[15:43:06] <andrewbogott>	 ottomata: cool, I think you'll like it better
[15:43:09] <ottomata>	 but first run fails to install puppet-master
[15:43:11] <ottomata>	 Job for puppet-master.service failed. See 'systemctl status puppet-master.service' and 'journalctl -xn' for details.
[15:43:11] <ottomata>	 invoke-rc.d: initscript puppet-master, action "start" failed.
[15:43:11] <ottomata>	 dpkg: error processing package puppet-master (--configure):
[15:43:11] <ottomata>	  subprocess installed post-installation script returned error exit status 1
[15:43:20] <andrewbogott>	 note that that role sets up a puppetmaster but doesn't point anything to use it
[15:43:24] <andrewbogott>	 also, no idea if it works anywhere but jessie
[15:43:25] <ottomata>	 Errors were encountered while processing:
[15:43:25] <ottomata>	  puppet-master
[15:43:27] <ottomata>	 its jessie
[15:43:59] <ottomata>	 andrewbogott:  will it use itself as puppetmaster?
[15:44:08] <andrewbogott>	 no
[15:44:11] <ottomata>	 oh
[15:44:11] <ottomata>	 hmm
[15:44:14] <andrewbogott>	 not unless you set it as its own puppetmaster
[15:44:21] <andrewbogott>	 that's all in the docs you're looking at I think :)
[15:44:41] <ottomata>	 k reading...
[15:44:43] <no_justification>	 !log pruned old 1.29.x and 1.30.x versions that somehow stuck around. Also 1.31.0-wmf.* cache/ directories for unused branches. T157030
[15:44:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:44:57] <stashbot>	 T157030: cannot delete non-empty directory: php-1.29.0-wmf.3 messages on 'scap sync' on mwdebug1002 - https://phabricator.wikimedia.org/T157030
[15:45:05] <ottomata>	 ah right..
[15:45:27] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 031] "\o/" [puppet] - 10https://gerrit.wikimedia.org/r/411616 (https://phabricator.wikimedia.org/T182810) (owner: 10Andrew Bogott)
[15:46:09] <wikibugs>	 (03PS1) 10Andrew Bogott: horizon: add a missing arg [puppet] - 10https://gerrit.wikimedia.org/r/413178 (https://phabricator.wikimedia.org/T187506)
[15:46:32] <wikibugs>	 10Operations, 10Deployments, 10Release-Engineering-Team (Kanban), 10User-Addshore: cannot delete non-empty directory: php-1.29.0-wmf.3 messages on 'scap sync' on mwdebug1002 - https://phabricator.wikimedia.org/T157030#3989505 (10demon) I'm curious if `scap clean` is the wrong approach. A daily (or heck, we...
[15:46:53] <andrewbogott>	 ottomata: sorry, didn't mean to rtfm you — I think it's straightforward, though.  You make a puppetmaster, and then later you decide who uses that puppetmaster (which may or may not be the puppetmaster host itself)
[15:47:09] <andrewbogott>	 I usually find it much less disorienting to have the puppetmaster and the client be different VMs
[15:47:23] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 032] horizon: add a missing arg [puppet] - 10https://gerrit.wikimedia.org/r/413178 (https://phabricator.wikimedia.org/T187506) (owner: 10Andrew Bogott)
[15:49:42] <icinga-wm>	 PROBLEM - LVS HTTP IPv4 on labweb.svc.eqiad.wmnet is CRITICAL: connect to address 10.2.2.40 and port 80: Connection refused
[15:49:46] <icinga-wm>	 PROBLEM - puppet last run on cp1058 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 4 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[retry-load-new-vcl-file]
[15:49:55] <_joe_>	 andrewbogott: ^^
[15:49:59] <volans>	 paged
[15:50:01] <_joe_>	 did you restart pybal?
[15:50:04] <ottomata>	 yeah, that is actually nice andrewbogott
[15:50:11] <ottomata>	 nice that the standalone pm does not have to pm itself
[15:50:19] <_joe_>	 pybal doesn't pick up changes unless you don't restart it
[15:50:20] <andrewbogott>	 great
[15:50:21] <ottomata>	 also, i've never tried this git remote push thing
[15:50:22] <ottomata>	 looks fancy
[15:50:32] <ottomata>	 i always do rsync to /var/lib/git ...
[15:50:37] <logmsgbot>	 !log gilles@tin Synchronized wmf-config/filebackend.php: Thumbor private wiki support deployment: [[gerrit:413115|Serve private wiki thumbnails with Thumbor (T169144)]] (duration: 01m 12s)
[15:50:39] <andrewbogott>	 _joe_: I didn't do anything that a plain puppet merge didn't do
[15:50:40] <ottomata>	 will try next time i need a longer lived pm
[15:50:42] <_joe_>	 ottomata: we just got paged, and andrew needs to be the one looking into it :P
[15:50:45] <andrewbogott>	 is there a by-hand step I'm missing?
[15:50:50] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:50:51] <stashbot>	 T169144: Serve thumb.php requests with Thumbor - https://phabricator.wikimedia.org/T169144
[15:51:15] <_joe_>	 andrewbogott: yes, once puppet has run on the load-balancers, you need to manually restart pybal where relevant
[15:51:24] <_joe_>	 in your case, the eqiad low-traffic pair
[15:51:31] <andrewbogott>	 crap, ok.
[15:51:32] <_joe_>	 so first lvs1006, then 1003
[15:52:11] <andrewbogott>	 lvs1006.eqiad.wmnet?  I can't connect for some reason
[15:52:23] <_joe_>	 .wm.org
[15:52:29] <_joe_>	 I'm on it andrewbogott 
[15:52:36] <andrewbogott>	 thanks
[15:52:58] <_joe_>	 so puppet still didn't run on lvs1006
[15:53:09] <_joe_>	 but it did run on einsteinium I guess
[15:53:09] <andrewbogott>	 meanwhile I should really remove that 'critical' flag, doing now
[15:53:10] <gilles>	 _joe_: I'm done with my deployment
[15:53:18] <_joe_>	 gilles: ack
[15:54:41] <_joe_>	 andrewbogott: so once you've restarted pybal, you can check your pool is there with 
[15:54:45] <wikibugs>	 (03PS1) 10Andrew Bogott: labweb lvs: mark critical: false [puppet] - 10https://gerrit.wikimedia.org/r/413179
[15:54:46] <icinga-wm>	 PROBLEM - puppet last run on cp1061 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 6 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[retry-load-new-vcl-file]
[15:54:46] <_joe_>	 curl localhost:9090/pools | grep labweb
[15:54:55] <wikibugs>	 (03PS2) 10Andrew Bogott: labweb lvs: mark critical: false [puppet] - 10https://gerrit.wikimedia.org/r/413179
[15:55:07] <wikibugs>	 (03PS1) 10BBlack: Add hiera max_core_rtt data [puppet] - 10https://gerrit.wikimedia.org/r/413180 (https://phabricator.wikimedia.org/T157430)
[15:55:19] <_joe_>	 but curl localhost:9090/pools/labweb_80
[15:55:22] <_joe_>	 seems empty
[15:55:23] <_joe_>	 uhm
[15:55:44] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 032] labweb lvs: mark critical: false [puppet] - 10https://gerrit.wikimedia.org/r/413179 (owner: 10Andrew Bogott)
[15:57:15] <_joe_>	 andrewbogott: how did you merge your puppet change?
[15:57:19] <_joe_>	 exact command please
[15:57:22] <_joe_>	 :)
[15:57:29] <_joe_>	 the one where you added labweb
[15:57:43] <andrewbogott>	 'sudo puppet merge' on puppetmaster1001.wikmedia.org
[15:57:46] <_joe_>	 ah
[15:57:52] <wikibugs>	 (03PS2) 10BBlack: Add hiera max_core_rtt data [puppet] - 10https://gerrit.wikimedia.org/r/413180 (https://phabricator.wikimedia.org/T157430)
[15:57:58] <_joe_>	 and you dind't notice errors from conftool-merge?
[15:58:11] <icinga-wm>	 PROBLEM - Confd template for /srv/config-master/pybal/eqiad/labweb on labpuppetmaster1001 is CRITICAL: File not found: /srv/config-master/pybal/eqiad/labweb
[15:58:14] <_joe_>	 sudo -i puppet merge when you want to merge changes there 
[15:58:20] <_joe_>	 anyways
[15:58:24] <_joe_>	 I'm fixing that too
[15:58:25] <andrewbogott>	 I didn't, but that doesn't mean there weren't any...
[15:58:30] <andrewbogott>	 thanks, what did I miss?
[15:58:43] <_joe_>	 lemme see, are you about to merge another change?
[15:58:56] <andrewbogott>	 Just did (critical: false) but I'm clear now
[15:59:10] <icinga-wm>	 RECOVERY - Confd template for /srv/config-master/pybal/eqiad/labweb on labpuppetmaster1001 is OK: No errors detected
[15:59:17] <wikibugs>	 10Operations, 10Deployments, 10Release-Engineering-Team (Kanban), 10User-Addshore: cannot delete non-empty directory: php-1.29.0-wmf.3 messages on 'scap sync' on mwdebug1002 - https://phabricator.wikimedia.org/T157030#3989547 (10thcipriani) Just to create a small test case that demos what's going wrong:  `...
[15:59:21] <andrewbogott>	 conftool says
[15:59:25] <andrewbogott>	 https://www.irccloud.com/pastebin/PRoJNZam/
[15:59:30] <icinga-wm>	 PROBLEM - puppet last run on cp1045 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 22 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[retry-load-new-vcl-file]
[15:59:37] <andrewbogott>	 is that because of the lack of -i ?
[15:59:41] <_joe_>	 yes
[15:59:52] <andrewbogott>	 ok, will try to retrain my fingers
[15:59:53] <_joe_>	 the credentials are available in the root home dir
[16:00:02] <_joe_>	 this is what I did to fix it https://dpaste.de/7ajq/raw
[16:00:22] <wikibugs>	 10Operations, 10monitoring, 10Patch-For-Review, 10User-fgiunchedi: Encrypt syslog traffic - https://phabricator.wikimedia.org/T136312#3989548 (10fgiunchedi) Today rsyslogd was "stuck" accepting new connections on lithium and wezen, at about the same time. This is a strace from `check_ssl` on einsteinium:...
[16:00:30] <godog>	 !log restart rsyslogd on lithium and wezen - T136312
[16:00:34] <andrewbogott>	 ok, makes sense
[16:00:41] <icinga-wm>	 PROBLEM - rsyslog TLS listener on port 6514 on lithium is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection reset by peer
[16:00:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:00:45] <stashbot>	 T136312: Encrypt syslog traffic - https://phabricator.wikimedia.org/T136312
[16:00:50] <icinga-wm>	 RECOVERY - rsyslog TLS listener on port 6514 on lithium is OK: SSL OK - Certificate lithium.eqiad.wmnet valid until 2021-10-23 19:09:29 +0000 (expires in 1340 days)
[16:01:28] <wikibugs>	 (03PS1) 10Gilles: Stop routing Varnish thumb.php traffic to image scalers [puppet] - 10https://gerrit.wikimedia.org/r/413185 (https://phabricator.wikimedia.org/T187899)
[16:01:45] <_joe_>	 gilles: \o/
[16:01:55] <godog>	 \o/ indeed
[16:02:30] <wikibugs>	 10Operations, 10User-Elukey: Ganeti instances to support Kafka Burrow Consumer lag monitoring - https://phabricator.wikimedia.org/T187805#3989556 (10elukey)
[16:02:40] <icinga-wm>	 PROBLEM - rsyslog TLS listener on port 6514 on wezen is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection reset by peer
[16:02:54] <godog>	 shoudl recover soon
[16:03:00] <icinga-wm>	 RECOVERY - rsyslog TLS listener on port 6514 on wezen is OK: SSL OK - Certificate wezen.codfw.wmnet valid until 2021-08-21 20:09:05 +0000 (expires in 1277 days)
[16:03:44] <wikibugs>	 (03PS1) 10Andrew Bogott: horizon memcache: fix an issue with erb var resolution [puppet] - 10https://gerrit.wikimedia.org/r/413186 (https://phabricator.wikimedia.org/T187506)
[16:04:18] <wikibugs>	 10Operations, 10Deployments, 10Release-Engineering-Team (Kanban), 10User-Addshore: cannot delete non-empty directory: php-1.29.0-wmf.3 messages on 'scap sync' on mwdebug1002 - https://phabricator.wikimedia.org/T157030#3989575 (10thcipriani) >>! In T157030#3989447, @demon wrote: > Sorta. From what I can tel...
[16:04:33] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 032] horizon memcache: fix an issue with erb var resolution [puppet] - 10https://gerrit.wikimedia.org/r/413186 (https://phabricator.wikimedia.org/T187506) (owner: 10Andrew Bogott)
[16:05:53] <wikibugs>	 10Operations, 10vm-requests: Site: eqiad|codfw VM request for Kafka Burrow Lag monitoring - https://phabricator.wikimedia.org/T187901#3989586 (10elukey)
[16:06:07] <wikibugs>	 10Operations, 10User-Elukey: Ganeti instances to support Kafka Burrow Consumer lag monitoring - https://phabricator.wikimedia.org/T187805#3989596 (10elukey) done! https://phabricator.wikimedia.org/T187901
[16:06:40] <icinga-wm>	 PROBLEM - PyBal IPVS diff check on lvs1003 is CRITICAL: CRITICAL: Hosts in IPVS but unknown to PyBal: set([labweb1001.wikimedia.org])
[16:06:51] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs1003 is CRITICAL: PYBAL CRITICAL - CRITICAL - labweb_80: Servers labweb1001.wikimedia.org are marked down but pooled
[16:08:00] <icinga-wm>	 PROBLEM - PyBal IPVS diff check on lvs1006 is CRITICAL: CRITICAL: Hosts in IPVS but unknown to PyBal: set([labweb1001.wikimedia.org])
[16:10:20] <wikibugs>	 (03PS1) 10Andrew Bogott: labweb: inclued role::lvs::realserver on labweb hosts [puppet] - 10https://gerrit.wikimedia.org/r/413188 (https://phabricator.wikimedia.org/T187506)
[16:11:05] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 032] labweb: inclued role::lvs::realserver on labweb hosts [puppet] - 10https://gerrit.wikimedia.org/r/413188 (https://phabricator.wikimedia.org/T187506) (owner: 10Andrew Bogott)
[16:11:18] <ema>	 andrewbogott: varnish reloads are failing with:
[16:11:20] <ema>	 Backend host '"labweb.svc.wikimedia.org"' could not be resolved to an IP address
[16:12:00] <icinga-wm>	 RECOVERY - LVS HTTP IPv4 on labweb.svc.eqiad.wmnet is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 453 bytes in 0.001 second response time
[16:12:02] <_joe_>	 ema: ouch
[16:12:14] <_joe_>	 ema: i didn't notice that in my reviwe
[16:12:17] <andrewbogott>	 what the heck?  That's the one part of this that I /do/ know how to do, I thought...
[16:12:46] <_joe_>	 andrewbogott: s/wikimedia.org/eqiad.wmnet/
[16:12:55] <ema>	 yup
[16:13:29] <andrewbogott>	 crap, ok, fixing
[16:13:56] <wikibugs>	 (03PS1) 10Elukey: role::analytics_cluster::coordinator: enable mon. for oozie|hive [puppet] - 10https://gerrit.wikimedia.org/r/413189 (https://phabricator.wikimedia.org/T184794)
[16:14:15] <andrewbogott>	 so many moving parts
[16:14:18] <wikibugs>	 (03PS1) 10Andrew Bogott: labweb: correct labweb service hostname [puppet] - 10https://gerrit.wikimedia.org/r/413190
[16:14:50] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 032] labweb: correct labweb service hostname [puppet] - 10https://gerrit.wikimedia.org/r/413190 (owner: 10Andrew Bogott)
[16:16:36] <ema>	 forced a puppet run on cp1045, all good
[16:16:55] <_joe_>	 ok
[16:16:57] <_joe_>	 cool
[16:19:10] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs1006 is CRITICAL: PYBAL CRITICAL - CRITICAL - labweb_80: Servers labweb1001.wikimedia.org are marked down but pooled
[16:19:30] <icinga-wm>	 RECOVERY - puppet last run on cp1045 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures
[16:19:40] <icinga-wm>	 RECOVERY - puppet last run on cp1061 is OK: OK: Puppet is currently enabled, last run 50 seconds ago with 0 failures
[16:20:37] <_joe_>	 uh?
[16:20:49] <andrewbogott>	 it looks up to me
[16:21:02] <_joe_>	 that means there is another typo maybe
[16:21:20] <andrewbogott>	 probably!
[16:21:41] <_joe_>	 is the server name labweb1001.wikimedia.org?
[16:21:54] <_joe_>	 yes, so no typo
[16:21:58] <_joe_>	 lemme see what's up
[16:22:29] <_joe_>	 curl localhost:9090/pools/labweb_80
[16:22:29] <_joe_>	 labweb1002.wikimedia.org:	enabled/down/not pooled
[16:22:29] <_joe_>	 labweb1001.wikimedia.org:	enabled/down/pooled
[16:22:53] <_joe_>	 same on lvs1003
[16:23:16] <_joe_>	 [labweb_80 IdleConnection] WARN: labweb1001.wikimedia.org (enabled/down/pooled): Connection to 208.80.154.160:80 failed.
[16:23:22] <andrewbogott>	 how specifically does it decide if the host is up or down?
[16:23:24] <_joe_>	 ok, this seems like a network issue
[16:23:27] <andrewbogott>	 oh, ok
[16:23:32] <_joe_>	 andrewbogott: depending on your configs
[16:23:45] <andrewbogott>	 but in this case, it's just seeing if port 80 is up
[16:24:05] <_joe_>	 andrewbogott: no, you are both doing an IdleConnection check, and a ProxyFetch check
[16:24:05] <andrewbogott>	 which it is, but maybe I need a firewall fix...
[16:24:09] <_joe_>	 both fail
[16:24:12] <_joe_>	 and yes, it seems so
[16:24:22] <andrewbogott>	 so which ports should I be opening?
[16:24:40] <wikibugs>	 (03PS4) 10Ottomata: Refactor kafkatee module to support multi instance [puppet] - 10https://gerrit.wikimedia.org/r/413170 (https://phabricator.wikimedia.org/T187890)
[16:24:49] <ottomata>	 elukey: ^ would you review that?
[16:25:12] <_joe_>	 andrewbogott: port 80 I guess, but I think it's more complicated than you think
[16:25:17] <andrewbogott>	 (and shouldn't that be handled by role::lvs::realserver?)
[16:25:25] <andrewbogott>	 yeah, might not be just the firewall
[16:25:28] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Refactor kafkatee module to support multi instance [puppet] - 10https://gerrit.wikimedia.org/r/413170 (https://phabricator.wikimedia.org/T187890) (owner: 10Ottomata)
[16:26:02] <ema>	 _joe_: there's a icinga warning on lvs1010 for pybal-etcd connections: 58 connections established with conf1001.eqiad.wmnet:2379 (min=59)
[16:26:15] <ema>	 I'll restart pybal there too 
[16:26:21] <_joe_>	 ema: pybal needs to be restartedd there, yes
[16:26:23] <_joe_>	 clearly
[16:26:31] <andrewbogott>	 _joe_: let me know if/when you need me to back this out so you can get on with your life
[16:26:36] <_joe_>	 ema: can you or someone from your team help andrewbogott?
[16:27:11] <_joe_>	 I think the problem is he's trying to do LVS/DR to a public IP from the labs subnet maybe?
[16:27:19] <_joe_>	 I haven't looked into it
[16:27:52] <ema>	 !log lvs1010: restart pybal
[16:28:04] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:28:05] <wikibugs>	 (03PS5) 10Ottomata: Refactor kafkatee module to support multi instance [puppet] - 10https://gerrit.wikimedia.org/r/413170 (https://phabricator.wikimedia.org/T187890)
[16:28:49] <wikibugs>	 (03PS2) 10Elukey: role::analytics_cluster::coordinator: enable mon. for oozie|hive [puppet] - 10https://gerrit.wikimedia.org/r/413189 (https://phabricator.wikimedia.org/T184794)
[16:29:20] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs1010 is CRITICAL: PYBAL CRITICAL - CRITICAL - labweb_80: Servers labweb1001.wikimedia.org are marked down but pooled
[16:30:00] <_joe_>	 andrewbogott: either you find someone to help troubleshooting this issue, or you should revert
[16:30:07] <andrewbogott>	 yep, ok
[16:31:31] <wikibugs>	 (03PS5) 10Gehel: elasticsearch: collect elasticsearch metrics on per node percentiles [puppet] - 10https://gerrit.wikimedia.org/r/412670
[16:31:46] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] elasticsearch: collect elasticsearch metrics on per node percentiles [puppet] - 10https://gerrit.wikimedia.org/r/412670 (owner: 10Gehel)
[16:31:54] <ema>	 right so the LVSs can't connect to 208.80.154.160:80
[16:32:30] <icinga-wm>	 PROBLEM - PyBal IPVS diff check on lvs1010 is CRITICAL: CRITICAL: Hosts in IPVS but unknown to PyBal: set([labweb1001.wikimedia.org])
[16:32:51] <icinga-wm>	 ACKNOWLEDGEMENT - PyBal IPVS diff check on lvs1003 is CRITICAL: CRITICAL: Hosts in IPVS but unknown to PyBal: set([labweb1001.wikimedia.org]) andrew bogott This is a work in progress, Im looking at it.
[16:32:51] <icinga-wm>	 ACKNOWLEDGEMENT - PyBal backends health check on lvs1003 is CRITICAL: PYBAL CRITICAL - CRITICAL - labweb_80: Servers labweb1001.wikimedia.org are marked down but pooled andrew bogott This is a work in progress, Im looking at it.
[16:32:51] <icinga-wm>	 ACKNOWLEDGEMENT - PyBal IPVS diff check on lvs1006 is CRITICAL: CRITICAL: Hosts in IPVS but unknown to PyBal: set([labweb1001.wikimedia.org]) andrew bogott This is a work in progress, Im looking at it.
[16:32:51] <icinga-wm>	 ACKNOWLEDGEMENT - PyBal backends health check on lvs1006 is CRITICAL: PYBAL CRITICAL - CRITICAL - labweb_80: Servers labweb1001.wikimedia.org are marked down but pooled andrew bogott This is a work in progress, Im looking at it.
[16:32:51] <icinga-wm>	 ACKNOWLEDGEMENT - PyBal IPVS diff check on lvs1010 is CRITICAL: CRITICAL: Hosts in IPVS but unknown to PyBal: set([labweb1001.wikimedia.org]) andrew bogott This is a work in progress, Im looking at it.
[16:32:51] <icinga-wm>	 ACKNOWLEDGEMENT - PyBal backends health check on lvs1010 is CRITICAL: PYBAL CRITICAL - CRITICAL - labweb_80: Servers labweb1001.wikimedia.org are marked down but pooled andrew bogott This is a work in progress, Im looking at it.
[16:33:46] <andrewbogott>	 ema: I'm looking at ferm issues but I suspect we also need to do something on the switch to allow this
[16:34:32] <wikibugs>	 (03CR) 10BryanDavis: "> could I just 'rm -R /usr/local/lib/mediawiki-config' and run puppet" [puppet] - 10https://gerrit.wikimedia.org/r/413095 (https://phabricator.wikimedia.org/T187850) (owner: 10BryanDavis)
[16:36:25] <wikibugs>	 10Operations, 10Beta-Cluster-Infrastructure, 10Readers-Web-Backlog (Tracking): Beta cluster api.php never responds - https://phabricator.wikimedia.org/T187891#3989296 (10thcipriani) Looks like the api is able to render a trivial test or two for me currently: https://en.wikipedia.beta.wmflabs.org/w/api.php?ac...
[16:37:53] <logmsgbot>	 !log ppchelko@tin Started deploy [changeprop/deploy@1be63aa]: Simplify ORES precaching by using the new endpoint T158437
[16:38:07] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:38:09] <stashbot>	 T158437: Change ORES rules to send all events to new "/precache" endpoint - https://phabricator.wikimedia.org/T158437
[16:39:26] <logmsgbot>	 !log ppchelko@tin Finished deploy [changeprop/deploy@1be63aa]: Simplify ORES precaching by using the new endpoint T158437 (duration: 01m 33s)
[16:39:38] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:40:21] <wikibugs>	 (03PS1) 10Andrew Bogott: horizon/labweb: open firewall to internal IPs for port 80 [puppet] - 10https://gerrit.wikimedia.org/r/413194 (https://phabricator.wikimedia.org/T187506)
[16:40:43] <_joe_>	 !log testing various etcd failure scenarios on mwdebug1001, T185078
[16:40:56] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:40:57] <stashbot>	 T185078: Test EtcdConfig in different failure scenarios - https://phabricator.wikimedia.org/T185078
[16:41:24] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 032] horizon/labweb: open firewall to internal IPs for port 80 [puppet] - 10https://gerrit.wikimedia.org/r/413194 (https://phabricator.wikimedia.org/T187506) (owner: 10Andrew Bogott)
[16:41:30] <icinga-wm>	 RECOVERY - puppet last run on cp1051 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures
[16:46:15] <wikibugs>	 (03PS6) 10Gehel: elasticsearch: collect elasticsearch metrics on per node percentiles [puppet] - 10https://gerrit.wikimedia.org/r/412670
[16:47:21] <wikibugs>	 (03CR) 10EBernhardson: [C: 031] elasticsearch: collect elasticsearch metrics on per node percentiles [puppet] - 10https://gerrit.wikimedia.org/r/412670 (owner: 10Gehel)
[16:49:50] <icinga-wm>	 RECOVERY - puppet last run on cp1058 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures
[16:52:17] <wikibugs>	 (03PS1) 10Bstorm: tools-static: Change to reverse proxy of cdnjs [puppet] - 10https://gerrit.wikimedia.org/r/413197 (https://phabricator.wikimedia.org/T182604)
[16:53:36] <wikibugs>	 (03CR) 10Rush: [C: 031] toollabs: add apt pinnings for key packages [puppet] - 10https://gerrit.wikimedia.org/r/410177 (https://phabricator.wikimedia.org/T187193) (owner: 10Arturo Borrero Gonzalez)
[16:53:44] <wikibugs>	 (03PS13) 10Rush: toollabs: add apt pinnings for key packages [puppet] - 10https://gerrit.wikimedia.org/r/410177 (https://phabricator.wikimedia.org/T187193) (owner: 10Arturo Borrero Gonzalez)
[16:54:19] <wikibugs>	 (03PS7) 10Gehel: elasticsearch: collect elasticsearch metrics on per node percentiles [puppet] - 10https://gerrit.wikimedia.org/r/412670
[16:54:39] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 032] toollabs: add apt pinnings for key packages [puppet] - 10https://gerrit.wikimedia.org/r/410177 (https://phabricator.wikimedia.org/T187193) (owner: 10Arturo Borrero Gonzalez)
[16:56:01] <wikibugs>	 10Operations: Define a special range in constants.pp for the LVS hosts - https://phabricator.wikimedia.org/T187910#3989817 (10Andrew)
[16:56:15] <logmsgbot>	 !log oblivian@puppetmaster1001 conftool action : edit; selector: name=ReadOnly,scope=eqiad
[16:56:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:00:03] <wikibugs>	 (03CR) 10Chad: "Having a root just clean it up was what I had hoped for last night on IRC ;-)" [puppet] - 10https://gerrit.wikimedia.org/r/413095 (https://phabricator.wikimedia.org/T187850) (owner: 10BryanDavis)
[17:01:39] <wikibugs>	 (03PS1) 10Arturo Borrero Gonzalez: Revert "toollabs: add apt pinnings for key packages" [puppet] - 10https://gerrit.wikimedia.org/r/413198
[17:02:03] <wikibugs>	 (03CR) 10Chad: "db1102 and db1095 complained too, are they labs replicas but poorly named?" [puppet] - 10https://gerrit.wikimedia.org/r/413095 (https://phabricator.wikimedia.org/T187850) (owner: 10BryanDavis)
[17:02:20] <wikibugs>	 (03CR) 10Paladox: [C: 031] Gerrit: Improve registration url [puppet] - 10https://gerrit.wikimedia.org/r/413079 (owner: 10Chad)
[17:02:32] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 032] Revert "toollabs: add apt pinnings for key packages" [puppet] - 10https://gerrit.wikimedia.org/r/413198 (owner: 10Arturo Borrero Gonzalez)
[17:02:55] <wikibugs>	 (03CR) 10Rush: "rush@tools-checker-01:~$ sudo facter -p | grep -i os_version" [puppet] - 10https://gerrit.wikimedia.org/r/410177 (https://phabricator.wikimedia.org/T187193) (owner: 10Arturo Borrero Gonzalez)
[17:04:26] <Zoranzoki21>	 Hi
[17:04:30] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs1010 is OK: PYBAL OK - All pools are healthy
[17:04:36] <wikibugs>	 (03CR) 10Rush: "root@tools-worker-1001:~# sudo facter -p | grep lsbdistcodename" [puppet] - 10https://gerrit.wikimedia.org/r/410177 (https://phabricator.wikimedia.org/T187193) (owner: 10Arturo Borrero Gonzalez)
[17:05:27] <wikibugs>	 (03PS2) 10Bstorm: tools-static: Change to reverse proxy of cdnjs [puppet] - 10https://gerrit.wikimedia.org/r/413197 (https://phabricator.wikimedia.org/T182604)
[17:05:56] <Zoranzoki21>	 I have question. Why is gerrit as to I am blocked again? Sometimes when I want to see my change, it showing as to page is not found. Sometimes I see page without problems
[17:06:15] <no_justification>	 You are not blocked.
[17:06:28] <Zoranzoki21>	 I know
[17:06:47] <Zoranzoki21>	 But it showing sometimes as to I am
[17:07:04] <_joe_>	 !log finished testing on mwdebug1001 for swat
[17:07:10] <Zoranzoki21>	 Right now, I had problems with showing https://gerrit.wikimedia.org/r/#/c/412947/
[17:07:16] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:07:28] <paladox>	 Zoranzoki21 if you were blocked, it would not let you sign in
[17:07:30] <icinga-wm>	 RECOVERY - PyBal IPVS diff check on lvs1010 is OK: OK: no difference between hosts in IPVS/PyBal
[17:07:43] <paladox>	 I think what your trying to view is either a deleted change or a draft
[17:07:51] <no_justification>	 That ^
[17:07:55] <Zoranzoki21>	 https://gerrit.wikimedia.org/r/#/c/412947/ is draft?
[17:08:11] <paladox>	 https://gerrit.wikimedia.org/r/#/c/412947/ works for me
[17:08:17] <paladox>	 Zoranzoki21 what error do you get?
[17:08:30] <wikibugs>	 (03PS1) 10Ema: labweb lvs: proxyfetch configuration [puppet] - 10https://gerrit.wikimedia.org/r/413201
[17:08:55] <Zoranzoki21>	 The page you requested was not found, or you do not have permission to view this page.
[17:09:11] <paladox>	 Hmm, does it say internal error?
[17:09:14] <paladox>	 (too)
[17:09:23] <Zoranzoki21>	 no
[17:10:05] <no_justification>	 That shows up for me, even when not logged in
[17:10:10] <no_justification>	 That's inconsistent with being blocked
[17:10:14] <no_justification>	 Or it being private
[17:10:19] <Zoranzoki21>	 So it wants at one point for a few seconds. Well, a couple of times when I refresh, it works smoothly for about an hour or two.
[17:10:33] <Zoranzoki21>	 *it working
[17:10:57] <Zoranzoki21>	 It's like something blocks it to show, and what I do not know.
[17:11:05] <Zoranzoki21>	 First I thinked to you blocked me again
[17:11:05] <no_justification>	 That sounds like a bad link or something. That's more of a 404-style error than anything
[17:11:22] <logmsgbot>	 !log ppchelko@tin Started deploy [changeprop/deploy@e9a6bb0]: Use post for ORES precache rules T158437
[17:11:36] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:11:36] <stashbot>	 T158437: Change ORES rules to send all events to new "/precache" endpoint - https://phabricator.wikimedia.org/T158437
[17:11:37] <no_justification>	 Well I've got nothing obvious in the error logs.
[17:11:43] <Zoranzoki21>	 But now working
[17:12:05] <Zoranzoki21>	 I will, when next time this happen, send screenshot here
[17:12:08] <wikibugs>	 (03PS1) 10Arturo Borrero Gonzalez: toollabs: add apt pinnings for key packages [puppet] - 10https://gerrit.wikimedia.org/r/413202 (https://phabricator.wikimedia.org/T187193)
[17:12:20] <wikibugs>	 (03PS2) 10Ema: labweb lvs: proxyfetch configuration [puppet] - 10https://gerrit.wikimedia.org/r/413201
[17:12:46] <logmsgbot>	 !log ppchelko@tin Finished deploy [changeprop/deploy@e9a6bb0]: Use post for ORES precache rules T158437 (duration: 01m 23s)
[17:12:56] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:12:57] <wikibugs>	 (03CR) 10Ema: [C: 032] labweb lvs: proxyfetch configuration [puppet] - 10https://gerrit.wikimedia.org/r/413201 (owner: 10Ema)
[17:14:14] <wikibugs>	 (03CR) 10Bstorm: "Note: Puppet is disabled on tools-static-11 and tools-static-10 so that this change doesn't impact the existing setup to allow for smooth " [puppet] - 10https://gerrit.wikimedia.org/r/413197 (https://phabricator.wikimedia.org/T182604) (owner: 10Bstorm)
[17:17:00] <ema>	 !log eqiad LVSs: bounce pybal for labweb proxfetch config changes
[17:17:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:17:42] <wikibugs>	 10Operations, 10Beta-Cluster-Infrastructure, 10Readers-Web-Backlog (Tracking): Beta cluster api.php never responds - https://phabricator.wikimedia.org/T187891#3989938 (10Niedzielski) 05Open>03Resolved a:03thcipriani Fixed! Thank you @thcipriani!
[17:18:41] <wikibugs>	 10Operations, 10LDAP-Access-Requests, 10WMF-NDA-Requests: Request to be added to the ldap/wmde group - https://phabricator.wikimedia.org/T187442#3975346 (10RStallman-legalteam) Hello all,  Just an update that I checked w/ the attorneys and we're going to continue to do NDAs for all LDAP access requests that...
[17:19:04] <wikibugs>	 10Operations, 10Beta-Cluster-Infrastructure, 10Readers-Web-Backlog (Tracking), 10Release-Engineering-Team (Kanban): Beta cluster api.php never responds - https://phabricator.wikimedia.org/T187891#3989962 (10thcipriani) >>! In T187891#3989938, @Niedzielski wrote: > Fixed! Thank you @thcipriani!  glad to hea...
[17:19:07] <wikibugs>	 (03PS2) 10Arturo Borrero Gonzalez: toollabs: add apt pinnings for key packages [puppet] - 10https://gerrit.wikimedia.org/r/413202 (https://phabricator.wikimedia.org/T187193)
[17:19:20] <wikibugs>	 (03CR) 10BryanDavis: "> db1102 and db1095 complained too, are they labs replicas but poorly" [puppet] - 10https://gerrit.wikimedia.org/r/413095 (https://phabricator.wikimedia.org/T187850) (owner: 10BryanDavis)
[17:19:40] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs1006 is OK: PYBAL OK - All pools are healthy
[17:20:14] <wikibugs>	 (03CR) 10Rush: [C: 031] "Best hopes!" [puppet] - 10https://gerrit.wikimedia.org/r/413202 (https://phabricator.wikimedia.org/T187193) (owner: 10Arturo Borrero Gonzalez)
[17:20:30] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs1003 is OK: PYBAL OK - All pools are healthy
[17:20:35] <wikibugs>	 (03PS3) 10Arturo Borrero Gonzalez: toollabs: add apt pinnings for key packages [puppet] - 10https://gerrit.wikimedia.org/r/413202 (https://phabricator.wikimedia.org/T187193)
[17:21:07] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 032] toollabs: add apt pinnings for key packages [puppet] - 10https://gerrit.wikimedia.org/r/413202 (https://phabricator.wikimedia.org/T187193) (owner: 10Arturo Borrero Gonzalez)
[17:21:40] <icinga-wm>	 RECOVERY - PyBal IPVS diff check on lvs1003 is OK: OK: no difference between hosts in IPVS/PyBal
[17:23:00] <icinga-wm>	 RECOVERY - PyBal IPVS diff check on lvs1006 is OK: OK: no difference between hosts in IPVS/PyBal
[17:32:13] <wikibugs>	 (03PS1) 10Rush: toolforge: apply pinning to k8s components [puppet] - 10https://gerrit.wikimedia.org/r/413206 (https://phabricator.wikimedia.org/T187193)
[17:32:47] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] toolforge: apply pinning to k8s components [puppet] - 10https://gerrit.wikimedia.org/r/413206 (https://phabricator.wikimedia.org/T187193) (owner: 10Rush)
[17:34:40] <_joe_>	 !log resuming tests on mwdebug1001
[17:34:45] <wikibugs>	 (03PS2) 10Rush: toolforge: apply pinning to k8s components [puppet] - 10https://gerrit.wikimedia.org/r/413206 (https://phabricator.wikimedia.org/T187193)
[17:34:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:35:20] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] toolforge: apply pinning to k8s components [puppet] - 10https://gerrit.wikimedia.org/r/413206 (https://phabricator.wikimedia.org/T187193) (owner: 10Rush)
[17:36:51] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [V: 031 C: 031] toolforge: apply pinning to k8s components [puppet] - 10https://gerrit.wikimedia.org/r/413206 (https://phabricator.wikimedia.org/T187193) (owner: 10Rush)
[17:37:06] <wikibugs>	 (03CR) 10Rush: [V: 032 C: 032] "Sorry style check we have to move forward." [puppet] - 10https://gerrit.wikimedia.org/r/413206 (https://phabricator.wikimedia.org/T187193) (owner: 10Rush)
[17:41:33] <wikibugs>	 (03PS1) 10Ema: icinga: promote check_established_connections alerts to critical [puppet] - 10https://gerrit.wikimedia.org/r/413208 (https://phabricator.wikimedia.org/T170847)
[17:42:59] <wikibugs>	 10Operations: Define a special range in constants.pp for the LVS hosts - https://phabricator.wikimedia.org/T187910#3989817 (10mark) You are setting up a publicly accessible web service, right? So you should probably open up port 80 (and/or 443) to the entire world, not just LVS servers.  Traffic is "routed" via...
[17:43:57] <ema>	 !log eqsin LVSs: upgrade pybal to 1.14.4 
[17:44:10] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:50:48] <wikibugs>	 (03PS3) 10Zoranzoki21: Added throttle rule for Wikipedia Women in Red editathon [mediawiki-config] - 10https://gerrit.wikimedia.org/r/412947 (https://phabricator.wikimedia.org/T187803)
[17:50:57] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Added throttle rule for Wikipedia Women in Red editathon [mediawiki-config] - 10https://gerrit.wikimedia.org/r/412947 (https://phabricator.wikimedia.org/T187803) (owner: 10Zoranzoki21)
[17:53:21] <wikibugs>	 (03PS4) 10Zoranzoki21: Added new throttle rule for Wikipedia Women in Red editathon [mediawiki-config] - 10https://gerrit.wikimedia.org/r/412947 (https://phabricator.wikimedia.org/T187803)
[17:53:25] <wikibugs>	 (03PS5) 10Zoranzoki21: Added new throttle rule for Wikipedia Women in Red editathon [mediawiki-config] - 10https://gerrit.wikimedia.org/r/412947 (https://phabricator.wikimedia.org/T187803)
[17:57:35] <wikibugs>	 (03PS1) 10Vgutierrez: Provide an UDP monitor. [debs/pybal] - 10https://gerrit.wikimedia.org/r/413211 (https://phabricator.wikimedia.org/T178151)
[17:59:47] <wikibugs>	 (03PS3) 10Herron: WIP: puppetdbquery: upgrade to 3.0.1 [puppet] - 10https://gerrit.wikimedia.org/r/410050 (https://phabricator.wikimedia.org/T187259)
[18:00:05] <jouncebot>	 addshore, hashar, anomie, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: That opportune time is upon us again. Time for a Morning SWAT (Max 8 patches) deploy. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180221T1800).
[18:00:05] <jouncebot>	 razesoldier, Zoranzoki21, and Jhs: A patch you scheduled for Morning SWAT (Max 8 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[18:00:13] <Jhs>	 I'm here!
[18:00:17] <razesoldier>	 I'm here
[18:00:17] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] WIP: puppetdbquery: upgrade to 3.0.1 [puppet] - 10https://gerrit.wikimedia.org/r/410050 (https://phabricator.wikimedia.org/T187259) (owner: 10Herron)
[18:00:30] <Zoranzoki21>	 and I
[18:00:48] <Jhs>	 Have the problems from earlier today been solved?
[18:01:40] <Zoranzoki21>	 Who will be today our swater?
[18:03:47] <wikibugs>	 (03CR) 10Elukey: Refactor kafkatee module to support multi instance (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/413170 (https://phabricator.wikimedia.org/T187890) (owner: 10Ottomata)
[18:04:06] <wikibugs>	 10Operations: Integrate jessie 8.10 point release - https://phabricator.wikimedia.org/T182656#3990159 (10MoritzMuehlenhoff) 05Open>03Resolved This is fully rolled out.
[18:07:54] * Jhs pings zeljkof :)
[18:08:23] * Zoranzoki21 trying to ping all swaters:)
[18:12:44] <_joe_>	 !log stopped testing on mwdebug1001 for SWAT window
[18:12:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:14:39] <wikibugs>	 10Operations, 10Domains, 10Traffic, 10WMF-Design, and 2 others: Create subdomain for Design and Wikimedia User Interface Style Guide - https://phabricator.wikimedia.org/T185282#3990194 (10Volker_E) @Dzahn Any updates on above? Accomplishing this task is part of end of Q goals and is dependent on the curren...
[18:16:55] <wikibugs>	 10Operations, 10Proton, 10Readers-Web-Backlog, 10Services (watching): Choose a deploy server for the chromium-render service - https://phabricator.wikimedia.org/T187821#3990203 (10Niedzielski)
[18:17:05] <zeljkof>	 Jhs, Zoranzoki21: sorry, can not swat
[18:17:18] <Zoranzoki21>	 I know zeljkof. But who can?
[18:17:51] <thcipriani>	 I can SWAT
[18:18:08] <Zoranzoki21>	 oh thank you god. Thank you thcipriani
[18:18:31] <wikibugs>	 (03PS3) 10Thcipriani: Set Topic namespace alias of zhwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/412439 (https://phabricator.wikimedia.org/T187546) (owner: 10星耀晨曦)
[18:19:00] <thcipriani>	 razesoldier: are you setup to test ^ on the mwdebug machines?
[18:19:18] <wikibugs>	 (03PS1) 10ArielGlenn: restbase dumps in xml format [dumps] - 10https://gerrit.wikimedia.org/r/413212
[18:19:24] <razesoldier>	 Yes, I can test
[18:19:34] <wikibugs>	 (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/412439 (https://phabricator.wikimedia.org/T187546) (owner: 10星耀晨曦)
[18:19:37] <thcipriani>	 cool :)
[18:19:44] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] restbase dumps in xml format [dumps] - 10https://gerrit.wikimedia.org/r/413212 (owner: 10ArielGlenn)
[18:19:50] <razesoldier>	 via browser extension
[18:20:33] <Jhs>	 yay thcipriani :)
[18:20:44] <wikibugs>	 (03Merged) 10jenkins-bot: Set Topic namespace alias of zhwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/412439 (https://phabricator.wikimedia.org/T187546) (owner: 10星耀晨曦)
[18:20:55] <wikibugs>	 (03CR) 10jenkins-bot: Set Topic namespace alias of zhwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/412439 (https://phabricator.wikimedia.org/T187546) (owner: 10星耀晨曦)
[18:21:21] <thcipriani>	 razesoldier: your change is live on mwdebug1002, check please
[18:22:01] <razesoldier>	 looks good,
[18:22:22] <razesoldier>	 Can be redirected to topic namespace
[18:22:37] <wikibugs>	 (03PS2) 10ArielGlenn: restbase dumps in xml format [dumps] - 10https://gerrit.wikimedia.org/r/413212
[18:22:39] <thcipriani>	 great, pushing the change live
[18:24:45] <logmsgbot>	 !log thcipriani@tin Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:412439|Set Topic namespace alias of zhwiki]] T187546 (duration: 01m 13s)
[18:24:59] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:24:59] <stashbot>	 T187546: Set Topic namespace alias of zhwiki - https://phabricator.wikimedia.org/T187546
[18:25:00] <thcipriani>	 ^ razesoldier your change is live everywhere, thanks for the patch!
[18:25:26] <Zoranzoki21>	 can my next? easier is
[18:25:28] <razesoldier>	 Thanks for your swat :D
[18:27:25] <wikibugs>	 (03PS6) 10Zoranzoki21: Added new throttle rule for Wikipedia Women in Red editathon [mediawiki-config] - 10https://gerrit.wikimedia.org/r/412947 (https://phabricator.wikimedia.org/T187803)
[18:28:59] <wikibugs>	 (03CR) 10Thcipriani: [C: 04-1] "comment inline" (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/412947 (https://phabricator.wikimedia.org/T187803) (owner: 10Zoranzoki21)
[18:29:11] <thcipriani>	 Zoranzoki21: time looks wrong ^
[18:29:46] <Zoranzoki21>	 thcipriani: I will for few seconds fix patch
[18:29:54] <thcipriani>	 k
[18:30:19] <Jhs>	 Zoranzoki21, you know you can edit directly in Gerrit right? :)
[18:30:22] <wikibugs>	 (03PS7) 10Zoranzoki21: Added new throttle rule for Wikipedia Women in Red editathon [mediawiki-config] - 10https://gerrit.wikimedia.org/r/412947 (https://phabricator.wikimedia.org/T187803)
[18:30:27] <Zoranzoki21>	 I know
[18:30:28] <Zoranzoki21>	 I did it
[18:30:45] <wikibugs>	 (03PS4) 10Thcipriani: Add namespace localization for sdwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/412676 (https://phabricator.wikimedia.org/T186943) (owner: 10Jon Harald Søby)
[18:30:47] <thcipriani>	 thanks
[18:30:50] <Jhs>	 (Y)
[18:30:51] <Jhs>	 :)
[18:30:58] <Jhs>	 asn
[18:31:01] <Zoranzoki21>	 thcipriani: Now you can
[18:32:39] <wikibugs>	 (03PS1) 10Rush: toolforge: update pin for kubernetes-client [puppet] - 10https://gerrit.wikimedia.org/r/413213 (https://phabricator.wikimedia.org/T187193)
[18:32:48] <wikibugs>	 (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/412947 (https://phabricator.wikimedia.org/T187803) (owner: 10Zoranzoki21)
[18:33:13] <Zoranzoki21>	 thcipriani: thanks
[18:33:28] <thcipriani>	 Zoranzoki21: you're welcome, thanks for the patch
[18:33:38] <Zoranzoki21>	 :)
[18:33:39] <thcipriani>	 once it merges, I'll sync it live
[18:33:45] <Zoranzoki21>	 thcipriani: ok
[18:34:17] <wikibugs>	 (03Merged) 10jenkins-bot: Added new throttle rule for Wikipedia Women in Red editathon [mediawiki-config] - 10https://gerrit.wikimedia.org/r/412947 (https://phabricator.wikimedia.org/T187803) (owner: 10Zoranzoki21)
[18:35:42] <Zoranzoki21>	 thcipriani: postmerge afraid me again
[18:36:59] <wikibugs>	 (03CR) 10jenkins-bot: Added new throttle rule for Wikipedia Women in Red editathon [mediawiki-config] - 10https://gerrit.wikimedia.org/r/412947 (https://phabricator.wikimedia.org/T187803) (owner: 10Zoranzoki21)
[18:37:00] <wikibugs>	 (03CR) 10Rush: "This worked seemingly fine for all 3 labsdb10[09|10|11]" [puppet] - 10https://gerrit.wikimedia.org/r/413095 (https://phabricator.wikimedia.org/T187850) (owner: 10BryanDavis)
[18:37:08] <thcipriani>	 Zoranzoki21: it's queued now https://integration.wikimedia.org/ci/job/beta-mediawiki-config-update-eqiad/ but there's another job running on the executor that post-merge uses
[18:37:24] <Zoranzoki21>	 thcipriani: ok. postmerge always afraiding me
[18:37:51] <chasemp>	 !log labsdb rm -fR /usr/local/lib/mediawiki-config && puppet agent --test
[18:37:56] <wikibugs>	 (03CR) 10Rush: "@marostegui could we run '/usr/local/lib/mediawiki-config && puppet agent --test' on db1102 and db1095 as an easy fix for submodule cleanu" [puppet] - 10https://gerrit.wikimedia.org/r/413095 (https://phabricator.wikimedia.org/T187850) (owner: 10BryanDavis)
[18:38:03] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:38:43] <wikibugs>	 (03CR) 10Rush: [C: 032] toolforge: update pin for kubernetes-client [puppet] - 10https://gerrit.wikimedia.org/r/413213 (https://phabricator.wikimedia.org/T187193) (owner: 10Rush)
[18:38:45] <wikibugs>	 (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/412676 (https://phabricator.wikimedia.org/T186943) (owner: 10Jon Harald Søby)
[18:39:35] <logmsgbot>	 !log thcipriani@tin Synchronized wmf-config/throttle.php: SWAT: [[gerrit:412947|Added new throttle rule for Wikipedia Women in Red editathon]] T187803 (duration: 01m 12s)
[18:39:46] <thcipriani>	 ^ Zoranzoki21 your change is live now
[18:39:48] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:39:49] <stashbot>	 T187803: Temporary lift of IP cap on en.wikipedia for 2018-08-03 - https://phabricator.wikimedia.org/T187803
[18:40:12] <Zoranzoki21>	 thcipriani: tnx
[18:40:13] <wikibugs>	 (03Merged) 10jenkins-bot: Add namespace localization for sdwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/412676 (https://phabricator.wikimedia.org/T186943) (owner: 10Jon Harald Søby)
[18:40:20] <thcipriani>	 yw :)
[18:40:24] <wikibugs>	 (03CR) 10jenkins-bot: Add namespace localization for sdwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/412676 (https://phabricator.wikimedia.org/T186943) (owner: 10Jon Harald Søby)
[18:40:50] <Jhs>	 thcipriani, i assume you saw my comment about script run for this one?
[18:42:20] <thcipriani>	 Jhs: saw the tag, do you need namespacedupes run for this?
[18:42:34] <thcipriani>	 Jhs: also, it's live on mwdebug1002, check please
[18:44:23] <Jhs>	 thcipriani, yeah, namespaceDupes to be safe
[18:44:29] <Jhs>	 looks good to me on 1002
[18:45:05] <thcipriani>	 ok, I'll sync live and run namespaceDupes on terbium after
[18:45:14] <Jhs>	 coolio (Y)
[18:48:28] <logmsgbot>	 !log thcipriani@tin Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:412676|Add namespace localization for sdwiki]] T186943 (duration: 01m 13s)
[18:48:42] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:48:42] <stashbot>	 T186943: Localize & change namespaces on Sindhi Wikipedia (sdwiki) - https://phabricator.wikimedia.org/T186943
[18:49:56] <thcipriani>	 Jhs: ^ live and namespacedupes run: 2132 links to fix, 2132 were resolvable.
[18:50:19] <Jhs>	 sweet
[18:51:05] <wikibugs>	 (03PS2) 10Thcipriani: Add sitename for Burmese Wiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/413166 (https://phabricator.wikimedia.org/T187882) (owner: 10Jon Harald Søby)
[18:51:07] <wikibugs>	 (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/413166 (https://phabricator.wikimedia.org/T187882) (owner: 10Jon Harald Søby)
[18:52:34] <wikibugs>	 (03Merged) 10jenkins-bot: Add sitename for Burmese Wiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/413166 (https://phabricator.wikimedia.org/T187882) (owner: 10Jon Harald Søby)
[18:52:48] <wikibugs>	 (03CR) 10jenkins-bot: Add sitename for Burmese Wiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/413166 (https://phabricator.wikimedia.org/T187882) (owner: 10Jon Harald Søby)
[18:52:53] <wikibugs>	 (03CR) 10Ottomata: Refactor kafkatee module to support multi instance (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/413170 (https://phabricator.wikimedia.org/T187890) (owner: 10Ottomata)
[18:53:21] <wikibugs>	 (03PS6) 10Ottomata: Refactor kafkatee module to support multi instance [puppet] - 10https://gerrit.wikimedia.org/r/413170 (https://phabricator.wikimedia.org/T187890)
[18:53:28] <wikibugs>	 (03PS2) 10Ottomata: Remove kafkatee as a submodule and re-add it into ops/puppet preserving history [puppet] - 10https://gerrit.wikimedia.org/r/413056
[18:53:34] <thcipriani>	 Jhs: sitename patch for mywiktionary is live on mwdebug1002, check please
[18:53:37] <wikibugs>	 (03CR) 10Ottomata: [V: 032 C: 032] Remove kafkatee as a submodule and re-add it into ops/puppet preserving history [puppet] - 10https://gerrit.wikimedia.org/r/413056 (owner: 10Ottomata)
[18:53:44] <no_justification>	 thcipriani: Run namespaceDupes.php (and then again with --fix) after you finish namespace changes
[18:54:02] <Jhs>	 thcipriani, looks good (Y)
[18:54:36] <no_justification>	 It's a pretty safe script to run, but a dry run never hearts :)
[18:55:07] <thcipriani>	 no_justification: this is likely true. I'll update docs https://wikitech.wikimedia.org/wiki/SWAT_deploys/Deployers#namespaceDupes after swat
[18:55:17] <thcipriani>	 Jhs: k, going live.
[18:57:17] <wikibugs>	 (03PS1) 10Ottomata: Revert "Remove kafkatee as a submodule and re-add it into ops/puppet preserving history" [puppet] - 10https://gerrit.wikimedia.org/r/413215
[18:57:21] <wikibugs>	 (03CR) 10Ottomata: [V: 032 C: 032] Revert "Remove kafkatee as a submodule and re-add it into ops/puppet preserving history" [puppet] - 10https://gerrit.wikimedia.org/r/413215 (owner: 10Ottomata)
[18:59:10] <logmsgbot>	 !log thcipriani@tin Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:413166|Add sitename for Burmese Wiktionary]] T187882 (duration: 01m 06s)
[18:59:17] <thcipriani>	 ^ Jhs live now
[18:59:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:59:24] <stashbot>	 T187882: Localized sitename for Burmese Wiktionary - https://phabricator.wikimedia.org/T187882
[18:59:32] <Jhs>	 looks good :)
[18:59:51] <thcipriani>	 great, thanks for the patch :)
[18:59:55] <thcipriani>	 </swat>
[19:00:05] <jouncebot>	 Deploy window Pre MediaWiki train sanity break (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180221T1900)
[19:00:05] <jouncebot>	 No GERRIT patches in the queue for this window AFAICS.
[19:00:45] <Jhs>	 thanks for swatting thcipriani :)
[19:00:55] <thcipriani>	 yw :)
[19:02:04] <thcipriani>	 https://wikitech.wikimedia.org/wiki/SWAT_deploys/Deployers#namespaceDupes updated
[19:03:10] <icinga-wm>	 PROBLEM - Unmerged changes on repository puppet on puppetmaster1001 is CRITICAL: There are 2 unmerged changes in puppet (dir /var/lib/git/operations/puppet, ref HEAD..origin/production).
[19:04:01] <Niharika>	 Thanks for that, thcipriani. 
[19:05:05] * thcipriani doffs hat
[19:05:09] <wikibugs>	 (03PS1) 10Ottomata: Refactor kafkatee module to support multi instance [puppet/kafkatee] - 10https://gerrit.wikimedia.org/r/413217 (https://phabricator.wikimedia.org/T187890)
[19:05:31] <wikibugs>	 (03Abandoned) 10Ottomata: Refactor kafkatee module to support multi instance [puppet] - 10https://gerrit.wikimedia.org/r/413170 (https://phabricator.wikimedia.org/T187890) (owner: 10Ottomata)
[19:06:50] <wikibugs>	 (03PS2) 10Ottomata: Refactor kafkatee module to support multi instance [puppet/kafkatee] - 10https://gerrit.wikimedia.org/r/413217 (https://phabricator.wikimedia.org/T187890)
[19:07:06] <wikibugs>	 (03CR) 10Ottomata: [V: 032 C: 032] Refactor kafkatee module to support multi instance [puppet/kafkatee] - 10https://gerrit.wikimedia.org/r/413217 (https://phabricator.wikimedia.org/T187890) (owner: 10Ottomata)
[19:07:33] <wikibugs>	 (03PS1) 10Ottomata: Update kafkatee submodule with multi instance support [puppet] - 10https://gerrit.wikimedia.org/r/413220 (https://phabricator.wikimedia.org/T187890)
[19:11:44] <wikibugs>	 10Operations, 10Proton, 10Readers-Web-Backlog, 10Services (watching): Choose a deploy server for the chromium-render service - https://phabricator.wikimedia.org/T187821#3990425 (10mobrovac) Given the requirements, I would be inclined to say Kubernetes, but we don't have any services on it yet. So perhaps G...
[19:13:58] <wikibugs>	 (03CR) 10Smalyshev: wdqs: allow configuration of kafka based updates (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/412873 (https://phabricator.wikimedia.org/T185951) (owner: 10Gehel)
[19:16:04] <wikibugs>	 (03PS2) 10Ottomata: Update kafkatee submodule with multi instance support [puppet] - 10https://gerrit.wikimedia.org/r/413220 (https://phabricator.wikimedia.org/T187890)
[19:16:30] <icinga-wm>	 PROBLEM - HHVM rendering on mw2128 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[19:17:21] <icinga-wm>	 RECOVERY - HHVM rendering on mw2128 is OK: HTTP OK: HTTP/1.1 200 OK - 74582 bytes in 1.325 second response time
[19:20:22] <wikibugs>	 (03PS3) 10Ottomata: Update kafkatee submodule with multi instance support [puppet] - 10https://gerrit.wikimedia.org/r/413220 (https://phabricator.wikimedia.org/T187890)
[19:21:19] <wikibugs>	 10Operations, 10netops: Cloud IPv6 subnets - https://phabricator.wikimedia.org/T187929#3990473 (10ayounsi) p:05Triage>03Normal
[19:23:41] <wikibugs>	 (03CR) 10Ottomata: "Looks good: https://puppet-compiler.wmflabs.org/compiler02/10071/" [puppet] - 10https://gerrit.wikimedia.org/r/413220 (https://phabricator.wikimedia.org/T187890) (owner: 10Ottomata)
[19:24:42] <ottomata>	 !log applying changes to kafkatee  module, first rhenium then oxygen.  will require manual config fixings
[19:24:46] <wikibugs>	 (03CR) 10Ottomata: [C: 032] Update kafkatee submodule with multi instance support [puppet] - 10https://gerrit.wikimedia.org/r/413220 (https://phabricator.wikimedia.org/T187890) (owner: 10Ottomata)
[19:24:55] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:26:30] <icinga-wm>	 RECOVERY - Unmerged changes on repository puppet on puppetmaster1001 is OK: No changes to merge.
[19:27:50] <ebernhardson>	 going to sneak another patch into SWAT, just bumping a pool counter limit for something cirrus
[19:27:57] <Hauskatze>	 jouncebot: next
[19:27:57] <jouncebot>	 In 0 hour(s) and 32 minute(s): MediaWiki train (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180221T2000)
[19:28:13] <wikibugs>	 (03PS2) 10EBernhardson: Increase pool counter workers for cirrus namespace lookup [mediawiki-config] - 10https://gerrit.wikimedia.org/r/412982
[19:29:21] <wikibugs>	 (03CR) 10EBernhardson: [C: 032] Increase pool counter workers for cirrus namespace lookup [mediawiki-config] - 10https://gerrit.wikimedia.org/r/412982 (owner: 10EBernhardson)
[19:29:46] <wikibugs>	 (03CR) 10Zhuyifei1999: "Why is caching the response from api.cdnjs.com needed?" [puppet] - 10https://gerrit.wikimedia.org/r/413197 (https://phabricator.wikimedia.org/T182604) (owner: 10Bstorm)
[19:30:53] <wikibugs>	 (03Merged) 10jenkins-bot: Increase pool counter workers for cirrus namespace lookup [mediawiki-config] - 10https://gerrit.wikimedia.org/r/412982 (owner: 10EBernhardson)
[19:31:08] <wikibugs>	 (03CR) 10jenkins-bot: Increase pool counter workers for cirrus namespace lookup [mediawiki-config] - 10https://gerrit.wikimedia.org/r/412982 (owner: 10EBernhardson)
[19:34:12] <logmsgbot>	 !log ebernhardson@tin Synchronized wmf-config/PoolCounterSettings.php: Increase pool counter workers for cirrus namespace lookup (duration: 01m 13s)
[19:34:26] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:37:11] <wikibugs>	 (03PS4) 10Herron: WIP: puppetdbquery: upgrade to 3.0.1 [puppet] - 10https://gerrit.wikimedia.org/r/410050 (https://phabricator.wikimedia.org/T187259)
[19:37:24] <wikibugs>	 (03CR) 10Ayounsi: [C: 032] openstack: labs-instance-transport1-b-codfw designations [dns] - 10https://gerrit.wikimedia.org/r/413160 (https://phabricator.wikimedia.org/T184209) (owner: 10Rush)
[19:38:08] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] WIP: puppetdbquery: upgrade to 3.0.1 [puppet] - 10https://gerrit.wikimedia.org/r/410050 (https://phabricator.wikimedia.org/T187259) (owner: 10Herron)
[19:38:23] <wikibugs>	 (03CR) 10Zhuyifei1999: tools-static: Change to reverse proxy of cdnjs (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/413197 (https://phabricator.wikimedia.org/T182604) (owner: 10Bstorm)
[19:39:00] <icinga-wm>	 PROBLEM - puppet last run on rhenium is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 4 minutes ago with 1 failures. Failed resources (up to 3 shown): Service[kafkatee-webrequest]
[19:41:04] <wikibugs>	 (03CR) 10Bstorm: "Caching the response is how the other bit (cdnjs-index) works.  It used to consume a packages.json from the checkout.  Now, we get the sam" [puppet] - 10https://gerrit.wikimedia.org/r/413197 (https://phabricator.wikimedia.org/T182604) (owner: 10Bstorm)
[19:41:19] <wikibugs>	 10Operations, 10Cloud-Services, 10netops: Labs to Cloud renaming for networking equipment - https://phabricator.wikimedia.org/T187933#3990570 (10chasemp)
[19:41:28] <wikibugs>	 10Operations, 10Cloud-Services, 10netops: Labs to Cloud renaming for networking equipment - https://phabricator.wikimedia.org/T187933#3990586 (10chasemp) p:05Triage>03Low
[19:45:42] <wikibugs>	 (03PS1) 10Cmjohnson: Adding mgmt and production dns [dns] - 10https://gerrit.wikimedia.org/r/413230 (https://phabricator.wikimedia.org/T186073)
[19:46:10] <wikibugs>	 (03CR) 10Bstorm: "I will also add that, we must cache the api.cdnjs.com response because it takes forever and can fail without special handling.  That can b" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/413197 (https://phabricator.wikimedia.org/T182604) (owner: 10Bstorm)
[19:46:32] <wikibugs>	 (03CR) 10Zhuyifei1999: "What I mean is, does cdnjs-index still have to access tools-static in order to generate the index? Can't it fetch the json from the API di" [puppet] - 10https://gerrit.wikimedia.org/r/413197 (https://phabricator.wikimedia.org/T182604) (owner: 10Bstorm)
[19:49:19] <wikibugs>	 (03CR) 10Bstorm: "yes and yes.  After this is basically working, I can change how the frontend is generated.  I think freeing up the disk space can happen b" [puppet] - 10https://gerrit.wikimedia.org/r/413197 (https://phabricator.wikimedia.org/T182604) (owner: 10Bstorm)
[19:51:24] <wikibugs>	 (03CR) 10Bstorm: "In fact, I can start setting up the api call in cdnjs-index while this is in review.  :)  It is needed for this change and the present sta" [puppet] - 10https://gerrit.wikimedia.org/r/413197 (https://phabricator.wikimedia.org/T182604) (owner: 10Bstorm)
[19:53:15] <wikibugs>	 (03PS1) 10Ottomata: Fix typo in kafkatee.systemd.erb WantedBy [puppet/kafkatee] - 10https://gerrit.wikimedia.org/r/413231 (https://phabricator.wikimedia.org/T187890)
[19:53:27] <wikibugs>	 (03CR) 10Ottomata: [V: 032 C: 032] Fix typo in kafkatee.systemd.erb WantedBy [puppet/kafkatee] - 10https://gerrit.wikimedia.org/r/413231 (https://phabricator.wikimedia.org/T187890) (owner: 10Ottomata)
[19:54:07] <wikibugs>	 (03PS1) 10Ottomata: Update kafkatee submodule with systemd typo fix [puppet] - 10https://gerrit.wikimedia.org/r/413232
[19:54:16] <wikibugs>	 (03CR) 10Ottomata: [V: 032 C: 032] Update kafkatee submodule with systemd typo fix [puppet] - 10https://gerrit.wikimedia.org/r/413232 (owner: 10Ottomata)
[19:56:13] <wikibugs>	 (03CR) 10Zhuyifei1999: tools-static: Change to reverse proxy of cdnjs (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/413197 (https://phabricator.wikimedia.org/T182604) (owner: 10Bstorm)
[19:57:33] <wikibugs>	 (03PS3) 10Bstorm: tools-static: Change to reverse proxy of cdnjs [puppet] - 10https://gerrit.wikimedia.org/r/413197 (https://phabricator.wikimedia.org/T182604)
[19:59:00] <icinga-wm>	 RECOVERY - puppet last run on rhenium is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[20:00:04] <jouncebot>	 twentyafterfour: Your horoscope predicts another unfortunate MediaWiki train deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180221T2000).
[20:00:05] <jouncebot>	 No GERRIT patches in the queue for this window AFAICS.
[20:01:22] <wikibugs>	 (03PS2) 10Krinkle: mediawiki: Enable auto_prepend_file setting for HHVM on Beta Cluster [puppet] - 10https://gerrit.wikimedia.org/r/412827 (https://phabricator.wikimedia.org/T180183)
[20:07:11] <wikibugs>	 (03PS1) 10Brion VIBBER: WIP - gzip .stl files on transfer (application/sla) [puppet] - 10https://gerrit.wikimedia.org/r/413236 (https://phabricator.wikimedia.org/T187930)
[20:07:28] <wikibugs>	 (03PS1) 10Ottomata: Set two webrequest kafkatee instances consuming from analytics and jumbo [puppet] - 10https://gerrit.wikimedia.org/r/413237 (https://phabricator.wikimedia.org/T185136)
[20:10:10] <wikibugs>	 (03PS1) 10Andrew Bogott: labweb: remove specific memcached port [puppet] - 10https://gerrit.wikimedia.org/r/413239 (https://phabricator.wikimedia.org/T187506)
[20:10:34] <mutante>	 !log phab2001 - testing phab restart cron 
[20:10:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:10:57] <wikibugs>	 (03CR) 10Dzahn: [C: 032] "tested and compiled  http://puppet-compiler.wmflabs.org/10072/" [puppet] - 10https://gerrit.wikimedia.org/r/413114 (https://phabricator.wikimedia.org/T187790) (owner: 1020after4)
[20:11:04] <wikibugs>	 (03PS2) 10Dzahn: Phabricator: restart apache every sunday night [puppet] - 10https://gerrit.wikimedia.org/r/413114 (https://phabricator.wikimedia.org/T187790) (owner: 1020after4)
[20:12:14] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 032] labweb: remove specific memcached port [puppet] - 10https://gerrit.wikimedia.org/r/413239 (https://phabricator.wikimedia.org/T187506) (owner: 10Andrew Bogott)
[20:14:03] <wikibugs>	 (03PS2) 10Ottomata: Set two webrequest kafkatee instances consuming from analytics and jumbo [puppet] - 10https://gerrit.wikimedia.org/r/413237 (https://phabricator.wikimedia.org/T185136)
[20:14:50] <mutante>	 wow, i cant submit it because it needs rebase but the rebase button also doesnt work.. ok..
[20:15:00] <mutante>	 that needs special timing :)
[20:16:47] <wikibugs>	 (03CR) 10Ottomata: [V: 032 C: 032] "Looks good: https://puppet-compiler.wmflabs.org/compiler02/10074/oxygen.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/413237 (https://phabricator.wikimedia.org/T185136) (owner: 10Ottomata)
[20:16:54] <wikibugs>	 (03CR) 10Ottomata: [C: 032] Set two webrequest kafkatee instances consuming from analytics and jumbo [puppet] - 10https://gerrit.wikimedia.org/r/413237 (https://phabricator.wikimedia.org/T185136) (owner: 10Ottomata)
[20:16:58] <wikibugs>	 (03PS3) 10Ottomata: Set two webrequest kafkatee instances consuming from analytics and jumbo [puppet] - 10https://gerrit.wikimedia.org/r/413237 (https://phabricator.wikimedia.org/T185136)
[20:17:00] <wikibugs>	 (03CR) 10Ottomata: [V: 032 C: 032] Set two webrequest kafkatee instances consuming from analytics and jumbo [puppet] - 10https://gerrit.wikimedia.org/r/413237 (https://phabricator.wikimedia.org/T185136) (owner: 10Ottomata)
[20:21:10] <icinga-wm>	 PROBLEM - Check systemd state on rhenium is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[20:21:21] <ottomata>	 ^ me
[20:21:25] <ottomata>	 just fixed
[20:22:10] <icinga-wm>	 RECOVERY - Check systemd state on rhenium is OK: OK - running: The system is fully operational
[20:22:27] <wikibugs>	 (03PS1) 1020after4: group1 wikis to 1.31.0-wmf.22 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/413240
[20:22:29] <wikibugs>	 (03CR) 1020after4: [C: 032] group1 wikis to 1.31.0-wmf.22 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/413240 (owner: 1020after4)
[20:24:08] <wikibugs>	 10Operations: Define a special range in constants.pp for the LVS hosts - https://phabricator.wikimedia.org/T187910#3990747 (10Andrew) This particular service is behind the misc-web varnishes.  So port 80 needs to be open to those varnishes and to lvs, but nothing else.  $DOMAIN_NETWORKS covers both those sets, s...
[20:24:17] <wikibugs>	 (03Merged) 10jenkins-bot: group1 wikis to 1.31.0-wmf.22 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/413240 (owner: 1020after4)
[20:26:07] <wikibugs>	 (03PS3) 1020after4: Phabricator: restart apache every sunday night [puppet] - 10https://gerrit.wikimedia.org/r/413114 (https://phabricator.wikimedia.org/T187790)
[20:26:50] <mutante>	 twentyafterfour: oh, you are already fixing it :)
[20:26:57] <mutante>	 it got into a special kind of rebase trap :)
[20:27:09] <wikibugs>	 (03CR) 10jenkins-bot: group1 wikis to 1.31.0-wmf.22 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/413240 (owner: 1020after4)
[20:27:22] <mutante>	 i was able to submit now, thx
[20:27:33] <logmsgbot>	 !log twentyafterfour@tin rebuilt and synchronized wikiversions files: group1 wikis to 1.31.0-wmf.22
[20:27:49] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:28:42] <logmsgbot>	 !log twentyafterfour@tin Synchronized php: group1 wikis to 1.31.0-wmf.22 (duration: 01m 08s)
[20:28:56] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:29:45] <twentyafterfour>	 uhm
[20:30:23] <twentyafterfour>	 Database is read-only: The database has been automatically locked while the replica database servers catch up to the master.
[20:31:18] <no_justification>	 Specific wikis? 
[20:31:35] <no_justification>	 I'm guessing mostly s3 wikis cuz group1
[20:31:38] <twentyafterfour>	 commons 
[20:31:45] <no_justification>	 Ah so s4
[20:31:56] <twentyafterfour>	 also wikidata 
[20:32:11] <no_justification>	 Hmmm
[20:32:15] <twentyafterfour>	 happened immediately after group1 
[20:32:22] <twentyafterfour>	 but I don't see how it could be related?
[20:32:29] <twentyafterfour>	 I guess I should roll back anyway
[20:32:49] <twentyafterfour>	 heh, also Notice: Array to string conversion in /srv/mediawiki/php-1.31.0-wmf.21/includes/libs/rdbms/database/position/MySQLMasterPos.php on line 41
[20:33:04] <twentyafterfour>	 but hmm that's the old branch 
[20:33:17] <no_justification>	 I saw that one
[20:33:23] <no_justification>	 Roll back
[20:33:29] <no_justification>	 If it goes away....related :p
[20:34:25] <twentyafterfour>	 !log rolling back group1 to wmf.21
[20:34:35] <wikibugs>	 (03PS1) 1020after4: group1 wikis to 1.31.0-wmf.21 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/413241
[20:34:37] <wikibugs>	 (03CR) 1020after4: [C: 032] group1 wikis to 1.31.0-wmf.21 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/413241 (owner: 1020after4)
[20:34:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:34:46] <no_justification>	 Hold the sync for like 30 seconds
[20:34:50] <no_justification>	 Wanna check tendril
[20:34:50] <wikibugs>	 (03PS1) 10Ottomata: Set up secondary webrequest camus job to consume from jumbo [puppet] - 10https://gerrit.wikimedia.org/r/413242
[20:34:51] <twentyafterfour>	 ok
[20:34:54] <wikibugs>	 (03CR) 10Dzahn: [C: 032] "Notice: /Stage[main]/Profile::Phabricator::Main/Cron[phab_restart]/ensure: created" [puppet] - 10https://gerrit.wikimedia.org/r/413114 (https://phabricator.wikimedia.org/T187790) (owner: 1020after4)
[20:35:43] <wikibugs>	 (03CR) 1020after4: "Thanks, dzahn!" [puppet] - 10https://gerrit.wikimedia.org/r/413114 (https://phabricator.wikimedia.org/T187790) (owner: 1020after4)
[20:36:03] <wikibugs>	 (03Merged) 10jenkins-bot: group1 wikis to 1.31.0-wmf.21 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/413241 (owner: 1020after4)
[20:36:18] <wikibugs>	 10Operations, 10Phabricator, 10Patch-For-Review, 10Release-Engineering-Team (Kanban), and 2 others: Phabricator: Clean up deadlocked apache processes - https://phabricator.wikimedia.org/T187790#3990786 (10Dzahn) restart cron has been installed on both servers
[20:36:20] <twentyafterfour>	 no_justification: ok tell me when, I'll sync
[20:36:40] <no_justification>	 Hmm, tendril reports no replag :\
[20:37:03] <wikibugs>	 (03CR) 10jenkins-bot: group1 wikis to 1.31.0-wmf.21 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/413241 (owner: 1020after4)
[20:37:12] <no_justification>	 Go ahead and sync
[20:37:41] <wikibugs>	 (03CR) 10Ottomata: "Looks good." [puppet] - 10https://gerrit.wikimedia.org/r/413242 (owner: 10Ottomata)
[20:37:42] <twentyafterfour>	 so the code that detects it is broken then?
[20:37:43] <icinga-wm>	 PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 90.00% of data above the critical threshold [50.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=2&fullscreen
[20:37:44] <wikibugs>	 (03CR) 10Ottomata: "https://puppet-compiler.wmflabs.org/compiler02/10075/analytics1003.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/413242 (owner: 10Ottomata)
[20:37:46] <wikibugs>	 (03CR) 10Ottomata: [C: 032] Set up secondary webrequest camus job to consume from jumbo [puppet] - 10https://gerrit.wikimedia.org/r/413242 (owner: 10Ottomata)
[20:37:54] <twentyafterfour>	 icinga-wm: you're too slow
[20:38:08] <logmsgbot>	 !log twentyafterfour@tin rebuilt and synchronized wikiversions files: group1 wikis to 1.31.0-wmf.21
[20:38:20] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:39:21] <logmsgbot>	 !log twentyafterfour@tin Synchronized php: group1 wikis to 1.31.0-wmf.21 (duration: 01m 12s)
[20:39:33] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:40:09] <twentyafterfour>	 yeah so there is _something_ wrong with that error reporting in wmf.22
[20:40:20] <twentyafterfour>	 is that code new? 
[20:40:34] <twentyafterfour>	 auto-read-only 
[20:42:49] <wikibugs>	 10Operations, 10Domains, 10Traffic, 10WMF-Design, and 2 others: Create subdomain for Design and Wikimedia User Interface Style Guide - https://phabricator.wikimedia.org/T185282#3990805 (10Dzahn) @Volker_E I think the ball is in your court. As i said above , it's not a problem as long as you can get your co...
[20:43:44] <icinga-wm>	 RECOVERY - MediaWiki exceptions and fatals per minute on graphite1001 is OK: OK: Less than 70.00% above the threshold [25.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=2&fullscreen
[20:43:45] <wikibugs>	 10Operations, 10Domains, 10Traffic, 10WMF-Design, and 2 others: Create subdomain for Design and Wikimedia User Interface Style Guide - https://phabricator.wikimedia.org/T185282#3990811 (10Dzahn) please see  https://www.mediawiki.org/wiki/Gerrit/New_repositories/Requests
[20:44:36] <wikibugs>	 (03PS1) 10Ottomata: Produce webrequest_misc logs to Kafka jumbo instead of Kafka analytics [puppet] - 10https://gerrit.wikimedia.org/r/413243 (https://phabricator.wikimedia.org/T185136)
[20:45:06] <no_justification>	 twentyafterfour: Question for AaronSchulz
[20:47:04] <wikibugs>	 10Operations, 10LDAP-Access-Requests, 10WMF-NDA-Requests: Request to be added to the ldap/wmde group - https://phabricator.wikimedia.org/T187442#3975346 (10Dzahn) I'll also start adding the "wmde" users to our "ldap_only" admins group then to avoid confusion.
[20:48:08] <wikibugs>	 10Operations, 10vm-requests: Site: eqiad|codfw VM request for Kafka Burrow Lag monitoring - https://phabricator.wikimedia.org/T187901#3989586 (10Dzahn) What host names do we want to use?
[20:48:26] <wikibugs>	 10Operations, 10vm-requests: Site: eqiad|codfw VM request for Kafka Burrow Lag monitoring - https://phabricator.wikimedia.org/T187901#3990824 (10Dzahn) (because DNS needs to exist before anything else and the VMs can be created)
[20:51:52] <wikibugs>	 (03PS2) 10Dzahn: Gerrit: Improve registration url [puppet] - 10https://gerrit.wikimedia.org/r/413079 (owner: 10Chad)
[20:53:11] <wikibugs>	 (03CR) 10Dzahn: [C: 032] Gerrit: Improve registration url [puppet] - 10https://gerrit.wikimedia.org/r/413079 (owner: 10Chad)
[20:53:21] <twentyafterfour>	 no_justification: yep, created a task T187942
[20:53:22] <stashbot>	 T187942: Replication lag detection broken in wmf.22 - https://phabricator.wikimedia.org/T187942
[20:53:39] <wikibugs>	 10Operations, 10Mathoid, 10Prod-Kubernetes, 10Kubernetes, and 3 others: Serve at least 50% of Mathoid via kubernetes - https://phabricator.wikimedia.org/T184919#3990859 (10mobrovac)
[20:53:42] <twentyafterfour>	 !log MediaWiki Train for 1.31.0-wmf.22 is blocked by T187942
[20:53:56] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:54:55] <wikibugs>	 (03CR) 10Dzahn: [C: 032] "deployed. restarted service on gerrit2001, but not on cobalt" [puppet] - 10https://gerrit.wikimedia.org/r/413079 (owner: 10Chad)
[20:55:33] <wikibugs>	 (03PS2) 10Dzahn: Add two exceptions to long-running screen monitoring checks [puppet] - 10https://gerrit.wikimedia.org/r/412674 (owner: 10Muehlenhoff)
[20:55:41] <wikibugs>	 (03CR) 10Dzahn: [C: 032] Add two exceptions to long-running screen monitoring checks [puppet] - 10https://gerrit.wikimedia.org/r/412674 (owner: 10Muehlenhoff)
[20:56:54] <wikibugs>	 10Operations, 10vm-requests: Site: eqiad|codfw VM request for Kafka Burrow Lag monitoring - https://phabricator.wikimedia.org/T187901#3990869 (10Dzahn) How about "kafkamon", akin to "netmon"?
[21:00:01] <wikibugs>	 (03PS3) 10Giuseppe Lavagetto: mediawiki: Enable auto_prepend_file setting for HHVM on Beta Cluster [puppet] - 10https://gerrit.wikimedia.org/r/412827 (https://phabricator.wikimedia.org/T180183) (owner: 10Krinkle)
[21:00:04] <jouncebot>	 cscott, arlolra, subbu, bearND, halfak, and Amir1: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for Services – Parsoid / Citoid / Mobileapps / ORES / … . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180221T2100).
[21:00:05] <jouncebot>	 No GERRIT patches in the queue for this window AFAICS.
[21:01:20] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 032] mediawiki: Enable auto_prepend_file setting for HHVM on Beta Cluster [puppet] - 10https://gerrit.wikimedia.org/r/412827 (https://phabricator.wikimedia.org/T180183) (owner: 10Krinkle)
[21:02:17] <subbu>	 nothing for parsoid
[21:05:18] <bearND>	 nothing for mobileapps today
[21:10:11] <wikibugs>	 (03PS3) 10Dzahn: Add two exceptions to long-running screen monitoring checks [puppet] - 10https://gerrit.wikimedia.org/r/412674 (owner: 10Muehlenhoff)
[21:13:27] <awight>	 I’m pushing a minor ORES service update.
[21:14:34] <logmsgbot>	 !log ppchelko@tin Started deploy [restbase/deploy@56fffcf]: Do not check for article deletion for update requests T181636
[21:14:48] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:14:49] <stashbot>	 T181636: Content service incorrectly reports article as "deleted" - https://phabricator.wikimedia.org/T181636
[21:15:26] <logmsgbot>	 !log awight@tin Started deploy [ores/deploy@7bbf21f]: T187914 on the ores* cluster
[21:15:38] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:15:39] <stashbot>	 T187914: New precache endpoint isn't reporting its metrics correctly - https://phabricator.wikimedia.org/T187914
[21:20:21] <wikibugs>	 (03PS4) 10Dzahn: Add two exceptions to long-running screen monitoring checks [puppet] - 10https://gerrit.wikimedia.org/r/412674 (owner: 10Muehlenhoff)
[21:23:44] <elukey>	 !log restart hhvm on mw1221 - high load alarms
[21:23:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:25:03] <icinga-wm>	 PROBLEM - mobileapps endpoints health on scb2006 is CRITICAL: /{domain}/v1/page/most-read/{yyyy}/{mm}/{dd} (retrieve the most-read articles for January 1, 2016) timed out before a response was received: /{domain}/v1/page/most-read/{yyyy}/{mm}/{dd} (retrieve the most-read articles for January 1, 2016 (with aggregated=true)) timed out before a response was received
[21:25:17] <wikibugs>	 10Operations, 10ops-eqiad, 10fundraising-tech-ops: rack frbast1001 - https://phabricator.wikimedia.org/T187363#3990946 (10cwdent)
[21:25:55] <wikibugs>	 10Operations, 10ops-eqiad, 10fundraising-tech-ops: rack frdata1001 - https://phabricator.wikimedia.org/T187364#3990951 (10cwdent)
[21:26:53] <icinga-wm>	 RECOVERY - mobileapps endpoints health on scb2006 is OK: All endpoints are healthy
[21:27:44] <elukey>	 !log restart hhvm on mw1227 - high load alarms
[21:27:56] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:28:28] <logmsgbot>	 !log awight@tin Finished deploy [ores/deploy@7bbf21f]: T187914 on the ores* cluster (duration: 13m 03s)
[21:28:40] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:28:40] <stashbot>	 T187914: New precache endpoint isn't reporting its metrics correctly - https://phabricator.wikimedia.org/T187914
[21:29:05] <logmsgbot>	 !log awight@tin Started deploy [ores/deploy@addba9c]: T187914 on the scb* cluster
[21:29:17] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:29:31] <wikibugs>	 10Operations, 10ops-eqiad, 10fundraising-tech-ops, 10Patch-For-Review: Rack/setup frmon1001 - https://phabricator.wikimedia.org/T186073#3990975 (10cwdent)
[21:30:02] <wikibugs>	 10Operations, 10ops-eqiad, 10fundraising-tech-ops: rack frpig1001 - https://phabricator.wikimedia.org/T187365#3990976 (10cwdent)
[21:30:11] <elukey>	 !log restart hhvm on mw1229 - high load alarms
[21:30:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:30:32] <logmsgbot>	 !log ppchelko@tin Finished deploy [restbase/deploy@56fffcf]: Do not check for article deletion for update requests T181636 (duration: 15m 59s)
[21:30:44] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:30:45] <stashbot>	 T181636: Content service incorrectly reports article as "deleted" - https://phabricator.wikimedia.org/T181636
[21:30:48] <wikibugs>	 10Operations, 10ops-eqiad, 10fundraising-tech-ops: rack frdata1001 - https://phabricator.wikimedia.org/T187364#3990984 (10cwdent)
[21:34:48] <elukey>	 !log restart hhvm on mw1232 - high load alarms
[21:35:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:39:07] <logmsgbot>	 !log awight@tin Finished deploy [ores/deploy@addba9c]: T187914 on the scb* cluster (duration: 10m 02s)
[21:39:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:39:22] <stashbot>	 T187914: New precache endpoint isn't reporting its metrics correctly - https://phabricator.wikimedia.org/T187914
[21:44:17] <wikibugs>	 (03CR) 10Cdentinger: [C: 04-1] "@cmjohnson I was not educated about the vlans when I chose these IPs, I updated all the task descriptions with better information: T186073" [dns] - 10https://gerrit.wikimedia.org/r/413230 (https://phabricator.wikimedia.org/T186073) (owner: 10Cmjohnson)
[21:44:55] <elukey>	 !log restart hhvm on mw1233 - high load alarms
[21:45:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:46:38] <elukey>	 !log restart hhvm on mw1235 - high load alarms
[21:46:50] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:46:59] <wikibugs>	 10Operations, 10DBA, 10Release-Engineering-Team, 10cloud-services-team, 10wikitech.wikimedia.org: Move some wikis to s5 - https://phabricator.wikimedia.org/T184805#3896851 (10demon) >>! In T184805#3896861, @jcrespo wrote: > Only adding #releng and #wmcs in case they can think of a reason not to move them...
[21:50:39] <elukey>	 !log restart hhvm on mw1224 - high load alarms
[21:50:52] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:52:29] <wikibugs>	 (03PS4) 10Bstorm: tools-static: Change to reverse proxy of cdnjs [puppet] - 10https://gerrit.wikimedia.org/r/413197 (https://phabricator.wikimedia.org/T182604)
[21:53:59] <wikibugs>	 (03CR) 10Bstorm: "The new version removes the cron to generate packages.json altogether because I've already merged the api call into cdnjs-index.  This is " (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/413197 (https://phabricator.wikimedia.org/T182604) (owner: 10Bstorm)
[21:58:32] <wikibugs>	 10Operations, 10DBA, 10cloud-services-team, 10wikitech.wikimedia.org, 10Release-Engineering-Team (Watching / External): Move some wikis to s5 - https://phabricator.wikimedia.org/T184805#3991103 (10greg)
[22:04:34] <wikibugs>	 10Operations, 10LDAP-Access-Requests, 10WMF-NDA-Requests: Request to be added to the ldap/wmde group - https://phabricator.wikimedia.org/T187442#3991109 (10RStallman-legalteam) Raz's NDA is fully signed. Thanks!
[22:31:04] <wikibugs>	 (03CR) 10BBlack: [C: 031] "Looks sane, and doesn't look like any other known mimetypes happen to contain the string "sla"." [puppet] - 10https://gerrit.wikimedia.org/r/413236 (https://phabricator.wikimedia.org/T187930) (owner: 10Brion VIBBER)
[22:38:47] <wikibugs>	 10Operations, 10Phabricator, 10Patch-For-Review, 10Release-Engineering-Team (Kanban), and 2 others: Apache on phab1001 is gradually leaking worker processes which are stuck in "Gracefully finishing" state - https://phabricator.wikimedia.org/T182832#3991212 (10mmodell)
[22:38:51] <wikibugs>	 10Operations, 10Phabricator, 10Patch-For-Review, 10Release-Engineering-Team (Kanban), and 2 others: Phabricator: Clean up deadlocked apache processes - https://phabricator.wikimedia.org/T187790#3991211 (10mmodell) 05Open>03Resolved
[22:52:57] <wikibugs>	 10Operations, 10HHVM, 10Patch-For-Review, 10Performance-Team (Radar): HHVM hangs on the API cluster - https://phabricator.wikimedia.org/T184048#3991252 (10Joe) we finally tracked this down to `JpegMetadataExtractor::segmentSplitter` where a infinite loop can happen in case the jpeg is broken:  https://gith...
[23:21:46] <wikibugs>	 (03PS1) 10Andrew Bogott: labweb: install nutcracker [puppet] - 10https://gerrit.wikimedia.org/r/413275 (https://phabricator.wikimedia.org/T187506)
[23:22:22] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] labweb: install nutcracker [puppet] - 10https://gerrit.wikimedia.org/r/413275 (https://phabricator.wikimedia.org/T187506) (owner: 10Andrew Bogott)
[23:26:14] <wikibugs>	 (03PS2) 10Andrew Bogott: labweb: install nutcracker [puppet] - 10https://gerrit.wikimedia.org/r/413275 (https://phabricator.wikimedia.org/T187506)
[23:26:53] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] labweb: install nutcracker [puppet] - 10https://gerrit.wikimedia.org/r/413275 (https://phabricator.wikimedia.org/T187506) (owner: 10Andrew Bogott)
[23:29:14] <wikibugs>	 (03PS3) 10Andrew Bogott: labweb: install nutcracker [puppet] - 10https://gerrit.wikimedia.org/r/413275 (https://phabricator.wikimedia.org/T187506)
[23:33:07] <wikibugs>	 10Operations, 10ops-eqiad, 10netops: Rack/cable/configure asw2-a-eqiad switch stack - https://phabricator.wikimedia.org/T187960#3991379 (10ayounsi) p:05Triage>03Normal
[23:33:39] <wikibugs>	 (03PS2) 10Gergő Tisza: Enable loginOnly mode for local auth provider on beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/409638 (https://phabricator.wikimedia.org/T57420)
[23:40:44] <wikibugs>	 10Operations, 10ops-eqiad, 10netops: Rack/cable/configure asw2-c-eqiad switch stack - https://phabricator.wikimedia.org/T187962#3991415 (10ayounsi) p:05Triage>03Normal
[23:42:28] <wikibugs>	 10Operations, 10ops-eqiad, 10netops: Rack/cable/configure asw2-c-eqiad switch stack - https://phabricator.wikimedia.org/T187962#3991441 (10ayounsi)
[23:57:09] <wikibugs>	 10Operations, 10LDAP-Access-Requests, 10WMF-NDA-Requests: Request to be added to the ldap/wmde group - https://phabricator.wikimedia.org/T187442#3991466 (10Dzahn) a:05MoritzMuehlenhoff>03Dzahn