[00:01:03] <grrrit-wm>	 (03CR) 10Filippo Giunchedi: [C: 031] Stop using package=>latest for standard packages [puppet] - 10https://gerrit.wikimedia.org/r/314270 (https://phabricator.wikimedia.org/T115348) (owner: 10Muehlenhoff) 
[00:02:11] <grrrit-wm>	 (03CR) 10Krinkle: [C: 031] "No matches in Git, no matches in mwgrep (NS_MEDIAWIKI and NS_USER), and no requests in the last 7 days (other than my own)." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/323323 (owner: 10Chad) 
[00:03:03] <grrrit-wm>	 (03CR) 10Filippo Giunchedi: "Any reason for the actual code to live in puppet as opposed to its own repo?" [puppet] - 10https://gerrit.wikimedia.org/r/322220 (https://phabricator.wikimedia.org/T147328) (owner: 10Addshore) 
[00:03:09] <icinga-wm>	 PROBLEM - puppet last run on logstash1003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[00:04:06] <grrrit-wm>	 (03CR) 10Dzahn: Phabricator: Allow setting the mysql.user and mysql.pass in labs (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/323146 (https://phabricator.wikimedia.org/T139475) (owner: 10Paladox) 
[00:04:16] <grrrit-wm>	 (03CR) 10Addshore: "My initial thought was the number of lines are so small puppet would be easier." [puppet] - 10https://gerrit.wikimedia.org/r/322220 (https://phabricator.wikimedia.org/T147328) (owner: 10Addshore) 
[00:04:31] <grrrit-wm>	 (03CR) 10Dzahn: [C: 04-1] Phabricator: Allow setting the mysql.user and mysql.pass in labs [puppet] - 10https://gerrit.wikimedia.org/r/323146 (https://phabricator.wikimedia.org/T139475) (owner: 10Paladox) 
[00:04:33] <grrrit-wm>	 (03CR) 10Filippo Giunchedi: "I'm not particularly happy with the "service discovery" part (the TODO) but this will do for now" [puppet] - 10https://gerrit.wikimedia.org/r/323079 (https://phabricator.wikimedia.org/T147423) (owner: 10Filippo Giunchedi) 
[00:04:57] <mutante>	 paladox: i ran the compiler, it shows no change, but there is a problem with the patch anyways
[00:05:08] <grrrit-wm>	 (03CR) 10Paladox: Phabricator: Allow setting the mysql.user and mysql.pass in labs (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/323146 (https://phabricator.wikimedia.org/T139475) (owner: 10Paladox) 
[00:05:18] <paladox>	 mutante look on the left of the diff
[00:06:20] <mutante>	 hmm, i see. i wonder if this is the wrong order of things
[00:06:24] <mutante>	 might be a trap later
[00:06:36] <mutante>	 but at least there's a todo comment.. so yea
[00:06:43] <paladox>	 Ok
[00:06:44] <paladox>	 :)
[00:07:25] <grrrit-wm>	 (03PS8) 10Dzahn: Phabricator: Allow setting the mysql.user and mysql.pass in labs [puppet] - 10https://gerrit.wikimedia.org/r/323146 (https://phabricator.wikimedia.org/T139475) (owner: 10Paladox) 
[00:07:32] <grrrit-wm>	 (03CR) 10Krinkle: [C: 04-1] "This is labs-specific but ends up installed and running in prod, too - where we hopefully don't have something important in that similarly" [puppet] - 10https://gerrit.wikimedia.org/r/323339 (https://phabricator.wikimedia.org/T143405) (owner: 10Filippo Giunchedi) 
[00:09:56] <grrrit-wm>	 (03CR) 10Chad: [C: 032] Docroot cleanup: Remove old unused blank.gif from HTTPS tests [mediawiki-config] - 10https://gerrit.wikimedia.org/r/323323 (owner: 10Chad) 
[00:10:04] <grrrit-wm>	 (03CR) 10Dzahn: [C: 032] "ok, no-op per compiler http://puppet-compiler.wmflabs.org/4666/iridium.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/323146 (https://phabricator.wikimedia.org/T139475) (owner: 10Paladox) 
[00:10:22] <paladox>	 mutante thanks :)
[00:10:41] <grrrit-wm>	 (03Merged) 10jenkins-bot: Docroot cleanup: Remove old unused blank.gif from HTTPS tests [mediawiki-config] - 10https://gerrit.wikimedia.org/r/323323 (owner: 10Chad) 
[00:11:24] <grrrit-wm>	 (03CR) 10Filippo Giunchedi: "It is labs instances data that ends up in the graphite production instance, I can move it to the "production" subrole if that makes things" [puppet] - 10https://gerrit.wikimedia.org/r/323339 (https://phabricator.wikimedia.org/T143405) (owner: 10Filippo Giunchedi) 
[00:12:12] <logmsgbot>	 !log demon@tin Synchronized docroot/foundation/: rm more junk (duration: 00m 45s)
[00:12:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:15:35] <mutante>	 paladox: you can go ahead with testing now 
[00:15:43] <mutante>	 paladox: also.. i think https://gerrit.wikimedia.org/r/#/c/308498/ is not worth it anymore
[00:16:05] <grrrit-wm>	 (03Abandoned) 10Paladox: role/cxserver: Fix role::cxserver not in autoload module layout [puppet] - 10https://gerrit.wikimedia.org/r/308498 (https://phabricator.wikimedia.org/T93645) (owner: 10Paladox) 
[00:16:09] <paladox>	 mutante ok
[00:16:10] <paladox>	 thanks
[00:16:11] <mutante>	 paladox: since all of manifests/role/ moved, that is fixed for all besides mariadb and eventlogging now
[00:16:50] <mutante>	 well.. eh. that said, i dunno about the labs and beta part that you were changing there
[00:16:58] <mutante>	 maybe that is still needed
[00:16:59] <Zppix>	 anything i can do to help remove junk in docroot?
[00:18:29] <paladox>	 mutante twentyafterfour i am going to run bin/storage upgrade manually
[00:18:34] <paladox>	 otherwise the tables wont be created
[00:18:41] <paladox>	 and then it will all fail
[00:18:54] <paladox>	 i doint event think we issue that command through puppet
[00:18:59] <paladox>	 so i am runnign it manually
[00:19:09] <icinga-wm>	 PROBLEM - puppet last run on install2001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[00:19:25] <grrrit-wm>	 (03CR) 10Filippo Giunchedi: "IMO yes this code should live outside puppet." [puppet] - 10https://gerrit.wikimedia.org/r/322220 (https://phabricator.wikimedia.org/T147328) (owner: 10Addshore) 
[00:20:02] <mutante>	 paladox: ok, good question if that is puppetized
[00:20:08] <paladox>	 I doint think so
[00:20:19] <paladox>	 mutante also i am going to submit another patch for mysql
[00:20:29] <paladox>	 it got me furthur, but i also need it set in local.config
[00:21:06] <grrrit-wm>	 (03CR) 10Addshore: "ack." [puppet] - 10https://gerrit.wikimedia.org/r/322220 (https://phabricator.wikimedia.org/T147328) (owner: 10Addshore) 
[00:21:09] <icinga-wm>	 RECOVERY - puppet last run on install2001 is OK: OK: Puppet is currently enabled, last run 6 seconds ago with 0 failures
[00:21:37] <grrrit-wm>	 (03PS1) 10Hoo man: Use entity types for the repoNamespaces Wikibase client setting [mediawiki-config] - 10https://gerrit.wikimedia.org/r/323347 
[00:27:23] <grrrit-wm>	 (03PS1) 10Paladox: Phabricator: Allow setting mysql.user and mysql.pass (part2) [puppet] - 10https://gerrit.wikimedia.org/r/323349 
[00:27:41] <grrrit-wm>	 (03PS2) 10Paladox: Phabricator: Allow setting mysql.user and mysql.pass (part2) [puppet] - 10https://gerrit.wikimedia.org/r/323349 
[00:27:52] <paladox>	 mutante ^^ could you review that please and run the puppet complier please?
[00:28:02] <logmsgbot>	 !log reedy@tin Synchronized php-1.29.0-wmf.3/extensions/CentralAuth/maintenance/populateLocalAndGlobalIds.php: Some perf related improvements (duration: 00m 45s)
[00:28:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:29:50] <grrrit-wm>	 (03CR) 10jenkins-bot: [V: 04-1] Phabricator: Allow setting mysql.user and mysql.pass (part2) [puppet] - 10https://gerrit.wikimedia.org/r/323349 (owner: 10Paladox) 
[00:30:55] <hoo>	 Xhgui is always timing out for me :S
[00:31:22] <grrrit-wm>	 (03PS3) 10Paladox: Phabricator: Allow setting mysql.user and mysql.pass (part2) [puppet] - 10https://gerrit.wikimedia.org/r/323349 
[00:32:09] <icinga-wm>	 RECOVERY - puppet last run on logstash1003 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[00:34:29] <icinga-wm>	 PROBLEM - puppet last run on labvirt1014 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[00:42:10] <grrrit-wm>	 (03CR) 10Krinkle: graphite: cleanup labs instances metrics [puppet] - 10https://gerrit.wikimedia.org/r/323339 (https://phabricator.wikimedia.org/T143405) (owner: 10Filippo Giunchedi) 
[00:49:08] <wikibugs>	 06Operations, 07Puppet, 07Documentation, 03Google-Code-In-2016, and 2 others: document all puppet classes / defined types!? - https://phabricator.wikimedia.org/T127797#2819801 (10Dzahn) Yes, sounds good.
[00:50:04] <grrrit-wm>	 (03CR) 10Dzahn: [C: 04-1] "compiler says there is a change" [puppet] - 10https://gerrit.wikimedia.org/r/323349 (owner: 10Paladox) 
[00:50:31] <mutante>	 paladox: tested, but there is a diff
[00:51:51] <paladox>	 mutante oh
[00:59:14] <grrrit-wm>	 (03PS4) 10Paladox: Phabricator: Allow setting mysql.user and mysql.pass (part2) [puppet] - 10https://gerrit.wikimedia.org/r/323349 
[00:59:19] <paladox>	 mutante ^^
[00:59:26] <paladox>	 could you re run puppet compiler please?
[00:59:31] <paladox>	 I hope that fixed it
[01:03:04] <grrrit-wm>	 (03PS1) 10BryanDavis: logstash: Add processing rules for MediaWiki's exception channel [puppet] - 10https://gerrit.wikimedia.org/r/323351 (https://phabricator.wikimedia.org/T136849) 
[01:03:07] <grrrit-wm>	 (03CR) 10jenkins-bot: [V: 04-1] Phabricator: Allow setting mysql.user and mysql.pass (part2) [puppet] - 10https://gerrit.wikimedia.org/r/323349 (owner: 10Paladox) 
[01:03:29] <icinga-wm>	 RECOVERY - puppet last run on labvirt1014 is OK: OK: Puppet is currently enabled, last run 25 seconds ago with 0 failures
[01:04:10] <grrrit-wm>	 (03PS2) 10BryanDavis: Do not send 'exception-json' channel to logstash [mediawiki-config] - 10https://gerrit.wikimedia.org/r/323330 (https://phabricator.wikimedia.org/T136849) (owner: 10Gergő Tisza) 
[01:04:38] <grrrit-wm>	 (03PS5) 10Paladox: Phabricator: Allow setting mysql.user and mysql.pass (part2) [puppet] - 10https://gerrit.wikimedia.org/r/323349 
[01:05:09] <grrrit-wm>	 (03CR) 10BryanDavis: "Prerequisite for I36b73528ca66da5138e7ac7110ab450eeee5a466" [puppet] - 10https://gerrit.wikimedia.org/r/323351 (https://phabricator.wikimedia.org/T136849) (owner: 10BryanDavis) 
[01:05:56] <grrrit-wm>	 (03PS2) 10BryanDavis: logstash: Break logstash.pp up into individual classes [puppet] - 10https://gerrit.wikimedia.org/r/323333 
[01:05:58] <grrrit-wm>	 (03PS2) 10BryanDavis: logstash: Add processing rules for MediaWiki's exception channel [puppet] - 10https://gerrit.wikimedia.org/r/323351 (https://phabricator.wikimedia.org/T136849) 
[01:10:24] <grrrit-wm>	 (03CR) 10Krinkle: [C: 031] logstash: Add processing rules for MediaWiki's exception channel [puppet] - 10https://gerrit.wikimedia.org/r/323351 (https://phabricator.wikimedia.org/T136849) (owner: 10BryanDavis) 
[01:27:37] <AndyRussG>	 bblack: hi! when you wrote above ^ Special:Banner, you meant, Special:BannerLoader, right?
[01:27:51] <AndyRussG>	 How hard would it be to change that to make it not client-side cache?
[01:29:33] <bblack>	 AndyRussG: I meant all URL's matching Special:Banner.*
[01:29:58] <bblack>	 AndyRussG: in the code it's: if (req.url !~ "^/(wiki/|(w/index\.php)?\?title=)Special:Banner") {
[01:30:54] <bblack>	 (which excludes those URLs from having the appserver's Cache-Control header replaced by "private, s-maxage=0, max-age=0, must-revalidate" for user-agent consumption, which is what we do for wiki articles to avoid client-side caching)
[01:31:40] <bblack>	 it's functionally easy to remove or modify that exclusion, but I have to wonder why we made it in the first place...
[01:31:52] <AndyRussG>	 bblack: quoting ejegg just now in #wikimedia-fundraising:
[01:31:54] <AndyRussG>	  19:28 	<ejegg>	AndyRussG: Here's what I get: cache-control:public, s-maxage=600, max-age=0
[01:31:56] <AndyRussG>	 19:28 	<ejegg>	that's anonymous on the live site
[01:31:58] <AndyRussG>	 19:28 	<AndyRussG>	Hmmm
[01:32:00] <AndyRussG>	 19:28 	<AndyRussG>	And what does that mean
[01:32:02] <AndyRussG>	 19:28 	< *>	AndyRussG hides ignorance under carpet
[01:32:04] <AndyRussG>	 19:29 	<ejegg>	max-age is for private (ie browser) caches
[01:32:06] <AndyRussG>	 19:29 	<ejegg>	s-maxage is for shared caches (ie varnish)
[01:32:08] <AndyRussG>	 19:29 	<ejegg>	and that's what we're outputtingg manually with those header() calls
[01:32:10] <AndyRussG>	 19:29 	<ejegg>	so the browser will always re-request
[01:32:37] <AndyRussG>	 bblack: K I guess we need to look into it carefully
[01:33:08] <bblack>	 yes, that's true, I wasn't paying attention to the shared-vs-nonshared
[01:33:28] <ejegg>	 bblack: aha, so there's a general rule that messes with the cache-control header
[01:33:37] <bblack>	 but in any case, the clause applies to all Special:Banner.*, I don't even know what all of those are, or how they all behave today
[01:33:38] <ejegg>	 but that rule doesn't apply to Special:Banner* pages
[01:33:43] <bblack>	 yes
[01:33:58] <ejegg>	 so they all retain the headers that we set manually
[01:34:02] <bblack>	 yes
[01:34:17] <bblack>	 although I wouldn't make firm bets about how all UAs interpret them
[01:34:18] <ejegg>	 cool, that's what we want, since we do have a bit of special cache header logic in there
[01:34:24] <AndyRussG>	 bblack: ejegg: isn't that a regex? So it would apply to Special:BannerLoader?
[01:34:31] <bblack>	 yes, it would
[01:34:40] <AndyRussG>	 Hmmm K
[01:34:59] <ejegg>	 AndyRussG: yep! so BannerLoader is excluded from the logic that nukes the cache-control header sent from PHP
[01:35:04] <bblack>	 right
[01:35:06] <ejegg>	 which is what we want
[01:35:17] <bblack>	 but why do we want that?
[01:35:44] <bblack>	 without that exclusion, varnish would still operate the same, and users wouldn't be allowed to cache, which seems to be what you want with max-age=0 anyways
[01:35:58] <bblack>	 it makes the user no-cache bit a bit more explicit and clear though
[01:36:00] <paladox>	 mutante :)
[01:36:01] <ejegg>	 ohhhhh right, we're just outputting those custom headers for varnish's sake
[01:36:32] <ejegg>	 so if we took that exception away, varnish would still respect our custom s-maxage
[01:36:35] <bblack>	 but at some point in the past, someone made that specific rule in varnish to let them through, for some past reason
[01:36:40] <bblack>	 yes
[01:36:41] <ejegg>	 but nothing downstream would cache it
[01:36:45] <bblack>	 yes
[01:36:58] <ejegg>	 right, OK, I guess we /don't/ want that behavior
[01:37:08] <ejegg>	 sorry for the confusion
[01:37:15] <bblack>	 at least for this case
[01:37:24] <bblack>	 what about all other cases for Special:Banner.*
[01:37:52] <ejegg>	 Yeah, I can't think of anywhere we want downstream caches to hang onto the banner content
[01:38:18] <ejegg>	 I mean, it's https all the way, so I'd guess it almost never matters
[01:38:38] <bblack>	 (keep in mind, for all practical purposes the only downstream cache we care about is the UA)
[01:38:56] <bblack>	 there are exceptions: authorized-by-the-client TLS proxy-caches, e.g. on corporate networks
[01:38:58] <bblack>	 but I think it's fair to ignore them for policy purposes and only consider the case of a single UA
[01:39:15] <ejegg>	 right, so the single UA is supposed to use maxage and not s-maxage
[01:39:25] <wikibugs>	 07Puppet, 10Beta-Cluster-Infrastructure, 05Goal, 13Patch-For-Review: Remove all ::beta roles in puppet - https://phabricator.wikimedia.org/T86644#2819845 (10Krenair) So now we have: ```alex@alex-laptop:~/Development/Wikimedia/Operations-Puppet (production)$ git grep -E "class .*::beta" modules/mediawiki/ma...
[01:39:28] <bblack>	 in theory yes
[01:39:31] <ejegg>	 so except for TSL proxy caches, we're still seeing the right behavior
[01:39:47] <bblack>	 I wouldn't put it past someone to say "hey if it's publicly cacheable, why can't the browser cache it too?"
[01:40:03] <bblack>	 in some logical sense, it doesn't make sense to have s-maxage > maxage
[01:40:08] <bblack>	 (in public headers)
[01:41:55] <ejegg>	 yeah, I'd say we don't need that rule, but it's not going to hurt except in TLS proxies and non-w3c-compliant UAs
[01:42:32] <ejegg>	 'The s-maxage directive is always ignored by a private cache. ' https://www.w3.org/Protocols/rfc2616/rfc2616-sec14.html#sec14.9.3
[01:43:14] <ejegg>	 So let's not change it now, but let's look again and clean it up in January
[01:43:50] <ejegg>	 k, I have to head out
[01:45:01] <AndyRussG>	 ejegg|away: thx!
[01:47:02] <AndyRussG>	 bblack: thx! I am sadly a bit distracted but will dig also check this out a bit more.... Do you by chance have a quick link/pointer to the code in question pls? thx!! :)
[01:48:20] <AndyRussG>	 WRT to the varnish quick-purge switch u mentioned earlier, do u think we could design something 4 that, and maybe test it quickly Mondayish? On that point, do u have any example code to mebbe use as a starting point?
[01:48:44] <AndyRussG>	 Sorry to be all bomshell over here ;p
[01:48:49] <AndyRussG>	 bombshell
[01:48:56] <AndyRussG>	 no that's not the right word, is it?
[01:49:13] <AndyRussG>	 Mmmm urgent attack of last minute past deadline arrrrg-ishness
[02:14:36] <bblack>	 AndyRussG: there's an existing script deployers can use to purge specific URLs that's already well-tested.  I'm just suggesting we can build on that and have a little wrapper ready to go that purges all the relevant banner-related URLs
[02:15:34] <AndyRussG>	 bblack: ah great.... mmm haz links?
[02:16:40] <bblack>	 not offhand, no
[02:16:51] <bblack>	 it's part of the mwScript stuff
[02:17:14] <bblack>	 we can look around for it on Monday
[02:17:40] <Krenair>	 purgeList.php?
[02:18:00] <bblack>	 yeah that
[02:18:06] <Krenair>	 you pipe in URLs and it sends purges for them IIRC
[02:19:25] <Krenair>	 mwscript itself is nothing special
[02:20:43] <Krenair>	 it just gets the correct path to use and runs php via sudo where appropriate
[02:21:30] <Krenair>	 the multiversion stuff it runs as a php entry point does some special things but ultimately it just results in the running of a normal MW maintenance script
[02:23:25] <Krenair>	 (the sudo is to ensure it runs under the lower-privileged web server user rather than with deployment rights etc.)
[02:29:55] <logmsgbot>	 !log l10nupdate@tin scap sync-l10n completed (1.29.0-wmf.3) (duration: 10m 39s)
[02:30:07] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[02:35:10] <logmsgbot>	 !log l10nupdate@tin ResourceLoader cache refresh completed at Thu Nov 24 02:35:10 UTC 2016 (duration 5m 15s)
[02:35:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[02:54:31] <AndyRussG>	 bblack: Krenair thx!
[02:54:38] <AndyRussG>	 back a bit later!
[02:59:02] <grrrit-wm>	 (03CR) 10Gergő Tisza: "Shouldn't it be the other way around? Deploy I36b73528ca66da5138e7ac7110ab450eeee5a466, deploy this (but also get rid of the exception-jso" [puppet] - 10https://gerrit.wikimedia.org/r/323351 (https://phabricator.wikimedia.org/T136849) (owner: 10BryanDavis) 
[03:21:09] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: s1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 814.98 seconds
[03:24:43] <grrrit-wm>	 (03CR) 10Krinkle: "Wouldn't a 401 with auth cause the authentication prompt to be shown to the end-user upon viewing the HTML page that embeds the image? Tha" [puppet] - 10https://gerrit.wikimedia.org/r/323135 (https://phabricator.wikimedia.org/T151444) (owner: 10Ema) 
[03:30:00] <grrrit-wm>	 (03CR) 10Krinkle: "nvm :)" [puppet] - 10https://gerrit.wikimedia.org/r/323135 (https://phabricator.wikimedia.org/T151444) (owner: 10Ema) 
[03:42:09] <icinga-wm>	 RECOVERY - MariaDB Slave Lag: s1 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 243.95 seconds
[04:10:39] <icinga-wm>	 PROBLEM - mailman I/O stats on fermium is CRITICAL: CRITICAL - I/O stats: Transfers/Sec=728.80 Read Requests/Sec=1469.30 Write Requests/Sec=10.10 KBytes Read/Sec=42218.40 KBytes_Written/Sec=452.00
[04:16:39] <icinga-wm>	 RECOVERY - mailman I/O stats on fermium is OK: OK - I/O stats: Transfers/Sec=1.70 Read Requests/Sec=2.40 Write Requests/Sec=26.40 KBytes Read/Sec=9.60 KBytes_Written/Sec=189.60
[04:27:59] <icinga-wm>	 PROBLEM - puppet last run on ms-fe1004 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[04:52:09] <icinga-wm>	 PROBLEM - puppet last run on cp3013 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[04:55:59] <icinga-wm>	 RECOVERY - puppet last run on ms-fe1004 is OK: OK: Puppet is currently enabled, last run 20 seconds ago with 0 failures
[05:21:09] <icinga-wm>	 RECOVERY - puppet last run on cp3013 is OK: OK: Puppet is currently enabled, last run 47 seconds ago with 0 failures
[05:23:16] <wikibugs_>	 07Puppet, 10Beta-Cluster-Infrastructure, 05Goal, 13Patch-For-Review: Remove all ::beta roles in puppet - https://phabricator.wikimedia.org/T86644#973295 (10demon) Just came across `beta::saltmaster::tools`. It doesn't even appear used...
[05:26:40] <wikibugs_>	 07Puppet, 10Beta-Cluster-Infrastructure, 05Goal, 13Patch-For-Review: Remove all ::beta roles in puppet - https://phabricator.wikimedia.org/T86644#2820019 (10Krenair) It's used on deployment-salt02: https://wikitech.wikimedia.org/wiki/Hiera:Deployment-prep/host/deployment-salt02
[05:27:46] <wikibugs_>	 07Puppet, 10Beta-Cluster-Infrastructure, 05Goal, 13Patch-For-Review: Remove all ::beta roles in puppet - https://phabricator.wikimedia.org/T86644#2820032 (10demon) Do we even need salt in beta? ;-)
[05:31:37] <wikibugs_>	 07Puppet, 10Beta-Cluster-Infrastructure, 05Goal, 13Patch-For-Review: Remove all ::beta roles in puppet - https://phabricator.wikimedia.org/T86644#2820037 (10Krenair) That's way out of the scope of this task, but yes, I have used it many times
[06:03:59] <icinga-wm>	 PROBLEM - puppet last run on rcs1002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[06:12:49] <icinga-wm>	 PROBLEM - puppet last run on db1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[06:31:59] <icinga-wm>	 RECOVERY - puppet last run on rcs1002 is OK: OK: Puppet is currently enabled, last run 12 seconds ago with 0 failures
[06:41:49] <icinga-wm>	 RECOVERY - puppet last run on db1001 is OK: OK: Puppet is currently enabled, last run 17 seconds ago with 0 failures
[06:55:36] <wikibugs_>	 06Operations, 10Wikimedia-General-or-Unknown, 07Availability, 13Patch-For-Review, and 2 others: Job queue size growing since ~12:00 on 2016-11-19 - https://phabricator.wikimedia.org/T151196#2820097 (10Nemo_bis)
[07:32:04] <marostegui>	 !log Stopping MySQL db2070 for maintenance - https://phabricator.wikimedia.org/T149553
[07:32:16] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:05:18] <grrrit-wm>	 (03PS1) 10Tim Landscheidt: Tools: Remove temporary class role::toollabs::merlbot_proxy [puppet] - 10https://gerrit.wikimedia.org/r/323373 
[08:15:13] <_joe_>	 !log uploaded calico-cni 1.5.1 to jessie-wikimedia
[08:15:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:16:42] <grrrit-wm>	 (03CR) 10Yuvipanda: [C: 032] Tools: Remove temporary class role::toollabs::merlbot_proxy [puppet] - 10https://gerrit.wikimedia.org/r/323373 (owner: 10Tim Landscheidt) 
[08:24:37] <wikibugs>	 06Operations, 07Puppet, 10Beta-Cluster-Infrastructure, 07Technical-Debt, 07Tracking: Minimize differences between beta and production (Tracking) - https://phabricator.wikimedia.org/T87220#2820182 (10hashar)
[08:24:39] <wikibugs_>	 07Puppet, 10Beta-Cluster-Infrastructure, 05Goal, 13Patch-For-Review: Remove all ::beta roles in puppet - https://phabricator.wikimedia.org/T86644#2820179 (10hashar) 05Open>03Resolved a:03hashar >>! In T86644#2820032, @demon wrote: > Do we even need salt in beta? ;-)  When you get 40+ instances. Yes d...
[08:25:29] <icinga-wm>	 PROBLEM - High load average on labstore1004 is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [24.0]
[08:27:29] <icinga-wm>	 RECOVERY - High load average on labstore1004 is OK: OK: Less than 50.00% above the threshold [16.0]
[08:29:02] <wikibugs>	 06Operations, 07Puppet, 10Beta-Cluster-Infrastructure, 07Technical-Debt, 07Tracking: Minimize differences between beta and production (Tracking) - https://phabricator.wikimedia.org/T87220#2820188 (10Krenair)
[08:29:04] <wikibugs_>	 07Puppet, 10Beta-Cluster-Infrastructure, 05Goal, 13Patch-For-Review: Remove all ::beta roles in puppet - https://phabricator.wikimedia.org/T86644#2820185 (10Krenair) 05Resolved>03Open a:05hashar>03None This task is not complete. I listed 7 existing cases above.
[08:34:54] <wikibugs_>	 07Puppet, 10Beta-Cluster-Infrastructure, 05Goal, 13Patch-For-Review: Remove all ::beta roles in puppet - https://phabricator.wikimedia.org/T86644#2820196 (10hashar) Sorry made a mistake when replying :D
[08:59:46] <marostegui>	 !log Deploy alter table S5 - dewiki.revision on db1092 (depooled) - T148967
[08:59:58] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:59:58] <stashbot>	 T148967: Fix PK on S5 dewiki.revision - https://phabricator.wikimedia.org/T148967
[09:03:28] <grrrit-wm>	 (03CR) 10Elukey: "Added the patch on deployment-prep if people want to test it. Quick test:" [puppet] - 10https://gerrit.wikimedia.org/r/322268 (https://phabricator.wikimedia.org/T137176) (owner: 10Elukey) 
[09:03:57] <elukey>	 marostegui: morning :P
[09:04:16] <marostegui>	 elukey hahaha morning
[09:12:49] <icinga-wm>	 PROBLEM - puppet last run on dbstore1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[09:25:09] <wikibugs_>	 06Operations, 10Wikimedia-Apache-configuration, 13Patch-For-Review: catch-all apache vhost on the cluster should return 404 for non-existing sites - https://phabricator.wikimedia.org/T137176#2820282 (10Joe) 05Open>03Invalid
[09:29:55] <wikibugs>	 06Operations, 10Wikimedia-Apache-configuration, 13Patch-For-Review: catch-all apache vhost on the cluster should return 404 for non-existing sites - https://phabricator.wikimedia.org/T137176#2820296 (10Joe) This is a huge discussion to have, and would need a ton of auditing. Basically:  - We return 200 for *...
[09:30:53] <grrrit-wm>	 (03CR) 10Giuseppe Lavagetto: [C: 04-2] "See comments on the task." [puppet] - 10https://gerrit.wikimedia.org/r/322268 (https://phabricator.wikimedia.org/T137176) (owner: 10Elukey) 
[09:38:13] <marostegui>	 !log Stopping replication db1052 (depooled) for maintenance - T150960
[09:38:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:38:25] <stashbot>	 T150960: Initial data tests for db1095 (temporary db1069 - sanitarium replacement) - https://phabricator.wikimedia.org/T150960
[09:41:49] <icinga-wm>	 RECOVERY - puppet last run on dbstore1001 is OK: OK: Puppet is currently enabled, last run 50 seconds ago with 0 failures
[09:57:31] <wikibugs>	 06Operations, 10Wikimedia-Apache-configuration, 13Patch-For-Review: catch-all apache vhost on the cluster should return 404 for non-existing sites - https://phabricator.wikimedia.org/T137176#2820406 (10elukey) 05Invalid>03Open
[10:13:15] <_joe_>	 !log running commonswiki htmlCacheUpdate jobs on terbium to catch up with the backlog, monitoring caches for vhtcpd queue overflows T151196
[10:13:26] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:13:27] <stashbot>	 T151196: Job queue size growing since ~12:00 on 2016-11-19 - https://phabricator.wikimedia.org/T151196
[10:16:19] <icinga-wm>	 PROBLEM - puppet last run on lvs3003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[10:16:30] <wikibugs_>	 06Operations, 10Wikimedia-Apache-configuration, 13Patch-For-Review: catch-all apache vhost on the cluster should return 404 for non-existing sites - https://phabricator.wikimedia.org/T137176#2820445 (10elukey) Re-opening the task after a chat with Joe, let's find a solution for this issue :)  What are the sc...
[10:17:48] <wikibugs_>	 06Operations, 10Wikimedia-General-or-Unknown, 07Availability, 13Patch-For-Review, and 2 others: Job queue size growing since ~12:00 on 2016-11-19 - https://phabricator.wikimedia.org/T151196#2820447 (10Joe) So, I solved the "mistery": since I last checked, we're throttling htmlCacheUpdates on the jobrunners...
[10:23:54] <wikibugs_>	 06Operations, 13Patch-For-Review: Prometheus cronspam - https://phabricator.wikimedia.org/T151149#2820450 (10elukey) thanks @fgiunchedi !!
[10:40:51] <grrrit-wm>	 (03CR) 10Alexandros Kosiaris: [C: 04-1] "First pass seems quite ok, comments inline" (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/323194 (https://phabricator.wikimedia.org/T150043) (owner: 10Mobrovac) 
[10:45:19] <icinga-wm>	 RECOVERY - puppet last run on lvs3003 is OK: OK: Puppet is currently enabled, last run 33 seconds ago with 0 failures
[10:48:24] <wikibugs>	 06Operations: silver: / partition low on space - https://phabricator.wikimedia.org/T151493#2820531 (10Peachey88)
[10:48:43] <wikibugs_>	 06Operations: silver: /dev/md2 mounted twice - https://phabricator.wikimedia.org/T151489#2820532 (10Peachey88)
[10:49:35] <wikibugs_>	 06Operations, 10DBA: Icinga MariaDB disk space check on silver checks the wrong partition - https://phabricator.wikimedia.org/T151491#2820534 (10Peachey88)
[10:50:09] <icinga-wm>	 PROBLEM - puppet last run on druid1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[10:52:37] <wikibugs>	 06Operations, 10DBA: Icinga MariaDB disk space check on silver checks the wrong partition - https://phabricator.wikimedia.org/T151491#2819072 (10jcrespo) There is no /srv partition on silver. probably it should check /a instead of / ?
[10:56:49] <wikibugs>	 06Operations, 10DBA: Icinga MariaDB disk space check on silver checks the wrong partition - https://phabricator.wikimedia.org/T151491#2820552 (10jcrespo) The MariaDB disk space check is a legacy of the past- there should be only one disk check, and the critical (and warning) level should be higher for database...
[11:11:46] <wikibugs>	 06Operations, 10DBA: Icinga MariaDB disk space check on silver checks the wrong partition - https://phabricator.wikimedia.org/T151491#2820601 (10Volans) @jcrespo see T151489, there is a `/srv` mount point as well as `/a` mount point, they both mount the same partition!
[11:12:19] <icinga-wm>	 PROBLEM - puppet last run on mx2001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[11:19:10] <icinga-wm>	 RECOVERY - puppet last run on druid1001 is OK: OK: Puppet is currently enabled, last run 52 seconds ago with 0 failures
[11:26:19] <wikibugs>	 06Operations, 06Performance-Team, 10Thumbor: Investigate source of thumbnail 302 redirects - https://phabricator.wikimedia.org/T148410#2820634 (10Gilles)
[11:40:19] <icinga-wm>	 RECOVERY - puppet last run on mx2001 is OK: OK: Puppet is currently enabled, last run 6 seconds ago with 0 failures
[11:41:26] <hoo>	 !log Killed the Wikidata JSON dump creation on snapshot1007: Wont succeed before Monday, due to T151356
[11:41:37] <hoo>	 apergos: FYI ^
[11:41:38] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:41:38] <stashbot>	 T151356: Wikibase\Repo\Store\Sql\SqlEntityIdPager::fetchIds query slow - https://phabricator.wikimedia.org/T151356
[11:41:51] <apergos>	 okey dokey
[11:42:26] <apergos>	 and there's the cron error mail too
[11:43:11] <wikibugs_>	 06Operations, 10DBA, 13Patch-For-Review: Set up TLS for MariaDB replication - https://phabricator.wikimedia.org/T111654#2820680 (10jcrespo) p:05High>03Normal
[11:45:36] <grrrit-wm>	 (03PS3) 10Mobrovac: Trending Edits: Role and module [puppet] - 10https://gerrit.wikimedia.org/r/323194 (https://phabricator.wikimedia.org/T150043) 
[11:48:07] <grrrit-wm>	 (03CR) 10Mobrovac: Trending Edits: Role and module (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/323194 (https://phabricator.wikimedia.org/T150043) (owner: 10Mobrovac) 
[11:56:05] <wikibugs>	 06Operations, 10DBA, 13Patch-For-Review: Firewall configurations for database hosts - https://phabricator.wikimedia.org/T104699#2820707 (10jcrespo) p:05Triage>03Normal
[12:00:34] <wikibugs_>	 06Operations, 06Discovery, 06WMDE-Analytics-Engineering, 10Wikidata, and 3 others: Add firewall exception to get to wdqs*.codfw.wmnet:8888 from analytics cluster - https://phabricator.wikimedia.org/T146474#2820741 (10Addshore) @akosiaris is there any way to get this expedited? (mentioning you as you comple...
[12:10:09] <icinga-wm>	 PROBLEM - puppet last run on aqs1006 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[12:10:24] <grrrit-wm>	 (03CR) 10Alexandros Kosiaris: [C: 032] "LGTM. merging" [puppet] - 10https://gerrit.wikimedia.org/r/323194 (https://phabricator.wikimedia.org/T150043) (owner: 10Mobrovac) 
[12:11:09] <icinga-wm>	 PROBLEM - salt-minion processes on thumbor1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[12:11:59] <icinga-wm>	 RECOVERY - salt-minion processes on thumbor1001 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion
[12:12:09] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs1006 is CRITICAL: PYBAL CRITICAL - thumbor_8800 - Could not depool server thumbor1001.eqiad.wmnet because of too many down!
[12:15:09] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs1006 is OK: PYBAL OK - All pools are healthy
[12:15:09] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs1003 is CRITICAL: PYBAL CRITICAL - thumbor_8800 - Could not depool server thumbor1002.eqiad.wmnet because of too many down!
[12:16:23] <icinga-wm>	 PROBLEM - LVS HTTP IPv4 on thumbor.svc.eqiad.wmnet is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[12:17:05] <akosiaris>	 hm thumbor issues
[12:17:09] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs1003 is OK: PYBAL OK - All pools are healthy
[12:17:45] <gilles>	 akosiaris: I can log into thumbor1001, is it considered down because of a health check?
[12:18:27] <akosiaris>	 gilles: it is not considered down as a host, the service is that it doesn't look to be ok
[12:19:08] <gilles>	 akosiaris: do you know where the service health check is defined? I wonder what it looks at
[12:19:13] <icinga-wm>	 RECOVERY - LVS HTTP IPv4 on thumbor.svc.eqiad.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 241 bytes in 0.359 second response time
[12:19:21] <akosiaris>	 hmm recovered
[12:20:11] <akosiaris>	 gilles: yes, https://phabricator.wikimedia.org/diffusion/OPUP/browse/production/hieradata/common/lvs/configuration.yaml;3e0cd4eac8c03eb88ab3a1fdd2bf992387828409$965
[12:20:27] <akosiaris>	 practically tries to fetch /healthcheck over http on port 8800
[12:20:31] <gilles>	 thanks
[12:21:19] <icinga-wm>	 PROBLEM - puppet last run on mira is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Scap_source[trending-edits/deploy]
[12:21:41] <gilles>	 https://grafana.wikimedia.org/dashboard/db/thumbor?from=now-30m&to=now  a spike in 599 it does look like something was up
[12:22:13] <icinga-wm>	 PROBLEM - LVS HTTP IPv4 on thumbor.svc.eqiad.wmnet is CRITICAL: HTTP CRITICAL - No data received from host
[12:22:43] <akosiaris>	 ok so it's flapping, I 'll schedule a 1 hour downtime on icinga while we investigate
[12:23:09] <akosiaris>	 so that we don't page every ops member on this cellphone on every flap
[12:23:21] <gilles>	 the load is definitely high on both boxes
[12:23:50] <wikibugs>	 06Operations, 06Labs, 10wikitech.wikimedia.org: Can't login wikitech - https://phabricator.wikimedia.org/T144805#2820787 (10Shoichi) p:05Triage>03High >>! In T144805#2789485, @Shoichi wrote: > Months ago, I can log in,but it also happen to me. I don't know rember I had set two-factor authentication or no...
[12:24:09] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs1003 is CRITICAL: PYBAL CRITICAL - thumbor_8800 - Could not depool server thumbor1002.eqiad.wmnet because of too many down!
[12:24:38] <akosiaris>	 gilles: very heavily into IOwait as it seems
[12:24:50] <akosiaris>	 something in thumbor is doing way too many IOPS
[12:25:21] <akosiaris>	 starting 25 mins ago
[12:25:24] <gilles>	 any offending instance in particular?
[12:26:09] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs1003 is OK: PYBAL OK - All pools are healthy
[12:26:46] <gilles>	 I can look at what python is doing with manhole if we can figure out which process(es) in particular are doing it
[12:27:35] <akosiaris>	 can't see any long running process ... they seem to be going through their lifecycle normally
[12:27:51] <gilles>	 first time I see this error, I'll check if it's new:
[12:27:52] <gilles>	 Nov 24 12:27:18 thumbor1002 thumbor@8811[89821]: 2016-11-24 12:27:18,634 8811 thumbor:ERROR [ExiftoolRunner] error: 'Deep recursion on subroutine "Image::ExifTool::ProcessDirectory" at /usr/share/perl5/Image/ExifTool/Exif.pm line 4532.\nDeep recursion on subroutine "Image::ExifTool::Exif::ProcessExif" at /usr/share/perl5/Image/ExifTool.pm line 6085.\n'
[12:28:03] <akosiaris>	 ah, it could be related
[12:28:17] <akosiaris>	 some of the processes consuming CPU are exiftool
[12:28:23] <icinga-wm>	 RECOVERY - LVS HTTP IPv4 on thumbor.svc.eqiad.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 241 bytes in 8.888 second response time
[12:28:27] <akosiaris>	 they are not in the top 5, but still
[12:28:29] <akosiaris>	 could be related
[12:29:17] <gilles>	 yep, that error started happening at Nov 24 12:06:26
[12:29:31] <akosiaris>	 ok almost definitely related then
[12:30:43] <gilles>	 could be a side-effect of the real issue, though
[12:31:54] <akosiaris>	 almost OOM killer has been invoked multiple times in the last 30 mins
[12:31:58] <akosiaris>	 also*
[12:32:11] <gilles>	 that happens continuously, thumbor is leaking memory at the moment
[12:32:23] <gilles>	 1000-1500 OOM kills per day if I recall correctly
[12:32:30] <akosiaris>	 no, it's more promiment now
[12:32:35] <gilles>	 oh  ok
[12:33:08] <akosiaris>	 157 on one host in 30 mins
[12:36:09] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs1006 is CRITICAL: PYBAL CRITICAL - thumbor_8800 - Could not depool server thumbor1001.eqiad.wmnet because of too many down!
[12:36:45] <gilles>	 the deep recursion thing is still happening but I can't tell which request triggered it, I'm going to need to live-hack the thumbor code to get more information
[12:36:51] <elukey>	 !log launched preferred-replica-election to re-add kafka1022 among the Topic partition leader brokers of the Analytics Kafka cluster (all metrics looks good)
[12:37:01] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:37:10] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs1006 is OK: PYBAL OK - All pools are healthy
[12:38:09] <icinga-wm>	 RECOVERY - puppet last run on aqs1006 is OK: OK: Puppet is currently enabled, last run 1 second ago with 0 failures
[12:38:39] <icinga-wm>	 PROBLEM - puppet last run on tin is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Scap_source[trending-edits/deploy]
[12:42:10] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs1003 is CRITICAL: PYBAL CRITICAL - thumbor_8800 - Could not depool server thumbor1002.eqiad.wmnet because of too many down!
[12:43:09] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs1003 is OK: PYBAL OK - All pools are healthy
[12:43:12] <gilles>	 akosiaris: can you make /usr/lib/python2.7/dist-packages/wikimedia_thumbor/exiftool_runner/__init__.py world-writable?
[12:43:42] <gilles>	 that's where I'll put my live hack to see if it's alwayd the same offending original or something like that
[12:46:56] <akosiaris>	 gilles: done. on thumbor1001
[12:47:03] <gilles>	 thanks
[12:47:05] <akosiaris>	 I 'll also lower the weight of this host a bit
[12:47:20] <akosiaris>	 should give a bit more breathing room while we debug
[12:47:36] <akosiaris>	 thumbor1002 is not going to be happy, but still...
[12:48:05] <logmsgbot>	 !log akosiaris@puppetmaster1001 conftool action : set/weight=5; selector: thumbor1001.eqiad.wmnet (tags: ['dc=eqiad', 'cluster=thumbor', 'service=thumbor'])
[12:48:15] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:48:28] <gilles>	 I'm going to restart the thumbor instances on thumbor1001 to make the live hack active
[12:48:53] <gilles>	 !log restarting thumbor on thumbor1001
[12:49:04] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:49:56] <akosiaris>	 !log lower thumbor1001 load by 50% to easy debugging
[12:50:03] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:53:15] <akosiaris>	 so, all thumbor processes are writing to the disk about 1.5MB/s.. total is around 30MB/s
[12:53:28] <akosiaris>	 with spikes around 50MB/s
[12:53:31] <akosiaris>	 which is not much
[12:54:21] <gilles>	 !log restarting thumbor on thumbor1001
[12:54:32] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:59:24] <gilles>	 of course now that I want to catch the recursion thing it's not happening anymore
[13:00:40] <gilles>	 it's still happening on thumbor1002, though
[13:00:56] <gilles>	 either we're unlucky and all those requests are going to thumbor1002 or it was a symptom and not necesserraly the cause
[13:01:21] <gilles>	 it's not super frequent, though
[13:01:36] <gilles>	 but it suggests recursing in something that does IO, so...
[13:03:32] <gilles>	 let's wait a bit more, with something that happens once per minute on average or so, with the new weights it wouldn't be impossible for thumbor1002 to get them all by chance
[13:04:22] <akosiaris>	 I can just revert the weight change
[13:04:36] <akosiaris>	 or even push more traffic to thumbor1001
[13:04:41] <gilles>	 right, please try that, or even skew it the opposite way
[13:04:51] <logmsgbot>	 !log akosiaris@puppetmaster1001 conftool action : set/weight=20; selector: thumbor1001.eqiad.wmnet (tags: ['dc=eqiad', 'cluster=thumbor', 'service=thumbor'])
[13:05:01] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:05:13] <akosiaris>	 ok, now thumbor1001 should start getting twice the requests thumbor1002 gets
[13:07:50] <akosiaris>	 and yet it seems to have no problem now
[13:08:02] <akosiaris>	 what on earth ?
[13:08:44] <gilles>	 right
[13:09:09] <gilles>	 I can't have the error reoccur on thumbor1001 but it still happens occasionally on thumbor1002
[13:09:25] <akosiaris>	 could it have been a simple spike whose backlog is just still being served on thumbor1002 ?
[13:10:16] <gilles>	 is thumbor1002 fine now?
[13:10:23] <akosiaris>	 no, not really
[13:10:31] <akosiaris>	 https://grafana.wikimedia.org/dashboard/file/server-board.json?var-server=thumbor1002&var-network=bond0
[13:10:36] <akosiaris>	 but it is getting better
[13:10:54] <akosiaris>	 user cpu % is going up and iowait is slowly becoming less
[13:11:28] <akosiaris>	 the load average is definitely dropping, so less work is being scheduled on that box, but that's to be expected
[13:11:32] <gilles>	 oh, but the disk is full
[13:11:40] <gilles>	 that might be what's causing all the rest
[13:11:50] <akosiaris>	 ?
[13:12:09] <akosiaris>	 which one ? I see plenty of space on both
[13:12:19] <gilles>	 oh no sorry, it's disk utilisation
[13:12:35] <gilles>	 interesting
[13:13:14] <gilles>	 restarting thumbor definitely stopped the issue on thumbor1001 but we clearly lost the ability to debug the problem there
[13:13:57] <gilles>	 can you make the same file writable on thumbor1002? I can just wait for the processes to restart "naturally" due to OOMs
[13:14:04] <akosiaris>	 yes
[13:14:44] <akosiaris>	 done
[13:19:59] <gilles>	 akosiaris: can you shift the traffic back to thumbor1002?
[13:20:52] <logmsgbot>	 !log akosiaris@puppetmaster1001 conftool action : set/weight=5; selector: thumbor1001.eqiad.wmnet (tags: ['dc=eqiad', 'cluster=thumbor', 'service=thumbor'])
[13:20:57] <akosiaris>	 gilles: done
[13:21:02] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:21:10] <akosiaris>	 it's quite weird what's going on though
[13:21:26] <akosiaris>	 processes are going through their lifecycle quite fine and a restart just fixed it ?
[13:21:36] <akosiaris>	 those 2 don't add up very well
[13:22:05] <gilles>	 right
[13:22:09] <icinga-wm>	 PROBLEM - puppet last run on db1076 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[13:22:25] <akosiaris>	 ah, here's the first time I had a long running process
[13:22:29] <akosiaris>	 not very long
[13:22:37] <akosiaris>	 but an rsvg-convert stood out for a while
[13:22:50] <akosiaris>	 whereas the typical is like a second
[13:22:55] <gilles>	 that's sort of expected, we get a lot of junk SVGs
[13:23:05] <gilles>	 rsvg-convert won't run for more than a minute
[13:23:24] <akosiaris>	 :( and I though I finally had a culprit
[13:23:46] <gilles>	 so, memory doesn't seem to be the issue, thumbor1001 was never close to the total memory limit
[13:24:18] <gilles>	 I can't really tell if it was a huge spike in requests, because we track when they complete (failure or success), not when they come in
[13:24:45] <gilles>	 to be improved, obviously, at the nginx level would give us the real picture
[13:25:19] <gilles>	 iowait spiked, but it was much higher earlier this morning on thumbor1001
[13:25:56] <gilles>	 even the load was really high earlier today
[13:26:20] <akosiaris>	 indeed
[13:26:35] <gilles>	 the failure rate seems to have dropped on its own on thumbor1002...
[13:26:41] <akosiaris>	 yes
[13:26:46] <akosiaris>	 load as well, as well as cpu usage
[13:27:07] <akosiaris>	 not sure if it was something we did
[13:27:13] <gilles>	 but we didn't do anything other than shift traffic. do you think it just gave it "breathing room"?
[13:27:30] <akosiaris>	 it's a plausible explanation
[13:27:58] <akosiaris>	 assuming the spike had practically ended by the time we started shifting traffic around
[13:28:00] <gilles>	 the recursion error is not  happening at all on either box
[13:28:13] <gilles>	 shame to have been unable to know if it was a particular image
[13:28:24] <akosiaris>	 I suppose we will see it again
[13:28:52] <gilles>	 right, I'll make a task to add url request context to as many thumbor errors as possible for future incidents like this
[13:28:57] <akosiaris>	 it's clear from the graphs that it has happened multiple times in the previous days
[13:29:09] <akosiaris>	 just not that strongly to emit an alert
[13:29:22] <gilles>	 right
[13:31:28] <logmsgbot>	 !log akosiaris@puppetmaster1001 conftool action : set/weight=10; selector: thumbor1001.eqiad.wmnet (tags: ['dc=eqiad', 'cluster=thumbor', 'service=thumbor'])
[13:31:38] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:31:49] <akosiaris>	 !log balance the load between thumbor1001 and thumbor1002 evenly
[13:31:58] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:42:36] <wikibugs>	 06Operations, 06Performance-Team, 10Thumbor: add thumbor to production infrastructure - https://phabricator.wikimedia.org/T139606#2820911 (10Gilles) a:05fgiunchedi>03None
[13:43:47] <wikibugs>	 06Operations, 06Performance-Team, 10Thumbor: Add request URL to thumbor errors - https://phabricator.wikimedia.org/T151553#2820915 (10Gilles)
[13:46:47] <wikibugs>	 06Operations, 06Performance-Team, 10Thumbor: Track incoming HTTP request count on the Thumbor boxes - https://phabricator.wikimedia.org/T151554#2820932 (10Gilles)
[13:49:17] <wikibugs_>	 06Operations, 10Icinga: register a nickserv account for icinga-wm - https://phabricator.wikimedia.org/T22771#2820946 (10hashar) 05Open>03declined The system we use to relay notifications to IRC does not support nick registration (T48254) and I have declined that task.   icinga-wm is one of the few bots tha...
[13:50:09] <icinga-wm>	 RECOVERY - puppet last run on db1076 is OK: OK: Puppet is currently enabled, last run 34 seconds ago with 0 failures
[13:50:12] <wikibugs>	 06Operations, 10Icinga: register a nickserv account for icinga-wm - https://phabricator.wikimedia.org/T22771#2820953 (10hashar)
[13:50:14] <wikibugs_>	 06Operations, 10IRCecho: ircecho should support nickserv registration - https://phabricator.wikimedia.org/T48254#503055 (10hashar) 05Open>03declined That is solely for icinga-wm and it works just fine without nick registration.  I dont think anyone will ever add support for nick registration to ircecho, bu...
[13:50:53] <jynus>	 what is this trending-edits scap failure, anyone knows?
[13:51:26] <jynus>	 I will investigate if nobody knows about it
[13:52:22] <wikibugs_>	 06Operations, 06Performance-Team, 10Thumbor: Track incoming HTTP request count on the Thumbor boxes - https://phabricator.wikimedia.org/T151554#2820957 (10Gilles) @fgiunchedi what do you think of using something like https://github.com/zebrafishlabs/nginx-statsd or https://github.com/knyar/nginx-lua-promethe...
[13:54:29] <wikibugs_>	 06Operations, 10Beta-Cluster-Infrastructure, 06Release-Engineering-Team: setup a DB backed parser cache - https://phabricator.wikimedia.org/T55457#2820963 (10hashar)
[13:55:29] <icinga-wm>	 PROBLEM - puppet last run on ms-be1025 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[14:00:58] <wikibugs_>	 06Operations, 06Performance-Team, 10Thumbor: Thumbor hangs on some TIFF files - https://phabricator.wikimedia.org/T151454#2820998 (10Gilles) Thanks, Pythonic Santa!
[14:04:36] <grrrit-wm>	 (03PS2) 10KartikMistry: Explicitly set cookieDomain for ContentTranslationSiteTemplates [mediawiki-config] - 10https://gerrit.wikimedia.org/r/320200 (https://phabricator.wikimedia.org/T149879) 
[14:09:11] <wikibugs_>	 06Operations, 10Traffic: varnishapi.py AttributeError: VSM_Close - https://phabricator.wikimedia.org/T151561#2821095 (10ema)
[14:09:23] <wikibugs_>	 06Operations, 10Traffic: varnishapi.py AttributeError: VSM_Close - https://phabricator.wikimedia.org/T151561#2821110 (10ema) p:05Triage>03Normal
[14:10:56] <wikibugs_>	 06Operations, 10Traffic: varnishapi.py AttributeError: VSM_Close - https://phabricator.wikimedia.org/T151561#2821095 (10ema)
[14:24:29] <icinga-wm>	 RECOVERY - puppet last run on ms-be1025 is OK: OK: Puppet is currently enabled, last run 45 seconds ago with 0 failures
[14:26:07] <wikibugs_>	 06Operations, 10Traffic: Varnishkafka and related VSM daemons seeing abandoned VSM logs on cp1055 - https://phabricator.wikimedia.org/T151563#2821164 (10elukey)
[14:29:53] <wikibugs_>	 06Operations, 10Traffic: Varnishkafka and related VSM daemons seeing abandoned VSM logs on cp1055 - https://phabricator.wikimedia.org/T151563#2821164 (10ema) We might have to increase workspace_backend: https://github.com/varnishcache/varnish-cache/issues/1990
[14:30:19] <wikibugs>	 06Operations, 06Performance-Team, 10Thumbor: Thumbor hangs on some TIFF files - https://phabricator.wikimedia.org/T151454#2821200 (10Gilles)
[14:33:43] <wikibugs_>	 06Operations, 10Traffic: Varnishkafka and related VSM daemons seeing abandoned VSM logs - https://phabricator.wikimedia.org/T151563#2821211 (10elukey)
[14:47:07] <ema>	 !log uploaded varnishkafka 1.0.12-1 to carbon main component, replacing version 1.0.7-1 (T150660)
[14:47:21] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:47:22] <stashbot>	 T150660: Post Varnish 4 migration cleanup - https://phabricator.wikimedia.org/T150660
[14:49:08] <elukey>	 long live to Varnish 4
[14:59:14] <wikibugs>	 06Operations, 06Performance-Team, 10Thumbor: Upgrade gifsicle to 1.88-2~bpo8+1 on Thumbor boxes - https://phabricator.wikimedia.org/T151565#2821235 (10Gilles)
[14:59:41] <wikibugs_>	 06Operations, 06Performance-Team, 10Thumbor: Upgrade gifsicle to 1.88-2~bpo8+1 on Thumbor boxes - https://phabricator.wikimedia.org/T151565#2821235 (10Gilles)
[15:02:31] <wikibugs_>	 06Operations, 06Performance-Team, 10Thumbor: Upgrade gifsicle to 1.88-2~bpo8+1 on Thumbor boxes - https://phabricator.wikimedia.org/T151565#2821252 (10Gilles) @fgiunchedi would adding a minimum version for gifsicle in the python-thumbor-wikimedia package be enough? Or are more steps required to have the jess...
[15:03:40] <ema>	 !log uploaded varnish 4.1.3-1wm4 to carbon main component, replacing version 3.0.6plus-wm9 (T150660)
[15:03:51] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:03:51] <stashbot>	 T150660: Post Varnish 4 migration cleanup - https://phabricator.wikimedia.org/T150660
[15:04:35] <wikibugs>	 06Operations, 10Traffic, 13Patch-For-Review: Upgrade all cache clusters to Varnish 4 - https://phabricator.wikimedia.org/T131499#2821257 (10ema)
[15:04:38] <wikibugs_>	 06Operations, 10Traffic, 13Patch-For-Review: Post Varnish 4 migration cleanup - https://phabricator.wikimedia.org/T150660#2821256 (10ema) 05Open>03Resolved
[15:06:57] <wikibugs_>	 06Operations, 10Traffic, 13Patch-For-Review: Convert text cluster to Varnish 4 - https://phabricator.wikimedia.org/T131503#2821259 (10ema) 05Open>03Resolved a:03ema
[15:06:59] <wikibugs>	 06Operations, 10Traffic, 13Patch-For-Review: Upgrade all cache clusters to Varnish 4 - https://phabricator.wikimedia.org/T131499#2821261 (10ema)
[15:07:32] <grrrit-wm>	 (03CR) 10Elukey: [C: 032] "https://puppet-compiler.wmflabs.org/4669/stat1002.eqiad.wmnet" [puppet] - 10https://gerrit.wikimedia.org/r/323133 (https://phabricator.wikimedia.org/T149722) (owner: 10Gehel) 
[15:07:39] <wikibugs_>	 06Operations, 10Traffic: Sideways Only-If-Cached on misses at a primary DC - https://phabricator.wikimedia.org/T142841#2821263 (10ema)
[15:07:42] <wikibugs>	 06Operations, 10Traffic, 13Patch-For-Review: Install XKey vmod - https://phabricator.wikimedia.org/T122881#2821264 (10ema)
[15:07:46] <wikibugs_>	 06Operations, 10MediaWiki-API, 10Traffic: Evaluate the feasibility of cache invalidation for the action API - https://phabricator.wikimedia.org/T122867#2821265 (10ema)
[15:07:46] <wikibugs>	 06Operations, 10Traffic, 07HTTPS, 05codfw-rollout: Outbound HTTPS for varnish backend instances - https://phabricator.wikimedia.org/T109325#2821267 (10ema)
[15:07:51] <wikibugs>	 06Operations, 10Traffic, 13Patch-For-Review: Upgrade all cache clusters to Varnish 4 - https://phabricator.wikimedia.org/T131499#2168822 (10ema) 05Open>03Resolved
[15:08:34] <grrrit-wm>	 (03PS2) 10Gehel: Add 'discovery-stats' technical user to the 'stats' group. [puppet] - 10https://gerrit.wikimedia.org/r/323133 (https://phabricator.wikimedia.org/T149722) 
[15:10:09] <icinga-wm>	 PROBLEM - puppet last run on mw1190 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[15:11:54] <grrrit-wm>	 (03CR) 10Alexandros Kosiaris: "https://puppet-compiler.wmflabs.org/4668/ is way too good to be true..." [puppet] - 10https://gerrit.wikimedia.org/r/322898 (owner: 10Alexandros Kosiaris) 
[15:17:59] <wikibugs>	 06Operations, 06Discovery, 10Wikidata, 10Wikidata-Query-Service, and 3 others: publish lag and response time for wdqs codfw to graphite - https://phabricator.wikimedia.org/T146207#2821275 (10akosiaris)
[15:18:05] <wikibugs_>	 06Operations, 06Discovery, 06WMDE-Analytics-Engineering, 10Wikidata, and 3 others: Add firewall exception to get to wdqs*.codfw.wmnet:8888 from analytics cluster - https://phabricator.wikimedia.org/T146474#2821272 (10akosiaris) 05Open>03Resolved a:03akosiaris This seems to have fallen between the cra...
[15:20:15] <wikibugs>	 06Operations, 10Traffic, 13Patch-For-Review: Upgrade to Varnish 4: things  to remember - https://phabricator.wikimedia.org/T126206#2821283 (10ema)
[15:20:53] <wikibugs_>	 06Operations, 06Discovery, 06WMDE-Analytics-Engineering, 10Wikidata, and 3 others: Add firewall exception to get to wdqs*.codfw.wmnet:8888 from analytics cluster - https://phabricator.wikimedia.org/T146474#2821284 (10Addshore) Thanks @akosiaris !
[15:21:23] <wikibugs>	 06Operations, 06Discovery, 10Wikidata, 10Wikidata-Query-Service, and 2 others: Install and configure new WDQS nodes on codfw - https://phabricator.wikimedia.org/T144380#2821287 (10Addshore)
[15:21:27] <wikibugs_>	 06Operations, 06Discovery, 10Wikidata, 10Wikidata-Query-Service, and 3 others: publish lag and response time for wdqs codfw to graphite - https://phabricator.wikimedia.org/T146207#2821285 (10Addshore) 05Open>03Resolved Now appearing in grafana
[15:22:29] <wikibugs>	 06Operations, 10Traffic, 13Patch-For-Review: Varnish4 is unexpectedly retrying certain applayer failure cases - https://phabricator.wikimedia.org/T150247#2821290 (10ema) We've proposed a patch introducing a varnishd parameter limiting the number of extrachance retries: https://github.com/varnishcache/varnish...
[15:23:25] <wikibugs_>	 06Operations, 10Traffic, 13Patch-For-Review: varnish-backend: weekly cron restart for all clusters - https://phabricator.wikimedia.org/T149784#2821291 (10ema) 05Open>03Resolved
[15:26:56] <wikibugs>	 06Operations, 06Analytics-Kanban, 10Traffic: Varnishlog with Start timestamp but no Resp one causing data consistency check alarms - https://phabricator.wikimedia.org/T148412#2821292 (10ema) 05Open>03Resolved >>! In T148412#2781980, @elukey wrote:  > Last but not the least, no alarms were fired for uploa...
[15:31:21] <grrrit-wm>	 (03PS1) 10Gehel: Add 'discovery-stats' technical user to the 'stats' group. [puppet] - 10https://gerrit.wikimedia.org/r/323399 (https://phabricator.wikimedia.org/T149722) 
[15:32:06] <grrrit-wm>	 (03CR) 10jenkins-bot: [V: 04-1] Add 'discovery-stats' technical user to the 'stats' group. [puppet] - 10https://gerrit.wikimedia.org/r/323399 (https://phabricator.wikimedia.org/T149722) (owner: 10Gehel) 
[15:34:09] <wikibugs_>	 06Operations, 06Performance-Team, 10Thumbor: Thumbor errors on some GIF files - https://phabricator.wikimedia.org/T151455#2821313 (10Gilles) The second example is actually the same issue manifesting itself, which should be resolved by the gifsicle upgrade.
[15:35:53] <grrrit-wm>	 (03PS1) 10Gehel: Revert "Add 'discovery-stats' technical user to the 'stats' group." [puppet] - 10https://gerrit.wikimedia.org/r/323400 
[15:36:49] <grrrit-wm>	 (03CR) 10Gehel: "Adding the "stats" group directly to the "discovery-stats" user conflicts with the way the admin module works." [puppet] - 10https://gerrit.wikimedia.org/r/323400 (owner: 10Gehel) 
[15:36:54] <grrrit-wm>	 (03CR) 10Gehel: [C: 032] Revert "Add 'discovery-stats' technical user to the 'stats' group." [puppet] - 10https://gerrit.wikimedia.org/r/323400 (owner: 10Gehel) 
[15:38:13] <wikibugs>	 06Operations, 06Performance-Team, 10Thumbor: Thumbor errors on some GIF files - https://phabricator.wikimedia.org/T151455#2821323 (10Gilles)
[15:39:09] <icinga-wm>	 RECOVERY - puppet last run on mw1190 is OK: OK: Puppet is currently enabled, last run 43 seconds ago with 0 failures
[16:00:08] <grrrit-wm>	 (03PS1) 10Gilles: Nginx timeout should be higher than thumbor subprocess timeout [puppet] - 10https://gerrit.wikimedia.org/r/323403 (https://phabricator.wikimedia.org/T151459) 
[16:18:03] <wikibugs>	 06Operations, 06Performance-Team, 10Thumbor: Thumbor should reject thumbnail requests that are the same size as the original or bigger - https://phabricator.wikimedia.org/T150741#2821410 (10Gilles) @fgiunchedi I've actually discovered that mediawiki *does* download the original before erroring that way. It's...
[16:44:22] <Zppix>	 To all the best WMF operations team from My family, Happy Thanksgiving, I hope you have a great day, Ill be around as much as possible. WMF wouldnt be the same without everyone of you.
[17:13:16] <grrrit-wm>	 (03CR) 10Alexandros Kosiaris: [C: 032] Template out reference to deployment.eqiad.wmnet in inactive.motd [puppet] - 10https://gerrit.wikimedia.org/r/322825 (https://phabricator.wikimedia.org/T146505) (owner: 10Alex Monk) 
[17:13:20] <grrrit-wm>	 (03PS2) 10Alexandros Kosiaris: Template out reference to deployment.eqiad.wmnet in inactive.motd [puppet] - 10https://gerrit.wikimedia.org/r/322825 (https://phabricator.wikimedia.org/T146505) (owner: 10Alex Monk) 
[17:13:22] <grrrit-wm>	 (03CR) 10Alexandros Kosiaris: [V: 032] Template out reference to deployment.eqiad.wmnet in inactive.motd [puppet] - 10https://gerrit.wikimedia.org/r/322825 (https://phabricator.wikimedia.org/T146505) (owner: 10Alex Monk) 
[17:14:09] <icinga-wm>	 RECOVERY - Restbase root url on restbase2012 is OK: HTTP OK: HTTP/1.1 200 - 15450 bytes in 0.104 second response time
[17:15:39] <icinga-wm>	 RECOVERY - puppet last run on tin is OK: OK: Puppet is currently enabled, last run 4 seconds ago with 0 failures
[17:16:20] <icinga-wm>	 RECOVERY - puppet last run on mira is OK: OK: Puppet is currently enabled, last run 23 seconds ago with 0 failures
[17:25:48] <_joe_>	 !log turned off additional workers for htmlcacheupdate on commonswiki as the queue has reduced to acceptable sizes (T151196)
[17:26:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:26:00] <stashbot>	 T151196: Job queue size growing since ~12:00 on 2016-11-19 - https://phabricator.wikimedia.org/T151196
[17:29:59] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: m3 on db1048 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 602.63 seconds
[17:30:09] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: m3 on db2012 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 614.38 seconds
[17:32:54] <paladox>	 Isen't that ^^ phabricator?
[17:32:57] <jynus>	 that could be me
[17:33:05] <paladox>	 oh
[17:33:09] <jynus>	 I have sent some extra backup processes
[17:33:16] <jynus>	 to the slaves
[17:33:20] <paladox>	 Oh
[17:33:23] <jynus>	 no impact to the masters
[17:33:35] <paladox>	 ok
[17:34:29] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: m3 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 875.98 seconds
[17:34:55] <jynus>	 yeah, that has some issues I should check, but no production impact
[17:36:09] <Zppix>	 jynus backups are good we love them
[17:36:23] <jynus>	 yes, we indeed do
[17:40:57] <jynus>	 I do not know why they cause lag on m3 and not on the other hosts
[17:52:49] <wikibugs_>	 06Operations, 06Performance-Team, 10Thumbor: Thumbor should reject thumbnail requests that are the same size as the original or bigger - https://phabricator.wikimedia.org/T150741#2821604 (10Gilles) Actually it's more complicated than that... I need to double check whether it does load from swift or not, it s...
[18:00:52] <grrrit-wm>	 (03PS1) 10Jcrespo: Add temporary workaround --skip-ssl until unified cert authority [puppet] - 10https://gerrit.wikimedia.org/r/323420 
[18:00:59] <icinga-wm>	 RECOVERY - MariaDB Slave Lag: m3 on db1048 is OK: OK slave_sql_lag Replication lag: 0.00 seconds
[18:01:09] <icinga-wm>	 RECOVERY - MariaDB Slave Lag: m3 on db2012 is OK: OK slave_sql_lag Replication lag: 0.35 seconds
[18:01:48] <wikibugs>	 06Operations, 06Performance-Team, 10Thumbor: Thumbor should reject thumbnail requests that are the same size as the original or bigger - https://phabricator.wikimedia.org/T150741#2821658 (10Gilles) Actual size check: https://phabricator.wikimedia.org/diffusion/MW/browse/master/includes/media/Transformational...
[18:02:24] <grrrit-wm>	 (03PS2) 10Jcrespo: Add temporary workaround --skip-ssl until unified cert authority [puppet] - 10https://gerrit.wikimedia.org/r/323420 
[18:02:29] <icinga-wm>	 RECOVERY - MariaDB Slave Lag: m3 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 0.00 seconds
[18:10:33] <wikibugs>	 06Operations, 06Performance-Team, 10Thumbor: Upgrade gifsicle to 1.88-2~bpo8+1 on Thumbor boxes - https://phabricator.wikimedia.org/T151565#2821703 (10fgiunchedi) @gilles we already have `jessie-backports` enabled in production so the minimum version in python-thumbor-wikimedia should DTRT
[18:13:59] <wikibugs_>	 06Operations, 10ArchCom-RfC, 06Commons, 10MediaWiki-File-management, and 14 others: Define an official thumb API - https://phabricator.wikimedia.org/T66214#2821720 (10Gilles) I think it might be just a little misunderstanding. If you're talking about describing to the client what kind of media the original...
[18:14:25] <grrrit-wm>	 (03PS1) 10Ema: varnish: rename scripts depending on varnishlog.py [puppet] - 10https://gerrit.wikimedia.org/r/323423 (https://phabricator.wikimedia.org/T150660) 
[18:14:37] <grrrit-wm>	 (03CR) 10Jcrespo: [C: 032] Add temporary workaround --skip-ssl until unified cert authority [puppet] - 10https://gerrit.wikimedia.org/r/323420 (owner: 10Jcrespo) 
[18:15:47] <wikibugs>	 06Operations, 06Performance-Team, 10Thumbor: Implement rate limiter in Thumbor - https://phabricator.wikimedia.org/T151067#2821739 (10Gilles) Yeah, I'm not sure that we rate-limit 404s actually. I'll make sure to do the right thing in Thumbor in regards to that.
[18:16:01] <grrrit-wm>	 (03PS2) 10Ema: varnish: rename scripts depending on varnishlog.py [puppet] - 10https://gerrit.wikimedia.org/r/323423 (https://phabricator.wikimedia.org/T150660) 
[18:23:41] <wikibugs_>	 06Operations, 06Performance-Team, 10Thumbor: Track incoming HTTP request count on the Thumbor boxes - https://phabricator.wikimedia.org/T151554#2821772 (10fgiunchedi) @gilles AFAIK there's no precedent like that no, it would be useful though for other places where we have nginx deployed and want to gain more...
[18:24:50] <grrrit-wm>	 (03PS1) 10Jcrespo: Avoid --dump-slave when performing backups [puppet] - 10https://gerrit.wikimedia.org/r/323427 
[18:35:56] <wikibugs>	 06Operations, 10Monitoring: dbstore1001 backup jobs failed between 2016-10-19 and 2016-11-23 - https://phabricator.wikimedia.org/T151579#2821787 (10jcrespo)
[18:40:30] <wikibugs>	 06Operations, 10Monitoring: dbstore1001 backup jobs failed between 2016-10-19 and 2016-11-23 - https://phabricator.wikimedia.org/T151579#2821834 (10jcrespo)
[18:42:56] <wikibugs>	 06Operations, 06Performance-Team, 10Thumbor: Thumbor should reject thumbnail requests that are the same size as the original or bigger - https://phabricator.wikimedia.org/T150741#2821844 (10fgiunchedi) @Gilles good question, I think for now the simplest thing is to treat multipage documents as exceptions and...
[18:43:19] <grrrit-wm>	 (03CR) 10Jcrespo: [C: 032] Avoid --dump-slave when performing backups [puppet] - 10https://gerrit.wikimedia.org/r/323427 (owner: 10Jcrespo) 
[18:47:58] <wikibugs>	 06Operations, 06Performance-Team, 10Thumbor: Thumbor should reject thumbnail requests that are the same size as the original or bigger - https://phabricator.wikimedia.org/T150741#2821863 (10Gilles) I think we can make storing that data very efficient by indexing by size. It's indeed unlikely that every singl...
[19:05:25] <wikibugs_>	 06Operations, 06Discovery, 10Wikimedia-Apache-configuration, 07Mobile, 13Patch-For-Review: m.wikipedia.org incorrectly redirects to en.m.wikipedia.org - https://phabricator.wikimedia.org/T69015#2821887 (10Jdlrobson)
[19:05:56] <wikibugs>	 06Operations, 06Discovery, 10Wikimedia-Apache-configuration, 07Mobile, 13Patch-For-Review: m.wikipedia.org incorrectly redirects to en.m.wikipedia.org - https://phabricator.wikimedia.org/T69015#730151 (10Jdlrobson) Given this would probably redirect to https://www.wikipedia.org/ probably something of con...
[19:10:59] <icinga-wm>	 PROBLEM - puppet last run on mw1232 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[19:35:09] <icinga-wm>	 PROBLEM - puppet last run on sca1004 is CRITICAL: CRITICAL: Puppet has 17 failures. Last run 2 minutes ago with 17 failures. Failed resources (up to 3 shown): Service[ferm],Service[diamond],Service[prometheus-node-exporter],Package[ecryptfs-utils]
[19:37:20] <Zppix>	 jouncebot now
[19:37:20] <jouncebot>	 No deployments scheduled for the next 90 hour(s) and 22 minute(s)
[19:39:59] <icinga-wm>	 RECOVERY - puppet last run on mw1232 is OK: OK: Puppet is currently enabled, last run 16 seconds ago with 0 failures
[19:43:18] <wikibugs>	 06Operations, 06Performance-Team, 10Thumbor: Thumbor should reject thumbnail requests that are the same size as the original or bigger - https://phabricator.wikimedia.org/T150741#2821974 (10fgiunchedi) @Gilles yeah that sounds good! Swift header limit I think is 8k by default, so we should be fairly safe the...
[19:45:59] <icinga-wm>	 PROBLEM - puppet last run on db1033 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[19:50:57] <wikibugs_>	 07Puppet, 10Beta-Cluster-Infrastructure, 07Beta-Cluster-reproducible, 07Easy, and 2 others: "Connect to 'deployment.eqiad.wmnet' instead" when you ssh into deployment-tin on Beta - https://phabricator.wikimedia.org/T146505#2821982 (10hashar) 05Open>03Resolved Thanks @krenair and @akosiaris
[20:06:40] <wikibugs>	 06Operations, 07Puppet, 07Documentation, 03Google-Code-In-2016, and 2 others: document all puppet classes / defined types!? - https://phabricator.wikimedia.org/T127797#2821997 (10Florian) Ok, published: https://codein.withgoogle.com/dashboard/tasks/5667832180244480/ :) I was free and wrote, that the studen...
[20:13:59] <icinga-wm>	 RECOVERY - puppet last run on db1033 is OK: OK: Puppet is currently enabled, last run 50 seconds ago with 0 failures
[20:25:09] <icinga-wm>	 PROBLEM - puppet last run on ms-be1003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[20:25:22] <grrrit-wm>	 (03CR) 10Florianschmidtwelzow: [C: 031] Do not send 'exception-json' channel to logstash [mediawiki-config] - 10https://gerrit.wikimedia.org/r/323330 (https://phabricator.wikimedia.org/T136849) (owner: 10Gergő Tisza) 
[20:54:09] <icinga-wm>	 RECOVERY - puppet last run on ms-be1003 is OK: OK: Puppet is currently enabled, last run 53 seconds ago with 0 failures
[21:49:59] <icinga-wm>	 PROBLEM - puppet last run on mw1206 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[22:13:19] <grrrit-wm>	 (03PS1) 10Gergő Tisza: Use custom LogstashFormatter [mediawiki-config] - 10https://gerrit.wikimedia.org/r/323492 (https://phabricator.wikimedia.org/T145133) 
[22:16:05] <wikibugs>	 06Operations, 06Discovery, 10Elasticsearch, 06Discovery-Search (Current work), 13Patch-For-Review: Upgrade our logstash-gelf package to latest available upstream version - https://phabricator.wikimedia.org/T150408#2785082 (10fgiunchedi) I took a look at this too and couldn't understand why the override d...
[22:17:59] <icinga-wm>	 RECOVERY - puppet last run on mw1206 is OK: OK: Puppet is currently enabled, last run 16 seconds ago with 0 failures
[22:33:09] <icinga-wm>	 RECOVERY - puppet last run on sca1004 is OK: OK: Puppet is currently enabled, last run 55 seconds ago with 0 failures
[22:47:20] <grrrit-wm>	 (03CR) 10Ppchelko: PDF Render: Create the service's admin group (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/323195 (https://phabricator.wikimedia.org/T143129) (owner: 10Mobrovac) 
[23:15:22] <wikibugs>	 06Operations, 10Wikimedia-General-or-Unknown, 07Availability, 13Patch-For-Review, and 2 others: Job queue size growing since ~12:00 on 2016-11-19 - https://phabricator.wikimedia.org/T151196#2822200 (10aaron) Related (and tricky) task is T123815, where the backoff times are waaay to pessimistic. The X/sec l...
[23:53:41] <grrrit-wm>	 (03PS6) 10Paladox: Phabricator: Allow setting mysql.user and mysql.pass (part2) [puppet] - 10https://gerrit.wikimedia.org/r/323349