[00:01:03] (03CR) 10Filippo Giunchedi: [C: 031] Stop using package=>latest for standard packages [puppet] - 10https://gerrit.wikimedia.org/r/314270 (https://phabricator.wikimedia.org/T115348) (owner: 10Muehlenhoff) [00:02:11] (03CR) 10Krinkle: [C: 031] "No matches in Git, no matches in mwgrep (NS_MEDIAWIKI and NS_USER), and no requests in the last 7 days (other than my own)." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/323323 (owner: 10Chad) [00:03:03] (03CR) 10Filippo Giunchedi: "Any reason for the actual code to live in puppet as opposed to its own repo?" [puppet] - 10https://gerrit.wikimedia.org/r/322220 (https://phabricator.wikimedia.org/T147328) (owner: 10Addshore) [00:03:09] PROBLEM - puppet last run on logstash1003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [00:04:06] (03CR) 10Dzahn: Phabricator: Allow setting the mysql.user and mysql.pass in labs (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/323146 (https://phabricator.wikimedia.org/T139475) (owner: 10Paladox) [00:04:16] (03CR) 10Addshore: "My initial thought was the number of lines are so small puppet would be easier." [puppet] - 10https://gerrit.wikimedia.org/r/322220 (https://phabricator.wikimedia.org/T147328) (owner: 10Addshore) [00:04:31] (03CR) 10Dzahn: [C: 04-1] Phabricator: Allow setting the mysql.user and mysql.pass in labs [puppet] - 10https://gerrit.wikimedia.org/r/323146 (https://phabricator.wikimedia.org/T139475) (owner: 10Paladox) [00:04:33] (03CR) 10Filippo Giunchedi: "I'm not particularly happy with the "service discovery" part (the TODO) but this will do for now" [puppet] - 10https://gerrit.wikimedia.org/r/323079 (https://phabricator.wikimedia.org/T147423) (owner: 10Filippo Giunchedi) [00:04:57] paladox: i ran the compiler, it shows no change, but there is a problem with the patch anyways [00:05:08] (03CR) 10Paladox: Phabricator: Allow setting the mysql.user and mysql.pass in labs (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/323146 (https://phabricator.wikimedia.org/T139475) (owner: 10Paladox) [00:05:18] mutante look on the left of the diff [00:06:20] hmm, i see. i wonder if this is the wrong order of things [00:06:24] might be a trap later [00:06:36] but at least there's a todo comment.. so yea [00:06:43] Ok [00:06:44] :) [00:07:25] (03PS8) 10Dzahn: Phabricator: Allow setting the mysql.user and mysql.pass in labs [puppet] - 10https://gerrit.wikimedia.org/r/323146 (https://phabricator.wikimedia.org/T139475) (owner: 10Paladox) [00:07:32] (03CR) 10Krinkle: [C: 04-1] "This is labs-specific but ends up installed and running in prod, too - where we hopefully don't have something important in that similarly" [puppet] - 10https://gerrit.wikimedia.org/r/323339 (https://phabricator.wikimedia.org/T143405) (owner: 10Filippo Giunchedi) [00:09:56] (03CR) 10Chad: [C: 032] Docroot cleanup: Remove old unused blank.gif from HTTPS tests [mediawiki-config] - 10https://gerrit.wikimedia.org/r/323323 (owner: 10Chad) [00:10:04] (03CR) 10Dzahn: [C: 032] "ok, no-op per compiler http://puppet-compiler.wmflabs.org/4666/iridium.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/323146 (https://phabricator.wikimedia.org/T139475) (owner: 10Paladox) [00:10:22] mutante thanks :) [00:10:41] (03Merged) 10jenkins-bot: Docroot cleanup: Remove old unused blank.gif from HTTPS tests [mediawiki-config] - 10https://gerrit.wikimedia.org/r/323323 (owner: 10Chad) [00:11:24] (03CR) 10Filippo Giunchedi: "It is labs instances data that ends up in the graphite production instance, I can move it to the "production" subrole if that makes things" [puppet] - 10https://gerrit.wikimedia.org/r/323339 (https://phabricator.wikimedia.org/T143405) (owner: 10Filippo Giunchedi) [00:12:12] !log demon@tin Synchronized docroot/foundation/: rm more junk (duration: 00m 45s) [00:12:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:15:35] paladox: you can go ahead with testing now [00:15:43] paladox: also.. i think https://gerrit.wikimedia.org/r/#/c/308498/ is not worth it anymore [00:16:05] (03Abandoned) 10Paladox: role/cxserver: Fix role::cxserver not in autoload module layout [puppet] - 10https://gerrit.wikimedia.org/r/308498 (https://phabricator.wikimedia.org/T93645) (owner: 10Paladox) [00:16:09] mutante ok [00:16:10] thanks [00:16:11] paladox: since all of manifests/role/ moved, that is fixed for all besides mariadb and eventlogging now [00:16:50] well.. eh. that said, i dunno about the labs and beta part that you were changing there [00:16:58] maybe that is still needed [00:16:59] anything i can do to help remove junk in docroot? [00:18:29] mutante twentyafterfour i am going to run bin/storage upgrade manually [00:18:34] otherwise the tables wont be created [00:18:41] and then it will all fail [00:18:54] i doint event think we issue that command through puppet [00:18:59] so i am runnign it manually [00:19:09] PROBLEM - puppet last run on install2001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [00:19:25] (03CR) 10Filippo Giunchedi: "IMO yes this code should live outside puppet." [puppet] - 10https://gerrit.wikimedia.org/r/322220 (https://phabricator.wikimedia.org/T147328) (owner: 10Addshore) [00:20:02] paladox: ok, good question if that is puppetized [00:20:08] I doint think so [00:20:19] mutante also i am going to submit another patch for mysql [00:20:29] it got me furthur, but i also need it set in local.config [00:21:06] (03CR) 10Addshore: "ack." [puppet] - 10https://gerrit.wikimedia.org/r/322220 (https://phabricator.wikimedia.org/T147328) (owner: 10Addshore) [00:21:09] RECOVERY - puppet last run on install2001 is OK: OK: Puppet is currently enabled, last run 6 seconds ago with 0 failures [00:21:37] (03PS1) 10Hoo man: Use entity types for the repoNamespaces Wikibase client setting [mediawiki-config] - 10https://gerrit.wikimedia.org/r/323347 [00:27:23] (03PS1) 10Paladox: Phabricator: Allow setting mysql.user and mysql.pass (part2) [puppet] - 10https://gerrit.wikimedia.org/r/323349 [00:27:41] (03PS2) 10Paladox: Phabricator: Allow setting mysql.user and mysql.pass (part2) [puppet] - 10https://gerrit.wikimedia.org/r/323349 [00:27:52] mutante ^^ could you review that please and run the puppet complier please? [00:28:02] !log reedy@tin Synchronized php-1.29.0-wmf.3/extensions/CentralAuth/maintenance/populateLocalAndGlobalIds.php: Some perf related improvements (duration: 00m 45s) [00:28:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:29:50] (03CR) 10jenkins-bot: [V: 04-1] Phabricator: Allow setting mysql.user and mysql.pass (part2) [puppet] - 10https://gerrit.wikimedia.org/r/323349 (owner: 10Paladox) [00:30:55] Xhgui is always timing out for me :S [00:31:22] (03PS3) 10Paladox: Phabricator: Allow setting mysql.user and mysql.pass (part2) [puppet] - 10https://gerrit.wikimedia.org/r/323349 [00:32:09] RECOVERY - puppet last run on logstash1003 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [00:34:29] PROBLEM - puppet last run on labvirt1014 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [00:42:10] (03CR) 10Krinkle: graphite: cleanup labs instances metrics [puppet] - 10https://gerrit.wikimedia.org/r/323339 (https://phabricator.wikimedia.org/T143405) (owner: 10Filippo Giunchedi) [00:49:08] 06Operations, 07Puppet, 07Documentation, 03Google-Code-In-2016, and 2 others: document all puppet classes / defined types!? - https://phabricator.wikimedia.org/T127797#2819801 (10Dzahn) Yes, sounds good. [00:50:04] (03CR) 10Dzahn: [C: 04-1] "compiler says there is a change" [puppet] - 10https://gerrit.wikimedia.org/r/323349 (owner: 10Paladox) [00:50:31] paladox: tested, but there is a diff [00:51:51] mutante oh [00:59:14] (03PS4) 10Paladox: Phabricator: Allow setting mysql.user and mysql.pass (part2) [puppet] - 10https://gerrit.wikimedia.org/r/323349 [00:59:19] mutante ^^ [00:59:26] could you re run puppet compiler please? [00:59:31] I hope that fixed it [01:03:04] (03PS1) 10BryanDavis: logstash: Add processing rules for MediaWiki's exception channel [puppet] - 10https://gerrit.wikimedia.org/r/323351 (https://phabricator.wikimedia.org/T136849) [01:03:07] (03CR) 10jenkins-bot: [V: 04-1] Phabricator: Allow setting mysql.user and mysql.pass (part2) [puppet] - 10https://gerrit.wikimedia.org/r/323349 (owner: 10Paladox) [01:03:29] RECOVERY - puppet last run on labvirt1014 is OK: OK: Puppet is currently enabled, last run 25 seconds ago with 0 failures [01:04:10] (03PS2) 10BryanDavis: Do not send 'exception-json' channel to logstash [mediawiki-config] - 10https://gerrit.wikimedia.org/r/323330 (https://phabricator.wikimedia.org/T136849) (owner: 10Gergő Tisza) [01:04:38] (03PS5) 10Paladox: Phabricator: Allow setting mysql.user and mysql.pass (part2) [puppet] - 10https://gerrit.wikimedia.org/r/323349 [01:05:09] (03CR) 10BryanDavis: "Prerequisite for I36b73528ca66da5138e7ac7110ab450eeee5a466" [puppet] - 10https://gerrit.wikimedia.org/r/323351 (https://phabricator.wikimedia.org/T136849) (owner: 10BryanDavis) [01:05:56] (03PS2) 10BryanDavis: logstash: Break logstash.pp up into individual classes [puppet] - 10https://gerrit.wikimedia.org/r/323333 [01:05:58] (03PS2) 10BryanDavis: logstash: Add processing rules for MediaWiki's exception channel [puppet] - 10https://gerrit.wikimedia.org/r/323351 (https://phabricator.wikimedia.org/T136849) [01:10:24] (03CR) 10Krinkle: [C: 031] logstash: Add processing rules for MediaWiki's exception channel [puppet] - 10https://gerrit.wikimedia.org/r/323351 (https://phabricator.wikimedia.org/T136849) (owner: 10BryanDavis) [01:27:37] bblack: hi! when you wrote above ^ Special:Banner, you meant, Special:BannerLoader, right? [01:27:51] How hard would it be to change that to make it not client-side cache? [01:29:33] AndyRussG: I meant all URL's matching Special:Banner.* [01:29:58] AndyRussG: in the code it's: if (req.url !~ "^/(wiki/|(w/index\.php)?\?title=)Special:Banner") { [01:30:54] (which excludes those URLs from having the appserver's Cache-Control header replaced by "private, s-maxage=0, max-age=0, must-revalidate" for user-agent consumption, which is what we do for wiki articles to avoid client-side caching) [01:31:40] it's functionally easy to remove or modify that exclusion, but I have to wonder why we made it in the first place... [01:31:52] bblack: quoting ejegg just now in #wikimedia-fundraising: [01:31:54] 19:28 AndyRussG: Here's what I get: cache-control:public, s-maxage=600, max-age=0 [01:31:56] 19:28 that's anonymous on the live site [01:31:58] 19:28 Hmmm [01:32:00] 19:28 And what does that mean [01:32:02] 19:28 < *> AndyRussG hides ignorance under carpet [01:32:04] 19:29 max-age is for private (ie browser) caches [01:32:06] 19:29 s-maxage is for shared caches (ie varnish) [01:32:08] 19:29 and that's what we're outputtingg manually with those header() calls [01:32:10] 19:29 so the browser will always re-request [01:32:37] bblack: K I guess we need to look into it carefully [01:33:08] yes, that's true, I wasn't paying attention to the shared-vs-nonshared [01:33:28] bblack: aha, so there's a general rule that messes with the cache-control header [01:33:37] but in any case, the clause applies to all Special:Banner.*, I don't even know what all of those are, or how they all behave today [01:33:38] but that rule doesn't apply to Special:Banner* pages [01:33:43] yes [01:33:58] so they all retain the headers that we set manually [01:34:02] yes [01:34:17] although I wouldn't make firm bets about how all UAs interpret them [01:34:18] cool, that's what we want, since we do have a bit of special cache header logic in there [01:34:24] bblack: ejegg: isn't that a regex? So it would apply to Special:BannerLoader? [01:34:31] yes, it would [01:34:40] Hmmm K [01:34:59] AndyRussG: yep! so BannerLoader is excluded from the logic that nukes the cache-control header sent from PHP [01:35:04] right [01:35:06] which is what we want [01:35:17] but why do we want that? [01:35:44] without that exclusion, varnish would still operate the same, and users wouldn't be allowed to cache, which seems to be what you want with max-age=0 anyways [01:35:58] it makes the user no-cache bit a bit more explicit and clear though [01:36:00] mutante :) [01:36:01] ohhhhh right, we're just outputting those custom headers for varnish's sake [01:36:32] so if we took that exception away, varnish would still respect our custom s-maxage [01:36:35] but at some point in the past, someone made that specific rule in varnish to let them through, for some past reason [01:36:40] yes [01:36:41] but nothing downstream would cache it [01:36:45] yes [01:36:58] right, OK, I guess we /don't/ want that behavior [01:37:08] sorry for the confusion [01:37:15] at least for this case [01:37:24] what about all other cases for Special:Banner.* [01:37:52] Yeah, I can't think of anywhere we want downstream caches to hang onto the banner content [01:38:18] I mean, it's https all the way, so I'd guess it almost never matters [01:38:38] (keep in mind, for all practical purposes the only downstream cache we care about is the UA) [01:38:56] there are exceptions: authorized-by-the-client TLS proxy-caches, e.g. on corporate networks [01:38:58] but I think it's fair to ignore them for policy purposes and only consider the case of a single UA [01:39:15] right, so the single UA is supposed to use maxage and not s-maxage [01:39:25] 07Puppet, 10Beta-Cluster-Infrastructure, 05Goal, 13Patch-For-Review: Remove all ::beta roles in puppet - https://phabricator.wikimedia.org/T86644#2819845 (10Krenair) So now we have: ```alex@alex-laptop:~/Development/Wikimedia/Operations-Puppet (production)$ git grep -E "class .*::beta" modules/mediawiki/ma... [01:39:28] in theory yes [01:39:31] so except for TSL proxy caches, we're still seeing the right behavior [01:39:47] I wouldn't put it past someone to say "hey if it's publicly cacheable, why can't the browser cache it too?" [01:40:03] in some logical sense, it doesn't make sense to have s-maxage > maxage [01:40:08] (in public headers) [01:41:55] yeah, I'd say we don't need that rule, but it's not going to hurt except in TLS proxies and non-w3c-compliant UAs [01:42:32] 'The s-maxage directive is always ignored by a private cache. ' https://www.w3.org/Protocols/rfc2616/rfc2616-sec14.html#sec14.9.3 [01:43:14] So let's not change it now, but let's look again and clean it up in January [01:43:50] k, I have to head out [01:45:01] ejegg|away: thx! [01:47:02] bblack: thx! I am sadly a bit distracted but will dig also check this out a bit more.... Do you by chance have a quick link/pointer to the code in question pls? thx!! :) [01:48:20] WRT to the varnish quick-purge switch u mentioned earlier, do u think we could design something 4 that, and maybe test it quickly Mondayish? On that point, do u have any example code to mebbe use as a starting point? [01:48:44] Sorry to be all bomshell over here ;p [01:48:49] bombshell [01:48:56] no that's not the right word, is it? [01:49:13] Mmmm urgent attack of last minute past deadline arrrrg-ishness [02:14:36] AndyRussG: there's an existing script deployers can use to purge specific URLs that's already well-tested. I'm just suggesting we can build on that and have a little wrapper ready to go that purges all the relevant banner-related URLs [02:15:34] bblack: ah great.... mmm haz links? [02:16:40] not offhand, no [02:16:51] it's part of the mwScript stuff [02:17:14] we can look around for it on Monday [02:17:40] purgeList.php? [02:18:00] yeah that [02:18:06] you pipe in URLs and it sends purges for them IIRC [02:19:25] mwscript itself is nothing special [02:20:43] it just gets the correct path to use and runs php via sudo where appropriate [02:21:30] the multiversion stuff it runs as a php entry point does some special things but ultimately it just results in the running of a normal MW maintenance script [02:23:25] (the sudo is to ensure it runs under the lower-privileged web server user rather than with deployment rights etc.) [02:29:55] !log l10nupdate@tin scap sync-l10n completed (1.29.0-wmf.3) (duration: 10m 39s) [02:30:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:35:10] !log l10nupdate@tin ResourceLoader cache refresh completed at Thu Nov 24 02:35:10 UTC 2016 (duration 5m 15s) [02:35:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:54:31] bblack: Krenair thx! [02:54:38] back a bit later! [02:59:02] (03CR) 10Gergő Tisza: "Shouldn't it be the other way around? Deploy I36b73528ca66da5138e7ac7110ab450eeee5a466, deploy this (but also get rid of the exception-jso" [puppet] - 10https://gerrit.wikimedia.org/r/323351 (https://phabricator.wikimedia.org/T136849) (owner: 10BryanDavis) [03:21:09] PROBLEM - MariaDB Slave Lag: s1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 814.98 seconds [03:24:43] (03CR) 10Krinkle: "Wouldn't a 401 with auth cause the authentication prompt to be shown to the end-user upon viewing the HTML page that embeds the image? Tha" [puppet] - 10https://gerrit.wikimedia.org/r/323135 (https://phabricator.wikimedia.org/T151444) (owner: 10Ema) [03:30:00] (03CR) 10Krinkle: "nvm :)" [puppet] - 10https://gerrit.wikimedia.org/r/323135 (https://phabricator.wikimedia.org/T151444) (owner: 10Ema) [03:42:09] RECOVERY - MariaDB Slave Lag: s1 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 243.95 seconds [04:10:39] PROBLEM - mailman I/O stats on fermium is CRITICAL: CRITICAL - I/O stats: Transfers/Sec=728.80 Read Requests/Sec=1469.30 Write Requests/Sec=10.10 KBytes Read/Sec=42218.40 KBytes_Written/Sec=452.00 [04:16:39] RECOVERY - mailman I/O stats on fermium is OK: OK - I/O stats: Transfers/Sec=1.70 Read Requests/Sec=2.40 Write Requests/Sec=26.40 KBytes Read/Sec=9.60 KBytes_Written/Sec=189.60 [04:27:59] PROBLEM - puppet last run on ms-fe1004 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [04:52:09] PROBLEM - puppet last run on cp3013 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [04:55:59] RECOVERY - puppet last run on ms-fe1004 is OK: OK: Puppet is currently enabled, last run 20 seconds ago with 0 failures [05:21:09] RECOVERY - puppet last run on cp3013 is OK: OK: Puppet is currently enabled, last run 47 seconds ago with 0 failures [05:23:16] 07Puppet, 10Beta-Cluster-Infrastructure, 05Goal, 13Patch-For-Review: Remove all ::beta roles in puppet - https://phabricator.wikimedia.org/T86644#973295 (10demon) Just came across `beta::saltmaster::tools`. It doesn't even appear used... [05:26:40] 07Puppet, 10Beta-Cluster-Infrastructure, 05Goal, 13Patch-For-Review: Remove all ::beta roles in puppet - https://phabricator.wikimedia.org/T86644#2820019 (10Krenair) It's used on deployment-salt02: https://wikitech.wikimedia.org/wiki/Hiera:Deployment-prep/host/deployment-salt02 [05:27:46] 07Puppet, 10Beta-Cluster-Infrastructure, 05Goal, 13Patch-For-Review: Remove all ::beta roles in puppet - https://phabricator.wikimedia.org/T86644#2820032 (10demon) Do we even need salt in beta? ;-) [05:31:37] 07Puppet, 10Beta-Cluster-Infrastructure, 05Goal, 13Patch-For-Review: Remove all ::beta roles in puppet - https://phabricator.wikimedia.org/T86644#2820037 (10Krenair) That's way out of the scope of this task, but yes, I have used it many times [06:03:59] PROBLEM - puppet last run on rcs1002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [06:12:49] PROBLEM - puppet last run on db1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [06:31:59] RECOVERY - puppet last run on rcs1002 is OK: OK: Puppet is currently enabled, last run 12 seconds ago with 0 failures [06:41:49] RECOVERY - puppet last run on db1001 is OK: OK: Puppet is currently enabled, last run 17 seconds ago with 0 failures [06:55:36] 06Operations, 10Wikimedia-General-or-Unknown, 07Availability, 13Patch-For-Review, and 2 others: Job queue size growing since ~12:00 on 2016-11-19 - https://phabricator.wikimedia.org/T151196#2820097 (10Nemo_bis) [07:32:04] !log Stopping MySQL db2070 for maintenance - https://phabricator.wikimedia.org/T149553 [07:32:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:05:18] (03PS1) 10Tim Landscheidt: Tools: Remove temporary class role::toollabs::merlbot_proxy [puppet] - 10https://gerrit.wikimedia.org/r/323373 [08:15:13] <_joe_> !log uploaded calico-cni 1.5.1 to jessie-wikimedia [08:15:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:16:42] (03CR) 10Yuvipanda: [C: 032] Tools: Remove temporary class role::toollabs::merlbot_proxy [puppet] - 10https://gerrit.wikimedia.org/r/323373 (owner: 10Tim Landscheidt) [08:24:37] 06Operations, 07Puppet, 10Beta-Cluster-Infrastructure, 07Technical-Debt, 07Tracking: Minimize differences between beta and production (Tracking) - https://phabricator.wikimedia.org/T87220#2820182 (10hashar) [08:24:39] 07Puppet, 10Beta-Cluster-Infrastructure, 05Goal, 13Patch-For-Review: Remove all ::beta roles in puppet - https://phabricator.wikimedia.org/T86644#2820179 (10hashar) 05Open>03Resolved a:03hashar >>! In T86644#2820032, @demon wrote: > Do we even need salt in beta? ;-) When you get 40+ instances. Yes d... [08:25:29] PROBLEM - High load average on labstore1004 is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [24.0] [08:27:29] RECOVERY - High load average on labstore1004 is OK: OK: Less than 50.00% above the threshold [16.0] [08:29:02] 06Operations, 07Puppet, 10Beta-Cluster-Infrastructure, 07Technical-Debt, 07Tracking: Minimize differences between beta and production (Tracking) - https://phabricator.wikimedia.org/T87220#2820188 (10Krenair) [08:29:04] 07Puppet, 10Beta-Cluster-Infrastructure, 05Goal, 13Patch-For-Review: Remove all ::beta roles in puppet - https://phabricator.wikimedia.org/T86644#2820185 (10Krenair) 05Resolved>03Open a:05hashar>03None This task is not complete. I listed 7 existing cases above. [08:34:54] 07Puppet, 10Beta-Cluster-Infrastructure, 05Goal, 13Patch-For-Review: Remove all ::beta roles in puppet - https://phabricator.wikimedia.org/T86644#2820196 (10hashar) Sorry made a mistake when replying :D [08:59:46] !log Deploy alter table S5 - dewiki.revision on db1092 (depooled) - T148967 [08:59:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:59:58] T148967: Fix PK on S5 dewiki.revision - https://phabricator.wikimedia.org/T148967 [09:03:28] (03CR) 10Elukey: "Added the patch on deployment-prep if people want to test it. Quick test:" [puppet] - 10https://gerrit.wikimedia.org/r/322268 (https://phabricator.wikimedia.org/T137176) (owner: 10Elukey) [09:03:57] marostegui: morning :P [09:04:16] elukey hahaha morning [09:12:49] PROBLEM - puppet last run on dbstore1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:25:09] 06Operations, 10Wikimedia-Apache-configuration, 13Patch-For-Review: catch-all apache vhost on the cluster should return 404 for non-existing sites - https://phabricator.wikimedia.org/T137176#2820282 (10Joe) 05Open>03Invalid [09:29:55] 06Operations, 10Wikimedia-Apache-configuration, 13Patch-For-Review: catch-all apache vhost on the cluster should return 404 for non-existing sites - https://phabricator.wikimedia.org/T137176#2820296 (10Joe) This is a huge discussion to have, and would need a ton of auditing. Basically: - We return 200 for *... [09:30:53] (03CR) 10Giuseppe Lavagetto: [C: 04-2] "See comments on the task." [puppet] - 10https://gerrit.wikimedia.org/r/322268 (https://phabricator.wikimedia.org/T137176) (owner: 10Elukey) [09:38:13] !log Stopping replication db1052 (depooled) for maintenance - T150960 [09:38:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:38:25] T150960: Initial data tests for db1095 (temporary db1069 - sanitarium replacement) - https://phabricator.wikimedia.org/T150960 [09:41:49] RECOVERY - puppet last run on dbstore1001 is OK: OK: Puppet is currently enabled, last run 50 seconds ago with 0 failures [09:57:31] 06Operations, 10Wikimedia-Apache-configuration, 13Patch-For-Review: catch-all apache vhost on the cluster should return 404 for non-existing sites - https://phabricator.wikimedia.org/T137176#2820406 (10elukey) 05Invalid>03Open [10:13:15] <_joe_> !log running commonswiki htmlCacheUpdate jobs on terbium to catch up with the backlog, monitoring caches for vhtcpd queue overflows T151196 [10:13:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:13:27] T151196: Job queue size growing since ~12:00 on 2016-11-19 - https://phabricator.wikimedia.org/T151196 [10:16:19] PROBLEM - puppet last run on lvs3003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [10:16:30] 06Operations, 10Wikimedia-Apache-configuration, 13Patch-For-Review: catch-all apache vhost on the cluster should return 404 for non-existing sites - https://phabricator.wikimedia.org/T137176#2820445 (10elukey) Re-opening the task after a chat with Joe, let's find a solution for this issue :) What are the sc... [10:17:48] 06Operations, 10Wikimedia-General-or-Unknown, 07Availability, 13Patch-For-Review, and 2 others: Job queue size growing since ~12:00 on 2016-11-19 - https://phabricator.wikimedia.org/T151196#2820447 (10Joe) So, I solved the "mistery": since I last checked, we're throttling htmlCacheUpdates on the jobrunners... [10:23:54] 06Operations, 13Patch-For-Review: Prometheus cronspam - https://phabricator.wikimedia.org/T151149#2820450 (10elukey) thanks @fgiunchedi !! [10:40:51] (03CR) 10Alexandros Kosiaris: [C: 04-1] "First pass seems quite ok, comments inline" (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/323194 (https://phabricator.wikimedia.org/T150043) (owner: 10Mobrovac) [10:45:19] RECOVERY - puppet last run on lvs3003 is OK: OK: Puppet is currently enabled, last run 33 seconds ago with 0 failures [10:48:24] 06Operations: silver: / partition low on space - https://phabricator.wikimedia.org/T151493#2820531 (10Peachey88) [10:48:43] 06Operations: silver: /dev/md2 mounted twice - https://phabricator.wikimedia.org/T151489#2820532 (10Peachey88) [10:49:35] 06Operations, 10DBA: Icinga MariaDB disk space check on silver checks the wrong partition - https://phabricator.wikimedia.org/T151491#2820534 (10Peachey88) [10:50:09] PROBLEM - puppet last run on druid1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [10:52:37] 06Operations, 10DBA: Icinga MariaDB disk space check on silver checks the wrong partition - https://phabricator.wikimedia.org/T151491#2819072 (10jcrespo) There is no /srv partition on silver. probably it should check /a instead of / ? [10:56:49] 06Operations, 10DBA: Icinga MariaDB disk space check on silver checks the wrong partition - https://phabricator.wikimedia.org/T151491#2820552 (10jcrespo) The MariaDB disk space check is a legacy of the past- there should be only one disk check, and the critical (and warning) level should be higher for database... [11:11:46] 06Operations, 10DBA: Icinga MariaDB disk space check on silver checks the wrong partition - https://phabricator.wikimedia.org/T151491#2820601 (10Volans) @jcrespo see T151489, there is a `/srv` mount point as well as `/a` mount point, they both mount the same partition! [11:12:19] PROBLEM - puppet last run on mx2001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [11:19:10] RECOVERY - puppet last run on druid1001 is OK: OK: Puppet is currently enabled, last run 52 seconds ago with 0 failures [11:26:19] 06Operations, 06Performance-Team, 10Thumbor: Investigate source of thumbnail 302 redirects - https://phabricator.wikimedia.org/T148410#2820634 (10Gilles) [11:40:19] RECOVERY - puppet last run on mx2001 is OK: OK: Puppet is currently enabled, last run 6 seconds ago with 0 failures [11:41:26] !log Killed the Wikidata JSON dump creation on snapshot1007: Wont succeed before Monday, due to T151356 [11:41:37] apergos: FYI ^ [11:41:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:41:38] T151356: Wikibase\Repo\Store\Sql\SqlEntityIdPager::fetchIds query slow - https://phabricator.wikimedia.org/T151356 [11:41:51] okey dokey [11:42:26] and there's the cron error mail too [11:43:11] 06Operations, 10DBA, 13Patch-For-Review: Set up TLS for MariaDB replication - https://phabricator.wikimedia.org/T111654#2820680 (10jcrespo) p:05High>03Normal [11:45:36] (03PS3) 10Mobrovac: Trending Edits: Role and module [puppet] - 10https://gerrit.wikimedia.org/r/323194 (https://phabricator.wikimedia.org/T150043) [11:48:07] (03CR) 10Mobrovac: Trending Edits: Role and module (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/323194 (https://phabricator.wikimedia.org/T150043) (owner: 10Mobrovac) [11:56:05] 06Operations, 10DBA, 13Patch-For-Review: Firewall configurations for database hosts - https://phabricator.wikimedia.org/T104699#2820707 (10jcrespo) p:05Triage>03Normal [12:00:34] 06Operations, 06Discovery, 06WMDE-Analytics-Engineering, 10Wikidata, and 3 others: Add firewall exception to get to wdqs*.codfw.wmnet:8888 from analytics cluster - https://phabricator.wikimedia.org/T146474#2820741 (10Addshore) @akosiaris is there any way to get this expedited? (mentioning you as you comple... [12:10:09] PROBLEM - puppet last run on aqs1006 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [12:10:24] (03CR) 10Alexandros Kosiaris: [C: 032] "LGTM. merging" [puppet] - 10https://gerrit.wikimedia.org/r/323194 (https://phabricator.wikimedia.org/T150043) (owner: 10Mobrovac) [12:11:09] PROBLEM - salt-minion processes on thumbor1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:11:59] RECOVERY - salt-minion processes on thumbor1001 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [12:12:09] PROBLEM - PyBal backends health check on lvs1006 is CRITICAL: PYBAL CRITICAL - thumbor_8800 - Could not depool server thumbor1001.eqiad.wmnet because of too many down! [12:15:09] RECOVERY - PyBal backends health check on lvs1006 is OK: PYBAL OK - All pools are healthy [12:15:09] PROBLEM - PyBal backends health check on lvs1003 is CRITICAL: PYBAL CRITICAL - thumbor_8800 - Could not depool server thumbor1002.eqiad.wmnet because of too many down! [12:16:23] PROBLEM - LVS HTTP IPv4 on thumbor.svc.eqiad.wmnet is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:17:05] hm thumbor issues [12:17:09] RECOVERY - PyBal backends health check on lvs1003 is OK: PYBAL OK - All pools are healthy [12:17:45] akosiaris: I can log into thumbor1001, is it considered down because of a health check? [12:18:27] gilles: it is not considered down as a host, the service is that it doesn't look to be ok [12:19:08] akosiaris: do you know where the service health check is defined? I wonder what it looks at [12:19:13] RECOVERY - LVS HTTP IPv4 on thumbor.svc.eqiad.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 241 bytes in 0.359 second response time [12:19:21] hmm recovered [12:20:11] gilles: yes, https://phabricator.wikimedia.org/diffusion/OPUP/browse/production/hieradata/common/lvs/configuration.yaml;3e0cd4eac8c03eb88ab3a1fdd2bf992387828409$965 [12:20:27] practically tries to fetch /healthcheck over http on port 8800 [12:20:31] thanks [12:21:19] PROBLEM - puppet last run on mira is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Scap_source[trending-edits/deploy] [12:21:41] https://grafana.wikimedia.org/dashboard/db/thumbor?from=now-30m&to=now a spike in 599 it does look like something was up [12:22:13] PROBLEM - LVS HTTP IPv4 on thumbor.svc.eqiad.wmnet is CRITICAL: HTTP CRITICAL - No data received from host [12:22:43] ok so it's flapping, I 'll schedule a 1 hour downtime on icinga while we investigate [12:23:09] so that we don't page every ops member on this cellphone on every flap [12:23:21] the load is definitely high on both boxes [12:23:50] 06Operations, 06Labs, 10wikitech.wikimedia.org: Can't login wikitech - https://phabricator.wikimedia.org/T144805#2820787 (10Shoichi) p:05Triage>03High >>! In T144805#2789485, @Shoichi wrote: > Months ago, I can log in,but it also happen to me. I don't know rember I had set two-factor authentication or no... [12:24:09] PROBLEM - PyBal backends health check on lvs1003 is CRITICAL: PYBAL CRITICAL - thumbor_8800 - Could not depool server thumbor1002.eqiad.wmnet because of too many down! [12:24:38] gilles: very heavily into IOwait as it seems [12:24:50] something in thumbor is doing way too many IOPS [12:25:21] starting 25 mins ago [12:25:24] any offending instance in particular? [12:26:09] RECOVERY - PyBal backends health check on lvs1003 is OK: PYBAL OK - All pools are healthy [12:26:46] I can look at what python is doing with manhole if we can figure out which process(es) in particular are doing it [12:27:35] can't see any long running process ... they seem to be going through their lifecycle normally [12:27:51] first time I see this error, I'll check if it's new: [12:27:52] Nov 24 12:27:18 thumbor1002 thumbor@8811[89821]: 2016-11-24 12:27:18,634 8811 thumbor:ERROR [ExiftoolRunner] error: 'Deep recursion on subroutine "Image::ExifTool::ProcessDirectory" at /usr/share/perl5/Image/ExifTool/Exif.pm line 4532.\nDeep recursion on subroutine "Image::ExifTool::Exif::ProcessExif" at /usr/share/perl5/Image/ExifTool.pm line 6085.\n' [12:28:03] ah, it could be related [12:28:17] some of the processes consuming CPU are exiftool [12:28:23] RECOVERY - LVS HTTP IPv4 on thumbor.svc.eqiad.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 241 bytes in 8.888 second response time [12:28:27] they are not in the top 5, but still [12:28:29] could be related [12:29:17] yep, that error started happening at Nov 24 12:06:26 [12:29:31] ok almost definitely related then [12:30:43] could be a side-effect of the real issue, though [12:31:54] almost OOM killer has been invoked multiple times in the last 30 mins [12:31:58] also* [12:32:11] that happens continuously, thumbor is leaking memory at the moment [12:32:23] 1000-1500 OOM kills per day if I recall correctly [12:32:30] no, it's more promiment now [12:32:35] oh ok [12:33:08] 157 on one host in 30 mins [12:36:09] PROBLEM - PyBal backends health check on lvs1006 is CRITICAL: PYBAL CRITICAL - thumbor_8800 - Could not depool server thumbor1001.eqiad.wmnet because of too many down! [12:36:45] the deep recursion thing is still happening but I can't tell which request triggered it, I'm going to need to live-hack the thumbor code to get more information [12:36:51] !log launched preferred-replica-election to re-add kafka1022 among the Topic partition leader brokers of the Analytics Kafka cluster (all metrics looks good) [12:37:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:37:10] RECOVERY - PyBal backends health check on lvs1006 is OK: PYBAL OK - All pools are healthy [12:38:09] RECOVERY - puppet last run on aqs1006 is OK: OK: Puppet is currently enabled, last run 1 second ago with 0 failures [12:38:39] PROBLEM - puppet last run on tin is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Scap_source[trending-edits/deploy] [12:42:10] PROBLEM - PyBal backends health check on lvs1003 is CRITICAL: PYBAL CRITICAL - thumbor_8800 - Could not depool server thumbor1002.eqiad.wmnet because of too many down! [12:43:09] RECOVERY - PyBal backends health check on lvs1003 is OK: PYBAL OK - All pools are healthy [12:43:12] akosiaris: can you make /usr/lib/python2.7/dist-packages/wikimedia_thumbor/exiftool_runner/__init__.py world-writable? [12:43:42] that's where I'll put my live hack to see if it's alwayd the same offending original or something like that [12:46:56] gilles: done. on thumbor1001 [12:47:03] thanks [12:47:05] I 'll also lower the weight of this host a bit [12:47:20] should give a bit more breathing room while we debug [12:47:36] thumbor1002 is not going to be happy, but still... [12:48:05] !log akosiaris@puppetmaster1001 conftool action : set/weight=5; selector: thumbor1001.eqiad.wmnet (tags: ['dc=eqiad', 'cluster=thumbor', 'service=thumbor']) [12:48:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:48:28] I'm going to restart the thumbor instances on thumbor1001 to make the live hack active [12:48:53] !log restarting thumbor on thumbor1001 [12:49:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:49:56] !log lower thumbor1001 load by 50% to easy debugging [12:50:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:53:15] so, all thumbor processes are writing to the disk about 1.5MB/s.. total is around 30MB/s [12:53:28] with spikes around 50MB/s [12:53:31] which is not much [12:54:21] !log restarting thumbor on thumbor1001 [12:54:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:59:24] of course now that I want to catch the recursion thing it's not happening anymore [13:00:40] it's still happening on thumbor1002, though [13:00:56] either we're unlucky and all those requests are going to thumbor1002 or it was a symptom and not necesserraly the cause [13:01:21] it's not super frequent, though [13:01:36] but it suggests recursing in something that does IO, so... [13:03:32] let's wait a bit more, with something that happens once per minute on average or so, with the new weights it wouldn't be impossible for thumbor1002 to get them all by chance [13:04:22] I can just revert the weight change [13:04:36] or even push more traffic to thumbor1001 [13:04:41] right, please try that, or even skew it the opposite way [13:04:51] !log akosiaris@puppetmaster1001 conftool action : set/weight=20; selector: thumbor1001.eqiad.wmnet (tags: ['dc=eqiad', 'cluster=thumbor', 'service=thumbor']) [13:05:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:05:13] ok, now thumbor1001 should start getting twice the requests thumbor1002 gets [13:07:50] and yet it seems to have no problem now [13:08:02] what on earth ? [13:08:44] right [13:09:09] I can't have the error reoccur on thumbor1001 but it still happens occasionally on thumbor1002 [13:09:25] could it have been a simple spike whose backlog is just still being served on thumbor1002 ? [13:10:16] is thumbor1002 fine now? [13:10:23] no, not really [13:10:31] https://grafana.wikimedia.org/dashboard/file/server-board.json?var-server=thumbor1002&var-network=bond0 [13:10:36] but it is getting better [13:10:54] user cpu % is going up and iowait is slowly becoming less [13:11:28] the load average is definitely dropping, so less work is being scheduled on that box, but that's to be expected [13:11:32] oh, but the disk is full [13:11:40] that might be what's causing all the rest [13:11:50] ? [13:12:09] which one ? I see plenty of space on both [13:12:19] oh no sorry, it's disk utilisation [13:12:35] interesting [13:13:14] restarting thumbor definitely stopped the issue on thumbor1001 but we clearly lost the ability to debug the problem there [13:13:57] can you make the same file writable on thumbor1002? I can just wait for the processes to restart "naturally" due to OOMs [13:14:04] yes [13:14:44] done [13:19:59] akosiaris: can you shift the traffic back to thumbor1002? [13:20:52] !log akosiaris@puppetmaster1001 conftool action : set/weight=5; selector: thumbor1001.eqiad.wmnet (tags: ['dc=eqiad', 'cluster=thumbor', 'service=thumbor']) [13:20:57] gilles: done [13:21:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:21:10] it's quite weird what's going on though [13:21:26] processes are going through their lifecycle quite fine and a restart just fixed it ? [13:21:36] those 2 don't add up very well [13:22:05] right [13:22:09] PROBLEM - puppet last run on db1076 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:22:25] ah, here's the first time I had a long running process [13:22:29] not very long [13:22:37] but an rsvg-convert stood out for a while [13:22:50] whereas the typical is like a second [13:22:55] that's sort of expected, we get a lot of junk SVGs [13:23:05] rsvg-convert won't run for more than a minute [13:23:24] :( and I though I finally had a culprit [13:23:46] so, memory doesn't seem to be the issue, thumbor1001 was never close to the total memory limit [13:24:18] I can't really tell if it was a huge spike in requests, because we track when they complete (failure or success), not when they come in [13:24:45] to be improved, obviously, at the nginx level would give us the real picture [13:25:19] iowait spiked, but it was much higher earlier this morning on thumbor1001 [13:25:56] even the load was really high earlier today [13:26:20] indeed [13:26:35] the failure rate seems to have dropped on its own on thumbor1002... [13:26:41] yes [13:26:46] load as well, as well as cpu usage [13:27:07] not sure if it was something we did [13:27:13] but we didn't do anything other than shift traffic. do you think it just gave it "breathing room"? [13:27:30] it's a plausible explanation [13:27:58] assuming the spike had practically ended by the time we started shifting traffic around [13:28:00] the recursion error is not happening at all on either box [13:28:13] shame to have been unable to know if it was a particular image [13:28:24] I suppose we will see it again [13:28:52] right, I'll make a task to add url request context to as many thumbor errors as possible for future incidents like this [13:28:57] it's clear from the graphs that it has happened multiple times in the previous days [13:29:09] just not that strongly to emit an alert [13:29:22] right [13:31:28] !log akosiaris@puppetmaster1001 conftool action : set/weight=10; selector: thumbor1001.eqiad.wmnet (tags: ['dc=eqiad', 'cluster=thumbor', 'service=thumbor']) [13:31:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:31:49] !log balance the load between thumbor1001 and thumbor1002 evenly [13:31:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:42:36] 06Operations, 06Performance-Team, 10Thumbor: add thumbor to production infrastructure - https://phabricator.wikimedia.org/T139606#2820911 (10Gilles) a:05fgiunchedi>03None [13:43:47] 06Operations, 06Performance-Team, 10Thumbor: Add request URL to thumbor errors - https://phabricator.wikimedia.org/T151553#2820915 (10Gilles) [13:46:47] 06Operations, 06Performance-Team, 10Thumbor: Track incoming HTTP request count on the Thumbor boxes - https://phabricator.wikimedia.org/T151554#2820932 (10Gilles) [13:49:17] 06Operations, 10Icinga: register a nickserv account for icinga-wm - https://phabricator.wikimedia.org/T22771#2820946 (10hashar) 05Open>03declined The system we use to relay notifications to IRC does not support nick registration (T48254) and I have declined that task. icinga-wm is one of the few bots tha... [13:50:09] RECOVERY - puppet last run on db1076 is OK: OK: Puppet is currently enabled, last run 34 seconds ago with 0 failures [13:50:12] 06Operations, 10Icinga: register a nickserv account for icinga-wm - https://phabricator.wikimedia.org/T22771#2820953 (10hashar) [13:50:14] 06Operations, 10IRCecho: ircecho should support nickserv registration - https://phabricator.wikimedia.org/T48254#503055 (10hashar) 05Open>03declined That is solely for icinga-wm and it works just fine without nick registration. I dont think anyone will ever add support for nick registration to ircecho, bu... [13:50:53] what is this trending-edits scap failure, anyone knows? [13:51:26] I will investigate if nobody knows about it [13:52:22] 06Operations, 06Performance-Team, 10Thumbor: Track incoming HTTP request count on the Thumbor boxes - https://phabricator.wikimedia.org/T151554#2820957 (10Gilles) @fgiunchedi what do you think of using something like https://github.com/zebrafishlabs/nginx-statsd or https://github.com/knyar/nginx-lua-promethe... [13:54:29] 06Operations, 10Beta-Cluster-Infrastructure, 06Release-Engineering-Team: setup a DB backed parser cache - https://phabricator.wikimedia.org/T55457#2820963 (10hashar) [13:55:29] PROBLEM - puppet last run on ms-be1025 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [14:00:58] 06Operations, 06Performance-Team, 10Thumbor: Thumbor hangs on some TIFF files - https://phabricator.wikimedia.org/T151454#2820998 (10Gilles) Thanks, Pythonic Santa! [14:04:36] (03PS2) 10KartikMistry: Explicitly set cookieDomain for ContentTranslationSiteTemplates [mediawiki-config] - 10https://gerrit.wikimedia.org/r/320200 (https://phabricator.wikimedia.org/T149879) [14:09:11] 06Operations, 10Traffic: varnishapi.py AttributeError: VSM_Close - https://phabricator.wikimedia.org/T151561#2821095 (10ema) [14:09:23] 06Operations, 10Traffic: varnishapi.py AttributeError: VSM_Close - https://phabricator.wikimedia.org/T151561#2821110 (10ema) p:05Triage>03Normal [14:10:56] 06Operations, 10Traffic: varnishapi.py AttributeError: VSM_Close - https://phabricator.wikimedia.org/T151561#2821095 (10ema) [14:24:29] RECOVERY - puppet last run on ms-be1025 is OK: OK: Puppet is currently enabled, last run 45 seconds ago with 0 failures [14:26:07] 06Operations, 10Traffic: Varnishkafka and related VSM daemons seeing abandoned VSM logs on cp1055 - https://phabricator.wikimedia.org/T151563#2821164 (10elukey) [14:29:53] 06Operations, 10Traffic: Varnishkafka and related VSM daemons seeing abandoned VSM logs on cp1055 - https://phabricator.wikimedia.org/T151563#2821164 (10ema) We might have to increase workspace_backend: https://github.com/varnishcache/varnish-cache/issues/1990 [14:30:19] 06Operations, 06Performance-Team, 10Thumbor: Thumbor hangs on some TIFF files - https://phabricator.wikimedia.org/T151454#2821200 (10Gilles) [14:33:43] 06Operations, 10Traffic: Varnishkafka and related VSM daemons seeing abandoned VSM logs - https://phabricator.wikimedia.org/T151563#2821211 (10elukey) [14:47:07] !log uploaded varnishkafka 1.0.12-1 to carbon main component, replacing version 1.0.7-1 (T150660) [14:47:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:47:22] T150660: Post Varnish 4 migration cleanup - https://phabricator.wikimedia.org/T150660 [14:49:08] long live to Varnish 4 [14:59:14] 06Operations, 06Performance-Team, 10Thumbor: Upgrade gifsicle to 1.88-2~bpo8+1 on Thumbor boxes - https://phabricator.wikimedia.org/T151565#2821235 (10Gilles) [14:59:41] 06Operations, 06Performance-Team, 10Thumbor: Upgrade gifsicle to 1.88-2~bpo8+1 on Thumbor boxes - https://phabricator.wikimedia.org/T151565#2821235 (10Gilles) [15:02:31] 06Operations, 06Performance-Team, 10Thumbor: Upgrade gifsicle to 1.88-2~bpo8+1 on Thumbor boxes - https://phabricator.wikimedia.org/T151565#2821252 (10Gilles) @fgiunchedi would adding a minimum version for gifsicle in the python-thumbor-wikimedia package be enough? Or are more steps required to have the jess... [15:03:40] !log uploaded varnish 4.1.3-1wm4 to carbon main component, replacing version 3.0.6plus-wm9 (T150660) [15:03:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:03:51] T150660: Post Varnish 4 migration cleanup - https://phabricator.wikimedia.org/T150660 [15:04:35] 06Operations, 10Traffic, 13Patch-For-Review: Upgrade all cache clusters to Varnish 4 - https://phabricator.wikimedia.org/T131499#2821257 (10ema) [15:04:38] 06Operations, 10Traffic, 13Patch-For-Review: Post Varnish 4 migration cleanup - https://phabricator.wikimedia.org/T150660#2821256 (10ema) 05Open>03Resolved [15:06:57] 06Operations, 10Traffic, 13Patch-For-Review: Convert text cluster to Varnish 4 - https://phabricator.wikimedia.org/T131503#2821259 (10ema) 05Open>03Resolved a:03ema [15:06:59] 06Operations, 10Traffic, 13Patch-For-Review: Upgrade all cache clusters to Varnish 4 - https://phabricator.wikimedia.org/T131499#2821261 (10ema) [15:07:32] (03CR) 10Elukey: [C: 032] "https://puppet-compiler.wmflabs.org/4669/stat1002.eqiad.wmnet" [puppet] - 10https://gerrit.wikimedia.org/r/323133 (https://phabricator.wikimedia.org/T149722) (owner: 10Gehel) [15:07:39] 06Operations, 10Traffic: Sideways Only-If-Cached on misses at a primary DC - https://phabricator.wikimedia.org/T142841#2821263 (10ema) [15:07:42] 06Operations, 10Traffic, 13Patch-For-Review: Install XKey vmod - https://phabricator.wikimedia.org/T122881#2821264 (10ema) [15:07:46] 06Operations, 10MediaWiki-API, 10Traffic: Evaluate the feasibility of cache invalidation for the action API - https://phabricator.wikimedia.org/T122867#2821265 (10ema) [15:07:46] 06Operations, 10Traffic, 07HTTPS, 05codfw-rollout: Outbound HTTPS for varnish backend instances - https://phabricator.wikimedia.org/T109325#2821267 (10ema) [15:07:51] 06Operations, 10Traffic, 13Patch-For-Review: Upgrade all cache clusters to Varnish 4 - https://phabricator.wikimedia.org/T131499#2168822 (10ema) 05Open>03Resolved [15:08:34] (03PS2) 10Gehel: Add 'discovery-stats' technical user to the 'stats' group. [puppet] - 10https://gerrit.wikimedia.org/r/323133 (https://phabricator.wikimedia.org/T149722) [15:10:09] PROBLEM - puppet last run on mw1190 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:11:54] (03CR) 10Alexandros Kosiaris: "https://puppet-compiler.wmflabs.org/4668/ is way too good to be true..." [puppet] - 10https://gerrit.wikimedia.org/r/322898 (owner: 10Alexandros Kosiaris) [15:17:59] 06Operations, 06Discovery, 10Wikidata, 10Wikidata-Query-Service, and 3 others: publish lag and response time for wdqs codfw to graphite - https://phabricator.wikimedia.org/T146207#2821275 (10akosiaris) [15:18:05] 06Operations, 06Discovery, 06WMDE-Analytics-Engineering, 10Wikidata, and 3 others: Add firewall exception to get to wdqs*.codfw.wmnet:8888 from analytics cluster - https://phabricator.wikimedia.org/T146474#2821272 (10akosiaris) 05Open>03Resolved a:03akosiaris This seems to have fallen between the cra... [15:20:15] 06Operations, 10Traffic, 13Patch-For-Review: Upgrade to Varnish 4: things to remember - https://phabricator.wikimedia.org/T126206#2821283 (10ema) [15:20:53] 06Operations, 06Discovery, 06WMDE-Analytics-Engineering, 10Wikidata, and 3 others: Add firewall exception to get to wdqs*.codfw.wmnet:8888 from analytics cluster - https://phabricator.wikimedia.org/T146474#2821284 (10Addshore) Thanks @akosiaris ! [15:21:23] 06Operations, 06Discovery, 10Wikidata, 10Wikidata-Query-Service, and 2 others: Install and configure new WDQS nodes on codfw - https://phabricator.wikimedia.org/T144380#2821287 (10Addshore) [15:21:27] 06Operations, 06Discovery, 10Wikidata, 10Wikidata-Query-Service, and 3 others: publish lag and response time for wdqs codfw to graphite - https://phabricator.wikimedia.org/T146207#2821285 (10Addshore) 05Open>03Resolved Now appearing in grafana [15:22:29] 06Operations, 10Traffic, 13Patch-For-Review: Varnish4 is unexpectedly retrying certain applayer failure cases - https://phabricator.wikimedia.org/T150247#2821290 (10ema) We've proposed a patch introducing a varnishd parameter limiting the number of extrachance retries: https://github.com/varnishcache/varnish... [15:23:25] 06Operations, 10Traffic, 13Patch-For-Review: varnish-backend: weekly cron restart for all clusters - https://phabricator.wikimedia.org/T149784#2821291 (10ema) 05Open>03Resolved [15:26:56] 06Operations, 06Analytics-Kanban, 10Traffic: Varnishlog with Start timestamp but no Resp one causing data consistency check alarms - https://phabricator.wikimedia.org/T148412#2821292 (10ema) 05Open>03Resolved >>! In T148412#2781980, @elukey wrote: > Last but not the least, no alarms were fired for uploa... [15:31:21] (03PS1) 10Gehel: Add 'discovery-stats' technical user to the 'stats' group. [puppet] - 10https://gerrit.wikimedia.org/r/323399 (https://phabricator.wikimedia.org/T149722) [15:32:06] (03CR) 10jenkins-bot: [V: 04-1] Add 'discovery-stats' technical user to the 'stats' group. [puppet] - 10https://gerrit.wikimedia.org/r/323399 (https://phabricator.wikimedia.org/T149722) (owner: 10Gehel) [15:34:09] 06Operations, 06Performance-Team, 10Thumbor: Thumbor errors on some GIF files - https://phabricator.wikimedia.org/T151455#2821313 (10Gilles) The second example is actually the same issue manifesting itself, which should be resolved by the gifsicle upgrade. [15:35:53] (03PS1) 10Gehel: Revert "Add 'discovery-stats' technical user to the 'stats' group." [puppet] - 10https://gerrit.wikimedia.org/r/323400 [15:36:49] (03CR) 10Gehel: "Adding the "stats" group directly to the "discovery-stats" user conflicts with the way the admin module works." [puppet] - 10https://gerrit.wikimedia.org/r/323400 (owner: 10Gehel) [15:36:54] (03CR) 10Gehel: [C: 032] Revert "Add 'discovery-stats' technical user to the 'stats' group." [puppet] - 10https://gerrit.wikimedia.org/r/323400 (owner: 10Gehel) [15:38:13] 06Operations, 06Performance-Team, 10Thumbor: Thumbor errors on some GIF files - https://phabricator.wikimedia.org/T151455#2821323 (10Gilles) [15:39:09] RECOVERY - puppet last run on mw1190 is OK: OK: Puppet is currently enabled, last run 43 seconds ago with 0 failures [16:00:08] (03PS1) 10Gilles: Nginx timeout should be higher than thumbor subprocess timeout [puppet] - 10https://gerrit.wikimedia.org/r/323403 (https://phabricator.wikimedia.org/T151459) [16:18:03] 06Operations, 06Performance-Team, 10Thumbor: Thumbor should reject thumbnail requests that are the same size as the original or bigger - https://phabricator.wikimedia.org/T150741#2821410 (10Gilles) @fgiunchedi I've actually discovered that mediawiki *does* download the original before erroring that way. It's... [16:44:22] To all the best WMF operations team from My family, Happy Thanksgiving, I hope you have a great day, Ill be around as much as possible. WMF wouldnt be the same without everyone of you. [17:13:16] (03CR) 10Alexandros Kosiaris: [C: 032] Template out reference to deployment.eqiad.wmnet in inactive.motd [puppet] - 10https://gerrit.wikimedia.org/r/322825 (https://phabricator.wikimedia.org/T146505) (owner: 10Alex Monk) [17:13:20] (03PS2) 10Alexandros Kosiaris: Template out reference to deployment.eqiad.wmnet in inactive.motd [puppet] - 10https://gerrit.wikimedia.org/r/322825 (https://phabricator.wikimedia.org/T146505) (owner: 10Alex Monk) [17:13:22] (03CR) 10Alexandros Kosiaris: [V: 032] Template out reference to deployment.eqiad.wmnet in inactive.motd [puppet] - 10https://gerrit.wikimedia.org/r/322825 (https://phabricator.wikimedia.org/T146505) (owner: 10Alex Monk) [17:14:09] RECOVERY - Restbase root url on restbase2012 is OK: HTTP OK: HTTP/1.1 200 - 15450 bytes in 0.104 second response time [17:15:39] RECOVERY - puppet last run on tin is OK: OK: Puppet is currently enabled, last run 4 seconds ago with 0 failures [17:16:20] RECOVERY - puppet last run on mira is OK: OK: Puppet is currently enabled, last run 23 seconds ago with 0 failures [17:25:48] <_joe_> !log turned off additional workers for htmlcacheupdate on commonswiki as the queue has reduced to acceptable sizes (T151196) [17:26:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:26:00] T151196: Job queue size growing since ~12:00 on 2016-11-19 - https://phabricator.wikimedia.org/T151196 [17:29:59] PROBLEM - MariaDB Slave Lag: m3 on db1048 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 602.63 seconds [17:30:09] PROBLEM - MariaDB Slave Lag: m3 on db2012 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 614.38 seconds [17:32:54] Isen't that ^^ phabricator? [17:32:57] that could be me [17:33:05] oh [17:33:09] I have sent some extra backup processes [17:33:16] to the slaves [17:33:20] Oh [17:33:23] no impact to the masters [17:33:35] ok [17:34:29] PROBLEM - MariaDB Slave Lag: m3 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 875.98 seconds [17:34:55] yeah, that has some issues I should check, but no production impact [17:36:09] jynus backups are good we love them [17:36:23] yes, we indeed do [17:40:57] I do not know why they cause lag on m3 and not on the other hosts [17:52:49] 06Operations, 06Performance-Team, 10Thumbor: Thumbor should reject thumbnail requests that are the same size as the original or bigger - https://phabricator.wikimedia.org/T150741#2821604 (10Gilles) Actually it's more complicated than that... I need to double check whether it does load from swift or not, it s... [18:00:52] (03PS1) 10Jcrespo: Add temporary workaround --skip-ssl until unified cert authority [puppet] - 10https://gerrit.wikimedia.org/r/323420 [18:00:59] RECOVERY - MariaDB Slave Lag: m3 on db1048 is OK: OK slave_sql_lag Replication lag: 0.00 seconds [18:01:09] RECOVERY - MariaDB Slave Lag: m3 on db2012 is OK: OK slave_sql_lag Replication lag: 0.35 seconds [18:01:48] 06Operations, 06Performance-Team, 10Thumbor: Thumbor should reject thumbnail requests that are the same size as the original or bigger - https://phabricator.wikimedia.org/T150741#2821658 (10Gilles) Actual size check: https://phabricator.wikimedia.org/diffusion/MW/browse/master/includes/media/Transformational... [18:02:24] (03PS2) 10Jcrespo: Add temporary workaround --skip-ssl until unified cert authority [puppet] - 10https://gerrit.wikimedia.org/r/323420 [18:02:29] RECOVERY - MariaDB Slave Lag: m3 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 0.00 seconds [18:10:33] 06Operations, 06Performance-Team, 10Thumbor: Upgrade gifsicle to 1.88-2~bpo8+1 on Thumbor boxes - https://phabricator.wikimedia.org/T151565#2821703 (10fgiunchedi) @gilles we already have `jessie-backports` enabled in production so the minimum version in python-thumbor-wikimedia should DTRT [18:13:59] 06Operations, 10ArchCom-RfC, 06Commons, 10MediaWiki-File-management, and 14 others: Define an official thumb API - https://phabricator.wikimedia.org/T66214#2821720 (10Gilles) I think it might be just a little misunderstanding. If you're talking about describing to the client what kind of media the original... [18:14:25] (03PS1) 10Ema: varnish: rename scripts depending on varnishlog.py [puppet] - 10https://gerrit.wikimedia.org/r/323423 (https://phabricator.wikimedia.org/T150660) [18:14:37] (03CR) 10Jcrespo: [C: 032] Add temporary workaround --skip-ssl until unified cert authority [puppet] - 10https://gerrit.wikimedia.org/r/323420 (owner: 10Jcrespo) [18:15:47] 06Operations, 06Performance-Team, 10Thumbor: Implement rate limiter in Thumbor - https://phabricator.wikimedia.org/T151067#2821739 (10Gilles) Yeah, I'm not sure that we rate-limit 404s actually. I'll make sure to do the right thing in Thumbor in regards to that. [18:16:01] (03PS2) 10Ema: varnish: rename scripts depending on varnishlog.py [puppet] - 10https://gerrit.wikimedia.org/r/323423 (https://phabricator.wikimedia.org/T150660) [18:23:41] 06Operations, 06Performance-Team, 10Thumbor: Track incoming HTTP request count on the Thumbor boxes - https://phabricator.wikimedia.org/T151554#2821772 (10fgiunchedi) @gilles AFAIK there's no precedent like that no, it would be useful though for other places where we have nginx deployed and want to gain more... [18:24:50] (03PS1) 10Jcrespo: Avoid --dump-slave when performing backups [puppet] - 10https://gerrit.wikimedia.org/r/323427 [18:35:56] 06Operations, 10Monitoring: dbstore1001 backup jobs failed between 2016-10-19 and 2016-11-23 - https://phabricator.wikimedia.org/T151579#2821787 (10jcrespo) [18:40:30] 06Operations, 10Monitoring: dbstore1001 backup jobs failed between 2016-10-19 and 2016-11-23 - https://phabricator.wikimedia.org/T151579#2821834 (10jcrespo) [18:42:56] 06Operations, 06Performance-Team, 10Thumbor: Thumbor should reject thumbnail requests that are the same size as the original or bigger - https://phabricator.wikimedia.org/T150741#2821844 (10fgiunchedi) @Gilles good question, I think for now the simplest thing is to treat multipage documents as exceptions and... [18:43:19] (03CR) 10Jcrespo: [C: 032] Avoid --dump-slave when performing backups [puppet] - 10https://gerrit.wikimedia.org/r/323427 (owner: 10Jcrespo) [18:47:58] 06Operations, 06Performance-Team, 10Thumbor: Thumbor should reject thumbnail requests that are the same size as the original or bigger - https://phabricator.wikimedia.org/T150741#2821863 (10Gilles) I think we can make storing that data very efficient by indexing by size. It's indeed unlikely that every singl... [19:05:25] 06Operations, 06Discovery, 10Wikimedia-Apache-configuration, 07Mobile, 13Patch-For-Review: m.wikipedia.org incorrectly redirects to en.m.wikipedia.org - https://phabricator.wikimedia.org/T69015#2821887 (10Jdlrobson) [19:05:56] 06Operations, 06Discovery, 10Wikimedia-Apache-configuration, 07Mobile, 13Patch-For-Review: m.wikipedia.org incorrectly redirects to en.m.wikipedia.org - https://phabricator.wikimedia.org/T69015#730151 (10Jdlrobson) Given this would probably redirect to https://www.wikipedia.org/ probably something of con... [19:10:59] PROBLEM - puppet last run on mw1232 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [19:35:09] PROBLEM - puppet last run on sca1004 is CRITICAL: CRITICAL: Puppet has 17 failures. Last run 2 minutes ago with 17 failures. Failed resources (up to 3 shown): Service[ferm],Service[diamond],Service[prometheus-node-exporter],Package[ecryptfs-utils] [19:37:20] jouncebot now [19:37:20] No deployments scheduled for the next 90 hour(s) and 22 minute(s) [19:39:59] RECOVERY - puppet last run on mw1232 is OK: OK: Puppet is currently enabled, last run 16 seconds ago with 0 failures [19:43:18] 06Operations, 06Performance-Team, 10Thumbor: Thumbor should reject thumbnail requests that are the same size as the original or bigger - https://phabricator.wikimedia.org/T150741#2821974 (10fgiunchedi) @Gilles yeah that sounds good! Swift header limit I think is 8k by default, so we should be fairly safe the... [19:45:59] PROBLEM - puppet last run on db1033 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [19:50:57] 07Puppet, 10Beta-Cluster-Infrastructure, 07Beta-Cluster-reproducible, 07Easy, and 2 others: "Connect to 'deployment.eqiad.wmnet' instead" when you ssh into deployment-tin on Beta - https://phabricator.wikimedia.org/T146505#2821982 (10hashar) 05Open>03Resolved Thanks @krenair and @akosiaris [20:06:40] 06Operations, 07Puppet, 07Documentation, 03Google-Code-In-2016, and 2 others: document all puppet classes / defined types!? - https://phabricator.wikimedia.org/T127797#2821997 (10Florian) Ok, published: https://codein.withgoogle.com/dashboard/tasks/5667832180244480/ :) I was free and wrote, that the studen... [20:13:59] RECOVERY - puppet last run on db1033 is OK: OK: Puppet is currently enabled, last run 50 seconds ago with 0 failures [20:25:09] PROBLEM - puppet last run on ms-be1003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [20:25:22] (03CR) 10Florianschmidtwelzow: [C: 031] Do not send 'exception-json' channel to logstash [mediawiki-config] - 10https://gerrit.wikimedia.org/r/323330 (https://phabricator.wikimedia.org/T136849) (owner: 10Gergő Tisza) [20:54:09] RECOVERY - puppet last run on ms-be1003 is OK: OK: Puppet is currently enabled, last run 53 seconds ago with 0 failures [21:49:59] PROBLEM - puppet last run on mw1206 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [22:13:19] (03PS1) 10Gergő Tisza: Use custom LogstashFormatter [mediawiki-config] - 10https://gerrit.wikimedia.org/r/323492 (https://phabricator.wikimedia.org/T145133) [22:16:05] 06Operations, 06Discovery, 10Elasticsearch, 06Discovery-Search (Current work), 13Patch-For-Review: Upgrade our logstash-gelf package to latest available upstream version - https://phabricator.wikimedia.org/T150408#2785082 (10fgiunchedi) I took a look at this too and couldn't understand why the override d... [22:17:59] RECOVERY - puppet last run on mw1206 is OK: OK: Puppet is currently enabled, last run 16 seconds ago with 0 failures [22:33:09] RECOVERY - puppet last run on sca1004 is OK: OK: Puppet is currently enabled, last run 55 seconds ago with 0 failures [22:47:20] (03CR) 10Ppchelko: PDF Render: Create the service's admin group (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/323195 (https://phabricator.wikimedia.org/T143129) (owner: 10Mobrovac) [23:15:22] 06Operations, 10Wikimedia-General-or-Unknown, 07Availability, 13Patch-For-Review, and 2 others: Job queue size growing since ~12:00 on 2016-11-19 - https://phabricator.wikimedia.org/T151196#2822200 (10aaron) Related (and tricky) task is T123815, where the backoff times are waaay to pessimistic. The X/sec l... [23:53:41] (03PS6) 10Paladox: Phabricator: Allow setting mysql.user and mysql.pass (part2) [puppet] - 10https://gerrit.wikimedia.org/r/323349