[00:16:30] (03PS1) 10Dzahn: static-bugzilla: ensure /srv/org/wikimedia exists [puppet] - 10https://gerrit.wikimedia.org/r/222515 [00:28:24] (03PS2) 10Dzahn: static-bugzilla: ensure /srv/org/wikimedia exists [puppet] - 10https://gerrit.wikimedia.org/r/222515 [00:30:58] (03PS3) 10Dzahn: static-bugzilla: ensure /srv/org/wikimedia exists [puppet] - 10https://gerrit.wikimedia.org/r/222515 (https://phabricator.wikimedia.org/T101734) [00:31:57] (03CR) 10Dzahn: [C: 032] static-bugzilla: ensure /srv/org/wikimedia exists [puppet] - 10https://gerrit.wikimedia.org/r/222515 (https://phabricator.wikimedia.org/T101734) (owner: 10Dzahn) [00:32:52] 6operations, 7Database: codfw frontends cannot connect to mysql at db2029 - https://phabricator.wikimedia.org/T104573#1423567 (10Springle) (4) == EINTR on connect. Presumably the max_connections you observed, which in turn possibly something to do with: * hhvm timeout (but presumably T98489 was deployed to C... [00:45:10] (03PS2) 10Ori.livneh: varnishlog: allow passing NULL parameter to VCL_Arg() [puppet] - 10https://gerrit.wikimedia.org/r/222507 [00:45:16] (03CR) 10Ori.livneh: [C: 032 V: 032] varnishlog: allow passing NULL parameter to VCL_Arg() [puppet] - 10https://gerrit.wikimedia.org/r/222507 (owner: 10Ori.livneh) [00:49:40] 6operations, 6Security, 7Database: Deployment/restricted root MySQL access? - https://phabricator.wikimedia.org/T104666#1423605 (10Krenair) 3NEW [00:53:14] PROBLEM - git.wikimedia.org on antimony is CRITICAL - Socket timeout after 10 seconds [00:54:49] (03CR) 10Dzahn: Add Phragile module. (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/218930 (https://phabricator.wikimedia.org/T101235) (owner: 10Jakob) [01:00:21] (03PS1) 10Dzahn: enable codfw bastion for non-ops user groups [puppet] - 10https://gerrit.wikimedia.org/r/222519 [01:02:09] (03PS2) 10Dzahn: enable codfw bastion for non-ops user groups [puppet] - 10https://gerrit.wikimedia.org/r/222519 [01:02:44] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL Anomaly detected: 11 data above and 9 below the confidence bounds [01:06:01] (03PS1) 10Dzahn: add parsoid/ocg/bastiononly use groups to hooft [puppet] - 10https://gerrit.wikimedia.org/r/222522 [01:07:30] (03PS3) 10Dzahn: enable codfw bastion for non-ops user groups [puppet] - 10https://gerrit.wikimedia.org/r/222519 [01:08:07] (03PS2) 10Dzahn: add parsoid/ocg/bastiononly user groups to hooft [puppet] - 10https://gerrit.wikimedia.org/r/222522 [01:15:49] (03PS1) 10Dzahn: bromine: remove roles except base::firewall [puppet] - 10https://gerrit.wikimedia.org/r/222523 [01:19:09] (03PS2) 10Dzahn: bromine: remove roles except base::firewall [puppet] - 10https://gerrit.wikimedia.org/r/222523 (https://phabricator.wikimedia.org/T101734) [01:22:54] RECOVERY - git.wikimedia.org on antimony is OK: HTTP OK: HTTP/1.1 200 OK - 60573 bytes in 0.074 second response time [01:24:06] (03PS1) 10Ori.livneh: varnishrls: include cache hit / miss stats from X-Cache header [puppet] - 10https://gerrit.wikimedia.org/r/222524 [01:32:37] 6operations: Rename 'restricted' group? - https://phabricator.wikimedia.org/T104671#1423684 (10Krenair) 3NEW [01:33:55] (03PS2) 10Ori.livneh: varnishrls: include cache hit / miss stats from X-Cache header [puppet] - 10https://gerrit.wikimedia.org/r/222524 [01:34:05] (03CR) 10Ori.livneh: [C: 032 V: 032] varnishrls: include cache hit / miss stats from X-Cache header [puppet] - 10https://gerrit.wikimedia.org/r/222524 (owner: 10Ori.livneh) [01:37:43] 6operations: Rename 'restricted' group? - https://phabricator.wikimedia.org/T104671#1423691 (10Krenair) [01:38:14] PROBLEM - puppet last run on db1044 is CRITICAL Puppet has 1 failures [01:38:29] (03PS1) 10Ori.livneh: Fix-up for Ia481719de: include 're' [puppet] - 10https://gerrit.wikimedia.org/r/222527 [01:38:44] (03PS2) 10Ori.livneh: Fix-up for Ia481719de: include 're' [puppet] - 10https://gerrit.wikimedia.org/r/222527 [01:38:51] (03CR) 10Ori.livneh: [C: 032 V: 032] Fix-up for Ia481719de: include 're' [puppet] - 10https://gerrit.wikimedia.org/r/222527 (owner: 10Ori.livneh) [01:41:49] 6operations, 10Traffic, 5Patch-For-Review, 7Varnish: Move bits traffic to text/mobile clusters - https://phabricator.wikimedia.org/T95448#1423694 (10Spage) Should this be closed? `/w/static` and //wikiname//`/load.php` URLs still work on bits, e.g. https://bits.wikimedia.org/static/1.26wmf12/resources/ass... [01:44:15] (03PS3) 10Dzahn: bromine: add standard, remove other role [puppet] - 10https://gerrit.wikimedia.org/r/222523 [01:47:28] (03PS4) 10Dzahn: bromine: add standard [puppet] - 10https://gerrit.wikimedia.org/r/222523 [01:47:30] 6operations, 10Traffic, 5Patch-For-Review, 7Varnish: Move bits traffic to text/mobile clusters - https://phabricator.wikimedia.org/T95448#1423696 (10BBlack) Yeah, to the degree possible, we've left everything we can still working on bits during the transition. The next major step we're coming up on is phy... [01:48:16] (03PS5) 10Dzahn: bromine: add standard [puppet] - 10https://gerrit.wikimedia.org/r/222523 [01:48:59] (03CR) 10Dzahn: [C: 032] bromine: add standard [puppet] - 10https://gerrit.wikimedia.org/r/222523 (owner: 10Dzahn) [01:51:07] ori: with hit/miss, keep in mind that the X-Cache level doesn't really differentiate hit-for-pass either [01:52:44] (so it's going to call it a hit if the VCL told it to pass the request to the backend for some reason, but that decision to pass it to the backend was a cacheable decision) [01:53:04] RECOVERY - puppet last run on db1044 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [01:53:56] and then also, hit/miss of backend layers in X-Cache gets frozen in true hits in the front layers (backend hit/miss count is frozen into frontend cache response when the frontend caches it, based on what it was during the first fetch from backend). [01:54:28] all in all, relying on X-Cache to determine true cache hitrates is, awkward at best heh :) [01:57:26] 6operations, 5Patch-For-Review: Move static-bugzilla from zirconium to ganeti - https://phabricator.wikimedia.org/T101734#1423700 (10Dzahn) [01:57:28] 6operations, 10vm-requests: eqiad: (1) VM for static-bugzilla - https://phabricator.wikimedia.org/T103604#1423698 (10Dzahn) 5Open>3Resolved that was my fault applying a role on initial run that caused a puppet fail before users were setup and not applying "standard' first. works now. resolving. [01:57:30] bblack: would determining cache hit / miss based on some latency threshold work? [01:57:59] I can't think of a reliable way to do that, no. [01:58:20] X-Cache is the right type of thing to do it with, it's just not got sufficient information in the header today. [01:58:24] RECOVERY - HTTP error ratio anomaly detection on graphite1001 is OK No anomaly detected [01:59:29] basically we need a better X-Cache, and there are other things to fold up with that too, in some proposal for better global request tracing [01:59:57] a header that every layer appends to: nginx, each varnish, apache at the appserver, maybe mediawiki tosses something relevant or interesting into it too [02:00:33] that's what google's dapper does: http://research.google.com/pubs/pub36356.html [02:01:04] part of the protobuf RPC protocol that services use to talk to each other is some unique request id [02:01:24] which gets passed around and threaded through the various services implicated in generating a response to some particular user request [02:02:06] well yeah, I guess there's two inverse ways of looking at the problem [02:02:06] twitter took the paper as a spec and built an open-source version, i haven't tried it out tho [02:02:25] 1) generate a unique ID at the very very front layer, and use that ID in all logging/analysis/trace deeper down [02:02:54] 2) generate a single trace-header that every layer on down appends info to, so that the request and response contain the whole chain as they go. [02:03:15] probably both are useful in different ways [02:03:19] Hm.. varnish doesn't have an indication/distinction between responding to a client with an http response it has cached under a key vs. passing from elsewhere? [02:03:48] the http response passed from elsewhere also bears the x-cache header [02:03:49] Krinkle: at the VCL level we can tell which is happened, so I'm sure there's a way to encode that in a header, too. Just needs some thinking. [02:04:08] Ah yeah [02:04:08] but X-Cache currently relies on obj.hit, which can be hit or hit-for-pass [02:04:08] you can also 'subscribe' to VCL routine names [02:04:20] Depending on what we want to measure, the metric ori and I want should consider 'hit' also when it was a pass to backend where it was a hit. [02:04:27] only 'miss' if neither varnish layer had it [02:04:30] (or measure both) [02:04:44] backend varnish that is [02:04:49] I think I got it right [02:04:54] I think it's interesting to look at the two hitrates independently, because we're definitely not stuck with the idea that we need 2-3 layers here :) [02:05:06] I only count when we have two (hit|miss) in the header [02:05:13] and I only count the header as sent to the client [02:05:35] I'm not sure what the "two (hit|miss)" part does [02:05:40] I didn't follow along here, but I scanned something about that header itself being cached? [02:06:24] both hit-for-pass and miss will go through the next layer down [02:06:30] !log ori Synchronized php-1.26wmf12/extensions/CentralAuth: I0e5f2d3b2: Updated mediawiki/core Project: mediawiki/extensions/CentralAuth 7f8da7139714dd5089dd03e8679aba25c2c89c4d (duration: 00m 15s) [02:06:37] Logged the message, Master [02:07:11] yeah, disregard the two (hit|miss); that wasn't the solution. ignoring all instances of X-Cache except TxHeader with remote party == 'client' was the right one [02:07:28] on the frontends, yes [02:07:36] yeah this is just the frontends [02:07:41] jgage: what was your commandline again to copy files without agent forwarding [02:07:45] and instead of somehow trying to collapse a 'miss, hit' into a miss or a hit i just record it as a 'hit_miss' [02:07:47] mutante: scp -3 [02:07:59] scp -3 somehost:~/foo otherhost:~/bar [02:09:06] bblack: I don't understand what X-Cache: hit, hit, hit represents [02:09:10] Krinkle: the cached header part is: when the frontend appends its own cp10xx hit (999) to the rest of the existing line from the deeper cache headers... if it was a true memcache hit in the frontend, the existing line that it's appending to is frozen in the cache. [02:09:18] if you get a hit, why go another layer deep? [02:09:48] oh, wait, I think I get it [02:09:50] ori: I think it builds the other way [02:09:53] ori: that means at some point the object fell out of the front cache and had to be re-fetched a layer down, where it was a hit. and now the request you're looking at is a hit on the cache from that re-fetch [02:10:03] right [02:10:16] so it append to the hit from the backend [02:10:46] bblack: And last front-end 'hit' appendix is not saved I guess. [02:10:50] it does that each time it fetches from memory [02:10:54] each layer does its own separate append to the X-Cache line, and in a "true" cache hit, the info in that line from deeper layers is also frozen in the cache [02:11:28] ah, so the last value is accurate from the frontend. [02:11:41] the final entry on the line is from the actual frontend cache itself, and represents the actual #hits as a live counter on something that's truly hot and stuck there. [02:12:13] (all of this modulo the fact that X-Cache calls somethings hits that are not hits in the sense we care about) [02:12:26] I think we can assume that the varnish backend will always have it if it was a hit in the frontend. Even if it was originally not a backend hit, it will be now. UNless it's dropped off there already. [02:12:42] will/would [02:12:44] I think using a latency threshold is dirty but probably the easiest way to marshal some confidence about this [02:12:44] there are scenarios where it can go either way [02:13:03] these are load.php requests; they take at least >50ms for the backend to generate [02:13:15] the distribution should be very distinctly bimodal [02:13:21] sometimes something sticks fairly well in the backends but occasionally falls off the frontends and refetches. sometimes something stays very hot in the frontends but rarely lasts long in the backends. [02:14:02] ori: yeah for this particular case of load.php, latency is probably a reasonable way to look at it. [02:14:27] or have mediawiki stick a timestamp as a header indicating when the response was generated [02:14:44] Hm.. yeah. we could use a boolean ms time as secondary factor to verify the accuracy of our x-cache data interpretation [02:14:48] anyways, what this all needs in the general case is more VCL work on building a better trace header, which should include 3-way differentiation in varnish on hit vs miss vs hit-for-pass [02:15:07] i get a brain freeze just trying to size up that task [02:15:10] right now miss is a true miss, but hit can be either hit or hit-for-pass [02:15:50] (with hit-for-pass, if it's consistent across the layers, you'd see counter increments at all layers too, but not at the same rates) [02:16:02] bblack: so 'miss' means it is absence in frontend (saying nothing about backend or app server) and 'hit' is hit in frontend. hit-for-pass is.. [02:16:20] (due to chash funneling, hit-for-pass hits would increment more slowly in the frontend and faster in the backend) [02:17:00] brain freeze with you. I think I'm lost, but I can't tell. [02:17:09] Krinkle: the hit/miss entries that get recorded in each entry in X-Cache are only local information, they know nothing about the other layers. [02:17:25] if it records a miss there, that means it was a true miss at its own layer, and it had to fetch from a layer down for sure. [02:17:49] varnish implements its own 304 handling, right? [02:17:58] If it records a hit there, that can be either a true hit (served from memory directly), or a hit-for-pass (meaning it has cached the fact that it always has to force-miss this for a while and fetch it from the backend) [02:18:00] Or do we vary by that, and then cache the handling of the app server? [02:18:15] e.g. there's a cache object for the 200 response and one for the 304 response? [02:18:17] ori: thanks, different solution but works (slowly:) [02:19:09] Krinkle: it can't vary on the response, only on the request (because we can't see a future potential response when deciding whether to use the cache for a given request) [02:19:13] first downloading and then uploading would be the same thing though it seems [02:20:28] right [02:20:43] Krinkle: but I'm not sure about 304 in particular, whether varnish handles that. I would expect it to, but I don't know where the docs are on that. [02:21:32] I mean, on the one hand, we want varnish to cache on itself. But we also need changes to propagate [02:21:51] so eventually it's gonan have to ask the app server whether the E-Tag or Last-Modified is still useable [02:21:53] well 304 is just based on the same basic principles as caching a 200 [02:22:12] different mechanisms, but either way you have a way to tell the cache if you don't want it to keep caching that [02:22:46] yeah, varnish should just do the same thing browsers do. During the public (s)maxage, handle 304 yourself [02:23:01] https://www.varnish-software.com/static/book/HTTP.html [02:23:17] and after that, check with the backend and if it gives 304 (which has no body) extend the same object, and if 200 replace. [02:23:26] ^ that represents varnish's view of it. but it's kind of a mixed document, "let's review how this works in general + this is what varnish does" [02:24:33] ori: Hm.. does varnish say in the log how long it took to get the body? [02:24:50] or would you need to compute it manually [02:24:56] brb, dinner [02:27:42] Krinkle: I'd have varnishrls keep a buffer of timestamps which occurred in the last second [02:28:47] 6operations, 10Traffic, 5Patch-For-Review: Sort out DHE for Forward Secrecy w/ older clients - https://phabricator.wikimedia.org/T104281#1423717 (10BBlack) p:5Triage>3Normal [02:29:17] and pronounce a request a cache hit if the timestamp is from more than a second ago or if it is in the last second's buffer [02:29:48] a one-second threshold may mis-classify some pathological responses, we could make it 5 seconds, that's not too many timestamps to hold in memory [02:30:36] you'd be surprised even on one server, the request rate on load.php heh [02:31:10] I don't get the logic on looking for request-rate gaps though [02:32:42] Hm... the main objective is to determine whether the current request had to be generated by mediawiki or is cached in varnish (e.g. internal 304-ish), separately whether the client made a 304 or 200 trip out of it. [02:32:57] So if x-debug contains 'hit' anywhere, it will have been from varnish land, I guess? [02:33:13] Krinkle: no [02:33:25] it can be "hit hit hit" and have not come from varnish [02:33:40] I don't understand [02:34:07] that's why I'm saying the current X-Cache is insufficient for what we really want to know about hitrates (I think it was more designed for debug tracing) [02:34:50] Krinkle: hit can mean "hit-for-pass", which just means a previously-made explicit decision to always intentionally miss that request for a certain amount of time. [02:35:17] a hit-for-pass object sites in the cache memory like a real hit for lookups, but says "I don't have content, I'm just here to cache the fact that you shouldn't use me" [02:35:22] s/sites/sits/ [02:35:54] Hm.. example? [02:36:07] http://stackoverflow.com/questions/12691489/varnish-hit-for-pass-means [02:36:08] aaanyway, gotta go. Will be back in an hour or so [02:36:11] thanks! [02:36:19] ^ that pretty much explains it, the SO thing [02:36:35] but the bottom line for today is: X-Cache "hit" is either true hit or the above hit-for-pass [02:42:08] !log l10nupdate Synchronized php-1.26wmf12/cache/l10n: (no message) (duration: 11m 43s) [02:42:17] Logged the message, Master [02:49:31] !log LocalisationUpdate completed (1.26wmf12) at 2015-07-03 02:49:31+00:00 [02:49:38] Logged the message, Master [02:50:07] !log restbase rolling restart [02:50:15] Logged the message, Master [02:55:03] 6operations: install/deploy mira as codfw deployment server - https://phabricator.wikimedia.org/T95436#1423748 (10Krenair) mira has base::firewall which tin doesn't have, and tin has mysql, role::labsdb::manager and role::releases::upload which mira doesn't have. Not sure about role::labsdb::manager or role::rel... [03:06:00] (03PS1) 10Dzahn: install mysql-client in role::deployment:server [puppet] - 10https://gerrit.wikimedia.org/r/222533 (https://phabricator.wikimedia.org/T95436) [03:07:10] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL Anomaly detected: 10 data above and 7 below the confidence bounds [03:13:15] (03PS1) 10Dzahn: role::deployment - no ensure => latest [puppet] - 10https://gerrit.wikimedia.org/r/222534 [03:13:32] (03PS2) 10Dzahn: role::deployment - no ensure => latest [puppet] - 10https://gerrit.wikimedia.org/r/222534 [03:22:05] (03PS1) 10Dzahn: restbase - fix => alignment (lint) [puppet] - 10https://gerrit.wikimedia.org/r/222535 [03:25:15] (03PS2) 10Dzahn: restbase - fix => alignment (lint) [puppet] - 10https://gerrit.wikimedia.org/r/222535 [03:40:55] (03PS1) 10Dzahn: few more lint fixes in role classes [puppet] - 10https://gerrit.wikimedia.org/r/222536 [03:44:18] 6operations, 10RESTBase-Cassandra, 6Services, 7RESTBase-architecture: alternative Cassandra metrics reporting - https://phabricator.wikimedia.org/T104208#1423787 (10Eevans) This is currently running on restbase100{1,2,6}.eqiad in infinite loops, on 60 second intervals, in screen sessions under my user. [04:08:19] PROBLEM - Incoming network saturation on labstore1003 is CRITICAL 18.52% of data above the critical threshold [100000000.0] [04:18:08] 6operations, 10Deployment-Systems, 5Patch-For-Review: install/deploy mira as codfw deployment server - https://phabricator.wikimedia.org/T95436#1423807 (10Krenair) [04:19:39] PROBLEM - carbon-cache too many creates on graphite1001 is CRITICAL 1.69% of data above the critical threshold [1000.0] [04:20:06] 6operations, 10Deployment-Systems, 5Patch-For-Review: install/deploy mira as codfw deployment server - https://phabricator.wikimedia.org/T95436#1190079 (10Krenair) How are we going to handle sync of mediawiki-staging between tin and mira? Wouldn't we want any sort of git change on one to be reflected on the... [04:46:56] (03PS1) 10Springle: s7 pager slave partitioning [software] - 10https://gerrit.wikimedia.org/r/222538 [04:47:18] RECOVERY - Incoming network saturation on labstore1003 is OK Less than 10.00% above the threshold [75000000.0] [04:58:47] 6operations, 10Traffic, 7HTTPS, 5HTTPS-by-default: HTTPS Plans (tracking / high-level info) - https://phabricator.wikimedia.org/T104681#1423896 (10BBlack) 3NEW a:3BBlack [04:59:21] bblack: Thanks, that was useful. [05:00:54] 6operations, 10Traffic, 5Patch-For-Review: Sort out DHE for Forward Secrecy w/ older clients - https://phabricator.wikimedia.org/T104281#1423914 (10BBlack) [05:00:55] 6operations, 10Traffic, 5HTTPS-by-default, 5Patch-For-Review: Preload HSTS - https://phabricator.wikimedia.org/T104244#1423915 (10BBlack) [05:00:57] 6operations, 10Traffic: Clean up DNS/redirects for TLS - https://phabricator.wikimedia.org/T102824#1423917 (10BBlack) [05:00:59] 6operations, 10Traffic, 7HTTPS, 5HTTPS-by-default: let all services on misc-web enforce http->https redirects - https://phabricator.wikimedia.org/T103919#1423916 (10BBlack) [05:01:01] 6operations, 10Traffic, 6WMF-Legal: Policy decisions for new (and current) DNS domains registered to the WMF - https://phabricator.wikimedia.org/T101048#1423919 (10BBlack) [05:01:15] lol [05:01:26] wikibugs can't handle a handful of new blocked-by tasks I guess :) [05:03:49] hi YuviPanda [05:05:38] does wikibugs need manual restarts usually on flood? [05:06:34] it doesn't [05:06:50] but it will ony rejoin channels when it has something to say [05:06:57] except -labs, which it always joins [05:16:49] RECOVERY - carbon-cache too many creates on graphite1001 is OK Less than 1.00% above the threshold [500.0] [06:00:45] _joe_: why are conftool-data and hieradata separate file hierarchies? they are both hierarchies of yaml flies. why not treat them as a single body of metadata that can be operated on by multiple tools? [06:17:45] 6operations, 7Database: codfw frontends cannot connect to mysql at db2029 - https://phabricator.wikimedia.org/T104573#1423975 (10jcrespo) Network conectivity is ok (I cannot discard it being too slow or other problem)- I can curl to the mysql port and I can see the connections initiating on netstat. There was... [06:19:02] !log LocalisationUpdate ResourceLoader cache refresh completed at Fri Jul 3 06:19:02 UTC 2015 (duration 19m 1s) [06:19:08] Logged the message, Master [06:29:58] PROBLEM - puppetmaster https on palladium is CRITICAL - Socket timeout after 10 seconds [06:30:29] PROBLEM - puppet last run on mw2095 is CRITICAL puppet fail [06:31:18] PROBLEM - puppet last run on mc2011 is CRITICAL puppet fail [06:31:30] PROBLEM - puppet last run on cp1056 is CRITICAL Puppet has 1 failures [06:31:32] (03PS1) 10Ori.livneh: Tessera on misc vcl: return (pass) early to allow additional HTTP method [puppet] - 10https://gerrit.wikimedia.org/r/222542 [06:31:39] RECOVERY - puppetmaster https on palladium is OK: HTTP OK: Status line output matched 400 - 335 bytes in 3.506 second response time [06:31:59] PROBLEM - puppet last run on db1022 is CRITICAL Puppet has 1 failures [06:32:00] (03PS2) 10Ori.livneh: Tessera on misc vcl: return (pass) early to allow additional HTTP method [puppet] - 10https://gerrit.wikimedia.org/r/222542 [06:32:44] (03CR) 10Ori.livneh: [C: 04-1] "We probably don't want to do that because it'll also disable caching." [puppet] - 10https://gerrit.wikimedia.org/r/222542 (owner: 10Ori.livneh) [06:33:59] PROBLEM - puppet last run on cp2026 is CRITICAL Puppet has 1 failures [06:34:19] PROBLEM - puppet last run on labstore1003 is CRITICAL Puppet has 1 failures [06:34:19] PROBLEM - puppet last run on cp1061 is CRITICAL Puppet has 1 failures [06:34:39] PROBLEM - puppet last run on cp3040 is CRITICAL Puppet has 1 failures [06:34:39] PROBLEM - puppet last run on cp3010 is CRITICAL Puppet has 1 failures [06:34:59] PROBLEM - puppet last run on cp4008 is CRITICAL Puppet has 1 failures [06:37:29] PROBLEM - puppet last run on lvs1002 is CRITICAL Puppet has 1 failures [06:38:08] PROBLEM - puppet last run on lead is CRITICAL Puppet has 1 failures [06:38:08] PROBLEM - puppet last run on db1015 is CRITICAL Puppet has 1 failures [06:38:09] PROBLEM - puppet last run on elastic1008 is CRITICAL Puppet has 1 failures [06:38:19] PROBLEM - puppet last run on db2064 is CRITICAL Puppet has 1 failures [06:38:28] PROBLEM - puppet last run on wtp2001 is CRITICAL Puppet has 1 failures [06:38:38] PROBLEM - puppet last run on es2001 is CRITICAL Puppet has 1 failures [06:38:39] PROBLEM - puppet last run on db2002 is CRITICAL Puppet has 1 failures [06:38:39] PROBLEM - puppet last run on lvs3001 is CRITICAL Puppet has 1 failures [06:38:49] PROBLEM - puppet last run on ms-fe1001 is CRITICAL Puppet has 1 failures [06:38:50] PROBLEM - puppet last run on db2045 is CRITICAL Puppet has 1 failures [06:38:59] PROBLEM - puppet last run on es2009 is CRITICAL Puppet has 1 failures [06:39:09] PROBLEM - puppet last run on elastic1021 is CRITICAL Puppet has 1 failures [06:39:09] PROBLEM - puppet last run on ruthenium is CRITICAL Puppet has 1 failures [06:39:19] PROBLEM - puppet last run on labvirt1003 is CRITICAL Puppet has 1 failures [06:39:19] PROBLEM - puppet last run on db1059 is CRITICAL Puppet has 1 failures [06:39:19] PROBLEM - puppet last run on elastic1012 is CRITICAL Puppet has 1 failures [06:39:20] PROBLEM - puppet last run on iron is CRITICAL Puppet has 1 failures [06:39:39] PROBLEM - puppet last run on ms-fe1004 is CRITICAL Puppet has 1 failures [06:39:39] PROBLEM - puppet last run on wtp2018 is CRITICAL Puppet has 1 failures [06:39:49] PROBLEM - puppet last run on holmium is CRITICAL Puppet has 1 failures [06:39:58] PROBLEM - puppet last run on subra is CRITICAL Puppet has 1 failures [06:39:58] PROBLEM - puppet last run on ms-fe2001 is CRITICAL Puppet has 1 failures [06:41:18] PROBLEM - puppet last run on mw2082 is CRITICAL Puppet has 1 failures [06:41:39] PROBLEM - puppet last run on mw1254 is CRITICAL Puppet has 2 failures [06:42:10] PROBLEM - puppet last run on mw2213 is CRITICAL Puppet has 1 failures [06:42:18] PROBLEM - puppet last run on mw2118 is CRITICAL Puppet has 1 failures [06:42:29] PROBLEM - puppet last run on mw1009 is CRITICAL Puppet has 1 failures [06:42:59] PROBLEM - puppet last run on mw1160 is CRITICAL Puppet has 1 failures [06:43:09] PROBLEM - puppet last run on mw2033 is CRITICAL Puppet has 1 failures [06:43:29] PROBLEM - puppet last run on mw2105 is CRITICAL Puppet has 1 failures [06:43:38] PROBLEM - puppet last run on mw1099 is CRITICAL Puppet has 1 failures [06:43:39] PROBLEM - puppet last run on mw1117 is CRITICAL Puppet has 1 failures [06:43:39] PROBLEM - puppet last run on mw1003 is CRITICAL Puppet has 1 failures [06:43:39] PROBLEM - puppet last run on mw1008 is CRITICAL Puppet has 1 failures [06:43:49] PROBLEM - puppet last run on mw1060 is CRITICAL Puppet has 1 failures [06:43:59] PROBLEM - puppet last run on mw1176 is CRITICAL Puppet has 1 failures [06:44:08] PROBLEM - puppet last run on mw2114 is CRITICAL Puppet has 1 failures [06:44:19] PROBLEM - puppet last run on mw1069 is CRITICAL Puppet has 1 failures [06:44:29] PROBLEM - puppet last run on mw1088 is CRITICAL Puppet has 1 failures [06:44:29] PROBLEM - puppet last run on mw1226 is CRITICAL Puppet has 1 failures [06:44:30] PROBLEM - puppet last run on mw2123 is CRITICAL Puppet has 1 failures [06:44:39] PROBLEM - puppet last run on mw1222 is CRITICAL Puppet has 1 failures [06:44:50] PROBLEM - puppet last run on mw2017 is CRITICAL Puppet has 1 failures [06:44:58] PROBLEM - puppet last run on mw1189 is CRITICAL Puppet has 1 failures [06:44:58] PROBLEM - puppet last run on mw2184 is CRITICAL Puppet has 1 failures [06:44:59] PROBLEM - puppet last run on mw1150 is CRITICAL Puppet has 1 failures [06:44:59] RECOVERY - puppet last run on db1022 is OK Puppet is currently enabled, last run 17 seconds ago with 0 failures [06:45:08] PROBLEM - puppet last run on mw1068 is CRITICAL Puppet has 1 failures [06:45:09] PROBLEM - puppet last run on mw2173 is CRITICAL Puppet has 1 failures [06:45:09] RECOVERY - puppet last run on cp2026 is OK Puppet is currently enabled, last run 22 seconds ago with 0 failures [06:45:09] PROBLEM - puppet last run on mw2163 is CRITICAL Puppet has 1 failures [06:45:19] PROBLEM - puppet last run on mw2013 is CRITICAL Puppet has 1 failures [06:45:19] PROBLEM - puppet last run on mw2023 is CRITICAL Puppet has 1 failures [06:45:19] PROBLEM - puppet last run on mw2043 is CRITICAL Puppet has 1 failures [06:45:29] RECOVERY - puppet last run on labstore1003 is OK Puppet is currently enabled, last run 16 seconds ago with 0 failures [06:45:29] RECOVERY - puppet last run on cp1061 is OK Puppet is currently enabled, last run 15 seconds ago with 0 failures [06:45:38] PROBLEM - puppet last run on mw2104 is CRITICAL Puppet has 1 failures [06:45:49] RECOVERY - puppet last run on cp3040 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [06:45:49] RECOVERY - puppet last run on cp3010 is OK Puppet is currently enabled, last run 37 seconds ago with 0 failures [06:45:58] PROBLEM - puppet last run on mw2206 is CRITICAL Puppet has 1 failures [06:45:58] PROBLEM - puppet last run on mw2113 is CRITICAL Puppet has 1 failures [06:45:59] PROBLEM - puppet last run on mw2036 is CRITICAL Puppet has 1 failures [06:46:08] RECOVERY - puppet last run on es2001 is OK Puppet is currently enabled, last run 12 seconds ago with 0 failures [06:46:08] RECOVERY - puppet last run on db2002 is OK Puppet is currently enabled, last run 25 seconds ago with 0 failures [06:46:10] RECOVERY - puppet last run on mc2011 is OK Puppet is currently enabled, last run 8 seconds ago with 0 failures [06:46:18] RECOVERY - puppet last run on cp4008 is OK Puppet is currently enabled, last run 35 seconds ago with 0 failures [06:46:19] RECOVERY - puppet last run on ms-fe1001 is OK Puppet is currently enabled, last run 4 seconds ago with 0 failures [06:46:20] RECOVERY - puppet last run on cp1056 is OK Puppet is currently enabled, last run 48 seconds ago with 0 failures [06:46:20] RECOVERY - puppet last run on db2045 is OK Puppet is currently enabled, last run 35 seconds ago with 0 failures [06:46:29] RECOVERY - puppet last run on es2009 is OK Puppet is currently enabled, last run 39 seconds ago with 0 failures [06:46:39] RECOVERY - puppet last run on elastic1021 is OK Puppet is currently enabled, last run 33 seconds ago with 0 failures [06:46:39] RECOVERY - puppet last run on ruthenium is OK Puppet is currently enabled, last run 7 seconds ago with 0 failures [06:46:48] RECOVERY - puppet last run on db1059 is OK Puppet is currently enabled, last run 31 seconds ago with 0 failures [06:46:48] RECOVERY - puppet last run on elastic1012 is OK Puppet is currently enabled, last run 31 seconds ago with 0 failures [06:46:49] RECOVERY - puppet last run on lvs1002 is OK Puppet is currently enabled, last run 19 seconds ago with 0 failures [06:46:49] RECOVERY - puppet last run on iron is OK Puppet is currently enabled, last run 31 seconds ago with 0 failures [06:47:08] RECOVERY - puppet last run on ms-fe1004 is OK Puppet is currently enabled, last run 38 seconds ago with 0 failures [06:47:09] RECOVERY - puppet last run on wtp2018 is OK Puppet is currently enabled, last run 53 seconds ago with 0 failures [06:47:18] RECOVERY - puppet last run on holmium is OK Puppet is currently enabled, last run 39 seconds ago with 0 failures [06:47:28] RECOVERY - puppet last run on lead is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [06:47:28] RECOVERY - puppet last run on subra is OK Puppet is currently enabled, last run 28 seconds ago with 0 failures [06:47:28] RECOVERY - puppet last run on ms-fe2001 is OK Puppet is currently enabled, last run 23 seconds ago with 0 failures [06:47:28] RECOVERY - puppet last run on db1015 is OK Puppet is currently enabled, last run 59 seconds ago with 0 failures [06:47:29] RECOVERY - puppet last run on elastic1008 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [06:47:38] RECOVERY - puppet last run on db2064 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [06:47:48] RECOVERY - puppet last run on wtp2001 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [06:47:59] RECOVERY - puppet last run on lvs3001 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [06:48:29] RECOVERY - puppet last run on mw1160 is OK Puppet is currently enabled, last run 7 seconds ago with 0 failures [06:48:38] RECOVERY - puppet last run on labvirt1003 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [06:48:39] RECOVERY - puppet last run on mw2082 is OK Puppet is currently enabled, last run 17 seconds ago with 0 failures [06:48:59] RECOVERY - puppet last run on mw2105 is OK Puppet is currently enabled, last run 46 seconds ago with 0 failures [06:48:59] RECOVERY - puppet last run on mw1254 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [06:49:09] RECOVERY - puppet last run on mw1003 is OK Puppet is currently enabled, last run 31 seconds ago with 0 failures [06:49:09] RECOVERY - puppet last run on mw1008 is OK Puppet is currently enabled, last run 15 seconds ago with 0 failures [06:49:19] RECOVERY - puppet last run on mw1060 is OK Puppet is currently enabled, last run 37 seconds ago with 0 failures [06:49:29] RECOVERY - puppet last run on mw1176 is OK Puppet is currently enabled, last run 33 seconds ago with 0 failures [06:49:38] RECOVERY - puppet last run on mw2213 is OK Puppet is currently enabled, last run 55 seconds ago with 0 failures [06:49:39] RECOVERY - puppet last run on mw2118 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [06:49:39] RECOVERY - puppet last run on mw2114 is OK Puppet is currently enabled, last run 41 seconds ago with 0 failures [06:49:39] RECOVERY - puppet last run on mw2036 is OK Puppet is currently enabled, last run 25 seconds ago with 0 failures [06:49:50] RECOVERY - puppet last run on mw1009 is OK Puppet is currently enabled, last run 57 seconds ago with 0 failures [06:49:58] RECOVERY - puppet last run on mw1088 is OK Puppet is currently enabled, last run 58 seconds ago with 0 failures [06:49:59] RECOVERY - puppet last run on mw1226 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [06:50:00] RECOVERY - puppet last run on mw2123 is OK Puppet is currently enabled, last run 22 seconds ago with 0 failures [06:50:09] RECOVERY - puppet last run on mw1222 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [06:50:20] RECOVERY - puppet last run on mw2017 is OK Puppet is currently enabled, last run 29 seconds ago with 0 failures [06:50:28] RECOVERY - puppet last run on mw1189 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [06:50:28] RECOVERY - puppet last run on mw2184 is OK Puppet is currently enabled, last run 52 seconds ago with 0 failures [06:50:29] RECOVERY - puppet last run on mw1150 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [06:50:29] RECOVERY - puppet last run on mw2033 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [06:50:38] RECOVERY - puppet last run on mw1068 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [06:50:38] RECOVERY - puppet last run on mw2173 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [06:50:39] RECOVERY - puppet last run on mw2163 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [06:50:48] RECOVERY - puppet last run on mw2013 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [06:50:48] RECOVERY - puppet last run on mw2023 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [06:50:49] RECOVERY - puppet last run on mw2043 is OK Puppet is currently enabled, last run 55 seconds ago with 0 failures [06:50:58] RECOVERY - puppet last run on mw1099 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [06:50:58] RECOVERY - puppet last run on mw1117 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [06:50:59] RECOVERY - puppet last run on mw2104 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [06:51:28] RECOVERY - puppet last run on mw2206 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [06:51:28] RECOVERY - puppet last run on mw2113 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [06:51:40] RECOVERY - puppet last run on mw1069 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [06:52:39] RECOVERY - puppet last run on mw2095 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [07:02:28] PROBLEM - puppet last run on labsdb1007 is CRITICAL Puppet has 1 failures [07:18:59] RECOVERY - puppet last run on labsdb1007 is OK Puppet is currently enabled, last run 40 seconds ago with 0 failures [07:20:25] We had some issues on s7 replication on dbstore2X hosts due to an incorrect puppet config, I have disabled puppet there until we fix fully the issues and fix the puppet config [07:20:37] (03PS1) 10Muehlenhoff: Annotate some recently assigned CVE IDs for our Linux kernel package [debs/linux] - 10https://gerrit.wikimedia.org/r/222548 [07:22:24] details are on T104471 and it did not affect users, but we were forced to repopulate partiall our backup machines on codfw [07:35:26] (03CR) 10Muehlenhoff: [C: 032 V: 032] Annotate some recently assigned CVE IDs for our Linux kernel package [debs/linux] - 10https://gerrit.wikimedia.org/r/222548 (owner: 10Muehlenhoff) [07:47:09] PROBLEM - Kafka Broker Messages In on analytics1021 is CRITICAL: kafka.server.BrokerTopicMetrics.AllTopicsMessagesInPerSec.FifteenMinuteRate CRITICAL: 766.141468458 [07:50:45] <_joe_> ori: just to avoid confusion, as hiera data is by its nature non-global [07:51:04] <_joe_> (I'm off, just passing by) [07:51:29] <_joe_> ori: but I guess we will and can integrate the two [07:51:57] <_joe_> (puppet will consume the same data as conftool for a number of purposes) [07:52:11] <_joe_> ok, I'm off, see everyone on monday [07:54:31] _joe_: have a good weekend [08:04:40] 6operations, 10MediaWiki-Sites, 10SEO, 5HTTPS-by-default, and 3 others: URLs for the same title without extra query parameters should have the same canonical link - https://phabricator.wikimedia.org/T67402#1424083 (10Nemo_bis) >>! In T67402#1414489, @Nemo_bis wrote: > Does the patch fix http://de.wikipedia... [08:30:29] PROBLEM - puppet last run on mc2011 is CRITICAL puppet fail [08:47:19] RECOVERY - puppet last run on mc2011 is OK Puppet is currently enabled, last run 41 seconds ago with 0 failures [08:52:39] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL Anomaly detected: 17 data above and 9 below the confidence bounds [09:36:24] !log restbase restarting cassandra on rb1002 [09:36:28] * mobrovac sighs [09:36:31] Logged the message, Master [09:46:29] PROBLEM - git.wikimedia.org on antimony is CRITICAL - Socket timeout after 10 seconds [09:51:50] RECOVERY - git.wikimedia.org on antimony is OK: HTTP OK: HTTP/1.1 200 OK - 60560 bytes in 0.686 second response time [09:59:26] (03PS1) 10Muehlenhoff: add ferm rules for redis [puppet] - 10https://gerrit.wikimedia.org/r/222554 [10:00:58] PROBLEM - puppet last run on sca1001 is CRITICAL puppet fail [10:04:50] (03PS1) 10Muehlenhoff: add ferm rules for memcached [puppet] - 10https://gerrit.wikimedia.org/r/222556 [10:11:10] PROBLEM - HTTP 5xx req/min on graphite1001 is CRITICAL 7.14% of data above the critical threshold [500.0] [10:18:36] 6operations, 10RESTBase, 10RESTBase-Cassandra, 6Services: Protect background jobs from unhandled exceptions - https://phabricator.wikimedia.org/T104581#1424234 (10mark) [10:26:10] RECOVERY - HTTP 5xx req/min on graphite1001 is OK Less than 1.00% above the threshold [250.0] [10:36:32] 6operations, 10ops-codfw: Equip osm-cp200{1,2,3,4} with 2 1.2TB SSDs each - https://phabricator.wikimedia.org/T104610#1424322 (10mark) Two of these servers would actually be databases, not caches, so perhaps we should rename them? [10:44:29] PROBLEM - Cassandra database on restbase1005 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 113 (cassandra), command name java, args CassandraDaemon [10:44:45] mobrovac: that you? [10:44:58] nope, restarting [10:45:18] PROBLEM - Cassanda CQL query interface on restbase1005 is CRITICAL: Connection refused [10:46:19] RECOVERY - Cassandra database on restbase1005 is OK: PROCS OK: 1 process with UID = 113 (cassandra), command name java, args CassandraDaemon [10:48:57] (03PS1) 10Jcrespo: Change replication filters on dbstore hosts to use one per line [puppet] - 10https://gerrit.wikimedia.org/r/222562 (https://phabricator.wikimedia.org/T104471) [10:50:34] PROBLEM - HTTP 5xx req/min on graphite1001 is CRITICAL 7.14% of data above the critical threshold [500.0] [10:50:44] !log started du of maps project on labstore2001 [10:50:46] mark: ^ [10:50:50] Logged the message, Master [10:51:05] :) [10:51:12] (03PS2) 10Jcrespo: Change replication filters on dbstore hosts to use one per line [puppet] - 10https://gerrit.wikimedia.org/r/222562 (https://phabricator.wikimedia.org/T104471) [10:51:37] mark: going to take a while, I think :) [10:51:44] for sure [10:52:14] I am going to +2 my own change because it is almost an unbreak now (data loss on backups) [10:52:49] rb1005 cass dying on us constantly [10:52:52] * mobrovac looking [10:53:08] (03CR) 10Jcrespo: [C: 032] Change replication filters on dbstore hosts to use one per line [puppet] - 10https://gerrit.wikimedia.org/r/222562 (https://phabricator.wikimedia.org/T104471) (owner: 10Jcrespo) [10:53:48] PROBLEM - Cassandra database on restbase1005 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 113 (cassandra), command name java, args CassandraDaemon [10:56:17] !log restbase disabling puppet on restbase1005 to tweak JVM params for cassandra [10:56:22] Logged the message, Master [10:57:09] RECOVERY - puppet last run on sca1001 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [10:59:20] RECOVERY - Cassandra database on restbase1005 is OK: PROCS OK: 1 process with UID = 113 (cassandra), command name java, args CassandraDaemon [10:59:59] RECOVERY - HTTP 5xx req/min on graphite1001 is OK Less than 1.00% above the threshold [250.0] [11:00:18] RECOVERY - Cassanda CQL query interface on restbase1005 is OK: TCP OK - 0.003 second response time on port 9042 [11:01:34] uf, ok, saved it [11:01:45] still, old gen space is too high [11:06:19] PROBLEM - puppet last run on mw1005 is CRITICAL Puppet has 1 failures [11:09:15] !log reimports finished on dbstore2* hosts and puppet reenabled after T104471 was fixed [11:09:21] Logged the message, Master [11:09:44] !log lvcreate -L 6TB -n tools-20150703 backup on labstore2001 [11:09:50] Logged the message, Master [11:10:07] mark: ^ created! [11:10:14] I guess this isn't mounted anywhere atm [11:10:18] yes, you can see it in 'lvs' though [11:10:23] indeed [11:10:34] so now you need to create an ext4 file system [11:10:53] for this purpose we'll just do it naively [11:11:24] which is 'mkfs -t ext4 DEVICEFILE' [11:11:24] mkfs? [11:11:38] so first verify the existence of the device file [11:12:08] root@labstore2001:~# ls -l /dev/mapper/backup-tools--20150703 [11:12:23] where's th/dev/mapper/backup-tools--20150703 exists [11:12:24] yeah [11:12:25]