[00:16:30] (03PS1) 10Dzahn: static-bugzilla: ensure /srv/org/wikimedia exists [puppet] - 10https://gerrit.wikimedia.org/r/222515 [00:28:24] (03PS2) 10Dzahn: static-bugzilla: ensure /srv/org/wikimedia exists [puppet] - 10https://gerrit.wikimedia.org/r/222515 [00:30:58] (03PS3) 10Dzahn: static-bugzilla: ensure /srv/org/wikimedia exists [puppet] - 10https://gerrit.wikimedia.org/r/222515 (https://phabricator.wikimedia.org/T101734) [00:31:57] (03CR) 10Dzahn: [C: 032] static-bugzilla: ensure /srv/org/wikimedia exists [puppet] - 10https://gerrit.wikimedia.org/r/222515 (https://phabricator.wikimedia.org/T101734) (owner: 10Dzahn) [00:32:52] 6operations, 7Database: codfw frontends cannot connect to mysql at db2029 - https://phabricator.wikimedia.org/T104573#1423567 (10Springle) (4) == EINTR on connect. Presumably the max_connections you observed, which in turn possibly something to do with: * hhvm timeout (but presumably T98489 was deployed to C... [00:45:10] (03PS2) 10Ori.livneh: varnishlog: allow passing NULL parameter to VCL_Arg() [puppet] - 10https://gerrit.wikimedia.org/r/222507 [00:45:16] (03CR) 10Ori.livneh: [C: 032 V: 032] varnishlog: allow passing NULL parameter to VCL_Arg() [puppet] - 10https://gerrit.wikimedia.org/r/222507 (owner: 10Ori.livneh) [00:49:40] 6operations, 6Security, 7Database: Deployment/restricted root MySQL access? - https://phabricator.wikimedia.org/T104666#1423605 (10Krenair) 3NEW [00:53:14] PROBLEM - git.wikimedia.org on antimony is CRITICAL - Socket timeout after 10 seconds [00:54:49] (03CR) 10Dzahn: Add Phragile module. (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/218930 (https://phabricator.wikimedia.org/T101235) (owner: 10Jakob) [01:00:21] (03PS1) 10Dzahn: enable codfw bastion for non-ops user groups [puppet] - 10https://gerrit.wikimedia.org/r/222519 [01:02:09] (03PS2) 10Dzahn: enable codfw bastion for non-ops user groups [puppet] - 10https://gerrit.wikimedia.org/r/222519 [01:02:44] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL Anomaly detected: 11 data above and 9 below the confidence bounds [01:06:01] (03PS1) 10Dzahn: add parsoid/ocg/bastiononly use groups to hooft [puppet] - 10https://gerrit.wikimedia.org/r/222522 [01:07:30] (03PS3) 10Dzahn: enable codfw bastion for non-ops user groups [puppet] - 10https://gerrit.wikimedia.org/r/222519 [01:08:07] (03PS2) 10Dzahn: add parsoid/ocg/bastiononly user groups to hooft [puppet] - 10https://gerrit.wikimedia.org/r/222522 [01:15:49] (03PS1) 10Dzahn: bromine: remove roles except base::firewall [puppet] - 10https://gerrit.wikimedia.org/r/222523 [01:19:09] (03PS2) 10Dzahn: bromine: remove roles except base::firewall [puppet] - 10https://gerrit.wikimedia.org/r/222523 (https://phabricator.wikimedia.org/T101734) [01:22:54] RECOVERY - git.wikimedia.org on antimony is OK: HTTP OK: HTTP/1.1 200 OK - 60573 bytes in 0.074 second response time [01:24:06] (03PS1) 10Ori.livneh: varnishrls: include cache hit / miss stats from X-Cache header [puppet] - 10https://gerrit.wikimedia.org/r/222524 [01:32:37] 6operations: Rename 'restricted' group? - https://phabricator.wikimedia.org/T104671#1423684 (10Krenair) 3NEW [01:33:55] (03PS2) 10Ori.livneh: varnishrls: include cache hit / miss stats from X-Cache header [puppet] - 10https://gerrit.wikimedia.org/r/222524 [01:34:05] (03CR) 10Ori.livneh: [C: 032 V: 032] varnishrls: include cache hit / miss stats from X-Cache header [puppet] - 10https://gerrit.wikimedia.org/r/222524 (owner: 10Ori.livneh) [01:37:43] 6operations: Rename 'restricted' group? - https://phabricator.wikimedia.org/T104671#1423691 (10Krenair) [01:38:14] PROBLEM - puppet last run on db1044 is CRITICAL Puppet has 1 failures [01:38:29] (03PS1) 10Ori.livneh: Fix-up for Ia481719de: include 're' [puppet] - 10https://gerrit.wikimedia.org/r/222527 [01:38:44] (03PS2) 10Ori.livneh: Fix-up for Ia481719de: include 're' [puppet] - 10https://gerrit.wikimedia.org/r/222527 [01:38:51] (03CR) 10Ori.livneh: [C: 032 V: 032] Fix-up for Ia481719de: include 're' [puppet] - 10https://gerrit.wikimedia.org/r/222527 (owner: 10Ori.livneh) [01:41:49] 6operations, 10Traffic, 5Patch-For-Review, 7Varnish: Move bits traffic to text/mobile clusters - https://phabricator.wikimedia.org/T95448#1423694 (10Spage) Should this be closed? `/w/static` and //wikiname//`/load.php` URLs still work on bits, e.g. https://bits.wikimedia.org/static/1.26wmf12/resources/ass... [01:44:15] (03PS3) 10Dzahn: bromine: add standard, remove other role [puppet] - 10https://gerrit.wikimedia.org/r/222523 [01:47:28] (03PS4) 10Dzahn: bromine: add standard [puppet] - 10https://gerrit.wikimedia.org/r/222523 [01:47:30] 6operations, 10Traffic, 5Patch-For-Review, 7Varnish: Move bits traffic to text/mobile clusters - https://phabricator.wikimedia.org/T95448#1423696 (10BBlack) Yeah, to the degree possible, we've left everything we can still working on bits during the transition. The next major step we're coming up on is phy... [01:48:16] (03PS5) 10Dzahn: bromine: add standard [puppet] - 10https://gerrit.wikimedia.org/r/222523 [01:48:59] (03CR) 10Dzahn: [C: 032] bromine: add standard [puppet] - 10https://gerrit.wikimedia.org/r/222523 (owner: 10Dzahn) [01:51:07] ori: with hit/miss, keep in mind that the X-Cache level doesn't really differentiate hit-for-pass either [01:52:44] (so it's going to call it a hit if the VCL told it to pass the request to the backend for some reason, but that decision to pass it to the backend was a cacheable decision) [01:53:04] RECOVERY - puppet last run on db1044 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [01:53:56] and then also, hit/miss of backend layers in X-Cache gets frozen in true hits in the front layers (backend hit/miss count is frozen into frontend cache response when the frontend caches it, based on what it was during the first fetch from backend). [01:54:28] all in all, relying on X-Cache to determine true cache hitrates is, awkward at best heh :) [01:57:26] 6operations, 5Patch-For-Review: Move static-bugzilla from zirconium to ganeti - https://phabricator.wikimedia.org/T101734#1423700 (10Dzahn) [01:57:28] 6operations, 10vm-requests: eqiad: (1) VM for static-bugzilla - https://phabricator.wikimedia.org/T103604#1423698 (10Dzahn) 5Open>3Resolved that was my fault applying a role on initial run that caused a puppet fail before users were setup and not applying "standard' first. works now. resolving. [01:57:30] bblack: would determining cache hit / miss based on some latency threshold work? [01:57:59] I can't think of a reliable way to do that, no. [01:58:20] X-Cache is the right type of thing to do it with, it's just not got sufficient information in the header today. [01:58:24] RECOVERY - HTTP error ratio anomaly detection on graphite1001 is OK No anomaly detected [01:59:29] basically we need a better X-Cache, and there are other things to fold up with that too, in some proposal for better global request tracing [01:59:57] a header that every layer appends to: nginx, each varnish, apache at the appserver, maybe mediawiki tosses something relevant or interesting into it too [02:00:33] that's what google's dapper does: http://research.google.com/pubs/pub36356.html [02:01:04] part of the protobuf RPC protocol that services use to talk to each other is some unique request id [02:01:24] which gets passed around and threaded through the various services implicated in generating a response to some particular user request [02:02:06] well yeah, I guess there's two inverse ways of looking at the problem [02:02:06] twitter took the paper as a spec and built an open-source version, i haven't tried it out tho [02:02:25] 1) generate a unique ID at the very very front layer, and use that ID in all logging/analysis/trace deeper down [02:02:54] 2) generate a single trace-header that every layer on down appends info to, so that the request and response contain the whole chain as they go. [02:03:15] probably both are useful in different ways [02:03:19] Hm.. varnish doesn't have an indication/distinction between responding to a client with an http response it has cached under a key vs. passing from elsewhere? [02:03:48] the http response passed from elsewhere also bears the x-cache header [02:03:49] Krinkle: at the VCL level we can tell which is happened, so I'm sure there's a way to encode that in a header, too. Just needs some thinking. [02:04:08] Ah yeah [02:04:08] but X-Cache currently relies on obj.hit, which can be hit or hit-for-pass [02:04:08] you can also 'subscribe' to VCL routine names [02:04:20] Depending on what we want to measure, the metric ori and I want should consider 'hit' also when it was a pass to backend where it was a hit. [02:04:27] only 'miss' if neither varnish layer had it [02:04:30] (or measure both) [02:04:44] backend varnish that is [02:04:49] I think I got it right [02:04:54] I think it's interesting to look at the two hitrates independently, because we're definitely not stuck with the idea that we need 2-3 layers here :) [02:05:06] I only count when we have two (hit|miss) in the header [02:05:13] and I only count the header as sent to the client [02:05:35] I'm not sure what the "two (hit|miss)" part does [02:05:40] I didn't follow along here, but I scanned something about that header itself being cached? [02:06:24] both hit-for-pass and miss will go through the next layer down [02:06:30] !log ori Synchronized php-1.26wmf12/extensions/CentralAuth: I0e5f2d3b2: Updated mediawiki/core Project: mediawiki/extensions/CentralAuth 7f8da7139714dd5089dd03e8679aba25c2c89c4d (duration: 00m 15s) [02:06:37] Logged the message, Master [02:07:11] yeah, disregard the two (hit|miss); that wasn't the solution. ignoring all instances of X-Cache except TxHeader with remote party == 'client' was the right one [02:07:28] on the frontends, yes [02:07:36] yeah this is just the frontends [02:07:41] jgage: what was your commandline again to copy files without agent forwarding [02:07:45] and instead of somehow trying to collapse a 'miss, hit' into a miss or a hit i just record it as a 'hit_miss' [02:07:47] mutante: scp -3 [02:07:59] scp -3 somehost:~/foo otherhost:~/bar [02:09:06] bblack: I don't understand what X-Cache: hit, hit, hit represents [02:09:10] Krinkle: the cached header part is: when the frontend appends its own cp10xx hit (999) to the rest of the existing line from the deeper cache headers... if it was a true memcache hit in the frontend, the existing line that it's appending to is frozen in the cache. [02:09:18] if you get a hit, why go another layer deep? [02:09:48] oh, wait, I think I get it [02:09:50] ori: I think it builds the other way [02:09:53] ori: that means at some point the object fell out of the front cache and had to be re-fetched a layer down, where it was a hit. and now the request you're looking at is a hit on the cache from that re-fetch [02:10:03] right [02:10:16] so it append to the hit from the backend [02:10:46] bblack: And last front-end 'hit' appendix is not saved I guess. [02:10:50] it does that each time it fetches from memory [02:10:54] each layer does its own separate append to the X-Cache line, and in a "true" cache hit, the info in that line from deeper layers is also frozen in the cache [02:11:28] ah, so the last value is accurate from the frontend. [02:11:41] the final entry on the line is from the actual frontend cache itself, and represents the actual #hits as a live counter on something that's truly hot and stuck there. [02:12:13] (all of this modulo the fact that X-Cache calls somethings hits that are not hits in the sense we care about) [02:12:26] I think we can assume that the varnish backend will always have it if it was a hit in the frontend. Even if it was originally not a backend hit, it will be now. UNless it's dropped off there already. [02:12:42] will/would [02:12:44] I think using a latency threshold is dirty but probably the easiest way to marshal some confidence about this [02:12:44] there are scenarios where it can go either way [02:13:03] these are load.php requests; they take at least >50ms for the backend to generate [02:13:15] the distribution should be very distinctly bimodal [02:13:21] sometimes something sticks fairly well in the backends but occasionally falls off the frontends and refetches. sometimes something stays very hot in the frontends but rarely lasts long in the backends. [02:14:02] ori: yeah for this particular case of load.php, latency is probably a reasonable way to look at it. [02:14:27] or have mediawiki stick a timestamp as a header indicating when the response was generated [02:14:44] Hm.. yeah. we could use a boolean ms time as secondary factor to verify the accuracy of our x-cache data interpretation [02:14:48] anyways, what this all needs in the general case is more VCL work on building a better trace header, which should include 3-way differentiation in varnish on hit vs miss vs hit-for-pass [02:15:07] i get a brain freeze just trying to size up that task [02:15:10] right now miss is a true miss, but hit can be either hit or hit-for-pass [02:15:50] (with hit-for-pass, if it's consistent across the layers, you'd see counter increments at all layers too, but not at the same rates) [02:16:02] bblack: so 'miss' means it is absence in frontend (saying nothing about backend or app server) and 'hit' is hit in frontend. hit-for-pass is.. [02:16:20] (due to chash funneling, hit-for-pass hits would increment more slowly in the frontend and faster in the backend) [02:17:00] brain freeze with you. I think I'm lost, but I can't tell. [02:17:09] Krinkle: the hit/miss entries that get recorded in each entry in X-Cache are only local information, they know nothing about the other layers. [02:17:25] if it records a miss there, that means it was a true miss at its own layer, and it had to fetch from a layer down for sure. [02:17:49] varnish implements its own 304 handling, right? [02:17:58] If it records a hit there, that can be either a true hit (served from memory directly), or a hit-for-pass (meaning it has cached the fact that it always has to force-miss this for a while and fetch it from the backend) [02:18:00] Or do we vary by that, and then cache the handling of the app server? [02:18:15] e.g. there's a cache object for the 200 response and one for the 304 response? [02:18:17] ori: thanks, different solution but works (slowly:) [02:19:09] Krinkle: it can't vary on the response, only on the request (because we can't see a future potential response when deciding whether to use the cache for a given request) [02:19:13] first downloading and then uploading would be the same thing though it seems [02:20:28] right [02:20:43] Krinkle: but I'm not sure about 304 in particular, whether varnish handles that. I would expect it to, but I don't know where the docs are on that. [02:21:32] I mean, on the one hand, we want varnish to cache on itself. But we also need changes to propagate [02:21:51] so eventually it's gonan have to ask the app server whether the E-Tag or Last-Modified is still useable [02:21:53] well 304 is just based on the same basic principles as caching a 200 [02:22:12] different mechanisms, but either way you have a way to tell the cache if you don't want it to keep caching that [02:22:46] yeah, varnish should just do the same thing browsers do. During the public (s)maxage, handle 304 yourself [02:23:01] https://www.varnish-software.com/static/book/HTTP.html [02:23:17] and after that, check with the backend and if it gives 304 (which has no body) extend the same object, and if 200 replace. [02:23:26] ^ that represents varnish's view of it. but it's kind of a mixed document, "let's review how this works in general + this is what varnish does" [02:24:33] ori: Hm.. does varnish say in the log how long it took to get the body? [02:24:50] or would you need to compute it manually [02:24:56] brb, dinner [02:27:42] Krinkle: I'd have varnishrls keep a buffer of timestamps which occurred in the last second [02:28:47] 6operations, 10Traffic, 5Patch-For-Review: Sort out DHE for Forward Secrecy w/ older clients - https://phabricator.wikimedia.org/T104281#1423717 (10BBlack) p:5Triage>3Normal [02:29:17] and pronounce a request a cache hit if the timestamp is from more than a second ago or if it is in the last second's buffer [02:29:48] a one-second threshold may mis-classify some pathological responses, we could make it 5 seconds, that's not too many timestamps to hold in memory [02:30:36] you'd be surprised even on one server, the request rate on load.php heh [02:31:10] I don't get the logic on looking for request-rate gaps though [02:32:42] Hm... the main objective is to determine whether the current request had to be generated by mediawiki or is cached in varnish (e.g. internal 304-ish), separately whether the client made a 304 or 200 trip out of it. [02:32:57] So if x-debug contains 'hit' anywhere, it will have been from varnish land, I guess? [02:33:13] Krinkle: no [02:33:25] it can be "hit hit hit" and have not come from varnish [02:33:40] I don't understand [02:34:07] that's why I'm saying the current X-Cache is insufficient for what we really want to know about hitrates (I think it was more designed for debug tracing) [02:34:50] Krinkle: hit can mean "hit-for-pass", which just means a previously-made explicit decision to always intentionally miss that request for a certain amount of time. [02:35:17] a hit-for-pass object sites in the cache memory like a real hit for lookups, but says "I don't have content, I'm just here to cache the fact that you shouldn't use me" [02:35:22] s/sites/sits/ [02:35:54] Hm.. example? [02:36:07] http://stackoverflow.com/questions/12691489/varnish-hit-for-pass-means [02:36:08] aaanyway, gotta go. Will be back in an hour or so [02:36:11] thanks! [02:36:19] ^ that pretty much explains it, the SO thing [02:36:35] but the bottom line for today is: X-Cache "hit" is either true hit or the above hit-for-pass [02:42:08] !log l10nupdate Synchronized php-1.26wmf12/cache/l10n: (no message) (duration: 11m 43s) [02:42:17] Logged the message, Master [02:49:31] !log LocalisationUpdate completed (1.26wmf12) at 2015-07-03 02:49:31+00:00 [02:49:38] Logged the message, Master [02:50:07] !log restbase rolling restart [02:50:15] Logged the message, Master [02:55:03] 6operations: install/deploy mira as codfw deployment server - https://phabricator.wikimedia.org/T95436#1423748 (10Krenair) mira has base::firewall which tin doesn't have, and tin has mysql, role::labsdb::manager and role::releases::upload which mira doesn't have. Not sure about role::labsdb::manager or role::rel... [03:06:00] (03PS1) 10Dzahn: install mysql-client in role::deployment:server [puppet] - 10https://gerrit.wikimedia.org/r/222533 (https://phabricator.wikimedia.org/T95436) [03:07:10] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL Anomaly detected: 10 data above and 7 below the confidence bounds [03:13:15] (03PS1) 10Dzahn: role::deployment - no ensure => latest [puppet] - 10https://gerrit.wikimedia.org/r/222534 [03:13:32] (03PS2) 10Dzahn: role::deployment - no ensure => latest [puppet] - 10https://gerrit.wikimedia.org/r/222534 [03:22:05] (03PS1) 10Dzahn: restbase - fix => alignment (lint) [puppet] - 10https://gerrit.wikimedia.org/r/222535 [03:25:15] (03PS2) 10Dzahn: restbase - fix => alignment (lint) [puppet] - 10https://gerrit.wikimedia.org/r/222535 [03:40:55] (03PS1) 10Dzahn: few more lint fixes in role classes [puppet] - 10https://gerrit.wikimedia.org/r/222536 [03:44:18] 6operations, 10RESTBase-Cassandra, 6Services, 7RESTBase-architecture: alternative Cassandra metrics reporting - https://phabricator.wikimedia.org/T104208#1423787 (10Eevans) This is currently running on restbase100{1,2,6}.eqiad in infinite loops, on 60 second intervals, in screen sessions under my user. [04:08:19] PROBLEM - Incoming network saturation on labstore1003 is CRITICAL 18.52% of data above the critical threshold [100000000.0] [04:18:08] 6operations, 10Deployment-Systems, 5Patch-For-Review: install/deploy mira as codfw deployment server - https://phabricator.wikimedia.org/T95436#1423807 (10Krenair) [04:19:39] PROBLEM - carbon-cache too many creates on graphite1001 is CRITICAL 1.69% of data above the critical threshold [1000.0] [04:20:06] 6operations, 10Deployment-Systems, 5Patch-For-Review: install/deploy mira as codfw deployment server - https://phabricator.wikimedia.org/T95436#1190079 (10Krenair) How are we going to handle sync of mediawiki-staging between tin and mira? Wouldn't we want any sort of git change on one to be reflected on the... [04:46:56] (03PS1) 10Springle: s7 pager slave partitioning [software] - 10https://gerrit.wikimedia.org/r/222538 [04:47:18] RECOVERY - Incoming network saturation on labstore1003 is OK Less than 10.00% above the threshold [75000000.0] [04:58:47] 6operations, 10Traffic, 7HTTPS, 5HTTPS-by-default: HTTPS Plans (tracking / high-level info) - https://phabricator.wikimedia.org/T104681#1423896 (10BBlack) 3NEW a:3BBlack [04:59:21] bblack: Thanks, that was useful. [05:00:54] 6operations, 10Traffic, 5Patch-For-Review: Sort out DHE for Forward Secrecy w/ older clients - https://phabricator.wikimedia.org/T104281#1423914 (10BBlack) [05:00:55] 6operations, 10Traffic, 5HTTPS-by-default, 5Patch-For-Review: Preload HSTS - https://phabricator.wikimedia.org/T104244#1423915 (10BBlack) [05:00:57] 6operations, 10Traffic: Clean up DNS/redirects for TLS - https://phabricator.wikimedia.org/T102824#1423917 (10BBlack) [05:00:59] 6operations, 10Traffic, 7HTTPS, 5HTTPS-by-default: let all services on misc-web enforce http->https redirects - https://phabricator.wikimedia.org/T103919#1423916 (10BBlack) [05:01:01] 6operations, 10Traffic, 6WMF-Legal: Policy decisions for new (and current) DNS domains registered to the WMF - https://phabricator.wikimedia.org/T101048#1423919 (10BBlack) [05:01:15] lol [05:01:26] wikibugs can't handle a handful of new blocked-by tasks I guess :) [05:03:49] hi YuviPanda [05:05:38] does wikibugs need manual restarts usually on flood? [05:06:34] it doesn't [05:06:50] but it will ony rejoin channels when it has something to say [05:06:57] except -labs, which it always joins [05:16:49] RECOVERY - carbon-cache too many creates on graphite1001 is OK Less than 1.00% above the threshold [500.0] [06:00:45] _joe_: why are conftool-data and hieradata separate file hierarchies? they are both hierarchies of yaml flies. why not treat them as a single body of metadata that can be operated on by multiple tools? [06:17:45] 6operations, 7Database: codfw frontends cannot connect to mysql at db2029 - https://phabricator.wikimedia.org/T104573#1423975 (10jcrespo) Network conectivity is ok (I cannot discard it being too slow or other problem)- I can curl to the mysql port and I can see the connections initiating on netstat. There was... [06:19:02] !log LocalisationUpdate ResourceLoader cache refresh completed at Fri Jul 3 06:19:02 UTC 2015 (duration 19m 1s) [06:19:08] Logged the message, Master [06:29:58] PROBLEM - puppetmaster https on palladium is CRITICAL - Socket timeout after 10 seconds [06:30:29] PROBLEM - puppet last run on mw2095 is CRITICAL puppet fail [06:31:18] PROBLEM - puppet last run on mc2011 is CRITICAL puppet fail [06:31:30] PROBLEM - puppet last run on cp1056 is CRITICAL Puppet has 1 failures [06:31:32] (03PS1) 10Ori.livneh: Tessera on misc vcl: return (pass) early to allow additional HTTP method [puppet] - 10https://gerrit.wikimedia.org/r/222542 [06:31:39] RECOVERY - puppetmaster https on palladium is OK: HTTP OK: Status line output matched 400 - 335 bytes in 3.506 second response time [06:31:59] PROBLEM - puppet last run on db1022 is CRITICAL Puppet has 1 failures [06:32:00] (03PS2) 10Ori.livneh: Tessera on misc vcl: return (pass) early to allow additional HTTP method [puppet] - 10https://gerrit.wikimedia.org/r/222542 [06:32:44] (03CR) 10Ori.livneh: [C: 04-1] "We probably don't want to do that because it'll also disable caching." [puppet] - 10https://gerrit.wikimedia.org/r/222542 (owner: 10Ori.livneh) [06:33:59] PROBLEM - puppet last run on cp2026 is CRITICAL Puppet has 1 failures [06:34:19] PROBLEM - puppet last run on labstore1003 is CRITICAL Puppet has 1 failures [06:34:19] PROBLEM - puppet last run on cp1061 is CRITICAL Puppet has 1 failures [06:34:39] PROBLEM - puppet last run on cp3040 is CRITICAL Puppet has 1 failures [06:34:39] PROBLEM - puppet last run on cp3010 is CRITICAL Puppet has 1 failures [06:34:59] PROBLEM - puppet last run on cp4008 is CRITICAL Puppet has 1 failures [06:37:29] PROBLEM - puppet last run on lvs1002 is CRITICAL Puppet has 1 failures [06:38:08] PROBLEM - puppet last run on lead is CRITICAL Puppet has 1 failures [06:38:08] PROBLEM - puppet last run on db1015 is CRITICAL Puppet has 1 failures [06:38:09] PROBLEM - puppet last run on elastic1008 is CRITICAL Puppet has 1 failures [06:38:19] PROBLEM - puppet last run on db2064 is CRITICAL Puppet has 1 failures [06:38:28] PROBLEM - puppet last run on wtp2001 is CRITICAL Puppet has 1 failures [06:38:38] PROBLEM - puppet last run on es2001 is CRITICAL Puppet has 1 failures [06:38:39] PROBLEM - puppet last run on db2002 is CRITICAL Puppet has 1 failures [06:38:39] PROBLEM - puppet last run on lvs3001 is CRITICAL Puppet has 1 failures [06:38:49] PROBLEM - puppet last run on ms-fe1001 is CRITICAL Puppet has 1 failures [06:38:50] PROBLEM - puppet last run on db2045 is CRITICAL Puppet has 1 failures [06:38:59] PROBLEM - puppet last run on es2009 is CRITICAL Puppet has 1 failures [06:39:09] PROBLEM - puppet last run on elastic1021 is CRITICAL Puppet has 1 failures [06:39:09] PROBLEM - puppet last run on ruthenium is CRITICAL Puppet has 1 failures [06:39:19] PROBLEM - puppet last run on labvirt1003 is CRITICAL Puppet has 1 failures [06:39:19] PROBLEM - puppet last run on db1059 is CRITICAL Puppet has 1 failures [06:39:19] PROBLEM - puppet last run on elastic1012 is CRITICAL Puppet has 1 failures [06:39:20] PROBLEM - puppet last run on iron is CRITICAL Puppet has 1 failures [06:39:39] PROBLEM - puppet last run on ms-fe1004 is CRITICAL Puppet has 1 failures [06:39:39] PROBLEM - puppet last run on wtp2018 is CRITICAL Puppet has 1 failures [06:39:49] PROBLEM - puppet last run on holmium is CRITICAL Puppet has 1 failures [06:39:58] PROBLEM - puppet last run on subra is CRITICAL Puppet has 1 failures [06:39:58] PROBLEM - puppet last run on ms-fe2001 is CRITICAL Puppet has 1 failures [06:41:18] PROBLEM - puppet last run on mw2082 is CRITICAL Puppet has 1 failures [06:41:39] PROBLEM - puppet last run on mw1254 is CRITICAL Puppet has 2 failures [06:42:10] PROBLEM - puppet last run on mw2213 is CRITICAL Puppet has 1 failures [06:42:18] PROBLEM - puppet last run on mw2118 is CRITICAL Puppet has 1 failures [06:42:29] PROBLEM - puppet last run on mw1009 is CRITICAL Puppet has 1 failures [06:42:59] PROBLEM - puppet last run on mw1160 is CRITICAL Puppet has 1 failures [06:43:09] PROBLEM - puppet last run on mw2033 is CRITICAL Puppet has 1 failures [06:43:29] PROBLEM - puppet last run on mw2105 is CRITICAL Puppet has 1 failures [06:43:38] PROBLEM - puppet last run on mw1099 is CRITICAL Puppet has 1 failures [06:43:39] PROBLEM - puppet last run on mw1117 is CRITICAL Puppet has 1 failures [06:43:39] PROBLEM - puppet last run on mw1003 is CRITICAL Puppet has 1 failures [06:43:39] PROBLEM - puppet last run on mw1008 is CRITICAL Puppet has 1 failures [06:43:49] PROBLEM - puppet last run on mw1060 is CRITICAL Puppet has 1 failures [06:43:59] PROBLEM - puppet last run on mw1176 is CRITICAL Puppet has 1 failures [06:44:08] PROBLEM - puppet last run on mw2114 is CRITICAL Puppet has 1 failures [06:44:19] PROBLEM - puppet last run on mw1069 is CRITICAL Puppet has 1 failures [06:44:29] PROBLEM - puppet last run on mw1088 is CRITICAL Puppet has 1 failures [06:44:29] PROBLEM - puppet last run on mw1226 is CRITICAL Puppet has 1 failures [06:44:30] PROBLEM - puppet last run on mw2123 is CRITICAL Puppet has 1 failures [06:44:39] PROBLEM - puppet last run on mw1222 is CRITICAL Puppet has 1 failures [06:44:50] PROBLEM - puppet last run on mw2017 is CRITICAL Puppet has 1 failures [06:44:58] PROBLEM - puppet last run on mw1189 is CRITICAL Puppet has 1 failures [06:44:58] PROBLEM - puppet last run on mw2184 is CRITICAL Puppet has 1 failures [06:44:59] PROBLEM - puppet last run on mw1150 is CRITICAL Puppet has 1 failures [06:44:59] RECOVERY - puppet last run on db1022 is OK Puppet is currently enabled, last run 17 seconds ago with 0 failures [06:45:08] PROBLEM - puppet last run on mw1068 is CRITICAL Puppet has 1 failures [06:45:09] PROBLEM - puppet last run on mw2173 is CRITICAL Puppet has 1 failures [06:45:09] RECOVERY - puppet last run on cp2026 is OK Puppet is currently enabled, last run 22 seconds ago with 0 failures [06:45:09] PROBLEM - puppet last run on mw2163 is CRITICAL Puppet has 1 failures [06:45:19] PROBLEM - puppet last run on mw2013 is CRITICAL Puppet has 1 failures [06:45:19] PROBLEM - puppet last run on mw2023 is CRITICAL Puppet has 1 failures [06:45:19] PROBLEM - puppet last run on mw2043 is CRITICAL Puppet has 1 failures [06:45:29] RECOVERY - puppet last run on labstore1003 is OK Puppet is currently enabled, last run 16 seconds ago with 0 failures [06:45:29] RECOVERY - puppet last run on cp1061 is OK Puppet is currently enabled, last run 15 seconds ago with 0 failures [06:45:38] PROBLEM - puppet last run on mw2104 is CRITICAL Puppet has 1 failures [06:45:49] RECOVERY - puppet last run on cp3040 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [06:45:49] RECOVERY - puppet last run on cp3010 is OK Puppet is currently enabled, last run 37 seconds ago with 0 failures [06:45:58] PROBLEM - puppet last run on mw2206 is CRITICAL Puppet has 1 failures [06:45:58] PROBLEM - puppet last run on mw2113 is CRITICAL Puppet has 1 failures [06:45:59] PROBLEM - puppet last run on mw2036 is CRITICAL Puppet has 1 failures [06:46:08] RECOVERY - puppet last run on es2001 is OK Puppet is currently enabled, last run 12 seconds ago with 0 failures [06:46:08] RECOVERY - puppet last run on db2002 is OK Puppet is currently enabled, last run 25 seconds ago with 0 failures [06:46:10] RECOVERY - puppet last run on mc2011 is OK Puppet is currently enabled, last run 8 seconds ago with 0 failures [06:46:18] RECOVERY - puppet last run on cp4008 is OK Puppet is currently enabled, last run 35 seconds ago with 0 failures [06:46:19] RECOVERY - puppet last run on ms-fe1001 is OK Puppet is currently enabled, last run 4 seconds ago with 0 failures [06:46:20] RECOVERY - puppet last run on cp1056 is OK Puppet is currently enabled, last run 48 seconds ago with 0 failures [06:46:20] RECOVERY - puppet last run on db2045 is OK Puppet is currently enabled, last run 35 seconds ago with 0 failures [06:46:29] RECOVERY - puppet last run on es2009 is OK Puppet is currently enabled, last run 39 seconds ago with 0 failures [06:46:39] RECOVERY - puppet last run on elastic1021 is OK Puppet is currently enabled, last run 33 seconds ago with 0 failures [06:46:39] RECOVERY - puppet last run on ruthenium is OK Puppet is currently enabled, last run 7 seconds ago with 0 failures [06:46:48] RECOVERY - puppet last run on db1059 is OK Puppet is currently enabled, last run 31 seconds ago with 0 failures [06:46:48] RECOVERY - puppet last run on elastic1012 is OK Puppet is currently enabled, last run 31 seconds ago with 0 failures [06:46:49] RECOVERY - puppet last run on lvs1002 is OK Puppet is currently enabled, last run 19 seconds ago with 0 failures [06:46:49] RECOVERY - puppet last run on iron is OK Puppet is currently enabled, last run 31 seconds ago with 0 failures [06:47:08] RECOVERY - puppet last run on ms-fe1004 is OK Puppet is currently enabled, last run 38 seconds ago with 0 failures [06:47:09] RECOVERY - puppet last run on wtp2018 is OK Puppet is currently enabled, last run 53 seconds ago with 0 failures [06:47:18] RECOVERY - puppet last run on holmium is OK Puppet is currently enabled, last run 39 seconds ago with 0 failures [06:47:28] RECOVERY - puppet last run on lead is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [06:47:28] RECOVERY - puppet last run on subra is OK Puppet is currently enabled, last run 28 seconds ago with 0 failures [06:47:28] RECOVERY - puppet last run on ms-fe2001 is OK Puppet is currently enabled, last run 23 seconds ago with 0 failures [06:47:28] RECOVERY - puppet last run on db1015 is OK Puppet is currently enabled, last run 59 seconds ago with 0 failures [06:47:29] RECOVERY - puppet last run on elastic1008 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [06:47:38] RECOVERY - puppet last run on db2064 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [06:47:48] RECOVERY - puppet last run on wtp2001 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [06:47:59] RECOVERY - puppet last run on lvs3001 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [06:48:29] RECOVERY - puppet last run on mw1160 is OK Puppet is currently enabled, last run 7 seconds ago with 0 failures [06:48:38] RECOVERY - puppet last run on labvirt1003 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [06:48:39] RECOVERY - puppet last run on mw2082 is OK Puppet is currently enabled, last run 17 seconds ago with 0 failures [06:48:59] RECOVERY - puppet last run on mw2105 is OK Puppet is currently enabled, last run 46 seconds ago with 0 failures [06:48:59] RECOVERY - puppet last run on mw1254 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [06:49:09] RECOVERY - puppet last run on mw1003 is OK Puppet is currently enabled, last run 31 seconds ago with 0 failures [06:49:09] RECOVERY - puppet last run on mw1008 is OK Puppet is currently enabled, last run 15 seconds ago with 0 failures [06:49:19] RECOVERY - puppet last run on mw1060 is OK Puppet is currently enabled, last run 37 seconds ago with 0 failures [06:49:29] RECOVERY - puppet last run on mw1176 is OK Puppet is currently enabled, last run 33 seconds ago with 0 failures [06:49:38] RECOVERY - puppet last run on mw2213 is OK Puppet is currently enabled, last run 55 seconds ago with 0 failures [06:49:39] RECOVERY - puppet last run on mw2118 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [06:49:39] RECOVERY - puppet last run on mw2114 is OK Puppet is currently enabled, last run 41 seconds ago with 0 failures [06:49:39] RECOVERY - puppet last run on mw2036 is OK Puppet is currently enabled, last run 25 seconds ago with 0 failures [06:49:50] RECOVERY - puppet last run on mw1009 is OK Puppet is currently enabled, last run 57 seconds ago with 0 failures [06:49:58] RECOVERY - puppet last run on mw1088 is OK Puppet is currently enabled, last run 58 seconds ago with 0 failures [06:49:59] RECOVERY - puppet last run on mw1226 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [06:50:00] RECOVERY - puppet last run on mw2123 is OK Puppet is currently enabled, last run 22 seconds ago with 0 failures [06:50:09] RECOVERY - puppet last run on mw1222 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [06:50:20] RECOVERY - puppet last run on mw2017 is OK Puppet is currently enabled, last run 29 seconds ago with 0 failures [06:50:28] RECOVERY - puppet last run on mw1189 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [06:50:28] RECOVERY - puppet last run on mw2184 is OK Puppet is currently enabled, last run 52 seconds ago with 0 failures [06:50:29] RECOVERY - puppet last run on mw1150 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [06:50:29] RECOVERY - puppet last run on mw2033 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [06:50:38] RECOVERY - puppet last run on mw1068 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [06:50:38] RECOVERY - puppet last run on mw2173 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [06:50:39] RECOVERY - puppet last run on mw2163 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [06:50:48] RECOVERY - puppet last run on mw2013 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [06:50:48] RECOVERY - puppet last run on mw2023 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [06:50:49] RECOVERY - puppet last run on mw2043 is OK Puppet is currently enabled, last run 55 seconds ago with 0 failures [06:50:58] RECOVERY - puppet last run on mw1099 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [06:50:58] RECOVERY - puppet last run on mw1117 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [06:50:59] RECOVERY - puppet last run on mw2104 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [06:51:28] RECOVERY - puppet last run on mw2206 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [06:51:28] RECOVERY - puppet last run on mw2113 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [06:51:40] RECOVERY - puppet last run on mw1069 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [06:52:39] RECOVERY - puppet last run on mw2095 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [07:02:28] PROBLEM - puppet last run on labsdb1007 is CRITICAL Puppet has 1 failures [07:18:59] RECOVERY - puppet last run on labsdb1007 is OK Puppet is currently enabled, last run 40 seconds ago with 0 failures [07:20:25] We had some issues on s7 replication on dbstore2X hosts due to an incorrect puppet config, I have disabled puppet there until we fix fully the issues and fix the puppet config [07:20:37] (03PS1) 10Muehlenhoff: Annotate some recently assigned CVE IDs for our Linux kernel package [debs/linux] - 10https://gerrit.wikimedia.org/r/222548 [07:22:24] details are on T104471 and it did not affect users, but we were forced to repopulate partiall our backup machines on codfw [07:35:26] (03CR) 10Muehlenhoff: [C: 032 V: 032] Annotate some recently assigned CVE IDs for our Linux kernel package [debs/linux] - 10https://gerrit.wikimedia.org/r/222548 (owner: 10Muehlenhoff) [07:47:09] PROBLEM - Kafka Broker Messages In on analytics1021 is CRITICAL: kafka.server.BrokerTopicMetrics.AllTopicsMessagesInPerSec.FifteenMinuteRate CRITICAL: 766.141468458 [07:50:45] <_joe_> ori: just to avoid confusion, as hiera data is by its nature non-global [07:51:04] <_joe_> (I'm off, just passing by) [07:51:29] <_joe_> ori: but I guess we will and can integrate the two [07:51:57] <_joe_> (puppet will consume the same data as conftool for a number of purposes) [07:52:11] <_joe_> ok, I'm off, see everyone on monday [07:54:31] _joe_: have a good weekend [08:04:40] 6operations, 10MediaWiki-Sites, 10SEO, 5HTTPS-by-default, and 3 others: URLs for the same title without extra query parameters should have the same canonical link - https://phabricator.wikimedia.org/T67402#1424083 (10Nemo_bis) >>! In T67402#1414489, @Nemo_bis wrote: > Does the patch fix http://de.wikipedia... [08:30:29] PROBLEM - puppet last run on mc2011 is CRITICAL puppet fail [08:47:19] RECOVERY - puppet last run on mc2011 is OK Puppet is currently enabled, last run 41 seconds ago with 0 failures [08:52:39] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL Anomaly detected: 17 data above and 9 below the confidence bounds [09:36:24] !log restbase restarting cassandra on rb1002 [09:36:28] * mobrovac sighs [09:36:31] Logged the message, Master [09:46:29] PROBLEM - git.wikimedia.org on antimony is CRITICAL - Socket timeout after 10 seconds [09:51:50] RECOVERY - git.wikimedia.org on antimony is OK: HTTP OK: HTTP/1.1 200 OK - 60560 bytes in 0.686 second response time [09:59:26] (03PS1) 10Muehlenhoff: add ferm rules for redis [puppet] - 10https://gerrit.wikimedia.org/r/222554 [10:00:58] PROBLEM - puppet last run on sca1001 is CRITICAL puppet fail [10:04:50] (03PS1) 10Muehlenhoff: add ferm rules for memcached [puppet] - 10https://gerrit.wikimedia.org/r/222556 [10:11:10] PROBLEM - HTTP 5xx req/min on graphite1001 is CRITICAL 7.14% of data above the critical threshold [500.0] [10:18:36] 6operations, 10RESTBase, 10RESTBase-Cassandra, 6Services: Protect background jobs from unhandled exceptions - https://phabricator.wikimedia.org/T104581#1424234 (10mark) [10:26:10] RECOVERY - HTTP 5xx req/min on graphite1001 is OK Less than 1.00% above the threshold [250.0] [10:36:32] 6operations, 10ops-codfw: Equip osm-cp200{1,2,3,4} with 2 1.2TB SSDs each - https://phabricator.wikimedia.org/T104610#1424322 (10mark) Two of these servers would actually be databases, not caches, so perhaps we should rename them? [10:44:29] PROBLEM - Cassandra database on restbase1005 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 113 (cassandra), command name java, args CassandraDaemon [10:44:45] mobrovac: that you? [10:44:58] nope, restarting [10:45:18] PROBLEM - Cassanda CQL query interface on restbase1005 is CRITICAL: Connection refused [10:46:19] RECOVERY - Cassandra database on restbase1005 is OK: PROCS OK: 1 process with UID = 113 (cassandra), command name java, args CassandraDaemon [10:48:57] (03PS1) 10Jcrespo: Change replication filters on dbstore hosts to use one per line [puppet] - 10https://gerrit.wikimedia.org/r/222562 (https://phabricator.wikimedia.org/T104471) [10:50:34] PROBLEM - HTTP 5xx req/min on graphite1001 is CRITICAL 7.14% of data above the critical threshold [500.0] [10:50:44] !log started du of maps project on labstore2001 [10:50:46] mark: ^ [10:50:50] Logged the message, Master [10:51:05] :) [10:51:12] (03PS2) 10Jcrespo: Change replication filters on dbstore hosts to use one per line [puppet] - 10https://gerrit.wikimedia.org/r/222562 (https://phabricator.wikimedia.org/T104471) [10:51:37] mark: going to take a while, I think :) [10:51:44] for sure [10:52:14] I am going to +2 my own change because it is almost an unbreak now (data loss on backups) [10:52:49] rb1005 cass dying on us constantly [10:52:52] * mobrovac looking [10:53:08] (03CR) 10Jcrespo: [C: 032] Change replication filters on dbstore hosts to use one per line [puppet] - 10https://gerrit.wikimedia.org/r/222562 (https://phabricator.wikimedia.org/T104471) (owner: 10Jcrespo) [10:53:48] PROBLEM - Cassandra database on restbase1005 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 113 (cassandra), command name java, args CassandraDaemon [10:56:17] !log restbase disabling puppet on restbase1005 to tweak JVM params for cassandra [10:56:22] Logged the message, Master [10:57:09] RECOVERY - puppet last run on sca1001 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [10:59:20] RECOVERY - Cassandra database on restbase1005 is OK: PROCS OK: 1 process with UID = 113 (cassandra), command name java, args CassandraDaemon [10:59:59] RECOVERY - HTTP 5xx req/min on graphite1001 is OK Less than 1.00% above the threshold [250.0] [11:00:18] RECOVERY - Cassanda CQL query interface on restbase1005 is OK: TCP OK - 0.003 second response time on port 9042 [11:01:34] uf, ok, saved it [11:01:45] still, old gen space is too high [11:06:19] PROBLEM - puppet last run on mw1005 is CRITICAL Puppet has 1 failures [11:09:15] !log reimports finished on dbstore2* hosts and puppet reenabled after T104471 was fixed [11:09:21] Logged the message, Master [11:09:44] !log lvcreate -L 6TB -n tools-20150703 backup on labstore2001 [11:09:50] Logged the message, Master [11:10:07] mark: ^ created! [11:10:14] I guess this isn't mounted anywhere atm [11:10:18] yes, you can see it in 'lvs' though [11:10:23] indeed [11:10:34] so now you need to create an ext4 file system [11:10:53] for this purpose we'll just do it naively [11:11:24] which is 'mkfs -t ext4 DEVICEFILE' [11:11:24] mkfs? [11:11:38] so first verify the existence of the device file [11:12:08] root@labstore2001:~# ls -l /dev/mapper/backup-tools--20150703 [11:12:23] where's th/dev/mapper/backup-tools--20150703 exists [11:12:24] yeah [11:12:25] obviously, be careful to get the right one :) [11:12:27] yes [11:12:28] heh [11:12:34] then do it [11:13:46] !log run mkfs -t ext4 /dev/mapper/backup-tools--20150703 on labstore2001 [11:13:52] Logged the message, Master [11:13:56] it's running now [11:16:24] should finish fairly quickly [11:16:59] what is it doing? [11:17:03] let's do this in a screen :) [11:17:54] mark: yeah, done [11:17:57] mark: it is in screen [11:18:03] ok [11:18:10] mark: there's a du on 0, and this on 1 [11:18:15] yep see it [11:18:16] ok [11:18:22] now we need to create a mount point, and mount it [11:18:29] I'm noting these down on http://etherpad.wikimedia.org/p/lvm-labstore-backups as well [11:18:31] for this we probably don't want to put it in /etc/fstab, as it's a pretty temporary filesystem [11:18:33] ok [11:18:35] right [11:18:54] so that's just a mkdir + a mount, right? [11:18:59] yes [11:19:03] let's put it under /srv somewhere [11:19:07] let's see what's there... [11:19:24] but this will be a one-off manual backup anyway [11:19:27] this should get automated next week [11:19:41] so it doesn't matter much [11:20:13] heh, 'should' [11:20:35] backup-tools-20150703 or something [11:20:43] ok [11:20:52] so /srv/backup/tools-20150703? [11:20:56] no [11:21:04] (03PS1) 10Muehlenhoff: Limit LDAP access to internal [puppet] - 10https://gerrit.wikimedia.org/r/222567 (https://phabricator.wikimedia.org/T102481) [11:21:05] check 'mount [11:21:14] /srv/backup is an existing mount of another backup [11:21:18] PROBLEM - Cassandra database on restbase1001 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 111 (cassandra), command name java, args CassandraDaemon [11:21:20] we don't want it on there :) [11:21:21] ah, I see [11:21:22] ok [11:21:28] well we could, but unnecessarily complicated [11:21:32] right [11:21:41] so /srv/backup-tools- [11:21:45] yes [11:22:24] !log mkdir /srv/backup-tools-20150703 on labstore2001 [11:22:29] PROBLEM - Cassanda CQL query interface on restbase1001 is CRITICAL: Connection refused [11:22:31] Logged the message, Master [11:22:39] mark: alright. now to do the actual mount. [11:22:49] yes [11:23:02] mount DEVICEFILE DIR really [11:23:18] RECOVERY - puppet last run on mw1005 is OK Puppet is currently enabled, last run 59 seconds ago with 0 failures [11:23:18] no extra bits then [11:23:19] cool [11:23:31] in this case not really needed [11:23:42] for a more permanent fs that can be more important :) [11:24:08] !log ran mount /dev/mapper/backup-tools--20150703 /srv/backup-tools-20150703/ on labstore2001 [11:24:09] although this fs could become more permanent if we make snapshots of it and update it [11:24:11] but we'll see [11:24:14] Logged the message, Master [11:24:24] ok [11:24:26] mounted now [11:24:32] 6operations, 10ops-eqiad, 7Database: Disk issue on db1028 - https://phabricator.wikimedia.org/T103230#1424407 (10jcrespo) 5Open>3Resolved Created again: ``` db1028:/home/jynus/events_again.log ``` There is definitely some issue there, but it is not affecting the normal performance, so I will set this t... [11:24:33] great [11:24:42] now we can start the rsync [11:24:45] no [11:24:46] sorry :) [11:24:51] snapshot? [11:24:52] we need to make a snapshot on the other side [11:24:54] yep [11:24:59] RECOVERY - Cassandra database on restbase1001 is OK: PROCS OK: 1 process with UID = 111 (cassandra), command name java, args CassandraDaemon [11:25:02] for that we first need to make space [11:25:05] and first remove the existing snapshot [11:25:28] ok. [11:25:38] * YuviPanda opens another completely different terminal app with a different color scheme [11:25:43] let's first unmount that one on 1002, it's on /mnt/backup/project/tools [11:26:05] it's not listed in fstab, you can likely just unmount it with 'umount DIR' [11:26:08] although it might be in use :) [11:26:10] RECOVERY - Cassanda CQL query interface on restbase1001 is OK: TCP OK - 0.019 second response time on port 9042 [11:26:21] nope [11:26:22] not in use [11:26:28] ok [11:26:29] !log umount /mnt/backup/project/tools/ on labstore1002 [11:26:35] Logged the message, Master [11:26:40] yeah [11:26:46] now we need to remove that logical volume, carefully [11:27:24] > 201506251717 labstore swi-a-s--- 1.94t tools 13.50 [11:27:26] that one? [11:27:43] yes [11:27:59] so that's LV name '201506251717' on VG labstore [11:28:07] and it's 2 TB in size, only using 13.5% [11:28:31] ok [11:28:32] yes [11:29:19] mark: lvremove? [11:38:35] (03CR) 10BBlack: [C: 031] Blacklist kernel modules [puppet] - 10https://gerrit.wikimedia.org/r/219786 (https://phabricator.wikimedia.org/T102600) (owner: 10Muehlenhoff) [11:38:51] mark: back? :) [11:38:54] yes [11:39:06] ok [11:39:07] where were we :) [11:40:34] mutante: about to remove the labstore [11:40:36] err [11:40:44] mark: about to remove a lv from labstore1002 [11:40:54] 201506251717 labstore swi-a-s--- 1.94t tools 13.50 [11:41:00] last thing I said was: 'lvremove?' [11:41:01] what command? [11:41:20] ok [11:41:42] so that says it's a snapshot LV with name '201506251717' in VG 'labstore', snapshot of the 'tools' LV [11:41:54] and it uses 13.5% of space out of it's nearly 2 TB, containing old copied-on-write blocks [11:42:15] so it's actually using roughly 270 GB right now [11:42:23] yes, lvremove [11:42:29] check the manpage for the syntax :) [11:42:59] mark: yeah, doing [11:43:34] mark: so lvremove labstore/201506251717 [11:44:16] yes [11:44:21] go ahead [11:44:53] > Do you really want to remove active logical volume 201506251717? [y/n]: [11:45:01] mark: the 'active' means it's just mounted? [11:45:05] no, it's not mounted [11:45:13] you can say 'y' [11:45:23] done! [11:45:37] you can separately deactivate them [11:45:40] but this does it for you [11:45:50] right [11:45:51] ok now it's gone, and we should see free space in 'vgs' in labstore [11:46:06] now we can create a new snapshot of the 'tools' LV [11:46:10] let's make it a bit smaller this time [11:46:15] it's possible to make it larger if needed [11:46:25] 500-1000 GB or so? [11:46:35] ok! [11:46:44] figure out the syntax from the manpage, i'll verify :) [11:46:53] this is not a thin snapshot or anything, just regular [11:46:56] yeah, doing so. [11:47:35] 6operations, 7Database: review eqiad database server quantities / warranties / service(s) - https://phabricator.wikimedia.org/T103936#1424447 (10Christopher) [11:49:50] mark: lvcreate -L 640G -s tools -n tools-20150703 labstore? [11:50:37] yes [11:50:39] go ahead [11:50:53] !log running lvcreate -L 640G -s tools -n tools-20150703 labstore on labstore1002 [11:50:59] Logged the message, Master [11:51:51] > The origin name should include the volume group. [11:51:57] right [11:52:01] should be labstore/tools then [11:52:14] so for the -s option [11:52:31] (03PS6) 10Paladox: add IPv6 for antimony (git web) [puppet] - 10https://gerrit.wikimedia.org/r/214432 (https://phabricator.wikimedia.org/T37540) (owner: 10Dzahn) [11:53:08] hm [11:53:35] no [11:53:46] Physical Volume "labstore" not found in Volume Group "labstore" [11:54:15] mark: not sure what that means. [11:54:23] ok [11:54:26] so the syntax is actually [11:54:38] lvcreate -L 640G -s -n tools-20150703 labstore/tools [11:54:52] oh, hmm. [11:54:54] although [11:54:58] the manpage contradicts itself? [11:55:01] haha [11:55:10] wtf :P [11:55:55] but yeah, that syntax I just wrote should work [11:56:09] mark: so -s just triggers the snapshot behavior and it's based off the LV at the end rather than vg [11:56:11] ugh [11:56:13] ok [11:56:16] yes [11:56:18] not very intuitive [11:57:15] !log run lvcreate -L 640G -s -n tools-20150703 labstore/tools on labstore1002 [11:57:19] good [11:57:21] Logged the message, Master [11:57:31] mark: done [11:57:35] nice [11:57:39] now you can mount the new snapshot [11:57:41] read-only please [11:58:09] so you can mount with the "-o ro" option added [11:58:52] yeah [11:58:57] although I should note that mounting read-only does not strictly mean that nothing gets written to the block device [11:59:02] it might do a journal recovery or such [11:59:07] that's not really relevant/important here [11:59:10] but keep it in mind [11:59:24] there is a way to set a block device read only [11:59:32] blockdev --setro (iirc) [11:59:37] ok! [12:00:55] mark: uh, where do I mount it? /mnt? [12:01:29] we can create a new directory under /mnt/backup [12:01:50] in contrast to labstore2001 that's not a mounted fs (other than / ) [12:02:01] so mkdir /mnt/backup/tools-20150703 ? [12:02:06] yeah [12:02:09] PROBLEM - HTTP 5xx req/min on graphite1001 is CRITICAL 7.14% of data above the critical threshold [500.0] [12:03:12] mark: mount -o ro /mnt/backup/ /dev/mapper/labstore-tools--20150703 [12:03:20] bah, lag. [12:03:44] that's not corret [12:03:46] 6operations, 10RESTBase, 10RESTBase-Cassandra, 6Services, 5Patch-For-Review: Protect background jobs from unhandled exceptions - https://phabricator.wikimedia.org/T104581#1424472 (10mobrovac) The PR for increasing robustness is [here](https://github.com/wikimedia/restbase-mod-table-cassandra/pull/117). [12:03:49] you have the wrong mountpoint [12:04:09] it needs to be a directory under /mnt/backup [12:04:10] whoops you're right [12:04:11] yeah [12:04:14] I created the directory [12:04:17] otherwise you'll overlay the current broken fs [12:04:27] (hide it, really) [12:04:40] mount -o ro /mnt/backup/tools-20150703/ /dev/mapper/labstore-tools--20150703 [12:05:18] mark: ^ [12:05:31] 6operations, 10RESTBase, 10RESTBase-Cassandra, 6Services, 5Patch-For-Review: Protect background jobs from unhandled exceptions - https://phabricator.wikimedia.org/T104581#1424474 (10mobrovac) Since it is friday and the background update jobs keep leaving RESTBase processes in a //comatose// state, I am g... [12:05:34] no [12:05:42] device file first, directory mountpoint after :) [12:05:45] bah [12:05:48] could have said that the first try heh [12:06:00] mark: ^^ the plan for restbase re-deployment, will go out in 15 mins or so [12:06:07] mobrovac: checking [12:06:25] kk [12:06:26] mount -o ro /dev/mapper/labstore-tools--20150703 /mnt/backup/tools-20150703/ [12:06:39] am going to run that once you ok, considering two messes :P [12:06:48] why isn't that stuff in gerrit [12:07:33] mark: asking me or YuviPanda ? [12:07:38] mobrovac: you [12:07:39] mobrovac: you :) [12:07:39] :) [12:07:44] but anyway, not your fault right now [12:07:44] hehe [12:07:45] anyway [12:07:51] mobrovac: i guess, go ahead yeah :) [12:08:09] mark: because for tests we need a local cassandra instance, which on our infra is not currently possible [12:08:59] !log running mount -o ro /dev/mapper/labstore-tools--20150703 /mnt/backup/tools-20150703/ now [12:09:06] Logged the message, Master [12:09:09] YuviPanda: ok :) [12:09:29] mark: yup, mounted [12:09:32] yeah [12:09:36] see the snapshot is already filling up [12:09:46] but even creating a snapshot before on the thin lv on raid6 would cause downtime [12:09:51] now we did it without a hickup [12:10:11] so is this an async process? [12:10:14] also wheeeee :) [12:10:19] what do you mean? [12:10:27] it's copy on write [12:10:27] what does 'filling up' mean? [12:10:29] aaah [12:10:30] I see [12:10:30] ok [12:10:42] so once you make a snapshot, once something writes to the tools filesystem, first the original block gets copied to the snapshot volume [12:10:45] so it stays the same [12:10:47] right [12:10:52] eventually it'll run out of space [12:10:53] so that's why it can be so much smaller [12:10:56] but by then we should have the backup finished and remove it [12:10:57] yes [12:11:04] right but not until 640G of things change. [12:11:10] thin LVs are a bit more efficient with sharing [12:11:12] and now backups don't take a month :) [12:11:14] but meh :) [12:11:15] yes [12:11:20] exactly [12:11:22] so we could start an rsync [12:11:25] or we could try something faster [12:11:36] running a 'tar | ssh | untar' [12:11:40] with a big buffer in between [12:11:46] rsync has latency issues on long links [12:11:53] tar of course doesn't, it's unidirectional [12:11:58] right [12:12:01] so the rsync we used is still in the screen [12:12:04] we could try this [12:12:09] but only because the destination is still empty [12:12:12] doesn't work for updating, of course [12:12:14] right. [12:12:17] do you want to figure out that command? [12:12:25] for tar / ssh /untar? [12:12:25] yeah [12:12:28] yep [12:12:30] cool [12:12:31] I'll check with you before doing [12:12:34] yep [12:15:41] so we can speed test that a little [12:16:02] and you could put a 'pv' in the pipe, on the other side, with a large buffer [12:16:08] (or something) [12:16:22] that would also tell you the throughput :) [12:16:53] tar cpf - /mnt/backup/tools-20150703/ | pv | ssh -i ~/.ssh/id_labstore root@labstore2001.codfw.wmnet "tar xpf - -C /srv/backup-tools-20150703' [12:16:55] mark: ^ [12:16:59] what do you mean by 'large buffer' [12:17:10] i'd put the pv on the other side, before the tar [12:17:17] and give it an option to buffer [12:17:21] let's see [12:17:54] pv's -B option [12:18:01] give it 10 MB or more [12:18:05] we can tweak that [12:18:05] ah, I see [12:18:06] and also add [12:18:25] -p -r -e [12:18:29] so we see what's happening [12:19:30] for tar we probably want --xattrs [12:19:37] tar cpf - /mnt/backup/tools-20150703/ | ssh -i ~/.ssh/id_labstore root@labstore2001.codfw.wmnet "pv -p -r -e -t -B 16M | tar xpf - -C /srv/backup-tools-20150703' [12:19:58] --xattrs on tar? [12:20:01] * YuviPanda reads man there [12:20:23] on which side of the tar? :) both? [12:20:45] i think that's necessary yes [12:20:52] other than it looks good [12:20:54] so you can test run this a little [12:21:01] just, before you restart you need to rm -rf the destination [12:21:04] (molly-guard!) [12:21:14] let's see what kind of transfer rate pv reports [12:21:34] tar --xattrs cpf - /mnt/backup/tools-20150703/ | ssh -i ~/.ssh/id_labstore root@labstore2001.codfw.wmnet "pv -p -r -e -t -B 16M | tar --xattrs xpf - -C /srv/backup-tools-20150703' [12:21:44] hmm, I'm notsure if I'm doing the arguments right [12:21:55] so [12:22:05] if you do the creating tar with /mnt/backup/tools-20150703/ [12:22:09] that dir will be prepended in the archive [12:22:16] so you probably want to use the current dir [12:22:27] oh, I didn't know that. [12:22:47] cd /mnt/backup/tools-20150703/ ; tar --xattrs cpf - . | ssh -i ~/.ssh/id_labstore root@labstore2001.codfw.wmnet "pv -p -r -e -t -B 16M | tar --xattrs xpf - -C /srv/backup-tools-20150703' [12:23:20] and there's also --acls btw [12:23:22] no idea if they are used [12:23:28] might as well add it now [12:23:33] although it could slow things down I guess [12:23:38] yeah [12:23:56] anyway, you can try this yes [12:24:10] cd /mnt/backup/tools-20150703/ ; tar --acls --xattrs cpf - . | ssh -i ~/.ssh/id_labstore root@labstore2001.codfw.wmnet "pv -p -r -e -t -B 16M | tar --acls --xattrs xpf - -C /srv/backup-tools-20150703' [12:24:13] ok! [12:24:33] if we saturate the link we can even have pv rate limit a bit [12:24:36] but it's probably not necessary [12:24:47] although I guess it could be wise for the weekend [12:25:15] pv's -L option, rate limit at 80 MB/s or so [12:25:26] should I do that now? or first see what happens? [12:25:27] and we can ionice separately if necessary [12:25:37] let's do it now, then we don't -need- to restart [12:25:43] ionice we can do after [12:25:46] ok [12:25:50] don't think it's that necessary now [12:26:16] oh add -b to pv as well [12:26:19] total bytes transfered [12:26:20] cd /mnt/backup/tools-20150703/ ; tar --acls --xattrs cpf - . | ssh -i ~/.ssh/id_labstore root@labstore2001.codfw.wmnet "pv -L 80M -p -r -e -t -B 16M | tar --acls --xattrs xpf - -C /srv/backup-tools-20150703' [12:26:28] ok! [12:26:29] cd /mnt/backup/tools-20150703/ ; tar --acls --xattrs cpf - . | ssh -i ~/.ssh/id_labstore root@labstore2001.codfw.wmnet "pv -L 80M -p -r -e -b -t -B 16M | tar --acls --xattrs xpf - -C /srv/backup-tools-20150703' [12:26:30] pv doesn't know the total size, but it gives us an indication other than 'df' [12:26:33] git-via-irc [12:26:36] heh [12:26:43] how did the last one do? [12:26:49] err, [12:26:54] the last one looks ok? [12:27:05] yes [12:27:28] can I see your screen? [12:27:43] mismatched quotes [12:27:50] heh indeed [12:27:51] mark: yes. I'm reusing the same screen. [12:28:12] i don't see it [12:28:20] PROBLEM - HTTP 5xx req/min on graphite1001 is CRITICAL 7.69% of data above the critical threshold [500.0] [12:28:26] mark: oh, haven't run it yet :) [12:28:26] ah [12:28:33] !log cd /mnt/backup/tools-20150703/ ; tar --acls --xattrs cpf - . | ssh -i ~/.ssh/id_labstore root@labstore2001.codfw.wmnet "pv -L 80M -p -r -e -b -t -B 16M | tar --acls --xattrs xpf - -C /srv/backup-tools-20150703" on labstore1002 [12:28:39] Logged the message, Master [12:29:03] > tar: You must specify one of the '-Acdtrux', '--delete' or '--test-label' options [12:29:03] Try 'tar --help' or 'tar --usage' for more information. [12:29:08] haha [12:29:15] tar bitching twice [12:29:22] it did [12:29:32] it wants - before cpf [12:29:32] etc [12:29:35] I wonder if it needed a - before cpf [12:29:35] yeah [12:29:38] yep [12:29:47] !log cd /mnt/backup/tools-20150703/ ; tar --acls --xattrs -cpf - . | ssh -i ~/.ssh/id_labstore root@labstore2001.codfw.wmnet "pv -L 80M -p -r -e -b -t -B 16M | tar --acls --xattrs -xpf - -C /srv/backup-tools-20150703" on labstore1002 [12:29:53] Logged the message, Master [12:30:03] mark: ok. I see no pv output tho? [12:30:04] 6operations: Firewall configurations for database hosts - https://phabricator.wikimedia.org/T104699#1424509 (10MoritzMuehlenhoff) 3NEW [12:30:14] indeed [12:30:18] perhaps because ssh doesn't have a terminal [12:30:21] but give it a minute [12:30:25] ok! [12:30:40] what might be easiest [12:30:42] is to have two pv's [12:30:51] put one with all the options on the local side [12:30:55] and another just buffering on the other side [12:30:56] can't hurt [12:30:58] yeah [12:31:02] yeah let's do that [12:31:05] so I'll cancel this, clean out the targetr [12:31:06] and see if it did anything on the dest [12:31:08] yep [12:31:09] PROBLEM - puppet last run on cp4004 is CRITICAL puppet fail [12:31:12] mark: it did, I see som efiles [12:31:24] great [12:31:26] !log interrupt tar | ssh | tar on labstore1002 [12:31:30] i'm gonna remount the broken fs read only [12:31:32] on 2001 [12:31:32] Logged the message, Master [12:31:52] !log labstore2001: mount /srv/backup -o remount,ro [12:31:59] Logged the message, Master [12:32:12] mark: ok [12:32:19] cd /mnt/backup/tools-20150703/ ; tar --acls --xattrs -cpf - . | pv -L 80M -p -r -e -b -t| ssh -i ~/.ssh/id_labstore root@labstore2001.codfw.wmnet "pv -B 16M | tar --acls --xattrs -xpf - -C /srv/backup-tools-20150703" [12:32:29] mark: should we buffer on the src as well? [12:32:40] let's do so, can't hurt [12:32:56] the more buffering here, the more consistent transfer rate [12:33:00] buffer bloat does not hurt here ;) [12:33:11] cd /mnt/backup/tools-20150703/ ; tar --acls --xattrs -cpf - . | pv -L 80M -p -r -e -b -t -B 16M | ssh -i ~/.ssh/id_labstore root@labstore2001.codfw.wmnet "pv -B 16M | tar --acls --xattrs -xpf - -C /srv/backup-tools-20150703" [12:33:22] yes, go ahead [12:33:32] after cleaning the dest [12:33:54] !log rm -rf /srv/backup-tools-20150703/* on labstore2001 [12:33:56] yeah [12:34:00] Logged the message, Master [12:34:27] mark: yup, I see pv [12:34:43] nice [12:34:45] mark: getting about 44 MB/s [12:34:49] not too bad [12:34:53] spoke too soon [12:35:01] now 14-20? [12:35:03] with small files that's a challenge [12:35:12] and at least with big files it's limited [12:35:13] it's a bit fluctuatey [12:35:19] yeah that's expected [12:35:23] this isn't at all sequential streaming [12:35:27] !log cd /mnt/backup/tools-20150703/ ; tar --acls --xattrs -cpf - . | pv -L 80M -p -r -e -b -t -B 16M | ssh -i ~/.ssh/id_labstore root@labstore2001.codfw.wmnet "pv -B 16M | tar --acls --xattrs -xpf - -C /srv/backup-tools-20150703" in screen on labstore1002 [12:35:29] right [12:35:29] and it's part of why buffering helps [12:35:34] Logged the message, Master [12:35:37] pv can even show the buffer utilisation [12:35:37] because it's not blockwise [12:35:40] yup [12:35:49] pv seems pretty awesome. [12:36:11] -T shows buffer percentage [12:36:17] mark: want me to restart with that on? [12:36:20] if you want [12:36:29] add whatever you like, try with larger buffer sizes [12:36:34] do it now now we're still at the start :) [12:36:35] yeah, ok! [12:36:44] ssh buffering and optimization can help too [12:36:57] !log interrupted tar | ssh | tar on labstore1002 and cleaned out dest on labstore2001 [12:37:04] Logged the message, Master [12:37:44] would it hurt if you added -v at the last tar? [12:37:48] so we see what files it's writing [12:37:51] I think pv has no issues with that [12:37:58] or the first tar, whatever [12:38:04] might be better [12:38:08] doesn't need to transfer that link [12:38:38] mark: yeah, assuming that doesn't lose us the pv display... [12:38:39] (03PS1) 10BBlack: ciperhsuites: add 'mid', changes to strong [puppet] - 10https://gerrit.wikimedia.org/r/222575 [12:38:43] try it :) [12:38:51] yeah [12:38:59] I'm editing the commandline in vim and copy pasting [12:39:09] cd /mnt/backup/tools-20150703/ ; tar --acls --xattrs -cpfv - . | pv -L 80M -p -r -e -b -t -B 32M -T | ssh -i ~/.ssh/id_labstore root@labstore2001.codfw.wmnet "pv -B 32M | tar --acls --xattrs -xpf - -C /srv/backup-tools-20150703" [12:39:14] increased buffer to 32M [12:39:14] no [12:39:20] now you have -f -v - [12:39:24] in the first tar [12:39:29] oh [12:39:30] right [12:39:30] use -cpvf - [12:39:42] I forgot that -f - was for stdin [12:39:46] :) [12:40:15] mark: it's a combination of pv and -v alternating randomly [12:40:21] I think I liked it without the -v [12:41:18] ok [12:41:25] mark: although the -v makes me weep - someone has extracted a full page dump onto /data/project on tools instead of just reading it as a gziped file from /public/dumps [12:41:32] 6operations, 10Traffic, 7HTTPS, 5HTTPS-by-default: HTTPS Plans (tracking / high-level info) - https://phabricator.wikimedia.org/T104681#1424519 (10BBlack) [12:41:37] heh [12:41:49] right now you don't see pv updates during small files [12:41:55] mark: so pv output is visible only when tar is doing a large file. [12:41:55] yeah [12:41:59] yeah [12:42:00] alright [12:42:03] there's always lsof :) [12:42:08] ditch the -v [12:42:10] yeah [12:42:24] can strace worst case too :) [12:42:41] yep [12:42:52] 6operations, 10Traffic, 7HTTPS, 5HTTPS-by-default: HTTPS Plans (tracking / high-level info) - https://phabricator.wikimedia.org/T104681#1423896 (10BBlack) [12:42:52] !log interrupt tar | ssh | tar on labstore1002, clean out destination on labstore2001 [12:42:58] Logged the message, Master [12:43:06] !log restbase deploying restbase/deploy @ 1a826a5 [12:43:07] !log cd /mnt/backup/tools-20150703/ ; tar --acls --xattrs -cpf - . | pv -L 80M -p -r -e -b -t -B 32M -T | ssh -i ~/.ssh/id_labstore root@labstore2001.codfw.wmnet "pv -B 32M | tar --acls --xattrs -xpf - -C /srv/backup-tools-20150703" on screen on labstore1002 [12:43:11] Logged the message, Master [12:43:17] Logged the message, Master [12:43:34] mark: hmm, I don't see where it's printing buffer utilization [12:43:47] the --- perhaps? [12:44:07] wait until it hits small files [12:44:17] is that just 'dunnolol?!' or 'full' [12:44:18] heh [12:44:18] true [12:44:33] at this point the last time it was slower [12:44:34] during large files it should be able to saturate the 80 MB/s really [12:44:41] while we have a consistent 40MB/s now [12:44:50] well, it has it in cache now too [12:45:01] ah true [12:45:09] there we go :) [12:45:11] yup [12:45:11] :P [12:45:19] PROBLEM - puppet last run on mw1138 is CRITICAL Puppet has 1 failures [12:45:40] so if that --- is buffer utilization I guess we can increase them even more although I'm not sure that's going to give us anything [12:48:08] RECOVERY - puppet last run on cp4004 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [12:54:53] YuviPanda: on deployment-prep a puppet run fails with "Error: Could not retrieve catalog from remote server: Error 400 on SERVER: Could not find class puppetmaster::autosigner for deployment-salt.deployment-prep.eqiad.wmflabs ", that seems related to the changes by you and Andrew? [12:56:04] moritzm: oh, not sure. [12:56:15] might be related to the puppet cert change by gage? [12:56:38] mark: fstat is less than 0.5% of syscall time [12:56:41] 91% spent in read [12:56:47] small reads? [12:56:49] 4% in getdents [12:58:04] (03PS2) 10BBlack: ciperhsuites: add 'mid', changes to strong [puppet] - 10https://gerrit.wikimedia.org/r/222575 [12:58:18] mark: yeah. [12:58:30] mark: lots more reads of indiivdual files [12:59:03] mark: lots of getxattr calls that fail as well because files don't have any xattrs :) [12:59:13] that might actually slow it down a lot [12:59:20] i have no idea if that's even necessary [12:59:26] acls, same [12:59:34] (03PS1) 10Glaisher: Enable WikiLove extension at Spanish Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/222581 (https://phabricator.wikimedia.org/T103424) [12:59:56] right [13:00:04] %T Percentage of the transfer buffer in use. Equivalent to -T. Shows "{----}" if the transfer is being done with splice(2), since splicing to or from pipes does not use [13:00:04] the buffer. [13:00:11] heh [13:00:14] so the buffer is inactive. [13:00:39] splice is a syscall that avoids user space for copying between two fds [13:00:54] oh, I see. [13:01:13] that explains the non-change on increasing buffer size [13:01:18] it can be avoided with -C [13:01:25] :D [13:01:28] dunno if the buffer is really needed [13:01:34] would be nice if you didn't have to restart every time eh [13:01:35] well we can test! [13:01:39] yeah sure [13:01:48] as long as we get it running permanently in the next hour or so ;) [13:01:50] splice is pretty awesomely fast if it works for whatever you're doing [13:01:52] yeah [13:02:10] !log interrupt tar | ssh | tar on labstore1002 and killed dest on labstore2001 [13:02:17] Logged the message, Master [13:02:19] RECOVERY - puppet last run on mw1138 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [13:02:52] !log run cd /mnt/backup/tools-20150703/ ; tar --acls --xattrs -cpf - . | pv -L 80M -C -p -r -e -b -t -B 32M -T | ssh -i ~/.ssh/id_labstore root@labstore2001.codfw.wmnet "pv -C -B 32M | tar --acls --xattrs -xpf - -C /srv/backup-tools-20150703" on labstore1002 [13:02:58] Logged the message, Master [13:03:16] mark: ok, I see buffer utilization now [13:03:36] no increase in throughput however. [13:03:40] indeed [13:04:21] it maxes out at pretty exactly 45 MB/s eh [13:04:25] yeah [13:04:30] it's also at 50% buffer utilization. [13:04:36] which is strange [13:05:10] (03CR) 10BBlack: [C: 032] "Checked in catalog compiler, no-op for 'compat' hosts (which is everything, currently)." [puppet] - 10https://gerrit.wikimedia.org/r/222575 (owner: 10BBlack) [13:05:10] let's see what happens when this hits non-large non-cached files [13:08:03] 6operations, 10RESTBase, 10RESTBase-Cassandra, 6Services, 5Patch-For-Review: Protect background jobs from unhandled exceptions - https://phabricator.wikimedia.org/T104581#1424568 (10mobrovac) This has been deployed @ 12:43 UTC, the system seems stable and functional. So far so good - no //unhandled excep... [13:09:16] (03PS1) 10Glaisher: Enable ShortUrl extension at orwiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/222584 (https://phabricator.wikimedia.org/T103644) [13:09:20] PROBLEM - puppet last run on mw2104 is CRITICAL puppet fail [13:11:13] 6operations: Firewall configurations for database hosts - https://phabricator.wikimedia.org/T104699#1424579 (10jcrespo) Question: * has iptables impact been measured? This may sound stupid, and I have never heard of such a thing (after all, it is kernel code), but please note that some servers have uncommon pat... [13:11:17] 6operations, 10Traffic, 7HTTPS, 5HTTPS-by-default: HTTPS Plans (tracking / high-level info) - https://phabricator.wikimedia.org/T104681#1424580 (10BBlack) [13:11:38] mark: I don't see it going past 45MB/s nor using more than 50% buffer. but, we're doing about 2G / minute now. [13:12:03] it's probably tar limiting it [13:12:09] could test that with pv as well, piping to /dev/null [13:12:22] true! [13:12:33] let me do that [13:12:39] this isn't going to be truly optimal however, tar itself works very sequential [13:12:43] moritzm, T104699#1424579 [13:12:54] so if it's i/o bound, that won't help [13:13:05] bblack: how's that parallel rsync coming ;) [13:13:09] PROBLEM - HTTP 5xx req/min on graphite1001 is CRITICAL 23.08% of data above the critical threshold [500.0] [13:13:15] :P [13:13:19] !log run cd /mnt/backup/tools-20150703/ ; tar --acls --xattrs -cpf - . | pv -L 80M -C -p -r -e -b -t -B 32M -T > /dev/null on labstore1002 [13:13:26] Logged the message, Master [13:13:30] mark: haha, equivalent performance. [13:13:34] ok [13:13:35] no [13:13:38] it's hitting 80MB/s [13:13:43] hah [13:13:46] then it's ssh [13:13:47] and 100% buffer [13:13:49] right [13:13:51] if you have a list of subdirs, you could fire off several rsyncs, one per subdir or set of subdirs :) [13:13:58] we're actually using tar right now [13:14:01] to avoid some latency [13:14:03] but now ssh is limiting [13:14:13] ssh does have some flags/settings to work around BDP issues I think, but I have to remember what they are [13:15:24] !log /dev/null filled up on labstore1002, aborting pipe of valuable user data into it. [13:15:30] Logged the message, Master [13:15:50] wat? [13:16:02] hmm I think I was thinking of hpn-ssh, and it appears openssh 4.7+ got some of that backported anyways [13:17:07] !log clean out tar | ssh | tar target on labstore2001 [13:17:14] Logged the message, Master [13:18:26] mark: even if it's doing on avg of 20 MB/s it'll be done in 3 days. [13:18:36] sure [13:18:48] I'm kind of ok with that :) [13:18:53] it's "adequate" now, it just annoys me a bit ;) [13:18:58] agreed, yeah [13:19:01] actuallyy [13:19:05] I can strace ssh [13:19:07] to see what it's doing [13:20:21] 23% in select, 40% in write 20 in read [13:20:35] but I guess some of it is just the encryption overhead [13:20:47] 6operations, 6Phabricator, 6Project-Creators, 6Triagers: Broaden the group of users that can create projects in Phabricator - https://phabricator.wikimedia.org/T706#1424590 (10Tobi_WMDE_SW) @samuwmde and @johl are going to create Sprint projects for the WMDE communication team. Can you please give project-... [13:21:04] 6operations, 10Wikimedia-Site-requests, 7I18n, 7Tracking: Wikis waiting to be renamed (tracking) - https://phabricator.wikimedia.org/T21986#1424592 (10jcrespo) [13:21:42] you could do "-c none" and kill encryption :) [13:21:58] don't we disable that? :) [13:22:12] pretty sure we do [13:22:17] oh I think ssh in general disables that these days [13:22:19] we can select another though [13:22:24] it's not even listed in the manpage heh [13:22:43] and yeah, we did limit the options in all our sshd_config too [13:23:05] Ciphers chacha20-poly1305@openssh.com,aes256-gcm@openssh.com,aes128-gcm@openssh.com,aes256-ctr,aes192-ctr,aes128-ctr [13:23:14] they're both jessie [13:23:23] suggestions on which ones to try, maybe? [13:23:39] a looong time ago it was arcfour for speedy transfers, but that's not even enabled anymore I think ;) [13:23:40] honestly the chacha20 option at the front of the list there is probably the fastest [13:23:55] I assume front of list means it's the default negotiation between jessie<->jessie for us now [13:23:57] I guess that's probably what it's using by default... [13:24:07] try aes256 then [13:24:28] aes128-gcm you mean? [13:24:44] > debug1: kex: server->client aes128-ctr umac-128-etm@openssh.com none [13:24:49] > debug1: kex: client->server aes128-ctr umac-128-etm@openssh.com none [13:24:58] ok [13:25:00] if I'm reading those right it's using aes128-ctr? [13:25:03] (03PS1) 10Jcrespo: Repool db2047 after maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/222591 [13:25:20] try asking for chacha20-poly1305@openssh.com ? [13:25:28] (with -c I mean) [13:25:30] (03CR) 10Jcrespo: [C: 032] Repool db2047 after maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/222591 (owner: 10Jcrespo) [13:25:59] RECOVERY - puppet last run on mw2104 is OK Puppet is currently enabled, last run 11 seconds ago with 0 failures [13:26:02] debug1: kex: client->server chacha20-poly1305@openssh.com none [13:26:05] debug1: kex: server->client chacha20-poly1305@openssh.com none [13:26:07] alright [13:26:14] mark: let me try with this now :) [13:26:17] I bet it's faster [13:26:49] if it's the bottleneck [13:26:51] cd /mnt/backup/tools-20150703/ ; tar --acls --xattrs -cpf - . | pv -L 80M -C -p -r -e -b -t -B 32M -T | ssh -c chacha20-poly1305@openssh.com -i ~/.ssh/id_labstore root@labstore2001.codfw.wmnet "pv -C -B 32M | tar --acls --xattrs -xpf - -C /srv/backup-tools-20150703" [13:26:52] indeed [13:26:54] we'll find out [13:27:15] !log interrupting tar |ssh | tar script and cleaning out destination again [13:27:21] Logged the message, Master [13:27:28] !log run cd /mnt/backup/tools-20150703/ ; tar --acls --xattrs -cpf - . | pv -L 80M -C -p -r -e -b -t -B 32M -T | ssh -c chacha20-poly1305@openssh.com -i ~/.ssh/id_labstore root@labstore2001.codfw.wmnet "pv -C -B 32M | tar --acls --xattrs -xpf - -C /srv/backup-tools-20150703" on labstore1002 [13:27:34] Logged the message, Master [13:27:40] !log jynus Synchronized wmf-config/db-codfw.php: repool db2047 after maintenance (duration: 00m 22s) [13:27:46] Logged the message, Master [13:27:58] also if chachapoly doesn't work out, aes128-gcm should in theory be faster than aes128-ctr [13:27:59] 6operations: Firewall configurations for database hosts - https://phabricator.wikimedia.org/T104699#1424614 (10MoritzMuehlenhoff) > * has iptables impact been measured? This may sound stupid, and I have never heard of such a thing (after all, it is kernel code), but please note that some servers have uncommon pa... [13:28:11] it seems exactly the same [13:28:15] i think the bottleneck is elsewhere [13:28:18] yeah [13:28:30] we could win a little bit by gzip, but meh [13:30:37] mark: yeah. I think I'm going to write up what we did in http://etherpad.wikimedia.org/p/lvm-labstore-backups later, and get some food in the meantime. [13:30:42] ok [13:30:51] that was fun, although we are ending with the same throughput we started with :( [13:31:42] we can optimize that some other time [13:33:17] 6operations: Firewall configurations for database hosts - https://phabricator.wikimedia.org/T104699#1424619 (10BBlack) >>! In T104699#1424579, @jcrespo wrote: > Question: > > * has iptables impact been measured? I don't know that it will matter for **most** hosts, but for our high-volume traffic hosts (e.g. L... [13:36:22] (03PS1) 10Hashar: Remove Gerrit replication to lanthanum.eqiad.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/222595 (https://phabricator.wikimedia.org/T86658) [13:37:28] (03CR) 10Andrew Bogott: [C: 031] "This looks fine. We should definitely turn on ferm on those hosts, but I'm off today and I'd like to be around when it happens :)" [puppet] - 10https://gerrit.wikimedia.org/r/222567 (https://phabricator.wikimedia.org/T102481) (owner: 10Muehlenhoff) [13:37:48] moritzm: I can look into your deployment-prep problem later [13:38:27] can anyone please merge in a Gerrit configuration change? https://gerrit.wikimedia.org/r/#/c/222595/ , that disable the replication of all git repo to lanthanum.eqiad.wmnet . The machine no more needs it :-} [13:38:53] moritzm: what is going on with deployment-prep? [13:39:53] 6operations, 6Phabricator, 6Project-Creators, 6Triagers: Broaden the group of users that can create projects in Phabricator - https://phabricator.wikimedia.org/T706#1424645 (10Qgil) Sorry, both have been added now. Please note that the guidelines for creating projects have been modified recently. Please fa... [13:40:08] 6operations: Firewall configurations for database hosts - https://phabricator.wikimedia.org/T104699#1424647 (10MoritzMuehlenhoff) >>! In T104699#1424619, @BBlack wrote: >>>! In T104699#1424579, @jcrespo wrote: >> Question: >> >> * has iptables impact been measured? > > I don't know that it will matter for **m... [13:40:58] YuviPanda: I'm looking into it myself ATM [13:42:59] hashar: a puppet run fails with "Error: Could not retrieve catalog from remote server: Error 400 on SERVER: Could not find class puppetmaster::autosigner for deployment-salt.deployment-prep.eqiad.wmflabs " I suppose is related to the merge ofhttps://gerrit.wikimedia.org/r/#/c/218380/ 16 hours ago [13:46:26] moritzm: you can try restarting the puppetmaster on it [13:46:31] sometime it misses classes for random reasons :-( [13:48:44] hashar: that would be puppetqd? [13:49:14] on deployment-salt.deployment-prep.eqiad.wmflab [13:49:19] should be `puppet master` [13:49:30] ok [13:51:24] apparently the last catalog was version 1435846839 [13:51:49] or Thu Jul 02 14:20:39 2015 UTC [13:53:47] (03CR) 10Paladox: [C: 031] add IPv6 for antimony (git web) [puppet] - 10https://gerrit.wikimedia.org/r/214432 (https://phabricator.wikimedia.org/T37540) (owner: 10Dzahn) [13:53:55] (03PS7) 10Paladox: add IPv6 for antimony (git web) [puppet] - 10https://gerrit.wikimedia.org/r/214432 (https://phabricator.wikimedia.org/T37540) (owner: 10Dzahn) [13:56:08] moritzm: I killed the puppetmaster, now that fails with a different error. Cant find role::beta::puppetmaster :-D [13:56:33] ah no [13:56:42] (03PS1) 10Jcrespo: repool db1022 with throttled traffic [mediawiki-config] - 10https://gerrit.wikimedia.org/r/222598 [13:56:45] Could not find class puppetmaster::autosigner again [13:57:14] hashar: if you have full-modern puppet, it will autosign by default. [13:57:19] And I renamed the autosigner class. [13:57:29] the class is applied on the node, but got removed hehe [13:57:40] andrewbogott: good morning :} [13:57:46] hm, yeah — I guess I should either fix this is in ldap or send an email. [13:57:50] * andrewbogott looks at ldap [13:58:32] wow, it’s used in 8 places, that’s 2x as much as I expected. [13:58:36] we were probably the sole users of that trick [13:58:40] oh [13:58:48] stay tuned, I’m going to fix it labs-wide [13:59:09] (03PS2) 10Jcrespo: repool db1022 with throttled traffic [mediawiki-config] - 10https://gerrit.wikimedia.org/r/222598 [14:00:18] (03CR) 10Jcrespo: [C: 032] repool db1022 with throttled traffic [mediawiki-config] - 10https://gerrit.wikimedia.org/r/222598 (owner: 10Jcrespo) [14:01:00] andrewbogott: I removed the class from deployment-prep and integration labs projects [14:01:22] hashar: you should use ‘puppetmaster::certcleaner’ instead. [14:01:28] I just changed it for every project that still had it turned on [14:01:48] PROBLEM - puppet last run on sca1001 is CRITICAL Puppet has 118 failures [14:02:14] \O/ [14:02:38] look ok now? [14:04:00] yeah [14:04:01] hashar: and, if you don’t mind, can you verify that it still autosigns? This is a use case that I kind of forgot about :( [14:04:50] both still have the puppetsigner.py cron entry [14:05:49] huh, really? I thought I cleaned that up. [14:05:51] * andrewbogott looks [14:06:00] !log jynus Synchronized wmf-config/db-eqiad.php: repool db1022 (low traffic) (duration: 00m 54s) [14:06:03] maybe because there is no puppet stuff to remove it [14:06:06] Logged the message, Master [14:06:14] hm, nope, I will fix [14:06:49] PROBLEM - puppet last run on mw1040 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [14:06:56] hashar: oh, it’s because you didn’t turn on puppetmaster::certcleaner [14:07:07] that’ll clean up the old cron and also do some potentially useful things. [14:07:17] oh [14:07:48] doing it now [14:08:30] RECOVERY - puppet last run on mw1040 is OK Puppet is currently enabled, last run 4 minutes ago with 0 failures [14:11:16] Notice: /Stage[main]/Puppetmaster::Certcleaner/Cron[puppet_salt_certificate_cleaner]/ensure: created [14:11:19] andrewbogott: that fixed it :-} [14:11:31] morebots: deployment-prep puppetmaster is all fixed up now! [14:11:31] I am a logbot running on tools-exec-1217. [14:11:31] Messages are logged to wikitech.wikimedia.org/wiki/Server_Admin_Log. [14:11:31] To log a message, type !log . [14:11:31] great — sorry I forgot you yesterday :) [14:11:57] so it is puppet and salt signing the certs automatically for us right ? [14:12:13] yep [14:12:19] \O/ [14:12:21] or at least that’s the idea. It’s a new feature in both. [14:12:36] It should speed up new instance building slightly. [14:13:32] great! [14:13:39] the labs infra keep improving [14:15:38] …slowly… [14:15:58] the killer feature was getting rid of the ec2id from hostname [14:16:07] that largely improved everyone workflow :} [14:16:32] Yeah, lots of things are more sane now thanks to that. And it hasn’t had /that/ many unintended consequences. [14:17:14] OK, I’m off today but will check back in a couple of hours in case YuviPanda is blocked by anything. [14:17:34] andrewbogott_afk: thanks to have showed up! have a good day :) [14:18:19] RECOVERY - puppet last run on sca1001 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [14:18:34] 6operations, 10Continuous-Integration-Infrastructure, 5Patch-For-Review: Phase out lanthanum.eqiad.wmnet - https://phabricator.wikimedia.org/T86658#1424790 (10hashar) [14:18:37] andrewbogott_afk: thanks! [14:19:37] 6operations, 10Continuous-Integration-Infrastructure, 5Patch-For-Review: Phase out lanthanum.eqiad.wmnet - https://phabricator.wikimedia.org/T86658#1424792 (10hashar) Hey @RobH , we will soon no more have any use for `lanthanum.eqiad.wmnet`. What is the process on #operations side to have the machine put ba... [14:27:31] 6operations, 10RESTBase-Cassandra, 6Services, 7RESTBase-architecture: alternative Cassandra metrics reporting - https://phabricator.wikimedia.org/T104208#1424800 (10Eevans) >>! In T104208#1423787, @Eevans wrote: > This is currently running on restbase100{1,2,6}.eqiad in infinite loops, on 60 second interva... [14:34:42] (03PS3) 10Muehlenhoff: Blacklist kernel modules [puppet] - 10https://gerrit.wikimedia.org/r/219786 (https://phabricator.wikimedia.org/T102600) [14:36:26] (03PS4) 10Muehlenhoff: Blacklist kernel modules [puppet] - 10https://gerrit.wikimedia.org/r/219786 (https://phabricator.wikimedia.org/T102600) [14:37:00] (03CR) 10Muehlenhoff: [C: 032 V: 032] Blacklist kernel modules [puppet] - 10https://gerrit.wikimedia.org/r/219786 (https://phabricator.wikimedia.org/T102600) (owner: 10Muehlenhoff) [14:44:13] what's the current status re 16.45 < icinga-wm> PROBLEM - puppet last run on ocg1003 is CRITICAL Puppet has 2 failures [14:45:30] Nemo_bis, icinga shows puppet last run on that host is OK now... [14:52:57] 6operations, 10RESTBase-Cassandra, 6Services, 7RESTBase-architecture: alternative Cassandra metrics reporting - https://phabricator.wikimedia.org/T104208#1424892 (10Eevans) @fgiunchedi with [[https://github.com/eevans/cassandra-metrics-collector/commit/8c75cd0dc69771f9f3cc50fc2f3e863eb2eab16|this changeset... [14:55:56] Any ops around? [14:56:20] Something seems broken with OCG: https://ganglia.wikimedia.org/latest/graph_all_periods.php?c=PDF%20servers%20eqiad&m=cpu_report&r=day&s=descending&hc=4&mc=2&st=1435935300&g=network_report&z=large [14:56:28] Users in -tech are reporting issues with it too [14:58:11] 6operations, 10Continuous-Integration-Infrastructure, 5Patch-For-Review: Phase out lanthanum.eqiad.wmnet - https://phabricator.wikimedia.org/T86658#1424908 (10RobH) When you finish its use completely, go ahead and assign me this task and I'll take over the reclaim. (We'll wipe the system of data so ensure y... [15:04:30] RECOVERY - HTTP 5xx req/min on graphite1001 is OK Less than 1.00% above the threshold [250.0] [15:05:21] 6operations, 5Patch-For-Review: Blacklist kernel modules - https://phabricator.wikimedia.org/T102600#1424922 (10MoritzMuehlenhoff) Now that the puppet change is merged, I've removed the old /etc/modprobe.d/blacklist-overlayfs.conf hotfix. [15:16:03] "Bundling process died with non zero code: 1" [15:17:11] All health checks are ok, and the processes are running [15:19:32] 6operations, 10OCG-General-or-Unknown, 6Services: Issues with OCG service in production? - https://phabricator.wikimedia.org/T104708#1424940 (10Krenair) 3NEW [15:21:17] 6operations, 10OCG-General-or-Unknown, 6Services: Issues with OCG service in production? - https://phabricator.wikimedia.org/T104708#1424950 (10jcrespo) ``` { "_index": "logstash-2015.07.03", "_type": "mw-ocg-service", "_id": "j_8xKLSSQYuVvrfvHiKqFA", "_score": null, "_source": { "host": "ocg10... [15:21:29] mark: so I'm going to do the others now, and will log as I go [15:21:37] I do not know enough about that service to try to fix it [15:21:42] mark: and check with you before anything destructive [15:24:33] mark: I"ll make 'others' 5T. 3.3T is used atm. [15:25:23] 10Ops-Access-Requests, 6operations, 5Patch-For-Review: MediaWiki deployment shell access request - https://phabricator.wikimedia.org/T104546#1424952 (10RobH) a:3RobH I'll claim this and fold it into my merge of his sudo rights next week. (The 3 day wait hasn't passed for this either, so meeting or not it m... [15:25:35] !log begin process of backing up others (all labs projects except tools) on to labstore2001 from labstore1002 [15:25:42] Logged the message, Master [15:25:53] hah [15:26:13] mark: hmmm, we have *just* enough space for others. [15:26:20] > Free PE / Size 1003416 / 3.83 TiB [15:26:24] on backup in labstore2001 [15:26:33] while /dev/mapper/labstore-others 11T 3.3T 7.6T 31% /srv/others [15:26:43] so that's 500G I guess. [15:26:50] mark: should I still do it? [15:28:10] YuviPanda: yes [15:28:15] ok then [15:28:20] I'll allocate 3.5T [15:28:42] !log run lvcreate -L 3.5T -n others-20150703 backup on labstore2001 [15:28:49] Logged the message, Master [15:29:02] mobrovac: are those OCG errors possibly related to your restbase deployment? [15:29:23] "error fetching restbase1 result" [15:29:25] !log running mkfs -t ext4 /dev/mapper/backup-others--20150703 on labstore2001 [15:29:30] Logged the message, Master [15:29:42] lemme take a look at the logs [15:30:07] started around 13:54 UTC I think [15:31:44] !log run lvcreate -L 640G -s -n others-20150703 labstore/others on labstore1002 [15:31:50] Logged the message, Master [15:32:21] mark: those errors have the wrong url [15:32:38] mark: http://10.2.2.17:72310/ ???? [15:32:45] no idea [15:32:52] !log run mkdir /mnt/backup/others-20150703 on labstore1002 [15:32:56] also, no domain in the URL which is needed by restbase [15:32:58] Logged the message, Master [15:33:33] !log run mount -o ro /dev/mapper/labstore-others--20150703 /mnt/backup/others-20150703/ on labstore1002 [15:33:40] Logged the message, Master [15:33:57] !log mkfs -t ext4 /dev/mapper/backup-others--20150703 on labstore2001 completed [15:34:03] Logged the message, Master [15:34:27] !log mkdir /srv/backup-others-20150703 on labstore2001 [15:34:34] Logged the message, Master [15:35:03] !log mount /dev/mapper/backup-others--20150703 /srv/backup-others-20150703/ on labstore2001 [15:35:09] Logged the message, Master [15:35:16] the pdf thing started around 18-19 UTC yesterday [15:35:51] !log cd /mnt/backup/others-20150703/ ; tar --acls --xattrs -cpf - . | pv -L 80M -C -p -r -e -b -t -B 32M -T | ssh -c chacha20-poly1305@openssh.com -i ~/.ssh/id_labstore root@labstore2001.codfw.wmnet "pv -C -B 32M | tar --acls --xattrs -xpf - -C /srv/backup-others-20150703" on labstore1002 [15:35:58] Logged the message, Master [15:36:45] mark: alright, so that's the others backup [15:36:58] \o/ [15:37:01] going at around 40 as well [15:37:03] :) [15:38:20] mark: don't think this would've been this painless a few months ago [15:38:25] but \o/ :) [15:38:29] it wasn't no [15:39:16] I'm curious to where the 44 number comes from... [15:39:33] I'll keep an eye on it over the next few days as well. [15:47:35] (03CR) 10QChris: [C: 031] Remove Gerrit replication to lanthanum.eqiad.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/222595 (https://phabricator.wikimedia.org/T86658) (owner: 10Hashar) [15:58:20] mobrovac: that seems to be normal though, it's even in the docs :) [16:07:13] mobrovac: yeah seeing more conn refused errors from the bundler to restbase [16:11:38] where is it getting that url from [16:12:59] firewall? [16:13:05] no it's the wrong port [16:13:10] oh [16:13:15] 72310 instead of 7231? [16:32:29] PROBLEM - HTTP 5xx req/min on graphite1001 is CRITICAL 7.69% of data above the critical threshold [500.0] [16:45:39] RECOVERY - HTTP 5xx req/min on graphite1001 is OK Less than 1.00% above the threshold [250.0] [16:52:06] are the pdf servers still an issue [16:52:07] ? [16:53:38] Nemo_bis, ^ [16:54:09] yes [16:54:13] we just found it [16:54:23] indeed [16:54:39] fucking php [16:54:43] just figured it out or just identified the issue? [16:54:49] just figured it out [16:54:49] $url = preg_replace( [16:54:49] '#/?$#', [16:54:49] '/' + $domain + '/v1/', [16:54:49] $params['url'] [16:54:49] ); [16:54:53] find the error [16:55:01] the addition [16:55:02] we've been staring at it for a while now [16:55:04] Sigh. [16:55:12] both mark and me [16:55:20] * paravoid feels completely stupid [16:55:21] I did something just like that a few days ago. They've had me doing JS for too long :p [16:55:24] i don't feel stupid [16:55:26] i don't ever write php [16:55:29] PROBLEM - Apache HTTP on mw1049 is CRITICAL - Socket timeout after 10 seconds [16:55:29] PROBLEM - HHVM rendering on mw1049 is CRITICAL - Socket timeout after 10 seconds [16:55:31] and that is stupid ;) [16:55:40] PROBLEM - Disk space on mw1049 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [16:55:40] PROBLEM - nutcracker port on mw1049 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [16:55:59] PROBLEM - dhclient process on mw1049 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [16:56:08] PROBLEM - SSH on mw1049 is CRITICAL - Socket timeout after 10 seconds [16:56:19] PROBLEM - HHVM processes on mw1049 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [16:56:19] PROBLEM - nutcracker process on mw1049 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [16:56:28] handy: http://www.vclfiddle.net/ [16:56:39] PROBLEM - salt-minion processes on mw1049 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [16:56:49] PROBLEM - puppet last run on mw1049 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [16:56:59] PROBLEM - DPKG on mw1049 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [16:57:00] PROBLEM - RAID on mw1049 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [16:57:09] PROBLEM - configured eth on mw1049 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [16:57:53] caused by https://gerrit.wikimedia.org/r/#/c/220036/2/RenderingAPI.php I guess [16:58:08] yes [16:58:38] Shall I change that and see if it fixes the issue? [16:58:46] i'm pretty sure it will [16:58:50] please do, unless paravoid is already on it [16:59:40] 19:59 < grrrit-wm> (PS1) Faidon Liambotis: Fix typo w/ VRS URL construction, commit e126f75 [extensions/Collection] - https://gerrit.wikimedia.org/r/222615 [16:59:54] :) [17:00:10] now to apply this to all branches is a total PITA right? [17:00:24] I need three backports or whatever plus another 3 mediawiki/core commits, right? [17:00:25] it's just wmf12 I think? [17:00:35] but what do I know [17:00:38] I'll handle it. [17:00:44] all this newfangled multiversion stuff :) [17:01:26] . o O ( when I was a boy, we had ONE mediawiki version running only... ) [17:01:33] Krenair: many thanks :) [17:01:38] yeah thanks :) [17:01:58] We do right now as well: [17:02:00] krenair@tin:/srv/mediawiki-staging/php-1.26wmf12/extensions/Collection ((57f718f...))$ mwversionsinuse [17:02:01] 1.26wmf12 [17:02:12] good [17:02:18] oh that's handy [17:02:19] RECOVERY - salt-minion processes on mw1049 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [17:02:23] the train is going a bit fast these days [17:02:39] RECOVERY - configured eth on mw1049 is OK - interfaces up [17:02:48] RECOVERY - RAID on mw1049 is OK no RAID installed [17:03:09] RECOVERY - Disk space on mw1049 is OK: DISK OK [17:03:09] RECOVERY - nutcracker port on mw1049 is OK: TCP OK - 0.000 second response time on port 11212 [17:03:29] RECOVERY - dhclient process on mw1049 is OK: PROCS OK: 0 processes with command name dhclient [17:03:39] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL Anomaly detected: 10 data above and 9 below the confidence bounds [17:03:39] RECOVERY - HHVM processes on mw1049 is OK: PROCS OK: 6 processes with command name hhvm [17:03:39] RECOVERY - nutcracker process on mw1049 is OK: PROCS OK: 1 process with UID = 108 (nutcracker), command name nutcracker [17:04:18] RECOVERY - puppet last run on mw1049 is OK Puppet is currently enabled, last run 29 minutes ago with 0 failures [17:04:19] RECOVERY - DPKG on mw1049 is OK: All packages OK [17:04:39] RECOVERY - Apache HTTP on mw1049 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.151 second response time [17:04:40] RECOVERY - HHVM rendering on mw1049 is OK: HTTP OK: HTTP/1.1 200 OK - 68428 bytes in 0.273 second response time [17:05:16] !log krenair Synchronized php-1.26wmf12/extensions/Collection/RenderingAPI.php: https://gerrit.wikimedia.org/r/#/c/222616/ - hoping this fixes T104708 (duration: 00m 44s) [17:05:19] RECOVERY - SSH on mw1049 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2 (protocol 2.0) [17:05:23] Logged the message, Master [17:05:46] Krenair: didn't that need a mw/core commit for the submodule update as well? [17:05:51] Nope. [17:05:57] Gerrit does that automatically now [17:05:57] oh? [17:06:01] it does?? [17:06:03] omg [17:06:05] yes [17:06:35] gerrit or jenkins/zuul? [17:06:44] https://git.wikimedia.org/log/mediawiki%2Fcore.git/refs%2Fheads%2Fwmf%2F1.26wmf12 [17:07:10] that's awesome [17:07:19] (03PS1) 10Jforrester: Enable VisualEditor for the Portal namespace on jawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/222617 [17:07:57] I just tested a few pages and it looks OK now [17:08:06] yep, logstash seems to confirm as well [17:08:08] graphs for those servers have picked back up again too [17:11:07] 6operations, 10OCG-General-or-Unknown, 6Services: Issues with OCG service in production? - https://phabricator.wikimedia.org/T104708#1425178 (10Krenair) a:3faidon Caused by https://gerrit.wikimedia.org/r/#/c/220036/ - fixed by https://gerrit.wikimedia.org/r/#/c/222615/ I just deployed it and it looks okay... [17:11:12] 6operations, 10OCG-General-or-Unknown, 6Services: Issues with OCG service in production? - https://phabricator.wikimedia.org/T104708#1425180 (10Krenair) 5Open>3Resolved [17:21:40] mark, paravoid: I left a note on the enwiki VPT section about it, thanks for investigating that [17:28:28] !log pooled mw1152 (HHVM image scaler) for debugging. [17:28:35] Logged the message, Master [17:33:59] PROBLEM - HHVM rendering on mw1152 is CRITICAL - Socket timeout after 10 seconds [17:33:59] PROBLEM - Apache HTTP on mw1152 is CRITICAL - Socket timeout after 10 seconds [17:35:00] ori: what is that? [17:35:35] hm [17:35:36] dunno [17:35:39] RECOVERY - Apache HTTP on mw1152 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.047 second response time [17:35:39] RECOVERY - HHVM rendering on mw1152 is OK: HTTP OK: HTTP/1.1 200 OK - 68427 bytes in 0.094 second response time [17:36:03] imagescalers have (had?) a very low maxclients setting [17:36:20] are you weighting this server differently? [17:36:24] nope [17:42:18] (03PS2) 10Jforrester: Enable VisualEditor for the Portal namespace on jawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/222617 (https://phabricator.wikimedia.org/T97313) [17:48:41] PROBLEM - HTTP 5xx req/min on graphite1001 is CRITICAL 7.14% of data above the critical threshold [500.0] [17:49:59] ^ paravoid [17:50:04] see, they're spiking [18:10:08] PROBLEM - Cassandra database on restbase1001 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 111 (cassandra), command name java, args CassandraDaemon [18:10:22] hello [18:10:36] mobrovac: ^ [18:10:59] PROBLEM - Cassanda CQL query interface on restbase1001 is CRITICAL: Connection refused [18:11:08] PROBLEM - Apache HTTP on mw1152 is CRITICAL - Socket timeout after 10 seconds [18:11:09] PROBLEM - HHVM rendering on mw1152 is CRITICAL - Socket timeout after 10 seconds [18:12:49] RECOVERY - Apache HTTP on mw1152 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 4.185 second response time [18:12:50] RECOVERY - HHVM rendering on mw1152 is OK: HTTP OK: HTTP/1.1 200 OK - 68425 bytes in 3.191 second response time [18:15:36] 6operations, 10OCG-General-or-Unknown, 6Services: Issues with OCG service in production - https://phabricator.wikimedia.org/T104708#1425471 (10Nemo_bis) p:5Triage>3Unbreak! [18:16:58] Nemo_bis: did you mean to do that? [18:16:59] PROBLEM - HHVM rendering on mw1077 is CRITICAL - Socket timeout after 10 seconds [18:17:08] PROBLEM - Apache HTTP on mw1077 is CRITICAL - Socket timeout after 10 seconds [18:17:34] why not [18:18:50] PROBLEM - Disk space on mw1077 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [18:18:50] PROBLEM - nutcracker process on mw1077 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [18:18:58] PROBLEM - DPKG on mw1077 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [18:18:59] PROBLEM - nutcracker port on mw1077 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [18:19:48] PROBLEM - puppet last run on mw1077 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [18:19:48] PROBLEM - RAID on mw1077 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [18:19:49] PROBLEM - SSH on mw1077 is CRITICAL - Socket timeout after 10 seconds [18:20:07] Nemo_bis: the last comment is that it appears to be fixed. is it not? [18:20:19] PROBLEM - configured eth on mw1077 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [18:20:28] PROBLEM - Apache HTTP on mw1152 is CRITICAL - Socket timeout after 10 seconds [18:20:29] PROBLEM - HHVM rendering on mw1152 is CRITICAL - Socket timeout after 10 seconds [18:21:09] ori: it's still marked resolved, AFAICS https://phabricator.wikimedia.org/T104708 [18:21:21] oh i see [18:21:51] a network outage [18:21:53] just what I needed [18:22:10] RECOVERY - Apache HTTP on mw1152 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.051 second response time [18:22:10] RECOVERY - HHVM rendering on mw1152 is OK: HTTP OK: HTTP/1.1 200 OK - 68425 bytes in 0.121 second response time [18:22:45] hm, possibly not related to wikimedi [18:22:49] RECOVERY - nutcracker port on mw1077 is OK: TCP OK - 0.000 second response time on port 11212 [18:22:49] (a) [18:23:06] i can depool it if you need to attend to other things [18:23:19] I lost my connection to the bastion :) [18:23:40] PROBLEM - salt-minion processes on mw1077 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [18:26:09] PROBLEM - dhclient process on mw1077 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [18:26:09] PROBLEM - HHVM processes on mw1077 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [18:27:39] PROBLEM - HTTP 5xx req/min on graphite1001 is CRITICAL 7.14% of data above the critical threshold [500.0] [18:28:19] PROBLEM - nutcracker port on mw1077 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [18:41:57] greg-g: hey hoo would like to deploy a few backported patches for the wikidata quality extensions and then enable them on monday 13:30 - 15:00 Berlin time (2 hours before morning swat) (asking for him because he has power outage) is that ok for you that i add it to the deployments wiki page? [18:42:24] jzerebecki: yeppers [18:42:33] k thx [18:42:35] fyi: I'm going to be on vacation all next week :) [18:42:59] RECOVERY - Disk space on mw1077 is OK: DISK OK [18:42:59] RECOVERY - nutcracker port on mw1077 is OK: TCP OK - 0.000 second response time on port 11212 [18:43:08] RECOVERY - nutcracker process on mw1077 is OK: PROCS OK: 1 process with UID = 108 (nutcracker), command name nutcracker [18:43:09] RECOVERY - DPKG on mw1077 is OK: All packages OK [18:43:49] RECOVERY - RAID on mw1077 is OK no RAID installed [18:43:49] RECOVERY - puppet last run on mw1077 is OK Puppet is currently enabled, last run 46 minutes ago with 0 failures [18:43:49] RECOVERY - salt-minion processes on mw1077 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [18:43:49] RECOVERY - SSH on mw1077 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2 (protocol 2.0) [18:44:18] RECOVERY - configured eth on mw1077 is OK - interfaces up [18:44:38] RECOVERY - dhclient process on mw1077 is OK: PROCS OK: 0 processes with command name dhclient [18:44:38] RECOVERY - HHVM processes on mw1077 is OK: PROCS OK: 6 processes with command name hhvm [18:44:40] (03PS1) 10Ori.livneh: mediawiki apache config: don't load mod_deflate on HHVMs [puppet] - 10https://gerrit.wikimedia.org/r/222673 [18:53:28] PROBLEM - HTTP 5xx req/min on graphite1001 is CRITICAL 6.67% of data above the critical threshold [500.0] [18:54:49] PROBLEM - puppet last run on mw1077 is CRITICAL puppet fail [19:09:42] (03PS1) 10Yuvipanda: labstore: Minor code cleanup of the exports daemon [puppet] - 10https://gerrit.wikimedia.org/r/222690 [19:10:08] PROBLEM - HTTP 5xx req/min on graphite1001 is CRITICAL 7.14% of data above the critical threshold [500.0] [19:12:34] (03PS1) 10Jeremyb: add HTTPS variants for wmfblog in feed whitelists [mediawiki-config] - 10https://gerrit.wikimedia.org/r/222691 (https://phabricator.wikimedia.org/T104727) [19:14:36] greg-g, who should we ask about deployment decisions next week? [19:17:49] PROBLEM - puppet last run on mw1069 is CRITICAL puppet fail [19:19:19] PROBLEM - Host mw2027 is DOWN: PING CRITICAL - Packet loss = 100% [19:20:59] RECOVERY - Host mw2027 is UPING OK - Packet loss = 0%, RTA = 42.92 ms [19:22:30] (03PS1) 10Dzahn: static-bugzilla: update Apache config for 2.4 [puppet] - 10https://gerrit.wikimedia.org/r/222692 [19:28:59] RECOVERY - puppet last run on mw1069 is OK Puppet is currently enabled, last run 42 seconds ago with 0 failures [19:40:29] (03PS20) 10Paladox: Rename all main WikimediaIncubator settings to have a wg prefix [mediawiki-config] - 10https://gerrit.wikimedia.org/r/207909 [19:51:07] as far as I can tell, looking at the graphs in the dashboards that mobrovac sent in his email to the ops list restbase1001 needs a restart, and looks like restbase1005 is beginning to get there. /cc urandom [19:55:55] /cc James_F ^ [20:03:21] subbu: :-( [20:03:28] subbu: Who can do that? [20:04:50] I don't know. have to be some ops person. looking at the services channel, urandom said "i'll be at a hotel with wifi by ~9pm EST" .. of course, i don't know if this is a big deal or not. but, just flagging based on the mail that marko had sent on the list. [20:05:56] reg big deal or not .. if there is a lot of redundant capacity in the 6 nodes and one being down is an issue or not. [20:07:19] PROBLEM - HTTP 5xx req/min on graphite1001 is CRITICAL 7.69% of data above the critical threshold [500.0] [20:11:38] RECOVERY - Cassandra database on restbase1001 is OK: PROCS OK: 1 process with UID = 111 (cassandra), command name java, args CassandraDaemon [20:12:00] !log restarted cassandra on restbase1001 [20:12:07] Logged the message, Master [20:13:00] ori, thanks. i saw this too late .. i emailed the ops list just now. so maybe respond to that. [20:13:47] done [20:13:59] PROBLEM - puppet last run on mw2194 is CRITICAL puppet fail [20:14:09] RECOVERY - Cassanda CQL query interface on restbase1001 is OK: TCP OK - 0.008 second response time on port 9042 [20:30:39] RECOVERY - puppet last run on mw2194 is OK Puppet is currently enabled, last run 7 seconds ago with 0 failures [20:47:04] 6operations, 6Phabricator, 6Security: Phabricator dependence on wmfusercontent.org - https://phabricator.wikimedia.org/T104730#1425681 (10Krenair) [21:02:49] PROBLEM - HTTP 5xx req/min on graphite1001 is CRITICAL 7.69% of data above the critical threshold [500.0] [21:15:14] 6operations, 6Phabricator, 6Security: Phabricator dependence on wmfusercontent.org - https://phabricator.wikimedia.org/T104730#1425725 (10BBlack) I think the whole point of phab.wmfusercontent.org was a security thing to begin with, so that scripts/content/whatever uploaded to phab by users wouldn't be consi... [21:28:15] 6operations, 6Phabricator, 6Security: Phabricator dependence on wmfusercontent.org - https://phabricator.wikimedia.org/T104730#1425751 (10Legoktm) We should add some content at https://wmfusercontent.org then to state that. [21:30:38] PROBLEM - HTTP 5xx req/min on graphite1001 is CRITICAL 7.69% of data above the critical threshold [500.0] [21:35:02] 6operations, 6Phabricator, 6Security: Phabricator dependence on wmfusercontent.org - https://phabricator.wikimedia.org/T104730#1425765 (10Mike_Peel) BBlack: why? [21:39:28] (03CR) 10Krinkle: Log privileged users with short passwords (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/222025 (https://phabricator.wikimedia.org/T94774) (owner: 10CSteipp) [21:46:49] !log depooled mw1152 [21:46:56] Logged the message, Master [22:03:52] 6operations, 6Phabricator, 6Security: Phabricator dependence on wmfusercontent.org - https://phabricator.wikimedia.org/T104730#1425783 (10csteipp) >>! In T104730#1425725, @BBlack wrote: > I think the whole point of phab.wmfusercontent.org was a security thing to begin with, so that scripts/content/whatever u... [22:09:50] ori: Hm... 22 million HTTP 204 responses randomly showed up :D [22:13:09] RECOVERY - HTTP 5xx req/min on graphite1001 is OK Less than 1.00% above the threshold [250.0] [22:14:10] 6operations: (www.)wmfusercontent.org should respond to HTTP - https://phabricator.wikimedia.org/T104735#1425822 (10Krinkle) [22:26:31] 6operations, 7Graphite: Upgrade to Grafana v2.x - https://phabricator.wikimedia.org/T104738#1425863 (10Krinkle) 3NEW [22:27:55] 6operations, 7Graphite: Upgrade to Grafana v2.x - https://phabricator.wikimedia.org/T104738#1425873 (10Krinkle) [22:32:03] 6operations, 6Phabricator, 6Security: Phabricator dependence on wmfusercontent.org - https://phabricator.wikimedia.org/T104730#1425882 (10Mike_Peel) Or they could be uploaded to commons, which copes with svgs just fine? [22:36:07] !log restarted apache on silver to see if it would make https://gerrit.wikimedia.org/r/#/c/221969/ take effect for T104360. It did not. [22:36:11] Logged the message, Master [22:38:58] RECOVERY - HTTP error ratio anomaly detection on graphite1001 is OK No anomaly detected [22:58:59] PROBLEM - git.wikimedia.org on antimony is CRITICAL - Socket timeout after 10 seconds [23:06:19] RECOVERY - git.wikimedia.org on antimony is OK: HTTP OK: HTTP/1.1 200 OK - 60557 bytes in 0.628 second response time [23:19:25] (03PS1) 10Ori.livneh: Force 'Transfer-Encoding: Chunked' header on 404 responses [mediawiki-config] - 10https://gerrit.wikimedia.org/r/222721 [23:19:58] (03PS2) 10Ori.livneh: Force 'Transfer-Encoding: Chunked' header on 404 responses [mediawiki-config] - 10https://gerrit.wikimedia.org/r/222721 [23:20:06] (03CR) 10Ori.livneh: [C: 032] Force 'Transfer-Encoding: Chunked' header on 404 responses [mediawiki-config] - 10https://gerrit.wikimedia.org/r/222721 (owner: 10Ori.livneh) [23:20:12] (03Merged) 10jenkins-bot: Force 'Transfer-Encoding: Chunked' header on 404 responses [mediawiki-config] - 10https://gerrit.wikimedia.org/r/222721 (owner: 10Ori.livneh) [23:20:39] PROBLEM - haproxy failover on dbproxy1004 is CRITICAL check_failover servers up 2 down 1 [23:24:25] !log ori Synchronized w/404.php: Force 'Transfer-Encoding: Chunked' header on 404 responses (duration: 00m 31s) [23:24:29] Logged the message, Master [23:30:58] PROBLEM - HTTP 5xx req/min on graphite1001 is CRITICAL 30.77% of data above the critical threshold [500.0] [23:41:29] PROBLEM - git.wikimedia.org on antimony is CRITICAL - Socket timeout after 10 seconds [23:54:09] RECOVERY - git.wikimedia.org on antimony is OK: HTTP OK: HTTP/1.1 200 OK - 60557 bytes in 0.222 second response time [23:55:56] !log legoktm Synchronized php-1.26wmf12/extensions/WikiLove/: WikiLove+UserMerge fixes (duration: 00m 18s) [23:56:01] Logged the message, Master [23:56:40] RECOVERY - HTTP 5xx req/min on graphite1001 is OK Less than 1.00% above the threshold [250.0] [23:59:52] !log legoktm Synchronized php-1.26wmf12/extensions/Translate/: Translate+UserMerge fixes (duration: 00m 17s) [23:59:56] Logged the message, Master