[00:02:52] !log Deployed fix for T95589 [00:02:58] Logged the message, Master [00:06:37] Our patches are done rmoen, although officially your swat window is over... Officially. [00:07:36] (03PS1) 10Ori.livneh: Add dummy "/preconnect" URL endpoint on restbase varnishes [puppet] - 10https://gerrit.wikimedia.org/r/203262 [00:08:08] bblack: ^ is a varnish change but a simple one. [00:08:16] gwicke: ^ [00:09:25] ori, would be funny if cluster dies because of it ;) [00:09:41] Hilarious, I'm sure :) [00:10:08] sorry, i meant the other patch - where all comments are replaced with '//' [00:10:27] this one is less intrusive )) [00:10:51] That reminds me, I have multiple-hundreds-of-lines changes to make to configuration. [00:11:01] Now you've really doomed the cluster yurik :) [00:11:24] (03CR) 10GWicke: [C: 031] "Looks good to me. It's unlikely to conflict with domains, although /_preconnect might be even less likely to do so." [puppet] - 10https://gerrit.wikimedia.org/r/203262 (owner: 10Ori.livneh) [00:11:57] ori, q for you - i was discussing perf evaluation with tfinc, and he suggested i talk to you - what would be a good way to evaluate performance of a server configuration? labs is good for functionality, not perf [00:12:17] this is for the osm tile server [00:13:19] get a https://www.blitz.io/ trial [00:14:03] (03PS2) 10Ori.livneh: Add dummy "/preconnect" URL endpoint on restbase varnishes [puppet] - 10https://gerrit.wikimedia.org/r/203262 [00:14:40] yurik: see if you like it; if you find it useful we can procure a subscription [00:14:48] ori, that's good, but what about the servers themselves? [00:15:03] virt cluster is not the same as hardware [00:15:06] esp for disk io [00:15:18] well, what do you want to evaluate? [00:15:43] basically - the perf or a map tile server (that hasn't been built yet) [00:16:02] because that will determine our server need and configuration [00:16:09] of a cluster [00:16:53] ask rob.h for a bare metal spare from the list: https://wikitech.wikimedia.org/wiki/Server_Spares [00:17:01] say you need it for a week to do load testing [00:17:28] you can ask for two servers with different builds so you can get a sense of the impact of different build configurations [00:18:00] yurik: if it's about latency & not volume, then testing on your local laptop should be pretty accurate [00:18:20] you can substract a couple of % to approximate production hardware for single-thread performance [00:19:56] thank you both, good thoughts [00:20:29] i like gwicke's idea [00:21:27] that way, when you file a procurement request, you can simply say: "this is the volume of requests we're expecting once we're in production; here's how my laptop performs. what would be a reasonable build configuration?" [00:24:25] (03CR) 10BBlack: [C: 04-1] "Insufficient cowbell" [puppet] - 10https://gerrit.wikimedia.org/r/203262 (owner: 10Ori.livneh) [00:24:46] (03CR) 10BBlack: [C: 031] ":)" [puppet] - 10https://gerrit.wikimedia.org/r/203262 (owner: 10Ori.livneh) [00:25:01] 6operations, 7Graphite: Urgent: Statsite changes semantics of timer rate metrics, need metric rename - https://phabricator.wikimedia.org/T95596#1196497 (10GWicke) Not sure why, but it looks like the old data didn't make it across from rate.wsp. The new sample_rate metrics all look like they started from scratc... [00:25:09] hah [00:25:34] bblack: ok if i merge? [00:25:38] yeah [00:25:45] thanks, going for it [00:26:08] (03CR) 10Ori.livneh: [C: 032] Add dummy "/preconnect" URL endpoint on restbase varnishes [puppet] - 10https://gerrit.wikimedia.org/r/203262 (owner: 10Ori.livneh) [00:27:02] forcing a puppet run on cp1045 to verify [00:27:07] ori, i just wish the procurement requests weren't due a year in advance ;) [00:28:17] yurik: it depends on how you approach it. if you have a puppet module with a 'decom.pp' manifest that cleanly removes the role, and you cite the patch that introduces it in the procurement request, and you give some details about how you plan to load-test, it may get processed quickly [00:29:26] cool, works on cp1045 [00:30:07] yurik: for prod OSM tileservers, your procurement request is easy without benchmarking. You need approximately all the servers :) [00:30:38] (03PS1) 10BBlack: T86663 5.5: pool 3046; depool 3007,amssq59-60 [puppet] - 10https://gerrit.wikimedia.org/r/203264 [00:30:46] bblack, exactly! +20% on top for the safety [00:30:56] (03CR) 10BBlack: [C: 032 V: 032] T86663 5.5: pool 3046; depool 3007,amssq59-60 [puppet] - 10https://gerrit.wikimedia.org/r/203264 (owner: 10BBlack) [00:32:18] are there any numbers for varnishes - e.g. how many smalish images (256x256) it can serve? [00:32:28] assumign in cache [00:33:03] in other words - how many 5-10kb images can varnish server per second / per connection [00:33:11] !log rmoen Synchronized php-1.26wmf1/extensions/MobileFrontend/: sync mobilefrontend for cherry-pick (duration: 01m 07s) [00:33:16] Logged the message, Master [00:33:36] yurik: in terms of RAM, varnish has an overhead of about 1k per object [00:33:43] yurik: don't know. design for scalability! [00:33:58] (per https://www.varnish-cache.org/docs/3.0/tutorial/sizing_your_cache.html) [00:34:17] bblack, obviously, but i still need to aproximate how much one varnish server can serve [00:34:22] once you have a suitably-scalable design for the whole affair, it shouldn't be that huge a deal to put the pieces of that design together on several reasonable test hosts and take some perf measurements to decide how much to buy. [00:34:58] !log rmoen Synchronized php-1.26wmf1/extensions/Gather/: update gather to master (duration: 01m 07s) [00:35:02] Logged the message, Master [00:35:02] bblack, unfortunatelly we already need to give some hard numbers [00:35:06] but really, without the whole-stack picture of how it will work to test with, it's going to be hard to estimate anything [00:35:10] MaxSem^ [00:35:20] my thoughts exactly ) [00:35:45] you could get a lot of mileage out of labs [00:36:01] (btw, I'd highly recommend trying to architect this as if SVG were all that mattered and work natively in SVG only, and then tack on an SVG->PNG/JPG layer only for browsers that need it and request them) [00:36:37] bblack, sure, but why? aren't SVGs much bigger? [00:36:55] * ori tries SO HARD to be good [00:37:04] * ori sits on hands and rocks back and forth. [00:37:06] !log rmoen Synchronized php-1.25wmf24/extensions/MobileFrontend/: sync mobilefrontend for cherry-pick (duration: 01m 07s) [00:37:09] Logged the message, Master [00:37:12] ori, do tell :D [00:37:22] not really, especially for the smaller tiles that may contain few vector objects. also, easier to work with for all kinds of transformations (like translating text labels), and easier to scale (duh) [00:37:25] i haven't seen recent stats for svg [00:37:38] scaling - yes :) [00:37:53] btw, our initial target is mobile apps [00:38:04] would svg cause much heavier battery use? [00:38:11] mmm, perf testing on laptop is impossible because full db is over 500G, and using just a part of it kinda kills the purpose [00:38:34] I think battery use is the least of your concerns right now, I donno :) [00:38:48] hehe [00:39:03] !log rmoen Synchronized php-1.25wmf24/extensions/Gather/: update gather to master (duration: 01m 13s) [00:39:06] Logged the message, Master [00:39:18] yurik: as I said before, I would start your architecture planning by looking at your tile generation latency first [00:39:35] as that combined with your target latency will drive a lot of it [00:40:02] it determines whether you can affort cache misses for example [00:40:07] tile-caching is a very specialized thing, too. you may or may not even want to use varnish for it. [00:40:10] gwicke, i agree, will need to work with some OSM ppl to get some suggestions / numbers from them [00:40:29] cache misses are inavoidable [00:40:46] bblack, are you proposing alternative caching/file cache? [00:40:58] for a narrow definition of cache, yes [00:41:03] cassandra! [00:41:05] not if you count storage [00:41:33] bblack, osm.org uses squid to some degree of success, I had an impression that varnish is just better than it [00:41:37] I'm pretty sure google maps is not just caching their tiles [00:41:40] regardless, you'll want whatever you cache tiles with will want a specialized hash that tries to be geospatially-local. e.g. if tile filenames or indexes are by top-left coordinate + size, hash-bin them by the top-left coordinate of the enclosing block of a certain larger size (which is larger than most tiles, but still numerous enough chunks in total to make spreading over many servers easy) [00:42:04] of course [00:42:27] all of those little details are going to factor greatly into how much hardware you end up needing I think [00:42:32] i'm thinking of using rendevouz algo by geo loca [00:42:35] quadtree coordinates kind of do that for free [00:43:01] this allows TOP N hashing load balancing among multiple servers [00:43:33] similar to consistent hashing, but more flexible [00:43:50] yeah [00:44:10] unfortunatelly it seems there is only a consistent hashing algo for VCL [00:44:15] well, once you decide on a way to hash that makes geospatial sense, you can always plug that into a consistent-hashing mechanism as well [00:44:18] (and should) [00:44:33] VCL has any hash algorithm you want it to have, if you're willing to write code :) [00:45:25] bblack, consistent hashing seems worse than HRW (rendevouz). Consistent is implemented here https://code.uplex.de/uplex-varnish/libvmod-vslp [00:45:39] eh [00:46:04] "consistent hashing" is not a data-hashing algorithm. it's a thing you do with the output of a data-hashing algorithm to make server failover less painful [00:46:18] you can mix and match whatever data-hashing you want with the consistent technique [00:46:36] correct - here by hash i really meant load balancing / failover technique [00:46:44] right [00:47:12] but it seems there is really no need to do geo-proximity for the backend servers - MaxSem? [00:47:30] tile servers get rendered in bunches - 9x9, so hash would have to account for that, [00:47:39] probably, as long as we hash by metatile [00:47:43] exactly [00:48:24] than again, if SQL queries go to close geo-chunks, it might be more efficient [00:48:49] but again, will need to talk to OSM ppl more [00:48:54] right [00:49:44] mmm, gotta investigate if pg works faster if there's a bunch of queries in the same area. common sense says yes, but who knows their r-trees [00:50:19] there's layers of stuff here I don't even remember. But I remember thinking that the layer that does the initial db -> svg transform and saves all the svgs down to some level of detail.... [00:50:36] you'd configure N servers at that layer which are all perfectly capable of answering any request [00:50:48] is there a finite number of rendered tiles, and is it plausible to have all of them in the cache at all times? [00:51:03] ori, 1) yes 2) not really [00:51:27] and have the varnish (or varnish-like layer) backending to those, to fetch SVGs, and have it using a geospatially-aware hash for that backend-server mapping [00:51:31] it's gonna be ~60T. nobody knows for sure because nobody did it [00:51:56] (with chash thinking thrown in, so that you don't clobber the whole scheme because one of them died) [00:52:47] ori: (2) is not really, but it's a yes down to a certain level of detail zoom. you basically define a cutoff below which (a) it's impractical to cache all the tiles but (b) they're so small you can render them on the fly when requested (from larger SVGs of the area, not from SQL queries) [00:53:09] what about ? [00:53:37] amazon might be interested in hosting that. they don't have a map product to compete with apple or google. [00:54:06] the whole reason we want it self-hosted is to avoid privacy issues I think. [00:54:34] I don't know how much we leak of that if we just backend to another service from ours for data that's cached up, though. [00:54:37] also, one thing is to store for researchers, another thing is to store and serve at a scale [00:54:53] It's S3 [00:55:17] last time I talked to/about OSM, OSM.org didn't want to work at scale for endusers regardless, but I don't know if that's still the case (otherwise why wouldn't we just offload tileserving directly to them) [00:55:38] (well and also (b) OSM.org privacy policy, while better than gmaps, is still not quite like ours...) [00:55:44] MaxSem: is your 60T number assuming any kind of compression, or does it count each blue or green tile? [00:56:30] osm.o's maps are for mappers - they don't have the bucks to make a generic tileserver cluster [00:57:10] gwicke, png's are already compressed, and yes - all those squares are included afaik [00:58:07] the point is more that at high zoom levels and certain blue regions the delta between the tiles becomes rather small [00:58:46] (03PS1) 10BBlack: T86663 5.5: switch cp3007 role [puppet] - 10https://gerrit.wikimedia.org/r/203267 [00:58:57] no, they don't have a duplicate-free store atm [00:59:10] there are a couple of really popular solid-coloured tiles, which you could represent as a number or compress generically in a block [00:59:39] it's slightly less popular because they're stored as 8x8 metatiles [00:59:47] I really wish we could solve the privacy issue and then grant money from wmf->osm to scale up their hardware for tileserving in exchange for a right to use them as our tileserver [01:00:44] gwicke: "Indian Ocean, Atlantic Ocean... all the same #0000FF to me" [01:00:46] there are a couple of storage systems that can do block compression for you [01:00:48] bblack, it also gets to the point where they might need ft devs to scale it. which costs more than h/w [01:00:59] as a naive implementation [01:01:14] yeah but same for us. either way the money gets spent. it's a matter of existing expertise and organizational focus, imho [01:01:17] ori: yup [01:02:36] bblack, I'm not familiar with squid - is it much harder to do all the stuff we discussed for varnish in squids? [01:02:54] (03CR) 10Dzahn: "rbf1001 has rdf1002: rdf1001 D" [puppet] - 10https://gerrit.wikimedia.org/r/203216 (https://phabricator.wikimedia.org/T95153) (owner: 10Dzahn) [01:03:05] I don't know if harder or easier really. it's harder to do as good a job in squid for sure, varnish is far more performant and flexible. [01:03:14] heh [01:03:23] it might be easier in squid in practice because you simply can't do all the optimal things you'd try to do in varnish :) [01:03:56] might be one reason why their infra is so unscaleable [01:04:01] it's kinda like asking whether it's easier or harder to do some_performance_intensive_thing in C or PHP [01:04:29] hehe, but i would rather stay with varnish - i already know it relatively well, thus more chance to do some creative breaking with it [01:04:44] (03CR) 10BBlack: [C: 032] T86663 5.5: switch cp3007 role [puppet] - 10https://gerrit.wikimedia.org/r/203267 (owner: 10BBlack) [01:05:01] unless thee is some really good reason to switch, like more features / much higher perf speed [01:05:02] I'm not thinking of using squid, I'm thinking about their existing infra [01:05:09] (03CR) 10Dzahn: [C: 032] delete rbf hosts from DHCP and netboot [puppet] - 10https://gerrit.wikimedia.org/r/203216 (https://phabricator.wikimedia.org/T95153) (owner: 10Dzahn) [01:05:31] yeah [01:05:35] MaxSem, we should use ops experience of squid to varnish migration ) [01:05:44] and help them migrate :D [01:06:00] thus they will make identical infrastructure [01:06:02] and why they have 2 tileservers that are separated only by geodns so that both could be rendering the same tile at the same time [01:06:46] bblack,i didn't get your thought about svg & N servers [01:08:01] yurik: at the varnish-cache itself, geospatiality of the cached objects matters little. but the servers directly beneath varnish, which can talk to SQL databases and pre-/re- generate SQL->SVG, and then probably SVG->SVG for zoomier tiles, and maybe even also SVG->PNG/JPG, it matters a lot [01:08:13] so you want the geospatially-aware hash on the balancing from varnish to that layer [01:08:29] (03CR) 10Dzahn: "rbf1001 has rdf1002: rdf1001 D" [puppet] - 10https://gerrit.wikimedia.org/r/203217 (https://phabricator.wikimedia.org/T95153) (owner: 10Dzahn) [01:09:07] mod_tile does SVG->SVG for scaling? [01:09:12] interesting [01:09:23] I don't know/remember, but either way, find a way :) [01:09:46] yep, that's what MaxSem is for :D [01:09:52] worst case, pregenerate SVG for the first N zoom layers until it gets unwieldy, then generate on the fly beneath that layer [01:10:16] yeah, i also suspect we can easily do rsync of those tiles between servers [01:10:29] they rarely change, and easier to gen them just once [01:10:49] (to be fair "matters little at the varnish layer" is assuming varnish is just 1 layer with random distribution. if structured as 2layer like our text/upload caches, then yes, you'd want the same geo-aware hash for varnish layer1->layer2 as well I'd think, or at least similar with perhaps different zoom cutoffs and such) [01:11:23] yes, that's what i am planning too. Actually I am debating if varnish#1 should even have a cache [01:11:39] in other words use varnish #1 as a customizable load balancer [01:11:49] (03PS1) 10Dereckson: Throttle rule for Editatón Ciencia y Tecnología en Chile [mediawiki-config] - 10https://gerrit.wikimedia.org/r/203269 [01:11:52] having a memory cache there helps a ton with most other things, probably would here too [01:12:05] just to absorb hits on whatever the hottest tiles are from the lower layers [01:12:23] (03PS4) 10Dzahn: remove rbf* production dns [dns] - 10https://gerrit.wikimedia.org/r/202298 (https://phabricator.wikimedia.org/T95153) (owner: 10John F. Lewis) [01:13:12] bblack, i wonder how much of an impact it is to actually cache vs pass [01:13:19] (esp useful in a perf spike, like if some reddit users posts a link to a specific funny map URL and thousands of excess hits all go for one zoomed area out of nowhere) [01:13:32] yurik: are you considering storing tiles as files? [01:13:34] guess you are right [01:13:39] gwicke, not really [01:13:48] map_tile will store them as files [01:13:50] the front layer doesn't have much impact on broad general access patterns; it absorbs spikes and the heaviest content of the moment [01:13:52] in its own storage [01:13:53] okay, was worried for a moment there [01:13:53] gwicke, yes - tiles are cached on tilesrvers [01:13:54] (FS) [01:14:55] (03CR) 10Dzahn: [C: 032] remove rbf* production dns [dns] - 10https://gerrit.wikimedia.org/r/202298 (https://phabricator.wikimedia.org/T95153) (owner: 10John F. Lewis) [01:14:56] actually, mod_tile is modular enough so you can plug whatever store system you want. that's why I was pondering about Cassandra [01:15:07] if you store a significant number of tiny files in the fs you probably won't have a lot of fun [01:15:39] beyond a certain zoom level, pre-rendering or storing never makes sense really [01:16:04] it will be faster to just generate those on the fly when requested, with varnish mitigating repeated requests for the same high-zoom tiles [01:16:25] bblack, thanks, interesting idea [01:16:26] (and hopefully generate them on the fly in an efficient way, e.g. from the SVG you have available from a lower zoom level rather than SQL?) [01:17:03] I'd think that the biggers players pre-generate based on access stats [01:17:05] bblack, zoom levels vary in details (and even coloring) [01:17:10] (03PS2) 10Dzahn: beta: lint [puppet] - 10https://gerrit.wikimedia.org/r/202655 [01:17:29] and some rules perhaps, like [01:17:39] gwicke: the outer (e.g. varnish) layer will cover the same thing really and auto-tune to access patterns, it just needs to be sufficiently-sized. [01:18:09] (^ re: bigger players pre-generate) [01:18:14] yeah, but when you roll out the new map look you'll want to pre-render it all [01:18:44] otherwise your latency sucks on roll-out [01:18:45] or just don't flip the switch all at once. does it matter if every user on every wiki sees the new tiles at the same instant? [01:19:05] nah, i think we can show different stuff [01:19:06] (03PS1) 10BBlack: T86663 5.2: repool cp3007 [puppet] - 10https://gerrit.wikimedia.org/r/203270 [01:19:08] (03PS3) 10Dzahn: beta: lint [puppet] - 10https://gerrit.wikimedia.org/r/202655 [01:19:13] (03PS2) 10Dzahn: backup: lint [puppet] - 10https://gerrit.wikimedia.org/r/202656 [01:19:19] people probably expect things to line up ;) [01:19:19] as long as it is per user i guess [01:19:21] (03CR) 10Dzahn: [C: 032] beta: lint [puppet] - 10https://gerrit.wikimedia.org/r/202655 (owner: 10Dzahn) [01:19:36] (03CR) 10BBlack: [C: 032 V: 032] T86663 5.2: repool cp3007 [puppet] - 10https://gerrit.wikimedia.org/r/203270 (owner: 10BBlack) [01:19:38] (03CR) 10jenkins-bot: [V: 04-1] backup: lint [puppet] - 10https://gerrit.wikimedia.org/r/202656 (owner: 10Dzahn) [01:19:44] could be fun though, a checker-board of old & new tiles [01:20:07] :) [01:20:07] (03PS3) 10Dzahn: backup: lint [puppet] - 10https://gerrit.wikimedia.org/r/202656 [01:20:09] gwicke, i guess new things like that should deploy per ip range unless we want to build a separate migration cluster [01:20:18] (03CR) 10jenkins-bot: [V: 04-1] backup: lint [puppet] - 10https://gerrit.wikimedia.org/r/202656 (owner: 10Dzahn) [01:20:34] just show a site notice "sorry, our maps gonna suck for a while". wikipedia style! [01:20:52] yurik: yes, that'd be a good way to do it [01:20:57] if the tile URLs have a /vNNN/ in the path, you could progressively update whatever generates tile URL refs by-user somehow and let the caches swap out slowly. [01:20:58] or use the replica in the other DC [01:21:05] hmm... actually this will happen regardless - because different tiles might have different expiration :) [01:21:13] in browser [01:21:40] (03PS4) 10Dzahn: backup: lint [puppet] - 10https://gerrit.wikimedia.org/r/202656 [01:21:44] in other words, if you want cache-consistency to a user, don't publish URLs that GET data which mutates [01:21:57] put a version in the URL somehow and don't mutate extant URL paths [01:22:04] yep ) [01:22:17] that's not how google maps works [01:22:23] gwicke, re lots of files: they seem to have a 35% inode usage atm [01:22:41] MaxSem: it's doable, but performance sucks [01:22:56] how google maps works probably isn't all that relevant to us [01:22:58] you'll use more space than your net image size [01:23:02] * yurik proposes to use mysql with blobs for tile store... and hides [01:23:09] (03CR) 10Dzahn: [C: 032] backup: lint [puppet] - 10https://gerrit.wikimedia.org/r/202656 (owner: 10Dzahn) [01:23:13] while with other storage solutions you'll probably end up using a good amount less [01:23:15] how google maps works, at some level is on 3,000 engineers and $300M of hardware :P [01:23:17] * yurik also contemplates nosql db [01:23:38] all those identical tiles compress rather well, for example [01:24:03] even better if you only store a number instead [01:24:21] or a string, like #0000ff [01:24:29] ;) [01:24:29] yurik, with cassandra's 10ms latency for objects of our size, can be good for us [01:25:14] couchbase might also be worth a look [01:25:55] we already have cassie [01:26:05] MaxSem, i say we do it "easy path" - storing things on FS at first, use https://www.blitz.io/ that ori proposed, and decide if it is good enough for the first pass [01:26:25] once we get some traction, we could switch to cassandra or couchbase or whatever else we may find [01:26:37] basically - deliver something fast first [01:26:45] than improve incrementally [01:26:49] If I were you I'd generate a modestly-sized dataset (200G or so), then store that on the fs vs. db & compare [01:26:49] otherwise we will never ship ) [01:27:05] (03PS2) 10Alex Monk: Throttle rule for Editatón Ciencia y Tecnología en Chile [mediawiki-config] - 10https://gerrit.wikimedia.org/r/203269 (https://phabricator.wikimedia.org/T95302) (owner: 10Dereckson) [01:27:33] gwicke, can't do a quick coparison cuz need to write a store code [01:27:44] yurik: I'm pretty sure there are values between "now" and "never" [01:27:46] gwicke, i agree that it might be the way to do it - but i feel it is more important to have something working reasonably well first - because otherwise we might spend too much time with all the extra components, and later realize we were doing it all wrong [01:27:52] I tend to agree you'll have to take an incremental approach to your architecture's scalability. [01:27:53] like "soon", or "later", or "in a little while" [01:28:24] but do try to at least have a repeatedly-updated future roadmap of the intended plan to scale much further, subject to feedback loop of what happens with the earlier stuff [01:28:31] I'm all for disk based tiles [01:28:40] our first users - the app with beta ON [01:28:44] (03CR) 10Alex Monk: [C: 032] Throttle rule for Editatón Ciencia y Tecnología en Chile [mediawiki-config] - 10https://gerrit.wikimedia.org/r/203269 (https://phabricator.wikimedia.org/T95302) (owner: 10Dereckson) [01:28:49] (03Merged) 10jenkins-bot: Throttle rule for Editatón Ciencia y Tecnología en Chile [mediawiki-config] - 10https://gerrit.wikimedia.org/r/203269 (https://phabricator.wikimedia.org/T95302) (owner: 10Dereckson) [01:28:55] so the load is relatively low, and pain tolerance of beta users is high ) [01:29:03] MaxSem: I just benchmarked html dumps in sqlite vs. the fs [01:29:13] (unless paravoid manages to recall why we decided to use an object store a while ago) :P [01:29:17] sqlite is a lot faster, even with a 181G db [01:29:18] (because those longer-term scaling plans could have far-reaching effects on how you structure things even at smaller scales. You may need to plan/code ahead a lot) [01:29:28] just because the files are so small [01:29:59] gwicke, sqlite starts to suck when it comes to multithreading [01:30:02] (03PS1) 10BBlack: T86663 5.6: pool 3047; depool 3008,amssq6[12] [puppet] - 10https://gerrit.wikimedia.org/r/203271 [01:30:06] for example, most filesystems store files in 4k blocks [01:30:15] (03CR) 10BBlack: [C: 032 V: 032] T86663 5.6: pool 3047; depool 3008,amssq6[12] [puppet] - 10https://gerrit.wikimedia.org/r/203271 (owner: 10BBlack) [01:30:18] if your file is 1k, you'll still use 4k of disk [01:30:37] (03PS1) 10Dzahn: remove cp3001,cp3002 from DHCP [puppet] - 10https://gerrit.wikimedia.org/r/203272 (https://phabricator.wikimedia.org/T94215) [01:30:40] MaxSem: agreed, point isn't that you should use sqlite [01:30:49] my file is 200k (?, don't remember) [01:31:13] it's more that even something as basic as sqlite can outperform ext4 if the files are small [01:31:21] !log krenair Synchronized wmf-config/throttle.php: https://gerrit.wikimedia.org/r/#/c/203269/ - trivial throttle change for event this weekend (duration: 01m 06s) [01:31:27] Logged the message, Master [01:31:50] MaxSem, do you have any idea what might be up with mw2129? [01:31:50] sqlite kinda sucks at threading, though, if it's a storage backend on a large-scale server with lots of threads of whatever accessing it [01:32:01] Krenair, me? [01:32:11] I don't even have access atm [01:32:16] (at least, last I recall from mucking with it server-side something like 7-8 years ago heh) [01:32:18] bblack: yup ;) [01:32:37] as I said, it's not about using sqlite at all [01:32:54] there's way better storage systems around, especially for high concurrency [01:32:58] oh heh, MaxSem already said that. I suck at multitasking! :) [01:33:02] (03PS1) 10Dzahn: remove cp3001,cp3002 from hiera ipsec data [puppet] - 10https://gerrit.wikimedia.org/r/203273 (https://phabricator.wikimedia.org/T94215) [01:33:57] gwicke, i totally agree with you about storage - we should probably use something better than FS, but it is not a must have for V1 [01:34:00] mutante: +1, I don't think those servers will ever get powered up again, at least not as cp300[12] [01:34:02] MaxSem: looking at openstreetmap.org, most tiles are < 10k [01:34:16] gwicke, *64 [01:34:39] bblack: thanks [01:34:45] ah, they do some custom batching? [01:35:04] they render and store 8x8 metatiles [01:35:45] render sounds good, but storage I'm less certain about [01:35:58] sounds complicated, as you should be able to get the same locality with generic solutions [01:36:34] !log powercycling mw2129 [01:36:38] Logged the message, Master [01:36:56] (03PS1) 10BBlack: T86663 5.6: switch cp3008 role [puppet] - 10https://gerrit.wikimedia.org/r/203274 [01:36:58] there are some nice ways of bit-interleaving quadtrees [01:37:08] Krenair: ^ it was just powered off, dont see a reason for it [01:37:44] gwicke, well - it doesn't matter for their current architecture that's not horizontally scalable, we might indeed have to come up with something different [01:37:51] (03CR) 10BBlack: [C: 032] T86663 5.6: switch cp3008 role [puppet] - 10https://gerrit.wikimedia.org/r/203274 (owner: 10BBlack) [01:38:01] (03PS1) 10Ori.livneh: rrd-navtiming: pass things around rather than use global state [puppet] - 10https://gerrit.wikimedia.org/r/203275 [01:38:08] (at a later point) [01:38:23] Krenair: mw2129 login: [01:38:39] RECOVERY - Host mw2129 is UPING OK - Packet loss = 0%, RTA = 43.61 ms [01:38:43] (03CR) 10Ori.livneh: [C: 032 V: 032] rrd-navtiming: pass things around rather than use global state [puppet] - 10https://gerrit.wikimedia.org/r/203275 (owner: 10Ori.livneh) [01:38:44] ori, of course there is difference - but unless we deliver "something" "soonish", we might not have it as a project, but rather as a volonteer effort, which becomes much more unpredictable ))) [01:38:53] it's back up, i gotta run though to bring back my zipcar [01:38:57] yeah I know. I'm just taunting you a little. [01:40:16] mutante, oh - you're commuting via zipcars now? :) [01:40:41] I'm heading out, see you later! [01:41:04] thanks mutante :) [01:42:09] (03PS1) 10BBlack: T86663 5.6: repool cp3008 [puppet] - 10https://gerrit.wikimedia.org/r/203276 [01:42:22] (03CR) 10BBlack: [C: 032 V: 032] T86663 5.6: repool cp3008 [puppet] - 10https://gerrit.wikimedia.org/r/203276 (owner: 10BBlack) [01:46:35] I kind of wonder what our procedure should be when these hosts die and have to be restarted [01:46:47] codfw isn't yet receiving traffic, right? [01:46:54] Krenair, YELL LOUDLY! [01:46:59] haha [01:47:11] but when they come back up, they will have missed deployments [01:47:42] are they immediately back serving traffic? I have no idea how that part of our infrastructure works [01:48:02] theoretically, it's the responsibility of whomever took it back up to sync [01:48:19] and only then repool [01:48:57] in practice, if it died very recently, changes aren't that significant [01:48:59] is repooling manual? [01:51:43] well yeah, but security deployments etc. [01:52:02] but not to have to wait for the next scap [02:00:59] (03PS5) 10Dereckson: Added *.adlibhosting.com to wgCopyUploadsDomains [mediawiki-config] - 10https://gerrit.wikimedia.org/r/202736 (https://phabricator.wikimedia.org/T95418) [02:02:59] (03CR) 10Dereckson: "PS5: Per bug comments, instead to only allow ymt.adlibhosting.com in addition to am.adlibhosting.com, we directly allow *.adlibhosting.com" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/202736 (https://phabricator.wikimedia.org/T95418) (owner: 10Dereckson) [02:06:39] (03CR) 10Gage: [C: 04-1] "This is correct, but I want to wait to merge it. Berkelium & Curium are IPsec test nodes and it's helpful to have a test case for security" [puppet] - 10https://gerrit.wikimedia.org/r/203273 (https://phabricator.wikimedia.org/T94215) (owner: 10Dzahn) [02:22:17] hey [02:22:24] something's wrong [02:23:57] db1064 & db1070, both s4, have their network saturated [02:24:33] since about ~00:00 UTC [02:28:15] lots of [02:28:16] SELECT /* ForeignDBFile::loadExtraFromDB 66.249.67.4 */ img_metadata FROM `image` WHERE img_name = 'Cyclopaedia,_Chambers_-_Supplement,_Volume_2.djvu' AND img_timestamp = '20120302033448' LIMIT 1; [02:28:20] from googlebot [02:28:37] the query returns a big Djvu XML [02:29:10] SELECT /* ForeignDBFile::loadExtraFromDB 66.249.67.20 */ img_metadata FROM `image` WHERE img_name = 'United_States_Statutes_at_Large_Volume_46_Part_1.djvu' AND img_timestamp = '20120126020500' LIMIT 1 [02:29:14] even worse [02:29:42] lots of djvu from googlebot [02:32:02] lots of [02:32:10] cp4011.ulsfo.wmnet 151209068 2015-04-10T02:30:37 4.097393274 66.249.67.20 miss/200 3983 GET http://en.m.wikisource.org/wiki/Page:The_Texan_Star.djvu/9 - [02:32:14] text/html; charset=UTF-8 - - Mozlla/5.0 (iPhone; CPU iPhone OS 6_0 like Mac OS X) AppleWebKit/536.26 (KHTML, like Gecko) Version/6.0 Mobile/10A5376e Safari/8536.25 (compatible; Googleboto/2.1; +http://www.google.com/bot.html) - page_id=582865;ns=104 [02:32:19] http://en.m.wikisource.org/wiki/Page:ScienceAndHypothesis1905.djvu/270 [02:32:22] http://en.m.wikisource.org/wiki/Page:Cadet_Handbook_and_Section_Roll.pdf/14 [02:32:25] etc. etc. [02:32:27] MaxSem: ^^^ [02:32:51] !log l10nupdate Synchronized php-1.25wmf24/cache/l10n: (no message) (duration: 09m 03s) [02:32:53] uh, I don't have access [02:32:55] googlebot crawling en.m.wikisource, every single page of every djvu/pdf [02:32:59] Logged the message, Master [02:33:02] can only look up sources for you [02:33:11] and for each of those huge djvu xml metadata are being fetched from the database [02:33:33] mmm, are they no cached? [02:33:59] page 270 of Page:ScienceAndHypothesis1905.djvu on mobile? [02:34:05] no, probably noone ever goes there :) [02:34:36] has something changed there lately? [02:35:46] not that I know of... [02:39:34] !log LocalisationUpdate completed (1.25wmf24) at 2015-04-10 02:38:30+00:00 [02:39:37] Logged the message, Master [02:46:29] paravoid: this sounds oddly familiar. remember that issue some months ago with some djvu for ... some giant old book that was scanned, maybe ref'd from a french wiki? [02:47:07] you might be remembering some of the swift outages [02:47:49] no, I don't think this was a swift outage [02:48:20] it was a lot like above: we had some perf issues, DB load, evidence leading to massive hits on many subpages of a djvu file on wikisource, googlebot involved, etc [02:48:37] the solution would be to store texts per page, outside of metadata [02:48:46] as of what to do, no idea:P [02:49:16] bblack: ok, I don't remember that at all [02:49:17] I'm trying to recall / search for the specific set of djvu files that were involved before [02:50:11] :P at phab search helpfulness [02:50:15] crawling of djvu is over, dbs have recovered [02:50:20] http://ganglia.wikimedia.org/latest/graph_all_periods.php?title=bytes_out&mreg[]=^bytes_out%24&hreg[]=^db1064&aggregate=1&hl=db1064.eqiad.wmnet|MySQL%20eqiad [02:50:35] and http://ganglia.wikimedia.org/latest/?r=hour&tab=ch&hreg[]=^db1070 [02:50:45] sehr gut [02:51:07] * MaxSem goes to sleep [02:51:13] yeah same here [02:51:29] I was sleeping before, opened my eyes, saw the alerts and came to check them [02:53:28] wel,, I wasn't sleeping at all [02:56:41] (03CR) 10Mjbmr: "Recheck please, this wiki is small, I can't get more support than this." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/201977 (owner: 10Mjbmr) [03:00:40] (03CR) 10Alex Monk: "I doubt this wiki has enough active users for there to be any point doing this, but I'll rescind my -2 as there is more than just the prop" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/201977 (owner: 10Mjbmr) [03:05:09] !log l10nupdate Synchronized php-1.26wmf1/cache/l10n: (no message) (duration: 09m 28s) [03:05:15] Logged the message, Master [03:05:41] PROBLEM - puppet last run on ms-be2002 is CRITICAL puppet fail [03:10:41] PROBLEM - puppet last run on ms-fe2001 is CRITICAL Puppet has 1 failures [03:11:59] !log LocalisationUpdate completed (1.26wmf1) at 2015-04-10 03:10:56+00:00 [03:12:04] ah I found what I was trying to remember re: djvu before. I'm not completely crazy. The specific djvu's in question in that past incident were all from the set(s) at http://fr.wikisource.org/wiki/Revue_des_Deux_Mondes, if that rings any bells [03:12:05] Logged the message, Master [03:12:20] (03PS6) 10Dereckson: Added *.adlibhosting.com to wgCopyUploadsDomains [mediawiki-config] - 10https://gerrit.wikimedia.org/r/202736 (https://phabricator.wikimedia.org/T95418) [03:12:54] I haven't found it yet, but at the time, I was able to find some page somewhere on wikisource or commons which, on a single page, generated thumbnails of many many pages from that djvu and would slow everything to a crawl, or something... [03:16:38] (03PS1) 10Dereckson: Enable SandboxLink extension on fr.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/203279 (https://phabricator.wikimedia.org/T95604) [03:20:03] (03PS2) 10Dereckson: Enable SandboxLink extension on fr.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/203279 (https://phabricator.wikimedia.org/T95604) [03:23:01] RECOVERY - puppet last run on ms-be2002 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [03:24:18] (03CR) 10Mjbmr: "Your consider is wrong, we all made changes to do wiki since 2010 with small active users." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/201977 (owner: 10Mjbmr) [03:26:21] RECOVERY - puppet last run on ms-fe2001 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [03:33:20] bblack: there was also at least one incident involving https://commons.wikimedia.org/wiki/Special:NewFiles and huge multi-page & ~150mb tif files uploaded by some library [03:33:29] (03PS3) 10Tim Landscheidt: Labs: Mute client-side notifications for wikitech Puppet status [puppet] - 10https://gerrit.wikimedia.org/r/203062 (owner: 10Hashar) [03:39:04] (03CR) 10Tim Landscheidt: "As mentioned above, if after merging there is a sudden tsunami of Labs instances with Puppet status "stalled", this change would need to b" [puppet] - 10https://gerrit.wikimedia.org/r/203062 (owner: 10Hashar) [03:50:55] (03CR) 10Mxn: "> Looks like ilowiki was the one you modified yourself?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/201915 (https://phabricator.wikimedia.org/T37337) (owner: 10Mxn) [03:55:31] (03PS1) 10Dereckson: Flagged revisions configuration on ca.wikinews [mediawiki-config] - 10https://gerrit.wikimedia.org/r/203283 (https://phabricator.wikimedia.org/T95085) [03:59:36] (03CR) 10Springle: "~35G is the uncompressed InnoDB data length without indexes. A dump would be smaller." [dumps] (ariel) - 10https://gerrit.wikimedia.org/r/200313 (owner: 10Kelson) [04:00:23] (03CR) 10Springle: [C: 031] mariadb: lint fixes in role class [puppet] - 10https://gerrit.wikimedia.org/r/202645 (https://phabricator.wikimedia.org/T93645) (owner: 10Dzahn) [04:09:32] (03CR) 10Mxn: "> Also, I ran `mwgrep min-device-pixel-ratio | grep Common.css` and removed the entries already addressed here:" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/201915 (https://phabricator.wikimedia.org/T37337) (owner: 10Mxn) [04:17:09] (03CR) 10Dzahn: [C: 031] scholarships - Increase HSTS max-age to 1 year [puppet] - 10https://gerrit.wikimedia.org/r/199126 (https://phabricator.wikimedia.org/T40516) (owner: 10Chmarkine) [04:17:20] (03CR) 10Dzahn: [C: 031] dbtree - Raise HSTS max-age to 1 year and add always flag [puppet] - 10https://gerrit.wikimedia.org/r/202267 (https://phabricator.wikimedia.org/T40516) (owner: 10Chmarkine) [04:18:56] (03PS4) 10Mxn: Set $wgLogoHD for wikis that currently do so in MediaWiki:Common.css [mediawiki-config] - 10https://gerrit.wikimedia.org/r/201915 (https://phabricator.wikimedia.org/T37337) [04:25:02] PROBLEM - puppet last run on mw2046 is CRITICAL puppet fail [04:33:08] (03CR) 10Andrew Bogott: "Am I correct in understanding that in conjunction with https://gerrit.wikimedia.org/r/#/c/199791/1/modules/mediawiki/files/apache/sites/re" [dns] - 10https://gerrit.wikimedia.org/r/199796 (https://phabricator.wikimedia.org/T92438) (owner: 10Dzahn) [04:34:15] (03CR) 10Andrew Bogott: [C: 031] drop shop & store entries from some projects [dns] - 10https://gerrit.wikimedia.org/r/196605 (https://phabricator.wikimedia.org/T92438) (owner: 10Dzahn) [04:38:25] (03CR) 10Mxn: "I belatedly realized that almost none of these files were upload-protected on Commons. (All their image description pages were edit-protec" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/201915 (https://phabricator.wikimedia.org/T37337) (owner: 10Mxn) [04:42:21] RECOVERY - puppet last run on mw2046 is OK Puppet is currently enabled, last run 51 seconds ago with 0 failures [04:55:19] (03CR) 10Giuseppe Lavagetto: [C: 031] remove pointless (I think!) esams $ganglia_aggregator and cache_upload def for esams [puppet] - 10https://gerrit.wikimedia.org/r/203087 (owner: 10BBlack) [04:55:49] (03CR) 10Giuseppe Lavagetto: [C: 031] delete rbf hosts from hiera [puppet] - 10https://gerrit.wikimedia.org/r/203217 (https://phabricator.wikimedia.org/T95153) (owner: 10Dzahn) [04:57:12] (03PS2) 10Giuseppe Lavagetto: hiera nuyaml: disable mainrole lookups [puppet] - 10https://gerrit.wikimedia.org/r/202749 (owner: 10Alexandros Kosiaris) [05:00:00] (03CR) 10Dzahn: [C: 04-1] "it might be unlikely that he uses icinga commands but i think technically laner should stay because he is a volunteer ops and still has sh" [puppet] - 10https://gerrit.wikimedia.org/r/202759 (owner: 10Andrew Bogott) [05:16:27] (03CR) 10Ori.livneh: [C: 031] "Nice! Tested, works. One small suggestion inline." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/175153 (owner: 10Ori.livneh) [05:30:40] <_joe_> ori: thanks a lot, people just want that tool :) [05:32:20] (03CR) 10Giuseppe Lavagetto: [C: 032] hiera nuyaml: disable mainrole lookups [puppet] - 10https://gerrit.wikimedia.org/r/202749 (owner: 10Alexandros Kosiaris) [05:32:51] PROBLEM - puppet last run on mw1044 is CRITICAL Puppet has 1 failures [05:48:21] RECOVERY - puppet last run on mw1044 is OK Puppet is currently enabled, last run 38 seconds ago with 0 failures [05:57:16] (03PS5) 10Giuseppe Lavagetto: Add hiera lookup tool [puppet] - 10https://gerrit.wikimedia.org/r/175153 (owner: 10Ori.livneh) [05:59:26] (03CR) 10Giuseppe Lavagetto: Add hiera lookup tool (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/175153 (owner: 10Ori.livneh) [05:59:48] !log LocalisationUpdate ResourceLoader cache refresh completed at Fri Apr 10 05:58:44 UTC 2015 (duration 58m 43s) [05:59:55] Logged the message, Master [06:01:09] (03PS1) 1020after4: Parameterize the path to /var/lib/l10nupdate (References T95564) [puppet] - 10https://gerrit.wikimedia.org/r/203286 (https://phabricator.wikimedia.org/T95564) [06:01:34] (03PS6) 10Giuseppe Lavagetto: Add hiera lookup tool [puppet] - 10https://gerrit.wikimedia.org/r/175153 (owner: 10Ori.livneh) [06:03:24] (03CR) 10Giuseppe Lavagetto: [C: 032] Add hiera lookup tool [puppet] - 10https://gerrit.wikimedia.org/r/175153 (owner: 10Ori.livneh) [06:27:51] (03PS2) 10Tim Starling: Switch some usages of 'wiki' to 'wikipedia' [mediawiki-config] - 10https://gerrit.wikimedia.org/r/194086 (https://phabricator.wikimedia.org/T91340) (owner: 10MaxSem) [06:29:35] (03CR) 10Tim Starling: [C: 032] "Perhaps. But that does not block this change." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/194086 (https://phabricator.wikimedia.org/T91340) (owner: 10MaxSem) [06:29:40] (03Merged) 10jenkins-bot: Switch some usages of 'wiki' to 'wikipedia' [mediawiki-config] - 10https://gerrit.wikimedia.org/r/194086 (https://phabricator.wikimedia.org/T91340) (owner: 10MaxSem) [06:30:40] PROBLEM - puppet last run on elastic1030 is CRITICAL Puppet has 1 failures [06:31:21] PROBLEM - puppet last run on mw1123 is CRITICAL Puppet has 1 failures [06:31:31] PROBLEM - puppet last run on db1015 is CRITICAL Puppet has 1 failures [06:31:40] PROBLEM - puppet last run on cp4008 is CRITICAL Puppet has 1 failures [06:31:41] PROBLEM - puppet last run on cp4003 is CRITICAL Puppet has 2 failures [06:32:21] PROBLEM - puppet last run on mw2104 is CRITICAL Puppet has 1 failures [06:32:40] PROBLEM - puppet last run on mw1228 is CRITICAL Puppet has 1 failures [06:33:02] (03Abandoned) 10Giuseppe Lavagetto: txstatsd: stop service before trying to remove the user [puppet] - 10https://gerrit.wikimedia.org/r/203029 (owner: 10Giuseppe Lavagetto) [06:33:31] PROBLEM - puppet last run on mw1009 is CRITICAL Puppet has 1 failures [06:34:09] !log tstarling Synchronized wmf-config/InitialiseSettings.php: wmgEnableRandomRootPage (duration: 00m 11s) [06:34:15] Logged the message, Master [06:34:51] PROBLEM - puppet last run on mw2134 is CRITICAL Puppet has 2 failures [06:35:21] PROBLEM - puppet last run on mw2003 is CRITICAL Puppet has 1 failures [06:35:41] PROBLEM - puppet last run on mw2030 is CRITICAL Puppet has 1 failures [06:36:20] PROBLEM - puppet last run on mw2079 is CRITICAL Puppet has 1 failures [06:36:40] PROBLEM - puppet last run on mw2016 is CRITICAL Puppet has 1 failures [06:43:41] PROBLEM - Outgoing network saturation on labstore1001 is CRITICAL 17.24% of data above the critical threshold [100000000.0] [06:45:30] RECOVERY - puppet last run on db1015 is OK Puppet is currently enabled, last run 3 seconds ago with 0 failures [06:45:32] RECOVERY - puppet last run on cp4003 is OK Puppet is currently enabled, last run 8 seconds ago with 0 failures [06:45:41] RECOVERY - puppet last run on mw1009 is OK Puppet is currently enabled, last run 39 seconds ago with 0 failures [06:46:11] RECOVERY - puppet last run on mw2104 is OK Puppet is currently enabled, last run 25 seconds ago with 0 failures [06:46:31] RECOVERY - puppet last run on mw1228 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [06:47:00] RECOVERY - puppet last run on mw1123 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [06:47:10] RECOVERY - puppet last run on mw2134 is OK Puppet is currently enabled, last run 34 seconds ago with 0 failures [06:47:10] RECOVERY - puppet last run on mw2016 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [06:47:20] RECOVERY - puppet last run on cp4008 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [06:47:32] RECOVERY - puppet last run on mw2003 is OK Puppet is currently enabled, last run 36 seconds ago with 0 failures [06:48:00] RECOVERY - puppet last run on elastic1030 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [06:48:01] RECOVERY - puppet last run on mw2030 is OK Puppet is currently enabled, last run 5 seconds ago with 0 failures [06:48:31] RECOVERY - puppet last run on mw2079 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [07:25:21] RECOVERY - Outgoing network saturation on labstore1001 is OK Less than 10.00% above the threshold [75000000.0] [07:30:24] (03CR) 10Hashar: [C: 031] "Since labsstatus is running on the master, that is probably fine." [puppet] - 10https://gerrit.wikimedia.org/r/203062 (owner: 10Hashar) [07:32:04] (03CR) 10Hashar: [C: 031] contint: 'zip' package via ensure_packages() [puppet] - 10https://gerrit.wikimedia.org/r/203203 (owner: 10Hashar) [07:43:03] (03CR) 10Hashar: [C: 04-1] "Quick question about @resolve." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/201882 (https://phabricator.wikimedia.org/T87519) (owner: 10Dzahn) [08:30:22] 6operations: Choose a consistent, distributed k/v storage for configuration management/discovery - https://phabricator.wikimedia.org/T95656#1196972 (10Joe) 3NEW [08:33:05] _joe_: there's also https://github.com/ha/doozerd [08:33:21] <_joe_> yeah but doozer was kind of discontinued lately [08:33:33] oh really? bummer [08:33:36] <_joe_> yep [08:33:48] <_joe_> I first used doozer, then fled to etcd [08:33:58] <_joe_> anything to avoid java :P [08:39:19] mobrovac: you reckon https://gerrit.wikimedia.org/r/203210 is good to merge? [08:39:40] godog: yup, gave it a look just now [08:39:46] (03CR) 10Mobrovac: [C: 031] Switch restbase from txstatsd to statsd backend [puppet] - 10https://gerrit.wikimedia.org/r/203210 (owner: 10GWicke) [08:43:28] (03PS2) 10Filippo Giunchedi: Switch restbase from txstatsd to statsd backend [puppet] - 10https://gerrit.wikimedia.org/r/203210 (owner: 10GWicke) [08:43:33] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] Switch restbase from txstatsd to statsd backend [puppet] - 10https://gerrit.wikimedia.org/r/203210 (owner: 10GWicke) [08:44:43] mobrovac: cool! merged, I'll bounce the cluster later on in turn [08:45:03] godog: no need actually [08:45:30] mobrovac: ah! picks up the config by itself? [08:45:35] the changes i did yesterday work the same [08:45:39] godog: eh you wish [08:45:41] :P [08:46:08] haha trying to be careful what I wish for [08:47:15] (03PS1) 10Mobrovac: Citoid: Change Zotero's host for Beta [puppet] - 10https://gerrit.wikimedia.org/r/203294 [08:48:46] mobrovac: srsly tho, why no need? [08:50:13] godog: so, we can choose between txstatsd and statsd, txstatsd was the default, but it has a param where you can set it in mode statsd, so i set that yesterday and deployed [08:50:27] the patch here merely switches explicitly from txstatsd to statsd [08:50:49] mobrovac: ahhh ok, that makes sense now, thanks! [08:59:50] RECOVERY - Disk space on ms-be1005 is OK: DISK OK [09:00:37] 6operations: Implement a configuration discovery system - https://phabricator.wikimedia.org/T95662#1197052 (10Joe) 3NEW [09:01:12] !log reboot ms-be1005, new disk didn't show up with the right letter [09:01:17] Logged the message, Master [09:02:12] E: instead of D: ? [09:02:27] A:\ [09:06:08] I guess A: and B: are still reserved for floppy disks [09:06:41] http://en.wikipedia.org/wiki/Drive_letter_assignment#Order_of_assignment [09:07:11] godog: meanwhile the Zuul packages for our Precise and Trusty distributions are ready [09:07:34] and I have deployed them on labs :) [09:09:10] hashar: cool, I'll take a look later [09:09:17] 6operations: Choose a consistent, distributed k/v storage for configuration management/discovery - https://phabricator.wikimedia.org/T95656#1197069 (10Joe) [09:09:19] 6operations: Implement a configuration discovery system - https://phabricator.wikimedia.org/T95662#1197068 (10Joe) [09:09:26] godog: I made a point of commenting the change between patchsets [09:09:35] not much changed recently [09:09:49] I would like to switch the Zuul server on gallium early next week [09:10:03] the package is installed there already but need some puppet change to switch :D [09:16:21] RECOVERY - puppet last run on ms-be1005 is OK Puppet is currently enabled, last run 45 seconds ago with 0 failures [09:18:34] 6operations, 10ops-eqiad: ms-be1005.eqiad.wmnet: slot=5 dev=sdf failed - https://phabricator.wikimedia.org/T95268#1197080 (10fgiunchedi) 5Open>3Resolved a:3fgiunchedi indeed the fs was still mounted, a reboot has let the new disk appear as sdf [09:20:24] k, let's test the waters ... [09:20:38] any opsen which would like to work on https://phabricator.wikimedia.org/T95253 ? [09:20:40] :) [09:23:40] <_joe_> mobrovac: kind of "yes", if I finish what I'm doing this morning I can think of shooting at a patch in the afternoon or on monday [09:26:36] that'd be wonderful _joe_ ! [09:29:36] 6operations, 10RESTBase, 10RESTBase-Cassandra: Set up multiple Cassandra instances per hardware node - https://phabricator.wikimedia.org/T95253#1197101 (10mobrovac) [09:29:57] _joe_: will assign to you then [09:31:35] 6operations, 10RESTBase, 10RESTBase-Cassandra: Set up multiple Cassandra instances per hardware node - https://phabricator.wikimedia.org/T95253#1197109 (10mobrovac) a:3Joe @Joe offered to have a take at it. [09:41:46] (03CR) 10Filippo Giunchedi: Citoid: Change Zotero's host for Beta (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/203294 (owner: 10Mobrovac) [09:43:22] (03CR) 10Mobrovac: Citoid: Change Zotero's host for Beta (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/203294 (owner: 10Mobrovac) [09:44:04] (03CR) 10Muehlenhoff: "I don't think we need to remove Camellia at this point:" [puppet] - 10https://gerrit.wikimedia.org/r/199582 (owner: 10BBlack) [09:45:25] 6operations, 3codfw-appserver-setup, 3wikis-in-codfw: Pybal RunCommand monitor doesn't work correctly on ubuntu trusty - https://phabricator.wikimedia.org/T94822#1197130 (10Joe) In the pybal logs I can see: ``` 2015-04-02 10:46:49.927886 [apaches_80] Monitoring instance RunCommand reports server mw2165.codf... [09:46:07] <_joe_> !log stopping and starting pybal on lvs2003, tests for T94822 [09:46:13] Logged the message, Master [09:46:38] (03PS2) 10Mobrovac: Citoid: Change Zotero's host for Beta [puppet] - 10https://gerrit.wikimedia.org/r/203294 (https://phabricator.wikimedia.org/T95616) [09:55:29] (03CR) 10Muehlenhoff: [C: 031] "Looks good to me" [puppet] - 10https://gerrit.wikimedia.org/r/199936 (owner: 10Chad) [10:08:12] 6operations, 3codfw-appserver-setup, 3wikis-in-codfw: Pybal RunCommand monitor doesn't work correctly on ubuntu trusty - https://phabricator.wikimedia.org/T94822#1197240 (10Joe) Restarting pybal and checking logs on the pybal server and one server in a pool with check_apache, I can see no connection happenin... [10:08:24] (03PS1) 10Filippo Giunchedi: point labmon1001 statsite to itself [puppet] - 10https://gerrit.wikimedia.org/r/203303 [10:09:52] (03CR) 10Filippo Giunchedi: "$ utils/hiera_lookup --fqdn=labmon1004.eqiad.wmflabs statsite::instance::graphite_host" [puppet] - 10https://gerrit.wikimedia.org/r/203303 (owner: 10Filippo Giunchedi) [10:10:56] (03PS2) 10Tim Landscheidt: Tools: Fix and clean up generation of /etc/ssh/ssh_known_keys [puppet] - 10https://gerrit.wikimedia.org/r/196125 (https://phabricator.wikimedia.org/T92379) [10:11:09] (03CR) 10Filippo Giunchedi: "note however that related https://gerrit.wikimedia.org/r/#/c/203106/ didn't seem to work for hosts in labs, at least using hiera_lookup:" [puppet] - 10https://gerrit.wikimedia.org/r/203303 (owner: 10Filippo Giunchedi) [10:13:51] (03CR) 10Filippo Giunchedi: [C: 031] "also I wonder if "" around key names matter? I've used no quotes in https://gerrit.wikimedia.org/r/#/c/203303/ for the key name and it see" [puppet] - 10https://gerrit.wikimedia.org/r/203294 (https://phabricator.wikimedia.org/T95616) (owner: 10Mobrovac) [10:19:10] (03CR) 10Mobrovac: "Not that I know of. YAML requires (double-)quotes around key names only when they contain spaces and such. Not sure why this particular fi" [puppet] - 10https://gerrit.wikimedia.org/r/203294 (https://phabricator.wikimedia.org/T95616) (owner: 10Mobrovac) [10:36:10] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL Anomaly detected: 11 data above and 0 below the confidence bounds [10:44:09] 6operations, 10Parsoid, 7service-runner: Decide whether to install heapdump by default, or continue to install npm & install on demand - https://phabricator.wikimedia.org/T95431#1197315 (10mobrovac) Based on discussions with @GWicke, due to the constraints and dependencies introduced by //heapdump//, we deci... [10:44:32] 6operations, 10Parsoid, 6Services, 7service-runner: Decide whether to install heapdump by default, or continue to install npm & install on demand - https://phabricator.wikimedia.org/T95431#1197316 (10mobrovac) p:5Triage>3Normal [11:02:01] PROBLEM - Graphite Carbon on graphite2001 is CRITICAL Not all configured Carbon instances are running. [11:03:01] PROBLEM - HTTP error ratio anomaly detection on graphite2001 is CRITICAL Anomaly detected: 17 data above and 2 below the confidence bounds [11:13:16] (03Abandoned) 10BBlack: remove CAMELLIA from ciphersuites [puppet] - 10https://gerrit.wikimedia.org/r/199582 (owner: 10BBlack) [11:15:42] (03PS3) 10BBlack: remove pointless esams $ganglia_aggregator from bits [puppet] - 10https://gerrit.wikimedia.org/r/203087 [11:16:16] (03CR) 10BBlack: [C: 032 V: 032] remove pointless esams $ganglia_aggregator from bits [puppet] - 10https://gerrit.wikimedia.org/r/203087 (owner: 10BBlack) [11:20:51] (03CR) 10BBlack: "Faidon ended up re-working some of our SSL cert deployment/monitoring infrastructure over several commits recently. I think this addresse" [puppet] - 10https://gerrit.wikimedia.org/r/15561 (owner: 10Catrope) [11:23:52] 6operations, 7Graphite: Counts with underscore in name no longer updated since move to statsite (cassandra metrics) - https://phabricator.wikimedia.org/T95627#1197383 (10fgiunchedi) looks like a statsite shortcoming in parsing scientific notation numbers, I've inquired upstream in https://github.com/armon/stat... [11:30:55] _joe_: thoughts on https://gerrit.wikimedia.org/r/#/c/203303/ ? [11:32:04] <_joe_> godog: I'll take a look [11:32:19] <_joe_> godog: looks good [11:32:53] (03CR) 10Giuseppe Lavagetto: [C: 031] point labmon1001 statsite to itself [puppet] - 10https://gerrit.wikimedia.org/r/203303 (owner: 10Filippo Giunchedi) [11:33:38] _joe_: thanks, also I had a doubt in the comments re: https://gerrit.wikimedia.org/r/#/c/203106/ [11:33:51] (03PS2) 10Filippo Giunchedi: point labmon1001 statsite to itself [puppet] - 10https://gerrit.wikimedia.org/r/203303 [11:33:57] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] point labmon1001 statsite to itself [puppet] - 10https://gerrit.wikimedia.org/r/203303 (owner: 10Filippo Giunchedi) [11:36:11] PROBLEM - puppet last run on labmon1001 is CRITICAL Puppet has 1 failures [11:36:28] <_joe_> mh doesn't sound good :P [11:37:39] indeed, it didn't work [11:37:49] <_joe_> oh gee [11:38:09] <_joe_> statsite::instance is a define [11:38:11] <_joe_> not a class [11:38:14] <_joe_> right? [11:38:18] that's correct [11:38:23] <_joe_> lemme look at the code [11:38:39] <_joe_> so hiera autolookup only happens for classes, of course [11:39:11] "of course" [11:39:21] <_joe_> well, it kind of makes sense [11:39:33] <_joe_> you can have multiple defines with conflicting parameters [11:39:44] <_joe_> while that can't happen for classes [11:40:44] mhh ok, what's the best way to fix it? [11:41:01] <_joe_> I'm looking [11:41:28] <_joe_> well an obvious way would be create a class parameter for role::statsite [11:41:49] <_joe_> but we could also do it more creatively [11:42:27] kart_, around? [11:42:41] heh, whichever is more obvious, to my untrained eye it looks GIGO [11:43:57] (03PS1) 10Giuseppe Lavagetto: statsite: look up the graphite host on hiera [puppet] - 10https://gerrit.wikimedia.org/r/203310 [11:44:05] <_joe_> godog: this is an option ^^ [11:45:13] is statsd.eqiad.wmnet still valid ? [11:48:31] RECOVERY - puppet last run on labmon1001 is OK Puppet is currently enabled, last run 34 seconds ago with 0 failures [11:54:04] Hi wikimedia-operations, can I ask where you have stats for Requests per second handled on load balancers for http / https? [11:54:40] PROBLEM - carbon-cache too many creates on graphite1001 is CRITICAL 1.67% of data above the critical threshold [1000.0] [11:54:41] PROBLEM - carbon-cache too many creates on graphite2001 is CRITICAL 1.67% of data above the critical threshold [1000.0] [12:01:41] PROBLEM - Disk space on graphite1001 is CRITICAL: DISK CRITICAL - free space: /var/lib/carbon 34393 MB (3% inode=99%) [12:03:51] oh [12:03:59] godog: graphite1001 already out of disk hehe [12:05:15] (03PS2) 10Tim Landscheidt: Tools: Allow proxy certificate to be manually managed [puppet] - 10https://gerrit.wikimedia.org/r/198665 [12:05:17] (03PS1) 10Tim Landscheidt: dynamicproxy: Provide list of active proxy entries for urlproxy [puppet] - 10https://gerrit.wikimedia.org/r/203313 (https://phabricator.wikimedia.org/T88216) [12:08:22] (03CR) 10Tim Landscheidt: [C: 04-1] "Tested that it works, but haven't tested yet that it can actually replace portgranter's data in labs/toollabs:www/content/list.php, so nee" [puppet] - 10https://gerrit.wikimedia.org/r/203313 (https://phabricator.wikimedia.org/T88216) (owner: 10Tim Landscheidt) [12:08:32] !log krenair Synchronized php-1.25wmf24/includes/specialpage/SpecialPageFactory.php: trying to investigate T90382 with some temp debugging (duration: 00m 12s) [12:08:36] Logged the message, Master [12:14:43] !log krenair Synchronized php-1.25wmf24/includes/specialpage/SpecialPageFactory.php: trying something else (duration: 00m 13s) [12:14:49] Logged the message, Master [12:18:11] RECOVERY - HTTP error ratio anomaly detection on graphite1001 is OK No anomaly detected [12:18:22] RECOVERY - HTTP error ratio anomaly detection on graphite2001 is OK No anomaly detected [12:19:50] !log krenair Synchronized php-1.25wmf24/includes/specialpage/SpecialPageFactory.php: (no message) (duration: 00m 11s) [12:19:57] Logged the message, Master [12:22:35] !log krenair Synchronized php-1.25wmf24/includes/specialpage/SpecialPageFactory.php: (no message) (duration: 00m 12s) [12:22:39] Logged the message, Master [12:24:19] !log krenair Synchronized php-1.25wmf24/includes/specialpage/SpecialPageFactory.php: (no message) (duration: 00m 12s) [12:28:16] !log krenair Synchronized php-1.25wmf24/includes/specialpage/SpecialPageFactory.php: (no message) (duration: 00m 12s) [12:29:15] !log krenair Synchronized php-1.25wmf24/includes/specialpage/SpecialPageFactory.php: (no message) (duration: 00m 12s) [12:29:25] this bug is so weird [12:31:15] !log krenair Synchronized php-1.25wmf24/includes/specialpage/SpecialPageFactory.php: removing debug (duration: 00m 13s) [12:31:19] Logged the message, Master [12:51:49] !log metrics from labs on graphite1001 by mistake, purging [12:51:53] Logged the message, Master [13:06:32] _joe_: looks good to me, I'm amending that [13:09:21] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL Anomaly detected: 11 data above and 7 below the confidence bounds [13:09:31] PROBLEM - HTTP error ratio anomaly detection on graphite2001 is CRITICAL Anomaly detected: 11 data above and 7 below the confidence bounds [13:11:57] (03CR) 10Filippo Giunchedi: [C: 031] statsite: look up the graphite host on hiera [puppet] - 10https://gerrit.wikimedia.org/r/203310 (owner: 10Giuseppe Lavagetto) [13:12:30] (03PS1) 10BBlack: T86663 5.3: pool 3038; depool 3005,amssq55 [puppet] - 10https://gerrit.wikimedia.org/r/203316 [13:13:05] 7Puppet, 10Continuous-Integration: Puppet run interrupted by "puppet-agent: Caught TERM; calling stop" - https://phabricator.wikimedia.org/T95683#1197667 (10Krinkle) 3NEW [13:13:23] (03CR) 10BBlack: [C: 032 V: 032] T86663 5.3: pool 3038; depool 3005,amssq55 [puppet] - 10https://gerrit.wikimedia.org/r/203316 (owner: 10BBlack) [13:18:28] Krenair: If you're deploying, maybe we can send out https://gerrit.wikimedia.org/r/#/c/203300/ to gather over the weekend? [13:19:06] I was just fiddling with FlaggedRevs debugging [13:19:56] but yeah [13:20:50] :D [13:20:51] Thx [13:20:59] Really I should be dealing with coursework today rather than looking at wikimedia stuff [13:23:11] (03PS1) 10BBlack: T86663 5.3: switch cp3005 role [puppet] - 10https://gerrit.wikimedia.org/r/203319 [13:23:26] (03CR) 10BBlack: [C: 032 V: 032] T86663 5.3: switch cp3005 role [puppet] - 10https://gerrit.wikimedia.org/r/203319 (owner: 10BBlack) [13:26:08] Krenair: Sure. I'll do it then. [13:30:20] 6operations, 7Graphite: Urgent: Statsite changes semantics of timer rate metrics, need metric rename - https://phabricator.wikimedia.org/T95596#1197711 (10fgiunchedi) odd, that should have worked, also there's currently ~1700 metrics under restbase/ does that seem ok? we can restore from backups and rename too [13:30:55] (03PS1) 10BBlack: T86663 5.3: repool cp3005 [puppet] - 10https://gerrit.wikimedia.org/r/203322 [13:31:09] (03CR) 10BBlack: [C: 032 V: 032] T86663 5.3: repool cp3005 [puppet] - 10https://gerrit.wikimedia.org/r/203322 (owner: 10BBlack) [13:33:38] (03Draft1) 10Dereckson: User rights configuration on ne.wikipedia - Reviewer [mediawiki-config] - 10https://gerrit.wikimedia.org/r/203323 [13:34:12] (03PS2) 10Dereckson: User rights configuration on ne.wikipedia - Reviewer [mediawiki-config] - 10https://gerrit.wikimedia.org/r/203323 (https://phabricator.wikimedia.org/T95101) [13:35:58] 6operations, 7Graphite: test sending varnishkafka and swift statsd traffic directly - https://phabricator.wikimedia.org/T95687#1197717 (10fgiunchedi) 3NEW [13:36:13] 6operations, 7Graphite: test sending varnishkafka and swift statsd traffic directly - https://phabricator.wikimedia.org/T95687#1197727 (10fgiunchedi) p:5Triage>3Normal a:3fgiunchedi [13:40:01] (03CR) 10BBlack: [C: 031] IPsec: improved cipher selection [puppet] - 10https://gerrit.wikimedia.org/r/201135 (owner: 10Gage) [13:40:10] RECOVERY - carbon-cache too many creates on graphite1001 is OK Less than 1.00% above the threshold [500.0] [13:40:21] RECOVERY - carbon-cache too many creates on graphite2001 is OK Less than 1.00% above the threshold [500.0] [13:41:21] Is "too many creates" somehow related to the introduction of new metrics that might be triggered by me pooling servers? [13:45:02] 6operations, 7Graphite: test sending varnishkafka and swift statsd traffic directly - https://phabricator.wikimedia.org/T95687#1197767 (10Ottomata) Oh, awesome, +1. Want to try this out with me today? [13:47:49] bblack: since it recovered I don't think so, it was earlier that labs metrics got sent to prod :! [13:49:15] 6operations, 7Graphite: test sending varnishkafka and swift statsd traffic directly - https://phabricator.wikimedia.org/T95687#1197782 (10fgiunchedi) if we can do it without going full-on for everything sure, otherwise I'd postpone to monday [13:49:46] 10Ops-Access-Requests, 6operations, 10Analytics: Grant Sati access to geowiki - https://phabricator.wikimedia.org/T95494#1197783 (10Ottomata) I sent en email yesterday to Jody asking for confirmation of Sati's NDA, as the instructions in your link say to do. [13:54:16] Krinkle, looks like it didn't work [13:54:19] https://integration.wikimedia.org/ci/job/mediawiki-phpunit-zend/4623/console [13:54:22] 13:50:23 ............................................................. 9577 / 9673 ( 99%) [13:54:22] 13:50:25 .....................................................Build timed out (after 30 minutes). Marking the build as failed. [13:54:58] Krenair: Try again [13:55:07] (03PS1) 10BBlack: T86663 5.4: pool 3039; depool 3006,amssq56 [puppet] - 10https://gerrit.wikimedia.org/r/203324 [13:55:09] (03PS1) 10BBlack: T86663 5.4: switch cp3006 role [puppet] - 10https://gerrit.wikimedia.org/r/203325 [13:55:11] (03PS1) 10BBlack: T86663 5.4: repool cp3006 [puppet] - 10https://gerrit.wikimedia.org/r/203326 [13:55:13] (03PS1) 10BBlack: T86663 5.7: pool 3048; depool 3009,amssq57 [puppet] - 10https://gerrit.wikimedia.org/r/203327 [13:55:15] (03PS1) 10BBlack: T86663 5.7: switch cp3009 role [puppet] - 10https://gerrit.wikimedia.org/r/203328 [13:55:17] (03PS1) 10BBlack: T86663 5.7: repool cp3009 [puppet] - 10https://gerrit.wikimedia.org/r/203329 [13:55:19] (03PS1) 10BBlack: T86663 5.8: pool 3049; depool 3010,amssq58 [puppet] - 10https://gerrit.wikimedia.org/r/203330 [13:55:21] (03PS1) 10BBlack: T86663 5.8: switch cp3010 role [puppet] - 10https://gerrit.wikimedia.org/r/203331 [13:55:23] (03PS1) 10BBlack: T86663 5.8: repool cp3010 [puppet] - 10https://gerrit.wikimedia.org/r/203332 [13:56:15] ^ prepped for later, pausing at least a few hours for now [13:58:08] (03PS1) 10Dereckson: User rights configuration on ne.wikipedia - Abusefilter [mediawiki-config] - 10https://gerrit.wikimedia.org/r/203333 (https://phabricator.wikimedia.org/T95102) [13:59:31] (03PS2) 10Dereckson: User rights configuration on ne.wikipedia - Abusefilter [mediawiki-config] - 10https://gerrit.wikimedia.org/r/203333 (https://phabricator.wikimedia.org/T95102) [14:01:14] (03Abandoned) 10BBlack: convert txstatsd to service_unit [puppet] - 10https://gerrit.wikimedia.org/r/193542 (owner: 10BBlack) [14:01:47] (03PS1) 10Mobrovac: Add a generic SCA service module [puppet] - 10https://gerrit.wikimedia.org/r/203334 (https://phabricator.wikimedia.org/T95533) [14:02:39] (03CR) 10jenkins-bot: [V: 04-1] Add a generic SCA service module [puppet] - 10https://gerrit.wikimedia.org/r/203334 (https://phabricator.wikimedia.org/T95533) (owner: 10Mobrovac) [14:02:50] damn [14:05:34] (03Draft1) 10Dereckson: User rights configuration on ne.wikipedia - Filemover [mediawiki-config] - 10https://gerrit.wikimedia.org/r/203335 (https://phabricator.wikimedia.org/T95103) [14:06:00] (03CR) 10Mobrovac: "damn strange strange Puppet bug https://tickets.puppetlabs.com/browse/PUP-1245 ." [puppet] - 10https://gerrit.wikimedia.org/r/203334 (https://phabricator.wikimedia.org/T95533) (owner: 10Mobrovac) [14:10:48] godog: do you think I can close https://phabricator.wikimedia.org/T95579 ? [14:11:02] (03CR) 10Dereckson: [C: 031] Create 'Portal' and 'Portal_Discussió' namespaces at cawikibooks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/199923 (https://phabricator.wikimedia.org/T93811) (owner: 10Gerardduenas) [14:13:43] gwicke: yeah I think so [14:14:45] 6operations, 7Graphite: Urgent: Statsite changes semantics of timer rate metrics, need metric rename - https://phabricator.wikimedia.org/T95596#1197860 (10GWicke) @fgiunchedi, it might be worth trying to restore those rates from the backup (and moving them directly to sample_rate). It would bring back the hist... [14:15:24] What the hell happened to mediawiki-phpunit-zenf [14:15:32] It went from 12 to 18 minutes this week. [14:16:56] <^d> Shitty test? [14:18:47] https://integration.wikimedia.org/ci/job/mediawiki-phpunit-zend/4625/testReport/(root)/ [14:18:52] Sort by Duration [14:19:24] (03CR) 10Dereckson: [C: 031] Add import sources for cawikibooks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/199927 (https://phabricator.wikimedia.org/T93750) (owner: 10Gerardduenas) [14:23:58] godog: what should we do about the rates? [14:25:06] (03CR) 10Dereckson: [C: 04-1] "Please split this change into two changes, one per bug." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/199321 (owner: 10Cenarium) [14:25:17] (03PS30) 10JanZerebecki: Wikidata builder [puppet] - 10https://gerrit.wikimedia.org/r/195567 (https://phabricator.wikimedia.org/T90567) [14:25:33] gwicke: I think restoring makes sense, I'm looking into that [14:26:03] godog: kk, thx! [14:26:18] it's a general problem for all timers, we just tried to fix restbase & cassandra first [14:26:36] (03CR) 10JanZerebecki: "PS30: Preserve mode and links when cp-ing the build result." [puppet] - 10https://gerrit.wikimedia.org/r/195567 (https://phabricator.wikimedia.org/T90567) (owner: 10JanZerebecki) [14:27:39] (03CR) 10Steinsplitter: [C: 031] Added *.adlibhosting.com to wgCopyUploadsDomains [mediawiki-config] - 10https://gerrit.wikimedia.org/r/202736 (https://phabricator.wikimedia.org/T95418) (owner: 10Dereckson) [14:34:49] 10Ops-Access-Requests, 6operations: Requesting stat1003 access for mholloway - https://phabricator.wikimedia.org/T95506#1197953 (10Andrew) @Mholloway: This should happen on Monday; feel free to nag me if it doesn't :) [14:37:22] 6operations, 7HHVM: Complete the use of HHVM over Zend PHP on the Wikimedia cluster - https://phabricator.wikimedia.org/T86081#1197979 (10chasemp) a:5chasemp>3Jdforrester-WMF [14:40:20] RECOVERY - HTTP error ratio anomaly detection on graphite1001 is OK No anomaly detected [14:43:11] RECOVERY - HTTP error ratio anomaly detection on graphite2001 is OK No anomaly detected [14:44:24] (03PS1) 10Andrew Bogott: Add mholloway to bastion-only and researchers. [puppet] - 10https://gerrit.wikimedia.org/r/203341 [14:45:15] (03CR) 10jenkins-bot: [V: 04-1] Add mholloway to bastion-only and researchers. [puppet] - 10https://gerrit.wikimedia.org/r/203341 (owner: 10Andrew Bogott) [14:46:59] 6operations, 7Graphite: test sending varnishkafka and swift statsd traffic directly - https://phabricator.wikimedia.org/T95687#1198011 (10Ottomata) Either way is fine with me. We could easily test by just changing on one node with a puppet conditional. [14:47:32] 6operations, 7HHVM: Complete the use of HHVM over Zend PHP on the Wikimedia cluster - https://phabricator.wikimedia.org/T86081#1198014 (10Krenair) @chasemp: It's not clear to me what James can really be expected to do here? This is a task for an ops engineer. [14:49:40] (03PS2) 10Andrew Bogott: Add mholloway to bastion-only and researchers. [puppet] - 10https://gerrit.wikimedia.org/r/203341 [14:56:01] PROBLEM - puppet last run on lvs2003 is CRITICAL Puppet has 1 failures [14:57:02] <_joe_> mmmh that may be me [14:57:12] <_joe_> checking [14:57:34] (03PS1) 10Hashar: base: vim -> vim-nox [puppet] - 10https://gerrit.wikimedia.org/r/203342 [14:59:22] Aw, crap. The ops presentation team is today; I miscounted weeks. [14:59:32] * Coren hurries to finish his slides. [14:59:52] godog: I think there's some weird scaling going on with counters too [15:00:10] Coren: did you make a calendar inviet for it? [15:00:14] I thought the meeting was set to repeat so I relied on it not being there on the calendar. [15:00:16] cuz i forgot until you said that, and i dont see it. [15:00:25] heh [15:00:40] PROBLEM - HTTP 5xx req/min on graphite1001 is CRITICAL 7.14% of data above the critical threshold [500.0] [15:00:49] I think its up to whoever is giving the talk for that day to create the calendar entry. [15:01:03] Well yeah, it's obvious in retrospect. :-) [15:01:32] gwicke: can you put the details on phab? easier to track [15:01:40] (cheat and duplicate the one form a couple of weeks ago) [15:01:46] faster than trying to get the entire team in invite list ;D [15:01:50] godog: restbase.sys_parsoid_generateAndSave.unchanged_rev_render used to have .rate and .count children, which are now gone; instead, that name has a rate that seems to be multiplied by 1000, which can be fixed by using the graphite 'ScaleToSeconds' function [15:02:37] godog: is that known / expected? [15:03:43] 10Ops-Access-Requests, 6operations, 6Services, 5Patch-For-Review: Allow mobrovac to restart Zotero - https://phabricator.wikimedia.org/T95400#1198061 (10RobH) p:5Triage>3Normal [15:04:28] robh: Bleh, scheduling conflict with the Call to Action thing or just after paravoid is done for the day if I push it to 18h UTC [15:05:09] 10Ops-Access-Requests, 6operations, 6Services, 5Patch-For-Review: Allow mobrovac to restart Zotero - https://phabricator.wikimedia.org/T95400#1198065 (10GWicke) Should we include services in general? [15:05:12] 10Ops-Access-Requests, 6operations, 10Analytics-EventLogging, 5Patch-For-Review: Grant user 'tomasz' access to dbstore1002 for Event Logging data - https://phabricator.wikimedia.org/T95036#1198063 (10RobH) This actually has to have @mark approval, not Toby. (Daniel and I discussed in IRC, this is a task u... [15:06:19] Coren: punt to next week? [15:06:33] Yeah, might be wiser. [15:06:40] 10Ops-Access-Requests, 6operations, 10Analytics-EventLogging, 5Patch-For-Review: Grant user 'tomasz' access to dbstore1002 for Event Logging data - https://phabricator.wikimedia.org/T95036#1198068 (10mark) Approved. [15:06:57] its an informal talk so i think pushing is fine, gives it a week on folks calendars to stare at them [15:07:21] 6operations, 7Graphite, 5Patch-For-Review: Counters now only provide rates (multiplied by 1000?) - https://phabricator.wikimedia.org/T95703#1198069 (10GWicke) 3NEW a:3fgiunchedi [15:07:42] 6operations, 7Graphite: Counters now only provide rates (multiplied by 1000?) - https://phabricator.wikimedia.org/T95703#1198069 (10GWicke) [15:07:47] Also this means I can concentrate on finishing the idmap thing today which is going to make paravoid happier than the talk. :-) [15:08:07] (Also me - that dependency on LDAP is teh sux) [15:10:14] 6operations, 10ops-eqiad, 6Labs: labvirt1004 has a failed disk 1789-Slot 0 Drive Array Disk Drive(s) Not Responding Check cables or replace the following drive(s): Port 1I: Box 1: Bay 1 - https://phabricator.wikimedia.org/T95622#1198079 (10Cmjohnson) A case with HP has been opened because that... [15:11:01] RECOVERY - HTTP 5xx req/min on graphite1001 is OK Less than 1.00% above the threshold [250.0] [15:11:50] RECOVERY - puppet last run on lvs2003 is OK Puppet is currently enabled, last run 38 seconds ago with 0 failures [15:14:59] 6operations, 10Continuous-Integration: Upload jenkins-debian-glue for Jessie on apt.wikimedia.org - https://phabricator.wikimedia.org/T95006#1198095 (10hashar) a:5hashar>3None [15:15:12] any bacula experts? having troubles with a restore job waiting on higher jobs to finish [15:15:41] 6operations, 10Continuous-Integration: Upload jenkins-debian-glue for Jessie on apt.wikimedia.org - https://phabricator.wikimedia.org/T95006#1177778 (10hashar) Packages for Jessie are available at http://people.wikimedia.org/~hashar/debs/jenkins-debian-glue/ . The task is now pending upload by #operations . [15:16:01] godog: I think akosiaris is best bet to ask if he's about [15:16:54] chasemp: I think so too, holiday in greece tho, will go for phab [15:18:53] <_joe_> godog: ach had that problem a few months back, but alex came to rescue :) [15:19:19] 6operations, 6Labs, 10hardware-requests: Replace virt1000 with a newer warrantied server - https://phabricator.wikimedia.org/T90626#1198102 (10Cmjohnson) a:5Cmjohnson>3RobH Rob, These are 3.5" disk bays. I have 1TB disks on-site and will swap them (just give me the +1). CJ [15:21:29] cmjohnson1: will you have a chance to try re-seating that drive today? [15:21:40] 6operations: bacula restore job waiting on higher jobs - https://phabricator.wikimedia.org/T95705#1198107 (10fgiunchedi) 3NEW a:3akosiaris [15:21:52] andrewbogott: already tried that...the disk is bad. I have a support ticket in [15:21:59] hopefully they will get back to me soon [15:22:00] dang, ok. thanks [15:22:11] I guess it takes 3+ weeks to mail out a replacement drive too? [15:22:40] hah..HP support is new to me but iirc we have 24 hour shipping [15:22:44] Oh, I should’ve read my phab backlog before nagging you [15:23:04] np...getting me back from the other day :-P [15:26:10] (03PS3) 10Andrew Bogott: admin: add tomasz to researchers group [puppet] - 10https://gerrit.wikimedia.org/r/202095 (https://phabricator.wikimedia.org/T95036) (owner: 10Dzahn) [15:27:29] (03CR) 10Andrew Bogott: [C: 032] admin: add tomasz to researchers group [puppet] - 10https://gerrit.wikimedia.org/r/202095 (https://phabricator.wikimedia.org/T95036) (owner: 10Dzahn) [15:27:47] (03PS2) 10Filippo Giunchedi: statsite: look up the graphite host on hiera [puppet] - 10https://gerrit.wikimedia.org/r/203310 (owner: 10Giuseppe Lavagetto) [15:27:57] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] statsite: look up the graphite host on hiera [puppet] - 10https://gerrit.wikimedia.org/r/203310 (owner: 10Giuseppe Lavagetto) [15:28:03] 10Ops-Access-Requests, 6operations, 10Analytics-EventLogging, 5Patch-For-Review: Grant user 'tomasz' access to dbstore1002 for Event Logging data - https://phabricator.wikimedia.org/T95036#1198133 (10Andrew) [15:28:49] 10Ops-Access-Requests, 6operations, 10Analytics-EventLogging, 5Patch-For-Review: Grant user 'tomasz' access to dbstore1002 for Event Logging data - https://phabricator.wikimedia.org/T95036#1198137 (10Andrew) 5Open>3Resolved Done. Tomasz, re-open this ticket or ping me directly if you don't have access... [15:32:53] (03PS1) 10Filippo Giunchedi: gdash: rename rate into sample_rate [puppet] - 10https://gerrit.wikimedia.org/r/203351 [15:33:28] (03PS3) 10Andrew Bogott: Add mholloway to bastion-only and researchers. [puppet] - 10https://gerrit.wikimedia.org/r/203341 [15:33:33] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] gdash: rename rate into sample_rate [puppet] - 10https://gerrit.wikimedia.org/r/203351 (owner: 10Filippo Giunchedi) [15:36:01] (03PS1) 10Papaul: added betelgeuse [dns] - 10https://gerrit.wikimedia.org/r/203352 [15:38:11] 10Ops-Access-Requests, 6operations, 3Continuous-Integration-Isolation: Grant hashar root access on to be installed labnodepool1001 - https://phabricator.wikimedia.org/T95303#1198163 (10Andrew) hashar, I will raise this issue during the Ops meeting on Monday. [15:39:18] andrewbogott: note the machine is not yet installed :D [15:39:37] yeah, I know. I think that’s going to be the easy part. [15:40:00] Although, that subnet is getting very crowded so you might wind up blocked on my decomissioning a cisco :( [15:41:01] (03CR) 10RobH: [C: 032] added betelgeuse [dns] - 10https://gerrit.wikimedia.org/r/203352 (owner: 10Papaul) [15:43:31] 6operations, 10MediaWiki-Debug-Logging, 6Release-Engineering, 6Security-Team, 5Patch-For-Review: Store unsampled API and XFF logs - https://phabricator.wikimedia.org/T88393#1198164 (10Andrew) @fgiunchedi why is daily rotation not practical for unsampled logs? Too big? [15:44:49] (03CR) 10Andrew Bogott: [C: 032] Redirect wikibook(s).(org|com) to www.wikibooks.org [puppet] - 10https://gerrit.wikimedia.org/r/185474 (https://phabricator.wikimedia.org/T87039) (owner: 10Glaisher) [15:45:32] andrewbogott: Thanks! :-) [15:45:47] * andrewbogott watches urls nervously [15:47:48] (03PS1) 10Dereckson: Namespace configuration on it.wikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/203354 (https://phabricator.wikimedia.org/T93870) [15:49:18] 6operations, 10MediaWiki-Debug-Logging, 6Release-Engineering, 6Security-Team, 5Patch-For-Review: Store unsampled API and XFF logs - https://phabricator.wikimedia.org/T88393#1198174 (10fgiunchedi) @andrew yes, difficult to grep and compress and trim if needed [15:50:23] godog: do you want to take ownership of that ticket and set up cronolog? [15:50:59] 6operations: Choose a consistent, distributed k/v storage for configuration management/discovery - https://phabricator.wikimedia.org/T95656#1198202 (10hashar) Out of curiosity, since analytics already uses zookeeper for hive/kafka, maybe it should be given a try first and other solutions looked at if zookeeper d... [15:51:26] andrewbogott: TBH no if possible, I'm trying not to add more stateful problems to my plate :) [15:51:36] hm, me neither :) [15:52:00] For some reason phab shows that bug as pending patch review so it keeps rising to the top [15:52:11] if you are on duty I got a low hanging fruit to upload a package to apt.wm.o :) [15:52:20] https://phabricator.wikimedia.org/T95006 already build it :) [15:52:42] 6operations, 7Graphite: Counters now only provide rates (multiplied by 1000?) - https://phabricator.wikimedia.org/T95703#1198204 (10fgiunchedi) it is, counters by default in statsites are defined as the sum of the counter values (unless extended counters) ``` STREAM("%s%s|%f|%lld\n", prefix, n... [15:53:45] hashar: ok! looking [15:54:52] not sure whether you want to rebuild it [15:57:15] hashar: so, /all/ the .deb files on that page? [15:58:27] 6operations, 10Parsoid, 6Services, 7service-runner: Decide whether to install heapdump by default, or continue to install npm & install on demand - https://phabricator.wikimedia.org/T95431#1198225 (10GWicke) > The caveat here is that we need to install these packages consistently on all servers. Why would... [15:58:48] hashar: or just _all_? [15:59:40] andrewbogott: all the .deb yeah [15:59:46] you probaqbly need the .dsc as well [16:00:03] Publish ALL THE DEBS! [16:00:06] the source package provides several binary packages with different dependencies [16:00:21] hashar: doesn’t jenkins-debian-glue_0.11.0_all.deb contain all of the above? [16:01:21] (03PS1) 10Ottomata: Add HiveContext support to spark by default [puppet/cdh] - 10https://gerrit.wikimedia.org/r/203358 [16:01:55] (03CR) 10Ottomata: [C: 032] Add HiveContext support to spark by default [puppet/cdh] - 10https://gerrit.wikimedia.org/r/203358 (owner: 10Ottomata) [16:02:18] (03PS1) 10Ottomata: Update cdh module with spark hive support [puppet] - 10https://gerrit.wikimedia.org/r/203359 [16:02:33] (03PS2) 10Ottomata: Update cdh module with spark hive support [puppet] - 10https://gerrit.wikimedia.org/r/203359 [16:03:14] andrewbogott: the 'buildenv' variables are more or less the same, but come with different dependencies [16:03:21] andrewbogott: make it easier to install dependent tools [16:03:23] hm, ok [16:03:40] on ops/puppet git grep jenkins-debian yields a bunch of them [16:04:15] (03CR) 10Ottomata: [C: 032] Update cdh module with spark hive support [puppet] - 10https://gerrit.wikimedia.org/r/203359 (owner: 10Ottomata) [16:05:24] andrewbogott: the buildenv packages just provide some basic shell script and /usr/share/doc entry. They in turn depends on the jenkins-debian-glue package and some others [16:06:16] hashar: ok, done [16:07:08] andrewbogott: confirmed! that is awesome :) [16:07:20] can close https://phabricator.wikimedia.org/T95006:) [16:07:36] 6operations: Allow access to https://archiva.wikimedia.org from analytics nodes. - https://phabricator.wikimedia.org/T95712#1198240 (10Ottomata) 3NEW [16:07:45] see you in 3 hours for the Ci isolation meeting [16:09:26] springle: heh, the 24h graphs on https://tendril.wikimedia.org/host/view/db1055.eqiad.wmnet/3306 should say 12h [16:13:11] andrewbogott: hey. when will it start taking effect? has it been deployed yet? [16:13:26] Glaisher: should’ve by now. I don’t know what’s up [16:14:01] :S [16:14:44] well, I guess maybe it requires apache to restart before it takes effect. Hm... [16:22:39] 6operations: Production cluster can't access labs cluster - https://phabricator.wikimedia.org/T95714#1198268 (10Dereckson) 3NEW [16:23:01] 7Blocked-on-Operations, 6Commons, 10Wikimedia-Site-requests: Add *.wmflabs.org to `wgCopyUploadsDomains` - https://phabricator.wikimedia.org/T78167#1198282 (10Dereckson) [16:23:24] JohnLewis: would you expect https://gerrit.wikimedia.org/r/#/c/185474/ to take effect after a merge, or is there some other step I’m missing? [16:24:04] 6operations: Production cluster can't access labs cluster - https://phabricator.wikimedia.org/T95714#1198268 (10Dereckson) [16:24:20] 6operations: Production cluster can't access labs cluster - https://phabricator.wikimedia.org/T95714#1198268 (10Dereckson) [16:24:23] andrewbogott: it's apache so I'd expect you'd need to graceful the main apache cluster [16:24:43] <_joe_> JohnLewis: I expect not [16:24:50] Yeah, puppet should do that [16:24:51] <_joe_> what change are you guys talking about? [16:24:56] https://gerrit.wikimedia.org/r/#/c/185474/ [16:25:20] _joe_: to be honest, I've never looking at how this stuff is used and that was just a guess :) [16:25:44] <_joe_> andrewbogott: checking' [16:25:57] thanks [16:26:19] <_joe_> andrewbogott: it works on mw1018 [16:26:28] <_joe_> checked with curl -H 'Host: wikibooks.org' localhost -I [16:26:40] _joe_: so it's an auto thing? okay :) [16:26:43] <_joe_> so I guess if it doesn't work live, it has to do with caching :) [16:27:14] _joe_: fair enough, that’s what I was hoping. I will just wait :) [16:27:22] <_joe_> curl wikibooks.org tells me Location: http://en.wikibooks.org/ right now btw [16:27:42] <_joe_> so... what is wrong here? [16:27:58] <_joe_> oh sorry it should redir to www [16:28:03] <_joe_> I got it wrong :) [16:28:03] yeah [16:28:21] <_joe_> so yeah, still cached [16:28:34] <_joe_> you /could/ de-cache it [16:28:45] <_joe_> but purging our caches is painful and dangerous [16:29:04] It’s far from urgent. I just wanted to make sure I wasn’t leaving us in a broken/indeterminant state [16:29:07] <_joe_> so I advise not to [16:29:17] um… indeterminate [16:29:32] <_joe_> ok [16:29:37] <_joe_> uh it's /late/ [16:29:45] <_joe_> and I'm up since 5 am [16:29:47] <_joe_> cya [16:29:56] so long! [16:31:33] 6operations, 10RESTBase, 10hardware-requests: Expand RESTBase cluster capacity - https://phabricator.wikimedia.org/T93790#1198308 (10fgiunchedi) >>! In T93790#1163395, @GWicke wrote: >>>! In T93790#1148439, @faidon wrote: >> What "does not have a lot of margin on IO bandwidth and storage capacity" mean exact... [16:31:50] (03PS2) 10Cenarium: Give patrol to reviewers for testwiki/enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/199321 (https://phabricator.wikimedia.org/T93798) [16:32:31] andrewbogott: I've also added you to two patches relating to labsconsole.wm.o / the redirect to wikitech [16:32:37] ok [16:40:31] (03PS1) 10Cenarium: Remove 'autoreview' usergroup from enwiki/testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/203370 (https://phabricator.wikimedia.org/T91934) [16:40:36] (03CR) 10jenkins-bot: [V: 04-1] Remove 'autoreview' usergroup from enwiki/testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/203370 (https://phabricator.wikimedia.org/T91934) (owner: 10Cenarium) [16:41:10] RECOVERY - Disk space on graphite1001 is OK: DISK OK [16:45:16] (03PS2) 10Cenarium: Remove 'autoreview' usergroup from enwiki/testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/203370 (https://phabricator.wikimedia.org/T91934) [16:56:11] 6operations, 10RESTBase, 10hardware-requests: Expand RESTBase cluster capacity - https://phabricator.wikimedia.org/T93790#1198379 (10GWicke) > Do you have figures on the expected growth rate? This depends on the factors mentioned in the summary. The disk space graphs in http://grafana.wikimedia.org/#/dashbo... [16:59:23] 6operations, 10Parsoid, 6Services, 7service-runner: Decide whether to install heapdump by default, or continue to install npm & install on demand - https://phabricator.wikimedia.org/T95431#1198394 (10mobrovac) >>! In T95431#1198225, @GWicke wrote: >> The caveat here is that we need to install these package... [17:03:27] 6operations, 10Parsoid, 6Services, 7service-runner: Decide whether to install heapdump by default, or continue to install npm & install on demand - https://phabricator.wikimedia.org/T95431#1198402 (10GWicke) > Ups, my wording seems to be completely messed up today. It should read ...on all servers node.js... [17:22:44] (03CR) 10Aaron Schulz: [C: 031] delete rbf hosts from hiera [puppet] - 10https://gerrit.wikimedia.org/r/203217 (https://phabricator.wikimedia.org/T95153) (owner: 10Dzahn) [17:24:15] 6operations, 6Mobile-Apps, 6Services: Deployment of Mobile App's service on the SCA cluster - https://phabricator.wikimedia.org/T92627#1198469 (10GWicke) @mobrovac, could you add the info from the mail thread here as well? [17:30:45] (03PS1) 10Cmjohnson: Adding dns entries (all) for berrylium [dns] - 10https://gerrit.wikimedia.org/r/203373 [17:52:33] (03CR) 10Mjbmr: "@Krenair I'm not sure where else I could request this. Should I abandon it and close the vote page?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/201977 (owner: 10Mjbmr) [17:53:58] (03CR) 10Alex Monk: "No, because I'm not attempting to veto this. I'm just not handling it myself." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/201977 (owner: 10Mjbmr) [17:56:11] (03CR) 10Mjbmr: "I fill that, but not no one else is willing to handle this because of your conversation." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/201977 (owner: 10Mjbmr) [17:59:05] (03Abandoned) 10Mjbmr: Add autopatrolled for fawikinews [mediawiki-config] - 10https://gerrit.wikimedia.org/r/201977 (owner: 10Mjbmr) [18:03:02] (03CR) 10Dzahn: "Yes, that is the intention. to only switch around "shop" and "store" but change nothing else. so everything that used to be redirected to " [dns] - 10https://gerrit.wikimedia.org/r/199796 (https://phabricator.wikimedia.org/T92438) (owner: 10Dzahn) [18:05:59] greg-g: I've got an UBN! patch for 1.26wmf1 that legoktm found. Ok to deploy? -- https://gerrit.wikimedia.org/r/#/c/203376/ [18:06:59] (03CR) 10Dzahn: "the intention is that, combined with https://gerrit.wikimedia.org/r/#/c/199796/, "shop" and "store" are simply switched around, so that "s" [puppet] - 10https://gerrit.wikimedia.org/r/199791 (https://phabricator.wikimedia.org/T92438) (owner: 10Dzahn) [18:08:31] (03CR) 10Dzahn: [C: 032] delete rbf hosts from hiera [puppet] - 10https://gerrit.wikimedia.org/r/203217 (https://phabricator.wikimedia.org/T95153) (owner: 10Dzahn) [18:10:44] (03PS1) 10Aaron Schulz: Set dedicated SUL rename runner loop [puppet] - 10https://gerrit.wikimedia.org/r/203379 (https://phabricator.wikimedia.org/T87397) [18:12:50] (03CR) 10Aklapper: "Mjbmr: Please refrain from using wording that could be interpreted as accusations. Please assume that people mean well. Thank you." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/201977 (owner: 10Mjbmr) [18:18:57] mutante: I just forwarded you an email from shopify. I’m tempted to assume that they just don’t know our process and ignore it, but can you confirm that that’s right? [18:19:12] Or rather, read it is ‘change your redirects’ and ignore the specific bits. [18:22:01] 6operations, 10hardware-requests, 5Patch-For-Review: Decom/repurpose rbf* hosts - https://phabricator.wikimedia.org/T95153#1198957 (10Dzahn) @robh for rbf* i went through all the decom steps listed above, incl. DNS removal and powering them down. in this case the entire server type "rbf" is not going to e... [18:22:42] 6operations, 10hardware-requests, 5Patch-For-Review: Decom/repurpose rbf* hosts - https://phabricator.wikimedia.org/T95153#1198964 (10Dzahn) a:5Dzahn>3RobH [18:24:22] Heads up. I'm going to deploy a tiny but important logging fix for 1.26wmf1 now [18:25:34] andrewbogott: it sounds she wants me to reply just to confirm it's valid.. on it [18:26:02] mutante: I can reply. I’m just confused by her very specific ‘point your CNAME for store to c.ssl.shopify.com’ [18:26:38] I’m really surprised by their “So it’ll be broken for a few hours, who cares?” approach to this. [18:26:42] Krinkle: You've got an undeployed change merged in 1.26wmf1 (Title: Add debug logging for I2b36b7a3 and I62fe3f700). I'll deploy it along with legoktm's patch [18:27:00] bd808: Thanks. Sorry I forgot about that one, I've still got a tin tab open. [18:27:03] Go ahead :) [18:27:30] No worries. I would have bitched if it wasn't so trivial :) [18:30:39] !log bd808 Synchronized php-1.26wmf1/includes/Title.php: Title: Add debug logging for I2b36b7a3 and I62fe3f700 (f45a334e) (duration: 00m 12s) [18:30:44] Krinkle: ^ [18:30:45] Logged the message, Master [18:31:47] mutante: that’s totally not the email I was talking about [18:31:47] !log bd808 Synchronized php-1.26wmf1/includes/debug/logger/LegacyLogger.php: debug: Add missing use DateTimeZone in LegacyLogger.php (2c8f292c) (duration: 00m 14s) [18:31:50] Logged the message, Master [18:31:56] legoktm: ^ [18:33:29] andrewbogott: ? but i just received that from you? [18:33:36] really? [18:33:38] * andrewbogott checks [18:33:53] The one I just sent starts with “Noted. I'll pass on your feedback to our ops team.” [18:34:11] So I was asking about the… most recent part. [18:34:24] bd808: Thx [18:34:38] mutante: sorry, you did indeed respond to the email I just forwarded. But to an older section of the conversation, I guess? [18:35:10] andrewbogott: you mean this "You can point your CNAME for store to c.ssl.shopify.com whenever you're ready! "then? [18:35:15] Yes~ [18:35:17] ! [18:35:20] That is the part I am asking about [18:35:33] which is why I keep quoting that sentence in my questions to you :) [18:36:14] yea, see my response in the other channel [18:36:19] i'll reply Effie too [18:36:25] ah [18:36:44] that was some kind of quoting/forward messup there, gotcha [18:37:43] bd808: looks to have stopped [18:38:16] !log Updated scap to f9b9a82 (Remove exotic unicode from ascii logo) [18:38:23] Logged the message, Master [18:38:26] !log Trebuchet fetch failed for scap/scap on mw2128 and mw1222 [18:38:29] Logged the message, Master [18:38:42] !log Trebuchet checkout failed for scap/scap on mw2128, mw1113, mw1222, and mw1104 [18:38:45] Logged the message, Master [18:40:06] ottomata: an1020 ..will need a new mainboard. :-( [18:41:01] (03PS2) 10Andrew Bogott: shop redirects: store instead of shop [puppet] - 10https://gerrit.wikimedia.org/r/199791 (https://phabricator.wikimedia.org/T92438) (owner: 10Dzahn) [18:41:10] (03PS2) 10Andrew Bogott: shop URL: change 'shop' to 'store' [dns] - 10https://gerrit.wikimedia.org/r/199796 (https://phabricator.wikimedia.org/T92438) (owner: 10Dzahn) [18:41:32] (03PS1) 10Legoktm: Set $wgCentralAuthCheckSULMigration = true [mediawiki-config] - 10https://gerrit.wikimedia.org/r/203381 (https://phabricator.wikimedia.org/T95735) [18:41:42] (03CR) 10jenkins-bot: [V: 04-1] Set $wgCentralAuthCheckSULMigration = true [mediawiki-config] - 10https://gerrit.wikimedia.org/r/203381 (https://phabricator.wikimedia.org/T95735) (owner: 10Legoktm) [18:41:52] (03CR) 10Cmjohnson: [C: 032] Adding dns entries (all) for berrylium [dns] - 10https://gerrit.wikimedia.org/r/203373 (owner: 10Cmjohnson) [18:42:17] (03PS2) 10Legoktm: Set $wgCentralAuthCheckSULMigration = true [mediawiki-config] - 10https://gerrit.wikimedia.org/r/203381 (https://phabricator.wikimedia.org/T95735) [18:43:00] (03CR) 10BryanDavis: [C: 031] Set $wgCentralAuthCheckSULMigration = true [mediawiki-config] - 10https://gerrit.wikimedia.org/r/203381 (https://phabricator.wikimedia.org/T95735) (owner: 10Legoktm) [18:43:34] 6operations, 10Deployment-Systems, 6Release-Engineering, 6Services: Streamline our service development and deployment process - https://phabricator.wikimedia.org/T93428#1199116 (10GWicke) [18:44:37] (03PS1) 10coren: Labs: Disable idmap on instances [puppet] - 10https://gerrit.wikimedia.org/r/203384 (https://phabricator.wikimedia.org/T95555) [18:45:00] cmjohnson1: ok [18:45:02] thanks [18:45:06] how long do you think that will take? [18:46:28] (03PS2) 10Legoktm: Set dedicated SUL rename runner loop [puppet] - 10https://gerrit.wikimedia.org/r/203379 (https://phabricator.wikimedia.org/T87397) (owner: 10Aaron Schulz) [18:47:42] 6operations, 10Deployment-Systems, 6Release-Engineering, 6Services: Streamline our service development and deployment process - https://phabricator.wikimedia.org/T93428#1199130 (10GWicke) [18:47:55] (03CR) 10Legoktm: [C: 031] Set dedicated SUL rename runner loop [puppet] - 10https://gerrit.wikimedia.org/r/203379 (https://phabricator.wikimedia.org/T87397) (owner: 10Aaron Schulz) [18:48:22] 6operations, 6Multimedia, 6Parsoid-Team, 6Release-Engineering, and 2 others: Prepare Platform/Ops April 2015 quarterly review presentation - https://phabricator.wikimedia.org/T91803#1199132 (10Qgil) [18:48:56] 6operations, 10ops-eqiad, 10Analytics-Cluster: analytics1020 hardware failure - https://phabricator.wikimedia.org/T95263#1199134 (10Cmjohnson) spent some time chatting with Dell tech. I did get firmware updates for the R720 that are bootable so I would like attempt to upgrade the bios on a few of the older s... [18:49:03] (03PS2) 10Mobrovac: Add a generic SCA service module [puppet] - 10https://gerrit.wikimedia.org/r/203334 (https://phabricator.wikimedia.org/T95533) [18:49:05] 6operations, 10Deployment-Systems, 6Release-Engineering, 6Services: Streamline our service development and deployment process - https://phabricator.wikimedia.org/T93428#1136887 (10GWicke) [18:49:26] 6operations, 10ops-eqiad, 10ops-fundraising: barium has a failed HDD - https://phabricator.wikimedia.org/T93899#1199143 (10Cmjohnson) This will be replaced 4/14 at 10am EST. [18:49:27] bd808: I trust you [18:49:46] greg-g: good because I already did it :) [18:49:54] whew [18:50:04] (03CR) 10Mjbmr: "I meant "I feel that"." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/201977 (owner: 10Mjbmr) [18:50:09] (03CR) 10Mholloway: [C: 031] "looks good!" [puppet] - 10https://gerrit.wikimedia.org/r/203341 (owner: 10Andrew Bogott) [18:51:08] (03CR) 10Mjbmr: "btw, it's already vetoed." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/201977 (owner: 10Mjbmr) [18:52:00] bd808: and now I just saw the flurry of task notifications (I'm "watching" -log-errors, so I saw it all). I love reading backlog and watching a UBN bug go from creation to resolved. :) [18:52:47] "I love it when a plan comes together" [18:52:54] 6operations, 10ops-eqiad, 10ops-fundraising: barium has a failed HDD - https://phabricator.wikimedia.org/T93899#1199180 (10Cmjohnson) Updated icinga with the downtime 4/14 1400 -1415 [18:53:06] * bd808 goes to watch A-Team while he eats lunch [18:53:46] (03PS2) 10BBlack: T86663 5.8: pool 3049; depool 3010 [puppet] - 10https://gerrit.wikimedia.org/r/203330 [18:53:48] (03PS2) 10BBlack: T86663 5.8: switch cp3010 role [puppet] - 10https://gerrit.wikimedia.org/r/203331 [18:53:50] (03PS2) 10BBlack: T86663 5.7: switch cp3009 role [puppet] - 10https://gerrit.wikimedia.org/r/203328 [18:53:52] (03PS2) 10BBlack: T86663 5.7: repool cp3009 [puppet] - 10https://gerrit.wikimedia.org/r/203329 [18:53:54] (03PS2) 10BBlack: T86663 5.8: repool cp3010 [puppet] - 10https://gerrit.wikimedia.org/r/203332 [18:53:56] (03PS2) 10BBlack: T86663 5.4: switch cp3006 role [puppet] - 10https://gerrit.wikimedia.org/r/203325 [18:53:58] (03PS2) 10BBlack: T86663 5.4: pool 3039; depool 3006 [puppet] - 10https://gerrit.wikimedia.org/r/203324 [18:54:00] (03PS2) 10BBlack: T86663 5.7: pool 3048; depool 3009 [puppet] - 10https://gerrit.wikimedia.org/r/203327 [18:54:02] (03PS2) 10BBlack: T86663 5.4: repool cp3006 [puppet] - 10https://gerrit.wikimedia.org/r/203326 [18:54:04] (03PS1) 10BBlack: T86663: depool amssq56-58 [puppet] - 10https://gerrit.wikimedia.org/r/203385 [18:54:06] (03PS1) 10BBlack: remove amssq* refs from site.pp and cache role stuff [puppet] - 10https://gerrit.wikimedia.org/r/203386 [18:55:35] (03CR) 10Dereckson: "Don't forget to edit commit message too." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/199321 (https://phabricator.wikimedia.org/T93798) (owner: 10Cenarium) [18:58:58] 6operations, 10ops-esams: Decomission amssq31-62 (32 hosts) - https://phabricator.wikimedia.org/T95742#1199200 (10BBlack) 3NEW [19:01:06] are logs for private wikis (such as office) still kept with everything else on fluorine? [19:01:46] yes [19:02:26] 6operations, 10Analytics-EventLogging, 5Patch-For-Review: Reclaim vanadium, move to spares - https://phabricator.wikimedia.org/T95566#1199213 (10Cmjohnson) [19:02:29] 6operations, 10ops-eqiad, 10Analytics-EventLogging: vanadium failed disk /dev/sda - https://phabricator.wikimedia.org/T94926#1199210 (10Cmjohnson) 5Open>3Resolved a:3Cmjohnson replaced disk [19:10:27] 6operations, 10ops-eqiad, 10ops-fundraising: barium has a failed HDD - https://phabricator.wikimedia.org/T93899#1199243 (10Cmjohnson) returned the disk Dell sent me. Tracking numbers FEDEX 9611918 2393026 47861526 [19:12:25] 6operations, 10Analytics-EventLogging, 5Patch-For-Review: Reclaim vanadium, move to spares - https://phabricator.wikimedia.org/T95566#1199253 (10Cmjohnson) confirmed ge-4/0/11 is vanadium. I deleted the interface from the switch The disk was replaced. @[[ https://phabricator.wikimedia.org/p/RobH/ | Robh ]] [19:13:04] 6operations, 10Analytics-EventLogging, 5Patch-For-Review: Reclaim vanadium, move to spares - https://phabricator.wikimedia.org/T95566#1199255 (10Cmjohnson) @[[ https://phabricator.wikimedia.org/p/RobH/ | RobH ]] did you add to server spares? [19:13:51] 6operations, 10Analytics-EventLogging, 5Patch-For-Review: Reclaim vanadium, move to spares - https://phabricator.wikimedia.org/T95566#1199257 (10Cmjohnson) confirmed in IRC that no it wasn't done...keeping ticket to complete [19:24:01] PROBLEM - RAID on analytics1017 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:24:41] PROBLEM - SSH on analytics1017 is CRITICAL - Socket timeout after 10 seconds [19:26:12] RECOVERY - SSH on analytics1017 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2 (protocol 2.0) [19:26:59] 6operations, 10hardware-requests, 5Patch-For-Review: Decom/repurpose rbf* hosts - https://phabricator.wikimedia.org/T95153#1199298 (10Dzahn) service tags: 8W62BP1 (rbf1001) 8W23BP1 (rbf1002) 17046X1 (rbf2001) 47046X1 (rbf2002) [19:27:20] RECOVERY - RAID on analytics1017 is OK no disks configured for RAID [19:40:43] 6operations: Production cluster can't access labs cluster - https://phabricator.wikimedia.org/T95714#1199364 (10Aklapper) @Dereckson: Why was this added to Security-General (because that does not mean much)? Because you want input from the Security team? [19:42:56] 6operations: Production cluster can't access labs cluster - https://phabricator.wikimedia.org/T95714#1199380 (10Dereckson) >>! In T95714#1199364, @Aklapper wrote: > @Dereckson: Why was this added to Security-General (because that does not mean much)? Because you want input from the Security team? Indeed. [19:43:17] 6operations, 6Security-Team: Production cluster can't access labs cluster - https://phabricator.wikimedia.org/T95714#1199383 (10Dereckson) [19:44:06] (03CR) 10Alex Monk: "Hoo? This has been waiting for you for over a year." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/118654 (https://bugzilla.wikimedia.org/56169) (owner: 10Gerrit Patch Uploader) [19:45:17] 6operations, 6Security-Team: Production cluster can't access labs cluster - https://phabricator.wikimedia.org/T95714#1199405 (10Dzahn) Wanting input from the security team seems the reason. If that tag has no meaning why do we have it? [19:45:34] (03CR) 10Alex Monk: "Tony: Any plans to fix this patch?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/118956 (https://bugzilla.wikimedia.org/61888) (owner: 1001tonythomas) [19:46:37] (03PS2) 10Dzahn: Adjust RSS whitelist to allow mediawiki blog feeds [mediawiki-config] - 10https://gerrit.wikimedia.org/r/118956 (https://phabricator.wikimedia.org/T63888) (owner: 1001tonythomas) [19:47:20] 6operations, 6Mobile-Web, 3Mobile-Web-Sprint-44-Road-Warrior:-Mad-Max-2: Spike: figure out the simplest possible way to apply tags to a large group of articles on en wikipedia - https://phabricator.wikimedia.org/T94755#1199426 (10phuedx) This might even be a NOP as we're planning on using categories initially. [19:47:57] (03PS1) 10Dereckson: Logo configuration on he.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/203422 (https://phabricator.wikimedia.org/T75424) [19:48:11] (03CR) 10Alex Monk: [C: 04-1] "It's probably going to error in some way." [puppet] - 10https://gerrit.wikimedia.org/r/139581 (owner: 10Withoutaname) [19:48:36] (03PS4) 10Dzahn: Add namespace aliases for shwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/118654 (https://phabricator.wikimedia.org/T58169) (owner: 10Gerrit Patch Uploader) [19:48:57] (03PS5) 10Dzahn: Add namespace aliases for shwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/118654 (https://phabricator.wikimedia.org/T58169) (owner: 10Gerrit Patch Uploader) [19:50:43] (03CR) 10Dzahn: "fwiw: "https://mingle.corp.wikimedia.org" <-- afaik that's outsource to thoughtworks and/or deprecated" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/118956 (https://phabricator.wikimedia.org/T63888) (owner: 1001tonythomas) [19:53:50] 6operations, 10ops-eqiad, 6Labs: labvirt1004 has a failed disk 1789-Slot 0 Drive Array Disk Drive(s) Not Responding Check cables or replace the following drive(s): Port 1I: Box 1: Bay 1 - https://phabricator.wikimedia.org/T95622#1199472 (10Cmjohnson) hank you for contacting HP e-Solutions. Wi... [19:54:41] 6operations, 6Security-Team: Production cluster can't access labs cluster - https://phabricator.wikimedia.org/T95714#1199479 (10Dereckson) [19:55:03] 6operations, 6Mobile-Web, 3Mobile-Web-Sprint-44-Road-Warrior:-Mad-Max-2: Spike: figure out the simplest possible way to apply tags to a large group of articles on en wikipedia - https://phabricator.wikimedia.org/T94755#1199481 (10JKatzWMF) 5Open>3Resolved [20:04:50] PROBLEM - puppet last run on mw2197 is CRITICAL puppet fail [20:05:47] 10Ops-Access-Requests, 6operations, 3Continuous-Integration-Isolation: Grant hashar root access on to be installed labnodepool1001 - https://phabricator.wikimedia.org/T95303#1199550 (10hashar) We spoke about it during our weekly Friday checkin. Agreed root access would be a convenience to bootstrap the servi... [20:22:20] RECOVERY - puppet last run on mw2197 is OK Puppet is currently enabled, last run 40 seconds ago with 0 failures [20:37:42] 6operations, 3Continuous-Integration-Isolation: install/deploy labnodepool1001 - https://phabricator.wikimedia.org/T95045#1199641 (10RobH) [20:37:44] 6operations, 10ops-eqiad: labnodepool1001 setup tasks: labels/ports/racktables - https://phabricator.wikimedia.org/T95048#1199639 (10RobH) 5Resolved>3Open please update racktables, as it still shows in row c [20:39:16] 6operations, 3Continuous-Integration-Isolation: install/deploy labnodepool1001 - https://phabricator.wikimedia.org/T95045#1179075 (10RobH) [20:39:42] 6operations, 10Continuous-Integration, 10hardware-requests, 3Continuous-Integration-Isolation: eqiad: (1) allocate server to migrate Zuul server to - https://phabricator.wikimedia.org/T95760#1199645 (10hashar) 3NEW [20:43:29] 6operations, 3Continuous-Integration-Isolation: install/deploy labnodepool1001 - https://phabricator.wikimedia.org/T95045#1199668 (10RobH) [21:05:27] Deskana: have you talked to bd808 and DanD about https://phabricator.wikimedia.org/T95019? [21:05:27] bearND: I talked to Dan about it, but I probably only mentioned it in passing to bd808. [21:05:29] We are planning on getting server side EL instrumentation on the login funnel that may help in the future for that particular problem [21:05:29] Deskana: I'm not sure who the best person to talk about this is. I think YuviPanda knows the labs side of it but we need it for production. [21:05:29] bearND: YuviPanda mentioned some more rigorous monitoring thing that's coming up. It'll still be good to chat to him. [21:05:29] It falls under the umbrella theme of "we don't watch our logs well" today [21:05:29] 6operations, 3Continuous-Integration-Isolation: install/deploy labnodepool1001 - https://phabricator.wikimedia.org/T95045#1199682 (10RobH) [21:05:29] bd808: Would the EL instrumentation be a real-time alerting mechanism? [21:05:30] 6operations, 3Continuous-Integration-Isolation: install/deploy labnodepool1001 - https://phabricator.wikimedia.org/T95045#1179075 (10RobH) OS is installed, but attempting to sign keys afterwards has lead to an issue. I cannot ssh or ping labnodepool1001.eqiad.wmnet from palladium (puppetmaster). I can do so... [21:05:30] 6operations, 6Engineering-Community, 6WMF-Legal, 6WMF-NDA: Implement the Volunteer NDA process in Phabricator - https://phabricator.wikimedia.org/T655#1199684 (10JanZerebecki) @Dzahn What still needs to be done to resolve this task? https://wikitech.wikimedia.org/wiki/Volunteer_NDA currently say "THIS PRO... [21:05:35] 6operations, 3Continuous-Integration-Isolation: install/deploy labnodepool1001 - https://phabricator.wikimedia.org/T95045#1199688 (10RobH) bastion, carbon, gallium... hosts in public IP vlans can ping the host, but nothing in the private vlans... [21:06:42] bearND: not in an of itself, no [21:06:50] *in and of [21:07:03] account creation is tough because captcha [21:07:07] serverside EL seems best, yeah [21:07:34] But you don't want to get paged every time there is a captcha failure. trust me on that [21:08:09] bd808: agreed [21:08:09] so what you really want is a probabilistic trend monitoring alert [21:08:09] heh [21:08:12] and yeah that would be sweet for many many things [21:08:38] a few, yeah [21:08:52] how many times has hoyt-winters come up? [21:08:57] hoyt? [21:09:09] holt [21:09:19] https://github.com/etsy/skyline [21:09:29] bd808: Would be sweet, indeed. Is that what you are planning on working on? [21:09:48] "planning on", nope. Dreaming about, yup [21:09:51] (03CR) 10BBlack: [C: 032] T86663: depool amssq56-58 [puppet] - 10https://gerrit.wikimedia.org/r/203385 (owner: 10BBlack) [21:10:38] I'm still trying to get logstash stable and useful. That's only 18 months of work thus far with the time I've been able to steal to work on it [21:12:31] Are we using similar monitoring tools for labs and production? [21:12:36] 6operations, 3Continuous-Integration-Isolation: install/deploy labnodepool1001 - https://phabricator.wikimedia.org/T95045#1199739 (10RobH) chatted with andrew, this is a known thing, and iron can ssh in. resuming installation [21:12:57] eyeballs and irc [21:14:41] We have various monitors for system health but nothing that I know of that watches error trends on the wikis with any type of alerting [21:14:41] afaik this is what we have in terms of graphite alerts so far: https://github.com/wikimedia/operations-puppet/blob/production/modules/mediawiki/manifests/monitoring/graphite.pp#L7 [21:15:36] 6operations, 3Continuous-Integration-Isolation: install/deploy labnodepool1001 - https://phabricator.wikimedia.org/T95045#1199756 (10RobH) a:5RobH>3None [21:15:42] 6operations, 3Continuous-Integration-Isolation: install/deploy labnodepool1001 - https://phabricator.wikimedia.org/T95045#1179075 (10RobH) puppet/salt accepted, system ready for service implementation. [21:16:02] actually not true, there are more: https://github.com/wikimedia/operations-puppet/search?utf8=%E2%9C%93&q=graphite_threshold [21:16:44] I wished setting those up didn't involve puppet though [21:18:08] 6operations, 3Continuous-Integration-Isolation: install/deploy labnodepool1001 - https://phabricator.wikimedia.org/T95045#1199769 (10hashar) 5Open>3stalled Thank you very much @RobH ! Service implementation is pending gaining access to it via T95303 that will be discussed Monday during the Ops meeting. [21:20:11] PROBLEM - High load for whatever reason on labstore1001 is CRITICAL 50.00% of data above the critical threshold [24.0] [21:22:21] Thanks! I assume the output of those tools is something lice above icinga IRC message and some nice graphs, right? [21:22:58] RECOVERY - High load for whatever reason on labstore1001 is OK Less than 50.00% above the threshold [16.0] [21:23:28] graphs for some things at https://gdash.wikimedia.org/ [21:24:15] bd808: great. Thanks! [21:24:20] and http://grafana.wikimedia.org/, for example http://grafana.wikimedia.org/#/dashboard/db/restbase [21:25:12] 6operations, 6Labs: One instance hammering on NFS should not make it unavailable to everyone else - https://phabricator.wikimedia.org/T95766#1199793 (10yuvipanda) 3NEW [21:29:47] (03CR) 10BBlack: [C: 032] remove amssq* refs from site.pp and cache role stuff [puppet] - 10https://gerrit.wikimedia.org/r/203386 (owner: 10BBlack) [21:30:57] 6operations, 10ops-esams: Decomission amssq31-62 (32 hosts) - https://phabricator.wikimedia.org/T95742#1199824 (10BBlack) [21:36:48] PROBLEM - puppet last run on amssq36 is CRITICAL puppet fail [21:37:00] :P [21:37:47] bblack: stop causing puppet failures :P [21:38:18] PROBLEM - HTTPS on amssq45 is CRITICAL: Return code of 255 is out of bounds [21:38:18] PROBLEM - HTTPS on amssq46 is CRITICAL: Return code of 255 is out of bounds [21:38:18] PROBLEM - HTTPS on amssq56 is CRITICAL: Return code of 255 is out of bounds [21:38:18] PROBLEM - HTTPS on amssq57 is CRITICAL: Return code of 255 is out of bounds [21:38:18] PROBLEM - HTTPS on amssq48 is CRITICAL: Return code of 255 is out of bounds [21:38:19] PROBLEM - HTTPS on amssq33 is CRITICAL: Return code of 255 is out of bounds [21:38:19] PROBLEM - HTTPS on amssq35 is CRITICAL: Return code of 255 is out of bounds [21:38:28] PROBLEM - HTTPS on amssq38 is CRITICAL: Return code of 255 is out of bounds [21:38:28] PROBLEM - HTTPS on amssq55 is CRITICAL: Return code of 255 is out of bounds [21:38:29] PROBLEM - HTTPS on amssq44 is CRITICAL: Return code of 255 is out of bounds [21:38:29] PROBLEM - HTTPS on amssq61 is CRITICAL: Return code of 255 is out of bounds [21:38:29] PROBLEM - HTTPS on amssq37 is CRITICAL: Return code of 255 is out of bounds [21:38:35] ignore all of this crap ^ [21:38:38] PROBLEM - Varnishkafka log producer on amssq43 is CRITICAL: PROCS CRITICAL: 0 processes with command name varnishkafka [21:38:38] PROBLEM - HTTPS on amssq34 is CRITICAL: Return code of 255 is out of bounds [21:38:38] PROBLEM - Varnish HTTP text-backend on amssq45 is CRITICAL: Connection refused [21:38:38] PROBLEM - Varnish HTTP text-frontend on amssq40 is CRITICAL: Connection refused [21:38:38] PROBLEM - HTTPS on amssq59 is CRITICAL: Return code of 255 is out of bounds [21:38:38] PROBLEM - HTTPS on amssq54 is CRITICAL: Return code of 255 is out of bounds [21:38:38] PROBLEM - Varnish HTTP text-backend on amssq31 is CRITICAL: Connection refused [21:38:39] PROBLEM - Varnish HTTP text-frontend on amssq34 is CRITICAL: Connection refused [21:38:39] PROBLEM - Varnish HTTP text-frontend on amssq51 is CRITICAL: Connection refused [21:38:39] bblack: I hope your... yeah [21:38:40] PROBLEM - Varnish HTTP text-frontend on amssq59 is CRITICAL: Connection refused [21:38:40] PROBLEM - Varnish HTTP text-backend on amssq54 is CRITICAL: Connection refused [21:38:41] PROBLEM - Varnish HTTP text-backend on amssq56 is CRITICAL: Connection refused [21:38:41] PROBLEM - HTTPS on amssq42 is CRITICAL: Return code of 255 is out of bounds [21:38:42] PROBLEM - Varnish HTTP text-backend on amssq47 is CRITICAL: Connection refused [21:38:53] PROBLEM - Varnishkafka log producer on amssq31 is CRITICAL: PROCS CRITICAL: 0 processes with command name varnishkafka [21:38:53] PROBLEM - Varnishkafka log producer on amssq34 is CRITICAL: PROCS CRITICAL: 0 processes with command name varnishkafka [21:38:54] PROBLEM - HTTPS on amssq52 is CRITICAL: Return code of 255 is out of bounds [21:38:57] PROBLEM - HTTPS on amssq47 is CRITICAL: Return code of 255 is out of bounds [21:38:57] /ignore icinga-wm 1hour .*Varnish* [21:38:57] PROBLEM - Varnishkafka log producer on amssq39 is CRITICAL: PROCS CRITICAL: 0 processes with command name varnishkafka [21:38:57] PROBLEM - Varnishkafka log producer on amssq33 is CRITICAL: PROCS CRITICAL: 0 processes with command name varnishkafka [21:38:57] PROBLEM - Varnishkafka log producer on amssq53 is CRITICAL: PROCS CRITICAL: 0 processes with command name varnishkafka [21:38:57] PROBLEM - Varnishkafka log producer on amssq45 is CRITICAL: PROCS CRITICAL: 0 processes with command name varnishkafka [21:38:58] PROBLEM - Varnishkafka log producer on amssq38 is CRITICAL: PROCS CRITICAL: 0 processes with command name varnishkafka [21:38:58] PROBLEM - Varnishkafka log producer on amssq50 is CRITICAL: PROCS CRITICAL: 0 processes with command name varnishkafka [21:38:59] PROBLEM - Varnishkafka log producer on amssq47 is CRITICAL: PROCS CRITICAL: 0 processes with command name varnishkafka [21:38:59] PROBLEM - Varnishkafka log producer on amssq56 is CRITICAL: PROCS CRITICAL: 0 processes with command name varnishkafka [21:39:00] PROBLEM - Varnish HTTP text-frontend on amssq44 is CRITICAL: Connection refused [21:39:00] PROBLEM - Varnish HTTP text-backend on amssq52 is CRITICAL: Connection refused [21:39:01] PROBLEM - Varnish HTTP text-frontend on amssq41 is CRITICAL: Connection refused [21:39:01] PROBLEM - Varnishkafka log producer on amssq58 is CRITICAL: PROCS CRITICAL: 0 processes with command name varnishkafka [21:39:01] :) [21:39:44] good luck! [21:39:49] I am off for the week-end [21:39:55] heh, there goes icinga [21:40:22] PROBLEM - Varnishkafka log producer on amssq60 is CRITICAL: PROCS CRITICAL: 0 processes with command name varnishkafka [21:40:24] neon was already in the process of deleting those hosts from monitoring anyways, it just takes for-freaking-ever to run [21:44:35] (03CR) 10BBlack: [C: 032] T86663 5.4: pool 3039; depool 3006 [puppet] - 10https://gerrit.wikimedia.org/r/203324 (owner: 10BBlack) [21:46:26] ok they're completely gone from monitoring now, should be no more amssq* spam [21:49:26] salt wins at math, again! [21:49:44] I tell it to do a list of things in batches of 52% at a time, and somehow it gets 3 batches :P [21:50:16] perhaps there's a --round-up flag somewhere [21:51:03] (03CR) 10BBlack: [C: 032] T86663 5.4: switch cp3006 role [puppet] - 10https://gerrit.wikimedia.org/r/203325 (owner: 10BBlack) [21:55:54] (03CR) 10BBlack: [C: 032] T86663 5.4: repool cp3006 [puppet] - 10https://gerrit.wikimedia.org/r/203326 (owner: 10BBlack) [21:56:34] bblack: it knows something we don't yet :p [21:57:22] the question whose answer is 42? [22:20:07] bd808: ori legoktm are there any docs on sending data from mw to graphite / statsd? [22:20:14] searching for statsd on mw.org isn’t yielding anything useful [22:20:39] there's a new thingy that Ori made I think... [22:21:33] hrm... wfLogProfilingData() does it [22:22:20] hmm [22:22:23] YuviPanda: BufferingStatsdDataFactory I think [22:22:45] Which you can get from a context as getStats() [22:22:51] aaah, nice [22:22:54] bearND: ^ [22:22:59] bd808: any examples? [22:23:00] * YuviPanda greps [22:24:01] oh [22:24:03] that’s fairly trivial [22:24:07] page/Article.php has one [22:24:16] yeah [22:24:18] just saw that [22:24:22] quite sane [22:24:24] thanks bd808 [22:24:33] yw [22:24:41] it's really new tech [22:24:46] yeah [22:24:48] (03CR) 10BBlack: [C: 032] T86663 5.7: pool 3048; depool 3009 [puppet] - 10https://gerrit.wikimedia.org/r/203327 (owner: 10BBlack) [22:24:50] shinnnyyy [22:25:11] bd808: I just had a meeting with Deskana and bearND, suggested they graphite count captcha failures [22:25:21] since they’re basically the only users of the createaccount API... [22:25:43] tgr is going to be adding EL logging to all that stuff [22:26:13] yeah, but you can’t alert on EL, no? [22:26:19] apps already has EL for that [22:26:21] probably not :/ [22:26:53] having to dupe things in EL and graphite seems dumb [22:27:15] If I can tack on some alerting to our EventLogging then that's easy [22:27:20] But yeah, AFAIK that is not possible [22:27:30] (should it be possible? I think so?) [22:27:39] could we just alert based on the number of captcha attempts/failures on the API response side of things? [22:27:41] trick YuviPanda into making it possible ;) [22:28:29] I love that you guys are going after this, but you may be trying to empty the ocean with an eyedropper [22:28:59] boo [22:28:59] we need systematic controls and alerting not just "oops this broke once" fixes [22:29:28] err, that boo wasn’t for yo [22:29:30] you [22:29:41] * bd808 picks up the drum that ori likes to get out once a quarter and pounds out a tune [22:29:45] I looked at the code and there’s no way to get a context where I want it, but I’ll avert this nerdsnipe [22:29:52] * YuviPanda listens to bd808 [22:30:44] 7Blocked-on-Operations, 6operations, 6Scrum-of-Scrums, 3Continuous-Integration-Isolation: Review Jenkins isolation architecture with Antoine - https://phabricator.wikimedia.org/T92324#1199963 (10hashar) We had three meeting already with @chasemp @andrew and @hashar . We are exchanging on a weekly basis. [22:31:20] (03CR) 10BBlack: [C: 032] T86663 5.7: switch cp3009 role [puppet] - 10https://gerrit.wikimedia.org/r/203328 (owner: 10BBlack) [22:35:30] bd808: I wonder if a part of the solution is to recognize that the ‘action’ pattern in EL is super heavily used, and start supporting that as a first class citizen [22:35:45] and then basically have graphite metrics for Event.Action, rather than just Event which is what we have no [22:35:46] *now [22:35:52] hmm [22:35:58] that’s just passing the buck a little bit, I guess [22:36:06] not sure how we can alert directly from EL, however [22:36:09] but that would indeed be awesome [22:36:20] let me file a bug [22:38:20] All I know about EL is that it's a magic thing that lets you make pretty graphs [22:40:29] (03CR) 10BBlack: [C: 032] T86663 5.7: repool cp3009 [puppet] - 10https://gerrit.wikimedia.org/r/203329 (owner: 10BBlack) [22:42:03] 6operations, 6Security, 10Wikimedia-Shop, 7HTTPS, 5Patch-For-Review: Changing the URL for the Wikimedia Shop - https://phabricator.wikimedia.org/T92438#1199993 (10Dzahn) Shopify said "You can point your CNAME for store to c.ssl.shopify.com.." but we have not used this before, the CNAME we have is for "s... [23:10:00] (03CR) 10BBlack: [C: 032] T86663 5.8: pool 3049; depool 3010 [puppet] - 10https://gerrit.wikimedia.org/r/203330 (owner: 10BBlack) [23:14:48] (03CR) 10BBlack: [C: 032] T86663 5.8: switch cp3010 role [puppet] - 10https://gerrit.wikimedia.org/r/203331 (owner: 10BBlack) [23:16:28] (03CR) 10Dzahn: [C: 032] contint: 'zip' package via ensure_packages() [puppet] - 10https://gerrit.wikimedia.org/r/203203 (owner: 10Hashar) [23:18:25] springle: ping [23:18:31] (03CR) 10Dzahn: [C: 032] package_builder: use ensure_packages() [puppet] - 10https://gerrit.wikimedia.org/r/203198 (owner: 10Hashar) [23:18:56] (03PS3) 10BBlack: T86663 5.8: repool cp3010 [puppet] - 10https://gerrit.wikimedia.org/r/203332 [23:19:08] (03CR) 10BBlack: [C: 032 V: 032] T86663 5.8: repool cp3010 [puppet] - 10https://gerrit.wikimedia.org/r/203332 (owner: 10BBlack) [23:19:14] bd808: Deskana bearND https://phabricator.wikimedia.org/T95780 [23:19:31] bd808: I think ^ is less ‘empty ocean with an eye dropper' [23:19:38] more like, ‘with a bucket’ or maybe even a ship [23:20:12] cool [23:20:59] 6operations, 7HTTPS, 3HTTPS-by-default, 5Patch-For-Review: Expand HTTP frontend clusters with new hardware - https://phabricator.wikimedia.org/T86663#1200097 (10BBlack) 5Open>3Resolved Done! [23:21:07] (03CR) 10Dzahn: [C: 032] package_builder: fix dependency order for hooks [puppet] - 10https://gerrit.wikimedia.org/r/203228 (owner: 10Hashar) [23:21:52] Deskana: can you tell me when the account creation outage was? [23:21:53] just date [23:22:17] YuviPanda: Erm [23:22:21] YuviPanda: Not off the top of my head [23:22:24] heh [23:22:25] alright [23:22:29] YuviPanda: Hold on, I can find out [23:22:48] (03CR) 10Dzahn: [C: 031] contint: make Jessie slaves package builders [puppet] - 10https://gerrit.wikimedia.org/r/203073 (https://phabricator.wikimedia.org/T95545) (owner: 10Hashar) [23:22:59] YuviPanda: https://phabricator.wikimedia.org/T94915 [23:23:04] godog: ori hmm, missing data for EL here: [23:23:08] https://graphite.wikimedia.org/render/?width=586&height=308&_salt=1428708160.935&from=-24days&target=eventlogging.schema.MobileWikiAppCreateAccount.p99&target=eventlogging.schema.MobileWikiAppCreateAccount.rate [23:23:09] hmm [23:23:12] I wonder if that’s the rename? [23:23:15] YuviPanda: Quote Deskana, "This issue appears to have begun at around 20150401230000 UTC (11pm Wednesday 1st April, UTC)." [23:23:50] Deskana: cool. I was trying to see if looking at general ‘rate of events’ is going to have helped [23:24:02] but: we’re missing data until 7th April... [23:24:12] from 23rd March [23:24:14] * YuviPanda files [23:24:15] bug [23:24:29] YuviPanda: From EL? Hopefully not. [23:24:34] not from EL no [23:24:35] 6operations, 10ops-esams: Decomission amssq31-62 (32 hosts) - https://phabricator.wikimedia.org/T95742#1200103 (10BBlack) Done so far: 1. Removed from puppet/pybal cache pools/definitions 2. Removed from site.pp 3. puppeted once without any specific role 4. shredded private keys (for whatever that's worth...)... [23:24:35] from graphite [23:24:40] Okay [23:24:42] * Deskana breathes a sigh of relief [23:24:51] at one time i had seen we have an sql server that basically contains a bunch of stats and samples of queries run against prod databases, how can i access that to run some queries? [23:25:23] 6operations, 10Analytics-EventLogging, 7Graphite: EL graphite data missing from 24/3 to 7/4 - https://phabricator.wikimedia.org/T95781#1200104 (10yuvipanda) 3NEW [23:26:44] (03CR) 10Dzahn: [C: 031] package_builder: fix dependency order for hooks [puppet] - 10https://gerrit.wikimedia.org/r/203228 (owner: 10Hashar) [23:27:13] (03CR) 10Dzahn: "eh just changed to +1 so others are not being surprised when it automerges when the dependency merges" [puppet] - 10https://gerrit.wikimedia.org/r/203228 (owner: 10Hashar) [23:28:28] (03CR) 10Dzahn: [C: 032] mariadb: lint fixes in role class [puppet] - 10https://gerrit.wikimedia.org/r/202645 (https://phabricator.wikimedia.org/T93645) (owner: 10Dzahn) [23:29:13] (03PS2) 10Dzahn: mariadb: lint fixes in role class [puppet] - 10https://gerrit.wikimedia.org/r/202645 (https://phabricator.wikimedia.org/T93645) [23:30:39] YuviPanda: Deskana: Thanks guys! I'm glad you're handling this. [23:30:52] (03PS2) 10Dzahn: remove cp3001,cp3002 from DHCP [puppet] - 10https://gerrit.wikimedia.org/r/203272 (https://phabricator.wikimedia.org/T94215) [23:30:59] (03CR) 10Dzahn: [C: 032] remove cp3001,cp3002 from DHCP [puppet] - 10https://gerrit.wikimedia.org/r/203272 (https://phabricator.wikimedia.org/T94215) (owner: 10Dzahn) [23:32:55] (03PS2) 10Dzahn: decom cp3001,cp3002. keep mgmt [dns] - 10https://gerrit.wikimedia.org/r/200222 (https://phabricator.wikimedia.org/T94215) [23:33:02] bearND: I cc’d you on the new bug as well [23:33:41] (03CR) 10Dzahn: [C: 032] decom cp3001,cp3002. keep mgmt [dns] - 10https://gerrit.wikimedia.org/r/200222 (https://phabricator.wikimedia.org/T94215) (owner: 10Dzahn) [23:33:51] YuviPanda: yes, saw that. Thanks! [23:40:06] PROBLEM - puppet last run on mw1113 is CRITICAL Puppet has 1 failures [23:41:47] RECOVERY - puppet last run on mw1113 is OK Puppet is currently enabled, last run 4 seconds ago with 0 failures [23:44:56] PROBLEM - puppet last run on eventlog1001 is CRITICAL Puppet has 16 failures [23:46:06] nuria: milimetric puppet failures? [23:46:09] on EL [23:49:17] PROBLEM - puppet last run on mw1222 is CRITICAL Puppet has 1 failures [23:50:32] (03PS2) 10Gage: remove cp3001,cp3002 from hiera ipsec data [puppet] - 10https://gerrit.wikimedia.org/r/203273 (https://phabricator.wikimedia.org/T94215) (owner: 10Dzahn) [23:52:02] (03CR) 10Gage: [C: 032] remove cp3001,cp3002 from hiera ipsec data [puppet] - 10https://gerrit.wikimedia.org/r/203273 (https://phabricator.wikimedia.org/T94215) (owner: 10Dzahn) [23:52:37] oh gage, didnt want it for testing anymore? [23:53:20] (03PS3) 10Gage: IPsec: improved cipher selection [puppet] - 10https://gerrit.wikimedia.org/r/201135 [23:53:58] the test conditions changed when the dns entries went away, but i got what i needed :) [23:54:27] ipsec.conf template is populated based on dns [23:54:55] ooh, i can easily add it back if you like [23:54:57] (03CR) 10Gage: [C: 032] IPsec: improved cipher selection [puppet] - 10https://gerrit.wikimedia.org/r/201135 (owner: 10Gage) [23:55:08] no need, thanks though [23:55:10] btw "Do not include null cipher" sounds good [23:55:14] ok [23:55:28] hehe [23:55:57] the syntax is confusing, the documentation said the format was encryption-identity-pseudorandomfunction-dh [23:56:11] yea, i saw that part where he said the docs are wrong [23:56:12] but in fact it allows multiple encryption algos [23:56:16] and then it adds null cipher.. wth?:) [23:56:28] yeah, not what i intended! [23:56:36] yay for review from strongswan maintainer [23:56:50] yes! that's cool [23:57:29] username "ecdsa", you can tell he is into crypot [23:57:34] crypto [23:57:42] heh yeah [23:58:00] 6operations, 10Analytics-EventLogging, 7Graphite: EL graphite data missing from 24/3 to 7/4 - https://phabricator.wikimedia.org/T95781#1200234 (10Nuria) This was caused by the migration and we corrected the problem already. Please see: https://wikitech.wikimedia.org/wiki/Incident_documentation/20150406-Event... [23:58:36] 6operations, 10Analytics-EventLogging, 7Graphite: EL graphite data missing from 24/3 to 7/4 - https://phabricator.wikimedia.org/T95781#1200245 (10Nuria) Closing ticket, let me know if something needs to happen here additionally. [23:58:44] 6operations, 10Analytics-EventLogging, 7Graphite: EL graphite data missing from 24/3 to 7/4 - https://phabricator.wikimedia.org/T95781#1200246 (10Nuria) 5Open>3Resolved