[00:38:43] what's with the new audio players on mediawiki sites? [00:38:57] s/mediawiki/wikimedia [00:40:58] YairRand: http://blog.wikimedia.org/2012/11/08/introducing-wikipedias-new-html5-video-player/ [00:41:08] problem? [00:41:52] it's really not helpful for wiktionary's audio samples to have "Timed text" buttons [00:42:55] and the audio players are huge compared to the old buttons [00:44:34] I suppose the only way to fix this is to throw piles of css in common.css at it? (this display:none, that display:none, shrink width with probably-soon-to-break pixel measurements...) [00:45:13] or is there maybe a way to just disable the new stuff? [00:55:51] the new player also appears to be advertising a link to what appears to be a commercial website [00:56:17] !log on fenari: killed stuck git processes run by ExtensionDistributor and LocalisationUpdate [00:56:23] Logged the message, Master [00:59:10] and it also crams a giant "audio" image into the settings box [00:59:25] and wastes most of the space in it anyway [01:00:28] and the "Credits" button doesn't seem to even do anything [01:00:50] and the settings box gets placed underneath other players [01:01:24] and sometimes it doesn't even play the audio [01:03:55] and the line-height looks ridiculous [01:04:48] and the volume settings takes up way more space than the old one [01:11:30] conclusion: new audio player is bad stuff for wiktionary [01:12:51] can anything be done about this? [01:15:35] I would suggest filing a bug [01:15:42] But it appears BZ is broken at the moment [01:15:59] erk [01:32:48] YairRand, BZ is back [01:33:04] ah, thanks. I'll go file that bug. [01:47:35] pgehres: are you guys done w/yr deploy? do you mind if i sync a js file? [01:47:51] we are long done, all yours [01:48:03] thanks [03:17:32] is TMH meant to play videos when you click the "play" button, or just lock up the browser with 100% CPU? [03:17:44] because for me it just does the latter [03:21:03] :\ [03:31:42] argh [03:31:46] it fixed itself [03:32:05] I don't know why I still use this buggy browser anyway [03:41:59] TimStarling I can explain [03:42:16] its a bug in flash, it sometimes happens - for most rarely [03:42:46] was it full screen? [03:43:11] it's not using flash [03:43:55] hmm [03:45:05] no clue then [12:50:22] could someone take a look at why the WantedCategories special page is not being refreshed in the past week? [12:50:29] please [13:00:47] lunch [14:07:15] <^Mike> Does domas still help out with database & performance work at WMF? [14:55:11] andre__: that table is delicious, thanks [17:47:07] ^Mike: yes [17:51:23] <^Mike> jeremyb: Yes... about domas? [18:01:02] ^Mike: yes [18:01:12] <^Mike> cool, thanks [18:35:54] bummer, logs permissions seem messed up: [18:35:59] ForbiddenYou don't have permission to access /~petrb/logs/#wikimedia-tech/ on this server. [18:36:03] http://bots.wmflabs.org/~petrb/logs/%23wikimedia-tech/ [18:39:24] Reedy: we never worked out if we want to shift the Monday deploy to Tuesday [19:56:30] robla: I'll probably just do it as normal on monday [20:17:57] hi notpeter - still good to chat in about 15 min? [20:38:00] notpeter: ping [20:38:12] hey [20:38:23] it is time, isnt it [20:38:50] is now good, notpeter? [20:39:37] give me 5, and I'm all yours [20:40:04] okay! [20:46:19] so notpeter I am interviewing you so I can blog about Ops for the tech blog [20:46:33] well, the WMF blog, tagged/categorized appropriately :) [20:46:41] ok, cool [20:46:52] notpeter: my assumption is that these questions make sense: [20:46:53] Wikimedia Operations is coming out of a massive hole of technical debt that few Wikimedians really appreciate. Two years ago, what kind of horrifying fragility were we subject to, and how did that affect us? Now, what improvements in service quality have we seen, and how did we get here? And how far do we have to go, and what will necessarily suffer in the interim? [20:47:29] sure, so let's start with when I started and what it was like [20:47:48] I started here in march 2011 [20:47:52] so about 20 months ago [20:47:58] right [20:48:03] this was at the beginning of a big hiring push for the foundation [20:48:04] around the same time as me actually! [20:48:07] ha, yeah [20:48:07] yep! [20:48:30] when I got here, Ithink the ops team consisted of about 6 people [20:48:55] which is completely amazing [20:48:56] supporting basically the 5th highest traffic site on the internet? [20:48:59] yeah [20:49:10] what are the numbers for similar commercial sites? [20:49:15] like, 300 Ops people? [20:49:33] certainly orders of magnitude more headcount [20:49:48] I'd say at least hundreds. probably clsoer to thousands for google [20:49:51] yeah [20:50:58] but, this also meant that out of the fast/cheap/well triangle, we'd gone with fast and cheap [20:51:21] it was kinda like many many layers of really artfully applied duct tape [20:51:24] "fast" here meaning that we got quick-and-dirty solutions because problems had to be solved NOW? [20:51:30] yeah [20:51:46] with so few ops engineers, you're always playing catchup [20:51:52] long-term is hard [20:52:08] was there any kind of master architectural plan? [20:52:14] as of 2011? [20:52:14] oh, definitely [20:52:22] like, with strategic goals? [20:52:31] at least technical goals [20:52:44] for example, when I got here, we were just starting our move into eqiad [20:52:51] which was a huge goal [20:53:04] I odn't think anyone really appreciated how big of a project that was going to be [20:53:29] our first problem was that, for example, very little was in puppet [20:53:32] we had never set up a data center of that complexity from scratch before [20:53:34] yeah [20:53:41] right? we had grown Tampa [20:53:48] it was layers of duct tape that had been biult up over years [20:53:51] yeah [20:54:05] (I realize this is an interview, so I'll interject this and then shut up: IIRC late 2010 / early 2011 was pretty much when we started to actually have good planning in that regard) [20:54:06] Ori had a line: "grown like mold on a shower curtain" re some piece of the codebase [20:54:09] and our amsterdam center is just so much shorter [20:54:19] RoanKattouw: we're doing this in public so we can get corrected by folks like you! [20:54:26] :) [20:54:27] "shorter" notpeter? [20:54:28] sumanah: hahahah [20:54:31] not sure what that means [20:54:44] sorry, smaller [20:54:45] Just providing some context regarding things that just predated notpeter [20:54:49] sure [20:55:03] it's just a caching center, which has so few moving parts [20:55:22] like a short stack of pancakes instead of a regular stack, eh? :) [20:55:28] hahah, yes. [20:55:32] (or of stroopwafels since it's in NL) [20:55:35] my fingers didn't get that one quite right [20:55:37] no prob [20:55:44] I thought it was Teh l33t Lingo [20:55:57] OK, so, very little was in puppet [20:56:11] so, when I got here, for example, our puppet manifests for our databases were a file that just had a comment that said "domas is a slacker" [20:56:17] hahaha [20:56:17] and leveraging the power of puppet to keep our configurations in sync was kind of key to making 2 real data centers work, right? [20:56:29] our whole search infrastructure existed outside of puppet control [20:56:36] notpeter: you're not kidding are you -- is there any way I can link to that file? [20:56:40] that version of it [20:56:48] sumanah: I can look :) [20:57:19] sumanah: yeah, I mean, we could have created a whole tappestry of duct tape, but this was also an oppertunity to do things right [20:57:26] automate [20:57:38] what were we actually doing to keep track of config changes? people just did !log every time and you checked that when troubleshooting? [20:57:39] so that when we have a third data center it will be a much much shorter process [20:57:43] YES [20:58:09] there was a lot of !log [20:58:20] Yup, pretty much [20:58:22] RoanKattouw: when you say that we started to get good planning in 2010/2011, can you give me something to link to? [20:58:31] but also, when the ops team is only 6 people, and everyone has been around for years, every just knows all the parts [20:58:47] sumanah: Perhaps some of the communication regarding the eqiad build-out? I think that was getting off the ground in late '10 [20:59:07] As in, us negotiating with the company and getting the space, announcing we got it, etc. [20:59:08] RoanKattouw: you know the Ops docs landscape far better than I do [20:59:39] like, whether this would have been on public mailing lists, blog posts, wikitech.wikimedia.org, private lists so I can't link to stuff, internal conversation lost in the mists of IRC logs, etc [20:59:48] I basically consider the dedication of a ~$3.3M budget to the eqiad build-out the "start of proper ops planning", but that's my opinion of where that milestone belongs, others may have different insights [20:59:56] Right [20:59:59] RoanKattouw: and when was that? [21:00:06] FY 2010-2011? [21:00:08] Well there was definitely public communication about this [21:00:12] Yeah so probably the FY 10-11 budget [21:00:24] notpeter: was the set of services that Ops supports smaller when you arrived? [21:00:28] Would have included a line item for significant spending on the new data center [21:00:37] in terms of the number of services and customers? [21:00:42] of course Labs is a huge new component [21:01:16] yeah, there was no labs [21:01:22] I'm looking at https://meta.wikimedia.org/wiki/Wikimedia_services [21:01:30] but also, ops was less able to give support to many teams [21:01:43] for example, fundraising just had a couple of boxes [21:01:50] and could do whatever they wanted on them [21:02:15] as opposed to now where jeff is working on making an awesome, pci compliant system with them full time [21:02:40] or analytics was very... independent/unsupported [21:02:42] Right, Jeff is a hybrid Ops/Fundraising person, right? embedded in 1, reporting sorta to another [21:02:59] what's the Analytics Ops support now, for contrast? [21:03:00] because there was so little human-hours to give to supporting things that weren't just keeping the site up [21:03:25] yeah, jeff and andrew otto are both kinda on ops, kinda on another team [21:04:07] (org chart will show them differently, though, as andrew's very much on analytics, and jeff is technically on ops) [21:04:41] but so, I think that the eqiad buildout is very demonstrative of the amount of debt that ops was in [21:04:42] right, and so Jeff started out with 100% all the Ops privileges, and Andrew sort of had to get them over time? [21:04:46] yeah [21:04:50] ok [21:04:57] we bought a bunch of boxes [21:05:13] I'm also curious about whether you think that kind of embedded cross-team approach is the future of Ops in some sense, but that can wait, it's not crucial to this story [21:05:21] and we're still going through the process of trying to set everything up in a nice, clean, automated way [21:05:24] right [21:05:45] oh, hhhmmmm... hopefully not, tbh [21:06:09] I would hope that we get to a point where ops has enough human-hours that we don't let anything fall through the cracks [21:06:19] which I thinkthe embedded cross-team thing solves [21:06:24] I mean, we'll all specialize in stuff [21:06:27] as happens [21:06:50] or, well, eventually ops will split if/when the org continues to grow [21:06:55] it has to, in some way [21:07:12] and I think that will determine how ops interacts with/serves other depts [21:07:45] (btw is there a central list of which Ops people specialize in what?) [21:07:59] (I don't think so, tbh) [21:08:06] (OK.) [21:08:13] Anyway, back to the contrast between early 2011 and now [21:08:25] I think another thing that illustrates our growth and maturity is our downtime [21:08:29] so you wanted to get all these boxes configured in a robust and repeatable way [21:08:37] oh yes! i wanted to hear about uptime figures [21:08:56] I'll have to hunt them up after this [21:08:59] as I don't have them handy [21:09:06] I know they're better than when I started [21:09:20] RoanKattouw: maybe you know them offhand! [21:09:27] No, I don't [21:09:28] but, I think something that's less visible to people outside of ops is the kind of downtime we have [21:09:33] I know anecdotally things have gotten better [21:09:41] the KIND of downtime - I listen with eagerness! [21:09:51] for example, we no longer have much of the variety of "oops, bumbed that cable" [21:09:57] or "that one box died" [21:10:07] because things are much more robust now [21:10:11] much more redundant [21:10:13] Or, "the master DB server has a full disk" [21:10:14] how did we get there? [21:10:17] right [21:10:22] That one happened a few times a few years ago, and doesn't happen any more now [21:10:45] RoanKattouw: ah yes, the one this year was the *parser cache* db being full, right? which sucks but at least is not master [21:10:54] so, a lot of it is a product of the massive automation push we've been going through [21:11:04] Right, that one [21:11:08] which let's us create redundantly far more easily [21:11:20] and let's us spend our time not fighting fires [21:11:30] notpeter: so, multiple people spending lots of time every day working to refactor and automate and add monitoring/instrumentation for lots of services? [21:11:36] That was the investment, right? [21:11:52] yeah [21:12:03] mass puppetization of *everything* [21:12:03] tell me more about "create redundantly far more easily" - create redundancy? [21:12:08] not just the core components [21:12:37] for example, I've worked on search a lot [21:12:48] when I got here, nothing was in puppet [21:13:02] now, we have two fully independent search setups [21:13:06] one in each dc [21:13:22] failover takes a couple of minutes at most [21:13:35] when you say "everything" - name names! (the blog post will go better with, like, a list of like 7 things we puppetized, so, search, what else?) [21:13:38] Oh, I remember how for basically all wikis except enwiki, there was only one search server. Hurray SPOFs [21:13:57] and one dc could get wiped out in a hurricane and I could have another copy up and running in a day (assuming hardware) [21:15:27] <^demon|zzz> notpeter: Speaking of search, wikidatawiki still seems to have no index :\ [21:15:36] since I've been here: all of the DBs, search, fundraising, analytics, parsercache, uh..... probably more. although I'd have to look [21:16:12] ^demon|zzz: yes, I've spent some time looking at what's broken.... but the answer is "search" [21:17:06] <^demon|zzz> notpeter: Glad to know we've narrowed it down ;-) [21:17:16] ^demon|zzz: but yes, this is on my list [21:17:37] <^demon|zzz> Cool, thanks! [21:17:41] Lemme look at https://gerrit.wikimedia.org/r/#/q/status:merged+project:%255Eoperations.*,n,z .... PHP, Apaches, nginx, MediaWiki itself, Nagios, logging stuff like udp2log [21:17:59] MW puppetization is used in labs but not prod [21:18:21] nginx was puppetized from the start, I believe, because we started using it (for HTTPS termination) when we were already in the puppet push [21:18:27] logging stuff = analytics, essentially [21:18:37] yeah [21:18:45] Good call on Nagios [21:18:51] but, oh man, udp2log stuffs... wow, that was a big unsorting [21:18:54] Puppetizing monitoring is very important [21:18:54] yeah, nagios [21:18:55] ganglia, zuul, partman, memcached, pybal [21:19:02] zuul is new as well [21:19:12] The others are good calls [21:19:22] I think that memecache and pybal were fully puppetized when I got here [21:19:59] basically I should look at https://gerrit.wikimedia.org/r/gitweb?p=operations/puppet.git;a=tree;f=manifests;h=40d0611729052ffa61e0890cdf9ae3d709521026;hb=HEAD [21:20:01] <^demon|zzz> I believe pybal was one of the earlier things to get puppetized. [21:20:05] which is not a short list [21:20:08] That's quite possible [21:20:09] <^demon|zzz> s/earlier/earliest/ [21:20:25] so, to get all buzzwordy for a sec, this was the ops dept going from sysadmin to dev-ops, as it's when we started treating our infrastructure as a codebase [21:20:26] I can also just look through the monthly reports to see what got puppetized when [21:20:30] memcached is fairly straightforward, and my impression of pybal is that it's easy if your name is Mark Bergsma and dark magic otherwise [21:20:34] sumanah: are you just looking for a list of things that are puppetized? [21:20:39] RoanKattouw: :) [21:21:00] * ori-l-away 's channel buffer doesn't go far enough [21:21:08] ori-l-away: for the purposes of writing a blog post about Ops's recent past and near future, I am hearing from notpeter about, among other things, the puppetization push [21:21:30] oh.. lemme see if i can help [21:21:45] ori-l: and so the narrative will include "since Peter got here, look at the giant list of configurations that had to go into puppet so we could actually do a second DC in a repeatable way" [21:21:54] but I think at this point that we're very nearly at a point where we can manage our whole infrastructure without needing to log into hosts [21:22:05] "since" meaning timewise, since Mar 2011, not in a causal way [21:22:13] which is the whole goal :) [21:22:16] yay! [21:22:28] notpeter: for the lay audience: why is logging into hosts a bad thing? [21:22:41] * sumanah has her guesses [21:22:42] because it means that you're doing things by hand [21:22:55] and/or that what you're doing isn't going through code review [21:23:25] moving to gerrit for our puppet repos, completely independent of labs, is awesome [21:23:34] it means I can really easily see what my coworkers are doing [21:23:40] I can ask for review when needed [21:23:50] it's a huge sign of maturation of our dept [21:24:06] previously, what was happening instead? [21:24:13] people just changed files and did a !log ? [21:24:17] maybe asked for code review sometimes? [21:24:39] well, when I got here, everything was done on a local svn repo or our puppetmaster [21:24:44] and then pushed out from there [21:24:53] which kinda works if you have 6 or fewer people [21:24:54] "puppetmaster" - basically a little git repo? [21:25:01] oh wait, you mean the master manifest [21:25:06] sumanah: here's a list of all the hostclasses currently defined in puppet: http://dpaste.de/CsPPJ/ there's no 1:1 correlation between pieces of software and hostclasses, but it's close enough [21:25:06] yeah [21:25:07] wait, I don't know what you mean [21:25:24] the box that everything talks to and asks it what puppet manifests it gets [21:25:37] and thus what packages/configs/etc it gets [21:25:44] It's essentially what controls everything based on the manifests [21:25:56] oh a puppetmaster is a kind of box [21:26:08] or a machine that is Master [21:26:09] ok [21:26:13] yeah [21:26:14] Basically [21:26:16] yeah, it's the one that all the other boxes ask "yo, what should I have on me" [21:26:29] ori-l: wow. each of those 725 lines is an artifact of blood, sweat, and tears [21:26:46] notpeter: I love that, as though the puppetmaster is a Gladwellian Influencer or Maven [21:27:13] it is our overlord, as the nagios box is our taskmaster :) [21:27:23] how long ago did we start using Nagios by the way [21:27:36] I think that was a danese era thing [21:27:50] and have we been steadily improving our usage of it somehow? I'm guessing yes (goes along with the robustness stuff) [21:27:53] mark said that before her, there was always someone online, so it worked well enough [21:28:05] kinda yes [21:28:17] although the work that asher has done creating profiling data is more useful [21:28:38] nagios is great for telling you when things are broken, and crap for telling you why [21:28:46] sumanah: Nagios was around when I started in '09, possibly even in '07 [21:28:59] when you say "profiling" what do you mean? like, "we conclude that over time this service usually does foo and right now it is doing bar and that's not a good sign"? [21:29:18] RoanKattouw: yes, you're right. mark was saying that pager duty didn't start until danese [21:29:32] like more ganglia statistics [21:29:32] sumanah: Profiling is the act of generating data on "how much time does large task X spend doing small subtask Y" [21:29:33] and graphite [21:30:17] so, ganglia.wikimedia.org is far more comprehensive at this point [21:30:23] so it generates information that you add to your engineering reflexes of "whoa that's not right" to make actionable wisdom? [21:30:32] and this https://graphite.wikimedia.org/dashboard/ [21:30:38] (need labs about to see it) [21:30:46] yeah [21:30:47] notpeter: what is the relationship among graphite, ganglia, and nagios? [21:30:49] The reason for that is that 1) one of those small Ys might actually be not so small, and be a problem, and 2) per the 80-20 rule, for some Ys optimization will have a larger impact, so you wanna find those [21:30:51] what is a subset of what :) [21:31:06] they're all different tools, tbh [21:31:12] nagios is purely for alerting [21:31:24] ganglia is more for perf data at a host level [21:31:31] graphite more for perf data at the application level [21:32:03] ok, so both ganglia & graphite might generate data that nagios then picks up to SMS someone? [21:32:17] Not exactly [21:32:20] at this point they're completely independent [21:32:38] Nagios checks are mostly behavioral checks, not always value-based or graph-based [21:32:39] nagios just checks things like "does port 80 return an http 301" [21:32:44] yeah [21:32:47] Some are, like the amount of free space, and some are not, like ... yeah the 301 [21:32:55] whereas nagios and graphite are time series data [21:33:02] *ganglia and graphite [21:33:04] er [21:33:05] yeah [21:33:07] that [21:33:08] sorry [21:33:13] :) [21:33:16] ok, so nagios is basically automated testing of our site that screams and sends up alarms when it fails [21:33:23] ye [21:33:23] s [21:33:27] Yeah they're really just graphs, they're passive. Nagios is active. [21:33:30] and very course testing at that [21:34:07] ok, so, we got Nagios more than 5 years ago, and when did we start using graphite & ganglia? [21:34:18] ganlgia was in place when I got here [21:34:25] but it gets a lot more data [21:34:27] I assume that people sometimes casting an eye on the latter is also part of our reduced downtime [21:34:30] and ganglia is within the last year [21:34:38] yeah [21:34:41] better data, really [21:34:46] reduces downtime [21:35:05] *graphite was in the last year [21:35:12] "better data" - because we've instrumented better and chopped down to the things we care about? signals that proxy closely things we care about? [21:35:14] Ganglia is also older than I remember [21:35:17] god damnit [21:35:20] thank you roan :) [21:35:27] No worries :) [21:35:32] It's hard to keep it all straight [21:35:43] is there some kind of regular "let's check the dashboard to look for upcoming problems" process? [21:35:59] Except if you're like me and you're intimately familiar with the mess of what lives where because you had to add SSL cert checks for everything [21:36:06] so, take for example: http://ganglia.wikimedia.org/latest/?c=Application%20servers%20pmtpa&m=load_one&r=hour&s=by%20name&hc=4&mc=2 [21:36:15] sumanah: yes, CT pming us :) [21:36:24] on that page [21:36:36] all of the apache-specific data is new [21:36:46] we've always had the various bits of host data [21:36:53] liek free disk, load, etc [21:37:03] er, this onehttp://ganglia.wikimedia.org/latest/?c=Application%20servers%20pmtpa&h=mw17.pmtpa.wmnet&m=load_one&r=hour&s=by%20name&hc=4&mc=2 [21:37:06] http://ganglia.wikimedia.org/latest/?c=Application%20servers%20pmtpa&h=mw17.pmtpa.wmnet&m=load_one&r=hour&s=by%20name&hc=4&mc=2 [21:37:21] Where is the Apache-specific data? I looked for the word "apache" and didn't see it [21:37:29] sorry, second link [21:37:34] like, requests per second [21:37:43] idle threads [21:37:44] etc [21:37:44] oh ok [21:37:56] so with that [21:38:17] Someone in Ops had to code up the instrumentation of Apache so we could get that data into ganglia? [21:38:22] one can look at it and make better deductions than just "yup, server's under a lot of load...." [21:38:30] oh wait, Apache is an app, so, graphite? [21:38:52] no, graphite is more for mediawiki [21:38:55] (kinda) [21:39:05] (there's a lot of overlap...) [21:39:33] so, there's a plugin for ganglia that does apache performance stats [21:39:44] it took me a couple of hours to set it all up [21:39:46] *but* [21:40:00] again, that's being forward thinking, debt that we had to work off [21:40:17] instead of just cursing ourselves when it wasn't there when we needed it [21:40:45] It's a massive undertaking to decide to do things The Right Way, set up a platform, instead of doing a million one-offs [21:40:51] look at how Kraken is progressing [21:40:55] yeah [21:41:02] sumanah: not that i'm anxious to re-live it, if you look at the wikitech archives from early june-ish sometime, i took down the app cluster by using clicktracking unscrupulously [21:41:11] and tim diagnosed the problem by looking at ganglia [21:41:12] and each of those things doesn't just take time to set up, it costs you time if you don't have it in place [21:41:54] ori-l: you weren't around during last year's fundraiser [21:42:59] sumanah: can we get an opportunity to edit your ops blog post before publishing? [21:43:21] ori-l: Brandon Harris linked to a Fundraiser Stats page in like a Quora or Reddit AMA or Twitter or something [21:43:21] and we had to kill it because of the load [21:43:21] I don't know how we figured it out, whether ganglia was part of it [21:43:23] but I totally get it, we need ganglia, how were we ever troubleshooting without it? [21:43:32] logging onto individual servers and doing a ps? [21:43:39] ganglia was not part of it [21:43:44] notpeter: ^ [21:43:49] wait, you were not around [21:43:51] pre-Ganglia [21:44:13] but as you say, we've improved our usage by doing things like taking the time to add that Apache perfdata plugin [21:45:01] binasher: mais oui, and of course you can correct me or anyone else here now as well [21:45:47] if someone in Ops wanted to write it they should feel free, I'm just interviewing y'all and writing it because I figure it'll get done faster that way [21:45:55] so yeah, a million little things that cost you time if you don't have them, but only take a little while to do [21:45:56] you have a lot on your plate [21:46:10] but so many of them! and each one eating up a little bit of our time [21:46:16] right. I want to come back to the downtime thing [21:46:21] ok [21:46:26] when you said the kind of downtime changed [21:46:43] the old downtime was "oops" - bumped a cable, edited the wrong file [21:46:50] what's the new downtime? :) [21:46:59] and what will the future downtime be? ;) [21:47:27] well, oops, or "that single point of failure died" [21:47:46] outages are now more likely to be from dev errors, which is a good thing [21:47:53] I mean, oops is eternal. no matter how good your code review is, etc, people will make mistakes [21:48:12] but yeah, I was going to say new features [21:48:17] be it in ops land of dev land [21:48:29] notpeter: yeah, on review of the backscroll I should have said SPOF was the old downtime [21:48:47] notpeter: did you get a chance to merge the coredb module? [21:48:55] no [21:49:04] I want to wrap up because I have taken more of your time than I promised to. Last bit: And how far do we have to go, and what will necessarily suffer in the interim? [21:49:08] it needs to be massively rebased [21:49:12] will do that after interview [21:49:22] sounds good [21:50:22] we're very close to having things in a very good state. one where we can fully fail over to having eqiad be our primary DC and even spin up more datacentes dramatically more quickly [21:50:34] binasher: when you say "ganglia was not part of it" you mean figuring out that Fundraiser Stats load thing and disabling it? [21:50:40] the current plan is to fail over to eqiad after the fundraiser [21:51:04] sumanah: aye [21:51:06] which is dramatically late, but there was a dramatic misunderestimation of the amount of debt that we were in [21:51:23] * sumanah appreciates usage of "misunderestimation" [21:52:11] going from one DC to two is very hard. two to three will be orders of magnitude easier [21:52:39] Can I give readers a guest login to https://graphite.wikimedia.org/dashboard/ ? [21:52:51] no. [21:52:52] nope. there's potentially sensitive data in there [21:53:49] as you mention, another part of the maturing organization is better transparency to your coworkers & better ability to contribute and ask for code review [21:53:55] via Gerrit [21:54:02] <^demon|zzz> sumanah: Also, that uses LDAP...so it would also grant them guest access to Gerrit, labsconsole, and half a dozen other things. [21:54:09] <^demon|zzz> (Even if graphite alone was ok to give out) [21:54:53] so, is there a separate permission for seeing the graphite dashboard, or is it "anyone with an LDAP account"? [21:55:08] does that mean that anyone with gerrit access can view it? [21:55:52] notpeter: following up on that: I can appreciate that until the eqiad dc is up and the failover works, a lot of other stuff is going to have to be on the back burner [21:56:02] <^demon|zzz> I don't know if everyone in LDAP can access it. But I know everyone who can access it can also use Gerrit. [21:56:17] notpeter: is that fair to say? [21:57:00] kinda yes, kinda no. I think that until it's done we can only give maintanence amounts of time/energy to many things [21:57:21] so, I guess yes :) [21:57:23] notpeter: but after that .... what's the next challenge going to be for Ops? more community collaboration/transparency/mentorship? Labs? another service that you're waiting to build out? any other big x-to-y switches on the horizon? [21:58:49] RoanKattouw: binasher: I'll be drafting this in https://meta.wikimedia.org/wiki/Wikimedia_Blog/Drafts starting once the interview is over so it'll be very easy to correct me on stuff :D [21:59:14] thanks [21:59:15] heh, you should ask an architect :) now that you ask, I realize that I'm too immersed in the day to day to really know the answer to that [21:59:22] HEY ASHER [21:59:25] ARCHITECT ASHER [21:59:36] :) [22:00:16] well, mark/ct/erik might be a more suited architect, tbh [22:00:44] btw, we need to convince asher to have user dbs in the labs replication dataserver :) [22:00:53] I mean, I hope the asnwer to "where is this all going?" is "mars" [22:01:29] you may need to change linux distribution before deploying there ;) [22:01:32] I am not eager to try to take any of Bergsma's or Woo's time - I've already taken an hour+ of yours - but I'll see what I can do [22:01:35] Platonides: you'll probably get federated versions at least [22:01:52] notpeter: Cooling the servers would be a lot easier, but the dust might be a problem [22:02:03] notpeter: That and the latency problems. But we can deal with that. [22:02:04] ok, thanks to all of you for your time on this, I really appreciate it [22:02:19] sumanah: definitely! [22:02:27] marktraceur: I look forward to solving these problems :) [22:02:28] I haven't used the federated engine [22:02:33] it could work [22:02:34] the lay readers are also going to be really grateful to see this glimpse behind the scenes, the donors will be glad to know what their money is paying for, as always [22:02:44] totally! [22:02:50] Platonides: mariadb comes with a newer version called "federatedX" [22:02:55] and your colleagues in all departments might also appreciate seeing what you've been laboring to chip away at [22:03:05] so, thanks for your time. [22:03:13] will it be using maria? [22:03:30] it won't perform as well as if you actually had userdb's in the core labsdb instances, but we should see if it works well enough [22:03:35] yeah [22:03:41] most definitely [22:03:58] <^Mike> maria always makes me hungry... marinara sauce over pasta, mmmmm :P [22:04:19] long term, production will probably use mariadb as well, though we don't have a target date [22:04:30] it is currently using oracle mysql, right? [22:04:51] with facebook's patchset [22:04:51] (probably with added patches) [22:04:56] yup [22:08:29] sumanah: the challenges of making mediawiki scale aren't going to go away any time soon, nor will the need for incremental architecture modernization at multiple levels. i don't think we really have challenges that end, to be replaced by new ones. there's a continuing arc of refinement in operations. [22:09:39] binasher: not to be picayune, but surely we can find milestones and goals, right? [22:10:48] there's a lot of incremental improvement but we also have big pushes in our past (like, the push to puppetize so we can replicate data centers more easily) so I presume we might have similarly big goal-oriented pushes in our future [22:10:53] sumanah: there is an operations roadmap [22:12:15] * sumanah searches for operations roadmap https://wikitech.wikimedia.org/index.php?search=roadmap&title=Special%3ASearch [22:12:22] binasher: I don't know where to look for it. help? [22:13:02] I know about https://www.mediawiki.org/wiki/Wikimedia_Engineering/2012-13_Goals and I know about the immediate "next month or 2" roadmap on mediawiki.org [22:13:03] i don't know if it's public, woosters ^^^ [22:13:59] there's also the 2013 techops budget, which i only have in the form of a google spreadsheet, but it contains a project tab [22:14:29] i.e. the west coast datacenter is one "what's next" [22:15:08] but much of what we do does not fit into that paradigm [22:15:21] certainly [22:15:28] yeah, lots of what I do is gardening too [22:17:32] and part of the narrative is getting better at the gardening [22:43:14] aah gardening – https://meta.wikimedia.org/wiki/Listening_to_our_garden <-- sumanah [22:43:17] hi Betacommand [22:43:48] Nemo_bis: hay, looking into rewriting your dumper using pywiki [22:44:44] Nemo_bis: nod, nod [22:49:42] Nemo_bis: should be fairly easy, just need to break down the XML format into something I can understand/create a template for [22:50:18] Betacommand: really? [22:50:30] I mean, that shouldn't be needed ^^' [22:50:52] Nemo_bis: pywikipedia already has all the functions needed to extract the data [22:50:52] anyway https://meta.wikimedia.org/wiki/Listening_to_our_garden [22:51:03] Betacommand: I wouldn't be *SO* sure [22:51:10] ehm wrong link [22:51:17] http://lists.wikimedia.org/pipermail/xmldatadumps-l/2012-November/000630.html [22:51:39] what about that? [22:51:56] someone else already asked about the format, so you might be interested, dunno [22:52:15] Nemo_bis: that was me [22:52:16] anyway basically you have to take whatever is among and concatenate it [22:52:39] Betacommand: *facepalm* [22:52:54] Nemo_bis: when working with the API and python I dont use XML I use json [22:53:26] XML is evil [22:53:50] json arrays are so much neater [22:54:41] hmm [22:54:52] but then how do you get the XML [22:55:17] why reconstruct it when you can get Special:Export or whatever produce it for you? [22:55:34] (all the rest can be done with JSON if one prefers so) [22:57:40] Nemo_bis: when working with large queries/spanning multiple requests and/or functions that dont use the api I wont get XML [22:58:28] its far easier to do a page.get() and not have to focus on whether its screenscraping or using the API [22:58:59] then throw page_text and other data into the XML file [23:18:09] binasher: so K4-713 has seen increases in bits response times [23:18:15] what pictures are both of you looking at? [23:18:39] Well, last night, we were getting reports in from everywhere of CSS not loading at all... [23:19:07] er.. did someone open a ticket or email ops about that? [23:19:07] uuuuuuhhhhh, ok [23:19:11] ...and I traced it back to the skin not loading all the way. [23:19:26] By the time I was about ready to tell anybody about it, I could no longer reproduce. [23:19:37] time frames? [23:20:05] Let me see... it was somewhere around 6pm PST. I can get more precise if I look at my chatlogs. [23:20:56] * K4-713 is looking [23:21:46] Yeah, I've got 6:53pm PST for the first I'd heard of a problem. [23:22:02] where were they coming from? [23:22:37] Looking for that too... [23:24:12] Seemed to be https://bits.wikimedia.org/donate.wikimedia.org/load.php?debug=false&lang=en&modules=site&only=styles&skin=vector&* [23:24:33] the bits monitor in watchmouse hasn't had any errors in the last several days, and performance has been steady [23:26:00] That's strange. [23:26:26] so, I finished upgarding bits caches at around 1640 pst yesterday [23:26:54] so the timing suggests that what you saw was not due to upgrade of bits caches [23:27:33] hm. I had heard rumblings of people seeing issues earlier in the day. but they weren't very specific about the time, unfortunately. [23:27:39] er, the day before [23:27:45] sorry [23:27:58] utc-based confusions [23:28:22] K4-713: is there a bugzilla ticket? [23:29:22] I didn't make one yet. [23:29:42] ah, users often make them [23:30:01] Wasn't even sure bugzilla was the appropriate place for stuff like that. [23:30:08] Usually, donors having problems just email us. [23:30:27] I think we received a few, actually. [23:32:40] binasher: Actually, I just checked. We received no emails that were clearly about missing css, but apparently they are usually a bit difficult to decipher. [23:33:00] How should I surface this stuff in the future? [23:33:39] donors.. hmm was this on fundraising pages vs. wikipedia? i've been looking for anything in the enwiki village pump [23:33:52] possibly? [23:34:15] Sorry: I'm usually at least one layer down from donor contact. [23:35:00] Which means that I generally don't hear about intermittent things until they've nearly gone away and I can't live-diagnose. [23:35:02] bits issues have more often been mediawiki issues than ops/varnish issues [23:35:44] which effects reporting, but it can be difficult to know one way or another [23:36:30] but emailing ops@ or ops-request@ with as much information as possible would be good [23:37:01] I can tell you one thing, though, that might not matter: I was WFH yesterday, and at home I'm running IPv6. When I was trying to get donate css to break, the one time I had the firebug net tab open, the URL that should contain the missing css was the only page subrequest that resolved to an IPv4 address. [23:37:36] All the other bits addresses (and everything else) were... fine. [23:38:10] that's weird [23:38:27] But, I was on IRC with people in the office, and they were hard-reloading and seeing the same symptoms. [23:38:41] ...only, obviously, all IPv4. [23:38:55] But, yeah: That was strange. [23:39:34] was everyone hard-reloading a specific page? [23:39:49] No, I think we were all on different donate pages. [23:40:56] HA. I just got it to misbehave. [23:41:07] ...it's still going on. [23:45:48] binasher: I'm getting some odd status codes from bits, intermittently. [23:45:54] Like, 301 and 302. [23:45:56] Same page. [23:47:38] what page? [23:48:45] Well, hard-reloading donate.wikimedia.org... but actually they seem to sort of move around amongst the subrequests. [23:49:07] https://bits.wikimedia.org/donate.wikimedia.org/load.php?debug=false&lang=en&modules=ext.wikihiero%7Cmediawiki.legacy.commonPrint%2Cshared%7Cmw.PopUpMediaTransform%7Cskins.vector&only=styles&skin=vector&* [23:49:10] That one did it a couple times. [23:49:35] returned a 301? [23:49:42] Yes. [23:49:52] what to? [23:49:52] But, not all the time. Which is strange. [23:50:26] can you grab a response header from one of these? [23:50:46] Gah, I got excited and blew away my data. Trying to catch a new one. [23:51:36] K4-713: if varnish returns a 301, it's because mediawiki returned a 301 [23:52:36] K4-713: i.e. http://pastebin.mozilla.org/1930602 [23:54:38] K4-713: i found the problem, it's in mediawiki [23:54:39] just a sec [23:55:10] * RoanKattouw looks up [23:55:48] binasher: If you found an MW problem, especially in load.php / ResourceLoader, I'd be interested to hear about it [23:55:58] <^demon|zzz> RoanKattouw: You're not looking up, you're looking forward. [23:56:03] hehe [23:56:40] K4-713: http://pastebin.mozilla.org/1930627 --- if mw returns different responses based on the presence of X-Forwarded-Proto, it has to be varied on [23:56:58] Hmm [23:57:41] "Hmm" indeed! [23:57:46] That's a bug in the HTTPS forcing code [23:57:46] and yeah, if a browser gets a 301 that points back to the exact same url (with the https) as what it requested, it isn't going to request it in a loop [23:57:57] Good find [23:58:08] * RoanKattouw walks to Chris's cubicle [23:58:25] But, why is it so intermittent, I wonder? [23:58:38] * marktraceur is very amused to watch events announced on IRC unfold in front of him [23:59:05] K4-713: because the cache control is short and there are multiple bits servers [23:59:48] K4-713: how are those donate css files getting requested without https? [23:59:56] its probably a rarer thing