[00:38:43] what's with the new audio players on mediawiki sites? [00:38:57] s/mediawiki/wikimedia [00:40:58] YairRand: http://blog.wikimedia.org/2012/11/08/introducing-wikipedias-new-html5-video-player/ [00:41:08] problem? [00:41:52] it's really not helpful for wiktionary's audio samples to have "Timed text" buttons [00:42:55] and the audio players are huge compared to the old buttons [00:44:34] I suppose the only way to fix this is to throw piles of css in common.css at it? (this display:none, that display:none, shrink width with probably-soon-to-break pixel measurements...) [00:45:13] or is there maybe a way to just disable the new stuff? [00:55:51] the new player also appears to be advertising a link to what appears to be a commercial website [00:56:17] !log on fenari: killed stuck git processes run by ExtensionDistributor and LocalisationUpdate [00:56:23] Logged the message, Master [00:59:10] and it also crams a giant "audio" image into the settings box [00:59:25] and wastes most of the space in it anyway [01:00:28] and the "Credits" button doesn't seem to even do anything [01:00:50] and the settings box gets placed underneath other players [01:01:24] and sometimes it doesn't even play the audio [01:03:55] and the line-height looks ridiculous [01:04:48] and the volume settings takes up way more space than the old one [01:11:30] conclusion: new audio player is bad stuff for wiktionary [01:12:51] can anything be done about this? [01:15:35] I would suggest filing a bug [01:15:42] But it appears BZ is broken at the moment [01:15:59] erk [01:32:48] YairRand, BZ is back [01:33:04] ah, thanks. I'll go file that bug. [01:47:35] pgehres: are you guys done w/yr deploy? do you mind if i sync a js file? [01:47:51] we are long done, all yours [01:48:03] thanks [03:17:32] is TMH meant to play videos when you click the "play" button, or just lock up the browser with 100% CPU? [03:17:44] because for me it just does the latter [03:21:03] :\ [03:31:42] argh [03:31:46] it fixed itself [03:32:05] I don't know why I still use this buggy browser anyway [03:41:59] TimStarling I can explain [03:42:16] its a bug in flash, it sometimes happens - for most rarely [03:42:46] was it full screen? [03:43:11] it's not using flash [03:43:55] hmm [03:45:05] no clue then [12:50:22] could someone take a look at why the WantedCategories special page is not being refreshed in the past week? [12:50:29] please [13:00:47] lunch [14:07:15] <^Mike> Does domas still help out with database & performance work at WMF? [14:55:11] andre__: that table is delicious, thanks [17:47:07] ^Mike: yes [17:51:23] <^Mike> jeremyb: Yes... about domas? [18:01:02] ^Mike: yes [18:01:12] <^Mike> cool, thanks [18:35:54] bummer, logs permissions seem messed up: [18:35:59] ForbiddenYou don't have permission to access /~petrb/logs/#wikimedia-tech/ on this server. [18:36:03] http://bots.wmflabs.org/~petrb/logs/%23wikimedia-tech/ [18:39:24] Reedy: we never worked out if we want to shift the Monday deploy to Tuesday [19:56:30] robla: I'll probably just do it as normal on monday [20:17:57] hi notpeter - still good to chat in about 15 min? [20:38:00] notpeter: ping [20:38:12] hey [20:38:23] it is time, isnt it [20:38:50] is now good, notpeter? [20:39:37] give me 5, and I'm all yours [20:40:04] okay! [20:46:19] so notpeter I am interviewing you so I can blog about Ops for the tech blog [20:46:33] well, the WMF blog, tagged/categorized appropriately :) [20:46:41] ok, cool [20:46:52] notpeter: my assumption is that these questions make sense: [20:46:53] Wikimedia Operations is coming out of a massive hole of technical debt that few Wikimedians really appreciate. Two years ago, what kind of horrifying fragility were we subject to, and how did that affect us? Now, what improvements in service quality have we seen, and how did we get here? And how far do we have to go, and what will necessarily suffer in the interim? [20:47:29] sure, so let's start with when I started and what it was like [20:47:48] I started here in march 2011 [20:47:52] so about 20 months ago [20:47:58] right [20:48:03] this was at the beginning of a big hiring push for the foundation [20:48:04] around the same time as me actually! [20:48:07] ha, yeah [20:48:07] yep! [20:48:30] when I got here, Ithink the ops team consisted of about 6 people [20:48:55] which is completely amazing [20:48:56] supporting basically the 5th highest traffic site on the internet? [20:48:59] yeah [20:49:10] what are the numbers for similar commercial sites? [20:49:15] like, 300 Ops people? [20:49:33] certainly orders of magnitude more headcount [20:49:48] I'd say at least hundreds. probably clsoer to thousands for google [20:49:51] yeah [20:50:58] but, this also meant that out of the fast/cheap/well triangle, we'd gone with fast and cheap [20:51:21] it was kinda like many many layers of really artfully applied duct tape [20:51:24] "fast" here meaning that we got quick-and-dirty solutions because problems had to be solved NOW? [20:51:30] yeah [20:51:46] with so few ops engineers, you're always playing catchup [20:51:52] long-term is hard [20:52:08] was there any kind of master architectural plan? [20:52:14] as of 2011? [20:52:14] oh, definitely [20:52:22] like, with strategic goals? [20:52:31] at least technical goals [20:52:44] for example, when I got here, we were just starting our move into eqiad [20:52:51] which was a huge goal [20:53:04] I odn't think anyone really appreciated how big of a project that was going to be [20:53:29] our first problem was that, for example, very little was in puppet [20:53:32] we had never set up a data center of that complexity from scratch before [20:53:34] yeah [20:53:41] right? we had grown Tampa [20:53:48] it was layers of duct tape that had been biult up over years [20:53:51] yeah [20:54:05] (I realize this is an interview, so I'll interject this and then shut up: IIRC late 2010 / early 2011 was pretty much when we started to actually have good planning in that regard) [20:54:06] Ori had a line: "grown like mold on a shower curtain" re some piece of the codebase [20:54:09] and our amsterdam center is just so much shorter [20:54:19] RoanKattouw: we're doing this in public so we can get corrected by folks like you! [20:54:26] :) [20:54:27] "shorter" notpeter? [20:54:28] sumanah: hahahah [20:54:31] not sure what that means [20:54:44] sorry, smaller [20:54:45] Just providing some context regarding things that just predated notpeter [20:54:49] sure [20:55:03] it's just a caching center, which has so few moving parts [20:55:22] like a short stack of pancakes instead of a regular stack, eh? :) [20:55:28] hahah, yes. [20:55:32] (or of stroopwafels since it's in NL) [20:55:35] my fingers didn't get that one quite right [20:55:37] no prob [20:55:44] I thought it was Teh l33t Lingo [20:55:57] OK, so, very little was in puppet [20:56:11] so, when I got here, for example, our puppet manifests for our databases were a file that just had a comment that said "domas is a slacker" [20:56:17] hahaha [20:56:17] and leveraging the power of puppet to keep our configurations in sync was kind of key to making 2 real data centers work, right? [20:56:29] our whole search infrastructure existed outside of puppet control [20:56:36] notpeter: you're not kidding are you -- is there any way I can link to that file? [20:56:40] that version of it [20:56:48] sumanah: I can look :) [20:57:19] sumanah: yeah, I mean, we could have created a whole tappestry of duct tape, but this was also an oppertunity to do things right [20:57:26] automate [20:57:38] what were we actually doing to keep track of config changes? people just did !log every time and you checked that when troubleshooting? [20:57:39] so that when we have a third data center it will be a much much shorter process [20:57:43] YES [20:58:09] there was a lot of !log [20:58:20] Yup, pretty much [20:58:22] RoanKattouw: when you say that we started to get good planning in 2010/2011, can you give me something to link to? [20:58:31] but also, when the ops team is only 6 people, and everyone has been around for years, every just knows all the parts [20:58:47] sumanah: Perhaps some of the communication regarding the eqiad build-out? I think that was getting off the ground in late '10 [20:59:07] As in, us negotiating with the company and getting the space, announcing we got it, etc. [20:59:08] RoanKattouw: you know the Ops docs landscape far better than I do [20:59:39] like, whether this would have been on public mailing lists, blog posts, wikitech.wikimedia.org, private lists so I can't link to stuff, internal conversation lost in the mists of IRC logs, etc [20:59:48] I basically consider the dedication of a ~$3.3M budget to the eqiad build-out the "start of proper ops planning", but that's my opinion of where that milestone belongs, others may have different insights [20:59:56] Right [20:59:59] RoanKattouw: and when was that? [21:00:06] FY 2010-2011? [21:00:08] Well there was definitely public communication about this [21:00:12] Yeah so probably the FY 10-11 budget [21:00:24] notpeter: was the set of services that Ops supports smaller when you arrived? [21:00:28] Would have included a line item for significant spending on the new data center [21:00:37] in terms of the number of services and customers? [21:00:42] of course Labs is a huge new component [21:01:16] yeah, there was no labs [21:01:22] I'm looking at https://meta.wikimedia.org/wiki/Wikimedia_services [21:01:30] but also, ops was less able to give support to many teams [21:01:43] for example, fundraising just had a couple of boxes [21:01:50] and could do whatever they wanted on them [21:02:15] as opposed to now where jeff is working on making an awesome, pci compliant system with them full time [21:02:40] or analytics was very... independent/unsupported [21:02:42] Right, Jeff is a hybrid Ops/Fundraising person, right? embedded in 1, reporting sorta to another [21:02:59] what's the Analytics Ops support now, for contrast? [21:03:00] because there was so little human-hours to give to supporting things that weren't just keeping the site up [21:03:25] yeah, jeff and andrew otto are both kinda on ops, kinda on another team [21:04:07] (org chart will show them differently, though, as andrew's very much on analytics, and jeff is technically on ops) [21:04:41] but so, I think that the eqiad buildout is very demonstrative of the amount of debt that ops was in [21:04:42] right, and so Jeff started out with 100% all the Ops privileges, and Andrew sort of had to get them over time? [21:04:46] yeah [21:04:50] ok [21:04:57] we bought a bunch of boxes [21:05:13] I'm also curious about whether you think that kind of embedded cross-team approach is the future of Ops in some sense, but that can wait, it's not crucial to this story [21:05:21] and we're still going through the process of trying to set everything up in a nice, clean, automated way [21:05:24] right [21:05:45] oh, hhhmmmm... hopefully not, tbh [21:06:09] I would hope that we get to a point where ops has enough human-hours that we don't let anything fall through the cracks [21:06:19] which I thinkthe embedded cross-team thing solves [21:06:24] I mean, we'll all specialize in stuff [21:06:27] as happens [21:06:50] or, well, eventually ops will split if/when the org continues to grow [21:06:55] it has to, in some way [21:07:12] and I think that will determine how ops interacts with/serves other depts [21:07:45] (btw is there a central list of which Ops people specialize in what?) [21:07:59] (I don't think so, tbh) [21:08:06] (OK.) [21:08:13] Anyway, back to the contrast between early 2011 and now [21:08:25] I think another thing that illustrates our growth and maturity is our downtime [21:08:29] so you wanted to get all these boxes configured in a robust and repeatable way [21:08:37] oh yes! i wanted to hear about uptime figures [21:08:56] I'll have to hunt them up after this [21:08:59] as I don't have them handy [21:09:06] I know they're better than when I started [21:09:20] RoanKattouw: maybe you know them offhand! [21:09:27] No, I don't [21:09:28] but, I think something that's less visible to people outside of ops is the kind of downtime we have [21:09:33] I know anecdotally things have gotten better [21:09:41] the KIND of downtime - I listen with eagerness! [21:09:51] for example, we no longer have much of the variety of "oops, bumbed that cable" [21:09:57] or "that one box died" [21:10:07] because things are much more robust now [21:10:11] much more redundant [21:10:13] Or, "the master DB server has a full disk" [21:10:14] how did we get there? [21:10:17] right [21:10:22] That one happened a few times a few years ago, and doesn't happen any more now [21:10:45] RoanKattouw: ah yes, the one this year was the *parser cache* db being full, right? which sucks but at least is not master [21:10:54] so, a lot of it is a product of the massive automation push we've been going through [21:11:04] Right, that one [21:11:08] which let's us create redundantly far more easily [21:11:20] and let's us spend our time not fighting fires [21:11:30] notpeter: so, multiple people spending lots of time every day working to refactor and automate and add monitoring/instrumentation for lots of services? [21:11:36] That was the investment, right? [21:11:52] yeah [21:12:03] mass puppetization of *everything* [21:12:03] tell me more about "create redundantly far more easily" - create redundancy? [21:12:08] not just the core components [21:12:37] for example, I've worked on search a lot [21:12:48] when I got here, nothing was in puppet [21:13:02] now, we have two fully independent search setups [21:13:06] one in each dc [21:13:22] failover takes a couple of minutes at most [21:13:35] when you say "everything" - name names! (the blog post will go better with, like, a list of like 7 things we puppetized, so, search, what else?) [21:13:38] Oh, I remember how for basically all wikis except enwiki, there was only one search server. Hurray SPOFs [21:13:57] and one dc could get wiped out in a hurricane and I could have another copy up and running in a day (assuming hardware) [21:15:27] <^demon|zzz> notpeter: Speaking of search, wikidatawiki still seems to have no index :\ [21:15:36] since I've been here: all of the DBs, search, fundraising, analytics, parsercache, uh..... probably more. although I'd have to look [21:16:12] ^demon|zzz: yes, I've spent some time looking at what's broken.... but the answer is "search" [21:17:06] <^demon|zzz> notpeter: Glad to know we've narrowed it down ;-) [21:17:16] ^demon|zzz: but yes, this is on my list [21:17:37] <^demon|zzz> Cool, thanks! [21:17:41] Lemme look at https://gerrit.wikimedia.org/r/#/q/status:merged+project:%255Eoperations.*,n,z .... PHP, Apaches, nginx, MediaWiki itself, Nagios, logging stuff like udp2log [21:17:59] MW puppetization is used in labs but not prod [21:18:21] nginx was puppetized from the start, I believe, because we started using it (for HTTPS termination) when we were already in the puppet push [21:18:27] logging stuff = analytics, essentially [21:18:37] yeah [21:18:45] Good call on Nagios [21:18:51] but, oh man, udp2log stuffs... wow, that was a big unsorting [21:18:54] Puppetizing monitoring is very important [21:18:54] yeah, nagios [21:18:55] ganglia, zuul, partman, memcached, pybal [21:19:02] zuul is new as well [21:19:12] The others are good calls [21:19:22] I think that memecache and pybal were fully puppetized when I got here [21:19:59] basically I should look at https://gerrit.wikimedia.org/r/gitweb?p=operations/puppet.git;a=tree;f=manifests;h=40d0611729052ffa61e0890cdf9ae3d709521026;hb=HEAD [21:20:01] <^demon|zzz> I believe pybal was one of the earlier things to get puppetized. [21:20:05] which is not a short list [21:20:08] That's quite possible [21:20:09] <^demon|zzz> s/earlier/earliest/ [21:20:25] so, to get all buzzwordy for a sec, this was the ops dept going from sysadmin to dev-ops, as it's when we started treating our infrastructure as a codebase [21:20:26] I can also just look through the monthly reports to see what got puppetized when [21:20:30] memcached is fairly straightforward, and my impression of pybal is that it's easy if your name is Mark Bergsma and dark magic otherwise [21:20:34] sumanah: are you just looking for a list of things that are puppetized? [21:20:39] RoanKattouw: :) [21:21:00] * ori-l-away 's channel buffer doesn't go far enough [21:21:08] ori-l-away: for the purposes of writing a blog post about Ops's recent past and near future, I am hearing from notpeter about, among other things, the puppetization push [21:21:30] oh.. lemme see if i can help [21:21:45] ori-l: and so the narrative will include "since Peter got here, look at the giant list of configurations that had to go into puppet so we could actually do a second DC in a repeatable way" [21:21:54] but I think at this point that we're very nearly at a point where we can manage our whole infrastructure without needing to log into hosts [21:22:05] "since" meaning timewise, since Mar 2011, not in a causal way [21:22:13] which is the whole goal :) [21:22:16] yay! [21:22:28] notpeter: for the lay audience: why is logging into hosts a bad thing? [21:22:41] * sumanah has her guesses [21:22:42] because it means that you're doing things by hand [21:22:55] and/or that what you're doing isn't going through code review [21:23:25] moving to gerrit for our puppet repos, completely independent of labs, is awesome [21:23:34] it means I can really easily see what my coworkers are doing [21:23:40] I can ask for review when needed [21:23:50] it's a huge sign of maturation of our dept [21:24:06] previously, what was happening instead? [21:24:13] people just changed files and did a !log ? [21:24:17] maybe asked for code review sometimes? [21:24:39] well, when I got here, everything was done on a local svn repo or our puppetmaster [21:24:44] and then pushed out from there [21:24:53] which kinda works if you have 6 or fewer people [21:24:54] "puppetmaster" - basically a little git repo? [21:25:01] oh wait, you mean the master manifest [21:25:06] sumanah: here's a list of all the hostclasses currently defined in puppet: http://dpaste.de/CsPPJ/ there's no 1:1 correlation between pieces of software and hostclasses, but it's close enough [21:25:06] yeah [21:25:07] wait, I don't know what you mean [21:25:24] the box that everything talks to and asks it what puppet manifests it gets [21:25:37] and thus what packages/configs/etc it gets [21:25:44] It's essentially what controls everything based on the manifests [21:25:56] oh a puppetmaster is a kind of box [21:26:08] or a machine that is Master [21:26:09] ok [21:26:13] yeah [21:26:14] Basically [21:26:16] yeah, it's the one that all the other boxes ask "yo, what should I have on me" [21:26:29] ori-l: wow. each of those 725 lines is an artifact of blood, sweat, and tears [21:26:46] notpeter: I love that, as though the puppetmaster is a Gladwellian Influencer or Maven [21:27:13] it is our overlord, as the nagios box is our taskmaster :) [21:27:23] how long ago did we start using Nagios by the way [21:27:36] I think that was a danese era thing [21:27:50] and have we been steadily improving our usage of it somehow? I'm guessing yes (goes along with the robustness stuff) [21:27:53] mark said that before her, there was always someone online, so it worked well enough [21:28:05] kinda yes [21:28:17] although the work that asher has done creating profiling data is more useful [21:28:38] nagios is great for telling you when things are broken, and crap for telling you why [21:28:46] sumanah: Nagios was around when I started in '09, possibly even in '07 [21:28:59] when you say "profiling" what do you mean? like, "we conclude that over time this service usually does foo and right now it is doing bar and that's not a good sign"? [21:29:18] RoanKattouw: yes, you're right. mark was saying that pager duty didn't start until danese [21:29:32] like more ganglia statistics [21:29:32] sumanah: Profiling is the act of generating data on "how much time does large task X spend doing small subtask Y" [21:29:33] and graphite [21:30:17] so, ganglia.wikimedia.org is far more comprehensive at this point [21:30:23] so it generates information that you add to your engineering reflexes of "whoa that's not right" to make actionable wisdom? [21:30:32] and this https://graphite.wikimedia.org/dashboard/ [21:30:38] (need labs about to see it) [21:30:46] yeah [21:30:47] notpeter: what is the relationship among graphite, ganglia, and nagios? [21:30:49] The reason for that is that 1) one of those small Ys might actually be not so small, and be a problem, and 2) per the 80-20 rule, for some Ys optimization will have a larger impact, so you wanna find those [21:30:51] what is a subset of what :) [21:31:06] they're all different tools, tbh [21:31:12] nagios is purely for alerting [21:31:24] ganglia is more for perf data at a host level [21:31:31] graphite more for perf data at the application level [21:32:03] ok, so both ganglia & graphite might generate data that nagios then picks up to SMS someone? [21:32:17] Not exactly [21:32:20] at this point they're completely independent [21:32:38] Nagios checks are mostly behavioral checks, not always value-based or graph-based [21:32:39] nagios just checks things like "does port 80 return an http 301" [21:32:44] yeah [21:32:47] Some are, like the amount of free space, and some are not, like ... yeah the 301 [21:32:55] whereas nagios and graphite are time series data [21:33:02] *ganglia and graphite [21:33:04] er [21:33:05] yeah [21:33:07] that [21:33:08] sorry [21:33:13] :) [21:33:16] ok, so nagios is basically automated testing of our site that screams and sends up alarms when it fails [21:33:23] ye [21:33:23] s [21:33:27] Yeah they're really just graphs, they're passive. Nagios is active. [21:33:30] and very course testing at that [21:34:07] ok, so, we got Nagios more than 5 years ago, and when did we start using graphite & ganglia? [21:34:18] ganlgia was in place when I got here [21:34:25] but it gets a lot more data [21:34:27] I assume that people sometimes casting an eye on the latter is also part of our reduced downtime [21:34:30] and ganglia is within the last year [21:34:38] yeah [21:34:41] better data, really [21:34:46] reduces downtime [21:35:05] *graphite was in the last year [21:35:12] "better data" - because we've instrumented better and chopped down to the things we care about? signals that proxy closely things we care about? [21:35:14] Ganglia is also older than I remember [21:35:17] god damnit [21:35:20] thank you roan :) [21:35:27] No worries :) [21:35:32] It's hard to keep it all straight [21:35:43] is there some kind of regular "let's check the dashboard to look for upcoming problems" process? [21:35:59] Except if you're like me and you're intimately familiar with the mess of what lives where because you had to add SSL cert checks for everything [21:36:06] so, take for example: http://ganglia.wikimedia.org/latest/?c=Application%20servers%20pmtpa&m=load_one&r=hour&s=by%20name&hc=4&mc=2 [21:36:15] sumanah: yes, CT pming us :) [21:36:24] on that page [21:36:36] all of the apache-specific data is new [21:36:46] we've always had the various bits of host data [21:36:53] liek free disk, load, etc [21:37:03] er, this onehttp://ganglia.wikimedia.org/latest/?c=Application%20servers%20pmtpa&h=mw17.pmtpa.wmnet&m=load_one&r=hour&s=by%20name&hc=4&mc=2 [21:37:06] http://ganglia.wikimedia.org/latest/?c=Application%20servers%20pmtpa&h=mw17.pmtpa.wmnet&m=load_one&r=hour&s=by%20name&hc=4&mc=2 [21:37:21] Where is the Apache-specific data? I looked for the word "apache" and didn't see it [21:37:29] sorry, second link [21:37:34] like, requests per second [21:37:43] idle threads [21:37:44] etc [21:37:44] oh ok [21:37:56] so with that [21:38:17] Someone in Ops had to code up the instrumentation of Apache so we could get that data into ganglia? [21:38:22] one can look at it and make better deductions than just "yup, server's under a lot of load...." [21:38:30] oh wait, Apache is an app, so, graphite? [21:38:52] no, graphite is more for mediawiki [21:38:55] (kinda) [21:39:05] (there's a lot of overlap...) [21:39:33] so, there's a plugin for ganglia that does apache performance stats [21:39:44] it took me a couple of hours to set it all up [21:39:46] *but* [21:40:00] again, that's being forward thinking, debt that we had to work off [21:40:17] instead of just cursing ourselves when it wasn't there when we needed it [21:40:45] It's a massive undertaking to decide to do things The Right Way, set up a platform, instead of doing a million one-offs [21:40:51] look at how Kraken is progressing [21:40:55] yeah [21:41:02] sumanah: not that i'm anxious to re-live it, if you look at the wikitech archives from early june-ish sometime, i took down the app cluster by using clicktracking unscrupulously [21:41:11] and tim diagnosed the problem by looking at ganglia [21:41:12] and each of those things doesn't just take time to set up, it costs you time if you don't have it in place [21:41:54] ori-l: you weren't around during last year's fundraiser [21:42:59] sumanah: can we get an opportunity to edit your ops blog post before publishing? [21:43:21] ori-l: Brandon Harris linked to a Fundraiser Stats page in like a Quora or Reddit AMA or Twitter or something [21:43:21] and we had to kill it because of the load [21:43:21] I don't know how we figured it out, whether ganglia was part of it [21:43:23] but I totally get it, we need ganglia, how were we ever troubleshooting without it? [21:43:32] logging onto individual servers and doing a ps? [21:43:39] ganglia was not part of it [21:43:44] notpeter: ^ [21:43:49] wait, you were not around [21:43:51] pre-Ganglia [21:44:13] but as you say, we've improved our usage by doing things like taking the time to add that Apache perfdata plugin [21:45:01] binasher: mais oui, and of course you can correct me or anyone else here now as well [21:45:47] if someone in Ops wanted to write it they should feel free, I'm just interviewing y'all and writing it because I figure it'll get done faster that way [21:45:55] so yeah, a million little things that cost you time if you don't have them, but only take a little while to do [21:45:56] you have a lot on your plate [21:46:10] but so many of them! and each one eating up a little bit of our time [21:46:16] right. I want to come back to the downtime thing [21:46:21] ok [21:46:26] when you said the kind of downtime changed [21:46:43] the old downtime was "oops" - bumped a cable, edited the wrong file [21:46:50] what's the new downtime? :) [21:46:59] and what will the future downtime be? ;) [21:47:27] well, oops, or "that single point of failure died" [21:47:46] outages are now more likely to be from dev errors, which is a good thing [21:47:53] I mean, oops is eternal. no matter how good your code review is, etc, people will make mistakes [21:48:12] but yeah, I was going to say new features [21:48:17]