[00:01:15] (03PS2) 10Dzahn: remove pmtpa access switches [operations/dns] - 10https://gerrit.wikimedia.org/r/143202 [00:02:02] (03CR) 10Ori.livneh: "> I'm not sure how the HHVM configuration interacts with the non-HHVM configuration. Will the variant aliases still work?" [operations/apache-config] - 10https://gerrit.wikimedia.org/r/142983 (owner: 10Ori.livneh) [00:03:11] (03CR) 10Dzahn: [C: 04-1] "need to check what/if any of this is used for 10th floor" [operations/dns] - 10https://gerrit.wikimedia.org/r/143202 (owner: 10Dzahn) [00:11:29] (03PS1) 10Dzahn: retab wikipedia zone and fix aligning [operations/dns] - 10https://gerrit.wikimedia.org/r/143208 [00:14:48] (03PS1) 10Dzahn: delete anything 'toolserver' [operations/dns] - 10https://gerrit.wikimedia.org/r/143209 [00:15:34] (03CR) 10Dzahn: [C: 04-1] "after "July 1st 1:00 am UTC, the Toolserver accounts will be expired and the" [operations/dns] - 10https://gerrit.wikimedia.org/r/143209 (owner: 10Dzahn) [00:16:32] (03PS3) 10Ori.livneh: Apache config for Wikivoyage using mod_proxy_fcgi [operations/apache-config] - 10https://gerrit.wikimedia.org/r/142983 [00:21:03] (03PS1) 10Dzahn: wikimediafoundation - align and tabs [operations/dns] - 10https://gerrit.wikimedia.org/r/143212 [00:24:37] (03CR) 10Dzahn: [C: 031] Initialize some settings for wikimania 2015 wiki [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/139279 (https://bugzilla.wikimedia.org/66370) (owner: 10Withoutaname) [00:28:54] (03PS7) 10Reedy: Initialize some settings for wikimania 2015 wiki [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/139279 (https://bugzilla.wikimedia.org/66370) (owner: 10Withoutaname) [00:31:43] springle: http://chr13.com/2014/03/10/using-google-to-ddos-any-website/ [00:31:56] reminded me of the tendril report [00:35:38] YuviPanda|zz: I know you're not around anymore but...pong. I'll try to hit you up to talk about that stuff tomorrow [00:38:34] (03CR) 10Dzahn: [C: 031] "lgtm, but can we not introduce the literal tabs?" [operations/puppet] - 10https://gerrit.wikimedia.org/r/141671 (owner: 10Filippo Giunchedi) [00:44:17] (03PS2) 10Dzahn: update index page for wikimedia downloads [operations/puppet] - 10https://gerrit.wikimedia.org/r/141671 (owner: 10Filippo Giunchedi) [00:44:53] AaronSchulz: wow [00:47:08] chasemp: still around? I popped back in for a bit (can't sleep) [00:47:10] if not I'll just poke you tomorrow [00:47:15] yo [00:47:19] chasemp: yo! [00:47:56] chasemp: around and want to talk about it now? [00:48:00] so to give some context to your concern, there has been talk of primary data submission from diamond, as in even binary state data for services and such [00:48:06] chasemp: ah! [00:48:07] chasemp: right. [00:48:13] but it was always maintained ops side that he submission would be to icinga [00:48:30] or equivalent and that only the anomaly type stuff was alerted through graphite [00:48:31] as in? diamond -> icinga? or diamond -> graphite <- icinga [00:48:49] as in, we have a mechanism (which we make?) for diamond => icinga [00:49:07] ideally....in my personal world...we use passive checks much more heavily [00:49:07] ah, hmm. [00:49:11] and have hosts themselves be responsible for checking in [00:49:18] so I told you that just so we are on the same page [00:49:18] but [00:49:39] as far as actual alerting how you see fit in labs, I think it's in your hands [00:49:45] as the prod stuff needs lots of TLC atm [00:49:58] (03PS3) 10Dzahn: update index page for wikimedia downloads [operations/puppet] - 10https://gerrit.wikimedia.org/r/141671 (owner: 10Filippo Giunchedi) [00:50:06] not suggesting you should implement all that, just that it's in the mix somewhere [00:50:18] personally I think trying out some alerting dashboards would be really cool [00:50:26] chasemp: I agree too. [00:50:27] who knows maybe you love it [00:50:30] maybe it's java [00:50:33] (03CR) 10Dzahn: "also fixed some validation errors in PS3, (see diff between PS2 and PS3), now: This document was successfully checked as XHTML 1.0 Transit" [operations/puppet] - 10https://gerrit.wikimedia.org/r/141671 (owner: 10Filippo Giunchedi) [00:50:35] heheh [00:50:53] chasemp: so one thing we can do is perhaps setup cabotapp (or equivalent) only for *toollabs*, and see how the experience is. [00:51:03] I think that's reasonable [00:51:08] it looks neat [00:51:24] chasemp: only thing I have against it is that checks aren't in version control, but need to be in a db. [00:51:29] that makes me feel weiirddd [00:51:39] not sure I understand? [00:51:41] so ideally I'd find something that lets me put the checks in a git repo [00:51:48] you mean the check logic itself? [00:51:52] chasemp: yeah [00:52:16] the graphite expressions will be stored in a db that cabotapp manages [00:52:18] yeah I'm sure we can do it right if we like it, I've done lots of things like that with git and post commit hooks [00:52:32] :D yeah, so that's a possibility too. [00:53:01] assume you've seen http://graphite.readthedocs.org/en/latest/tools.html ? [00:53:09] lots of effort in this space atm [00:53:10] chasemp: yeah, that's where I picked this up from [00:53:20] I've used most of those actually [00:53:41] oh [00:53:46] https://github.com/livingsocial/rearview/ looks nice too, but rub. [00:53:49] *ruby [00:53:51] but not cabot as it's new to me [00:54:01] and requires jvm [00:54:40] chasemp: have you used any of those tools for alerts? [00:54:55] no as is the case for everyone we rolled our own at my last place [00:55:08] hehe [00:55:14] I'd like to avoid that [00:55:26] chasemp: https://github.com/datacratic/check_graphite or similar is also a possibility [00:55:42] _joe_ actually rewrote most of that quite nicely [00:55:50] but it hasn't seen too much implementation yet [00:55:53] oh, check_graphite? [00:55:56] if you were curious he would be the guy to talk to [00:55:57] yes [00:56:21] nice. I wouldn't mind using check_graphite + icinga, but that seems a bit wasteful (+ brings all of icinga's complexity with it for little benefit) [00:56:27] (03PS3) 10Dzahn: modules/coredb_mysql/ sans systemuser [operations/puppet] - 10https://gerrit.wikimedia.org/r/137994 (owner: 10Rush) [00:57:05] well the hardest parts for an alerting system isn't alerting actually it's the finagle bits of assigning alerts, accepting them, silencing them [00:57:06] etc [00:57:13] chasemp: the previos icinga on labs effort stalled because prod's icinga setup was too prod specific, and hence petan just went his own way (and it was unpuppetized). I'd say my first priority is making sure that doesn't happen, and that any solution has buy-in from ops. [00:57:15] you get a lot more than you realize with that underbelly [00:57:18] RECOVERY - Puppet freshness on db1009 is OK: puppet ran at Tue Jul 1 00:57:10 UTC 2014 [00:57:21] chasemp: that's true [00:57:44] idk how serious akosiaris is about shinken [00:57:58] but if he has more or less made up his mind maybe just deploy it in toollabs [00:58:04] rather than unraveling icinga [00:58:12] easier to build new than remodel [00:58:23] we could do that too, but then how do you define checks? collecting resources has the same problem with shinken, I'd suppose [00:58:38] ah I see [00:58:52] I have a few thoughts on it but let me noodle on it for a bit [00:59:05] chasemp: ok! [00:59:40] (03PS4) 10Dzahn: gerrit - replace generic::systemuser with user [operations/puppet] - 10https://gerrit.wikimedia.org/r/138008 (owner: 10Rush) [00:59:42] but say you get the histogram / trending portion stood up [00:59:47] that seems like the place to start [00:59:52] maybe you end up happy without icinga idk [01:00:24] I'd be happy to start with something as trivial as 'host is down!' and 'out of disk space!' (two things people on mailing lists had to alert us today, and happens almost every other week) :) [01:01:42] (03PS4) 10Dzahn: ganglia - replace generic::systemuser with user [operations/puppet] - 10https://gerrit.wikimedia.org/r/138010 (owner: 10Rush) [01:02:00] seems like you could do that second with most of the graphite dashboard tools [01:02:18] chasemp: yeah, with an alerting component [01:02:21] and we could write a really, really trivial cron to check for a host that hasn't reported a metric in x number of minutes [01:02:39] right [01:02:45] or even add that into cabotapp [01:02:52] yeah probably ideally that [01:03:01] it right now doesn't support reporting to IRC, for example (but reports to HipChat >_>) [01:03:25] yeah I kind of don't love our whole irc bots thing honestly [01:03:37] I think we talked previously about using a redis pubsub or some pubsub queue [01:03:43] hah, a new bot every few months as well [01:03:46] that can be used as an irc events pipeline [01:03:47] chasemp: oh, but where would that feed to? [01:04:23] chasemp: ah, got it [01:04:23] what I did before was a single bot(-ish) that could read from the queue (as could any consumer) and then you publish events [01:04:29] like an operational events pipeline [01:04:31] yeah [01:04:42] it's not complicated really but far more flexible than a lot of event islands [01:04:49] and we used that to post events to graphite too [01:04:54] faidon was interested in http://www.fedmsg.com/en/latest/ a while back [01:05:04] so syncs, etc all could be overlayed on graphs [01:05:34] in my (highly biased) opinion EventLogging is more robust, but it doesn't target operations stuff currently [01:05:38] ori: that's...pretty similar to how i've used redis int he past [01:05:50] I admit I know little of the current eventlogging stuff [01:06:08] I saw a chart from nuria it seemed not simple**tm [01:06:12] ori: do you have any thoughts/opinions on cabotapp and friends? [01:06:25] i've never seen cabotapp [01:06:36] (http://cabotapp.com/) [01:06:52] metrics based alerts. [01:07:30] seems okay, i'm just suspicious of new solutions [01:07:39] yeah, me too. [01:08:01] let's not marry it, but maybe drinks a few dates [01:08:17] then we say we are moving out of the country if it sucks [01:08:26] not marrying *anything* sounds great to me! :) [01:08:39] everybody wants to bring their ex from last work :) [01:08:45] * YuviPanda is actually moving out of the country soon too, so will work well [01:08:55] mutante: heh, that's so true [01:10:27] we have graphite, gdash, logstash, kibana, ganglia, diamond, txstatsd, python-statsd, icinga, tendril, udp2log, logster [01:10:48] we have python-statsd running somewhere as well? [01:10:51] YuviPanda: anyways, hope something there was helpful. tl;dr...I think giving it a try is cool. also, your use case may be simple enough it's all you need. [01:10:57] python-statsd is a client lib I think [01:11:03] ah [01:11:11] and I thought ganglia was going to die... [01:11:29] not dead yet tho [01:11:46] don't know tendril, ori? [01:11:48] watchmouse should make sure the other monitoring tools are actually up [01:11:51] YuviPanda: you know how it's more fun to roll out something new than sunset an existing solution? other people feel like that too :P [01:11:57] ori: :P [01:12:19] idk I love ripping things out :) [01:12:27] I'm more impressed with removing lines of code than adding them :) [01:12:29] mutante: PROBLEM - Number of monitoring apps is CRITICAL: CRITICAL: 20.00% of data exceeded the critical threshold [500.0] [01:12:38] ori: :)) [01:12:44] * YuviPanda likes ripping things out too [01:12:51] go kill toolserver :) [01:12:52] tendril is https://tendril.wikimedia.org/ [01:12:53] might get rid of all ganglia things that are labs specific. [01:12:58] mutante: ALREADY DEAD :) [01:13:03] s pringle's own [01:13:05] kinda cool actually [01:13:10] YuviPanda: https://gerrit.wikimedia.org/r/#/c/143209/ [01:13:22] ori: didn't we have ismaheal or something earlier? [01:13:28] oh, i forgot, there's ishmael too [01:13:29] "mha" ? [01:13:32] heh [01:14:04] to be fair several of those are at different layers in the monitoring stack not direct competitors :) [01:14:48] oh i'm not pointing fingers, i am directly responsible for some of the duplication [01:15:07] but i'm just leery of single-purpose monitoring solutions for exactly that reason [01:15:15] yeah agreed [01:15:18] +1 [01:15:31] bottom line is we need to figure out what we are gonig to do w/ icinga [01:15:41] so tools, etc know which way to go as well I guess [01:15:42] <^demon|lunch> Maybe we can move monitoring to phabricator since it's going to be our do-everything tool. [01:15:51] ^demon|lunch: heheheh [01:15:59] yeah, if icinga is going to be kept at least it should be moved to be a module [01:16:01] you laugh, but... https://secure.phabricator.com/book/phabricator/article/herald/ [01:16:20] herald is quite different from this :) [01:16:35] we could feed all monitoring data into it if we hate ourselves [01:16:42] herald sounds like gerrit-reviewer-bot [01:17:29] so, did you try the code review part of phab? [01:17:47] * YuviPanda hasn't [01:17:54] I gave up trying to set that up after a while [01:18:07] well...I mean I have used it a lot since I used it for work before :) [01:18:21] <^demon|lunch> I've done it. [01:18:27] <^demon|lunch> Not bad. [01:19:27] btw, here's another option, this is what we used at former work(tm) [01:19:30] http://docs.pnp4nagios.org/pnp-0.4/start [01:19:40] that is a nagios/icinga plugin to feed performance data into rrd [01:19:48] without the additional ganglia part [01:19:49] I used pnp before too, but it's definitely showing it's age [01:20:44] HEY GUYS LET'S WRITE OUR OWN IN GO AND SCALA [01:20:46] fair, it isn't bleeding edge for sure [01:21:49] the end of every awesome tool-selection process is to write a better one than all the ones you evaluated, so that the selection process is easier for the next guy :) [01:22:07] obligatory http://xkcd.com/927/ [01:22:32] bblack: :) after that you have the new IRC bot to report it [01:22:35] at least the character encoding one seems... somewhat fixed, for some definition of 'fixed' [01:23:17] ori: we could use Limn for dashboards. [01:23:18] unless you're trying to interoperate between languages that prefer internal UCS-32 vs UTF-16 vs UTF-8 [01:23:32] oh i forgot limn too [01:23:43] yeah, or using languages that think UCS-2 is UTF-16 [01:23:54] s/UCS-32/UCS-4/ heh [01:24:00] I can't even keep all the acronyms straight [01:24:05] bblack: UCS-32 sounds awesome! :D [01:24:09] or using languages that think they're languages that htink UCS-2 is UTF-16 (coco) [01:24:24] Limn, heh. [01:26:17] what we need is a new universal encoding that fixes all the legacy problems of unicode and subsumes them into a single character set. It will use 7 bits per character at minimum to save 12% file size on most documents and mirror the first 128 code values of UTF-8, then have this awesome encoding scheme that re-encodes the rest of UTF-8 using RLE compression. [01:26:22] we'll call it UTF-WIN [01:27:26] (your string libraries will need to bit-shift the 7-bit codes around to display them, but we've got plenty of processor power for that these days!) [01:27:41] someone start an IETF group, quick [01:27:56] YuviPanda: !monitoring is http://downforeveryoneorjustme.com/${instance} :p [01:27:59] gotta run, laters [01:28:03] mutante: :D [01:28:28] bblack: should just use bytes that when rendered in binary look like the character. [01:28:50] hah [01:28:58] our clear text will be crypto and our crypto will be clear text [01:29:25] come to think of it, with UCS-4 you could just do an 4x8 bitmap as a low-resolution font in the same bitspace. [01:30:13] every possible glyph ever, at a certain crappy resolution. no need for character sets [01:30:57] heh [01:31:01] it shall be UTF-BLOB [01:32:21] alright, I shall sleep [01:32:53] chasemp: do noodle on it more, but once I complete sending more metrics into graphite/diamond on toollabs (nginx error counts, grid engine stats) I'll do something about it :) [01:33:05] sounds good [01:33:34] chasemp: disk just got full on *7* of our hosts, and we got notified on IRC again :( [01:33:42] * YuviPanda mumbles about 2G /vars [01:33:55] ah [01:33:57] only 2 hosts [01:34:04] 03-10 doesn't mean 03 through 10 :) [01:46:59] (03PS1) 10Yuvipanda: diamond: Keep only 2 days of local logs around [operations/puppet] - 10https://gerrit.wikimedia.org/r/143233 [01:47:43] * YuviPanda adds a bunch of people to review ^. [01:48:07] YuviPanda: could we instead make it a param and keep prod as is? [01:48:19] chasemp: sure. [01:48:29] chasemp: if that's the case I can probably even trim labs down to 1. [01:48:35] sounds good [01:48:42] chasemp: let me amend [01:53:01] (03PS2) 10Yuvipanda: diamond: Keep only 1 day of logs for labs [operations/puppet] - 10https://gerrit.wikimedia.org/r/143233 [01:53:01] chasemp: updated. [01:54:55] I hate to nitpick you but the integers...are really strings even tho puppet is loose with the logic [01:55:28] if you look at say port numbers we define, strings representing integers especially for templates [01:55:29] chasemp: heh, alright. good point, though. [01:55:49] these are python ints but puppet strings [01:56:39] (03PS3) 10Yuvipanda: diamond: Keep only 1 day of logs for labs [operations/puppet] - 10https://gerrit.wikimedia.org/r/143233 [01:56:44] chasemp: updated again. [01:58:13] (03CR) 10Rush: [C: 032 V: 032] diamond: Keep only 1 day of logs for labs [operations/puppet] - 10https://gerrit.wikimedia.org/r/143233 (owner: 10Yuvipanda) [01:58:18] chasemp: woot, ty [01:59:23] I have had it get weird in a template where a integer in ruby you mean to be a string gets some template logic that works out with dynamic typing [01:59:27] but is insane to debug [01:59:38] chasemp: :D [01:59:52] chasemp: yeah, esp it is dsl -> another dsl -> python [02:00:03] (03PS4) 10Yuvipanda: labs: Enable diamond PuppetAgent collector on all nodes [operations/puppet] - 10https://gerrit.wikimedia.org/r/143193 [02:00:19] chasemp: ^ too if you're up for it. 'sok if it's too late :) [02:01:00] so for me to review collectors if you wouldn't mind I like to throw an example from teh logs of what actual metrics they collect [02:01:34] I usually literally just paste from the log, but makes it clearer down the road, I haven't had time to look at what that outputs specifically and the docs in teh header of the collector don't say [02:01:42] chasemp: sure. https://github.com/BrightcoveOS/Diamond/wiki/collectors-PuppetAgentCollector documents. [02:02:07] chasemp: I can add a link to the commit message [02:02:09] ha that's fair, used to be in the __doc__ for collectors I guess they stopped [02:02:26] (03PS5) 10Yuvipanda: labs: Enable diamond PuppetAgent collector on all nodes [operations/puppet] - 10https://gerrit.wikimedia.org/r/143193 [02:02:36] chasemp: it's in __doc__, they just have a generator [02:03:01] I looked but possible I'm also blind, I was doing 3 things at once :) [02:03:08] : [02:03:55] actually lots of good stuff there, some of it kinda awkward to get into graphite like last_run [02:04:07] but tracking total run times acrosss hosts over time [02:04:09] yeah I like that [02:04:37] chasemp: yeah, and I want to update that to actually put in $CURTIME - last_run [02:04:53] chasemp: so that'll be puppet freshness graph, and if we monitor based on that I can get puppet freshness alerts that way [02:04:57] ah, there is a built in derivative function in the collector super class :) [02:05:08] method even [02:05:08] yeah, or that :) [02:05:38] have you done the math on releasing this for disk space on graphite? [02:05:38] chasemp: if you like this enough I can make this happen in prod too. [02:05:53] that's a lot of new metrics [02:05:55] chasemp: heh, no. I'm just keeping a close eye on it, seems ok so far. [02:06:12] can you tell me how many hosts this hits? [02:06:16] or how many hosts in graphite now? [02:06:19] in labs I mean [02:07:01] chasemp: unsure how to count... [02:07:02] if someting like this [02:07:03] servers.hostname.puppetagent.time.file [02:07:13] is going to be for all defined types (unsure if just for builtins) [02:07:16] you could be looking at a lot [02:07:25] servers.hostname.puppetagent.time.package [02:07:33] PROBLEM - Unmerged changes on repository puppet on strontium is CRITICAL: Fetching origin [02:07:59] go to servers dir for whisper files [02:08:03] chasemp: am looking at the yaml file it gets its data from, moment [02:08:18] ah, yeah, running find on that should tell me [02:08:38] I use ls | wc -l [02:08:41] heh :) [02:08:56] chasemp: not on labs, since it is not all under servers [02:09:02] ah [02:09:02] each is namespaced by projectname [02:09:04] graphite.wmflabs.org [02:09:37] chasemp: 245 [02:10:00] how big are your whisper files? [02:10:39] idk I haven't done the math yet, and I don't know if the yaml file has stats on all defined types, but this could eat a lot of disk [02:10:54] chasemp: looking at the yaml file now [02:11:09] chasemp: it does report time for each resource type [02:11:28] I don't know how many we have, but it's more than teh default [02:11:45] so take the original 20-ish metrics per host and add 10 say, maybe 30-40 per host [02:11:48] x whisper file size [02:11:50] x hosts [02:12:04] rough guess is mucho disk space [02:12:26] chasemp: hmm, it's currently 80G [02:12:53] which means I've to also cut out some other things I'm logging that I don't have a direct use for (NFS usage, user login usage) [02:13:08] I know the feeling of 'hey I can watch all this!' :) [02:13:31] chasemp: :) [02:13:34] I think when I looked in prod we have sizeable whispers [02:13:49] could trim them down w/ less granularity over the long haul for labs [02:13:53] PROBLEM - Puppet freshness on db1006 is CRITICAL: Last successful Puppet run was Tue 01 Jul 2014 00:13:05 UTC [02:14:05] chasemp: yeah, I think that's the way to go [02:14:05] (03PS1) 10Yuvipanda: toollabs: Don't collect NFS stats [operations/puppet] - 10https://gerrit.wikimedia.org/r/143236 [02:14:09] my feeling is 1 minute has not much context in labs more than 72 hours out [02:14:20] chasemp: yeah, I'd agree again. [02:14:25] and then even at 30 minute blocks for labs, depending on teh host you are probably going to see what you want [02:14:35] or who knows what blocks [02:14:42] but yeah, probably have to tweak that before you push this out [02:14:45] chasemp: I'd say very less granularity in general, and toollabs / betalabs can override for slightly more [02:15:14] !log LocalisationUpdate completed (1.24wmf10) at 2014-07-01 02:14:11+00:00 [02:15:24] Logged the message, Master [02:16:03] chasemp: for labs, 1minute for 7 days, 30m for 30 days, 1h for 1 year, 1d for 5y? [02:16:19] let me ask you [02:16:28] how many labs hosts down the 1 hour a year ago is gonig to be useful [02:16:34] maybe a lot idk [02:16:55] chasemp: well, for toollabs it would definitely be, at leat to spot trends in usage. [02:16:55] I would say do that and see how big the file is :) [02:16:58] chasemp: yeah :) [02:17:05] chasemp: if I change it, will graphite automatically compact? [02:17:08] I have a few guidelines hang on [02:17:10] no [02:17:12] you can do it [02:17:18] manually there is a whisper tool [02:17:18] chasemp: ok! [02:17:21] but otherwise remove and recreate [02:17:25] (easier) [02:17:44] I haven't publicly announce graphite yet, so theoretically all this data can be thrown away [02:17:57] I made these notes before [02:17:58] but toolserver died today, so would be nice to have some data from before that to compare to how toollabs performs after. [02:17:59] #1.4MB per stat [02:17:59] #retentions = 10s:1d,1m:30d,15m:1y,1h:3y [02:17:59] #512KB per stat [02:18:01] #retentions = 5m:30d,15m:90d,1h:3y [02:18:35] is that 512KB per stat per host? [02:18:39] yes [02:18:40] guh, stupid question. [02:19:00] note that's aggressive for here, as we don't do 10s intervals [02:19:05] right, so at about 250 hosts, 250 * 1.4MB is under 400MB [02:19:46] anyways just a few numbers, you'll have to play [02:19:52] 400 MB per stat, so a 140GB disk would give me... [02:19:59] about 400 stats per host [02:20:06] err [02:20:06] 350 [02:20:09] (03CR) 10Rush: [C: 04-1] "let's get some numbers on how this affects disk space. especially since it seems to log a metric per defined type, per host :)" [operations/puppet] - 10https://gerrit.wikimedia.org/r/143193 (owner: 10Yuvipanda) [02:20:19] I -1'd...but not because it's a bad idea [02:20:27] just because I don't know what happens disk space wise [02:20:29] chasemp: yeah, seems sane to figure out a retention strategy *before* [02:20:46] my suggestion is [02:20:50] chasemp: also labs has restrictions on disk sizes, I think. can't really go 'but we can just get a bigger disk' [02:20:51] go broader for disk space, etc [02:21:06] this one's already got the biggest disk labs has [02:21:12] because you don't care about 1 minute disk usage usually, 5 minutes gets you there [02:21:23] I meant like poll disk growth at 5 and so you can store it more cheaply [02:21:30] chasemp: right, so setup a storage schema that's more granular [02:21:45] rather than .* [02:21:53] the ones I pasted teh top was cpu say [02:21:57] and the bottom was disk usage [02:22:27] I never said to myself, in what minute in this five minute period did this disk grow :) [02:22:32] but I suppose it could happen [02:22:37] nah, makes sense for disk [02:23:08] I'm also learning as I go, (not a sysadmin! :D), so haven't really considered these questions before [02:23:47] it's all good, I've done a few rollouts like this so glad to at least provide some useful thoughts [02:24:12] :) [02:24:28] another thing is there are some things that are by nature ephemeral, like you need 1 minute for you don't care if it failed a year ago [02:24:34] and then also last thought, in labs at least [02:24:40] make your default catchall very small [02:24:50] so when someone tries to write a collector and sends a million random strings [02:24:52] !log LocalisationUpdate completed (1.24wmf11) at 2014-07-01 02:23:49+00:00 [02:24:53] right. CPU is probably ephemeral [02:24:55] you have a fighting chance [02:24:57] Logged the message, Master [02:25:14] idk cpu I like :) [02:25:24] but for labs maybe do percentage and not per item for cpu [02:25:25] ? [02:25:29] right. [02:25:43] I wonder if I can configure diamond to not send *all* the stats [02:25:50] like, I don't care about inode stats in disk usage, for example [02:26:00] so that's 4 extra stats per host I don't want/need [02:26:17] yeah I had to basically shadow the collector in puppet to overwrite with the more limited stats version before [02:26:24] right. [02:26:26] some of them do tho [02:27:08] chasemp: right. I wonder if upstream will accept a generic setting that lets us discard unused stats at the diamond level [02:27:17] like a [02:27:19] don't log this [02:27:21] regex [02:27:27] actually the disk one has that now if I recall [02:27:31] yeah [02:27:45] kormoc is the main guy [02:27:48] #python-diamond [02:27:52] he's actually really nice [02:27:53] it'll also need to account for variants between prod and labs [02:28:00] since I guess prod doesn't care as much about disk space [02:28:18] honestly we just have a crap load of disk space [02:28:24] hehe [02:28:25] so I got away with being permissive so far [02:28:28] labs images, though. [02:28:51] it's up to andrewbogott_afk or Coren but in this case maybe you get could more disk? [02:28:52] idk [02:28:57] chasemp: possibly, yeah [02:28:58] it's kind of a labs public utility if used right [02:29:02] (03PS2) 10Yuvipanda: toollabs: Don't collect NFS stats [operations/puppet] - 10https://gerrit.wikimedia.org/r/143236 [02:29:15] chasemp: yeah, I guess I'll need to make a new image. [02:30:05] I don't actually know if NFS metric is useful, since it won't tell us which user is hitting NFS hard [02:30:06] hmm [02:30:18] chasemp: I'll think more about this and talk to Coren and andrewbogott_afk [02:30:29] sounds good, let me know how it's going [02:30:37] chasemp: if it is possible to get a bigger disk, I see no harm in getting, say, 500G of /srv and then not worry about this [02:30:44] chasemp: would graphite slow down with age if it has too much data? [02:31:00] well yes and no, carbon will need tweaking to keep up with large incoming [02:31:08] but the UI doesn't care about how many metrics you have [02:31:11] right [02:31:15] just how many it has to touch when you do a GET [02:31:41] yeah, I'll talk to those two about it [02:32:10] chasemp: I'll go sleep now, but thanks a lot! very useful / informative :) [02:32:23] good deal, night! [02:32:34] at the back of my mind something is telling that I'm doing all this work so I can be woken up at night when toollabs goes down... :) [02:33:19] maybe a good thing. [02:33:33] RECOVERY - Puppet freshness on db1006 is OK: puppet ran at Tue Jul 1 02:33:27 UTC 2014 [02:33:37] it's good if the alternative is finding out 3 days later you lost a weeks worth of work :) [02:33:43] chasemp: :D indeed. [02:33:56] chasemp: nobody's lost their work yet, though. but angry users happen often :) [02:34:04] for some definition of often, of course [02:34:09] but still too often for my liking [02:34:39] ok, sleep [02:34:42] it's 8AM :| [02:34:44] bye [02:51:11] !log LocalisationUpdate ResourceLoader cache refresh completed at Tue Jul 1 02:50:05 UTC 2014 (duration 50m 4s) [02:51:16] Logged the message, Master [04:05:50] PROBLEM - Kafka Broker Messages In on analytics1021 is CRITICAL: kafka.server.BrokerTopicMetrics.AllTopicsMessagesInPerSec.FifteenMinuteRate CRITICAL: 992.807696236 [04:32:03] PROBLEM - graphite.wikimedia.org on tungsten is CRITICAL: CRITICAL - Socket timeout after 10 seconds [04:41:11] RECOVERY - graphite.wikimedia.org on tungsten is OK: HTTP OK: HTTP/1.1 200 OK - 1607 bytes in 0.003 second response time [06:13:00] !log springle Synchronized wmf-config/db-eqiad.php: depool db1060 during schema changes (duration: 00m 07s) [06:13:05] Logged the message, Master [06:29:11] !log ran fixInvalidStudent.php --wiki=enwiki --courseId=359 for bug 66624 [06:29:16] Logged the message, Master [06:31:24] !log springle Synchronized wmf-config/db-eqiad.php: repool db1060 (duration: 00m 06s) [06:31:27] Logged the message, Master [06:33:59] !log springle Synchronized wmf-config/db-eqiad.php: depool db1063 during schema changes (duration: 00m 06s) [06:34:04] Logged the message, Master [06:36:27] * matanya looks for _joe_ [06:38:15] springle: is it ok if I run a script to delete some corrupt entries on the centralauth db? just want to make sure you're not doing anything related to it. (https://bugzilla.wikimedia.org/show_bug.cgi?id=66535) [06:38:47] legoktm: go ahead. i'm not touching s7 atm [06:39:24] ok, thanks [06:40:21] !log starting to run checkLocalNames.php and checkLocalUser.php for some wikivoyages to clean up bug 66535 [06:40:26] Logged the message, Master [06:55:35] nice [06:56:53] (03PS1) 10Springle: Reduce m[23] binlog expiry to 7 days. [operations/puppet] - 10https://gerrit.wikimedia.org/r/143248 [06:58:02] !log springle Synchronized wmf-config/db-eqiad.php: repool db1063 (duration: 00m 06s) [06:58:05] Logged the message, Master [06:58:28] (03CR) 10Springle: [C: 032] Reduce m[23] binlog expiry to 7 days. [operations/puppet] - 10https://gerrit.wikimedia.org/r/143248 (owner: 10Springle) [06:59:24] RECOVERY - Unmerged changes on repository puppet on strontium is OK: Fetching origin [07:01:24] (03PS2) 10Krinkle: dynamicproxy: Let 403 and 404 responses pass through [operations/puppet] - 10https://gerrit.wikimedia.org/r/143081 [07:06:05] !log springle Synchronized wmf-config/db-eqiad.php: depool db1067 during schema changes (duration: 00m 06s) [07:06:09] Logged the message, Master [07:16:45] !log springle Synchronized wmf-config/db-eqiad.php: repool db1067 (duration: 00m 12s) [07:16:50] Logged the message, Master [07:22:14] !log finished running checkLocalNames.php and checkLocalUser.php for some wikivoyages to clean up bug 66535 [07:22:19] Logged the message, Master [08:13:12] (03PS1) 10Matanya: mailrelay: convert 'true' into a real boolean [operations/puppet] - 10https://gerrit.wikimedia.org/r/143251 [08:22:11] (03PS1) 10Giuseppe Lavagetto: rcstream: add ipv6 addresses to backends [operations/puppet] - 10https://gerrit.wikimedia.org/r/143252 [08:27:10] (03CR) 10Giuseppe Lavagetto: [C: 032] rcstream: add ipv6 addresses to backends [operations/puppet] - 10https://gerrit.wikimedia.org/r/143252 (owner: 10Giuseppe Lavagetto) [08:47:23] !log ms-be3003 sdk1 disk to 0 weight [08:47:28] Logged the message, Master [08:49:32] PROBLEM - graphite.wikimedia.org on tungsten is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:49:46] (03PS3) 10Nemo bis: dynamicproxy: Let 403 and 404 responses pass through [operations/puppet] - 10https://gerrit.wikimedia.org/r/143081 (owner: 10Krinkle) [08:50:22] RECOVERY - graphite.wikimedia.org on tungsten is OK: HTTP OK: HTTP/1.1 200 OK - 1607 bytes in 0.005 second response time [09:38:29] from linode http://pastie.org/9342856 [09:44:00] well trace from eqiad to your next hope 217.0.117.212 seems to work :-( [09:44:01] no clue [09:44:19] (03PS11) 10Giuseppe Lavagetto: Improve nginx TLS cipher list & session timeout [operations/puppet] - 10https://gerrit.wikimedia.org/r/132393 (https://bugzilla.wikimedia.org/53259) (owner: 10JanZerebecki) [09:44:37] <_joe_> hey ho, let's go [09:45:21] <_joe_> in ~ 15 minutes, PFS would be enabled on wikipedias. [09:45:31] \o/ \o/ [09:45:33] \O/ [09:45:43] (no clue what PFS stands for, but that sounds exciting) [09:45:50] Pure Fast Speed [09:46:10] <_joe_> Perfect Forward Secrecy [09:46:20] <_joe_> http://en.wikipedia.org/wiki/Forward_secrecy if you trust wikipedia [09:46:23] <_joe_> :P [09:47:54] =DDDDDDDD [09:47:57] woot \o/ [09:53:02] I think we should just tell people hashar's version :P [09:57:59] <_joe_> legoktm: it would actually be slightly slower [09:58:15] <_joe_> in my evaluations, non-observably slower, but still [09:58:17] shshsh, don't tell them [09:58:36] <_joe_> matanya: I've been very transparent in ops@ [09:58:45] <_joe_> the benefit for users are really there, so... [10:00:04] _joe_: The time is nigh to deploy Ops (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20140701T1000) [10:00:17] <_joe_> jouncebot: thanks [10:00:29] _joe_: that is your cue I suppose :-) [10:00:36] <_joe_> (how do I notify this bot I've released?) [10:00:40] <_joe_> akosiaris: :) [10:01:21] you don't need to tell the bot anything, it just gives reminders [10:01:48] <_joe_> Oh I won't forget [10:01:57] <_joe_> it's a potentially fatal change :P [10:02:21] haha [10:02:45] we will see in a few minutes [10:02:51] https://github.com/mattofak/jouncebot [10:03:03] one way communication ? bah... [10:03:38] the turing test was (almost) beaten, we can have more :P [10:03:44] it just there to make sure you don't miss your deployment window and inform others something is going on [10:04:00] yeah I know, just making fun :-) [10:05:36] <_joe_> akosiaris: that damn dpkg-query for hadoop and the rest is making me nuts [10:06:34] i hinted yesterday about what i think is wrong [10:06:49] <_joe_> it is released but I'll have to restart nginxes by hand it seems [10:06:54] <_joe_> s/restart/reload/ [10:08:37] <_joe_> mh no actually this will need a restart [10:08:58] <_joe_> !log restarting nginx on ssl100* servers in sequence, to activate PFS [10:09:04] Logged the message, Master [10:09:17] _joe_: hmm ? let me read backlog [10:11:27] <_joe_> akosiaris: changing nginx.conf does not restart nginx on ssl100*/ssl300* [10:11:44] <_joe_> so that ops can do that gradually when they feel like it [10:11:55] cool [10:11:56] <_joe_> we can also depool those hosts from pybal before restarting [10:12:03] <_joe_> but I don't see a point honestly [10:12:28] I think you have a point there [10:12:28] <_joe_> it a less-than-a-second hiccup [10:12:52] <_joe_> not sure if nginx restarting is better or worse than depooling a server [10:12:59] well, assuming the config does not break the restart and nginx is capable of restarting [10:13:14] well reloading more like it [10:13:17] <_joe_> i already tested that with a server I depooled :) [10:13:23] <_joe_> reloading will not be enough [10:13:29] <_joe_> not in this case [10:14:01] needs to reinitialize the SSL socket ? [10:14:07] <_joe_> yes [10:14:34] <_joe_> needs to reload the correct chiphers as well and it's not something they can do by reloading [10:14:49] <_joe_> it's strange given how good is nginx at reloading configs [10:15:00] <_joe_> you know what? I'll depool the servers while doing the restart, it costs me nothing [10:15:32] not following on the dpkg-query/hadoop thingy btw [10:15:43] backlog was not really enlightening [10:15:45] <_joe_> see otto's email @ops [10:16:52] ah ok. Yeah I look into it [10:17:27] <_joe_> akosiaris: on trustys you see the full message which shows that is a dpkg-query [10:25:02] akosiaris: if you are interested in my bit on the hadoop thingy, poke [10:25:51] (03CR) 10Alexandros Kosiaris: [C: 04-1] "This is almost LGTM. Two minor issues and I think we are good to go" (032 comments) [operations/puppet] - 10https://gerrit.wikimedia.org/r/139095 (owner: 10Nikerabbit) [10:26:25] matanya: As soon as I clean up the queue of other things, will do. Thanks [10:30:32] <_joe_> !log all eqiad SSL terminators are now PFS enabled. Moving to rolling restarting esams [10:30:38] Logged the message, Master [10:32:39] <_joe_> https://www.ssllabs.com/ssltest/analyze.html?d=en.wikipedia.org :) [10:32:40] akosiaris: https://gerrit.wikimedia.org/r/#/c/139095/32/modules/cxserver/templates/logrotate.erb is syntax error without quotes. [10:32:46] <_joe_> (this points to eqiad) [10:35:20] kart_: responded privately [10:35:40] :) [10:38:00] <_joe_> !log esams restart finished, moving to ulsfo [10:38:06] Logged the message, Master [10:39:04] (03PS33) 10KartikMistry: cxserver configuration for beta labs [operations/puppet] - 10https://gerrit.wikimedia.org/r/139095 (owner: 10Nikerabbit) [10:41:26] kart_: LGTM. Got any specific requirements about merging it ? Or can I do it ? [10:41:59] feel free :) [10:42:11] hurray [10:42:46] _joe_: did that include nginx on esams text caches, too? [10:42:57] btw. YAY! [10:42:58] (03CR) 10Alexandros Kosiaris: [C: 032] cxserver configuration for beta labs [operations/puppet] - 10https://gerrit.wikimedia.org/r/139095 (owner: 10Nikerabbit) [10:43:02] <_joe_> jzerebec1i: ulsfo you mean [10:43:11] <_joe_> ulsfo is going live now [10:43:32] no i meant if you already did the text caches in esams? [10:43:50] <_joe_> jzerebec1i: ssl terminates on ssl300* in esams AFAIK [10:44:35] akosiaris: Thank you :) [10:44:40] (03PS1) 10QChris: Make backup handle ensuring 'absent' [operations/puppet/wikimetrics] - 10https://gerrit.wikimedia.org/r/143266 [10:45:00] I suppose we should keep an eye on https://gdash.wikimedia.org/dashboards/frontend/ [10:45:21] (03PS1) 10QChris: Append hostname to wikimetrics backup target [operations/puppet] - 10https://gerrit.wikimedia.org/r/143267 [10:45:23] (03PS1) 10QChris: Have wikimetrics cleanup its backup upon absenting it [operations/puppet] - 10https://gerrit.wikimedia.org/r/143268 [10:45:36] <_joe_> Nemo_bis: yes, I'd be surprised if something happens [10:45:46] AES_128_CBC, TLS v1.1, ECDHE-RSA [10:45:56] kind of weird... [10:46:04] what's wrong with my chrome ? [10:46:36] (03CR) 10jenkins-bot: [V: 04-1] Append hostname to wikimetrics backup target [operations/puppet] - 10https://gerrit.wikimedia.org/r/143267 (owner: 10QChris) [10:47:00] _joe_: it's mostly IE having problems right? maybe at 99th percentile something moves [10:47:03] (03CR) 10jenkins-bot: [V: 04-1] Have wikimetrics cleanup its backup upon absenting it [operations/puppet] - 10https://gerrit.wikimedia.org/r/143268 (owner: 10QChris) [10:47:08] (but not so far) [10:47:20] <_joe_> Nemo_bis: it's slightly slower [10:47:32] <_joe_> I think connection slowness will be much more influential [10:48:36] _joe_: once you are done, please let me know if i can purge is_puppet_master in site.pp [10:49:18] <_joe_> matanya: I don't think so, that has to do with labs puppetmasters I guess [10:49:51] <_joe_> !log nginx restarted on all ulsfo hosts as well, we should be PFS-enabled now [10:49:55] Logged the message, Master [10:50:06] (03CR) 10Milimetric: [C: 04-1] Have wikimetrics cleanup its backup upon absenting it (031 comment) [operations/puppet] - 10https://gerrit.wikimedia.org/r/143268 (owner: 10QChris) [10:50:39] ok _joe_just relaying a question by andrewbogott [10:52:05] <_joe_> matanya: I think I can take a look with andrew this evening [10:52:16] thank you [10:54:35] (03CR) 10QChris: Have wikimetrics cleanup its backup upon absenting it (031 comment) [operations/puppet] - 10https://gerrit.wikimedia.org/r/143268 (owner: 10QChris) [10:55:59] (03PS2) 10QChris: Have wikimetrics cleanup its backup upon absenting it [operations/puppet] - 10https://gerrit.wikimedia.org/r/143268 [10:56:01] (03PS2) 10QChris: Append hostname to wikimetrics backup target [operations/puppet] - 10https://gerrit.wikimedia.org/r/143267 [10:57:12] (03CR) 10jenkins-bot: [V: 04-1] Have wikimetrics cleanup its backup upon absenting it [operations/puppet] - 10https://gerrit.wikimedia.org/r/143268 (owner: 10QChris) [11:15:40] _joe_: all prod is now puppet 3? including precise and below boxes ? [11:15:45] yes [11:15:57] well I guess not lucid [11:16:13] * matanya goes to purge some code ... :) [11:16:20] what code? [11:16:32] modules/apt/manifests/puppet.pp [11:16:42] all the 2.7 if part [11:17:03] <_joe_> lucid is not. [11:17:09] too bad [11:17:32] <_joe_> matanya: well add me as a reviewer, not sure it obvious what you can remove and what not [11:17:36] still can fix it do be lucid instead of precise [11:17:54] <_joe_> matanya: no need to do anything for lucids [11:18:09] not even asking about hardy [11:18:28] <_joe_> are you sure you understand correctly how and why we did that kind of pinning? [11:18:29] so i'll think what would be the best way to remove this part of code [11:18:36] i think i have [11:18:48] <_joe_> ok, so go on :) [11:19:01] <_joe_> sorry, lunch [11:19:37] to verify, we prefer wikimedia over ubuntu everywhere, but in case of precise with 2.7 we tell the bx to prefer ubuntu [11:19:45] *box [11:23:30] (03PS1) 10Matanya: apt: minor lint [operations/puppet] - 10https://gerrit.wikimedia.org/r/143271 [11:26:45] (03PS1) 10QChris: Document wikimetrics' use of the $::wikimetrics_backup global [operations/puppet] - 10https://gerrit.wikimedia.org/r/143273 [11:30:26] (03CR) 10jenkins-bot: [V: 04-1] Document wikimetrics' use of the $::wikimetrics_backup global [operations/puppet] - 10https://gerrit.wikimedia.org/r/143273 (owner: 10QChris) [11:33:54] (03PS1) 10QChris: Lint: Quote global [operations/puppet/wikimetrics] - 10https://gerrit.wikimedia.org/r/143275 [11:37:24] (03PS2) 10QChris: Lint: Scope global [operations/puppet/wikimetrics] - 10https://gerrit.wikimedia.org/r/143275 [11:40:55] (03PS1) 10Raimond Spekking: Enable RSS extension for dewiki on Labs for testing [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/143276 [11:41:07] (03CR) 10jenkins-bot: [V: 04-1] Enable RSS extension for dewiki on Labs for testing [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/143276 (owner: 10Raimond Spekking) [11:47:46] !log reedy Synchronized database lists: (no message) (duration: 00m 16s) [11:47:51] Logged the message, Master [11:48:16] !log reedy Synchronized wmf-config/InitialiseSettings.php: touch (duration: 00m 14s) [11:48:20] Logged the message, Master [11:48:42] !log reedy rebuilt wikiversions.cdb and synchronized wikiversions files: (no message) [11:48:47] Logged the message, Master [11:50:18] PROBLEM - Disk space on ms-be3003 is CRITICAL: DISK CRITICAL - free space: /srv/swift-storage/sdb3 74954 MB (4% inode=97%): /srv/swift-storage/sda3 88432 MB (4% inode=98%): /srv/swift-storage/sde1 106033 MB (5% inode=98%): /srv/swift-storage/sdh1 84818 MB (4% inode=98%): /srv/swift-storage/sdj1 93190 MB (4% inode=98%): /srv/swift-storage/sdl1 89435 MB (4% inode=98%): /srv/swift-storage/sdi1 81889 MB (4% inode=98%): /srv/swift-storage/ [11:51:32] (03CR) 10Reedy: [C: 032] Initialize some settings for wikimania 2015 wiki [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/139279 (https://bugzilla.wikimedia.org/66370) (owner: 10Withoutaname) [11:51:32] (03Merged) 10jenkins-bot: Initialize some settings for wikimania 2015 wiki [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/139279 (https://bugzilla.wikimedia.org/66370) (owner: 10Withoutaname) [11:51:47] (03PS2) 10Raimond Spekking: Enable RSS extension for dewiki on Labs for testing [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/143276 [11:53:00] !log Manually created wikimania2015wiki database on 10.64.16.18 [11:53:05] Logged the message, Master [11:55:22] !log reedy Synchronized database lists: (no message) (duration: 00m 28s) [11:55:27] Logged the message, Master [11:55:47] !log reedy Synchronized wmf-config/InitialiseSettings.php: touch (duration: 00m 13s) [11:55:52] Logged the message, Master [12:00:48] !log Manually created Echo tables on extension1 [12:00:54] Logged the message, Master [12:01:28] (03CR) 10Hoo man: [C: 04-1] "Only had a quick look" (033 comments) [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/143276 (owner: 10Raimond Spekking) [12:01:28] (03PS1) 10Reedy: Disable Echo on wikimania2015wiki till tables created [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/143278 [12:01:28] (03PS3) 10Raimond Spekking: Enable RSS extension for dewiki on Labs for testing [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/143276 [12:01:28] (03CR) 10Raimond Spekking: "Thanks Hoo man for the quick review. It's my first commit to labs :-)" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/143276 (owner: 10Raimond Spekking) [12:01:28] (03Abandoned) 10Reedy: Disable Echo on wikimania2015wiki till tables created [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/143278 (owner: 10Reedy) [12:02:33] (03CR) 10QChris: "> I've added qchris as a reviewer so he can check the geowiki" [operations/puppet] - 10https://gerrit.wikimedia.org/r/142483 (owner: 10Scottlee) [12:03:37] !log reedy Synchronized wmf-config/interwiki.cdb: (no message) (duration: 00m 13s) [12:03:42] Logged the message, Master [12:05:35] (03CR) 10Hoo man: [C: 04-1] "Sorry if my comment was confusing" (031 comment) [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/143276 (owner: 10Raimond Spekking) [12:05:45] (03PS1) 10Reedy: sync-common-file is no more, use sync-file [operations/puppet] - 10https://gerrit.wikimedia.org/r/143279 [12:06:02] (03PS4) 10Raimond Spekking: Enable RSS extension for dewiki on Labs for testing [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/143276 [12:06:02] (03CR) 10jenkins-bot: [V: 04-1] Enable RSS extension for dewiki on Labs for testing [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/143276 (owner: 10Raimond Spekking) [12:08:38] (03PS5) 10Raimond Spekking: Enable RSS extension for dewiki on Labs for testing [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/143276 [12:13:24] (03CR) 10Hoo man: [C: 032] "Should be fine now :)" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/143276 (owner: 10Raimond Spekking) [12:13:31] (03Merged) 10jenkins-bot: Enable RSS extension for dewiki on Labs for testing [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/143276 (owner: 10Raimond Spekking) [12:14:18] Reedy: --^ as you're currently messing with stuff... can you sync that out while you're on it [12:15:04] (03PS1) 10Reedy: Update IW cache [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/143280 [12:15:23] (03PS2) 10Reedy: Update IW cache [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/143280 [12:15:29] (03CR) 10Reedy: [C: 032] Update IW cache [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/143280 (owner: 10Reedy) [12:15:34] (03Merged) 10jenkins-bot: Update IW cache [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/143280 (owner: 10Reedy) [12:16:07] !log reedy Synchronized wmf-config/InitialiseSettings-labs.php: (no message) (duration: 00m 20s) [12:16:12] Logged the message, Master [12:16:15] thx [12:21:03] (03CR) 10Whym: "@TTO: I asked MaxSem last week or so, and he seemed to be waiting just in case there are other opinions on the choice of the feed name, if" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/136316 (https://bugzilla.wikimedia.org/66015) (owner: 10Whym) [12:21:21] (03PS1) 10Reedy: Fix wgServer/wgCanonicalServer for noboards [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/143281 [12:21:37] (03PS2) 10Reedy: Fix wgServer/wgCanonicalServer for noboards [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/143281 [12:21:42] (03CR) 10Reedy: [C: 032] Fix wgServer/wgCanonicalServer for noboards [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/143281 (owner: 10Reedy) [12:21:48] (03Merged) 10jenkins-bot: Fix wgServer/wgCanonicalServer for noboards [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/143281 (owner: 10Reedy) [12:22:18] !log reedy Synchronized wmf-config/InitialiseSettings.php: (no message) (duration: 00m 18s) [12:22:22] Logged the message, Master [12:22:45] (03PS7) 10Reedy: Remove remnants of . replaced with _ in "lang" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/134948 [12:23:05] (03CR) 10Reedy: [C: 032] Remove remnants of . replaced with _ in "lang" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/134948 (owner: 10Reedy) [12:23:13] (03Merged) 10jenkins-bot: Remove remnants of . replaced with _ in "lang" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/134948 (owner: 10Reedy) [12:23:47] !log reedy Synchronized multiversion/: (no message) (duration: 00m 23s) [12:23:53] Logged the message, Master [12:28:34] _joe_: have moment for the apt::puppet change ? [12:29:22] <_joe_> matanya: in 10 minutes :) [12:29:27] thanks [12:41:06] _joe_: The time is nigh to hear what matanya has to say [12:41:48] <_joe_> matanya: :P [12:41:59] <_joe_> I had just opened your change [12:43:32] <_joe_> matanya: AmA [12:44:23] I didn't push any yet _joe_ [12:45:05] <_joe_> matanya: yes I got confused by the apt lint one [12:45:14] looking at the code, in modules/apt/manifests/puppet.pp I see if ($version == '2.7') [12:45:41] that part seems like it can be ditched, since no 2.7 should be preferred from ubuntu over the one from wikimedia [12:46:27] <_joe_> yes, just be sure no file would be kept around [12:46:38] <_joe_> it sould not [12:47:17] as for the else: trusty should follow ubuntu or wikimedia ? [12:47:21] (03PS1) 10Filippo Giunchedi: swift: add icehouse pinning to ms-be1* [operations/puppet] - 10https://gerrit.wikimedia.org/r/143282 [12:48:53] _joe_: my only doubt is the elsif which seems like it should stay [12:49:14] <_joe_> matanya: so remove the whole conditional [12:49:26] <_joe_> matanya: submit your change and I'll review it :) [12:49:33] sure [12:49:35] thanks [12:54:30] (03PS1) 10Matanya: apt::puppet: remove puppet 2.7 conditional [operations/puppet] - 10https://gerrit.wikimedia.org/r/143283 [12:57:16] (03CR) 10Matanya: "I would remove the entire file and just use the files from our apt repo, do we still need pinning for puppet?" [operations/puppet] - 10https://gerrit.wikimedia.org/r/143283 (owner: 10Matanya) [13:04:31] (03CR) 10Giuseppe Lavagetto: [C: 04-2] "We should:" [operations/puppet] - 10https://gerrit.wikimedia.org/r/143283 (owner: 10Matanya) [13:08:10] !log dist-upgrade and reboot boron [13:08:14] Logged the message, Master [13:15:14] (03CR) 10Giuseppe Lavagetto: [C: 031] Add apache::conf (031 comment) [operations/puppet] - 10https://gerrit.wikimedia.org/r/142400 (owner: 10Ori.livneh) [13:16:10] !log dist-upgrade and reboot tellurium [13:16:15] Logged the message, Master [13:18:32] (03CR) 10Giuseppe Lavagetto: [C: 04-1] "I do agree with the principle, but please reimplement." (031 comment) [operations/puppet] - 10https://gerrit.wikimedia.org/r/143090 (owner: 10Ori.livneh) [13:31:38] (03CR) 10Alexandros Kosiaris: [C: 04-1] "I will echo Giuseppe on that front" [operations/puppet] - 10https://gerrit.wikimedia.org/r/143090 (owner: 10Ori.livneh) [13:31:49] (03PS2) 10Giuseppe Lavagetto: add apache::param resource [operations/puppet] - 10https://gerrit.wikimedia.org/r/143090 (owner: 10Ori.livneh) [13:32:44] ::param? :) [13:33:23] <_joe_> paravoid: well, name sucks [13:33:38] <_joe_> ::envvar would make more sense probably [13:33:48] yes [13:33:54] <_joe_> doing that [13:33:59] <_joe_> ori will hate me :P [13:36:04] (03CR) 10Ottomata: "Great, ok, just wanted to double check. Probably just an old typo. Thanks!" [operations/puppet] - 10https://gerrit.wikimedia.org/r/142483 (owner: 10Scottlee) [13:38:18] (03CR) 10Milimetric: [C: 031] "I made a mistake in the previous review, `ensure: absent` has been implemented properly." [operations/puppet] - 10https://gerrit.wikimedia.org/r/143268 (owner: 10QChris) [13:38:33] (03CR) 10Ottomata: [C: 032 V: 032] Make backup handle ensuring 'absent' [operations/puppet/wikimetrics] - 10https://gerrit.wikimedia.org/r/143266 (owner: 10QChris) [13:38:44] (03PS3) 10Ottomata: Append hostname to wikimetrics backup target [operations/puppet] - 10https://gerrit.wikimedia.org/r/143267 (owner: 10QChris) [13:39:31] (03CR) 10Ottomata: [C: 032 V: 032] Append hostname to wikimetrics backup target [operations/puppet] - 10https://gerrit.wikimedia.org/r/143267 (owner: 10QChris) [13:39:48] (03PS3) 10Ottomata: Have wikimetrics cleanup its backup upon absenting it [operations/puppet] - 10https://gerrit.wikimedia.org/r/143268 (owner: 10QChris) [13:40:01] (03CR) 10Ottomata: [C: 032 V: 032] Lint: Scope global [operations/puppet/wikimetrics] - 10https://gerrit.wikimedia.org/r/143275 (owner: 10QChris) [13:40:02] 3 [13:40:24] Who is stealing window focus :-) [13:40:59] (03CR) 10Ottomata: Have wikimetrics cleanup its backup upon absenting it (031 comment) [operations/puppet] - 10https://gerrit.wikimedia.org/r/143268 (owner: 10QChris) [13:43:45] (03PS1) 10Alexandros Kosiaris: Remove puppet freshness check and all dependencies [operations/puppet] - 10https://gerrit.wikimedia.org/r/143304 [13:43:47] (03PS1) 10Alexandros Kosiaris: Remove the snmptt user [operations/puppet] - 10https://gerrit.wikimedia.org/r/143305 [13:43:49] (03PS1) 10Alexandros Kosiaris: Remove the last resources of snmp on hosts [operations/puppet] - 10https://gerrit.wikimedia.org/r/143306 [13:44:37] (03PS3) 10Giuseppe Lavagetto: add apache::envvar resource [operations/puppet] - 10https://gerrit.wikimedia.org/r/143090 (owner: 10Ori.livneh) [13:45:29] (03PS4) 10QChris: Have wikimetrics cleanup its backup upon absenting it [operations/puppet] - 10https://gerrit.wikimedia.org/r/143268 [13:46:20] (03CR) 10QChris: Have wikimetrics cleanup its backup upon absenting it (031 comment) [operations/puppet] - 10https://gerrit.wikimedia.org/r/143268 (owner: 10QChris) [13:47:28] (03PS2) 10QChris: Document wikimetrics' use of the $::wikimetrics_backup global [operations/puppet] - 10https://gerrit.wikimedia.org/r/143273 [13:47:55] (03CR) 10Giuseppe Lavagetto: [C: 031] role::mediawiki::webserver: set maxclients dynamically [operations/puppet] - 10https://gerrit.wikimedia.org/r/137947 (owner: 10Ori.livneh) [13:48:05] PROBLEM - Unmerged changes on repository puppet on strontium is CRITICAL: Fetching origin [13:49:25] PROBLEM - Unmerged changes on repository puppet on palladium is CRITICAL: Fetching origin [13:50:33] ottomata: ^ [13:50:36] (03PS1) 10Reedy: Non Wikipedias to 1.24wmf10 [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/143309 [13:51:03] <_joe_> note to all opsens with merge rights: do NOT sudo puppet-merge [13:51:10] oh? [13:51:13] <_joe_> sudo -i; puppet-merge [13:51:22] oh that's the issue with strontium? [13:51:25] ack [13:51:26] <_joe_> paravoid: doing sudo makes the git-merge hook fail [13:51:26] sorry paravoid [13:51:28] <_joe_> it seems [13:51:29] <_joe_> :) [13:51:35] got it [13:51:39] I'm guilty of sudo puppet-merge [13:51:41] we should fix that though [13:51:43] <_joe_> paravoid: we may want to fix that, but i don't have time [13:51:53] * paravoid points to ottomata [13:51:58] to fix? [13:51:58] as the author of puppet-merge [13:51:59] ;) [13:51:59] :D [13:52:02] :D [13:52:05] yeah i'd be happy to do that [13:52:06] RECOVERY - Unmerged changes on repository puppet on strontium is OK: Fetching origin [13:52:23] just on strontium? or everywhere? [13:52:25] RECOVERY - Unmerged changes on repository puppet on palladium is OK: Fetching origin [13:52:37] sudo puppet-merge on palladium fails? that's the issue? [13:54:27] <_joe_> ottomata: it does not fail, it's the post-merge hook that makes the other masters merge as well that fails in that case [13:55:02] hm [13:55:30] <_joe_> ottomata: and you just entered the login-shell-sudo env variables hell [13:55:32] <_joe_> good luck [13:55:36] oh because the post merge is an ssh thing [13:55:37] ? [13:56:18] <_joe_> check by yourself, I just got the error on strontium (git appears unconfigured) and the cause [13:56:23] yeah looking [13:56:51] <_joe_> I'm pretty sure there is an implicit ssh user somewhere or you using $USER or some other envvar [13:57:35] yeah, i think ther eprobably need to be some special sudo rules for this gituser or something [13:57:42] this does look quite annoying......... [13:57:58] <_joe_> told ya :) [13:58:07] sudo -s ; puppet-merge also works fine [13:58:21] a stopgap could be to just exit if it is known it'll fail [13:58:39] not a solution, but better than icinga-wm [13:58:45] ottomata: I figured the dpkg-query puppet thing, replying to your email [13:58:49] that would be a godo first step [13:58:53] oh thanks akosiaris! [14:02:31] (03CR) 10Ottomata: [C: 032 V: 032] Have wikimetrics cleanup its backup upon absenting it [operations/puppet] - 10https://gerrit.wikimedia.org/r/143268 (owner: 10QChris) [14:02:54] (03CR) 10Ottomata: [C: 032 V: 032] Document wikimetrics' use of the $::wikimetrics_backup global [operations/puppet] - 10https://gerrit.wikimedia.org/r/143273 (owner: 10QChris) [14:06:00] (03CR) 10Andrew Bogott: [C: 032] dynamicproxy: Let 403 and 404 responses pass through [operations/puppet] - 10https://gerrit.wikimedia.org/r/143081 (owner: 10Krinkle) [14:09:12] (03CR) 10Alexandros Kosiaris: Add apache::conf (031 comment) [operations/puppet] - 10https://gerrit.wikimedia.org/r/142400 (owner: 10Ori.livneh) [14:11:19] akosiaris: I'm still puzzled by that puppet icinga test. I thought I understood it briefly (that it has two modes and we're only using the freshness mode) so wrote https://gerrit.wikimedia.org/r/#/c/143161/ [14:11:37] …but now icinga is reporting 0 failures for a box that can't compile, so I think that patch is insufficient [14:11:51] which box ? [14:12:02] virt1008 [14:12:40] I believe the current test should be showing green (since puppet is running) but also saying that there are 99 errors. [14:12:43] OK: Puppet is currently enabled, last run 223 seconds ago with 99 failures [14:13:03] well, dammit, a second ago it said 0 failures [14:13:35] I guess I will wait until it does that again and then ping you :/ [14:13:42] meanwhile, does the above patch seem correct to you? [14:13:47] so, I think we should do away with freshness intervals and all that jazz and move to "if failures > 0" croak [14:13:50] (03PS1) 10Manybubbles: Configure cache warmers [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/143317 [14:14:03] in theory yes but I think we should not have 2 [14:14:16] Isn't freshness meaningful though? If puppet is not running at all [14:14:21] then we won't get any failure reporting [14:14:37] not running at all ? why ? [14:14:43] disabled ? we got a check for that [14:14:56] I don't know why! But I don't want to miss it if it happens [14:15:17] the post-merge check looks at $USER is probably why [14:15:19] akosiaris: locks for example [14:15:32] matanya: that is disabled [14:15:37] Or if puppet itself has a failure (rather than the puppet manifests) [14:15:42] it only updates strontium from palladium if $USER is gitpuppet [14:16:15] akosiaris: you mean Notice: Run of Puppet configuration client already in progress; skipping (/var/lib/puppet/state/agent_catalog_run.lock exists) [14:16:15] is disabled ? [14:16:45] matanya: not that is not disabled, but that does not matter it is a race, it will run normally the next interval [14:16:51] bblack@palladium:~$ sudo echo $USER [14:16:52] bblack [14:16:57] oops that's wrong [14:17:04] bblack: puppet-merge does su IIRC [14:17:08] akosiaris: not if the pid gets stuck [14:17:30] happened to me, that is why i bring it up [14:17:58] matanya: like the entire puppet agent run going CPU crazy and never ending and never doing anything ? [14:18:14] akosiaris: we could also add a second set of args to your test so that it can test for freshness and also for failures in the same run [14:18:17] so the process is there and in S state ? [14:18:25] But it seems just as well to have two separate line-items in icinga [14:18:31] that, or the pid isn't cleared after death/failed run/crash [14:18:51] maybe sudo -i puppet-merge ? [14:18:59] that is already checked IIRC matanya [14:19:06] then cool [14:19:29] andrewbogott: adding another 1000 checks to neon for this does not make me exactly happy [14:19:43] we should try to merge those checks into one [14:19:46] akosiaris: ok -- would you like me to modify the tool so it checks both? [14:20:01] how about all three ? [14:20:10] I started to do that yesterday and then I though, no, akosiaris clearly meant this to be two separate modes :) [14:20:12] disabled, failures, time ? [14:20:17] sure [14:20:36] it'll make for a long commandline but should be easy [14:20:39] akosiaris: any word on kafka 0.8.1.1? [14:21:04] ottomata: kafka 0.8.1 or gradle ? [14:21:18] uh, both? [14:21:20] (03PS1) 10Hoo man: Fix extension WikiHiero typo [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/143318 [14:21:25] I can probably have a 0.8.1 this week. gradle.. no [14:21:29] 0.8.1.1 package feesibility [14:21:34] oh gradle schmadle, i don't care [14:21:41] i just want a pacakge :) [14:21:50] I looked at the changes, it seems it will be easy to upgrade to 0.8.1 [14:21:55] ok awesome [14:22:00] 0.8.1.1 btw! [14:22:28] ok :-) [14:22:38] (03PS2) 10Hoo man: Fix extension WikiHiero typo [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/143318 [14:25:04] (03PS1) 10Manybubbles: Juggle shard counts for Cirrus [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/143319 [14:27:18] !log Stopping Jenkins it has some corrupted threads [14:27:23] Logged the message, Master [14:41:49] (03PS2) 10Krinkle: Configure cache warmers [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/143317 (owner: 10Manybubbles) [14:43:41] akosiaris: ok, look at virt1008 now. When I run puppet agent -tv from the commandline it reports 99 errors but when it runs as part of a normal refresh (as it just now did) it reports 0 [14:43:53] …I think [14:44:07] (03PS3) 10Krinkle: Configure cache warmers [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/143317 (owner: 10Manybubbles) [14:44:11] (03PS4) 10Manybubbles: Configure cache warmers [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/143317 [14:44:51] (03PS5) 10Krinkle: Configure cache warmers [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/143317 (owner: 10Manybubbles) [14:45:02] sorry for commonet>commit [14:45:06] undone [14:45:09] nice word [14:46:17] (03CR) 10Manybubbles: Configure cache warmers (031 comment) [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/143317 (owner: 10Manybubbles) [14:50:30] manybubbles: I'll let you SWAT your changes today [14:50:37] anomie: thanks! [14:56:39] andrewbogott: wat ? [14:56:49] that is weird... [14:56:55] yep! [14:58:50] andrewbogott: maybe a race ? [14:59:00] while it is running but it has not yet concluded ? [15:00:04] manybubbles, anomie: The time is nigh to deploy SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20140701T1500) [15:00:45] akosiaris: yeah, probably it's checking the status file but the file is empty [15:01:13] (03CR) 10Manybubbles: [C: 032] Configure cache warmers [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/143317 (owner: 10Manybubbles) [15:01:16] (03CR) 10Manybubbles: [C: 032] Juggle shard counts for Cirrus [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/143319 (owner: 10Manybubbles) [15:02:59] (03Merged) 10jenkins-bot: Configure cache warmers [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/143317 (owner: 10Manybubbles) [15:03:03] (03Merged) 10jenkins-bot: Juggle shard counts for Cirrus [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/143319 (owner: 10Manybubbles) [15:03:59] !log temporarily disabling puppet on hafnium to test an eventlogging alert [15:04:03] Logged the message, Master [15:04:31] !log manybubbles Synchronized wmf-config: SWAT - cirrus settings - cache warmers and shard counts (duration: 00m 06s) [15:04:35] Logged the message, Master [15:06:24] !log manybubbles Synchronized php-1.24wmf11/extensions/CirrusSearch/: SWAT code to set up cache warmers (duration: 00m 05s) [15:06:27] Logged the message, Master [15:07:45] (03PS1) 10Giuseppe Lavagetto: mediawiki: manage the apache config via puppet [operations/puppet] - 10https://gerrit.wikimedia.org/r/143329 [15:09:01] !log done with SWAT deploy [15:09:08] Logged the message, Master [15:13:47] (03CR) 10jenkins-bot: [V: 04-1] mediawiki: manage the apache config via puppet [operations/puppet] - 10https://gerrit.wikimedia.org/r/143329 (owner: 10Giuseppe Lavagetto) [15:14:21] (03CR) 10Ottomata: [C: 032 V: 032] "Nuria and I tested this a bunch, and there are parts of upstart that are quite strange to us. However, we think this will work." [operations/puppet] - 10https://gerrit.wikimedia.org/r/143258 (owner: 10Nuria) [15:14:26] nuria, s'ok? [15:14:27] ^ [15:14:42] nuria looking [15:15:43] ah, left WIP on there [15:15:46] oops [15:16:16] akosiaris: should we bother having a configuration threshold for # of failures? Or just regard any failures as critical? [15:16:33] the latter [15:20:36] (03PS2) 10Giuseppe Lavagetto: mediawiki: manage the apache config via puppet [operations/puppet] - 10https://gerrit.wikimedia.org/r/143329 [15:23:30] PROBLEM - Unmerged changes on repository puppet on palladium is CRITICAL: Fetching origin [15:24:10] PROBLEM - Unmerged changes on repository puppet on strontium is CRITICAL: Fetching origin [15:24:29] ottomata: ^ is that you? [15:24:35] yes ah [15:24:42] sorry too many chats and merges at the same time [15:25:03] andrewbogott: what's the check limit at right now? [15:25:04] 5 mins? [15:25:10] RECOVERY - Unmerged changes on repository puppet on strontium is OK: Fetching origin [15:25:14] ottomata: no idea [15:25:15] (03PS2) 10PiRSquared17: Grant 'centralauth-rename' right to stewards [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/139655 (owner: 10Gerrit Patch Uploader) [15:25:30] RECOVERY - Unmerged changes on repository puppet on palladium is OK: Fetching origin [15:27:34] <_joe_> ottomata: 10 mins [15:27:42] <_joe_> it's just to remember you [15:27:45] hah [15:27:51] <_joe_> I know it's annoying :) [15:28:00] ahhh, 10 is good ok [15:28:07] naw, 10 is good, was going to say 5 was too short [15:28:18] i usually remember better than i have been this morning, just have a lot of chats open! [15:28:33] ottomata i had some packaging questions -- should I just private message you? [15:28:51] naw, ask here dogeydogey :) [15:29:00] other's might be able to answer better than me, and then they can see the context [15:29:56] so i tried to package as practice [15:30:10] whoops, I tried to package https://github.com/etsy/logster as practice [15:30:15] (03CR) 10Manybubbles: [C: 031] "Ran performance tests for this - didn't even cause a blip." [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/142606 (owner: 10Chad) [15:30:37] manybubbles: show off :) [15:30:55] Reedy: after we caused an outage last week, I'm running performance tests [15:30:58] every fucking time [15:31:12] <_joe_> manybubbles: good guy :) [15:31:26] because we don't (yet) have cache warmers adding a bunch at a time can cause thrash [15:31:32] _joe_: I'm not good - just scared [15:31:41] <^d> Those cache warmers will make all the difference. [15:31:49] <_joe_> manybubbles: oh so you're molding slowly into an ops [15:31:49] I hope so [15:31:50] they should [15:32:01] _joe_: slowly slowly slowly [15:32:11] I did that at a 20 person startup after our ops left [15:32:15] but I was bad at it [15:32:15] PROBLEM - Check status of defined EventLogging jobs on graphite consumer on hafnium is CRITICAL: CRITICAL: Stopped EventLogging jobs: consumer/graphite [15:32:32] <_joe_> manybubbles: next step is realizing all software is crap and it only works by chance [15:32:47] _joe_: I'm so far beyond that [15:32:57] <_joe_> then you'll have the positive attitude all ops always have :) [15:32:59] I don't even believe I write good code [15:33:19] <_joe_> manybubbles: who does? [15:33:21] scratch that, I especially believe I make stupid mistakes [15:33:30] ottomata I think I did everything correctly but when I try to build the package (debuild -us -uc) I get an error: http://pastie.org/9343892 [15:33:31] _joe_: code review helps so much [15:33:38] <_joe_> it does [15:33:54] <_joe_> it uniforms us to the same horrible mistakes we reach a consensus on :P [15:34:56] btw that eventlogging jobs thing is me... [15:35:01] (dogeydogey will get back to you in just ab it) [15:35:36] <^d> manybubbles: Speaking of code you wrote that's good...random question. I had someone asking me about insource: yesterday. Couldn't get it to find wiki tag thingies (like or ), just "ref" and "score" in the results (which is way over inclusive!). [15:36:09] <^d> How can we search for this? [15:36:11] <^d> I made sure we're not stripping the tags in the page text builder. [15:36:13] ^d: I imagine it is because insource:foo is preparsed. insource:// would do it [15:36:26] insource is preparsed [15:36:27] <^d> I tried that I thought... [15:36:34] and usesed english parser [15:36:47] rather, uses the language parser for the wiki [15:37:15] RECOVERY - Check status of defined EventLogging jobs on graphite consumer on hafnium is OK: OK: All defined EventLogging jobs are runnning. [15:37:48] hey, can I work on these bugs: https://bugzilla.wikimedia.org/show_bug.cgi?id=51434, https://bugzilla.wikimedia.org/show_bug.cgi?id=51497, https://bugzilla.wikimedia.org/show_bug.cgi?id=54065 [15:37:49] * ^d twiddles thumbs while search does its thing [15:38:10] <^d> "An error has occurred while searching: We could not complete your search due to a temporary problem. Please try again later." [15:38:14] <^d> Aww :( [15:38:39] dogeydogey: [15:38:40] (03PS1) 10Andrew Bogott: Simplify check_puppetrun. [operations/puppet] - 10https://gerrit.wikimedia.org/r/143332 [15:38:46] akosiaris: this doesn't fix the race condition, but it consolidates the tests ^ [15:38:48] q1: did you actually clone from etsy or from our repo? [15:39:07] https://gerrit.wikimedia.org/r/#/admin/projects/operations/debs/logster [15:39:10] (03Abandoned) 10Andrew Bogott: Check for puppet failures as well as for puppet staleness. [operations/puppet] - 10https://gerrit.wikimedia.org/r/143161 (owner: 10Andrew Bogott) [15:39:30] ottomata I cloned from etsy's github repo [15:39:55] ok, well that will be your first problem! there's no debian/ directory there :p [15:40:02] go with this one [15:40:03] https://gerrit.wikimedia.org/r/#/admin/projects/operations/debs/logster [15:40:04] (03CR) 10Dzahn: [C: 031] apt: minor lint [operations/puppet] - 10https://gerrit.wikimedia.org/r/143271 (owner: 10Matanya) [15:40:08] it has a debian branch [15:40:11] also, when you build [15:40:14] use git-buildpackage [15:40:16] not debuild [15:40:33] ^d: damn it [15:40:51] andrewbogott: if failcount == 99. Is that documented somewhere ? [15:41:04] ottomata aren't I supposed to build my own debian directory? [15:41:10] the 99 that is [15:41:11] that's what i've been doing, building my own [15:41:45] akosiaris: the 99 is set earlier in the script [15:41:54] I could just have it error out there rather than use 99 as a flag... [15:43:23] ah yes. I completely forgot about that. Hmmm maybe it is better using a symbol [15:43:56] I am gonna try something [15:44:04] 'k [15:46:14] dogeydogey: not with git-buildpackage [15:46:23] i guess, if you are practicing and want to try it yourself, that's fine [15:46:31] sorry, thought you were just practicing building [15:46:41] but if you want to make it up on your own for practice, go right ahead [15:46:47] ottomata nah, so I guess I'm trying to take any code from github and package it [15:47:00] and that's what I did with logster, is that wrong? [15:47:03] but! i am actually only versed in git-buildpackage! since i have only build .debs for WMF and that is what we use [15:47:08] naw, its cool, if you are just practicing that's fine [15:47:23] i have two links for you though... [15:48:10] http://honk.sigxcpu.org/projects/git-buildpackage/manual-html/gbp.import.html#GBP.IMPORT.UPSTREAM-GIT [15:48:17] although, we usually do things a little differently here [15:48:26] debian branch is our --git-debian-branch (usually)_ [15:48:35] (03PS4) 10Filippo Giunchedi: update index page for wikimedia downloads [operations/puppet] - 10https://gerrit.wikimedia.org/r/141671 [15:48:36] and for packages without tags, you have to specify the proper cli args [15:48:40] you can see what we do in gbp.conf [15:48:51] https://github.com/wikimedia/operations-debs-logster/blob/debian/debian/gbp.conf [15:48:54] aaannnnd [15:49:17] for python packages, I wrote this (even though its not fai don's favorite): https://wikitech.wikimedia.org/wiki/Git-buildpackage#How_to_build_a_Python_deb_package_using_git-buildpackage [15:50:16] ottomata so maybe we can take a step back, the whole point of building a package is to install something that isn't available in the current apt-get repo right? whether it's the software as a whole or a specific version of it? [15:50:53] so if i build a python app that just prints "hello universe" to the terminal, i can package that up and allow anyone to install it [15:53:25] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] update index page for wikimedia downloads [operations/puppet] - 10https://gerrit.wikimedia.org/r/141671 (owner: 10Filippo Giunchedi) [15:53:43] sure [15:54:30] dogeydogey: ja that' scool, but also the point of you building a package for wmf is learning how we prefer to do it, in spite of our terrible documentation and fragmented ways :p [15:54:41] there are so many different ways to do it, and I am constantly confused by it [15:54:52] oh i see [15:55:06] i've learned a very specific way to do it by now (via git buildpackage), and usually can get those packages reviewed and merged [15:55:10] let me read the links and i'll get back to you [15:55:18] so, if you want to learn to do them for WMF, then i'd go about learning the same way [15:55:26] check out other packages in operatations/debs/ [15:55:29] there are a lot in there [15:55:32] ottomata also am I cool to work on these bugs: https://bugzilla.wikimedia.org/show_bug.cgi?id=51434, https://bugzilla.wikimedia.org/show_bug.cgi?id=51497, https://bugzilla.wikimedia.org/show_bug.cgi?id=54065 [15:55:40] probably best to stick with ones that have more recent commit dates [15:56:23] dogeydogey: probably? def sync up with the requesters of those bugs [15:56:28] especially YuviPanda [15:56:46] I DIDN'T DO IT [15:56:48] oh [15:56:52] :) [15:57:04] well, yuvi for 51434 and maybe 54065 [15:57:05] also, i wish the tickets had more info to get started [15:57:08] yeah [15:57:15] talk to the requesters and ask for that [15:58:10] paravoid: i see that mail aliases are handled by puppet but I can't find the file to update anywhere...where is it? [16:00:14] (03PS3) 10Yuvipanda: toollabs: Don't collect NFS stats [operations/puppet] - 10https://gerrit.wikimedia.org/r/143236 [16:01:19] dogeydogey: have you seen your latest patch broke some stuff? that is a good learning opertunity [16:04:51] (03PS1) 10Yuvipanda: toollabs: Reduce number of things being logged [operations/puppet] - 10https://gerrit.wikimedia.org/r/143335 [16:04:54] Coren: ^ [16:05:26] matanya which one? [16:05:31] Coren: should remove some unwanted logging that I over enthusiastically added :) [16:06:05] (03CR) 10coren: [C: 032] toollabs: Reduce number of things being logged [operations/puppet] - 10https://gerrit.wikimedia.org/r/143335 (owner: 10Yuvipanda) [16:06:11] Coren: ty [16:09:30] (03CR) 10Calak: [C: 031] Kill $wgEnableNewpagesUserFilter [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/141100 (https://bugzilla.wikimedia.org/58932) (owner: 10TTO) [16:12:23] apergos: lvilla@ sent us a new index page for dumps.wm.org and got merged in https://gerrit.wikimedia.org/r/#/c/141671/ however it seems that the files on dataset1001 are not updated via puppet? the ones in /data/xmldatadumps/public [16:13:20] dogeydogey: https://gerrit.wikimedia.org/r/#/c/142479/ [16:19:04] (03PS2) 10Alexandros Kosiaris: Simplify check_puppetrun. [operations/puppet] - 10https://gerrit.wikimedia.org/r/143332 (owner: 10Andrew Bogott) [16:19:19] andrewbogott: wanna take a look ? ^ [16:19:47] I think it also incorporates the check_puppet_disabled check so we could ditch this one afterwards [16:19:56] s/I think// [16:23:23] (03CR) 10Andrew Bogott: [C: 031] Simplify check_puppetrun. [operations/puppet] - 10https://gerrit.wikimedia.org/r/143332 (owner: 10Andrew Bogott) [16:24:03] akosiaris: i guess i'm going to remove the versions.rb facts and just hardcode the versions in the role... [16:24:03] hm [16:26:03] (03CR) 10Anomie: [C: 031] "I note that the default for this config variable is true in DefaultSettings.php, so this would change things only for frwiki, nlwiki, and " [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/141100 (https://bugzilla.wikimedia.org/58932) (owner: 10TTO) [16:32:00] ottomata: hardcoding the versions doesn't sound right [16:32:35] maybe we can avoid this somehow? What is the name/dir of this file ? [16:32:48] * akosiaris expects >64 chars for a filename [16:33:39] (03CR) 10Alexandros Kosiaris: [C: 032] Simplify check_puppetrun. [operations/puppet] - 10https://gerrit.wikimedia.org/r/143332 (owner: 10Andrew Bogott) [16:35:22] (03CR) 10Ori.livneh: [C: 031] "Awesome; this is exactly what I had in mind." [operations/puppet] - 10https://gerrit.wikimedia.org/r/143329 (owner: 10Giuseppe Lavagetto) [16:35:57] akosiaris: [16:35:58] https://github.com/wikimedia/operations-puppet/blob/production/manifests/role/analytics/hive.pp#L75 [16:36:03] and i was planning on doing [16:36:07] auxpath => "file:///usr/lib/hive-hcatalog/share/hcatalog/hive-hcatalog-core-${::hive_version}-${::cdh_version}.jar", [16:36:28] (the path has changed slightly in cdh5) [16:37:07] (03CR) 10Dzahn: "is this about apache-config or about mediawiki-config? the commit message kind of says both" [operations/puppet] - 10https://gerrit.wikimedia.org/r/143329 (owner: 10Giuseppe Lavagetto) [16:38:38] ottomata: ldconfig seems so nice right now :-) [16:39:45] ah, well, except that its gonna be a little tricky to do this not as a fact i think [16:39:51] i was thikning inline_template could do it [16:39:59] but that will render on the puppet master, not on the agent, right? [16:40:24] i don't really want to set this auxpath in the module itself, as it is definitely a wmf specific usage [16:40:24] PROBLEM - puppet last run on fenari is CRITICAL: CRITICAL: Puppet has {failcount} failures [16:40:39] {failcount} failed :p [16:40:43] meh... [16:41:03] akosiaris: if I made it a function? would that run on the agent, but not unless it is called? [16:41:24] PROBLEM - puppet last run on erbium is CRITICAL: CRITICAL: Puppet has {failcount} failures [16:41:25] PROBLEM - puppet last run on dataset2 is CRITICAL: CRITICAL: Puppet has {failcount} failures [16:41:43] oh no, not {failcount} failures! [16:41:48] ha [16:41:55] lol [16:41:58] we should fix it by applying {fix} [16:42:02] haha [16:42:15] #{fix} [16:42:48] ha, akosiaris, we could do a symlink with an exec :p [16:43:08] ottomata: still how would be determine the target ? [16:43:12] ln -s /path/to/hcatalog*.jar /path/to/hcatalog.jar [16:43:14] :p [16:43:48] btw auxpath => 'file:///usr/lib/hcatalog/share/hcatalog/*.jar' [16:43:55] wouldn't it not work too ? [16:44:02] doubt it, this gets rendered in an xml config file [16:44:03] but you don't want all jars, right ? [16:44:23] https://github.com/wikimedia/puppet-cdh/blob/master/templates/hive/hive-site.xml.erb#L279 [16:44:34] no, and i doubt that shell wildcard would work here [16:44:40] (03PS1) 10Dzahn: fix failed failcount with fix [operations/puppet] - 10https://gerrit.wikimedia.org/r/143345 [16:44:44] sigh [16:45:24] custom puppet function might work though, right? [16:45:28] facter is cleaner [16:45:29] mutunte, that ^ is good, but the wikimetrics update module ? [16:45:40] ha, akosiaris, i could jsut redirect the dpkg-query output to dev null :p [16:45:40] ottomata: well yes but it sounds a bad idea [16:45:56] what are you trying to do? [16:45:56] the custom function that is [16:45:58] i hate submodules [16:46:15] jgage: FWIW, sdk1 on ms-be3003 is almost emptied (set to 0 weight, 50G left) [16:47:30] PROBLEM - puppet last run on db1046 is CRITICAL: CRITICAL: Puppet has {failcount} failures [16:49:08] ori, what am I trying to do? [16:49:13] you are asking me? [16:49:37] yeah just curious, sounds like a fun problem [16:50:01] sigh, apache foundation uses confluence and jira [16:50:16] I was hoping they would see the error of their ways by now [16:50:36] they can't, because they use jira [16:50:52] hive.aux.jars.path The location of the plugin jars that contain implementations of user defined functions and serdes. [16:51:01] so... what do you understand from that ? [16:51:09] ori, trying to make the versions in this file path automated rather than hardcoded [16:51:09] https://github.com/wikimedia/operations-puppet/blob/production/manifests/role/analytics/hive.pp#L75 [16:51:18] current solution is custom facts: [16:51:19] https://gerrit.wikimedia.org/r/#/c/142715/1/lib/facter/versions.rb [16:51:19] a dir ? a comma separated list ? a single file ? [16:51:35] but, facts are run on every node, even if the containing module isn't included [16:51:54] so we are seeing stupid (harmless) dpkg-query warnings on every puppet run: [16:52:08] No packages found matching hadoop. [16:52:09] etc. [16:52:40] akosiaris: would it be so bad if I redirected the dpkg-query output? [16:52:43] btw these are being stored in the database [16:52:51] hm [16:52:54] the facts? [16:52:57] yes [16:52:58] like for each node? [16:52:59] hgm [16:53:00] hm [16:53:05] ok not ideal then [16:53:10] PROBLEM - puppet last run on osmium is CRITICAL: CRITICAL: Puppet has {failcount} failures [16:53:21] mutante: ? [16:53:23] aside from being a slight misuse, what's so bad about the function idea? akosiaris? [16:53:35] i think mutante is wrestling with a submodule :/ [16:53:42] mutante, lemme know if I can help [16:53:59] ottomata: http://docs.puppetlabs.com/guides/custom_facts.html#confining-facts ? [16:54:26] (03PS1) 10Dzahn: fix {failcount} fail in new puppetrun check [operations/puppet] - 10https://gerrit.wikimedia.org/r/143347 [16:54:37] HMMMMM [16:54:51] (03CR) 10Alexandros Kosiaris: [C: 032] fix {failcount} fail in new puppetrun check [operations/puppet] - 10https://gerrit.wikimedia.org/r/143347 (owner: 10Dzahn) [16:54:54] could make a cdh_installed fact [16:54:57] (03CR) 10Dzahn: [C: 032] fix {failcount} fail in new puppetrun check [operations/puppet] - 10https://gerrit.wikimedia.org/r/143347 (owner: 10Dzahn) [16:55:28] or no, we'd ahve to check each package [16:55:35] ottomata: http://docs.puppetlabs.com/guides/custom_facts.html#fact-precedence looks like it could be exploited here too [16:55:37] hadoop nodes that don't have hive would fail [16:55:46] the functions are not run in the agent scope ottomata, they are run on the puppetmaster [16:55:53] when compiling the catalog [16:55:55] (03CR) 10Dzahn: [V: 032] fix {failcount} fail in new puppetrun check [operations/puppet] - 10https://gerrit.wikimedia.org/r/143347 (owner: 10Dzahn) [16:56:33] (03Abandoned) 10Dzahn: fix failed failcount with fix [operations/puppet] - 10https://gerrit.wikimedia.org/r/143345 (owner: 10Dzahn) [16:56:55] ah foo ok akosiaris [16:57:14] akosiaris: i could just add a conditional in there to check for the package in dpkg or whatever before running dpkg-query [16:57:18] ottomata: per http://docs.puppetlabs.com/facter/2.0/fact_overview.html#main-components-of-simple-resolutions , confine statements "can either match against the value of another fact or evaluate an arbitrary Ruby expression/block." [16:57:20] and return 'uninstalled' or something [16:57:36] or use this confine thing to do that [16:57:44] so, the facts would still be executed and in the puppet db [16:57:51] the confine sounds the best solution [16:57:51] but, on nodes where packages don't exist [16:57:58] cdh_version would == 'uninstalled' [16:58:00] ? [16:58:08] or, maybe I can return null or something? [16:58:22] use the confine so the facts will never get populated on hosts not having those packages [16:58:29] right [16:58:41] oh because I can do a custom ruby block [16:58:42] ...yeahhhhhh [16:58:43] ok ok [16:58:43] got it [16:58:45] i like [16:59:10] hm, i guess if confine returns false? [16:59:13] reading... [16:59:13] mutante: thanks [17:00:02] matanya I fixed the issue on https://gerrit.wikimedia.org/r/#/c/142479/ but it won't let me push it [17:00:04] manybubbles, ^d: The time is nigh to deploy Search (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20140701T1700) [17:00:04] thanks ori, will try that [17:00:08] gotta grab some lunch [17:00:17] ottomata: yes, if it returns false. e.g.: https://github.com/puppetlabs/facter/blob/978d2ef9390bf920f60af5355c9fe3b36154ad10/lib/facter/gce.rb#L7 [17:02:43] (03CR) 10Chad: [C: 032] Move remaining pool 4 lsearchd wikis (except commons) to Cirrus [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/142606 (owner: 10Chad) [17:02:49] (03PS1) 10Andrew Bogott: Revert "Intentionally break puppet compile for virt1008" [operations/puppet] - 10https://gerrit.wikimedia.org/r/143348 [17:03:00] (03Merged) 10jenkins-bot: Move remaining pool 4 lsearchd wikis (except commons) to Cirrus [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/142606 (owner: 10Chad) [17:03:24] godog, the deployment process is in the middle of being changed, and I need to update the docs to reflect this, I'll have a look at that tomorrow though [17:03:25] PROBLEM - puppet last run on virt1008 is CRITICAL: CRITICAL: Complete puppet failure [17:03:35] PROBLEM - HTTP error ratio anomaly detection on tungsten is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 9 below the confidence bounds [17:03:47] !log demon Synchronized cirrus.dblist: Move remaining pool 4 lsearchd wikis (except commons) to Cirrus (duration: 00m 07s) [17:03:50] Logged the message, Master [17:04:33] btw ottomata, seems like hive.aux.jar.path can be a directory [17:04:38] (03CR) 10Andrew Bogott: [C: 032] Revert "Intentionally break puppet compile for virt1008" [operations/puppet] - 10https://gerrit.wikimedia.org/r/143348 (owner: 10Andrew Bogott) [17:05:14] apergos: ok! would it be a problem if I just copy the file over given that's in flux? [17:05:35] RECOVERY - puppet last run on db1046 is OK: OK: Puppet is currently enabled, last run 5 seconds ago with 0 failures [17:05:46] which could mean that nothing of all this is necessary. Maybe try that out before fighting with the facts? [17:06:06] 5 seconds ago ... [17:06:23] godog: I'd rather look at what happened and fix it correctly if you don't mind [17:06:48] (03PS1) 10Andrew Bogott: Intentionally break puppet compile for virt1008 [operations/puppet] - 10https://gerrit.wikimedia.org/r/143350 [17:06:49] apergos: sure! thanks :)) [17:07:25] RECOVERY - puppet last run on virt1008 is OK: OK: Puppet is currently enabled, last run 48 seconds ago with 0 failures [17:07:51] thanks for picking that up, I remember seeing the ticket and then it dropped off my radar :-/ [17:08:05] PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: 14.29% of data exceeded the critical threshold [500.0] [17:08:13] ^d: do any nodes actually have "gerrit::replicationdest" role? [17:08:25] !log restarted mysqld on db1046 m2 slave [17:08:31] Logged the message, Master [17:08:36] <^d> mutante: antimony, gallium, lanthanum possibly. [17:08:43] ^d: thanks [17:08:57] ^d: that's why i ask https://gerrit.wikimedia.org/r/#/c/138008/4/manifests/gerrit.pp [17:09:40] <^d> Hmm. [17:09:42] <^d> Should be ok [17:10:38] it should be noop, yea [17:11:23] (03CR) 10Andrew Bogott: [C: 032] Intentionally break puppet compile for virt1008 [operations/puppet] - 10https://gerrit.wikimedia.org/r/143350 (owner: 10Andrew Bogott) [17:11:38] (03CR) 10Dzahn: [C: 032] gerrit - replace generic::systemuser with user [operations/puppet] - 10https://gerrit.wikimedia.org/r/138008 (owner: 10Rush) [17:13:10] Warning: /Stage[main]/Gitblit/File[/etc/apache2/sites-enabled/git.wikimedia.org]: Ensure set to :present but file type is link so no content will be synced [17:13:23] ori: ::rolleyes:: asset-check :) [17:13:55] ^d: no change on antimony, but noticed the unrelated issue above [17:14:17] <^d> :\ [17:15:25] PROBLEM - puppet last run on virt1008 is CRITICAL: CRITICAL: Complete puppet failure [17:15:34] akosiaris: success! ^ [17:15:45] COMPLETE PUPPET FAILURE [17:15:54] (03CR) 10Alexandros Kosiaris: Enable ContentTranslation extension on beta labs (031 comment) [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/140723 (owner: 10Nikerabbit) [17:15:55] so [17:15:58] epic puppet fail [17:16:02] https://graphite.wikimedia.org/render/?title=HTTP%205xx%20Responses%20-8hours&from=-8hours&width=1024&height=500&until=now&areaMode=none&hideLegend=false&lineWidth=2&lineMode=connected&target=color%28cactiStyle%28alias%28reqstats.500,%22500%20resp/min%22%29%29,%22red%22%29&target=color%28cactiStyle%28alias%28reqstats.5xx,%225xx%20resp/min%22%29%29,%22blue%22%29 [17:16:14] paravoid: we don't worry about that [17:16:14] I considered 'epic puppet fail' but that seemed too dramatic :) [17:16:18] only misc services [17:16:20] root@antimony:/etc/apache2/sites-enabled# file svn [17:16:20] svn: ASCII English text [17:16:25] RECOVERY - puppet last run on virt1008 is OK: OK: Puppet is currently enabled, last run 50 seconds ago with 0 failures [17:16:56] akosiaris: nevermind, still broken :( ^ [17:17:22] https://graphite.wikimedia.org/render/?title=HTTP%205xx%20Responses%20-1hours&from=-1hours&until=now&width=1024&height=500&target=reqstats.5xx [17:17:28] so was that also broken by the lint change yesterday or is it a new one? [17:17:35] I'm going to leave it broken for a bit and see if the error message flaps [17:17:42] mutante: that's an intentional breakage to test monitoring [17:17:44] anyone want to look at that? [17:17:44] maybe we should stop deleting apache configs [17:17:56] ^d: Are you done? [17:18:00] andrewbogott: it's a different issue i keep trying to point out [17:18:13] <^d> hoo: Yep, all done. Just watching things now. [17:18:19] mutante: oh, ok, sorry [17:18:23] Ok, will push a typo fix, then [17:18:31] (03PS3) 10Hoo man: Fix extension WikiHiero typo [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/143318 [17:18:38] mutante: I suspect that ori can explain about the apache config thing [17:19:14] It's not an "epic puppet fail" if it doesn't bring the site down [17:19:16] yesterday it was caused by that lint change.. [17:19:27] PROBLEM - puppet last run on virt1008 is CRITICAL: CRITICAL: Complete puppet failure [17:19:35] (03CR) 10Hoo man: [C: 032] "typo typo typo..." [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/143318 (owner: 10Hoo man) [17:19:40] andrewbogott: so definitely a race [17:19:54] akosiaris: yeah [17:19:59] andrewbogott: hm? [17:20:12] (03Merged) 10jenkins-bot: Fix extension WikiHiero typo [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/143318 (owner: 10Hoo man) [17:20:22] hoo, are you going to deploy ^^^ right now? [17:20:28] ori: um… mutante keeps finding broken apaches. I don't know any more than that. [17:20:36] MaxSem: Yeah, no-op, though [17:20:46] Anything blocking? [17:21:04] !log fixing svn.wikimedia.org apache site manually [17:21:09] Logged the message, Master [17:21:22] !log restarting apache on antimony [17:21:27] Logged the message, Master [17:21:32] MaxSem: ? [17:21:39] running it manually though I can not reproduce it [17:23:10] hmmm I think I have an idea about that but I will explore it further tomorrow andrewbogott [17:23:55] akosiaris: 'k [17:24:28] !log hoo Synchronized wmf-config/: Typos typos typso (duration: 00m 08s) [17:24:33] Logged the message, Master [17:24:36] !log antimony: git.wikimedia.org]: Ensure set to :present but file type is link so no content will be synced [17:24:43] Logged the message, Master [17:27:07] RECOVERY - HTTP 5xx req/min on tungsten is OK: OK: Less than 1.00% data above the threshold [250.0] [17:28:02] (03CR) 10Faidon Liambotis: [C: 04-1] "+1 on what Daniel said. mediawiki-config is the name we use for something entirely different and this mixup can be very confusing." [operations/puppet] - 10https://gerrit.wikimedia.org/r/143329 (owner: 10Giuseppe Lavagetto) [17:30:40] (03CR) 10Giuseppe Lavagetto: "I'll change that, I'm just terrible at picking names." [operations/puppet] - 10https://gerrit.wikimedia.org/r/143329 (owner: 10Giuseppe Lavagetto) [17:30:51] Are such peaks normal? https://ganglia.wikimedia.org/latest/graph.php?r=week&z=large&h=ms-be3001.esams.wmnet&m=cpu_report&s=by+name&mc=2&g=network_report&c=Swift+esams [17:31:25] (03CR) 10Faidon Liambotis: [C: 031] "While this was my idea and I kinda like it, I can't help but wonder: are these domains actually being used?" [operations/apache-config] - 10https://gerrit.wikimedia.org/r/143095 (owner: 10Reedy) [17:31:34] Nemo_bis: yes [17:31:44] Nemo_bis: swift esams is not exactly in production [17:31:51] this is just some internal replication traffic [17:32:19] swift shuffling stuff? :) [17:32:28] yeah [17:32:34] there's a disk being emptied [17:32:39] earlier weeks looked calmer [17:32:40] ah [17:33:12] (03CR) 10Faidon Liambotis: [C: 032] swift: add icehouse pinning to ms-be1* [operations/puppet] - 10https://gerrit.wikimedia.org/r/143282 (owner: 10Filippo Giunchedi) [17:33:42] manybubbles: piiing [17:33:44] checkpoint? [17:34:22] (03CR) 10BryanDavis: "Any idea how we might be able to use this in beta (deployment-prep labs project) where we currently use the betacluster branch of operatio" [operations/puppet] - 10https://gerrit.wikimedia.org/r/143329 (owner: 10Giuseppe Lavagetto) [17:35:17] (03PS1) 10Dzahn: fix apache site setup on git.wm.org [operations/puppet] - 10https://gerrit.wikimedia.org/r/143357 [17:35:27] RECOVERY - puppet last run on virt1008 is OK: OK: Puppet is currently enabled, last run 30 seconds ago with 0 failures [17:37:18] (03PS4) 10Ori.livneh: add apache::def resource [operations/puppet] - 10https://gerrit.wikimedia.org/r/143090 [17:38:12] ottomata I think I have packaging figured out but not exactly the wikimidia way [17:39:12] (03CR) 10Dzahn: [C: 032] "just fixing the current issue" [operations/puppet] - 10https://gerrit.wikimedia.org/r/143357 (owner: 10Dzahn) [17:39:20] can you or someone here give me an overview of the wikimedia-ops deployment process? [17:39:25] or is that online somewhere? [17:39:29] (03PS2) 10Dzahn: fix apache site setup on git.wm.org [operations/puppet] - 10https://gerrit.wikimedia.org/r/143357 [17:39:45] (03CR) 10Dzahn: "i figure apache_site is gone then?" [operations/puppet] - 10https://gerrit.wikimedia.org/r/143357 (owner: 10Dzahn) [17:40:20] PROBLEM - puppet last run on mw1085 is CRITICAL: CRITICAL: Puppet has 1 failures [17:40:22] (03CR) 10Faidon Liambotis: [C: 04-1] "Inline comments; plus, the commit should only contain debian/. The upstream sources don't really belong here. If you're planning to use gi" (036 comments) [operations/debs/php-mailparse] (review) - 10https://gerrit.wikimedia.org/r/142751 (owner: 1020after4) [17:41:57] (03PS3) 10Ori.livneh: gitblit: use apache::site [operations/puppet] - 10https://gerrit.wikimedia.org/r/143357 (owner: 10Dzahn) [17:42:32] mutante: updated [17:42:37] http://smokeping.wikimedia.org/?target=ESAMS.Core.cr1-esams [17:42:38] fun fun fun [17:42:51] cause of the reqerrors [17:43:00] ongoing issues [17:43:05] not that I see anyone caring, but still :) [17:43:09] (dogeydogey in some meetings, will get back to you soon) [17:44:39] (03PS4) 10Dzahn: gitblit: use apache::site [operations/puppet] - 10https://gerrit.wikimedia.org/r/143357 [17:45:05] (03PS6) 10Nikerabbit: Enable ContentTranslation extension on beta labs [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/140723 [17:45:25] (03CR) 10Dzahn: [C: 032] gitblit: use apache::site [operations/puppet] - 10https://gerrit.wikimedia.org/r/143357 (owner: 10Dzahn) [17:45:30] akosiaris: don't go anywhere, just about to update apache::conf :P [17:45:31] !log rebuilding cirrus index for commons to put it into fewer shards - it should be faster this way [17:45:36] Logged the message, Master [17:46:10] (03CR) 10jenkins-bot: [V: 04-1] Enable ContentTranslation extension on beta labs [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/140723 (owner: 10Nikerabbit) [17:46:32] (03PS7) 10Nikerabbit: Enable ContentTranslation extension on beta labs [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/140723 [17:47:01] (03CR) 10Dzahn: "recheck" [operations/puppet] - 10https://gerrit.wikimedia.org/r/143357 (owner: 10Dzahn) [17:49:12] (03PS1) 10Dr0ptp4kt: Restore OM tagging for 470-07. [operations/puppet] - 10https://gerrit.wikimedia.org/r/143359 [17:50:20] PROBLEM - puppetmaster backend https on palladium is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:50:34] (03PS1) 10Jforrester: Enable TemplateData GUI for English, French and Italian Wikipedias [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/143360 (https://bugzilla.wikimedia.org/67376) [17:50:40] ori: Notice: /Stage[main]/Sysctl/File[/etc/sysctl.d/70-high-http-performance.conf]/ensure: removed [17:50:50] (03PS1) 10Nikerabbit: Fix string interpolation in cxserver [operations/puppet] - 10https://gerrit.wikimedia.org/r/143361 [17:51:02] bblack: when you have a moment, would you please review and +2 https://gerrit.wikimedia.org/r/#/c/143359 ? the operator confirmed it's a go [17:51:08] ori: it fixed the issue though.. etc/apache2/sites-enabled/50-git-wikimedia-org.conf]: Scheduling refresh of Service[apache2 [17:51:20] RECOVERY - puppetmaster backend https on palladium is OK: HTTP OK: Status line output matched 400 - 335 bytes in 7.896 second response time [17:51:21] (03CR) 10Nikerabbit: "This prevents cxserver from starting." [operations/puppet] - 10https://gerrit.wikimedia.org/r/143361 (owner: 10Nikerabbit) [17:51:48] mutante: gitblit's high performance is endangered! :P [17:51:50] (03CR) 10Dzahn: "yep, that fixed it. etc/apache2/sites-enabled/50-git-wikimedia-org.conf]: Scheduling refresh of Service[apache2.. not sure about this one " [operations/puppet] - 10https://gerrit.wikimedia.org/r/143357 (owner: 10Dzahn) [17:52:07] mutante: i guess webserver::apache included the http performance sysctl [17:52:18] iirc paravoid was suspicious of it so not sure if it's worth reintroducing [17:52:31] ori: alright! [17:52:44] thanks for fixing [17:52:46] (03PS2) 10BBlack: Restore OM tagging for 470-07. [operations/puppet] - 10https://gerrit.wikimedia.org/r/143359 (owner: 10Dr0ptp4kt) [17:52:56] (03CR) 10BBlack: [C: 032 V: 032] Restore OM tagging for 470-07. [operations/puppet] - 10https://gerrit.wikimedia.org/r/143359 (owner: 10Dr0ptp4kt) [17:53:01] bblack: thx man [17:53:26] ori: of what? [17:53:46] akosiaris: I updated the patches [17:53:56] paravoid: ori: Notice: /Stage[main]/Sysctl/File[/etc/sysctl.d/70-high-http-performance.conf]/ensure: removed [17:54:06] np [17:54:14] sudo -i puppet-merge seems to have worked, btw [17:54:24] i'm not sure how descriptive "high http performance" is of gitblit too [17:54:30] !log demon Synchronized php-1.24wmf10/extensions/Elastica: Updating to master, fixes fatal error (duration: 00m 07s) [17:54:34] Logged the message, Master [17:54:36] outage again [17:55:46] paravoid: esams network issues? [17:55:51] yes [17:56:20] I think :) [17:57:05] 64 bytes from bast1001.wikimedia.org (208.80.154.149): icmp_req=76 ttl=61 time=653 ms [17:57:08] 64 bytes from bast1001.wikimedia.org (208.80.154.149): icmp_req=77 ttl=61 time=659 ms [17:57:11] 64 bytes from bast1001.wikimedia.org (208.80.154.149): icmp_req=80 ttl=61 time=88.5 ms [17:57:14] 64 bytes from bast1001.wikimedia.org (208.80.154.149): icmp_req=78 ttl=61 time=662 ms [17:57:17] definitely [17:57:31] consistently 650+ ms [17:57:41] what the hell [17:58:30] RECOVERY - puppet last run on mw1085 is OK: OK: Puppet is currently enabled, last run 41 seconds ago with 0 failures [17:58:38] calling gtt [18:00:04] Reedy, greg-g: The time is nigh to deploy MediaWiki train (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20140701T1800) [18:00:12] PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: 21.43% of data exceeded the critical threshold [500.0] [18:05:52] (03CR) 10BryanDavis: [C: 031] Enable ContentTranslation extension on beta labs [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/140723 (owner: 10Nikerabbit) [18:06:19] (03PS2) 10Reedy: Non Wikipedias to 1.24wmf10 [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/143309 [18:07:16] (03CR) 10Reedy: [C: 032] Non Wikipedias to 1.24wmf10 [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/143309 (owner: 10Reedy) [18:07:23] (03Merged) 10jenkins-bot: Non Wikipedias to 1.24wmf10 [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/143309 (owner: 10Reedy) [18:08:24] 23 # FIXME: remove after the ganglia module migration [18:08:28] that did not happen yet, or? [18:08:35] confused by ganglia_new [18:09:28] if $::realm == 'labs' or ($::hostname in ['netmon1001'] or $::site == 'esams' or ($::site == 'pmtpa' and $cluster in ['cache_bits'])) { [18:12:45] (03CR) 10Dzahn: [C: 032] ganglia - replace generic::systemuser with user [operations/puppet] - 10https://gerrit.wikimedia.org/r/138010 (owner: 10Rush) [18:14:44] !log reedy rebuilt wikiversions.cdb and synchronized wikiversions files: Non wikipedias to 1.24wmf11 [18:14:49] Logged the message, Master [18:15:36] !log reedy Synchronized docroot and w: (no message) (duration: 00m 18s) [18:15:40] Logged the message, Master [18:19:52] (03CR) 1020after4: Packaging of php-mailparse from the pecl (032 comments) [operations/debs/php-mailparse] (review) - 10https://gerrit.wikimedia.org/r/142751 (owner: 1020after4) [18:20:29] (03CR) 1020after4: "New patch incoming..." [operations/debs/php-mailparse] (review) - 10https://gerrit.wikimedia.org/r/142751 (owner: 1020after4) [18:22:39] (03PS3) 1020after4: Packaging of php-mailparse from the pecl [operations/debs/php-mailparse] (review) - 10https://gerrit.wikimedia.org/r/142751 [18:23:05] (03PS4) 10Dzahn: dataset-replace generic::systemuser with user [operations/puppet] - 10https://gerrit.wikimedia.org/r/138000 (owner: 10Rush) [18:23:56] ori: brain bounce q about versions idea [18:23:58] yt? [18:24:00] yep [18:24:12] (03PS3) 10Ori.livneh: Add apache::conf [operations/puppet] - 10https://gerrit.wikimedia.org/r/142400 [18:24:17] ^ akosiaris [18:24:18] so, this will work, i can run dpkg-query --show and it exits 1 if the package isn't matched [18:24:26] Reedy: greg-g i'll add the new table for wikidata and populate it now, unless some reason not to do that yet [18:24:29] but, i'm not sure how to get the exit code from a facter exec [18:24:37] i can get and parse the output [18:24:37] but that's not as clean [18:24:49] i doubt that I can just do a plain ruby exec of somekind, can I/ [18:24:49] this is for the confine, yes? [18:24:50] yes [18:24:59] you can just do a plain ruby exec [18:25:05] that won't be run on the puppe tmaster? [18:25:23] i don't think so, but let me reread the docs [18:25:38] aude: All good from me I think [18:25:45] legoktm: What was that backport you wanted? [18:26:14] Reedy: ok [18:26:46] hm, ori, i can test in labs right now i think, i have a puppet client i can use [18:27:12] ottomata: which package are you searching for specifically? [18:27:27] hadoop is a fine example [18:27:40] does it have a top-level executable in $PATH? [18:27:54] yessssssssssss [18:28:04] what is it? [18:28:08] but not all do, and there are actually probably several executables [18:28:09] hm [18:28:10] hdfs [18:28:11] is one [18:28:25] just looking at https://github.com/puppetlabs/facter/blob/deff6048809da75409eb92f42a066093eb820669/lib/facter/dhcp_servers.rb#L22 [18:28:35] it calls Facter::Core::Execution.which('nmcli') [18:28:37] ye, hm [18:28:41] i see where you are doing... [18:28:42] going [18:28:45] seems a little less elegant [18:28:48] i agree [18:28:52] so still looking [18:28:54] but it'd work [18:29:07] which returns true/false based on .. well, you figured it out. here's the source: https://github.com/puppetlabs/facter/blob/deff6048809da75409eb92f42a066093eb820669/lib/facter/core/execution/posix.rb [18:29:39] yeah [18:30:15] ottomata: wait, why can't you use Facter::Core::Execution.execute? [18:30:17] (03PS2) 10Aude: Set internalEntitySerializerClass Wikibase setting [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/142998 [18:30:51] !log aaron Synchronized wmf-config/PrivateSettings.php: removed obsolete swift tampa config (duration: 00m 07s) [18:30:54] Logged the message, Master [18:31:14] could, but it returns the string ori [18:31:16] and i'd parse it [18:31:19] rather than checking exitval [18:31:28] ruby's system() does what I want (returns false) [18:31:29] (03PS3) 10Aude: Set internalEntitySerializerClass Wikibase setting [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/142998 [18:31:34] checking if I can use that in labs now... [18:33:15] (03PS1) 10Aude: Enable property/entity suggester on Wikidata [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/143365 [18:33:17] ready to enable it [18:33:24] shall be done quick as jenkins allows [18:33:35] ottomata: i don't get it; why not simple do: [18:33:35] Facter::Core::Execution.execute('dpkg-query --show hadoop') ~= 'No packages found' [18:33:58] (03CR) 10Aude: [C: 032] Set internalEntitySerializerClass Wikibase setting [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/142998 (owner: 10Aude) [18:34:04] (03Merged) 10jenkins-bot: Set internalEntitySerializerClass Wikibase setting [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/142998 (owner: 10Aude) [18:34:12] because its less elegant than just checking exit val! :p [18:34:20] i MAY do that ori :p [18:34:23] Facter::Core::Execution.execute('dpkg-query --show hadoop').start_with('No packages') [18:34:32] err, start_with?('No packages') even [18:35:00] (03CR) 10Aude: [C: 032] Enable property/entity suggester on Wikidata [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/143365 (owner: 10Aude) [18:35:10] (03Merged) 10jenkins-bot: Enable property/entity suggester on Wikidata [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/143365 (owner: 10Aude) [18:38:11] (03PS1) 10Andrew Bogott: Make virt1008 a live compute node; make virt1009 the puppet canary. [operations/puppet] - 10https://gerrit.wikimedia.org/r/143368 [18:38:23] !log aude Synchronized wmf-config/Wikibase.php: (no message) (duration: 00m 15s) [18:38:27] Logged the message, Master [18:38:53] !log aude Synchronized wmf-config/InitialiseSettings.php: Enable property suggester on Wikidata (duration: 00m 10s) [18:38:56] Logged the message, Master [18:40:06] RECOVERY - HTTP 5xx req/min on tungsten is OK: OK: Less than 1.00% data above the threshold [250.0] [18:41:26] looks good :) [18:41:26] PROBLEM - puppet last run on virt1008 is CRITICAL: CRITICAL: Complete puppet failure [18:41:58] !log switching puppet canary from virt1008 to virt1009 [18:42:03] Logged the message, Master [18:42:05] !log adding virt1008 to labs compute pool [18:42:09] Logged the message, Master [18:42:15] (03CR) 10Andrew Bogott: [C: 032] Make virt1008 a live compute node; make virt1009 the puppet canary. [operations/puppet] - 10https://gerrit.wikimedia.org/r/143368 (owner: 10Andrew Bogott) [18:44:47] "Enabled CirrusSearch as the default search backend on 30 more wikis - take five" [18:44:49] I'm slightly confused about the 'do not add new configuration here + readme' on the top of admins.pp Are we still supposed to change ssh keys there (make the old one absent and add the new one) or is that supposed to be in data.yaml somehow? [18:44:50] hmmm. [18:45:15] take five? That sounds like it was fun.... [18:45:33] jamesofur: data.yaml I believe. But chasemp is the authority on such things [18:45:48] twkozlowski: ^ [18:45:58] IRC #REDIRECT [18:46:17] ? [18:46:19] jamesofur: no, all in admins.yaml [18:46:52] oh, I was wrong, [18:46:54] jamesofur: modules/admin/data.yaml should already have the key you are looking for [18:47:10] jamesofur: if not, let us know which one [18:47:38] it has my old one yes, but I need to update it, do we still need to do an absent and present version there? Or just replace? [18:48:13] I 'have' the one that is there still but my laptop was stolen. Was encrypted (and the key is passworded) but better safe then sorry [18:48:20] jamesofur: just replace, it should purge anything that's not there. [18:48:22] ori, ottomata, sorry for lurking, but are you wanting to get the exit status of shell execution in ruby? [18:48:24] cool [18:48:26] thanks andrewbogott [18:48:29] * andrewbogott is pretty sure but still defers to chasemp [18:48:33] just replace unless you want to deactivate the entire user [18:48:46] * jamesofur will wait for clarification before submitting but makes edit now [18:48:51] mutante: heh ;) I do not wish to deactivate [18:49:02] marxarelli: nah, we know how to do that; facter expects you to execute commands using its wrappers, and they don't provide access to exit status [18:49:09] Just remember having to do the absent procedure the last time this came around for admins.pp [18:49:29] jamesofur: yes, just replace and i can check it removes it [18:49:46] facter uses %x{} i believe, which means you should have the exit status in $? [18:50:35] ori: ^, it's a global however :/ [18:51:32] jamesofur: nothing new goes in admins.pp :) [18:51:36] all goes in data.yaml [18:51:40] * jamesofur nods [18:53:53] (03CR) 1020after4: Packaging of php-mailparse from the pecl (031 comment) [operations/debs/php-mailparse] (review) - 10https://gerrit.wikimedia.org/r/142751 (owner: 1020after4) [18:54:08] (03PS1) 10Jalexander: updating ssh-key for jamesur [operations/puppet] - 10https://gerrit.wikimedia.org/r/143374 [18:55:04] fwiw yes the keys are absolute now, so anything not there won't be there on the end host [18:55:11] no more absent then waiting and then purging dance [18:55:13] perfect [18:56:15] (03CR) 10Dzahn: [C: 032] "it's the one from https://office.wikimedia.org/w/index.php?title=User:Jalexander&oldid=114435" [operations/puppet] - 10https://gerrit.wikimedia.org/r/143374 (owner: 10Jalexander) [18:56:49] thanks mutante [19:00:35] jamesofur: it was replaced on bast1001 [19:00:56] perfect, and able to log in and confirm [19:01:02] :) [19:01:37] yeah ori, facter config with blocks not working for me... hmph [19:01:47] it works with other facts just fine [19:01:52] (03PS1) 10RobH: adding Dan Garry (deskana) to bastion and statistics-users group [operations/puppet] - 10https://gerrit.wikimedia.org/r/143376 [19:01:58] even this [19:02:08] https://gist.github.com/ottomata/d51356d1a502282dd38a [19:02:10] doesn't work [19:03:55] hmm, maybe i'm doing it wrong [19:04:27] (03PS1) 10Ori.livneh: phabricator: use apache::site, not apache::vhost [operations/puppet] - 10https://gerrit.wikimedia.org/r/143378 [19:04:46] chasemp: ^ [19:04:51] ottomata: what doesn't work? [19:04:56] i.e., what are you seeing? [19:05:43] ottomata: and re: not working in vagrant, perhaps it's a puppet2/3 thing? [19:05:57] greg-g: https://bugzilla.wikimedia.org/show_bug.cgi?id=67243#c10 will need wmf backports asap [19:06:01] ori: ah, the regsubst I moved to the template as I need it to allow https as well, I will integrate this into what I've got for trusty so far [19:06:04] ja mabye...i think its not working in labs either... [19:06:25] greg-g: i'm not filing in a swat because i can't guarantee that i'll be there at the time, and someone is just going to ignore the patches then [19:06:41] MatmaRex: i can be on the hook [19:06:41] i'm just going to throw this here and hope someone picks up… [19:07:09] MatmaRex: i don't have context tho, what do you mean by 'asap'? is it fixing something that is currently broken? [19:07:28] chasemp: i can update the patch for that if you like (or just hand it off to you to do with as you please, whatever you prefer) [19:07:43] ori: yes, jquery.ui dialogs disappear because of a Blink rendering bug http://imgur.com/F9xFj1x,eoXHwJK [19:07:54] ori: I think I will just steal what you've got and include it w/ my 'trusty friendly phabricator' patch I've already got some stuff [19:08:05] chasemp: cool, totally fine by me [19:08:06] ori: the patch works around that bug, and generally uses a better way of accomplishing the thing [19:08:07] thanks for doing that, it was on my list today so all good [19:08:08] (03CR) 10Dzahn: [C: 031] "key matches prod key from https://office.wikimedia.org/w/index.php?title=User:DGarry_%28WMF%29&oldid=114197 , UID matches ldap, groups see" [operations/puppet] - 10https://gerrit.wikimedia.org/r/143376 (owner: 10RobH) [19:08:31] hmm, yeah maybe so ori, I also get 'Could not retrieve cdh_version: uninitialized constant Facter::Core' on vagrant if I try to use that instead of just setcode "blabla" [19:08:32] hm [19:08:45] (03CR) 10Ori.livneh: [C: 04-2] "Chase will integrate the bits he likes from this patch to another patch he already has in progress." [operations/puppet] - 10https://gerrit.wikimedia.org/r/143378 (owner: 10Ori.livneh) [19:09:06] ori here is a question, some stuff for perms (syntax) changed form apache 2.2 to 2.4 [19:09:15] yeah, require all granted [19:09:20] I have template logic that checks for trusty vs precise [19:09:27] but that's inferring the new apache and kind of crap [19:09:29] better way? [19:09:36] PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: 13.33% of data exceeded the critical threshold [500.0] [19:10:04] well, one of the decisions we made with the new apache module is to load mod_authz_compat (or whatever it's called) which allows the old-style allow from .. deny from .. directives to be used [19:10:23] wasn't doing it for me w/ the existing vhost stuff [19:10:29] maybe I needed apache::site? [19:10:43] i.e. perms failed w/ old style for me before I did the template version dance [19:10:51] when was this? [19:11:00] greg-g: unless you object i'll just sync MatmaRex's patch now [19:11:09] ori: I guess last week? middle of [19:11:30] ori: I don't think I object, if you reviewed the changes [19:11:40] greg-g: just doing that now [19:11:48] * greg-g nods [19:12:12] (03CR) 10RobH: [C: 032] "I've matched the same user rights as the other project managers doing similar roles and needing similar research rights. This should be g" [operations/puppet] - 10https://gerrit.wikimedia.org/r/143376 (owner: 10RobH) [19:12:20] ori: I'll circle back with you since you're busy :) it works now but could be better so can wait [19:12:20] yep [19:12:36] chasemp: thanks, this should be quick, sorry [19:14:19] !log ori Synchronized php-1.24wmf11/resources/src/jquery.ui-themes/vector/jquery.ui.core.css: Ib09928248: vector/jquery.ui.core.css: Update rule for .ui-helper-hidden-accessible (bug 67243) (duration: 00m 06s) [19:14:25] Logged the message, Master [19:14:26] RECOVERY - puppet last run on virt1008 is OK: OK: Puppet is currently enabled, last run 37 seconds ago with 0 failures [19:14:34] !log ori Synchronized php-1.24wmf10/resources/src/jquery.ui-themes/vector/jquery.ui.core.css: Ib09928248: vector/jquery.ui.core.css: Update rule for .ui-helper-hidden-accessible (bug 67243) (duration: 00m 05s) [19:14:38] Logged the message, Master [19:14:46] ^ MatmaRex [19:15:27] RECOVERY - Kafka Broker Messages In on analytics1021 is OK: kafka.server.BrokerTopicMetrics.AllTopicsMessagesInPerSec.FifteenMinuteRate OKAY: 2623.46028763 [19:16:32] Do minor upgrades of packages (e.g. swift or nova) generally get scheduled on the deployment calendar? Or is that strictly for mediawiki? [19:16:36] greg-g ^ ? [19:16:36] ori: whee. thanks, verified fixed :) [19:17:09] chasemp: the only other thing i can think of is to use guards, but I think doing the conditionals in puppet/erb is nicer [19:17:21] andrewbogott: false dichotomy, but... [19:17:43] greg-g: but... [19:17:49] well, [19:18:04] andrewbogott: swift/nova aren't just "packages" they're services, no? will an upgrade require downtime? Will it require rolling-updates that imply lessened redundancy? [19:18:07] whut? :) [19:18:14] mutante: i'm just playing along [19:18:21] deployment calendar isn't "strictly" mediawiki in any sense of either word :) [19:18:57] greg: It should not result in downtime, but I'm going to schedule and warn labs-l just in case. [19:19:29] andrewbogott: see also: upgrades to the ElasticSearch cluster that manybubbles/^d do, those are scheduled because it's a rolling-upgrade that implies lessened availablility/hard to rollback/etc. [19:19:47] andrewbogott: if you need to warn labs-l, then you need to schedule, is a good rule of thumb ;) [19:21:22] greg-g: hm, ok. [19:21:36] * andrewbogott tries to remember how to add something to the deplyment calendar [19:21:42] andrewbogott: which also implies >= a week's notice [19:21:56] andrewbogott: lazy way: email me by Thursday the week before :) [19:22:33] ja ori i'm checking in labs with facter version 1.7.5 [19:22:42] the code for blocks passed to confine is not there [19:23:47] ori: how bout... I will get what i have together, and we can improve it as we go :) [19:23:47] ottomata: doesn't puppet 3 require facter 2+? [19:24:00] chasemp: +1 [19:24:10] greg-g: so i can't mail you today for an upgrade on Monday the 7th? Does that fall into the < a week rule, or the before thursday rule? [19:24:20] root@hadoop-d-master0:/usr/lib/ruby/vendor_ruby/facter/util# puppet --version [19:24:20] 3.4.3 [19:24:20] root@hadoop-d-master0:/usr/lib/ruby/vendor_ruby/facter/util# facter --version [19:24:20] 1.7.5 [19:24:34] RECOVERY - HTTP 5xx req/min on tungsten is OK: OK: Less than 1.00% data above the threshold [250.0] [19:24:34] http://apt.wikimedia.org/wikimedia/pool/main/f/facter/ [19:24:48] guess not? [19:25:04] andrewbogott: that's fine, "week before" meaning mostly "by thursday the week before the week of", if that makes sense [19:25:14] 'k [19:25:19] _joe_: any insight on that? [19:25:50] andrewbogott: ">= week before" implied a level of precision I didn't mean :) [19:25:57] ottomata done with meetings? [19:26:49] yes! [19:26:51] hi dogeydogey [19:26:57] so, jaj, what's your q? [19:27:48] ottomata is there an overview of the wiki-ops deployment process somewhere? [19:27:56] for debs? no, probalby not [19:27:56] or someone willing to explain it? [19:28:04] but, i can give you brief overview [19:28:16] let's go with something like logster, that didnt' have a deb for it before i build one [19:28:28] i chatted with folks, said I wanted it for stuff [19:28:34] eventually got sortof some agreement [19:28:48] so, I went and created a debian/ dir using git buildpackage (and those instructions I pasted before) [19:28:52] and submitted that for review to gerrit [19:29:02] then that wen tthrough some rounds of review and changes [19:29:09] then eventually it was approved [19:29:11] and merged [19:29:18] once approved and merged, I built an official deb [19:29:27] and copied it to the apt server [19:29:33] and used reprepro commands to add it to our apt repo [19:29:47] then, I wrote puppet manifests to install and configure the deb [19:30:41] +1 to ottomata's process for a deb, same here :) [19:30:59] dogeydogey: https://gerrit.wikimedia.org/r/#/c/95556 [19:31:03] ottomata so everything is deployed using puppet? [19:31:06] yes [19:31:19] well, not *everything* but pretty much all ops infrastructure is [19:31:31] for wmf apps (like mediawiki, etc.) those have their own deploy process [19:31:34] using other tools [19:32:18] ottomata what about pushing to a staging infrastructure then pushing to prod? [19:32:31] lol, staging infrastructure [19:32:33] or deploying to staging first with puppet [19:32:35] :) [19:32:45] oh i see what you did there [19:32:49] dogeydogey: i think you mean labs [19:32:57] i develop the deb using a local vagrant VM [19:33:00] and also develop puppet there [19:33:06] once I get something work, I often test in labs [19:33:17] there once was a labs project for building [19:33:39] so when you have the approved deb copied to the apt server and the puppet manifests written up, you apply it to the labs, and if it works send it off to prod? [19:34:35] hm, naw, i would do the labs stuff before its approved and in apt [19:34:41] ottomata also labs is only like a small miniature environment no? what about potential impacts of the change on other things not setup in the labs enviro? [19:34:42] either I would just dpkg -i the deb in labs [19:34:50] OR, there is a way to put a .deb in a special dir in labs [19:34:55] do we have a doc on using vagrant to test puppet changes for a given type of host? [19:34:56] i think labs instances have a local apt repo configured or something... [19:35:11] dogeydogey: labs is ad hoc VMs [19:35:17] you create and destroy instances as you see fit [19:35:23] 100% isolated from production [19:35:23] okay thanks, i think i got a better idea now :) [19:35:27] ottomata https://gerrit.wikimedia.org/r/#/c/142479/ -- I made changes to this but it won't push now, any idea what's going on? [19:35:31] bblack jgage wrote up something on the testing process I believe.....wiki'ed somewhere [19:35:53] bblack, no i doubt it, and def not for a 'given type of host' (not really sure what that means) [19:36:06] but, mediawiki-vagrant is pretty modular [19:36:12] you can write a role class to use modules [19:36:13] and then do [19:36:19] because I still more-or-less just try to get it right and hope jenkins catches stupid errors (unless I think the fallout could really kill something) [19:36:19] vagrant enable-role [19:36:21] vagrant provsion [19:36:22] blabla [19:36:36] yeah, just use vagrant, you'll find it pretty easy [19:36:42] 'given type of host' I meant like a text cache or an lvs node [19:36:50] well, vagrant doesn't ahve the ops/puppet repo [19:36:53] we gave up on beta? [19:36:55] (hence my submodule pushes :p) [19:37:07] and also hence my patches to try and make this easier [19:37:11] why do we need submodules to use vagrant? [19:37:41] (03CR) 10Ottomata: "Talked with Mark about this last week. I need to write up docs with my intentions and a plan forward." [operations/puppet/varnish] - 10https://gerrit.wikimedia.org/r/133695 (owner: 10Ottomata) [19:37:57] bblack, you don't, but if you want to share them between ops/puppet and vagrant, you do [19:38:06] if you want to just hack somethign locally that only you can use [19:38:15] you can manually symlink or copy directories around locally [19:38:30] but submodules let you just develop a generic module and share between multiple environments/repos [19:38:55] I guess I should look into vagrant more, because that doesn't sound like what I expect it to sound like :) [19:39:08] whatcha mean? [19:39:26] ori's mediawiki-vagrant has its own puppet manifests [19:39:30] you can use puppetmaster::self in actual labs too [19:39:33] not the operations/puppet repository [19:39:34] "share them between ops/puppet and vagrant"? I thought this could just provision stuff from an unmerged ops/puppet and test it? [19:39:46] naw, not in vagrant bblack [19:39:50] you can do that in labs [19:40:06] ottomata any idea on why I can't git review changes to this anymore? https://gerrit.wikimedia.org/r/#/c/142479 [19:40:09] kinda, if there's an applicably-roled host and the labs differences themselves don't matter [19:40:42] dogeydogey: it's already merged, you can still add comments though [19:40:54] I think since I haven't tried vagrant, I've been pinning hopes on it that it's the tool I wish it was :) [19:41:08] _joe_: updated apache::conf, apache::def [19:42:44] what I want is "puppet-tester cp4011.ulsfo.wmnet ", which spawns a little vm that thinks it's cp4011.ulsfo.wmnet and puppets itself using the pending unmerged changes, or something. [19:42:53] PROBLEM - Graphite Carbon on tungsten is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:42:53] PROBLEM - puppet last run on tungsten is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:42:55] <_joe_> ori: if I'm needed to merge, I'll do that tomorrow morning [19:43:13] _joe_: just +1 please [19:43:28] bblack, hm, _joe_ has something kinda like that [19:43:33] PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: 6.67% of data exceeded the critical threshold [500.0] [19:43:34] the puppet catalog differ [19:43:38] _joe_: or not, if you're not up for it, 'sokay [19:43:39] that works in a vagrant instance [19:43:43] RECOVERY - puppet last run on tungsten is OK: OK: Puppet is currently enabled, last run 834 seconds ago with 0 failures [19:43:43] RECOVERY - Graphite Carbon on tungsten is OK: OK: All defined Carbon jobs are runnning. [19:44:03] yeah I haven't tried that catalog differ yet. it sounds like it might be very useful though :) [19:44:07] <_joe_> ori: in a few mins [19:44:19] <_joe_> bblack: in a few more mins, I'll show you [19:44:27] poor _joe_ :) [19:44:39] <_joe_> although I guess it needs some maintenance post-varnish-submodule [19:46:53] _joe_, you are going to get bblack into vagrant and then he is going to see the submodule light and is going to move the varnish module back himself :p :p [19:47:17] somehow I doubt that [19:47:25] haha, actually, that's not true, catalog differ is hard to use with submodules :/ [19:47:26] <_joe_> I doubt that too [19:47:32] unless vagrant prints $100 bills every time I execute it [19:47:35] them maybe [19:47:36] haha [19:48:07] (03CR) 10Scottlee: "Whew, looks like someone else fixed it." [operations/puppet] - 10https://gerrit.wikimedia.org/r/142479 (owner: 10Scottlee) [19:48:19] ottomata: to be totally honest, i am kinda exasperated with submodules too. i like code-sharing, but a README in the module dir that explains where the module is from might by better [19:48:38] ottomata: we could even have a script that updates shared modules in mediawiki-vagrant [19:48:51] by fetching the latest module dir from operations/puppet [19:49:08] it's annoying, but that way it's only annoying to one or two people rather than everyone always [19:49:23] the latest module dir? [19:50:06] (03CR) 10Dzahn: "Scottlee, yes, in https://gerrit.wikimedia.org/r/#/c/143172/" [operations/puppet] - 10https://gerrit.wikimedia.org/r/142479 (owner: 10Scottlee) [19:50:22] yes, so, there'd be a small rake task for mwv that fetches operations/puppet and compares mediawiki/vagrant.git:puppet/modules/$modulename to operations/puppet.git:modules/$modulename [19:50:30] (03PS1) 10Aude: adjust $wgPropertySuggesterMinProbability setting for wikidata [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/143390 [19:50:33] if the latter is different, it copies the changes and commits the result [19:51:07] greg-g: is anyone deplouing now? [19:51:21] deploy* [19:51:27] ottomata: the way i see it, there are two use-cases for submodules: (1) to facilitate third-party reuse and contributions, and (2) to facilitate code-sharing across mediawiki-vagrant and operations/puppet [19:51:28] not that I know of [19:51:30] aude: ^ [19:51:33] i don't have a good solution for (1) [19:51:33] ok [19:51:42] based on feedback we want to adjust a setting [19:51:52] but for (2), i think the update thing could work [19:51:54] shall be quick :) [19:52:10] (03CR) 10Aude: [C: 032] adjust $wgPropertySuggesterMinProbability setting for wikidata [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/143390 (owner: 10Aude) [19:52:16] (03Merged) 10jenkins-bot: adjust $wgPropertySuggesterMinProbability setting for wikidata [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/143390 (owner: 10Aude) [19:53:05] ahh, ok [19:53:25] !log aude Synchronized wmf-config/Wikibase.php: adjust property suggester setting for wikidata (duration: 00m 11s) [19:53:27] done [19:53:30] Logged the message, Master [19:53:42] * aude rather not wait for swat this time [19:59:25] (03PS1) 10Ottomata: Remove versions.rb facts - this caused every node to print out dpkg-query warnings [operations/puppet/cdh] - 10https://gerrit.wikimedia.org/r/143392 [19:59:46] (03CR) 10Ottomata: [C: 032 V: 032] Remove versions.rb facts - this caused every node to print out dpkg-query warnings [operations/puppet/cdh] - 10https://gerrit.wikimedia.org/r/143392 (owner: 10Ottomata) [20:01:04] "Aufgrund von Serverproblemen ist das Fehler-Wiki zur Zeit nicht verfügbar. " [20:01:08] that's so meta:) [20:01:19] "because of server problems, error-wiki isn't available" [20:01:37] PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: 6.67% of data exceeded the critical threshold [500.0] [20:05:20] Reedy: I didn't need a backport, just wanted a merge, but now that you asked, it would be good if we could backport https://gerrit.wikimedia.org/r/#/c/143243/ so I can start running the script [20:06:56] legoktm: That change is horrible [20:07:00] I didn't backport anything yet... [20:07:08] if ( PHP_SAPI !== 'cli' ) [20:07:10] really [20:07:18] Just 11? or 10 too? [20:07:45] hoo: otherwise I'd have to pass the it as an argument through like 4 functions [20:07:59] Reedy: just 11 should be fine [20:08:12] Yeah... [20:08:16] I don't agree with that change either [20:09:32] ok [20:09:35] We have code like that in core [20:09:48] quite a lot [20:09:53] WP:OTHERSTUFF [20:09:54] I mean, is there ever a circumstance where we run a maint script and it *should* go to RC? [20:09:57] that doesn't really justify doing it again [20:10:20] legoktm: That's the wrong question [20:10:50] WP:OSE [20:11:16] ok, /me fixes [20:11:27] aude: :D didn't know that one [20:11:51] heh [20:14:08] <_joe_> ori: on your changes [20:14:24] on my changes [20:15:13] blah [20:15:18] this script isn't using the Maint class [20:15:48] (03PS1) 10Ottomata: Update cdh module with removal of custom versions facts [operations/puppet] - 10https://gerrit.wikimedia.org/r/143398 [20:16:09] (03CR) 10Ottomata: [C: 032 V: 032] Update cdh module with removal of custom versions facts [operations/puppet] - 10https://gerrit.wikimedia.org/r/143398 (owner: 10Ottomata) [20:18:09] what's going on with 5xx? [20:18:53] I see traffic spikes on pdf servers and text caches too, donno if related [20:19:07] (03PS1) 10QChris: Make dbstore1002 handle s5 analytics queries [operations/dns] - 10https://gerrit.wikimedia.org/r/143399 (https://bugzilla.wikimedia.org/66068) [20:19:22] (03PS5) 10Giuseppe Lavagetto: add apache::def resource [operations/puppet] - 10https://gerrit.wikimedia.org/r/143090 (owner: 10Ori.livneh) [20:19:25] https://graphite.wikimedia.org/render/?title=HTTP%205xx%20Responses%20-4hours&from=-4hours&width=1024&height=500&until=now&areaMode=none&hideLegend=false&lineWidth=2&lineMode=connected&target=color%28cactiStyle%28alias%28reqstats.5xx,%225xx%20resp/min%22%29%29,%22blue%22%29 [20:19:33] (03CR) 10Giuseppe Lavagetto: [C: 031] add apache::def resource [operations/puppet] - 10https://gerrit.wikimedia.org/r/143090 (owner: 10Ori.livneh) [20:21:05] _joe_: thank you [20:21:25] (03CR) 10Ori.livneh: [C: 032] add apache::def resource [operations/puppet] - 10https://gerrit.wikimedia.org/r/143090 (owner: 10Ori.livneh) [20:24:07] where did the 5xx tsv log end up? I've lost track of log hosts :) [20:25:29] if you mean 5xx that are the result of mediawiki fatals or exceptions, it's fluorine:/a/mw-log/{fatal,exception}.log [20:25:31] bblack: ^ [20:25:39] does flourine still exist? [20:25:44] does it ever [20:25:47] no fenari, fluorine [20:25:54] they're also plotted here: http://ur1.ca/edq1f and they don't seem high [20:26:10] I mean the machine "flourine" isn't even in our dns [20:26:19] fluorine, sorry [20:26:20] uo, not ou [20:26:29] can we just alias that so it works? :P [20:26:41] actually not sorry, your typo not mine! :P [20:26:52] * ori retracts apology [20:26:53] yeah [20:27:08] that's why I couldn't find the host after I looked on wikitech, I kept getting it backwards on the commandline :) [20:27:51] !log Adding cache warmers to all Cirrus indexes for group1 wikis with more then one shard except commons (commons is busy, it'll have to wait:) [20:27:55] Logged the message, Master [20:28:37] (03CR) 10Giuseppe Lavagetto: [C: 04-1] "Just found a minor issue; if resolved, count me as +1" (031 comment) [operations/puppet] - 10https://gerrit.wikimedia.org/r/142400 (owner: 10Ori.livneh) [20:29:06] <_joe_> ori: sorry, did not catch this in the first iteration. [20:29:15] seems like a lot of [2014-07-01 20:24:09] Fatal error: Maximum execution time of 180 seconds exceeded for lots of different things in the fatal log [20:29:19] but that's normal [20:30:07] What replaced apache-graceful? [20:30:36] <_joe_> bblack: 5xx from varnishes is on oxygen [20:30:50] <_joe_> errors from backends on fluorine [20:30:57] Can someone restart apache on m1217 as it's apparently spamming apc errors [20:31:50] !log restarting apache on mw1217 [20:31:54] Logged the message, Master [20:32:19] Jul 1 20:31:41 mw1217 kernel: [40380516.085263] apache2[23758]: segfault at 7f401aac1d40 ip 00007f401aac1d40 sp 00007f4007e6ee08 error 14 in mod_filter.so[7f401f821000+3000] [20:32:33] Nice [20:32:34] apache2 segfault? [20:33:18] I can't even reach oxygen (even ping) from iron [20:33:42] i can [20:33:49] mutante: maybe someone's found a new zero-day to try to exploit [20:33:49] _joe_: thanks for the reviews! [20:34:07] PROBLEM - puppet last run on rcs1001 is CRITICAL: CRITICAL: Puppet has 1 failures [20:34:10] mutante: you can ping oxygen.eqiad.wmnet from iron? [20:34:31] bblack: i can ssh to oxygen from iron [20:34:46] oh it's the public you're hitting [20:34:47] bblack: oxygen.wikimedia.org [20:34:59] bblack: yea, i just did "oxygen" [20:35:22] oxygen.eqiad.wmnet is in DNS with a 10.64 addr that totally doesn't exist on the host [20:36:15] !log reedy Synchronized php-1.24wmf11/extensions/WikimediaMessages: bug 67387 (duration: 00m 15s) [20:36:19] Logged the message, Master [20:37:30] bblack: confirmed that [20:38:01] <_joe_> mmmh how's that even possible? [20:38:04] there is an eth1 though [20:38:12] that is DOWN [20:38:16] cable unplugged? [20:38:25] hmm [20:38:28] ip link show deve eth1 [20:38:41] yay Reedy! [20:38:53] mutante: it's not configured in /etc/network/interfaces, though [20:40:01] <_joe_> bblack: oxygen.wikimedia.org [20:40:38] i wonder if it ever had that IP [20:41:05] RECOVERY - HTTP 5xx req/min on tungsten is OK: OK: Less than 1.00% data above the threshold [250.0] [20:41:13] do we think it should have the second interface configured or not? [20:41:24] I think not [20:41:29] ok [20:41:42] then just remnant or error in DNS [20:41:51] I'm still digging, but git blame isn't very useful. why are there so many commits with the author root@wikimedia.org in the dns repo, that don't show up in gerrit? :P [20:42:07] :p svn history [20:42:13] github! [20:42:13] Reedy: so...backport https://gerrit.wikimedia.org/r/#/c/143419/ ? :) [20:42:14] heh [20:42:31] erg, it's not a straight cherry-pick [20:42:34] * legoktm does [20:44:05] PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: 20.00% of data exceeded the critical threshold [500.0] [20:44:21] (03PS1) 10Dzahn: remove oxygen.eqiad.wmnet, doesn't exist [operations/dns] - 10https://gerrit.wikimedia.org/r/143472 [20:45:21] Reedy: https://gerrit.wikimedia.org/r/#/c/143473/ [20:47:44] (03PS1) 10Dzahn: re-activate addWiki function in maintenance [operations/debs/wikistats] - 10https://gerrit.wikimedia.org/r/143475 [20:48:31] (03CR) 10jenkins-bot: [V: 04-1] re-activate addWiki function in maintenance [operations/debs/wikistats] - 10https://gerrit.wikimedia.org/r/143475 (owner: 10Dzahn) [20:49:06] (03PS2) 10Dzahn: re-activate addWiki function in maintenance [operations/debs/wikistats] - 10https://gerrit.wikimedia.org/r/143475 [20:50:20] (03PS3) 10Dzahn: re-activate addWiki function in maintenance [operations/debs/wikistats] - 10https://gerrit.wikimedia.org/r/143475 [20:52:06] RECOVERY - puppet last run on rcs1001 is OK: OK: Puppet is currently enabled, last run 36 seconds ago with 0 failures [20:53:39] (03PS1) 10Ottomata: Use CDH5 for labs instances [operations/puppet] - 10https://gerrit.wikimedia.org/r/143476 [20:56:57] (03PS2) 10Ottomata: Use CDH5 Hive for labs instances [operations/puppet] - 10https://gerrit.wikimedia.org/r/143476 [20:59:08] Reedy: er, are you going to deploy it now or should I add it to the SWAT deploy? [21:00:04] bsitu, spagewmf: The time is nigh to deploy Flow (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20140701T2100) [21:00:30] RECOVERY - HTTP 5xx req/min on tungsten is OK: OK: Less than 1.00% data above the threshold [250.0] [21:04:21] _joe_: ok, so, your choice: [21:05:23] 1) add ignore => '[a-z]*' to the *-available file resources. apache::site / apache::conf -provisioned files all have numeric prefixes, so this would exempt package-provided conf files while retaining automatic cleanup of unprovisioned confs [21:06:31] 2) add ignore => '{security,other-vhosts-access-log}*', which would retain the only two that have (a) actual configs (most of the rest are just comments), (b) which we would plausibly want [21:07:12] 3) (my preferred option, and consistent with faidon's preference for sysctl.d / rsync.d) have the apache module provision a default apache::conf file that contains the following directives: [21:07:36] 4) replace apache with nginx [21:07:43] ServerTokens Prod, TraceEnable Off, CustomLog ${APACHE_LOG_DIR}/other_vhosts_access.log vhost_combined [21:08:13] TraceEnable Off comes from security.conf, CustomLog ... comes from other-vhosts-access-log.conf [21:08:28] ServerTokens is actually set to 'OS' by security.conf, but we don't want that [21:09:06] <_joe_> ori: ok, my only concern is, using 3) we will have to track the package for changes [21:09:28] <_joe_> but that is reasonable [21:09:51] the comments in conf-enabled/* suggest to me that the content of those files is "here are some things to think about" rather than "here are sensible defaults" (those go in apache2.conf) [21:10:03] that's my interpretation of the fact that, for example, charset.conf is entirely commented out [21:10:09] <_joe_> you know that, if not given tight constraints, I'll go with 4) :P [21:10:34] i'm with you on that [21:10:46] <_joe_> ori: yes it usually is. In I think hardy or lucid they added one file there to mitigate a CVE, but we track those anyway [21:10:50] i found out the other day that the version of apache in trusty doesn't support unix domain sockets :( it's one minor revision behind [21:11:08] mutante: if you got a unitedlayer access list audit email, you can ignore it as I am handling it. [21:11:18] you are listed as contact type notification on it, so i assume you may get it [21:11:21] _joe_: cool, so it's settled [21:11:46] _joe_: it seems good to set a generic default ServerAdmin too while we're at it which is why i was asking about {noc,root,webmaster}@wikimedia.org [21:12:14] <_joe_> ori: then webmaster@ I'd say [21:16:05] (03PS4) 10Jforrester: Create a dblist for non-Beta Features wikis [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/120171 [21:16:58] sounds good [21:19:14] (03CR) 10BBlack: [C: 031] remove oxygen.eqiad.wmnet, doesn't exist [operations/dns] - 10https://gerrit.wikimedia.org/r/143472 (owner: 10Dzahn) [21:22:11] eqiad-esams issue is fixed fwiw [21:28:50] (03PS4) 10Ori.livneh: Add apache::conf [operations/puppet] - 10https://gerrit.wikimedia.org/r/142400 [21:29:36] (03PS5) 10Ori.livneh: Add apache::conf [operations/puppet] - 10https://gerrit.wikimedia.org/r/142400 [21:29:38] How long should it take for an extension change to get pushed to the deployment-prep apaches? [21:31:16] hashar, do you know? [21:32:41] Krenair: roughly 15 minutes max iirc [21:33:02] Krenair: https://integration.wikimedia.org/dashboard/ list the jenkins jobs that updates beta [21:33:34] Krenair: once merged, Gerrit update the extension in mediawiki/extensions.git . That repo is pulled every 10 minutes or so [21:33:41] Krenair: then we run scap to deploy [21:33:48] it is being run right now [21:33:55] https://integration.wikimedia.org/ci/job/beta-scap-eqiad/11741/console [21:34:24] Krenair: scap job triggered by an upstream job that pulled the extensions https://integration.wikimedia.org/ci/job/beta-code-update-eqiad/13962/consoleFull [21:35:03] Krenair: see also https://wikitech.wikimedia.org/wiki/Nova_Resource:Deployment-prep/How_code_is_updated ! [21:35:04] I am off [21:35:07] Okay [21:35:09] Thank you hashar [21:35:36] (03CR) 10Ori.livneh: [C: 032] Add apache::conf [operations/puppet] - 10https://gerrit.wikimedia.org/r/142400 (owner: 10Ori.livneh) [21:36:20] Krenair: feel free to talk about your finding on one of the lists hehe [21:36:45] Krenair: the more people knows about how beta work, the more they will use it to track bugs / tests [21:37:56] sleeps [21:39:13] PROBLEM - puppet last run on tmh1001 is CRITICAL: CRITICAL: Puppet has 5 failures [21:39:27] uh oh [21:39:33] PROBLEM - puppet last run on cp4016 is CRITICAL: CRITICAL: Puppet has 2 failures [21:39:34] !log Set email for re-renamed dewiki account "Kolimak". Email and password got lost during a screwed rename. [21:39:39] Logged the message, Master [21:39:52] PROBLEM - puppet last run on mw1137 is CRITICAL: CRITICAL: Puppet has 6 failures [21:39:52] PROBLEM - puppet last run on cp3019 is CRITICAL: CRITICAL: Puppet has 4 failures [21:39:53] PROBLEM - puppet last run on mw1103 is CRITICAL: CRITICAL: Puppet has 7 failures [21:40:02] PROBLEM - puppet last run on cp4011 is CRITICAL: CRITICAL: Puppet has 2 failures [21:40:12] PROBLEM - puppet last run on cp3017 is CRITICAL: CRITICAL: Puppet has 1 failures [21:40:12] PROBLEM - puppet last run on analytics1012 is CRITICAL: CRITICAL: Puppet has 7 failures [21:40:12] PROBLEM - puppet last run on cp4010 is CRITICAL: CRITICAL: Puppet has 4 failures [21:40:23] aww man [21:40:28] icinga yer such a downer. [21:40:32] PROBLEM - puppet last run on mw1073 is CRITICAL: CRITICAL: Puppet has 4 failures [21:40:32] PROBLEM - puppet last run on cp3021 is CRITICAL: CRITICAL: Puppet has 1 failures [21:40:51] probably me [21:40:52] checking [21:40:52] PROBLEM - HTTP on nickel is CRITICAL: Connection refused [21:41:04] i was just looking through recent merge emails, heh [21:41:12] is the "puppet has X failures" new? [21:41:24] no, not me [21:41:29] Error: /Stage[main]/Admin/Admin::Hashuser[springle]/Admin::User[springle]/File[/home/springle]: Failed to generate additional resources using 'eval_generate': Connection reset by peer - SSL_connect [21:41:34] it was discussed that it was coming, so im guessing someone merged it sometime recently [21:41:41] robh: cool [21:41:43] its good that its happening though, means we check those systems [21:41:54] but now i feel obligated to check them! [21:41:56] boooooooo [21:41:57] hissssss [21:42:08] hm, no, wait [21:43:09] my change did trigger a refresh of the apache service on palladium, so that's a potential vector [21:43:42] but it didn't purge any files or enable new configs [21:43:48] so don't rest easy thinking this is figured out [21:43:52] PROBLEM - HTTP on aluminium is CRITICAL: Connection refused [21:45:13] ah yes, apache failed to start on aluminium [21:46:42] RECOVERY - puppet last run on mw1137 is OK: OK: Puppet is currently enabled, last run 18 seconds ago with 0 failures [21:47:52] RECOVERY - puppet last run on cp3019 is OK: OK: Puppet is currently enabled, last run 32 seconds ago with 0 failures [21:47:52] RECOVERY - HTTP on aluminium is OK: HTTP OK: HTTP/1.1 302 Found - 557 bytes in 0.001 second response time [21:47:54] ok, should all recover shortly [21:48:25] aluminum is on lucid and its version of apache doesn't set ${APACHE_LOG_DIR} [21:48:28] well, i ran puppet on cp3019 [21:48:34] cuz it said 4 failures, but ran without incident [21:48:40] so thats odd [21:49:04] i'm going to leave the next failure to see if it auto clears on systems next automated run [21:49:06] im curious [21:49:34] (the cp3X range ones are what i'll look at) [21:50:02] someone investigating why nickel/ganglia is down? [21:51:00] i will, since it's probably related to my change [21:51:31] it cant start service apache [21:51:36] same reason i think [21:51:41] (03PS1) 10Ori.livneh: Fix-up for I5bf6186d7: don't reference ${APACHE_LOG_DIR}; unavailable on lucid [operations/puppet] - 10https://gerrit.wikimedia.org/r/143490 [21:51:51] fixed by ^^ [21:51:52] yes nickel is lucid [21:52:44] (03CR) 10RobH: [C: 031] Fix-up for I5bf6186d7: don't reference ${APACHE_LOG_DIR}; unavailable on lucid [operations/puppet] - 10https://gerrit.wikimedia.org/r/143490 (owner: 10Ori.livneh) [21:52:52] RECOVERY - HTTP on nickel is OK: HTTP OK: HTTP/1.1 302 Found - 545 bytes in 0.002 second response time [21:52:52] thanks, apache is back up on that host [21:52:57] sorry bout that [21:53:07] (03CR) 10Ori.livneh: [C: 032] Fix-up for I5bf6186d7: don't reference ${APACHE_LOG_DIR}; unavailable on lucid [operations/puppet] - 10https://gerrit.wikimedia.org/r/143490 (owner: 10Ori.livneh) [21:53:31] paravoid: i'm responding to the guys at orange. is tech-emergency@wikimedia.org the correct email address for outage reporting (non-Wikipedia Zero) or is there a better one? i'm going to tell them to email wikipediazero@wikimedia.org for W0 stuff, but wanted to convey the correct contact info for other stuff. [21:53:50] dr0ptp4kt: it reached us, so I guess it's fine [21:54:05] paravoid: cool [21:54:10] english, though :) [21:54:12] PROBLEM - puppet last run on nickel is CRITICAL: CRITICAL: Puppet has 1 failures [21:54:22] or next time I'm replying in Greek :P [21:57:12] RECOVERY - puppet last run on analytics1012 is OK: OK: Puppet is currently enabled, last run 34 seconds ago with 0 failures [21:57:12] RECOVERY - puppet last run on tmh1001 is OK: OK: Puppet is currently enabled, last run 18 seconds ago with 0 failures [21:57:22] PROBLEM - puppet last run on search1004 is CRITICAL: CRITICAL: Puppet has 1 failures [21:57:32] RECOVERY - puppet last run on mw1073 is OK: OK: Puppet is currently enabled, last run 25 seconds ago with 0 failures [21:57:32] PROBLEM - puppet last run on terbium is CRITICAL: CRITICAL: Puppet has 1 failures [21:57:52] RECOVERY - puppet last run on mw1103 is OK: OK: Puppet is currently enabled, last run 54 seconds ago with 0 failures [21:57:52] PROBLEM - puppet last run on elastic1003 is CRITICAL: CRITICAL: Puppet has 1 failures [21:58:18] paravoid: ha! [21:58:32] PROBLEM - puppet last run on lvs4004 is CRITICAL: CRITICAL: Puppet has 1 failures [21:58:32] RECOVERY - puppet last run on cp3021 is OK: OK: Puppet is currently enabled, last run 24 seconds ago with 0 failures [21:58:33] PROBLEM - puppet last run on amssq33 is CRITICAL: CRITICAL: Puppet has 1 failures [21:59:42] RECOVERY - puppet last run on cp4016 is OK: OK: Puppet is currently enabled, last run 15 seconds ago with 0 failures [21:59:55] RECOVERY - puppet last run on cp4011 is OK: OK: Puppet is currently enabled, last run 41 seconds ago with 0 failures [22:00:05] RECOVERY - puppet last run on cp3017 is OK: OK: Puppet is currently enabled, last run 45 seconds ago with 0 failures [22:00:15] RECOVERY - puppet last run on nickel is OK: OK: Puppet is currently enabled, last run 53 seconds ago with 0 failures [22:00:15] RECOVERY - puppet last run on cp4010 is OK: OK: Puppet is currently enabled, last run 55 seconds ago with 0 failures [22:01:46] all of these were related to ${APACHE_LOG_DIR} causing aluminum's apache to fail to start [22:02:05] i'm forcing puppet runs on the remaining ones so the alerts don't linger [22:02:36] (03PS1) 10Yurik: Added a comment to remove dup config setting [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/143497 [22:03:17] (03CR) 10Dzahn: [C: 032] remove oxygen.eqiad.wmnet, doesn't exist [operations/dns] - 10https://gerrit.wikimedia.org/r/143472 (owner: 10Dzahn) [22:04:35] RECOVERY - puppet last run on lvs4004 is OK: OK: Puppet is currently enabled, last run 3 seconds ago with 0 failures [22:04:48] (03PS2) 10Dzahn: remove oxygen.eqiad.wmnet, doesn't exist [operations/dns] - 10https://gerrit.wikimedia.org/r/143472 [22:05:15] RECOVERY - puppet last run on search1004 is OK: OK: Puppet is currently enabled, last run 47 seconds ago with 0 failures [22:05:35] RECOVERY - puppet last run on amssq33 is OK: OK: Puppet is currently enabled, last run 0 seconds ago with 0 failures [22:06:18] (03CR) 10Dzahn: [C: 032] re-activate addWiki function in maintenance [operations/debs/wikistats] - 10https://gerrit.wikimedia.org/r/143475 (owner: 10Dzahn) [22:06:55] RECOVERY - puppet last run on elastic1003 is OK: OK: Puppet is currently enabled, last run 11 seconds ago with 0 failures [22:06:56] PROBLEM - puppet last run on virt1009 is CRITICAL: CRITICAL: Complete puppet failure [22:08:36] andrewbogott: ^ success [22:08:56] that's the intended puppet breakage one [22:09:00] mutante: sort of… it has a race so reporting is inconsistent. [22:09:08] ooh.. i see [22:10:22] andrewbogott: it just made me think of another way.. there are generic log checks for icinga.. so we can already check for any string in any log... [22:10:30] we could just check "Error 400 on SERVER" [22:10:39] RECOVERY - puppet last run on terbium is OK: OK: Puppet is currently enabled, last run 45 seconds ago with 0 failures [22:12:01] mutante: how would it know when things were fixed? Like, does it grep the log or just notice changes? [22:15:35] PROBLEM - graphite.wikimedia.org on tungsten is CRITICAL: CRITICAL - Socket timeout after 10 seconds [22:16:18] andrewbogott: one can define a "recovery pattern" (say, "Finished catalog run") [22:16:25] RECOVERY - graphite.wikimedia.org on tungsten is OK: HTTP OK: HTTP/1.1 200 OK - 1607 bytes in 0.074 second response time [22:16:30] it normally only checks all lines that have been added since last run [22:16:40] talking about http://labs.consol.de/nagios/check_logfiles/ [22:16:54] which is a bit more advanced than just grep [22:16:55] PROBLEM - puppet last run on stat1001 is CRITICAL: CRITICAL: Puppet has 1 failures [22:17:25] PROBLEM - puppet last run on mw1027 is CRITICAL: CRITICAL: Puppet has 1 failures [22:18:09] mutante: that might have potential [22:18:22] mws still getting errors? [22:18:24] I'm not sure why akosiaris approached it the way he did [22:18:50] andrewbogott: It's all greek to me [22:19:00] RECOVERY - puppet last run on virt1009 is OK: OK: Puppet is currently enabled, last run 27 seconds ago with 0 failures [22:19:00] PROBLEM - puppet last run on ssl3001 is CRITICAL: CRITICAL: Puppet has 1 failures [22:19:30] PROBLEM - puppet last run on mw1179 is CRITICAL: CRITICAL: Puppet has 7 failures [22:19:36] mutante: as you see, virt1009 is now reporting recovery but nothing has changed [22:19:55] hmm.. yea. i see [22:20:00] PROBLEM - puppet last run on mw1128 is CRITICAL: CRITICAL: Puppet has 17 failures [22:20:10] PROBLEM - puppet last run on ms-be1010 is CRITICAL: CRITICAL: Puppet has 12 failures [22:20:10] PROBLEM - puppet last run on cp3005 is CRITICAL: CRITICAL: Puppet has 1 failures [22:20:20] PROBLEM - puppet last run on db1068 is CRITICAL: CRITICAL: Puppet has 13 failures [22:20:20] PROBLEM - puppet last run on db1009 is CRITICAL: CRITICAL: Puppet has 8 failures [22:20:30] PROBLEM - puppet last run on analytics1009 is CRITICAL: CRITICAL: Puppet has 24 failures [22:20:30] PROBLEM - puppet last run on hydrogen is CRITICAL: CRITICAL: Puppet has 7 failures [22:20:30] PROBLEM - puppet last run on mc1015 is CRITICAL: CRITICAL: Puppet has 5 failures [22:20:30] PROBLEM - puppet last run on mw1073 is CRITICAL: CRITICAL: Puppet has 29 failures [22:20:30] PROBLEM - puppet last run on cp1066 is CRITICAL: CRITICAL: Puppet has 14 failures [22:20:40] who has most failures wins [22:20:40] PROBLEM - puppet last run on mw1137 is CRITICAL: CRITICAL: Puppet has 28 failures [22:20:42] 29 failures, new record [22:20:50] PROBLEM - puppet last run on ms-be1014 is CRITICAL: CRITICAL: Puppet has 8 failures [22:20:53] :) [22:21:00] PROBLEM - puppet last run on mw1103 is CRITICAL: CRITICAL: Puppet has 3 failures [22:21:00] PROBLEM - puppet last run on cp3019 is CRITICAL: CRITICAL: Puppet has 4 failures [22:21:00] PROBLEM - puppet last run on db1005 is CRITICAL: CRITICAL: Puppet has 13 failures [22:21:00] PROBLEM - puppet last run on zinc is CRITICAL: CRITICAL: Puppet has 6 failures [22:21:00] PROBLEM - puppet last run on lvs3002 is CRITICAL: CRITICAL: Puppet has 1 failures [22:21:10] PROBLEM - puppet last run on cp3017 is CRITICAL: CRITICAL: Puppet has 4 failures [22:21:10] PROBLEM - puppet last run on db1027 is CRITICAL: CRITICAL: Puppet has 3 failures [22:21:10] PROBLEM - puppet last run on virt1005 is CRITICAL: CRITICAL: Puppet has 6 failures [22:21:10] PROBLEM - puppet last run on cp4010 is CRITICAL: CRITICAL: Puppet has 2 failures [22:21:20] PROBLEM - puppet last run on tmh1001 is CRITICAL: CRITICAL: Puppet has 15 failures [22:21:20] PROBLEM - puppet last run on mw1078 is CRITICAL: CRITICAL: Puppet has 7 failures [22:21:20] PROBLEM - puppet last run on elastic1010 is CRITICAL: CRITICAL: Puppet has 9 failures [22:21:21] PROBLEM - puppet last run on mw1047 is CRITICAL: CRITICAL: Puppet has 26 failures [22:21:21] RECOVERY - puppet last run on mw1027 is OK: OK: Puppet is currently enabled, last run 23 seconds ago with 0 failures [22:21:31] PROBLEM - puppet last run on eeden is CRITICAL: CRITICAL: Puppet has 3 failures [22:21:40] PROBLEM - puppet last run on cp4016 is CRITICAL: CRITICAL: Puppet has 2 failures [22:21:50] PROBLEM - puppet last run on cp3021 is CRITICAL: CRITICAL: Puppet has 2 failures [22:22:00] PROBLEM - puppet last run on cp4011 is CRITICAL: CRITICAL: Puppet has 2 failures [22:22:10] PROBLEM - puppet last run on ms-be3004 is CRITICAL: CRITICAL: Puppet has 2 failures [22:22:10] PROBLEM - puppet last run on cp3007 is CRITICAL: CRITICAL: Puppet has 4 failures [22:22:20] PROBLEM - puppet last run on cp4015 is CRITICAL: CRITICAL: Puppet has 3 failures [22:22:20] PROBLEM - puppet last run on amssq50 is CRITICAL: CRITICAL: Puppet has 1 failures [22:23:00] RECOVERY - puppet last run on mw1128 is OK: OK: Puppet is currently enabled, last run 6 seconds ago with 0 failures [22:23:00] PROBLEM - puppet last run on virt1009 is CRITICAL: CRITICAL: Complete puppet failure [22:25:24] um… ok, what changed? anything? [22:30:51] (03PS1) 10RobH: sam smith to deployers group [operations/puppet] - 10https://gerrit.wikimedia.org/r/143500 [22:31:12] * greg-g looks up [22:31:40] ah right [22:32:43] someone +1 my change so im not being all self reviewy [22:32:46] =] [22:32:57] * robh is only going to wait a few minutes before he does it anyhow [22:33:10] RECOVERY - puppet last run on ms-be3004 is OK: OK: Puppet is currently enabled, last run 53 seconds ago with 0 failures [22:33:24] (03CR) 10Andrew Bogott: [C: 031] sam smith to deployers group [operations/puppet] - 10https://gerrit.wikimedia.org/r/143500 (owner: 10RobH) [22:33:36] andrewbogott: thx =] [22:34:00] (03CR) 10RobH: [C: 032] sam smith to deployers group [operations/puppet] - 10https://gerrit.wikimedia.org/r/143500 (owner: 10RobH) [22:35:00] RECOVERY - puppet last run on stat1001 is OK: OK: Puppet is currently enabled, last run 1 seconds ago with 0 failures [22:35:18] ..... [22:35:23] crap i did that and now i regret it [22:35:27] sorry, should've done that for ya robh, I was just curious [22:35:31] why did no one make him acknowledge server access respoinsiblities? [22:35:36] =P [22:35:37] hah! [22:35:48] * greg-g retracks apology [22:35:51] you were quick checking that ssh key [22:35:53] and no one made sure that key wasn't labs key [22:36:01] RECOVERY - puppet last run on ssl3001 is OK: OK: Puppet is currently enabled, last run 6 seconds ago with 0 failures [22:36:03] i forgot that part, blaaaaah [22:36:06] now i gott acheck [22:36:30] RECOVERY - puppet last run on mc1015 is OK: OK: Puppet is currently enabled, last run 15 seconds ago with 0 failures [22:36:43] ok, it doesnt match key in ldap [22:36:52] so yay [22:36:58] (labs key) [22:37:10] RECOVERY - puppet last run on ms-be1010 is OK: OK: Puppet is currently enabled, last run 20 seconds ago with 0 failures [22:37:10] RECOVERY - puppet last run on db1027 is OK: OK: Puppet is currently enabled, last run 60 seconds ago with 0 failures [22:37:10] RECOVERY - puppet last run on cp3005 is OK: OK: Puppet is currently enabled, last run 25 seconds ago with 0 failures [22:37:20] so whew. I can simply direct him to the responsibilities page in my resolution email [22:37:20] RECOVERY - puppet last run on tmh1001 is OK: OK: Puppet is currently enabled, last run 26 seconds ago with 0 failures [22:37:20] RECOVERY - puppet last run on mw1047 is OK: OK: Puppet is currently enabled, last run 11 seconds ago with 0 failures [22:37:20] RECOVERY - puppet last run on db1009 is OK: OK: Puppet is currently enabled, last run 1 seconds ago with 0 failures [22:37:30] RECOVERY - puppet last run on analytics1009 is OK: OK: Puppet is currently enabled, last run 46 seconds ago with 0 failures [22:37:30] RECOVERY - puppet last run on hydrogen is OK: OK: Puppet is currently enabled, last run 13 seconds ago with 0 failures [22:37:30] RECOVERY - puppet last run on eeden is OK: OK: Puppet is currently enabled, last run 43 seconds ago with 0 failures [22:37:30] RECOVERY - puppet last run on mw1073 is OK: OK: Puppet is currently enabled, last run 18 seconds ago with 0 failures [22:37:30] RECOVERY - puppet last run on cp1066 is OK: OK: Puppet is currently enabled, last run 44 seconds ago with 0 failures [22:37:40] RECOVERY - puppet last run on cp4016 is OK: OK: Puppet is currently enabled, last run 1 seconds ago with 0 failures [22:37:40] RECOVERY - puppet last run on mw1137 is OK: OK: Puppet is currently enabled, last run 25 seconds ago with 0 failures [22:37:50] RECOVERY - puppet last run on cp3021 is OK: OK: Puppet is currently enabled, last run 35 seconds ago with 0 failures [22:38:00] RECOVERY - puppet last run on mw1103 is OK: OK: Puppet is currently enabled, last run 44 seconds ago with 0 failures [22:38:00] RECOVERY - puppet last run on db1005 is OK: OK: Puppet is currently enabled, last run 43 seconds ago with 0 failures [22:38:00] RECOVERY - puppet last run on cp3019 is OK: OK: Puppet is currently enabled, last run 27 seconds ago with 0 failures [22:38:00] RECOVERY - puppet last run on zinc is OK: OK: Puppet is currently enabled, last run 27 seconds ago with 0 failures [22:38:00] RECOVERY - puppet last run on cp4011 is OK: OK: Puppet is currently enabled, last run 35 seconds ago with 0 failures [22:38:01] RECOVERY - puppet last run on lvs3002 is OK: OK: Puppet is currently enabled, last run 5 seconds ago with 0 failures [22:38:14] RECOVERY - puppet last run on cp3017 is OK: OK: Puppet is currently enabled, last run 44 seconds ago with 0 failures [22:38:14] RECOVERY - puppet last run on virt1005 is OK: OK: Puppet is currently enabled, last run 59 seconds ago with 0 failures [22:38:14] RECOVERY - puppet last run on cp4010 is OK: OK: Puppet is currently enabled, last run 35 seconds ago with 0 failures [22:38:14] RECOVERY - puppet last run on cp3007 is OK: OK: Puppet is currently enabled, last run 12 seconds ago with 0 failures [22:38:20] RECOVERY - puppet last run on amssq50 is OK: OK: Puppet is currently enabled, last run 48 seconds ago with 0 failures [22:38:20] RECOVERY - puppet last run on mw1078 is OK: OK: Puppet is currently enabled, last run 19 seconds ago with 0 failures [22:38:20] RECOVERY - puppet last run on elastic1010 is OK: OK: Puppet is currently enabled, last run 18 seconds ago with 0 failures [22:38:20] RECOVERY - puppet last run on db1068 is OK: OK: Puppet is currently enabled, last run 53 seconds ago with 0 failures [22:38:30] RECOVERY - puppet last run on mw1179 is OK: OK: Puppet is currently enabled, last run 58 seconds ago with 0 failures [22:38:50] RECOVERY - puppet last run on ms-be1014 is OK: OK: Puppet is currently enabled, last run 50 seconds ago with 0 failures [22:39:01] RECOVERY - puppet last run on virt1009 is OK: OK: Puppet is currently enabled, last run 19 seconds ago with 0 failures [22:40:16] RECOVERY - puppet last run on cp4015 is OK: OK: Puppet is currently enabled, last run 53 seconds ago with 0 failures [22:40:34] yay recoveries [22:45:19] hey, now that the icinga spam is over, ops should make one of these for us: https://monitor.archive.org/weathermap/weathermap.html [22:48:57] greg-g: guy who originally wrote that tool was an old colleague of us/alex [22:49:02] well, alex's boss at one point too :) [22:49:05] awesome :) [22:49:18] the perl version [22:49:26] it got rewritten in PHP by some person at some point [22:51:34] (03CR) 10Matanya: [C: 04-1] Modify nova role to better support labs uses. (031 comment) [operations/puppet] - 10https://gerrit.wikimedia.org/r/141836 (owner: 10Andrew Bogott) [22:54:54] paravoid: Any idea where the source is? [22:57:30] cacti [22:58:33] Reedy, have you already deployed https://gerrit.wikimedia.org/r/#/c/143473/ ? [22:58:44] Nope [22:58:53] Didn't make an update commit for core [23:00:04] RoanKattouw, mwalker, ori, MaxSem: The time is nigh to deploy SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20140701T2300) [23:00:21] * MaxSem volunteers [23:02:49] to do swat, or the update? ;) [23:03:01] the swat [23:04:01] PROBLEM - HTTP error ratio anomaly detection on tungsten is CRITICAL: CRITICAL: Anomaly detected: 80 data above and 9 below the confidence bounds [23:04:25] !log maxsem Synchronized php-1.24wmf10/resources/: https://gerrit.wikimedia.org/r/#/c/142975/ (duration: 00m 19s) [23:04:29] Logged the message, Master [23:05:26] !log maxsem Synchronized php-1.24wmf11/resources/: https://gerrit.wikimedia.org/r/#/c/142975/ (duration: 00m 05s) [23:05:31] Logged the message, Master [23:06:08] legoktm, yt? about to deploy your stuff [23:06:13] hey [23:06:21] I'm just doing the core commit... [23:06:53] Reedy, it's already in deployment branch [23:07:11] it needs a submodule bump [23:07:23] That's what I mean [23:07:28] https://gerrit.wikimedia.org/r/143505 [23:07:29] eh [23:08:05] I should probably sleep sometimes:) [23:09:30] !log maxsem Synchronized php-1.24wmf11/extensions/CentralAuth/: https://gerrit.wikimedia.org/r/#/c/143473/ (duration: 00m 05s) [23:09:34] Logged the message, Master [23:09:57] thanks! [23:10:06] * MaxSem scratches head [23:10:12] nothing else to deploy? [23:13:44] (03CR) 10Dzahn: [C: 04-2] dataset-replace generic::systemuser with user [operations/puppet] - 10https://gerrit.wikimedia.org/r/138000 (owner: 10Rush) [23:21:20] (03PS6) 10Ori.livneh: role::mediawiki::webserver: set maxclients dynamically [operations/puppet] - 10https://gerrit.wikimedia.org/r/137947 [23:26:18] (03PS1) 10Rush: trusty friendly phabricator [operations/puppet] - 10https://gerrit.wikimedia.org/r/143510 [23:31:54] (03CR) 10MaxSem: [C: 032] FeaturedFeeds for Wiktionary [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/136316 (https://bugzilla.wikimedia.org/66015) (owner: 10Whym) [23:32:02] (03Merged) 10jenkins-bot: FeaturedFeeds for Wiktionary [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/136316 (https://bugzilla.wikimedia.org/66015) (owner: 10Whym) [23:34:18] !log maxsem Synchronized wmf-config/FeaturedFeedsWMF.php: https://gerrit.wikimedia.org/r/#/c/136316/ (duration: 00m 04s) [23:34:22] Logged the message, Master [23:36:13] !log maxsem Synchronized wmf-config/FeaturedFeedsWMF.php: https://gerrit.wikimedia.org/r/#/c/136316/ now for realz (duration: 00m 04s) [23:36:18] Logged the message, Master [23:40:35] (03PS1) 10RobH: setting up the node definitions for francium as blog server [operations/puppet] - 10https://gerrit.wikimedia.org/r/143517 [23:42:22] (03CR) 10RobH: [C: 032] "self reviewing since its just applying already tested class to a new server (and no service changes yet)" [operations/puppet] - 10https://gerrit.wikimedia.org/r/143517 (owner: 10RobH) [23:43:16] !log any francium errors can be ignored, as the software doesn't fully deploy from puppet and its not in service [23:43:22] Logged the message, Master [23:43:56] (03CR) 10Dzahn: [C: 031] trusty friendly phabricator [operations/puppet] - 10https://gerrit.wikimedia.org/r/143510 (owner: 10Rush) [23:59:32] (03PS2) 10Rush: trusty friendly phabricator [operations/puppet] - 10https://gerrit.wikimedia.org/r/143510