[00:08:54] New review: Tim Starling; "The code implies that this error occurs when the search *query* is empty, not the search results. Th..." [operations/debs/lucene-search-2] (master) C: -1; - https://gerrit.wikimedia.org/r/60860 [00:16:53] paravoid: ping [00:16:58] Guest83333: pong [00:17:15] paravoid: May I PM? [00:17:35] sure [00:17:39] you don't have to ask first :) [00:24:03] Change merged: Catrope; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/60911 [00:26:05] WTF [00:26:16] Did someone mess with host SSH keys? [00:26:25] what key? [00:26:44] I'm getting "The authenticity of host 'mw1' couldn't be established" for every host when I try sync-file [00:27:51] I get [00:27:52] mw76: Warning: Permanently added 'mw76,10.0.11.76' (RSA) to the list of known hosts. [00:27:53] mw77: Warning: Permanently added 'mw77,10.0.11.77' (RSA) to the list of known hosts. [00:27:53] mw75: Warning: Permanently added 'mw75,10.0.11.75' (RSA) to the list of known hosts. [00:27:53] mw81: Warning: Permanently added 'mw81,10.0.11.81' (RSA) to the list of known hosts. [00:28:25] heh, we must have different SSH options then [00:28:45] But doesn't this mean something must be screwed up if we both get this for *all* hosts? [00:28:45] related to Leslie upgrading them? [00:29:04] Probably not [00:29:05] were their addresses changed? [00:29:16] Because it's also doing it for hume, bast1001 and searchidx1001 [00:29:26] no, i don't think so [00:29:48] strict host key validation would be useful if we had some way for securely updating ssh_known_hosts [00:30:01] but last time I checked, we didn't, that's why I have it disabled [00:31:20] ha ha [00:31:22] -rw------- 1 root root 2562362 Apr 26 00:11 ssh_known_hosts [00:31:27] helps if you can read the file [00:31:31] did stuff like ssh-keygen -f "/etc/ssh/ssh_known_hosts" -R bayle [00:31:32] *sigh* [00:31:41] to remove the one for bayle..f.e. [00:32:06] TimStarling: Mind if I make that 644? [00:32:10] I did it already [00:32:12] OK [00:32:14] seems to work now [00:32:25] Syncing now [00:32:37] !log catrope synchronized wmf-config/CommonSettings.php 'Remove wmgVisualEditorHide' [00:32:39] !log on fenari: fixed permissions on /etc/ssh/ssh_known_hosts [00:32:45] Logged the message, Master [00:32:53] !log catrope synchronized wmf-config/InitialiseSettings.php 'Clean up VisualEditor config' [00:32:53] Logged the message, Master [00:33:01] Logged the message, Master [00:35:46] New review: Dzahn; "just adding comments and makes lint check happy:)" [operations/puppet] (production) C: 2; - https://gerrit.wikimedia.org/r/60851 [00:35:53] Change merged: Dzahn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/60851 [00:42:20] Ryan_Lane: ping [00:42:26] sup? [00:42:35] Ryan_Lane: http://techcrunch.com/2013/04/25/facebook-parse/ [00:42:59] is that what you've been working on? :) [00:43:13] Ryan_Lane: no [00:43:21] Ryan_Lane: that means that Ben Hartshorne is now at Facebook [00:43:27] oooohhhhh [00:43:28] right [00:44:09] I forgot he moved there [00:44:11] Ryan_Lane: see https://www.parse.com/about/team [00:44:41] Ryan_Lane: click on Ben's head and read his bio — it's pretty cool [00:46:18] PROBLEM - Puppet freshness on lvs1006 is CRITICAL: No successful Puppet run in the last 10 hours [00:46:18] PROBLEM - Puppet freshness on lvs1005 is CRITICAL: No successful Puppet run in the last 10 hours [00:46:18] PROBLEM - Puppet freshness on lvs1004 is CRITICAL: No successful Puppet run in the last 10 hours [00:46:28] heh [00:48:29] New patchset: Jforrester; "wikibugs: Set up #mediawiki-visualeditor" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/37570 [00:51:17] Change merged: Kaldari; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/60947 [00:52:08] > [00:52:08] To apply via API, send a POST to: [00:52:09] https://www.parse.com/jobs/apply [00:52:09] > [00:52:10] Heh. [00:53:20] !log catrope synchronized php-1.22wmf2/extensions/VisualEditor/ 'Update VisualEditor to master' [00:53:28] Logged the message, Master [00:54:01] Susan: ha ha ha you should apply [00:54:12] Susan: the Facebook deal isn't closed yet ;-) [00:57:14] !log kaldari synchronized wmf-config/InitialiseSettings.php 'temporarily disabling WikiLove per bug 47457' [00:57:22] Logged the message, Master [00:59:08] you broke WikiLove... [01:02:50] odder: Broke? It's been disabled. [01:03:30] Susan: I know. [01:03:40] What do you see that's broken? [01:04:05] see bug :) [01:05:04] https://bugzilla.wikimedia.org/show_bug.cgi?id=47457 ? [01:05:10] I'm not following you. [01:14:03] PROBLEM - Puppet freshness on virt1005 is CRITICAL: No successful Puppet run in the last 10 hours [01:54:24] PROBLEM - MySQL Slave Delay on db1025 is CRITICAL: CRIT replication delay 233 seconds [01:55:25] RECOVERY - MySQL Slave Delay on db1025 is OK: OK replication delay 13 seconds [02:10:04] PROBLEM - Puppet freshness on mc15 is CRITICAL: No successful Puppet run in the last 10 hours [02:12:39] !log LocalisationUpdate completed (1.22wmf2) at Fri Apr 26 02:12:38 UTC 2013 [02:12:47] Logged the message, Master [02:14:24] PROBLEM - MySQL Slave Delay on db1025 is CRITICAL: CRIT replication d [03:45:44] !log LocalisationUpdate ResourceLoader cache refresh completed at Fri Apr 26 03:45:44 UTC 2013 [03:45:51] Logged the message, Master [04:58:15] why don't we use NAT? [04:59:17] in what way is a SOCKS proxy more secure or otherwise better than a specific firewall rule on a NAT server? [04:59:57] i suppose you could log requests or limit to certain domains/paths or something [05:00:25] harder to use on the client and a little more flexibility for the man in the middle. i think [05:03:16] I'm just considering the options for yet another internal/external relay [05:03:23] in particular IRC [05:04:29] it just seems like a waste of time, if we could have a working default next hop [05:05:59] what's this relay for? [05:15:21] it's for logmsgbot [05:15:54] the deployment scripts on tin have to send messages to freenode [05:20:18] oh, right [05:20:49] i suppose you could use udp (i think you mentioned that before) or salt or kafka [05:21:24] but maybe i'm mixing up irc bots [05:24:56] i guess i'm undecided about whether it's worth blocking outbound [05:29:46] there are many options [05:30:15] it's one of those things where you have to decide what camp you belong to before you can choose a solution :) [05:30:18] * jeremyb_ was limiting to stuff already in use :) [05:31:18] I don't have an opinion about the network setup, but I think there's a high-level reason for preferring the use of some message bus, and that's to allow people write software to react to what is happening on the cluster in useful and interesting ways [05:31:28] yeah, I knew you would like zmq [05:31:42] in fact I have just been reading its documentation for that reason [05:32:08] oh. you have to make it past the whole idiotic "it's a radioactive isotope that cures AIDS" nonsense [05:33:50] New patchset: Stefan.petrea; "Added packages needed for Jenkins environment" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/60965 [05:34:02] of course IRC is also a message bus and a good one for certain applications, but it's also annoying to parse and annoyingly nontrivial to remain connected to [05:34:05] hello Operations [05:34:13] I'm from the Analytics team [05:34:15] please merge https://gerrit.wikimedia.org/r/60965 [05:34:23] I like that, straight & to the point [05:34:34] there are packages needed by the Jenkins environment on the Jenkins environment [05:34:56] hello average [05:34:59] hi TimStarling [05:34:59] do you have a name? [05:35:03] I do [05:35:07] Stefan Petrea [05:36:41] that's my name [05:36:47] if you wait another couple of hours, antoine will be awake and will be able to sort that out for you [05:36:57] or is it more urgent than that? [05:37:15] TimStarling: not very urgent, can wait [05:37:28] ori-l: for the IRC part of it, I've been looking at two solutions [05:37:36] just a bunch of packages that need to be present on the Jenkins [05:37:51] one is the python irclib package which we use already for this and various other things [05:38:06] average: You should add reviewers to your Gerrit changesets. [05:38:22] the other is supybot, which does a million things by default, but I've just confirmed that it can be configured to not do any of those default things, and just run a single plugin instead [05:39:20] What's zmq? [05:39:31] zeromq [05:39:32] yeah, I've written supybot plugins before. If you're just looking for "stay connected to IRC and behave yourself", you can look at Twisted's IRC lib, which IIRC supybot uses under the covers [05:39:35] supybot happens to have a plugin already which accepts IRC lines via a simple text protocol via TCP [05:39:35] it's a distributed message queue [05:39:39] Susan: ^^ [05:39:42] https://github.com/plq/zmq [05:39:47] That? [05:39:51] the protocol includes authentication [05:40:00] Susan: no [05:40:00] https://en.wikipedia.org/wiki/Zmq doesn't exist, so I'm kinda lost. [05:40:09] Susan: http://www.zeromq.org/ [05:40:13] Ah. [05:40:14] ottomata: hi sir :) [05:40:26] Susan wasn't that far off, that is a client for it [05:40:45] hi ori-l [05:40:53] hi average [05:41:01] anyway, either way, the IRC part is python [05:41:06] I thought everyone used XMPP. [05:41:23] now, you can do a simple TCP socket server in about 5 lines of code in python [05:41:35] Susan: uhm.. in the Analytics team we do IRC and Google Hangouts [05:41:35] probably a ZeroMQ listener is about the same number [05:41:45] Susan: not so much XMPP(or Gtalk) .. [05:41:50] they're equivalent from that perspective [05:42:11] the difference would mostly be in how hard they are to puppetize [05:42:17] and I think plain TCP would win there [05:42:18] ori-l: you know I though your isotope bit was sarcasm...and then it wasn't [05:42:33] average: but hangouts are XMPP? [05:42:37] maybe [05:42:38] *thought [05:43:13] Susan: I'm talking about a bot which is in this channel right now [05:43:18] it can't be trivially moved to XMPP [05:43:27] TimStarling: puppetizing zmq is package { 'python-zmq': ensure => present, } [05:44:18] but TCP works fine. the nice thing zero gives you is multiple subscribers and some in-memory buffering to smooth over brief disconnects [05:44:41] seems like overkill in this context [05:44:42] salt uses zeromq as its message bus and ryan deployed salt everywhere [05:44:46] i wonder how deploys get to graphite now [05:44:55] you could have graphite just be an extra listener [05:45:05] I am imagining a venn diagram of the features I need and the features various things will give me [05:45:28] TimStarling: as you said, it's the same python code. zmq mimics the familiar BSD socket API so they're not just both five lines, they're more or less the same five lines [05:45:39] and supybot and zeromq are a fair way towards overkill by that measure [05:46:43] supybot is, certainly [05:47:12] but I get paid if you use ZeroMQ, as the official Wikimedia cheerleader [05:47:37] then there is the question of whether to extend ircecho or to do my own new thing [05:47:51] what's ircecho? [05:47:57] ircecho annoys me in a number of ways, and I would be tempted to fix those things if I went my own way [05:48:04] Is the goal to get rid of the bots? [05:48:09] it's what logmsgbot uses at the moment [05:48:38] ori-l: https://wikitech.wikimedia.org/wiki/Ircecho [05:48:44] logmsgbot, gerrit-wm, and icinga-wm are all ircecho [05:48:47] it reads from a file like "tail -f" and writes to IRC [05:49:01] ircecho is just a way to pipe `tail -f` to a channel [05:49:04] it originally actually used tail -f | ircecho [05:49:09] Susan: thank you for adding Antoine to the ticket [05:49:17] but Ryan rewrote it to use inotify [05:49:21] No problem. [05:49:29] apparently not realising that tail -f also uses inotify now [05:49:31] I think getting rid of morebots would be nice. [05:49:32] that's actually kind of nice, I don't know why people hate pipes [05:49:40] If something could do all of that more ... directly. [05:49:48] But I guess there are different purposes. [05:50:22] anyway, it's got no configuration file [05:50:40] the configuration code that should be written in python is instead written in ruby [05:50:56] what's got no configuration file? [05:50:58] 40 lines of ruby code embedded in a puppet template [05:51:06] oh, the ruby i wrote? :) [05:51:07] ircecho [05:51:28] which eventually sets the command line parameters via the default file [05:52:09] i'm itching to write this now; are you sure you want to? [05:53:16] I am pretty sure that I don't want to, so go right ahead [05:53:38] hehe, TimStarling has people writing stuff for him left and right [05:53:45] first Coren, now ori-l [05:53:59] is Coren's thing deployed? [05:54:41] https://gerrit.wikimedia.org/r/#/c/58449/ [05:54:45] no, he still hasn't amended it since my review [05:56:25] has irclib been reliable? [05:56:40] yes, I think the ircbot part of it is quite reliable [05:57:03] ok, i'll just use that [05:57:31] and what else? zmq? [05:57:52] i'm contractually obliged to at least 'import zmq' in the header, but other than that i can use tcp [05:58:07] heh [06:04:02] TimStarling: you realize it was written in python before I made changes to it right? [06:04:10] and that it was broken in a number of ways [06:04:29] also, the ruby config thing I −1'd a number of times [06:04:30] ohi Ryan_Lane [06:04:39] I'd prefer a config file in yaml [06:04:49] stuck in /etc, installed by puppet [06:04:58] I saw you write that on another change, about YAML configuration [06:05:19] YAML did a bad thing to me when I was a child so my automatic reaction was to scoff at that [06:05:21] I chose to go the inotify route because I needed to read multiple files (gerrit) [06:05:24] hahaha [06:05:32] but then, I had to think, maybe configuration is the only thing YAML is actually useful for [06:05:34] well, you could go with json or paste, or a ini files, etc [06:05:45] I only like YAML for config [06:05:49] I use JSON for everything else [06:05:50] or XML [06:06:13] there used to be a YAML output format in the MW API, using a YAML library that had been imported without review [06:06:15] ini is a pain in the ass [06:06:26] paste is just a harder to use ini [06:06:28] and it turned out to be ridiculously full of bugs [06:06:39] heh [06:06:57] so I ripped it out, and now if you ask for format=yaml, it will give you JSON, since that is a strict subset [06:06:59] I remember this YAML library [06:07:22] I had to add it back into OpenStackManager because cloud-init only accepts YAML [06:07:29] maybe we're not using that anymore in it [06:08:09] it takes a lot of code to achieve beautiful output which actually round-trips correctly, that's my fundamental criticism of it [06:08:10] nah. still there. I hate cloud-init [06:08:40] JSON is a much better format [06:08:47] I like it for config because it's easier for humans to edit [06:09:08] it --> YAML [06:09:13] sorry [06:09:13] yes [06:10:07] anyway, yes, I can believe that YAML would work for configuration [06:10:08] anyway, I have no strong feeling about how tin -> a public host is implemented [06:10:28] I wrote on the gerrit change that we could just use socat [06:10:30] there's lots of ways to do any. none are necessarily superior to others [06:10:34] but that turns out to be wrong [06:10:38] Anyway, . [06:10:38] oh? [06:11:05] since that would additionally need a firewall to stop it from being an anonymous open proxy for connecting to freenode [06:11:33] right [06:11:44] but, I was just reading the socat man page, and it does actually have some authentication parameters in there [06:11:57] but it doesn't matter if ori-l is just going to do it in python [06:12:03] one thing I was doing in git deploy was writing the messages into redis to be picked up, since I was already storing things in redis [06:12:14] but in this case it's absurd overkill [06:13:07] yeah, if you're going to use something high-level, I wouldn't have thought redis would be at the top of the list [06:13:29] the positive about doing it in redis for git-deploy was that the messages wouldn't get lost [06:13:29] there's a lot of message queue servers [06:13:49] yeah, I was just reusing what was there, rather than using an additional thing [06:14:47] there's lots of better things to use otherwise [06:15:01] what do you think of my idea of using NAT? it would solve the general class of problems [06:15:05] zmq is an easy one since it's already installed everywhere [06:15:24] you'll have to get mark's opinion on that. I dislike NAT generally [06:15:45] I suppose he was the one who decided to not use it in the first place [06:16:16] I'm pretty sure we had NAT in the olden days [06:16:19] maybe on albert? [06:16:32] before my time :) [06:17:21] it's not very often the internal hosts need to connect to the outside world [06:17:33] this is the only really recent example I can think of [06:17:58] https://wikitech.wikimedia.org/w/index.php?title=Albert&oldid=12509 [06:18:35] maybe james day set it up... [06:19:34] who would be the best person to assign https://bugzilla.wikimedia.org/show_bug.cgi?id=30861 too? doesn't robh normally handle the domain names? [06:19:45] yep [06:19:52] there is an argument that a botnet client wouldn't be able to dial home if it has no network access [06:19:55] and sometimes mutante-away [06:20:15] but if it was a back door made for wikimedia, it could just dial home via one of the HTTP proxies [06:20:47] well, AFAIK our forward proxies only allow access to specific plaes [06:20:54] *places [06:20:58] this keyboard is killing me [06:21:08] p858snake|l: oh, *that* one [06:22:24] I can get any website out of brewster [06:22:36] hm. well, that's lame. [06:22:57] do you know the HTTP copy upload feature? I think that is enabled now [06:23:09] it really should only allow access to apt security, and a few other places [06:23:27] yeah, it's enabled on commons [06:23:33] ohhhhhh [06:23:38] I totally forgot about that [06:23:53] that's linne [06:23:56] yep [06:24:28] * Aaron|home dislikes the code for that [06:24:48] that's mostly there because we don't support large enough uploads, right? [06:25:27] or is it also there to get images and such from flickr and other locations? [06:26:48] I thought it was the later, the I guess it works for both [06:27:07] yeah, I'm not sure why it was enabled [06:27:47] does chunked upload work now? [06:28:20] it should [06:29:07] "a feature to import photographs from Flickr (started as a Google Summer of Code project)" is mentioned in the blog as a bullet point [06:29:23] !b 30861 | p858snake|l [06:29:23] p858snake|l: https://bugzilla.wikimedia.org/30861 [06:31:11] the only place that's enabled AFAIK is flickr [06:35:51] http://ankuranand.in/gsoc/wordpress/?p=83 [06:37:45] !log Restarted Jenkins. It was stuck in some while loop doing tons of stat() on unexisting file :( Would be up in roughly 40 minutes :( [06:37:53] Logged the message, Master [06:38:03] Hi hashar. [06:38:34] hashar: you've been added to a gerrit change by Susan for average [06:38:50] for average, [06:39:08] average_drifter of the analytics tea [06:39:09] m [06:39:23] it's me, average [06:39:26] sorry can't parse that, I am barely awake [06:39:33] hashar: my name's Stefan Petrea [06:39:48] bonjour. had coffee? [06:40:05] i also identify as "barely awake" [06:40:27] oh just saw Jenkins is being restarted [06:40:40] yeah had too sorry :( [06:40:54] PROBLEM - Puppet freshness on cp3003 is CRITICAL: No successful Puppet run in the last 10 hours [06:40:54] PROBLEM - Puppet freshness on virt3 is CRITICAL: No successful Puppet run in the last 10 hours [06:41:04] it is causing me more and more headhaches [06:48:59] average: so I guess ops would be able to install the perl packages :-D [06:49:32] average: I don't have merge rights on ops/puppet [06:51:52] back to bed [06:53:46] * jeremyb_ figured the answer would be something like that... [07:17:39] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [07:18:30] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.128 second response time [07:31:19] PROBLEM - Puppet freshness on ms-fe3001 is CRITICAL: No successful Puppet run in the last 10 hours [08:12:59] PROBLEM - Puppet freshness on db44 is CRITICAL: No successful Puppet run in the last 10 hours [08:55:44] New patchset: Hashar; "wikistats packages needed for Jenkins environment" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/60965 [08:57:13] New review: Hashar; "PS2: moved the packages from `contint` module to the new misc::wikistats::package class." [operations/puppet] (production) C: 1; - https://gerrit.wikimedia.org/r/60965 [09:31:57] Change merged: MaxSem; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/60943 [09:34:28] New patchset: Ori.livneh; "Add simple TCP->IRC forwarding bot" [operations/debs/adminbot] (master) - https://gerrit.wikimedia.org/r/60972 [10:00:01] MaxSem: what's the reason for the current 666 / 302 redirect handling in varnish for m/zero? [10:00:08] why isn't that done in mediawiki? [10:00:21] * MaxSem LOOKS [10:00:29] ehm, caps [10:00:31] :) [10:06:02] I guess because it doesn't use cache space this way [10:06:40] aha [10:06:51] I recall some of this [10:07:20] so the problem is that some carriers allow only particular language domains for free [10:08:42] thus, redirection has to be done sooner rarther than later to prevent people from paying for visiting other domains [10:09:44] but yeah, I explained Adam and Yuri that most of it, if not all, can be done in PHP (requires m. and sero. domain portals) [10:09:55] yeah [10:10:16] I believe we should make the apache/mediawiki layer working directly with m and zero urls [10:10:23] without all the rewriting in varnish [10:10:31] now of course, our apache config is a huge mess [10:10:33] that doesn't help [10:10:36] the issue is further complicated that changing even some details requires coordination with a lot of carriers [10:12:41] btw mark, I don't see any changes in varnish hit rate so far [10:13:49] MaxSem: https://ganglia.wikimedia.org/latest/graph.php?r=week&z=xlarge&c=Mobile+caches+eqiad&h=cp1041.eqiad.wmnet&v=65608808&m=varnish.fetch_length&jr=&js=&vl=N%2Fs&ti=Fetch+with+Length [10:13:56] backend fetches have gone down a bit [10:14:03] unfortunately the monthly graphs are fucked :( [10:14:16] so you know what the problem with the hit rate is? [10:14:25] the stupid purges are taken into account as well [10:14:34] and on the mobile cluster, there are nearly as many purge requests as normal requests [10:14:36] d'oh [10:14:39] so that skews the hit rate heavily :( [10:15:03] with squid that's not the case [10:15:10] and on upload, we have far more requests, so it's not as obvious [10:16:15] what was the problem with https://gerrit.wikimedia.org/r/#/c/58729/ ? [10:16:38] i reverted it because the nagios monitoring triggered [10:16:44] as it got a 301 instead of the expected pattern [10:16:50] but then I corrected that and put it back in place [10:16:58] but now I think it's related to the new 301 issue you're seeing [10:17:33] well, it's not causing it, it was hiding it before ;) [10:42:51] apergos: does replag in (0,1) interval get reported as 0 or 1 on dbtree? [10:43:18] I don't know [10:44:41] :( [10:44:57] is there a way to get a more precise reporting? ganglia maybe? [10:46:57] PROBLEM - Puppet freshness on lvs1005 is CRITICAL: No successful Puppet run in the last 10 hours [10:46:57] PROBLEM - Puppet freshness on lvs1006 is CRITICAL: No successful Puppet run in the last 10 hours [10:46:57] PROBLEM - Puppet freshness on lvs1004 is CRITICAL: No successful Puppet run in the last 10 hours [10:48:47] http://ganglia.wikimedia.org/latest/graph_all_periods.php?title=mysql_slave_lag&mreg[]=^mysql_slave_lag%24&hreg[]=db10%2807|24|28%29&aggregate=1&hl=db1007.eqiad.wmnet|MySQL%20eqiad,db1024.eqiad.wmnet|MySQL%20eqiad,db1028.eqiad.wmnet|MySQL%20eqiad [10:57:11] uh [10:57:11] I don't know how either of those gets its information [10:57:11] would have to go look at the code for the checks [11:14:47] PROBLEM - Puppet freshness on virt1005 is CRITICAL: No successful Puppet run in the last 10 hours [11:19:12] New patchset: Mark Bergsma; "Support labs" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/60977 [11:20:33] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/60977 [11:21:24] New patchset: Aude; "(bug 47620) Exclude user and talk pages from wikidata features in clients" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/60978 [11:22:10] New patchset: Ori.livneh; "Add simple TCP->IRC forwarding bot" [operations/debs/adminbot] (master) - https://gerrit.wikimedia.org/r/60972 [11:23:49] New patchset: Mark Bergsma; "Include the configuration class" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/60979 [11:24:25] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/60979 [11:29:18] New patchset: Mark Bergsma; "Fix $override_hostname template lookup" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/60980 [11:30:15] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/60980 [11:40:37] New patchset: Mark Bergsma; "Use the new ganglia monitor in Labs" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/60982 [11:41:08] New patchset: Mark Bergsma; "Use the new ganglia module for monitors in Labs" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/60982 [11:42:37] New patchset: Mark Bergsma; "Use the new ganglia module for monitors in Labs" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/60982 [11:44:30] is mediawiki using Net_IPv4 or anything similar? [11:44:37] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/60982 [11:58:40] Change abandoned: MaxSem; "Going the way of https://gerrit.wikimedia.org/r/#/c/60945/ instead" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/60879 [12:10:07] PROBLEM - Puppet freshness on mc15 is CRITICAL: No successful Puppet run in the last 10 hours [12:21:55] New patchset: Mark Bergsma; "Disable the proxy for non pmtpa/eqiad hosts" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/60989 [12:23:27] Change abandoned: Mark Bergsma; "(no reason)" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/25002 [12:23:44] Change abandoned: Mark Bergsma; "(no reason)" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/38720 [12:25:06] New patchset: Mark Bergsma; "Disable the proxy for non pmtpa/eqiad hosts" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/60989 [12:25:49] alrighty, daily pleas for review time!:) [12:25:51] https://gerrit.wikimedia.org/r/60680 [12:26:00] https://gerrit.wikimedia.org/r/60945 [12:26:05] https://gerrit.wikimedia.org/r/60945 [12:26:12] pretty please:) [12:28:18] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/60989 [12:31:22] New patchset: Mark Bergsma; "Create role::logging::eventlogging" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/60680 [12:32:20] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/60680 [12:36:00] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/60945 [12:38:30] New review: Mark Bergsma; "Why wasn't this put in the ::logging subclass?" [operations/puppet] (production) C: -1; - https://gerrit.wikimedia.org/r/60923 [12:38:49] MaxSem: I hear ryan is open to review bribes [12:43:51] New patchset: MaxSem; "EventLogging Varnish setup for Labs" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/60923 [12:52:27] New patchset: Mark Bergsma; "Disable the generic proxy, and add proxying for specific apt repos" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/60994 [12:53:33] MaxSem: just put the damn realm check inside the logging class [12:53:51] in fact [12:54:00] for eventlogging, only the ip is different [12:54:03] so you can put that as a selector there [13:00:42] ah [13:01:05] and then wrap the other two loggers in an if realm is production [13:04:30] New patchset: Faidon; "Initial Ceph module, role class & site.pp" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/60997 [13:08:24] New patchset: MaxSem; "EventLogging Varnish setup for Labs" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/60923 [13:14:01] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/60923 [13:14:31] thanks a lot, mark [13:14:58] ^demon: Gerrit died [13:15:03] gerrit died! [13:15:09] 503 [13:15:12] <^demon> Yeah, replication freaked out. [13:15:15] <^demon> Was about to !log [13:15:23] <^demon> !log gerrit replication freaked out, restarting [13:15:33] Logged the message, Master [13:15:46] ah, it's back :) [13:16:20] "In the third minute he shall rise again." [13:17:46] <^demon> The fact that keys are still loaded by core and not the replication plugin annoys me. [13:17:53] * ^demon whacks gerrit [13:20:53] morning! hashar, you there? wanna brain bounce a puppet design decision with me? :) [13:21:08] ottomata: sure :) [13:21:34] morning ottomata [13:21:37] morning! [13:21:45] ottomata: you're ops right ? [13:21:51] yup + analytics [13:22:00] hashar: ok so I run into this a lot with hadoopy stuff [13:22:00] ottomata: could you help me merge a patchset please ? [13:22:04] i have client and servers [13:22:05] average, sure! [13:22:28] but, for the most part, I want the same config files for both clients and servers to be on all my nodes [13:22:39] so, previously, what I was doing [13:22:51] ottomata: thanks ! https://gerrit.wikimedia.org/r/60965 [13:23:04] ottomata: some modules that are needed for jenkins basically [13:23:17] ottomata: that change is good to me [13:23:17] :D [13:23:31] though wikistats role class does not seem to be applied anywhere :D [13:24:01] yeah, looks like average didn't add that though [13:24:39] (ok, will do patchset first, hashar, i'll continue explaining my puppet stuff in a sec :) ) [13:24:45] so [13:24:46] what about [13:24:46] misc::wikistats::updates [13:24:47] ? [13:25:15] ah role::wikistats [13:25:15] hm [13:25:25] hm. [13:25:47] ottomata: no problem [13:25:50] any place is good [13:25:57] as long as teh packages get installed on jenkins [13:26:20] , no i'm just wondering why you added the inclusion of wikistats::packages in wikistats::updates [13:26:43] New patchset: Faidon; "Initial Ceph module, role class & site.pp" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/60997 [13:27:18] uhm, not sure if I did the inclusion [13:27:20] maybe hashar ? [13:27:49] class misc::wikistats::updates { [13:27:49] » include misc::wikistats::packages [13:27:55] it just wasn't there before [13:27:56] its probably fine [13:28:33] ok [13:30:24] New review: Mark Bergsma; "Minor comments inline" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/60997 [13:31:27] Change merged: Ottomata; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/60965 [13:32:05] average, done. [13:32:58] ottomata: thank you :) [13:43:31] ottomata: lucene search has a a similar system. It has clients and a server (the indexers) [13:43:42] ottomata: maybe you can get some inspiration from there [13:44:11] ok lemme look (i'm writing up examples for you too, which is actually helping me sort out my thoughts) [13:44:13] New patchset: Faidon; "Initial Ceph module, role class & site.pp" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/60997 [13:44:40] hashar, where is that? (i'm looking in ops/puppet) [13:44:50] manifests/search.pp iirc [13:44:57] or: git grep -i lucene:: [13:44:58] :) [13:45:27] haha, yeah [13:45:40] no lucene module though, not very clean. that sooortta does one of my options [13:47:33] !log Moved udpmcast from hooft to nescio [13:47:39] Logged the message, Master [13:47:42] !log dist upgrade of hooft to precise [13:47:51] Logged the message, Master [13:47:52] New patchset: Faidon; "Initial Ceph module, role class & site.pp" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/60997 [13:49:05] !log Rebooting hooft [13:49:14] Logged the message, Master [13:49:19] ottomata: is there a way to know when the packages get installed ? [13:49:53] Change merged: Faidon; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/60997 [13:49:59] puppet just needs to run, want me to run it manually somewhere? [13:51:54] New patchset: BBlack; "Work-In-Progress vhtcpd code." [operations/software/varnish/vhtcpd] (master) - https://gerrit.wikimedia.org/r/60390 [13:53:25] PROBLEM - SSH on hooft is CRITICAL: Connection refused [13:54:06] ottomata: if you can that would be awesome, I am hitting rebuild button jenkins [13:54:30] ottomata: i will [13:54:36] danke [13:55:26] RECOVERY - SSH on hooft is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1.1 (protocol 2.0) [13:56:13] it is awfully slow :( [13:57:16] hashar, i might be working this out all by myself by writing this up, it is helping me clear up my thoughts [13:57:21] here's what i'm working on, if you are interestd [13:57:22] http://codeshare.io/NjftI [13:57:23] :) [14:01:05] PROBLEM - Host ms-fe1004 is DOWN: PING CRITICAL - Packet loss = 100% [14:01:15] PROBLEM - Host ms-fe1003 is DOWN: PING CRITICAL - Packet loss = 100% [14:01:55] ottomata: ether pad with syntax highlight ? [14:01:59] ottomata: luuuuve that [14:02:21] i know, awesome right! [14:02:25] i foudn that on reddit the other day [14:02:30] this is the first chance i've had to use it! [14:05:40] average: the perl packages for wikistats are installed on Jenkins server :) [14:06:15] RECOVERY - Host ms-fe1004 is UP: PING OK - Packet loss = 0%, RTA = 4.62 ms [14:06:26] hashar: wow, nice !! [14:06:29] hashar , ottomata thank you [14:06:45] p858snake|l: did you look at the bug? [14:06:47] ottomata: I will go with option C :D [14:07:02] ottomata: seems nice to me to have booger::client / booger::server .. [14:07:15] yeah i think so too, that's what I'm looking at now [14:07:16] jeremyb_: My crystal ball is currently in the store getting serviced. [14:07:31] i think there was a reason I didn't like this before though, but typing this out is helping me clear out my brain [14:07:34] p858snake|l: what model? [14:07:36] going to try that for now and see how it goes [14:07:42] if i run into reasons why i'll come back and we will brain bounce [14:07:43] ottomata: with the package {} / common config in a booger::common or something similar [14:07:43] thank you [14:07:44] p858snake|l: anyway, did you look? [14:08:22] * p858snake|l goes to look for a facepalm picture… [14:08:32] yeahhhhhh that could work better than :;config [14:08:37] cause then i could put the package there [14:08:48] i think I don't like having to pass all of the parameters around all the time [14:08:49] hmmm [14:09:07] ottomata: can't you get default parameters ? [14:11:41] ha, this thing doesn't work as well as etherpad when we both edit [14:11:49] nawww inherits isn't evil! [14:11:49] oops [14:11:58] hashar, yeah [14:12:08] can't you include ? [14:12:09] i probably will use ::params class with any of these [14:12:11] to store defaults [14:12:31] New patchset: Faidon; "Fix order between $ganglia_aggregator and include" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/61002 [14:13:31] Change merged: Faidon; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/61002 [14:13:45] so hashar, i think i've done option D and liked it before too [14:13:48] no inheritance [14:14:00] don't need to pass the args around all over the place [14:14:06] but, you do need to manually include the common class [14:20:49] ottomata: maybe that will do it [14:21:04] ottomata: will you always need a client? [14:21:24] it depends, i'm trying to work this out for a buncha different purposes [14:21:30] p858snake|l: i highlighted you with a bug #. AFAIK you never responded [14:21:33] its not always a 'client' per say, but sometimes a service [14:21:54] uhh, sorry that didn't make sense [14:22:13] jeremyb_: than I probably did read and had no reason to respond [14:22:27] sometimes they are clients+servers (nodemananger, for example, it is a client of a diferent server: the resourcemanager), but is also a server daemon) [14:22:31] p858snake|l: well now i'm asking for a response :) [14:22:40] New patchset: Mark Bergsma; "Split gmetad hosts and aggregator hosts" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/61003 [14:26:14] ugh, odder just sent a mail sending peoples to etherpad :( [14:26:15] New patchset: Mark Bergsma; "Disable the generic proxy, and add proxying for specific apt repos" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/60994 [14:26:15] New patchset: Mark Bergsma; "Split gmetad hosts and aggregator hosts" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/61003 [14:26:53] odder: don't expect it to be backed up or accessible 10 mins after you last viewed the pad [14:27:08] New patchset: Mark Bergsma; "Disable the generic proxy, and add proxying for specific apt repos" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/60994 [14:27:08] New patchset: Mark Bergsma; "Split gmetad hosts and aggregator hosts" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/61003 [14:27:30] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/61003 [14:28:34] ? [14:28:44] odder: ? [14:28:50] what. [14:31:38] jeremyb_: actually, Etherpad works all right for our purposes, much better than the one on Labs [14:31:54] New patchset: Mark Bergsma; "Make hooft a Ganglia aggregator host for esams" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/61004 [14:31:58] odder: ok. just don't assume it won't break with no warning [14:32:14] odder: and don't assume your text will be recoverable [14:33:27] New patchset: Mark Bergsma; "Disable the generic proxy, and add proxying for specific apt repos" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/60994 [14:33:27] New patchset: Mark Bergsma; "Make hooft a Ganglia aggregator host for esams" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/61004 [14:33:47] now that i finally loaded the ticket maybe Thehelpfulone or LeslieCarr want to comment on that ^^^ (etherpad) [14:34:00] RT 4979 [14:34:11] jeremyb_: I have never seen it break compared to the Labs instance. [14:34:22] Isarra: Ping. [14:34:22] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/60994 [14:34:42] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/61004 [14:35:28] Thehelpfulone wrote (re a different user): [14:35:30] > I'll tell them not to use etherpad in the future, promise! [14:36:17] jeremyb_: sorry, there is nothing that fits our purposes better than Etherpad right now. [14:36:35] odder: ok... just don't expect it to not disappear [14:36:40] I guess we could switch to PiratePad, of course, but it's just so nice to keep our things at *.wikimedia.org [14:36:52] odder: you could e.g. set a 5 minutely cron to back it up to a place you control [14:36:59] * aude wonders who *we* is [14:37:07] aude: WLM [14:37:12] ah [14:37:20] aude: 26 14:26:14 < jeremyb_> ugh, odder just sent a mail sending peoples to etherpad :( [14:38:19] odder: What? [14:38:29] it's trivial to install our own etherpad-lite :) [14:38:44] jeremyb_: I don't understand, are you recommending people a labs instance as more stable than a production something? [14:39:00] Isarra: I see that Sven_Manguard already told you what :-) [14:39:02] Nemo_bis: i didn't say to use the labs instance [14:39:27] New patchset: Mark Bergsma; "Move the dirty Ganglia migration logic into ganglia.pp, add hooft" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/61006 [14:39:33] just save it to the wiki every now and then.... [14:40:40] odder: Please stop bothering me. [14:40:48] jeremyb_: so why is odder constantly saying it's better than the one on labs? :) [14:41:50] Isarra: OK, I will. [14:43:42] New patchset: Mark Bergsma; "Move the dirty Ganglia migration logic into ganglia.pp, add hooft" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/61006 [14:55:26] New patchset: Mark Bergsma; "Move the dirty Ganglia migration logic into ganglia.pp, add hooft" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/61006 [14:57:41] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/61006 [15:00:38] New patchset: Faidon; "Adjust description for media storage LVS config" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/61010 [15:19:33] Change merged: Faidon; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/61010 [15:21:41] hey it looks like the new ceph module broke icinga [15:21:47] oh? [15:22:03] something with the host group not being found [15:22:04] I've learned to ignore the ALERT icinga mails :/ [15:22:08] haven't looked a tthe module [15:22:09] haha [15:22:28] looking [15:23:53] duuh [15:23:53] New patchset: Faidon; "Fix capitalization of ceph monitor_group" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/61012 [15:24:10] Change merged: Faidon; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/61012 [15:30:24] :0 [15:30:27] :) i meant [15:32:00] puppet still running on neon [15:32:00] gah [15:33:53] ok, fixed [15:33:55] thanks LeslieCarr [15:33:57] PROBLEM - NTP on ms-fe1004 is CRITICAL: NTP CRITICAL: No response from NTP server [15:34:07] PROBLEM - RAID on ms-fe1004 is CRITICAL: Connection refused by host [15:34:27] RECOVERY - Host ms-fe1003 is UP: PING OK - Packet loss = 0%, RTA = 0.48 ms [15:34:27] PROBLEM - DPKG on ms-fe1004 is CRITICAL: Connection refused by host [15:34:37] PROBLEM - DPKG on ms-fe1003 is CRITICAL: Connection refused by host [15:34:37] PROBLEM - Disk space on ms-fe1004 is CRITICAL: Connection refused by host [15:34:47] PROBLEM - Disk space on ms-fe1003 is CRITICAL: Connection refused by host [15:36:47] PROBLEM - HTTP on ms-fe1004 is CRITICAL: Connection refused [15:37:07] PROBLEM - RAID on ms-fe1003 is CRITICAL: Connection refused by host [15:37:54] New patchset: Mark Bergsma; "Move esams over to the new Ganglia module" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/61016 [15:39:31] New patchset: Mark Bergsma; "Move esams over to the new Ganglia module" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/61016 [15:40:50] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/61016 [15:48:57] PROBLEM - NTP on ms-fe1003 is CRITICAL: NTP CRITICAL: No response from NTP server [15:55:37] !log jenkins: enabled ExtraSettings.php loading {{gerrit|61022}}. That enables a bunch of MediaWiki settings for all mediawiki related jobs, namely displaying errors, moving the temp dir to tmpfs, showing exceptions details and enabling development warnings. [15:55:44] Logged the message, Master [16:10:49] New patchset: Mark Bergsma; "Readd class ganglia to standard" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/61024 [16:15:51] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/61024 [16:26:57] New patchset: Mark Bergsma; "Include .pyconf config files as well" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/61027 [16:29:57] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/61027 [16:41:18] PROBLEM - Puppet freshness on virt3 is CRITICAL: No successful Puppet run in the last 10 hours [16:41:18] PROBLEM - Puppet freshness on cp3003 is CRITICAL: No successful Puppet run in the last 10 hours [16:42:36] New patchset: Mark Bergsma; "Enable the python module" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/61029 [16:44:18] PROBLEM - MySQL Slave Delay on db1025 is CRITICAL: CRIT replication delay 206 seconds [16:45:18] RECOVERY - MySQL Slave Delay on db1025 is OK: OK replication delay 9 seconds [16:49:16] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/61029 [16:59:43] New patchset: Ram; "Dump entire global config to a file." [operations/debs/lucene-search-2] (master) - https://gerrit.wikimedia.org/r/61033 [17:03:24] average: scrum? [17:05:24] New review: Ram; "This is the rebased version of 55841 which has merge conflicts. I also added the new field 'disabled..." [operations/debs/lucene-search-2] (master) - https://gerrit.wikimedia.org/r/61033 [17:08:11] PROBLEM - MySQL Slave Delay on db1025 is CRITICAL: CRIT replication delay 188 seconds [17:08:51] New review: Ram; "The merge conflicts are resolved in https://gerrit.wikimedia.org/r/#/c/61033/" [operations/debs/lucene-search-2] (master) - https://gerrit.wikimedia.org/r/55841 [17:09:11] RECOVERY - MySQL Slave Delay on db1025 is OK: OK replication delay 19 seconds [17:12:39] New patchset: Demon; "Puppetize gitblit.properties" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/61036 [17:15:12] PROBLEM - LVS HTTPS IPv4 on wikisource-lb.esams.wikimedia.org is CRITICAL: Connection refused [17:15:22] PROBLEM - LVS HTTP IPv4 on wikipedia-lb.esams.wikimedia.org is CRITICAL: Connection timed out [17:15:27] working on it [17:15:31] PROBLEM - LVS HTTP IPv4 on wiktionary-lb.esams.wikimedia.org is CRITICAL: Connection timed out [17:15:32] PROBLEM - LVS HTTP IPv6 on wikisource-lb.esams.wikimedia.org_ipv6 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 325 bytes in 7.183 second response time [17:15:32] PROBLEM - LVS HTTP IPv6 on foundation-lb.esams.wikimedia.org_ipv6 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 325 bytes in 7.189 second response time [17:15:32] PROBLEM - LVS HTTP IPv4 on wikiquote-lb.esams.wikimedia.org is CRITICAL: Connection timed out [17:15:41] PROBLEM - LVS HTTP IPv4 on wikisource-lb.esams.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:15:42] PROBLEM - LVS HTTP IPv4 on wikimedia-lb.esams.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:15:45] PROBLEM - LVS HTTP IPv6 on mediawiki-lb.esams.wikimedia.org_ipv6 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 325 bytes in 7.183 second response time [17:15:51] PROBLEM - LVS HTTP IPv4 on wikibooks-lb.esams.wikimedia.org is CRITICAL: Connection timed out [17:15:51] PROBLEM - Frontend Squid HTTP on amssq37 is CRITICAL: Connection timed out [17:15:52] PROBLEM - Frontend Squid HTTP on knsq28 is CRITICAL: Connection timed out [17:16:01] PROBLEM - LVS HTTP IPv4 on foundation-lb.esams.wikimedia.org is CRITICAL: Connection timed out [17:16:15] need anything? [17:16:21] RECOVERY - LVS HTTP IPv4 on wikipedia-lb.esams.wikimedia.org is OK: HTTP OK: HTTP/1.0 200 OK - 13192 bytes in 1.796 second response time [17:16:31] RECOVERY - LVS HTTP IPv6 on foundation-lb.esams.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 200 OK - 63952 bytes in 0.359 second response time [17:16:32] RECOVERY - LVS HTTP IPv4 on wikisource-lb.esams.wikimedia.org is OK: HTTP OK: HTTP/1.0 200 OK - 49356 bytes in 0.445 second response time [17:16:32] PROBLEM - LVS HTTP IPv6 on wikinews-lb.esams.wikimedia.org_ipv6 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 325 bytes in 7.178 second response time [17:16:41] RECOVERY - LVS HTTP IPv4 on wikibooks-lb.esams.wikimedia.org is OK: HTTP OK: HTTP/1.0 200 OK - 46822 bytes in 0.367 second response time [17:16:42] RECOVERY - Frontend Squid HTTP on amssq37 is OK: HTTP OK: HTTP/1.0 200 OK - 1417 bytes in 0.182 second response time [17:16:42] RECOVERY - Frontend Squid HTTP on knsq28 is OK: HTTP OK: HTTP/1.0 200 OK - 1414 bytes in 0.180 second response time [17:16:51] RECOVERY - LVS HTTP IPv4 on foundation-lb.esams.wikimedia.org is OK: HTTP OK: HTTP/1.0 200 OK - 40956 bytes in 0.366 second response time [17:16:56] resolving seems broken in esams [17:17:11] RECOVERY - LVS HTTPS IPv4 on wikisource-lb.esams.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 49832 bytes in 0.726 second response time [17:17:21] RECOVERY - LVS HTTP IPv4 on wiktionary-lb.esams.wikimedia.org is OK: HTTP OK: HTTP/1.0 200 OK - 69161 bytes in 0.448 second response time [17:17:31] RECOVERY - LVS HTTP IPv4 on wikiquote-lb.esams.wikimedia.org is OK: HTTP OK: HTTP/1.0 200 OK - 59103 bytes in 0.355 second response time [17:17:32] RECOVERY - LVS HTTP IPv4 on wikimedia-lb.esams.wikimedia.org is OK: HTTP OK: HTTP/1.0 200 OK - 97469 bytes in 0.541 second response time [17:17:41] RECOVERY - LVS HTTP IPv6 on mediawiki-lb.esams.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 200 OK - 63952 bytes in 0.353 second response time [17:18:22] RECOVERY - LVS HTTP IPv6 on wikinews-lb.esams.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 200 OK - 63952 bytes in 0.350 second response time [17:18:22] RECOVERY - LVS HTTP IPv6 on wikisource-lb.esams.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 200 OK - 63952 bytes in 0.356 second response time [17:20:11] RECOVERY - Recursive DNS on 91.198.174.6 is OK: DNS OK: 0.094 seconds response time. www.wikipedia.org returns 91.198.174.225 [17:20:47] argh [17:21:05] prefix list LVS-service-IPs included 91.198.174.0/25 [17:31:02] New patchset: MaxSem; "Fix EventLogging fail" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/61039 [17:31:15] can someone review ^^ please [17:32:01] PROBLEM - Puppet freshness on ms-fe3001 is CRITICAL: No successful Puppet run in the last 10 hours [17:34:29] <^d> MaxSem: +1'd. [17:40:29] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:41:19] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.122 second response time [17:52:29] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:53:07] New patchset: Faidon; "Ceph: fix radosgw monitoring" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/61043 [17:53:20] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.132 second response time [17:56:12] New patchset: Faidon; "Ceph: fix radosgw monitoring" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/61043 [17:57:37] Change merged: Faidon; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/61043 [18:09:45] LeslieCarr: around? [18:09:49] New patchset: Faidon; "Ceph: add missing requires for radosgw" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/61044 [18:13:51] PROBLEM - Puppet freshness on db44 is CRITICAL: No successful Puppet run in the last 10 hours [18:25:21] RECOVERY - NTP on ms-fe1003 is OK: NTP OK: Offset -0.03247845173 secs [18:26:31] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:27:21] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.176 second response time [18:28:16] mutante: I reopened https://rt.wikimedia.org/Ticket/Display.html?id=4932 because it needs a +2 on https://gerrit.wikimedia.org/r/#/c/60893/ [18:29:27] mutante: that gerrit change was missed in our list between lwelling and bsitu [18:30:49] mutante: sorry about that. [18:32:21] tewwy: i think mutante's afk [18:32:30] mutante, nopeter: ping [18:32:31] as of 20ish mins ago [18:32:48] jeremyb_: Thanks. I'll send by e-mail also just in case. [18:33:07] k [18:38:07] PROBLEM - HTTP on ms-fe1003 is CRITICAL: Connection refused [18:40:12] New patchset: Faidon; "Ceph: fix a bunch of dependencies" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/61044 [18:42:38] New patchset: Ori.livneh; "Add simple TCP->IRC forwarding bot" [operations/debs/adminbot] (master) - https://gerrit.wikimedia.org/r/60972 [18:44:55] Change merged: Faidon; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/61044 [18:45:37] New patchset: Ori.livneh; "Add simple TCP->IRC forwarding bot" [operations/debs/adminbot] (master) - https://gerrit.wikimedia.org/r/60972 [18:46:51] oooh, it's an ori-l [18:52:17] New patchset: Demon; "Set X-Forwarded-Port so Gitblit knows its on SSL" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/61050 [18:52:35] ^demon: and proto? [18:52:54] <^demon> Don't need both, one or other is enough. [18:53:32] just thinking about how protoproxy works [18:54:43] New patchset: Demon; "Set X-Forwarded-(Port|Proto) so Gitblit knows its on SSL" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/61050 [18:55:00] <^demon> Just set both. Can't hurt. [18:55:55] New patchset: Demon; "Set X-Forwarded-(Port|Proto) so Gitblit knows its on SSL" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/61050 [18:56:04] <^demon> And probably helps to enable mod_headers ;-) [19:08:13] PROBLEM - MySQL Slave Delay on db1025 is CRITICAL: CRIT replication delay 189 seconds [19:10:14] RECOVERY - MySQL Slave Delay on db1025 is OK: OK replication delay 9 seconds [19:18:37] !log jenkins died again sorry :( But this time I most probably found out the reason why it is slowed down: a web spider attempt to crawl all our builds! [19:18:44] marktraceur: ^^^^ :-) [19:18:45] Logged the message, Master [19:19:04] Aha. [19:19:13] hashar: That would certainly do it [19:19:21] that has driven me nuts for a few days now [19:19:28] I had a java thread doing 100% I/O [19:20:13] now I need to write a robots.txt :) [19:20:32] *nod* [19:20:43] Disallow: /ci [19:20:46] I guess that is enough [19:21:08] Disallow: / ? :) [19:21:27] na we have our main page https://integration.wikimedia.org [19:25:35] I need a easter egg [19:25:50] what color? [19:27:34] jeremyb_: https://gerrit.wikimedia.org/r/#/c/61053/ :D [19:28:06] also, i thinks easter is over [19:29:24] https://integration.wikimedia.org/robots.txt [19:29:28] hopefully that is good enough for now [19:31:23] paravoid: here now [19:33:41] hashar: robots.txt: > The character encoding of the plain text document was not declared. The document will render with garbled text in some browser configurations if the document contains characters from outside the US-ASCII range. The character encoding of the file needs to be declared in the transfer protocol or file needs to use a byte order mark as an encoding signature. [19:34:02] jeremyb_: ah [19:34:09] binasher: https://ishmael.wikimedia.org/sample/?hours=2&host=db1056&sort=count is empty :( [19:35:19] jeremyb_: I will happily ignore that warning and assume softs assume UTF-8 by default nowadays :-D [19:35:43] hashar: i wish. for some reason iceweasel did iso8859-1 [19:37:55] New patchset: Ottomata; "Updating README and comments to indicate that mod_proxy_http must be enabled." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/61056 [19:38:05] Change merged: Ottomata; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/61056 [19:38:47] LeslieCarr: icinga.pp installs a systemuser icinga, and adds groups membeship for group "nagios" [19:39:04] that group doesn't get added anywhere, nagios-nrpe-server creates it [19:39:17] adding a group { } or a require is trivial, but I'm not sure why we need the icinga user on all boxes? [19:39:56] paravoid: it's going to run the backdoor daemon [19:40:01] duh! [19:40:17] but in reality, i'm not sure - i [19:40:24] i'm pretty sure that is cruft from the old nagios config [19:40:30] and there probably was a reason at some point.... [19:40:43] as long as anything running nagios-nrpe-server has it ... [19:41:59] mutante: you can ignore the IIX port ops ticket that's abotu to come in (or assign it to me) [19:53:18] PROBLEM - MySQL Slave Delay on db1025 is CRITICAL: CRIT replication delay 189 seconds [19:54:19] RECOVERY - MySQL Slave Delay on db1025 is OK: OK replication delay 11 seconds [19:55:44] New patchset: Vogone; "(bug 46864) Disabling default user option toc-floated per default on hewikivoyage" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/61058 [20:01:29] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:03:18] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.128 second response time [20:05:44] New patchset: Legoktm; "(bug 46864) Disabling default user option toc-floated per default on hewikivoyage" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/61058 [20:06:01] New patchset: Legoktm; "(bug 46864) Disabling default user option toc-floated per default on hewikivoyage" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/61058 [20:08:00] New patchset: Vogone; "(bug 46864) Disabling default user option toc-floated per default on hewikivoyage" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/61058 [20:09:45] PROBLEM - MySQL Slave Delay on db1025 is CRITICAL: CRIT replication delay 199 seconds [20:10:02] what is he doing??? [20:10:28] marktraceur: I have killed a node process on gallium [20:10:40] marktraceur: I guess some daemon that has gone standalone when jenkins died. [20:10:58] !log killed a node process on gallium. Probably a daemon that has gone standalone when jenkins died [20:11:05] Logged the message, Master [20:11:45] RECOVERY - MySQL Slave Delay on db1025 is OK: OK replication delay 6 seconds [20:11:54] New patchset: Vogone; "(bug 46864) Disabling default user option toc-floated per default on hewikivoyage" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/61058 [20:14:45] PROBLEM - MySQL Slave Delay on db1025 is CRITICAL: CRIT replication delay 187 seconds [20:17:46] New patchset: Vogone; "(bug 46864) Disabling user option toc-floated by default on hewikivoyage" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/61058 [20:18:45] PROBLEM - MySQL Slave Delay on db1025 is CRITICAL: CRIT replication delay 212 seconds [20:18:52] New patchset: Legoktm; "(bug 46864) Disabling user option toc-floated by default on hewikivoyage" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/61058 [20:20:03] New review: Dzahn; "looks like just a typo in the path to the script indeed. per RT-4932 and mail from Terry" [operations/puppet] (production) C: 2; - https://gerrit.wikimedia.org/r/60893 [20:20:12] New patchset: Dzahn; "Fix echo digest email cron script path" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/60893 [20:22:13] New review: Dzahn; "apparently this should have been merged before gerrit 52745" [operations/puppet] (production) C: 2; - https://gerrit.wikimedia.org/r/60893 [20:22:14] Change merged: Dzahn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/60893 [20:22:45] RECOVERY - MySQL Slave Delay on db1025 is OK: OK replication delay 27 seconds [20:23:29] bsitu: lwelling: command => "/usr/local/bin/foreachwikiindblist /usr/local/apache/common/echowikis.dblist extensions/Echo/maintenance/processEchoEmailBatch.php" done [20:23:48] that simply added the "maintenance" part in the path [20:25:50] mutante: thanks [20:26:49] yw [20:29:22] LeslieCarr: so, this icinga thing is breaking initial puppet runs on new installs; are you going to take a stab it or should I just hit it with a hammer? :) [20:30:27] can you "hulk smash" it? [20:31:55] thanks mutante [20:42:06] hrm, looks like eqiad varnishes aren't purging or something ? though esams is [20:47:15] PROBLEM - Puppet freshness on lvs1004 is CRITICAL: No successful Puppet run in the last 10 hours [20:47:15] PROBLEM - Puppet freshness on lvs1006 is CRITICAL: No successful Puppet run in the last 10 hours [20:47:16] PROBLEM - Puppet freshness on lvs1005 is CRITICAL: No successful Puppet run in the last 10 hours [20:53:44] ottomata: ana1010,1012,1020 have a dpkg issue, saw in monitoring. "dpkg was interrupted". want me to attempt fixing it? [20:54:13] hmm [20:54:16] lemme check [20:54:45] PROBLEM - MySQL Slave Delay on db1025 is CRITICAL: CRIT replication delay 199 seconds [20:54:50] apt-get -s upgrade [20:54:50] E: dpkg was interrupted, you must manually run 'dpkg --configure -a' to correct the problem. [20:55:38] would do just that [20:55:45] RECOVERY - MySQL Slave Delay on db1025 is OK: OK replication delay 25 seconds [20:57:55] broke when it tried to purge nscd a couple days ago [20:57:57] Start-Date: 2013-04-22 15:55:56 [20:57:57] Commandline: apt-get --yes purge nscd [20:57:57] Purge: nscd:amd64 () [20:57:57] Error: Sub-process /usr/bin/dpkg exited unexpectedly [20:58:22] but that seems to be all [20:58:59] ayyee cool, am fixing, i would need to fix anyway, cause I need that purge to happen [20:59:09] cool, thanks [20:59:12] that is removing leftover ldap confs that we don't want on the cluster, they were conflicting with puppet user managment [20:59:19] i removed it everywhere, looks like those 3 failed [20:59:23] aye, that was my guess [20:59:25] RECOVERY - DPKG on analytics1012 is OK: All packages OK [20:59:31] thanks for fixing [21:00:35] RECOVERY - DPKG on analytics1010 is OK: All packages OK [21:01:30] yup! thanks for notice! [21:01:36] RECOVERY - DPKG on analytics1020 is OK: All packages OK [21:05:38] New patchset: Ottomata; "Fixing NARA filter on emery." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/61067 [21:05:51] Change merged: Ottomata; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/61067 [21:09:45] PROBLEM - MySQL Slave Delay on db1025 is CRITICAL: CRIT replication delay 198 seconds [21:12:45] RECOVERY - MySQL Slave Delay on db1025 is OK: OK replication delay 28 seconds [21:15:26] PROBLEM - Puppet freshness on virt1005 is CRITICAL: No successful Puppet run in the last 10 hours [21:27:45] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [21:28:16] New patchset: Andrew Bogott; "Add robots.txt and a privacy policy to mediawiki_singlenode." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/61069 [21:28:30] gerrit died :-( [21:28:36] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.128 second response time [21:31:45] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [21:31:49] New review: Odder; "I was thinking it about doing it a bit differently:" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/61058 [21:32:36] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.123 second response time [21:33:45] PROBLEM - MySQL Slave Delay on db1025 is CRITICAL: CRIT replication delay 190 seconds [21:36:18] binasher: Nice work on the mention in the O'Reilly Programming Newsletter [21:36:45] RECOVERY - MySQL Slave Delay on db1025 is OK: OK replication delay 22 seconds [21:37:11] * preilly e.g., "Maria, I've Just Run Queries on Maria — Oracle and MariaDB may not be the Jets and the Sharks, but there's drama aplenty as they fight it out for open source database supremacy. MariaDB has just won the heart of the Wikimedia Foundation, who are moving Wikipedia from MySQL to MariaDB." [21:41:38] New patchset: Dzahn; "add secure.wikimedia.org (old SSL site) redirects to cluster." [operations/apache-config] (master) - https://gerrit.wikimedia.org/r/60934 [21:42:36] New review: Dzahn; "Faidon's comments reflected in PS2" [operations/apache-config] (master) - https://gerrit.wikimedia.org/r/60934 [21:43:16] wth @ jenkins output [21:43:23] New review: awjrichards; "(1 comment)" [operations/mediawiki-config] (master) C: -1; - https://gerrit.wikimedia.org/r/60885 [21:43:38] wikimedia-ssl-backend.conf: No such file or directory .. yea, but thats what i would want to delete [21:44:00] oh, of course, needs to go from all.conf as well,, nv [21:44:17] turns that into "thanks jenkins":) [21:45:10] New patchset: Dzahn; "add secure.wikimedia.org (old SSL site) redirects to cluster." [operations/apache-config] (master) - https://gerrit.wikimedia.org/r/60934 [21:53:40] New review: Dzahn; "waiting period is over and no concerns have been raised. welcome" [operations/puppet] (production) C: 2; - https://gerrit.wikimedia.org/r/60457 [21:53:46] PROBLEM - MySQL Slave Delay on db1025 is CRITICAL: CRIT replication delay 216 seconds [21:54:02] New patchset: Dzahn; "add account and key for ebernhardson and include on stat1" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/60457 [21:55:29] New review: Dzahn; "recheck" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/60457 [21:58:22] Change merged: Dzahn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/60457 [21:58:46] RECOVERY - MySQL Slave Delay on db1025 is OK: OK replication delay 7 seconds [22:04:59] New patchset: Ori.livneh; "Add 'tcpircbot' Puppet class" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/61078 [22:06:05] ori-l: you're coding faster than I can review [22:06:12] Change abandoned: Ori.livneh; "Did it in operations/puppet instead I6d4f661b7" [operations/debs/adminbot] (master) - https://gerrit.wikimedia.org/r/60972 [22:06:21] paravoid: that one is for tim to worry about [22:06:26] :) [22:07:06] * ori-l runs [22:10:11] PROBLEM - Puppet freshness on mc15 is CRITICAL: No successful Puppet run in the last 10 hours [22:10:12] New patchset: Faidon; "Ceph: fix more radosgw order of dependencies" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/61079 [22:11:03] Change merged: Faidon; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/61079 [22:13:21] RECOVERY - HTTP on ms-fe1003 is OK: HTTP OK: HTTP/1.1 200 OK - 454 bytes in 0.002 second response time [22:32:41] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [22:33:32] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.137 second response time [22:39:11] PROBLEM - Puppet freshness on vanadium is CRITICAL: No successful Puppet run in the last 10 hours [22:43:42] PROBLEM - MySQL Slave Delay on db1025 is CRITICAL: CRIT replication delay 216 seconds [22:44:42] RECOVERY - MySQL Slave Delay on db1025 is OK: OK replication delay 2 seconds [22:48:42] PROBLEM - MySQL Slave Delay on db1025 is CRITICAL: CRIT replication delay 189 seconds [22:50:42] RECOVERY - MySQL Slave Delay on db1025 is OK: OK replication delay 25 seconds [22:51:57] Ryan_Lane: maybe do a Cask run at 4:30? [22:52:02] sure [22:52:07] * AaronSchulz has a meeting in 8 [22:52:15] * Ryan_Lane nods [22:52:22] * AaronSchulz is in a rummy mood [22:52:43] I'm more in the mood for spades, or cards against humanity [22:55:52] PROBLEM - Disk space on mc15 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [22:56:42] RECOVERY - Disk space on mc15 is OK: DISK OK [22:57:43] binasher: do you know what's up with ishmael? [23:03:19] grr [23:06:39] AaronSchulz: the problem was that db1001 which is replacing db9 was read-only.. i thought notpeter said it was ready to go to move apps over to it [23:06:54] fixed now, but you'll have to wait for new data [23:24:28] New patchset: Asher; "sitting new m1 shard primary site to mw_primary (eqiad)" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/61150 [23:29:21] Change merged: Asher; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/61150 [23:41:13] New patchset: Faidon; "Ceph: more radosgw changes, drop apache::mod" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/61153 [23:41:13] New patchset: Faidon; "Ceph: add meaningful frontend/backend nagios checks" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/61154 [23:42:52] New review: Faidon; "I wouldn't say it looks good but…" [operations/puppet] (production) C: 2; - https://gerrit.wikimedia.org/r/61153 [23:42:54] Change merged: Faidon; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/61153 [23:43:18] Change merged: Faidon; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/61154 [23:44:14] New review: Dzahn; "deploying this now that the ./ should be mostly over" [operations/puppet] (production) C: 2; - https://gerrit.wikimedia.org/r/60426 [23:44:14] Change merged: Dzahn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/60426 [23:50:50] New review: Dzahn; "before: Verify return code: 21 (unable to verify the first certificate)" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/60426 [23:52:29] RECOVERY - HTTP on ms-fe1004 is OK: HTTP OK: HTTP/1.1 200 OK - 454 bytes in 0.001 second response time [23:52:56] !log make SSL on the Blog use SSLCACertificatePath and c_rehash certs on kaulen [23:53:04] Logged the message, Master [23:53:27] s/kaulen/marmontel.damn [23:55:18] PROBLEM - Host ms-fe1003 is DOWN: PING CRITICAL - Packet loss = 100% [23:56:19] RECOVERY - DPKG on ms-fe1003 is OK: All packages OK [23:56:29] RECOVERY - Host ms-fe1003 is UP: PING OK - Packet loss = 0%, RTA = 0.52 ms [23:56:29] RECOVERY - RAID on ms-fe1003 is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0 [23:56:38] RECOVERY - Disk space on ms-fe1003 is OK: DISK OK [23:56:39] PROBLEM - Host ms-fe1004 is DOWN: PING CRITICAL - Packet loss = 100% [23:57:29] RECOVERY - Disk space on ms-fe1004 is OK: DISK OK [23:57:29] RECOVERY - DPKG on ms-fe1004 is OK: All packages OK [23:57:29] !log upgrading mysql packages on marmontel (blog) [23:57:37] Logged the message, Master [23:57:39] RECOVERY - Host ms-fe1004 is UP: PING OK - Packet loss = 0%, RTA = 1.54 ms [23:59:18] PROBLEM - HTTP on ms-fe1003 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 758 bytes in 0.001 second response time