[00:43:00] (03PS1) 10Ebe123: Add namespace aliases for ang.wikipedia and wiktionary [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/112409 [00:49:51] (03PS1) 10SPQRobin: Enable VisualEditor on Wikimedia Incubator [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/112410 [01:09:10] (03CR) 10PiRSquared17: [C: 031] Add namespace aliases for ang.wikipedia and wiktionary [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/112409 (owner: 10Ebe123) [01:13:55] (03PS2) 10Ebe123: Add namespace aliases for ang.wikipedia and wiktionary [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/112409 [01:40:00] (03CR) 10PiRSquared17: [C: 031] Add namespace aliases for ang.wikipedia and wiktionary [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/112409 (owner: 10Ebe123) [02:03:16] (03PS1) 10Springle: depool db1002 for schema changes [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/112411 [02:06:04] (03CR) 10Springle: [C: 032] depool db1002 for schema changes [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/112411 (owner: 10Springle) [02:06:12] (03Merged) 10jenkins-bot: depool db1002 for schema changes [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/112411 (owner: 10Springle) [02:07:15] !log springle synchronized wmf-config/db-eqiad.php 's2 depool db1002 schema changes' [02:07:24] Logged the message, Master [02:08:15] !log LocalisationUpdate completed (1.23wmf12) at 2014-02-10 02:08:14+00:00 [02:08:23] Logged the message, Master [02:16:00] !log LocalisationUpdate completed (1.23wmf13) at 2014-02-10 02:15:59+00:00 [02:16:06] Logged the message, Master [02:31:09] !log LocalisationUpdate ResourceLoader cache refresh completed at 2014-02-10 02:31:08+00:00 [02:31:17] Logged the message, Master [02:31:31] (03PS1) 10Springle: repool db1002, depool db1060 schema changes [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/112412 [02:32:11] (03CR) 10Springle: [C: 032] repool db1002, depool db1060 schema changes [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/112412 (owner: 10Springle) [02:32:20] (03Merged) 10jenkins-bot: repool db1002, depool db1060 schema changes [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/112412 (owner: 10Springle) [02:33:10] !log springle synchronized wmf-config/db-eqiad.php 's2 repool db1002, depool db1060 schema changes' [02:33:17] Logged the message, Master [02:43:07] (03PS1) 10Springle: repool db1060 warm up [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/112414 [02:43:32] (03CR) 10Springle: [C: 032] repool db1060 warm up [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/112414 (owner: 10Springle) [02:43:40] (03Merged) 10jenkins-bot: repool db1060 warm up [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/112414 (owner: 10Springle) [02:44:32] !log springle synchronized wmf-config/db-eqiad.php 's2 repool db1060 warm up' [02:44:41] Logged the message, Master [03:00:41] (03PS1) 10Springle: prepare for s2 master rotation db1036 to db1024 [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/112415 [03:01:45] (03CR) 10Springle: [C: 032] prepare for s2 master rotation db1036 to db1024 [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/112415 (owner: 10Springle) [03:01:53] (03Merged) 10jenkins-bot: prepare for s2 master rotation db1036 to db1024 [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/112415 (owner: 10Springle) [03:03:02] !log springle synchronized wmf-config/db-eqiad.php 'prepare for s2 master rotation db1036 to db1024 (eqiad)' [03:03:09] Logged the message, Master [03:03:43] !log springle synchronized wmf-config/db-pmtpa.php 'prepare for s2 master rotation db1036 to db1024 (pmtpa)' [03:03:50] Logged the message, Master [03:11:39] PROBLEM - mysqld processes on db1036 is CRITICAL: PROCS CRITICAL: 0 processes with command name mysqld [03:14:40] (03PS1) 10Springle: s2 switch master to db1024 [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/112416 [03:15:08] (03CR) 10Springle: [C: 032] s2 switch master to db1024 [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/112416 (owner: 10Springle) [03:15:16] (03Merged) 10jenkins-bot: s2 switch master to db1024 [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/112416 (owner: 10Springle) [03:16:11] !log springle synchronized wmf-config/db-eqiad.php 's2 switch master to db1023 (eqiad)' [03:16:19] Logged the message, Master [03:16:49] !log springle synchronized wmf-config/db-pmtpa.php 's2 switch master to db1023 (pmtpa)' [03:16:56] Logged the message, Master [03:22:35] (03PS1) 10Springle: update dns for s2 master switch to db1024 [operations/dns] - 10https://gerrit.wikimedia.org/r/112417 [03:23:12] (03CR) 10Springle: [C: 032] update dns for s2 master switch to db1024 [operations/dns] - 10https://gerrit.wikimedia.org/r/112417 (owner: 10Springle) [03:34:42] (03PS1) 10Springle: update coredb topology for s2 master switch [operations/puppet] - 10https://gerrit.wikimedia.org/r/112418 [03:35:16] (03CR) 10jenkins-bot: [V: 04-1] update coredb topology for s2 master switch [operations/puppet] - 10https://gerrit.wikimedia.org/r/112418 (owner: 10Springle) [03:40:39] !log springle synchronized wmf-config/db-eqiad.php 's2 switch master to db1023 (eqiad)' [03:40:46] Logged the message, Master [03:41:14] !log springle synchronized wmf-config/db-pmtpa.php 's2 switch master to db1023 (eqiad)' [03:41:19] * springle sigh [03:41:22] Logged the message, Master [04:04:31] (03CR) 10Swalling: [C: 04-1] "See my comments." (032 comments) [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/111460 (owner: 10Phuedx) [04:08:57] (03PS2) 10Springle: update coredb topology for s2 master switch [operations/puppet] - 10https://gerrit.wikimedia.org/r/112418 [04:10:49] (03CR) 10Springle: [C: 032 V: 032] update coredb topology for s2 master switch [operations/puppet] - 10https://gerrit.wikimedia.org/r/112418 (owner: 10Springle) [04:37:50] mark, ping me when you're up and/or have a minute? I have yet another networking puzzle [05:33:09] PROBLEM - MySQL Idle Transactions on db1016 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:33:59] RECOVERY - MySQL Idle Transactions on db1016 is OK: OK longest blocking idle transaction sleeps for 0 seconds [06:26:52] springle, still working? I'm stumped by something that an op with even the most modest apache skills can probably sort out in 5 minutes :( [06:28:01] andrewbogott: what's up? [06:28:27] there ought to be a wiki hosted by virt1000, but 'Firefox can't establish a connection to the server at virt1000.wikimedia.org' [06:28:35] I can access services on other ports on that box... [06:28:44] And sites-enabled &c look right to me. [06:28:46] See anything obvious? [06:29:19] (At some point Ryan intentionally disabled the site, so it's possible there's something actively blocking http(s). Although I /think/ I've already fixed the thing that he changed.) [06:29:43] I disabled it, yeah [06:29:46] is ferm allowing it? [06:30:00] is apache running and configured [06:30:13] most importantly, is mediawiki up to date there? :) [06:30:15] Ryan_Lane: when you disabled, what did you do? Just break the link to the wiki? [06:30:27] I removed the symlink that apache uses [06:30:34] Ah, yep -- I replaced that. [06:30:37] but you should still be able to get to apache [06:30:38] And apache is running [06:30:45] what ip is it running on? [06:31:05] why are you bringing up mediawiki there? [06:31:14] I think virt1000 is 208.80.154.18 [06:31:31] …in anticipation of wikitech moving there. [06:31:39] oh, we're going to do that first? [06:31:41] And, to test OSM and such. [06:31:52] I'd think that would be a relatively late step [06:32:00] It is. [06:32:09] I'm just blocked on other tasks, was going to give OSM + havana a whirl. [06:32:10] PROBLEM - Host mw27 is DOWN: PING CRITICAL - Packet loss = 100% [06:32:24] you may want to consider adding a feature to OSM to let you test other regions, but limit that to a set of users [06:32:33] sure. [06:32:43] but I guess it's good to get some initial testing done :) [06:33:49] RECOVERY - Host mw27 is UP: PING OK - Packet loss = 0%, RTA = 35.34 ms [06:36:03] I don't know anything about ferm, is there a way to query what ports are open? Or should I just dig in puppet? [06:45:14] andrewbogott: port 80 and 443 are closed according to nmap. mite virt1001 need something like webserver::apache in puppet? [06:45:19] * springle guessing a bit [06:45:37] springle: are we talking about 1000 or 1001? [06:45:41] er virt1000 [06:45:43] sorry [06:46:06] It definitely includes a webserver class, but... [06:46:16] well, I'll look around, not sure what would block (or unblock) the port. [06:51:46] oh, springle, you may be right -- looks like the class that includes the webserver is missing. [06:51:54] Must've been there, and then removed… I'll replace it. [06:53:01] (03PS1) 10Andrew Bogott: Added role::nova::manager to virt1000, so I can debug OSM. [operations/puppet] - 10https://gerrit.wikimedia.org/r/112420 [06:55:51] (03CR) 10Andrew Bogott: [C: 032] Added role::nova::manager to virt1000, so I can debug OSM. [operations/puppet] - 10https://gerrit.wikimedia.org/r/112420 (owner: 10Andrew Bogott) [07:02:19] hm… necessary but not sufficient :( [07:05:40] PROBLEM - HTTP on virt1000 is CRITICAL: Connection refused [07:05:58] well, that's not an improvement :( [07:06:10] PROBLEM - MySQL Idle Transactions on db1016 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [07:06:20] PROBLEM - MySQL InnoDB on db1016 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [07:07:00] RECOVERY - MySQL Idle Transactions on db1016 is OK: OK longest blocking idle transaction sleeps for 0 seconds [07:07:10] RECOVERY - MySQL InnoDB on db1016 is OK: OK longest blocking idle transaction sleeps for 0 seconds [07:08:43] andrewbogott: ferm just updates iptables, so if there is a rule to block some port, it should probably be visible in "iptables -L". [07:10:59] andrewbogott: If Apache is running, there should be an entry in "netstat -at | less" with ":http" or ":80". [07:12:02] it looks to me like http is working but not https [07:13:00] hm, maybe not [07:14:19] well… I don't see any iptables rules about http on virt0 either, and I know it's working there. [07:16:02] And is Apache listening on :80 or :443? [07:17:08] in netstat you mean? Yes on virt0 and no on virt1000. [07:17:37] but I don't know where to go from there. Apache is clearly running. [07:18:03] andrewbogott: this means your have the service runnign in ps-ef |grep apache [07:18:24] yes, and I can see it coming up in the logs. [07:18:26] but not listening on 80 or 443 with netstat -an ? [07:18:44] seems so, unless I'm looking at the wrong thing. [07:19:05] can you telnet to 127.0.0.1 50? [07:19:09] *80 [07:19:22] nope [07:19:24] i guess connection refused [07:19:31] yep [07:19:49] and in the apache config, is it configured to listen on *:80 ? [07:20:44] Is that a thing you have to specify? Surely that's the default... [07:21:02] wouldn't hurt to check [07:21:08] Anyway, we're talking about /etc/apache2/apache2.conf right? [07:21:16] yes [07:21:49] No. [07:22:03] by 'the default' I mean -- it does it even if nothing is specified in the conf [07:22:03] missing one conf there [07:22:09] On WMF, it's /etc/apache2/sites-enabled/*, IIRC. [07:22:36] (Symlinks to ../sites-available.) [07:22:43] Oh, yeah, /that/ is definitely there, same as virt0. From puppet. [07:22:56] 80 is a rewrite to https [07:26:11] andrewbogott: what is the output of netstat -plnt ? [07:27:09] https://dpaste.de/J3iO [07:27:19] Apache config is here: https://git.wikimedia.org/blob/operations%2Fpuppet/8cf6cd3dca1e9cb45bea2c7be26c2ec91cbc2279/templates%2Fapache%2Fsites%2Fwikitech.wikimedia.org.erb [07:29:28] andrewbogott: apache is listening on ipv6, i see. is it ? [07:29:37] port 8140 [07:30:06] looks like [07:30:21] did you config that? [07:30:47] and what do you get when you try to telnet/netcat to there? [07:31:07] I didn't hand configure anything about this box. It ought to be puppetized as a web server. [07:31:18] If I telnet to 8140? Just a hang. [07:31:48] 8140? that is puppet [07:31:56] that is what i thought [07:32:02] wait, what? [07:32:06] well more like apache passenger configured to serve puppet [07:32:17] Ah, yes, this is a puppet master as well. [07:32:20] morning btw [07:32:21] :-) [07:32:25] 'morning! [07:32:34] yeah, hi akosiaris :) [07:32:53] intersting thing here andrew [07:32:56] akosiaris: my dumb question of the morning is "Why doesn't virt1000 serve https?" [07:33:45] matanya: yes? [07:34:07] andrewbogott: it is [07:34:18] on port 8140 [07:34:18] true. [07:34:20] you want it on 443 ? [07:34:23] you say apache is running, but doesn't open any connection on port 80/443 [07:34:26] and some other virtual host ? [07:34:28] Yes, it should be a wikitech mirror. [07:34:32] then ports.conf :-) [07:34:43] heh... that is going to be difficult [07:34:58] difficult why? [07:35:14] I mean, not a live mirror. Just another box running the same puppet config. [07:35:15] cause ports.conf seems to be managed by puppetmaster class on this machine [07:35:29] Oh. Well… it works on virt0, lemme see what's happening there. [07:35:37] well you are not confined to ports.conf now that I think about it [07:35:46] the directives can be included in any file [07:35:51] yeah, virt0 is the same, just 8140 [07:36:00] but you are not going to make any friends that way [07:36:14] wait.. isn't wikitech on virt0 ? [07:36:23] yes [07:36:39] So, virt1000 is in eqiad, it will be the wikitech host when we turn off pmtpa [07:37:02] It was, until somewhat recently, running a verison of wikitech. I don't know why it's stopped working. [07:37:10] conf.d/ports-wikitech.conf [07:37:13] on virt0 [07:37:16] that would explain it [07:37:33] which I suppose is not puppetized... [07:37:41] i think you suppose correctly [07:37:50] * andrewbogott sighs [07:38:05] heh... try logging into mchenry :P [07:38:17] I did that on Friday [07:38:20] never again! [07:38:24] akosiaris: any list of not puppertized stuff? [07:38:32] matanya: I 'd wish [07:38:59] anyway, email is a huge project of paravoid. If he wants help he 'll let us know [07:39:33] akosiaris: that was totally it, on virt1000. Thank you! [07:39:38] * andrewbogott puppetizes quick before he forgets [07:39:40] RECOVERY - HTTP on virt1000 is OK: HTTP OK: HTTP/1.1 302 Found - 457 bytes in 0.011 second response time [07:39:42] as well as admins.pp [07:39:45] woo! [07:39:58] andrewbogott: thank you too for puppetizing that :-) [07:41:21] it is much easier to dubug with shell access :) [07:46:07] matanya: yeah, true. [07:46:08] (03PS1) 10Andrew Bogott: puppetize ports-wikitech.conf on wikitech hosts. [operations/puppet] - 10https://gerrit.wikimedia.org/r/112421 [07:46:19] In my case not easy enough though. [07:46:22] akosiaris, ^ [07:46:45] oops, doublequotes! [07:47:17] and no single quotes [07:47:20] andrewbogott: require package php5 ? [07:47:44] akosiaris: copying the pattern from above, for sites-available. [07:47:57] I guess I'm trusting that whoever wrote that knew what they were doing... [07:48:00] this file needs love [07:48:20] Then please require that file [07:48:22] please!!! [07:48:36] hm? require which file? [07:48:47] the sites-available [07:48:55] matanya: it does, but it's also subject to rapid development so not a great candidate for linting this week. [07:49:00] it is making semantically sense [07:49:17] you need the site for those ports to actually serve something [07:49:23] hm, true. [07:49:29] and it is better than php5 [07:49:41] andrewbogott: not lint, modulrize [07:49:53] yes, needs both! [07:49:58] true [07:50:27] yeah, however finishing up with whatever development first might be prudent [07:50:29] (03PS2) 10Andrew Bogott: puppetize ports-wikitech.conf on wikitech hosts. [operations/puppet] - 10https://gerrit.wikimedia.org/r/112421 [07:50:30] matanya: better quotemarks now? [07:50:54] yes, better fix now, than million fixes later [07:50:56] andrewbogott: debug OSM ? [07:51:09] OSM = "OpenStackManager" in this case. [07:51:12] aaaah [07:51:19] not open street map [07:51:28] thanks. I was flumoxed for a moment [07:51:33] Bah, we were here first, should have dibs on the acronym! [07:51:41] But, I'm going to have to start spelling it out I guess. [07:52:10] ahahaha [07:52:15] (03CR) 10Ori.livneh: [C: 032] logstash: Add normalized_message field to all events [operations/puppet] - 10https://gerrit.wikimedia.org/r/112149 (owner: 10BryanDavis) [07:52:22] akosiaris: i will need your help in fixing your -1's on my patches [07:52:35] Anyway… what with that patch going on a production box, can I get a +2? [07:53:36] matanya: aahh i was thinking yesterday about the ganglia/nagios one. I am thinking the best way forward would be to first ask if anyone uses that notes_url thing and kill it if no-one does. I sure have not in all these months [07:53:41] andrewbogott: why is that an erb file, anyway? [07:54:05] ori: Because there's not a good organizational place for it otherwise. [07:54:10] akosiaris: unfortunately I don't have that much time today. I need to finish up on some openstreetmap stuff [07:54:14] files andrewbogott [07:54:20] Felt better sticking it next to its cohorts than making a new diretctory with just that one file in it. [07:54:30] andrewbogott: that's silly; just make a directory [07:54:38] matanya: that is..^ [07:54:41] * andrewbogott scowls [07:54:46] i wonder why i am speaking to myself... [07:55:11] yeah, ok akosiaris thanks. one day we might find time for stuff :) [07:55:42] andrewbogott: this point is actully valid, erb takes more time to compile etc [07:55:56] (03CR) 10Alexandros Kosiaris: [C: 032] puppetize ports-wikitech.conf on wikitech hosts. [operations/puppet] - 10https://gerrit.wikimedia.org/r/112421 (owner: 10Andrew Bogott) [07:55:56] :) [07:56:25] oh... i am not on RT duty anymore :-) [07:57:35] no you are not [07:58:09] if ori had not changed the subject I would have completely forgotten it until someone asked me [07:59:12] andrewbogott: you know, [07:59:27] you could also just do content => "Listen 80\nListen 443" [07:59:48] please avoid that pattern [08:00:15] but not erb files that contain no erb? [08:00:30] not that it does not work, but there are already two ways to populate files, let's not add a third one [08:01:07] ori: you are right there but that thing needs love anyway and I am pretty sure that one andrewbogott is done, matanya is going to really really show his appreciation for it. [08:01:08] four! inline_template('<%= @ports.map("Listen #{port}").join("\n") %>') [08:01:30] and we haven't even gotten to custom puppet ruby functions! :P [08:01:31] So it doesn't really cause harm right now [08:01:39] ori: true true :-) [08:01:50] ori: what are your feelings on stevedore and cliffy? [08:02:14] are those pokemon? [08:02:18] :D [08:02:20] i don't know either [08:02:21] python libraries [08:02:25] * ori googles [08:02:41] also that map thing was wrong above, but nevermind [08:02:43] stevedore is meant to be a consistent approach to the driver/hook/extension model [08:03:17] (03PS1) 10Andrew Bogott: Move ports-wikitech.conf out of templates and into files. [operations/puppet] - 10https://gerrit.wikimedia.org/r/112422 [08:03:19] oh, that's not bad, i could've used that [08:03:26] * ori thanks andrewbogott [08:03:29] cliffy is meant to be a sane way to handle cli actions, subactions, sub sub actions, etc. with drivers and extensions and such [08:03:45] would've been a lot easier if I hadn't tried to amend an already-merged patch :( [08:03:48] twisted provides both, but it only makes sense to use it if you're writing a twisted network daemon [08:03:59] yep [08:04:38] akosiaris: i count 4 already, source (as puppet file server) content (as templates) content (as "some text") and inline functions [08:04:47] I'm considering them for trigger [08:04:57] so that I don't use a custom driver/extension/hook model [08:05:05] i went with https://github.com/wikimedia/mediawiki-extensions-EventLogging/blob/master/server/eventlogging/handlers.py#L39 [08:05:09] matanya: yeah ori mentioned the fourth one as well. [08:05:20] yep [08:05:30] oh, skipped that [08:05:34] but it looks like stevedore goes a lot farther [08:05:38] I'm doing something relatively similar [08:05:39] dunno, looks neat at a glance [08:06:00] the names suck btw [08:06:09] I like that it's somewhat standard [08:06:20] they remind me of the two old guys at the puppets [08:06:21] it would make adapting other people's code easier, when necessary [08:06:26] akosiaris: haha [08:06:39] and makes it easier for people to understand a codebase [08:06:49] akosiaris: :D [08:06:52] akosiaris: that's tim and domas usually [08:06:58] hahahaha [08:07:02] hahahahahahaha [08:07:35] well, I'll investigate them for trigger to see if it's worthwhile [08:07:54] I'm not a huge fan of how I'm handling action loading (and I know you aren't either ;) ) [08:09:09] let me know if they're good libs, i'd be interested [08:09:13] * Ryan_Lane nods [08:09:14] will do [08:22:20] PROBLEM - MySQL InnoDB on db1016 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:23:10] RECOVERY - MySQL InnoDB on db1016 is OK: OK longest blocking idle transaction sleeps for 0 seconds [08:28:12] good morning [08:29:21] (03PS1) 10Matanya: webserver: fixing duplicate declaration of apache-mpm [operations/puppet] - 10https://gerrit.wikimedia.org/r/112423 [08:29:32] akosiaris: this one is for you ^ [08:29:36] hi hashar [08:33:51] (03PS2) 10Matanya: webserver: fixing duplicate declaration of apache-mpm [operations/puppet] - 10https://gerrit.wikimedia.org/r/112423 [08:51:34] (03PS1) 10ArielGlenn: module for releases webserver (mobile and mw tarballs) [operations/puppet] - 10https://gerrit.wikimedia.org/r/112424 [08:57:01] (03CR) 10Matanya: [C: 04-1] "I don't like the layout if this module. I think it would be better if you move the system role and the monitoring into a role inside manif" [operations/puppet] - 10https://gerrit.wikimedia.org/r/112424 (owner: 10ArielGlenn) [08:58:17] * yuvipanda pokes hashar with https://gerrit.wikimedia.org/r/#/c/111765/ :) [09:02:09] matanya: what do you want to see parameterized in the webserver class? [09:02:37] apergos: docroot, server_admin [09:02:42] ok thanks [09:02:43] the site name [09:02:44] yuvipanda: follow up with maxsem :-] [09:02:56] yuvipanda: anyone can review / +2 that change to get it deployed on beta. [09:02:57] ? [09:03:00] hashar: you are okay with him +2ing when he's around? [09:03:01] ok [09:03:02] MaxSem: https://gerrit.wikimedia.org/r/#/c/111765/ [09:03:18] MaxSem: that repo has a lot more merges now :) [09:04:25] * MaxSem reads code [09:04:43] yuvipanda: yes :-] [09:04:47] WTF http://git.wikimedia.org/blob/mediawiki%2Fextensions%2FPopups.git/7cadfed0be8e0311b645eac63b0ff370a7e9e4ea/Popups.hooks.php [09:04:49] hashar: ok :) [09:05:11] MaxSem: ? [09:05:17] yuvipanda: I am merely responsible for maintaining the beta infrastructure which is really a service to the developers to stage code before it lands in production. [09:05:19] you're mixing WMF conf stuff with extension conf [09:05:19] there's a patch adding a betafeatures check to that there. [09:05:31] MaxSem: what do you mean? [09:05:37] $wgEnablePopups is unnecessary, misleading, evil [09:05:45] MaxSem: what. that's just a feature flag [09:06:07] it's not [09:06:11] explain [09:06:13] how it is not? [09:06:24] if that is set to false, then absolutely nothing happens. [09:06:32] feature flag allows to disable a part of extension's functionality [09:07:11] WMF config enables/disables extensions explicitly [09:07:43] MaxSem: how does that have anything to do with enabling or disabling it on betalabs? [09:09:00] (03CR) 10Matanya: "proposed fix in https://gerrit.wikimedia.org/r/#/c/112423/" [operations/puppet] - 10https://gerrit.wikimedia.org/r/107567 (owner: 10Matanya) [09:11:06] MaxSem: ? [09:11:08] this extension's present state is a good indication that it wasn't reviewed by anyone who should've reviewed it [09:11:18] meh [09:11:31] do apply to revoke all the reviewers' +2s [09:13:57] MaxSem: do you have a list of people who you considered 'people who should have reviewed it', since apparently that is different from the general +2 list? [09:14:11] Chris Steipp? [09:14:50] for *betalabs*? [09:14:53] meh. fine. [09:14:57] * yuvipanda goes to jump through another hoop [09:20:54] yuvipanda, https://gerrit.wikimedia.org/r/112430 [09:21:32] MaxSem: looking [09:23:07] MaxSem: merged. [09:24:33] (03CR) 10MaxSem: [C: 032] Deploy Extension:Popups on betalabs [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/111765 (owner: 10Yuvipanda) [09:24:42] (03Merged) 10jenkins-bot: Deploy Extension:Popups on betalabs [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/111765 (owner: 10Yuvipanda) [09:24:47] MaxSem: :) thanks! [09:25:34] MaxSem: I've poked the people who are responsible for this extension (Design team), they should poke chris earlier today or somesuch. [09:25:46] should definitely go through one before getting deployed deployed, I guess [09:26:48] how much code growth will there be before it's ready for WMF? [09:28:00] MaxSem: I think there are 3 patches under review, and that's about it. [09:28:08] hmm [09:28:16] MaxSem: hmm, 4 actually. 3 CSS changes and a betafeatures hook [09:28:32] eek, so no more JS? [09:28:58] MaxSem: might be applying a class here or there [09:29:26] my main fear was not current JS but that it's in active development and after we deploy something small and safe it will quickly turn into a behemoth [09:30:04] MaxSem: that's true for everything, no? At least this is much better than the current gadget used [09:30:09] (much smaller too, but eh) [09:30:12] heh [09:30:52] MaxSem: I would've liked it to be in VectorBeta, but then people pointed out how this works across skins... [09:36:08] (03PS2) 10Hashar: Tools: Install package supybot [operations/puppet] - 10https://gerrit.wikimedia.org/r/112202 (owner: 10Tim Landscheidt) [09:36:12] (03PS3) 10Tim Landscheidt: Tools: Install package supybot [operations/puppet] - 10https://gerrit.wikimedia.org/r/112202 [09:36:17] (03CR) 10Hashar: [C: 031] Tools: Install package supybot [operations/puppet] - 10https://gerrit.wikimedia.org/r/112202 (owner: 10Tim Landscheidt) [09:46:31] morning [09:48:35] hi paravoid [09:48:47] (03PS2) 10ArielGlenn: module for releases webserver (mobile and mw tarballs) [operations/puppet] - 10https://gerrit.wikimedia.org/r/112424 [09:54:19] (03CR) 10Matanya: [C: 031] module for releases webserver (mobile and mw tarballs) [operations/puppet] - 10https://gerrit.wikimedia.org/r/112424 (owner: 10ArielGlenn) [10:01:32] (03CR) 10Faidon Liambotis: [C: 04-1] "Please don't do access controls inside the module. It's hard to fix admins.pp enough as it is, having to fix it across the board is going " [operations/puppet] - 10https://gerrit.wikimedia.org/r/112424 (owner: 10ArielGlenn) [10:03:26] paravoid: where would you want me to move them, to the role::releases class? and do you mean the groups as well as the accounts? [10:05:55] good point paravoid i think you should just call accounts from the current admins.pp [10:08:05] apergos: yeah, I guess role class or site.pp [10:08:08] apergos: https://gerrit.wikimedia.org/r/#/c/107848/ btw [10:10:28] paravoid: you should get a raise when this is mreged [10:10:37] *merged [10:10:40] lol [10:16:51] (03CR) 10Faidon Liambotis: [C: 04-1] Torrus: add turros to netmon1001 (036 comments) [operations/puppet] - 10https://gerrit.wikimedia.org/r/108314 (owner: 10Matanya) [10:25:00] .names [10:25:06] ha! [10:27:33] (03CR) 10Matanya: Torrus: add turros to netmon1001 (031 comment) [operations/puppet] - 10https://gerrit.wikimedia.org/r/108314 (owner: 10Matanya) [10:28:19] (03PS5) 10Matanya: Torrus: add torrus to netmon1001 [operations/puppet] - 10https://gerrit.wikimedia.org/r/108314 [10:31:51] (03CR) 10Hashar: "Thank you!" (031 comment) [operations/puppet] - 10https://gerrit.wikimedia.org/r/111917 (owner: 10BryanDavis) [10:31:58] paravoid: are you looking for a review on the admins module, are it is not ready yet for that? [10:32:39] (03PS3) 10ArielGlenn: module for releases webserver (mobile and mw tarballs) [operations/puppet] - 10https://gerrit.wikimedia.org/r/112424 [10:32:40] (hint, i think you should use other approach ) [10:34:14] I'm looking for a review of the concept, so far [10:34:17] ideas are welcome [10:34:35] apergos: there's an icinga alarm for snapshot1003's disk space [10:37:01] ok, I'll have a look, thanks [10:41:03] (03PS3) 10Hashar: retab realm.pp [operations/puppet] - 10https://gerrit.wikimedia.org/r/104808 [10:41:10] (03PS3) 10Hashar: realm.pp puppet lint fixes [operations/puppet] - 10https://gerrit.wikimedia.org/r/104809 [10:41:24] (03CR) 10Matanya: "I think a better way to handle the keys would be to use puppetdb with puppet query, (https://forge.puppetlabs.com/dalen/puppetdbquery). I " [operations/puppet] - 10https://gerrit.wikimedia.org/r/107848 (owner: 10Faidon Liambotis) [10:44:38] no, puppetdb is not suitable for this purpose, I have security concerns [10:44:54] ok [10:45:00] keys should be in git, code-reviewable and authenticated [10:45:26] puppetdb, anyone can run a query [10:47:02] could I get three changes in for contint please ? https://gerrit.wikimedia.org/r/#/q/status:open+topic:contint,n,z :D [10:52:37] (03Abandoned) 10Matanya: realm: lint clean [operations/puppet] - 10https://gerrit.wikimedia.org/r/109074 (owner: 10Matanya) [11:03:40] (03PS3) 10Hashar: contint: browsers for testing + xvfb for headless [operations/puppet] - 10https://gerrit.wikimedia.org/r/111209 [11:06:16] (03CR) 10ArielGlenn: [C: 032] contint: browsers for testing + xvfb for headless [operations/puppet] - 10https://gerrit.wikimedia.org/r/111209 (owner: 10Hashar) [11:06:59] (03PS3) 10Hashar: contint: slave-scripts are deployed via git-deploy [operations/puppet] - 10https://gerrit.wikimedia.org/r/111446 [11:08:33] (03CR) 10Phuedx: [C: 031] add shell account for phuedx and add to mortals [operations/puppet] - 10https://gerrit.wikimedia.org/r/112150 (owner: 10Dzahn) [11:08:48] (03CR) 10ArielGlenn: [C: 032] contint: slave-scripts are deployed via git-deploy [operations/puppet] - 10https://gerrit.wikimedia.org/r/111446 (owner: 10Hashar) [11:09:28] (03PS3) 10Hashar: contint: fix slave-scripts deployment on labs [operations/puppet] - 10https://gerrit.wikimedia.org/r/111447 [11:09:37] (03PS8) 10Dzahn: turn planet into a module [operations/puppet] - 10https://gerrit.wikimedia.org/r/108674 [11:10:19] apergos: thanks :) [11:10:34] one left [11:12:08] (03CR) 10ArielGlenn: [C: 032] contint: fix slave-scripts deployment on labs [operations/puppet] - 10https://gerrit.wikimedia.org/r/111447 (owner: 10Hashar) [11:13:25] (03PS4) 10ArielGlenn: module for releases webserver (mobile and mw tarballs) [operations/puppet] - 10https://gerrit.wikimedia.org/r/112424 [11:18:20] (03CR) 10ArielGlenn: [C: 032] module for releases webserver (mobile and mw tarballs) [operations/puppet] - 10https://gerrit.wikimedia.org/r/112424 (owner: 10ArielGlenn) [11:33:10] PROBLEM - MySQL Idle Transactions on db1016 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [11:34:20] PROBLEM - MySQL InnoDB on db1016 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [11:37:00] RECOVERY - MySQL Idle Transactions on db1016 is OK: OK longest blocking idle transaction sleeps for 0 seconds [11:37:10] RECOVERY - MySQL InnoDB on db1016 is OK: OK longest blocking idle transaction sleeps for 0 seconds [11:39:55] (03PS1) 10ArielGlenn: caesium as releases webserver (mw tarballs etc) [operations/puppet] - 10https://gerrit.wikimedia.org/r/112433 [11:42:54] (03CR) 10ArielGlenn: [C: 032] caesium as releases webserver (mw tarballs etc) [operations/puppet] - 10https://gerrit.wikimedia.org/r/112433 (owner: 10ArielGlenn) [11:45:06] (03PS2) 10Andrew Bogott: Move ports-wikitech.conf out of templates and into files. [operations/puppet] - 10https://gerrit.wikimedia.org/r/112422 [11:45:26] apergos: can it be we a hav a broken table? [11:45:29] *e [11:46:05] db1016? [11:46:49] no, there is a user with a right that doesn't exist [11:47:05] see : https://meta.wikimedia.org/w/index.php?title=Special:GlobalUsers&group=steward [11:47:21] the groups pathoschild is member of [11:47:25] (03PS3) 10Andrew Bogott: Move ports-wikitech.conf out of templates and into files. [operations/puppet] - 10https://gerrit.wikimedia.org/r/112422 [11:47:48] (03CR) 10Alexandros Kosiaris: [C: 032] add shell account for phuedx and add to mortals [operations/puppet] - 10https://gerrit.wikimedia.org/r/112150 (owner: 10Dzahn) [11:47:50] nice :-D [11:48:21] any hint what the heck is this? all i found was: 21:11, 6 April 2009 Pathoschild (Talk | contribs | block) changed global group membership for User:Pathoschild from steward to steward, Cabal (test old bug) [11:48:36] pretty sure it's not a broken table [11:49:11] (03CR) 10Andrew Bogott: [C: 032] Move ports-wikitech.conf out of templates and into files. [operations/puppet] - 10https://gerrit.wikimedia.org/r/112422 (owner: 10Andrew Bogott) [11:49:37] but given that's from 2009, no, no ideas, I'd have to hunt around a bit [11:50:17] ok, should i open a bug? [11:51:16] why not ask pathoschild? [11:51:37] still a steward, might remember what it was all about [11:51:50] yeah, makes sense [11:51:54] thanks [11:52:04] Oh, I remember that thing [11:52:19] please, fill me in [11:52:19] (03PS1) 10Andrew Bogott: Switch nova-network vlan from 1118 to 1102. [operations/puppet] - 10https://gerrit.wikimedia.org/r/112434 [11:52:41] matanya: It was some bug with removing all rights from a global group [11:52:58] https://meta.wikimedia.org/wiki/MediaWiki:Grouppage-Cabal [11:52:59] the global group then vanished but still was assigned to people [11:53:02] something like that [11:53:59] (03PS1) 10Nikerabbit: Update ULS config [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/112435 [11:54:28] https://meta.wikimedia.org/wiki/Special:GlobalGroupPermissions/Cabal [11:54:30] (03CR) 10Andrew Bogott: [C: 032] Switch nova-network vlan from 1118 to 1102. [operations/puppet] - 10https://gerrit.wikimedia.org/r/112434 (owner: 10Andrew Bogott) [11:54:50] I think it was me who fixed that bug in CentralAuth, but the whole global groups thingy is still a little awry [11:55:11] or maybe it's not even fixed... at least it's a known one and nothing to worry about :P [11:55:42] thanks hoo [11:58:51] matanya: it's the cabal, there is noting to see *flashes MIB memory eraser* [11:59:08] matanya: If you assign a permission to the group, you should be able to remove it from Pathos and Drini again [11:59:15] | 271287 | Cabal | [11:59:31] i will do [11:59:32] that one is probably a deleted global account [11:59:43] | 99 | Cabal | [11:59:45] that one, I meant [12:00:02] those aren't getting removed on deletion :/ [12:06:41] (03CR) 10JanZerebecki: turn planet into a module (034 comments) [operations/puppet] - 10https://gerrit.wikimedia.org/r/108674 (owner: 10Dzahn) [12:08:54] (03PS9) 10JanZerebecki: turn planet into a module [operations/puppet] - 10https://gerrit.wikimedia.org/r/108674 (owner: 10Dzahn) [12:12:36] (03CR) 10JanZerebecki: turn planet into a module (031 comment) [operations/puppet] - 10https://gerrit.wikimedia.org/r/108674 (owner: 10Dzahn) [12:14:34] (03CR) 10JanZerebecki: [C: 04-1] "3 questions remaining in PS8. Everything else is fine." [operations/puppet] - 10https://gerrit.wikimedia.org/r/108674 (owner: 10Dzahn) [12:15:17] (03PS1) 10ArielGlenn: releases.wm.o for new web server for mw/mobile tarballs [operations/dns] - 10https://gerrit.wikimedia.org/r/112437 [12:16:28] (03CR) 10ArielGlenn: [C: 032] releases.wm.o for new web server for mw/mobile tarballs [operations/dns] - 10https://gerrit.wikimedia.org/r/112437 (owner: 10ArielGlenn) [12:20:03] !log Jenkins: deleted /srv/slave-scrips from old jenkins servers, everything should now use /srv/deployment/integration/slave-scripts [12:20:11] Logged the message, Master [12:22:45] PROBLEM - Host virt1001 is DOWN: PING CRITICAL - Packet loss = 100% [12:22:55] ^ me [12:24:14] !jenkins mwext-DonationInterface-runtests [12:24:15] https://integration.wikimedia.org/ci/job/mwext-DonationInterface-runtests [12:26:28] (03PS1) 10ArielGlenn: caesium added as backend for varnish misc web cluster [operations/puppet] - 10https://gerrit.wikimedia.org/r/112438 [12:27:05] RECOVERY - Host virt1001 is UP: PING OK - Packet loss = 0%, RTA = 0.43 ms [12:28:53] !jenkins mwext-DonationInterface-testextensions-master [12:28:53] https://integration.wikimedia.org/ci/job/mwext-DonationInterface-testextensions-master [12:38:35] (03CR) 10ArielGlenn: [C: 032] caesium added as backend for varnish misc web cluster [operations/puppet] - 10https://gerrit.wikimedia.org/r/112438 (owner: 10ArielGlenn) [12:39:53] (03PS3) 10Faidon Liambotis: Gzip SVGs on front & back upload varnishes [operations/puppet] - 10https://gerrit.wikimedia.org/r/108484 (owner: 10Ori.livneh) [12:41:35] PROBLEM - NTP on virt1001 is CRITICAL: NTP CRITICAL: Offset unknown [12:44:06] I just did my sheet in Diederik's spreadsheet [12:44:10] I highly recommend everyone to do it :) [12:44:45] it's just 5-15' of your time, might help manage our RT queue better [12:45:04] apergos / akosiaris [12:46:35] RECOVERY - NTP on virt1001 is OK: NTP OK: Offset 2.241134644e-05 secs [13:00:00] paravoid: ? [13:12:57] (03CR) 10Amire80: [C: 031] Update ULS config [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/112435 (owner: 10Nikerabbit) [13:18:05] (03PS1) 10Alexandros Kosiaris: Removed notes_url from nagios host extra info [operations/puppet] - 10https://gerrit.wikimedia.org/r/112441 [13:19:40] (03CR) 10Matanya: [C: 031] Removed notes_url from nagios host extra info [operations/puppet] - 10https://gerrit.wikimedia.org/r/112441 (owner: 10Alexandros Kosiaris) [13:26:28] so many tickets... [13:28:28] (03CR) 10Alexandros Kosiaris: "This is here mostly to serve as a poll on whether we want this or not and if we do want it, a quick discussion on how to solve this cleanl" [operations/puppet] - 10https://gerrit.wikimedia.org/r/112441 (owner: 10Alexandros Kosiaris) [13:59:56] PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: reqstats.5xx [crit=500.000000 [14:27:56] " paravoid: you should get a raise when this is mreged": Apparently no merge needed :-). Congrats, paravoid! Well deserved. [14:30:12] (03PS1) 10Gilles: Start sampling detailed network performance for Multimedia Viewer [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/112452 [14:37:10] scfc_de: heh, thanks :) [14:39:03] :D [14:56:14] (03CR) 10ArielGlenn: "It looks like admins::restricted is enough to allow you to run mysql queries from e.g. bast1001; let's do that for a start." [operations/puppet] - 10https://gerrit.wikimedia.org/r/112168 (owner: 10Dzahn) [14:57:47] (03CR) 10Hoo man: "I'll talk to Erik about this one again... if there's no way, I'll give up the mortals/deploy access, but it would certainly be great to ha" [operations/puppet] - 10https://gerrit.wikimedia.org/r/112168 (owner: 10Dzahn) [15:00:33] (03CR) 10Aude: [C: 031] "hoo is already quite helpful with deploys and investigating bugs. I think shell access allows him to be even more helpful." [operations/puppet] - 10https://gerrit.wikimedia.org/r/112168 (owner: 10Dzahn) [15:03:04] Coren: are you about? Can you join the canonical call? [15:04:56] RECOVERY - HTTP 5xx req/min on tungsten is OK: OK: reqstats.5xx [warn=250.000 [15:22:06] (03PS4) 10Hoo man: Add shell account for hoo, admins restricted [operations/puppet] - 10https://gerrit.wikimedia.org/r/112168 (owner: 10Dzahn) [15:23:36] (03CR) 10Hoo man: "Changed to admin::restricted for now, after I got used to the cluster, this will be changed to mortals. (~1 month or so)" [operations/puppet] - 10https://gerrit.wikimedia.org/r/112168 (owner: 10Dzahn) [15:28:13] (03PS5) 10Dzahn: Add shell account for hoo, admins restricted [operations/puppet] - 10https://gerrit.wikimedia.org/r/112168 [15:28:26] !log reindexing phase 0 wikis after Cirrus deploy last Thursday [15:28:33] Logged the message, Master [15:29:14] (03PS1) 10Hashar: contint: move jenkins-deploy user home to /dev/vdb [operations/puppet] - 10https://gerrit.wikimedia.org/r/112458 [15:29:40] (03CR) 10ArielGlenn: [C: 032] Add shell account for hoo, admins restricted [operations/puppet] - 10https://gerrit.wikimedia.org/r/112168 (owner: 10Dzahn) [15:32:15] congrats hoo :) can we now bug you for all DB inconsistency doubts we have? :P [15:32:44] Nemo_bis: Sure, I'm going to kill all the data... empty table are never inconsistent with anything ;) :D [15:33:22] * tables :P [15:33:46] hoo: they let you hit the dbs? nice. [15:36:44] (03PS1) 10Hashar: contint: .gitconfig for jenkins-deploy user [operations/puppet] - 10https://gerrit.wikimedia.org/r/112461 [15:53:06] (03PS1) 10Faidon Liambotis: install-server: add Ubuntu 14.04/Trusty Tahr [operations/puppet] - 10https://gerrit.wikimedia.org/r/112465 [15:53:08] (03PS1) 10Faidon Liambotis: Switch copper to Ubuntu trusty [operations/puppet] - 10https://gerrit.wikimedia.org/r/112466 [15:53:58] !log reindex went well. performing a links recount so we can push more code changes next week safely. [15:54:04] Logged the message, Master [15:54:49] paravoid: did we get a 14.04 image in labs ? [15:54:52] (03CR) 10Faidon Liambotis: [C: 032] install-server: add Ubuntu 14.04/Trusty Tahr [operations/puppet] - 10https://gerrit.wikimedia.org/r/112465 (owner: 10Faidon Liambotis) [15:55:01] (03CR) 10Faidon Liambotis: [C: 032] Switch copper to Ubuntu trusty [operations/puppet] - 10https://gerrit.wikimedia.org/r/112466 (owner: 10Faidon Liambotis) [15:55:03] ha no [15:55:07] hashar: no [15:55:11] ok ok :) [15:56:29] (03PS2) 10Hashar: contint: move jenkins-deploy user home to /dev/vdb [operations/puppet] - 10https://gerrit.wikimedia.org/r/112458 [15:57:06] \O/ [15:57:58] (03CR) 10ArielGlenn: [C: 032] contint: move jenkins-deploy user home to /dev/vdb [operations/puppet] - 10https://gerrit.wikimedia.org/r/112458 (owner: 10Hashar) [15:58:06] !log Jenkins: migrating labs jenkins-deploy user homedir from /home/jenkins-deploy (GlusterFS) to local directories under /mnt/home/jenkins-deploy to avoid GlusterFS and race conditions between instances. {{bug|61144}} [15:58:11] apergos: thanks :-) [15:58:14] Logged the message, Master [15:58:24] (03PS2) 10Hashar: contint: .gitconfig for jenkins-deploy user [operations/puppet] - 10https://gerrit.wikimedia.org/r/112461 [15:58:35] apergos: and I rebased the other one https://gerrit.wikimedia.org/r/#/c/112461/ :D [15:58:38] /mnt/home? [15:58:39] ewwww [15:58:43] me neither [15:58:51] that is for contint guys :D [15:58:59] nothing to worry about :-] [15:59:08] seriously, ew [16:00:33] I had an even crazier idea which is to add in a .pp a include /data/project/puppet/*.pp :D [16:00:42] (03CR) 10ArielGlenn: [C: 032] contint: .gitconfig for jenkins-deploy user [operations/puppet] - 10https://gerrit.wikimedia.org/r/112461 (owner: 10Hashar) [16:02:10] (03PS1) 10Hashar: contint: file{} needs 'owner' not 'user' [operations/puppet] - 10https://gerrit.wikimedia.org/r/112468 [16:02:16] apergos: and I made a typo sorry :-( [16:02:27] rats, I read it and didn't see it [16:02:28] file { user => } isn't valid, needs to be "owner" [16:02:28] sorry [16:02:32] yeah hard to catch [16:02:36] I have read a few times myself [16:02:37] yeah I should have checked all the params [16:03:39] (03CR) 10ArielGlenn: [C: 032] contint: file{} needs 'owner' not 'user' [operations/puppet] - 10https://gerrit.wikimedia.org/r/112468 (owner: 10Hashar) [16:08:25] PROBLEM - Host copper is DOWN: PING CRITICAL - Packet loss = 100% [16:08:34] that would be me [16:08:40] no operational impact [16:18:55] RECOVERY - Host copper is UP: PING OK - Packet loss = 0%, RTA = 0.31 ms [16:25:35] PROBLEM - RAID on copper is CRITICAL: Connection refused by host [16:25:45] PROBLEM - puppet disabled on copper is CRITICAL: Connection refused by host [16:25:45] PROBLEM - Disk space on copper is CRITICAL: Connection refused by host [16:25:55] PROBLEM - DPKG on copper is CRITICAL: Connection refused by host [16:25:55] PROBLEM - SSH on copper is CRITICAL: Connection refused [16:28:35] !log done with links count update for cirurs [16:28:43] Logged the message, Master [16:28:50] !log correction: done with link count update for cirrus [16:28:58] Logged the message, Master [16:30:05] (03PS1) 10Faidon Liambotis: install-server: disable biosdevname on trusty [operations/puppet] - 10https://gerrit.wikimedia.org/r/112475 [16:30:37] (03CR) 10Faidon Liambotis: [C: 032] install-server: disable biosdevname on trusty [operations/puppet] - 10https://gerrit.wikimedia.org/r/112475 (owner: 10Faidon Liambotis) [16:31:06] (03CR) 10Faidon Liambotis: [V: 032] install-server: disable biosdevname on trusty [operations/puppet] - 10https://gerrit.wikimedia.org/r/112475 (owner: 10Faidon Liambotis) [16:37:55] PROBLEM - NTP on copper is CRITICAL: NTP CRITICAL: No response from NTP server [16:52:43] akosiaris: yT? [16:53:01] ottomata: ? [16:53:18] hey, wanted to discuss kafkatee init stuff with ya [16:53:46] sure. shoot [16:54:34] so, you were asking why I didn't let kafkatee daemonize itself [16:54:44] and 1. i haven't yet been able to get that to work correclty with upstart yet [16:54:45] and 2. [16:54:50] upstart docs recommend not doing it [16:54:52] if possible [16:54:59] yeah I read that [16:55:11] true, cause upstart does all the mumbo jumbo [16:55:23] but i think you might need an expect there ? [16:55:33] an expect statement that is [16:55:34] if kafkatee doesn't daemon [16:55:38] ize [16:55:40] then we don't need an expect [16:55:47] because it will run in foreground [16:55:54] and the upstart exec stanza will background it [16:56:08] that part was not clear on the docs [16:56:18] so if you don't put an expect there [16:56:25] what happens ? [16:56:39] what is the default I mean [16:56:42] If you do not specify the expect stanza, Upstart will track the life cycle of the first PID that it executes in the http://upstart.ubuntu.com/cookbook/#exec or http://upstart.ubuntu.com/cookbook/#script stanzas. [16:57:55] so it is the forks 0 no expect [16:58:13] so fine by me... not really an upstart expert tbh [16:58:41] aye me neither, just reading the docs [16:58:50] i mean, i could do an init.d script and use start-stop-daemon if you prefer [16:58:59] but i like upstart better, and it think for kafkatee upstart is fine [16:59:06] i hope ubuntu will also move to systemd, like debian [16:59:11] niah don't waste your time [17:03:52] ok well, akosiaris, anything else on that change? [17:04:10] apart from the copyright file ? nothing [17:04:21] if you got that we are good to go [17:04:38] i think i fixed that [17:04:47] https://gerrit.wikimedia.org/r/#/c/110620/10/debian/copyright [17:04:50] ja? [17:05:06] LGTM [17:05:14] want me to +2 ? [17:05:19] yus please [17:05:42] feel free to merge at your will :-) [17:05:46] danke! :0 [17:06:09] PROBLEM - Host copper is DOWN: PING CRITICAL - Packet loss = 100% [17:15:44] see you later [17:16:35] RECOVERY - Host copper is UP: PING OK - Packet loss = 0%, RTA = 0.32 ms [17:20:55] RECOVERY - SSH on copper is OK: SSH OK - OpenSSH_6.4p1 Ubuntu-2 (protocol 2.0) [17:31:19] the cur tables are still around :P [17:32:50] (03PS1) 10Faidon Liambotis: install-server: fix late_command for trusty [operations/puppet] - 10https://gerrit.wikimedia.org/r/112482 [17:33:14] (03PS2) 10Faidon Liambotis: install-server: fix late_command for trusty [operations/puppet] - 10https://gerrit.wikimedia.org/r/112482 [17:34:09] (03CR) 10Faidon Liambotis: [C: 032] install-server: fix late_command for trusty [operations/puppet] - 10https://gerrit.wikimedia.org/r/112482 (owner: 10Faidon Liambotis) [17:35:35] RECOVERY - RAID on copper is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0 [17:35:45] RECOVERY - puppet disabled on copper is OK: OK [17:35:46] RECOVERY - Disk space on copper is OK: DISK OK [17:35:56] RECOVERY - DPKG on copper is OK: All packages OK [17:39:44] (03PS1) 10Faidon Liambotis: reprepro: add i386 to trusty [operations/puppet] - 10https://gerrit.wikimedia.org/r/112484 [17:40:32] (03CR) 10Faidon Liambotis: [C: 032 V: 032] reprepro: add i386 to trusty [operations/puppet] - 10https://gerrit.wikimedia.org/r/112484 (owner: 10Faidon Liambotis) [17:41:25] Reedy: can you please merge: https://gerrit.wikimedia.org/r/#/c/111985/ [17:52:17] (03PS1) 10Faidon Liambotis: salt: remove pinning to specific package version [operations/puppet] - 10https://gerrit.wikimedia.org/r/112485 [17:52:19] (03PS1) 10Faidon Liambotis: ssh: use provider => 'upstart' [operations/puppet] - 10https://gerrit.wikimedia.org/r/112486 [17:52:40] (03CR) 10Faidon Liambotis: [C: 032] salt: remove pinning to specific package version [operations/puppet] - 10https://gerrit.wikimedia.org/r/112485 (owner: 10Faidon Liambotis) [17:53:05] (03PS2) 10Faidon Liambotis: ssh: use provider => 'upstart' [operations/puppet] - 10https://gerrit.wikimedia.org/r/112486 [17:53:45] RECOVERY - NTP on copper is OK: NTP OK: Offset -0.0100055933 secs [17:55:24] (03CR) 10Faidon Liambotis: [C: 032] ssh: use provider => 'upstart' [operations/puppet] - 10https://gerrit.wikimedia.org/r/112486 (owner: 10Faidon Liambotis) [18:03:43] paravoid: getting ready for trusty? [18:11:45] PROBLEM - Host copper is DOWN: PING CRITICAL - Packet loss = 100% [18:16:56] RECOVERY - Host copper is UP: PING OK - Packet loss = 0%, RTA = 1.08 ms [18:19:35] PROBLEM - RAID on copper is CRITICAL: Connection refused by host [18:19:45] PROBLEM - puppet disabled on copper is CRITICAL: Connection refused by host [18:19:46] PROBLEM - Disk space on copper is CRITICAL: Connection refused by host [18:19:55] PROBLEM - DPKG on copper is CRITICAL: Connection refused by host [18:19:55] PROBLEM - SSH on copper is CRITICAL: Connection refused [18:25:38] (03PS1) 10Faidon Liambotis: install-server: allow /sbin/ip for early_command [operations/puppet] - 10https://gerrit.wikimedia.org/r/112489 [18:25:56] (03CR) 10Faidon Liambotis: [C: 032 V: 032] install-server: allow /sbin/ip for early_command [operations/puppet] - 10https://gerrit.wikimedia.org/r/112489 (owner: 10Faidon Liambotis) [18:26:56] PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: reqstats.5xx [crit=500.000000 [18:40:25] PROBLEM - Host copper is DOWN: PING CRITICAL - Packet loss = 100% [18:45:35] RECOVERY - Host copper is UP: PING OK - Packet loss = 0%, RTA = 0.35 ms [18:52:15] so happy to see paravoid promoted :) i guess my comments from this morning were heard [18:52:38] Congratulations paravoid ! [18:55:45] :) [19:00:01] PROBLEM - NTP on copper is CRITICAL: NTP CRITICAL: No response from NTP server [19:02:50] commonswiki: HTTP 404 (Not Found) in 'SwiftFileBackend::doStoreInternal' [19:02:58] * AaronSchulz scratches head [19:07:07] (03PS4) 10Phuedx: Enable the GettingStarted extension on non-enwiki wikis. [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/111460 [19:09:38] (03CR) 10Swalling: [C: 031] "LGTM" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/111460 (owner: 10Phuedx) [19:10:32] 'non-enwiki wikis' Interesting title... [19:12:21] JohnLewis: at one point i'll get "wiki wiki wild wild west" [19:12:24] give me time… [19:12:36] phuedx :) [19:12:56] PROBLEM - Host copper is DOWN: PING CRITICAL - Packet loss = 100% [19:13:14] phuedx: I'll expect 'none wmf wikis' soon ;) [19:13:17] copper, eh [19:17:37] ACKNOWLEDGEMENT - Host copper is DOWN: PING CRITICAL - Packet loss = 100% daniel_zahn i hear paravoid uses it [19:17:51] yeah, sorry about that [19:17:55] so, copper. I saw pa-ravoid doin..... [19:17:56] nvm [19:18:49] it's the trusty test host [19:18:51] paravoid: i disabled notifications for all services on it, just needs reactivation if it should [19:18:59] thanks :) [19:21:05] (03CR) 10Phuedx: Enable the GettingStarted extension on non-enwiki wikis. (033 comments) [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/111460 (owner: 10Phuedx) [19:23:25] RECOVERY - Host copper is UP: PING OK - Packet loss = 0%, RTA = 0.38 ms [19:25:19] (host is not a service on host:) but i turned that off to now, reboot as much as you like)..(should) [19:25:56] RECOVERY - HTTP 5xx req/min on tungsten is OK: OK: reqstats.5xx [warn=250.000 [19:29:07] (03PS1) 10Ottomata: Adding new research leila, giving access to stat1 and stat1002 and bastions [operations/puppet] - 10https://gerrit.wikimedia.org/r/112497 [19:33:10] (03CR) 10Ottomata: [C: 032 V: 032] Adding new research leila, giving access to stat1 and stat1002 and bastions [operations/puppet] - 10https://gerrit.wikimedia.org/r/112497 (owner: 10Ottomata) [19:40:48] (03CR) 10Dzahn: "thanks! formey:~# ldaplist -l passwd leila = uidNumber: 3963" [operations/puppet] - 10https://gerrit.wikimedia.org/r/112497 (owner: 10Ottomata) [20:08:38] (03PS1) 10Ori.livneh: chown scap checkout to mwdeploy user [operations/puppet] - 10https://gerrit.wikimedia.org/r/112503 [20:09:38] !log harmon - removing from puppet stored configs, complete decom, unused Tampa spare [20:10:21] Logged the message, Master [20:14:15] !log harmon - revoke puppet cert,disable puppet,disable icinga notifications, shutting down [20:14:42] (03PS2) 10Dzahn: decom 'harmon' - rm from site/dsh/partman/dhcp [operations/puppet] - 10https://gerrit.wikimedia.org/r/112171 [20:14:57] Logged the message, Master [20:15:12] PROBLEM - check_disk on lutetium is CRITICAL: DISK CRITICAL - free space: / 3802 MB (10% inode=95%): /srv 546850 MB (37% inode=99%): [20:18:20] paravoid: I wonder HEADs to containers sometimes 503...maybe there are exceptions in some log? [20:19:09] yoo greg-g, you there? [20:20:08] PROBLEM - check_disk on lutetium is CRITICAL: DISK CRITICAL - free space: / 3802 MB (10% inode=95%): /srv 546850 MB (37% inode=99%): [20:21:08] I bet the 404s on store are the same problem [20:21:10] * AaronSchulz looks at http://lists.openstack.org/pipermail/openstack/2013-December/004154.html [20:22:07] hrm, well http://lists.openstack.org/pipermail/openstack/2013-December/004155.html only explains the PUT problem [20:22:13] (03CR) 10Dzahn: [C: 032] decom 'harmon' - rm from site/dsh/partman/dhcp [operations/puppet] - 10https://gerrit.wikimedia.org/r/112171 (owner: 10Dzahn) [20:24:30] Connection to harmon closed. [20:24:30] one less [20:25:06] PROBLEM - check_disk on lutetium is CRITICAL: DISK CRITICAL - free space: / 3801 MB (10% inode=95%): /srv 546850 MB (37% inode=99%): [20:30:06] RECOVERY - check_disk on lutetium is OK: DISK OK - free space: / 7664 MB (21% inode=95%): /srv 546850 MB (37% inode=99%): [20:36:12] ottomata: I am now [20:37:24] ah, hey, see email i just sent [20:37:24] JVM App Deployment [20:37:25] tell me what you think [20:38:48] kk [20:39:49] ottomata: interesting, I think I'll need a little more to give a true +1/-1 [20:40:09] but, "git-annex just within the deploy system, not on our personal dev machines" sounds reasonable, I think... [20:40:48] * greg-g needs to read that entire thread, lost it over the weekend [20:40:55] s/lost/didn't/ [20:43:06] yeah so [20:43:20] i've been hacking together a little config file solution [20:43:26] to just download and checksum files from urls [20:43:30] i think that works [20:43:36] then, we could use git annex once the files are there [20:43:47] to add them to the deploy host's repository [20:44:10] and then git-annex sync and git-annex get on each of the targets to pull in the files from the deploy host [20:44:11] OR [20:44:12] i mean [20:44:22] the other idea was just to commit them to deploy host's working copy [20:44:40] to the working deploy branch [20:44:40] and use regular ol' git just like it works now [20:44:45] to fetch and pull them [20:44:54] but that would clutter up the git repos on the cluster [20:47:13] sorry, was replying to one of the emails [20:48:41] eeek, I think we want to avoid binaries in git as much as possible, I mean, absolutely. [20:48:48] ok cool [20:48:49] in the deploy case it isn't so bad, as it would never be checked into a remote that we care about [20:48:49] buutu yeah, i can imagine tin getting bogged down with crap after a while [20:49:07] yeah, you really really don't want to commit binaries of any real size :) [20:53:54] greg-g: i think that this is a good idea [20:53:56] beacuse [20:54:04] if people DO want to use git-annex locally to manage the files [20:54:30] it should still work…as long as deploy host knows how to get those files from an annex somewhere [20:54:30] e.g. with addurl [20:54:30] or something [20:54:30] * greg-g nods [20:54:35] yeah [20:54:47] doing git-annex as deploy system but config files to manage what files are around [20:54:57] solves our review and git-annex remote messiness problem [20:55:12] word [20:55:12] as we wouldn't use git annex to do that part [20:55:19] right [20:55:20] just plain ol git [20:55:22] aye [20:56:19] sorry I've been quiet on this thread so far, I should have jumped in earlier [20:56:47] heheh, yeah sounds like you have more opinions than anybody, nobody seems to be that interested [20:56:53] bd808: you there? [20:57:06] ottomata: pong [20:57:08] yeah, I think most people are like "whatever works?" [20:57:12] yeah [20:57:20] * bd808 reads scollback [20:57:42] I think git-annex will give us neat stuff in the future with all of the infra he's written into it, but it won't be useful at the beginning (or will be scary complex) [20:57:46] so, i'd like to start playing with this, but i'm kinda still grokking git deploy and what the methods in deploy.py do and how they are used [20:57:52] well, not git-deploy, [20:57:52] * greg-g nods [20:57:59] i guess this is sartoris (since it is in ops/puppet) [20:58:12] yeah [20:58:24] so, bd808, wondering if you could brain bounce with me for a sec to figure out what I should do next [20:58:33] not sure how much more you understand than me though :p[ [20:59:00] hehe 'brain bounce' [20:59:06] squishy [20:59:16] * yuvipanda makes mental picture of two people co-juggling brains like juggler's balls [21:01:50] ottomata: I can listen. I really haven't taken the time to poke into the git-deploy guts yet. I've actually kind of been holding off on the assumption that Ryan will be switch us over to trebuchet-trigger on the frontend soon. [21:02:04] * greg-g nods [21:02:07] that's what the plan is [21:02:19] from what I saw of his and ori's convo this weekend [21:02:39] (not sure of timeline) [21:03:44] well [21:03:46] right so [21:03:53] the backend i'm looking at is trebuchet [21:04:03] just the stuff that is committed directly to ops/puppet [21:04:13] ottomata: re your email -- I think that's what my brain was thinking of when I suggested the configuration file approach. Managing the binaries should just be a deploy system concern. There will need to be a script of something to manage them for developers too. [21:04:42] script *or* something to manage [21:04:45] Ryan told me to go ahead and code whatever i'm doing in ops/puppet [21:04:52] and he'd upstream anything useful in trebuchet [21:04:57] e.g. git-annex support [21:05:37] (03CR) 10Catrope: [C: 032] Fix missing entries in VisualEditor-default.dblist [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/111141 (owner: 10Jforrester) [21:06:04] (03Merged) 10jenkins-bot: Fix missing entries in VisualEditor-default.dblist [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/111141 (owner: 10Jforrester) [21:06:21] Sounds reasonable I guess. I'm assuming that we will find a way to consume his upstream repo at some point but we can cross that bridge later. [21:07:26] So the things you are thinking off all live in the trebuchet layer and not half there and half in the frontend script? (Which is the perl version of git-deploy in out env as far as I know) [21:07:38] Gah I can't type today [21:07:42] *thinking of [21:09:27] right [21:09:28] yeah [21:09:31] ottomata: So walk me through what would happen on tin when a change to the config script was fetched [21:09:38] so [21:09:41] tin: [21:09:51] if artifacts.json exists [21:09:52] read it [21:09:54] for each artifact there [21:10:09] download file and checksum; remove any mismatched checksum files [21:10:12] then [21:10:15] (not sure about this part) [21:10:24] git annex add path/to/artifact [21:10:46] and i guess whatever needs to happen to remove any unwanted artifacts from annex [21:11:06] then [21:11:10] when each of the deploy targets is uhh deploying [21:11:10] they will [21:11:17] get fetch [21:11:19] git* [21:11:20] !log catrope synchronized visualeditor-default.dblist 'fix missing entries' [21:11:28] Logged the message, Master [21:11:43] git checkout; git annex sync (to sync info about annex remotes around) [21:11:43] git annex get (to get all files) [21:11:58] the git annex get command should pull the files from tin [21:12:10] * greg-g nods [21:12:24] i'm actually not sure at what step those commands need to be run, but it would be done at the same point that submodules are synced... [21:12:24] hm [21:12:36] this actually sounds a little complicated at this point, if we are trying to keep deploys atomic [21:12:37] the git-annex sync/get? [21:12:44] yeah [21:12:50] because it might take a few minutes for the get to complete [21:12:54] well, the things after "uhh deploying" are all automagic [21:12:58] but the get can't run unless the repo has been merged [21:13:12] i don't know how this works on the targets? [21:13:15] i understand [21:13:22] * greg-g nods [21:13:23] git fetch everywhere…wait..git checkout [21:13:43] but with this we'd need something like [21:14:01] git fetch everwhere, checkout to temp location, git annex get…wait…ok git checkout [21:14:29] Ryan has mentioned that there is a facility to checkout a temporary version on the servers that are being deployed so that things like this can happen [21:14:52] ay hmmm, ok [21:14:54] I'm not sure if we have those changes in the puppet repo yet or if they only live in the github version [21:14:57] aye [21:15:20] right, so: git-fetch, git-checkout (in some temp dir), git-annex sync, git-annex get ., flop over to live dir [21:15:33] But that was part of the changes he was working on to solve the l10n binary problem [21:16:12] aye, yeah, same problem [21:16:28] remind me to tell you about "dropunused" in the future if we actually do use git-annex [21:17:32] (it'll just be a cronjob, no worries ;) ) [21:18:23] ha, ok [21:18:31] hm so bd808, ok, I really just want somethign that works here [21:18:47] it doesn't sound like we will be moving to upstream trebuchet anytime soon, right? [21:18:52] I think the puppet repo trebuchet has that checkout capability [21:18:59] oh, the checkout temp dir? [21:19:01] I found this in my scollback from the weekend [21:19:04] (03PS1) 10MaxSem: Deploy MobileApp on test and test2 [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/112575 [21:19:06] (03PS1) 10MaxSem: Enable MobileApp everywhere [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/112576 [21:19:16] "[21:31] other than trigger and ricochet the most current code is in puppet" [21:19:27] oh ok [21:19:30] hm [21:19:40] i wish there were more function docs in deploy.py [21:19:44] have to read all the code to understand it [21:20:22] ottomata: Ryan did tell us to bug him about specific docs that need updating... [21:20:23] Yeah [21:20:38] I'd ping him via email. I bet he'll get back to you overnight [21:20:38] oooo [21:20:38] ok [21:20:56] ok bd808 [21:21:08] what is the diff between runners/deploy.py and modules/deploy.py? [21:21:18] * bd808 shrugs [21:21:34] :) [21:21:37] !log catrope synchronized wmf-config/InitialiseSettings.php 'touch' [21:21:44] Logged the message, Master [21:21:52] ori: Can you help ottomata with some trebuchet questions? ^^ [21:25:04] greg-g, any objections if I deployed https://gerrit.wikimedia.org/r/112139 during my window? [21:25:28] ottomata: It looks to me like runners/deploy is the bit that runs on tin from the git-deploy hooks [21:25:59] OHHhhHh [21:26:21] ok and are these salt commands specifically? [21:26:26] MaxSem: sure [21:26:28] If I'm reading it right it's the par that tells salt what to do [21:26:30] i guess so but maybe that doesn't make a diff [21:26:38] oh i see yeah [21:26:39] ok cool [21:26:39] so [21:26:46] on tin git deploy probably does seomthing like [21:26:47] (03PS1) 10Chad: Lower search cache expiry to 12 hours on all wikis [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/112577 [21:26:57] salt call deploy.fetch [21:27:09] and that triggers the targets to fetch? [21:27:15] yeah ok [21:27:16] hm [21:27:25] a little confusing that they have the same module.function name [21:27:26] but ja [21:27:31] ottomata: I hope you mean salt-call. ;-) [21:27:35] ^d: oh man, I misread (missed the "Search" bit) and was about to kick you [21:27:41] i have no idea what I mean! [21:27:44] ottomata: Yay! [21:27:51] i have run about 2 salt commadns ever! :p [21:28:00] <^d> greg-g: Lower all cache expiries! [21:28:03] (Sorry, I'm new here, figured I'd start chiming in on Salt stuff) [21:28:07] ^d: :) [21:28:13] please do! [21:28:24] oh Corey and Coren eh? [21:28:32] i was wondering if Coren had just gotten cuter or somethign [21:28:34] It figures Coren would be here too. [21:28:52] I contribute to Salt. Coren contributes to typos. [21:28:55] It's a trap! [21:29:10] hmm, ok, so bd808, Ryan said I coudl test stuff in the sartoris labs project [21:29:16] haha [21:29:45] i shoudl find these .py files somewhhere there? i can I just edit them to develop stuff? [21:30:14] ottomata: In theory. I think he uses a local puppet master in that project [21:30:30] So the files should be in … let me look [21:30:32] k that would be evern better [21:31:21] (03CR) 10Manybubbles: [C: 031] "Good for when there is window." [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/112577 (owner: 10Chad) [21:31:30] ottomata: A labs local puppet master puts it's files in /var/lib/git/operations/puppet [21:31:59] That dir will be a clone of operations/puppet that you can hack on [21:32:31] (03PS1) 10Jforrester: Enable VisualEditor in opt-in mode for French Wiktionary [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/112582 [21:32:54] yes ja [21:33:03] that bit I know well! but which machine? [21:33:08] all of them? [21:33:33] ah sartoris-server [21:33:34] i see it [21:33:34] cool [21:34:36] ottomata: i'm going to make a change to remove "locke" [21:35:23] ok awesome, yeah that can be done away with anytime [21:35:26] back i Nov you said can be decom'ed, there is some file_mover@locke stuff though, but we can handle via gerrit [21:35:34] cool! [21:42:46] PROBLEM - swift-object-auditor on ms-be1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [21:42:46] PROBLEM - swift-object-replicator on ms-be1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [21:42:56] PROBLEM - swift-account-auditor on ms-be1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [21:42:57] PROBLEM - swift-object-server on ms-be1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [21:43:07] PROBLEM - SSH on ms-be1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [21:43:07] PROBLEM - swift-account-reaper on ms-be1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [21:43:07] PROBLEM - swift-container-updater on ms-be1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [21:43:07] PROBLEM - RAID on ms-be1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [21:43:16] PROBLEM - DPKG on ms-be1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [21:43:16] PROBLEM - swift-container-auditor on ms-be1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [21:43:16] PROBLEM - swift-container-replicator on ms-be1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [21:43:16] PROBLEM - swift-account-server on ms-be1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [21:43:16] PROBLEM - swift-account-replicator on ms-be1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [21:43:26] PROBLEM - swift-container-server on ms-be1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [21:43:36] PROBLEM - swift-object-updater on ms-be1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [21:44:54] hmmm, bd808, i'm not so sure I was correct [21:45:01] when I said that all this would be in the trebuchet backend code [21:45:02] not sure [21:45:26] i mean, it could be [21:45:30] i'm really not sure where to put stuff [21:45:33] so, for the stuff on tin [21:45:41] i need to read thsi config file and download files [21:45:55] not sure if that should be triggered from deploy.checkout [21:45:55] or [21:46:01] from git-deploy/hooks/deploylib.py [21:46:15] there's stuff in there that does submodule stuff [21:46:16] hmmmm [21:46:24] Or a raw git hook in the local repo? [21:46:50] HMMMM [21:47:24] that is very simple, keeps it out of trebuchet's hair, at least for now [21:47:29] and trebuchet could just do annex stuff [21:47:30] hmm [21:47:53] I was thinking about a post-checkout hook but I was thinking caveman simple too [21:48:17] * bd808 doesn't mean to offend any cavemen in the audience [21:49:08] The git docs on post-checkout say "you can use it to set up your working directory properly for your project environment. This may mean moving in large binary files that you don’t want source controlled, auto-generating documentation, or something along those lines." [21:51:04] (03PS1) 10Dzahn: remove locke from puppet,dsh,dhcpd,ud2log filter [operations/puppet] - 10https://gerrit.wikimedia.org/r/112587 [21:51:44] any roots around that can help us work around a trebuchet bug? [21:52:57] ottomata: can you look at https://rt.wikimedia.org/Ticket/Display.html?id=6168 ? [21:53:13] to be able to actually deploy we need to nuke the parsoid checkout on each node and re-fetch it: https://gist.github.com/gwicke/bb5e58ae2a4bcd47baac [21:54:56] PROBLEM - NTP on ms-be1001 is CRITICAL: NTP CRITICAL: No response from NTP server [21:55:21] (03PS2) 10Jforrester: Move VisualEditor to opt-in status on eswiki [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/109639 (owner: 10TTO) [21:56:14] so, do we care about those icinga errors? [21:56:36] AaronSchulz: ^ [21:57:03] greg-g: you must be new here. :> [21:57:13] matanya: done [21:57:17] !log unsuccessful Parsoid deploy as trebuchet failed to update the submodule with the parsoid source, need trebuchet bug fix [21:57:22] thanks ottomata [21:57:24] Logged the message, Master [21:58:02] ok, i'll push patches to remove it [21:58:12] (03PS1) 10Jforrester: Disable 'beta' label in tab for the VE opt-in wiki (enwiki) [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/112590 [21:58:54] gwicke: that's all on _each_ node, so salt -G 'deployment_target:parsoid' cmd.run ...? [21:59:26] mutante, yes [21:59:28] gwicke: delete on all and then run salt-call _on_ each node via salt itself? [21:59:31] ok [21:59:36] ideally not all at once [21:59:41] the 10% thing? [21:59:46] yes [21:59:53] or even one manually first [22:00:02] MatmaRex: :P [22:00:06] okay, my window. beware, mortals! [22:00:10] i'm picing wtp1016 [22:00:16] picking [22:00:20] chatting with Ryan in the security channel, he is at work though [22:00:28] oh,ok [22:00:48] gwicke, any production breakage? [22:01:00] afaik no [22:01:18] our deploy was a no-op again as trebuchet did not actually update the code [22:01:31] (03PS1) 10Matanya: locke: decom, remove udp2log filters [operations/puppet] - 10https://gerrit.wikimedia.org/r/112591 [22:02:34] !log wtp1016 - delete deployment/parsoid, salt-call fetch/checkout.., restart parsoid [22:02:39] gwicke: done on wtp1016 [22:02:42] Logged the message, Master [22:03:23] ottomata , mutante please : https://gerrit.wikimedia.org/r/112591 [22:03:36] mutante, looks good so far [22:04:43] (03CR) 10MaxSem: [C: 032] Deploy MobileApp on test and test2 [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/112575 (owner: 10MaxSem) [22:05:34] (03CR) 10Ottomata: [C: 032 V: 032] locke: decom, remove udp2log filters [operations/puppet] - 10https://gerrit.wikimedia.org/r/112591 (owner: 10Matanya) [22:05:51] thanks matanya [22:05:56] PROBLEM - Disk space on ms-be1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [22:06:14] i have a few coming in ottomata, i'll poke you around [22:06:16] PROBLEM - puppet disabled on ms-be1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [22:07:05] (03Merged) 10jenkins-bot: Deploy MobileApp on test and test2 [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/112575 (owner: 10MaxSem) [22:08:22] (03PS1) 10Matanya: locke: decom, remove logging doc comments [operations/puppet] - 10https://gerrit.wikimedia.org/r/112593 [22:09:46] PROBLEM - MySQL Slave Delay on db1050 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [22:09:56] PROBLEM - MySQL Recent Restart on db1050 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [22:09:56] PROBLEM - MySQL Idle Transactions on db1050 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [22:09:56] PROBLEM - Full LVS Snapshot on db1050 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [22:09:57] PROBLEM - MySQL Processlist on db1050 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [22:09:57] PROBLEM - mysqld processes on db1050 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [22:10:06] PROBLEM - RAID on db1050 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [22:10:06] ha? ^ [22:10:16] PROBLEM - SSH on db1050 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [22:10:16] PROBLEM - MySQL disk space on db1050 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [22:10:16] PROBLEM - MySQL InnoDB on db1050 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [22:10:16] PROBLEM - MySQL Slave Running on db1050 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [22:10:28] Gee, I wonder if a server fell over? [22:10:31] !log maxsem started scap: Extension:MobileApp deployment [22:10:36] PROBLEM - DPKG on db1050 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [22:10:36] PROBLEM - puppet disabled on db1050 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [22:10:40] Logged the message, Master [22:10:46] PROBLEM - Disk space on db1050 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [22:10:46] PROBLEM - MySQL Replication Heartbeat on db1050 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [22:10:49] db1050 went nuts [22:11:09] * matanya is looking for a dba [22:11:29] matanya, it's just NRPE [22:11:50] why would nrpe do this? [22:12:14] why would it fail to connect to a socket? [22:12:26] no, that is obivous [22:12:33] see springle got here :) [22:12:37] !log fixing broken parsoid deploy on wtp*, one by one [22:12:46] Logged the message, Master [22:12:57] matanya: firewalling? [22:12:57] hm i can't reach wikitech [22:12:59] springle: did db1050 go bad? [22:13:26] looking at it now [22:13:37] jgage: only you [22:14:09] PROBLEM - Apache HTTP on mw1200 is CRITICAL: Connection timed out [22:14:10] springle: matanya , yea, looks it did [22:14:17] * matanya points MaxSem ^ [22:14:29] matanya thanks, looks like perhaps office connectivity or dns [22:14:36] PROBLEM - Apache HTTP on mw1130 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [22:14:46] PROBLEM - Apache HTTP on mw1205 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [22:15:12] please don't die all :) [22:15:36] RECOVERY - Apache HTTP on mw1130 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 9.970 second response time [22:15:57] PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: reqstats.5xx [crit=500.000000 [22:16:21] (03PS1) 10Chad: Revoke key per request [operations/puppet] - 10https://gerrit.wikimedia.org/r/112594 [22:16:38] <^d> mutante: ^ key revocation [22:16:56] PROBLEM - Apache HTTP on mw1198 is CRITICAL: Connection timed out [22:17:06] PROBLEM - Apache HTTP on mw1196 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [22:17:06] PROBLEM - Apache HTTP on mw1197 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [22:17:13] ottomata: https://gerrit.wikimedia.org/r/112593 [22:17:19] !log springle synchronized wmf-config/db-eqiad.php 'db1050 crashed, depool' [22:17:26] PROBLEM - Apache HTTP on mw1192 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [22:17:26] Hi [22:17:27] Logged the message, Master [22:17:30] I am getting Wikimedia Errors [22:17:34] I assume you already know [22:17:46] RECOVERY - Apache HTTP on mw1200 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 3.096 second response time [22:17:46] RECOVERY - Apache HTTP on mw1205 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 7.346 second response time [22:17:46] RECOVERY - Apache HTTP on mw1198 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.055 second response time [22:17:47] PROBLEM - Apache HTTP on mw1189 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [22:17:50] yes huh [22:17:51] huh: A list of apache errors says yes :p [22:17:56] RECOVERY - Apache HTTP on mw1197 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.055 second response time [22:17:56] RECOVERY - Apache HTTP on mw1196 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.064 second response time [22:17:56] (03CR) 10Dzahn: [C: 032] Revoke key per request [operations/puppet] - 10https://gerrit.wikimedia.org/r/112594 (owner: 10Chad) [22:18:16] RECOVERY - Apache HTTP on mw1192 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 3.181 second response time [22:18:17] Looks like they're coming back up? [22:18:24] Hilariously greg-g is incommunicado [22:18:26] PROBLEM - Apache HTTP on mw1203 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 MediaWiki exception - 1568 bytes in 0.056 second response time [22:18:35] Hilarious. [22:18:47] Yay he's back [22:18:48] oh, hi greg [22:18:52] (03PS2) 10Matanya: locke: decom, remove logging doc comments [operations/puppet] - 10https://gerrit.wikimedia.org/r/112593 [22:18:57] (03CR) 10Ottomata: [C: 032 V: 032] locke: decom, remove logging doc comments [operations/puppet] - 10https://gerrit.wikimedia.org/r/112593 (owner: 10Matanya) [22:18:59] hey, so, what's up? [22:19:26] RECOVERY - Apache HTTP on mw1203 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 1.328 second response time [22:19:28] (I can't get to my colo'd box with irssi on it, don't have scrollback) [22:19:32] not my scap - it was still updating proxies when errors started [22:19:36] RECOVERY - Apache HTTP on mw1189 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.055 second response time [22:19:41] oh, recoveries, yay [22:20:36] PROBLEM - Apache HTTP on mw1194 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 MediaWiki exception - 1568 bytes in 0.057 second response time [22:20:46] PROBLEM - Apache HTTP on mw1198 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 MediaWiki exception - 1568 bytes in 0.059 second response time [22:20:46] PROBLEM - Apache HTTP on mw1059 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [22:20:47] PROBLEM - Apache HTTP on mw1103 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [22:20:47] PROBLEM - Apache HTTP on mw1172 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [22:20:47] PROBLEM - Apache HTTP on mw1024 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [22:20:47] PROBLEM - Apache HTTP on mw1078 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [22:20:47] PROBLEM - Apache HTTP on mw1098 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [22:20:48] PROBLEM - Apache HTTP on mw1042 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [22:20:49] Cannot contact the database server: Too many connections (10.64.16.144) [22:20:56] PROBLEM - Apache HTTP on mw1087 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 MediaWiki exception - 1567 bytes in 9.488 second response time [22:20:56] PROBLEM - Apache HTTP on mw1034 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [22:20:56] PROBLEM - Apache HTTP on mw1077 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [22:20:56] PROBLEM - Apache HTTP on mw1197 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 MediaWiki exception - 1567 bytes in 0.057 second response time [22:20:57] PROBLEM - Apache HTTP on mw1196 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 MediaWiki exception - 1568 bytes in 0.057 second response time [22:20:57] PROBLEM - Apache HTTP on mw1075 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 MediaWiki exception - 1567 bytes in 9.532 second response time [22:20:57] PROBLEM - LVS HTTP IPv6 on text-lb.esams.wikimedia.org_ipv6 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 MediaWiki exception - 1799 bytes in 0.337 second response time [22:20:57] someone ping springle [22:21:00] PROBLEM - Apache HTTP on mw1081 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 MediaWiki exception - 1567 bytes in 4.742 second response time [22:21:06] PROBLEM - LVS HTTPS IPv4 on text-lb.eqiad.wikimedia.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 MediaWiki exception - 1760 bytes in 0.077 second response time [22:21:10] PROBLEM - Apache HTTP on mw1109 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 MediaWiki exception - 1568 bytes in 0.066 second response time [22:21:10] PROBLEM - LVS HTTP IPv4 on text-lb.esams.wikimedia.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 MediaWiki exception - 1798 bytes in 0.667 second response time [22:21:12] So, who spoiled the coffee this time? [22:21:12] MaxSem: what did you deploy? [22:21:16] PROBLEM - Apache HTTP on mw1083 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 MediaWiki exception - 1567 bytes in 0.066 second response time [22:21:17] greg-pidgin: i believe he is already on [22:21:21] greg-pidgin: he is here [22:21:27] springle: see above, please :) [22:21:36] 10.64.16.144 is db1049 [22:21:36] RECOVERY - Apache HTTP on mw1078 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.054 second response time [22:21:36] RECOVERY - Apache HTTP on mw1024 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.194 second response time [22:21:37] PROBLEM - Apache HTTP on mw1043 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 MediaWiki exception - 1567 bytes in 0.311 second response time [22:21:38] ori, ext:MobileApp - just a RL module [22:21:46] RECOVERY - Apache HTTP on mw1042 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 2.639 second response time [22:21:46] RECOVERY - Apache HTTP on mw1075 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.064 second response time [22:21:47] PROBLEM - LVS HTTP IPv6 on text-lb.ulsfo.wikimedia.org_ipv6 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 MediaWiki exception - 1796 bytes in 0.310 second response time [22:21:47] (and of course, I'm still having network issues here at WMF office) [22:21:52] not even included in page views [22:21:55] enwiki slave [22:21:56] RECOVERY - Apache HTTP on mw1198 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 4.381 second response time [22:21:57] RECOVERY - LVS HTTP IPv6 on text-lb.esams.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 200 OK - 69604 bytes in 0.410 second response time [22:22:00] logstash showing a big dberror spike. "Error connecting" to various ips from various ips [22:22:06] RECOVERY - Apache HTTP on mw1109 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.136 second response time [22:22:07] RECOVERY - LVS HTTP IPv4 on text-lb.esams.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 69604 bytes in 0.454 second response time [22:22:12] yeah, springle, let me know if you're looking or not [22:22:18] yes, looking [22:22:19] yep, I see DB -related fatals [22:22:20] thans [22:22:21] as i said :) [22:22:26] sorry, pidgin sucks [22:22:36] RECOVERY - Apache HTTP on mw1103 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.063 second response time [22:22:36] RECOVERY - Apache HTTP on mw1043 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.066 second response time [22:22:37] oh sorry you weren't on [22:22:45] getting a 503 when I try to go to meta [22:22:46] RECOVERY - Apache HTTP on mw1194 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 1.082 second response time [22:22:46] RECOVERY - Apache HTTP on mw1172 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 5.084 second response time [22:22:46] PROBLEM - NTP on db1050 is CRITICAL: NTP CRITICAL: No response from NTP server [22:22:47] RECOVERY - LVS HTTP IPv6 on text-lb.ulsfo.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 200 OK - 69575 bytes in 0.503 second response time [22:22:48] * MaxSem blames greg-g for lunching without a laptop:P [22:22:48] your nick color is all lame and dim [22:22:50] PROBLEM - Apache HTTP on mw1049 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [22:22:56] PROBLEM - Apache HTTP on mw1050 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [22:22:56] PROBLEM - Apache HTTP on mw1097 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [22:22:56] PROBLEM - Apache HTTP on mw1181 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [22:22:57] PROBLEM - Apache HTTP on mw1171 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [22:23:03] 10-Feb-2014 22:22:05] Fatal error: Call to a member function getSlavePos() on a non-object at /usr/local/apache/common-local/php-1.23wmf12/includes/db/LoadBalancer.php on line 294 [22:23:06] RECOVERY - Apache HTTP on mw1197 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 4.123 second response time [22:23:06] RECOVERY - LVS HTTPS IPv4 on text-lb.eqiad.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 69531 bytes in 0.025 second response time [22:23:10] PROBLEM - Apache HTTP on mw1220 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 MediaWiki exception - 1568 bytes in 0.057 second response time [22:23:10] PROBLEM - Apache HTTP on mw1139 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 MediaWiki exception - 1568 bytes in 0.072 second response time [22:23:10] PROBLEM - Apache HTTP on mw1054 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 MediaWiki exception - 1567 bytes in 0.068 second response time [22:23:10] PROBLEM - Apache HTTP on mw1068 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [22:23:10] PROBLEM - Apache HTTP on mw1030 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [22:23:10] PROBLEM - Apache HTTP on mw1107 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 MediaWiki exception - 1567 bytes in 0.449 second response time [22:23:10] PROBLEM - Apache HTTP on mw1102 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 MediaWiki exception - 1567 bytes in 0.062 second response time [22:23:11] PROBLEM - Apache HTTP on mw1093 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [22:23:11] PROBLEM - Apache HTTP on mw1207 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 MediaWiki exception - 1567 bytes in 0.056 second response time [22:23:12] PROBLEM - LVS HTTPS IPv6 on text-lb.eqiad.wikimedia.org_ipv6 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 MediaWiki exception - 1760 bytes in 0.092 second response time [22:23:13] PROBLEM - Apache HTTP on mw1026 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 MediaWiki exception - 1567 bytes in 3.781 second response time [22:23:16] PROBLEM - Apache HTTP on mw1053 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 MediaWiki exception - 1567 bytes in 7.533 second response time [22:23:16] PROBLEM - Apache HTTP on mw1175 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [22:23:16] PROBLEM - Apache HTTP on mw1202 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 MediaWiki exception - 1567 bytes in 0.064 second response time [22:23:16] PROBLEM - Apache HTTP on mw1069 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 MediaWiki exception - 1567 bytes in 0.067 second response time [22:23:16] PROBLEM - Apache HTTP on mw1206 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 MediaWiki exception - 1567 bytes in 0.048 second response time [22:23:17] wmf 12? [22:23:35] that's not new code :/ [22:23:36] PROBLEM - Apache HTTP on mw1174 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 MediaWiki exception - 1567 bytes in 0.051 second response time [22:23:36] PROBLEM - Apache HTTP on mw1025 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 MediaWiki exception - 1567 bytes in 0.067 second response time [22:23:36] RECOVERY - Apache HTTP on mw1098 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.074 second response time [22:23:36] PROBLEM - Apache HTTP on mw1031 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 MediaWiki exception - 1567 bytes in 0.073 second response time [22:23:36] PROBLEM - Apache HTTP on mw1205 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 MediaWiki exception - 1568 bytes in 0.065 second response time [22:23:37] RECOVERY - Apache HTTP on mw1059 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.073 second response time [22:23:37] RECOVERY - Apache HTTP on mw1049 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.086 second response time [22:23:43] back up now [22:23:46] RECOVERY - Apache HTTP on mw1050 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.070 second response time [22:23:46] RECOVERY - Apache HTTP on mw1097 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.070 second response time [22:23:46] RECOVERY - Apache HTTP on mw1034 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.088 second response time [22:23:46] RECOVERY - Apache HTTP on mw1181 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.067 second response time [22:23:46] RECOVERY - Apache HTTP on mw1171 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.071 second response time [22:23:50] !log big dberror spike. "Error connecting" to various ips from various ips [22:23:56] what's up? [22:23:56] RECOVERY - Apache HTTP on mw1068 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.094 second response time [22:23:57] PROBLEM - Apache HTTP on mw1219 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 MediaWiki exception - 1567 bytes in 0.054 second response time [22:23:57] PROBLEM - Apache HTTP on mw1218 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 MediaWiki exception - 1568 bytes in 0.056 second response time [22:23:57] PROBLEM - Apache HTTP on mw1088 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 MediaWiki exception - 1567 bytes in 0.070 second response time [22:23:57] RECOVERY - Apache HTTP on mw1081 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.085 second response time [22:23:57] PROBLEM - Apache HTTP on mw1209 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 MediaWiki exception - 1568 bytes in 0.097 second response time [22:23:57] Logged the message, Master [22:24:07] RECOVERY - Apache HTTP on mw1139 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.081 second response time [22:24:07] RECOVERY - Apache HTTP on mw1026 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.083 second response time [22:24:07] RECOVERY - LVS HTTPS IPv6 on text-lb.eqiad.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 200 OK - 69552 bytes in 0.359 second response time [22:24:15] matanya: Brilliant log message :D [22:24:16] RECOVERY - Apache HTTP on mw1202 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.058 second response time [22:24:16] RECOVERY - Apache HTTP on mw1206 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.058 second response time [22:24:16] RECOVERY - Apache HTTP on mw1083 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.065 second response time [22:24:16] RECOVERY - Apache HTTP on mw1069 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.175 second response time [22:24:18] mark: db related things, not sure yet, ori pasted some errors [22:24:30] I pasted [10-Feb-2014 22:22:05] Fatal error: Call to a member function getSlavePos() on a non-object at /usr/local/apache/common-local/php-1.23wmf12/includes/db/LoadBalancer.php on line 294 [22:24:34] Explosions in the sky time. [22:24:35] https://gdash.wikimedia.org/dashboards/reqerror/ [22:24:36] RECOVERY - Apache HTTP on mw1174 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.061 second response time [22:24:36] RECOVERY - Apache HTTP on mw1025 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.054 second response time [22:24:36] PROBLEM - Apache HTTP on mw1042 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 MediaWiki exception - 1567 bytes in 0.063 second response time [22:24:36] RECOVERY - Apache HTTP on mw1031 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.086 second response time [22:24:46] RECOVERY - Apache HTTP on mw1077 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.066 second response time [22:24:46] RECOVERY - Apache HTTP on mw1087 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.071 second response time [22:24:51] Most of the errors in the dberror log are "Too many connections" [22:24:57] RECOVERY - Apache HTTP on mw1209 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.056 second response time [22:24:57] RECOVERY - Apache HTTP on mw1196 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 1.272 second response time [22:25:06] RECOVERY - Apache HTTP on mw1220 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.062 second response time [22:25:07] RECOVERY - Apache HTTP on mw1102 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.067 second response time [22:25:07] RECOVERY - Apache HTTP on mw1175 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.065 second response time [22:25:07] PROBLEM - Apache HTTP on mw1082 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 MediaWiki exception - 1567 bytes in 0.065 second response time [22:25:07] RECOVERY - Apache HTTP on mw1053 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 3.377 second response time [22:25:08] eg "Error connecting to 10.64.32.21: :real_connect(): (HY000/1040): Too many connections" [22:25:13] !log maxsem scap aborted: Extension:MobileApp deployment (duration: 14m 41s) [22:25:16] PROBLEM - Apache HTTP on mw1160 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 MediaWiki exception - 1568 bytes in 0.045 second response time [22:25:20] meh [22:25:21] Logged the message, Master [22:25:25] bd808: I also got Wikimedia Errors [22:25:26] PROBLEM - Apache HTTP on mw1192 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [22:25:31] something caused thousands of broken connections to db slaves, leaving /many/ sleepers in process list [22:25:36] PROBLEM - Apache HTTP on mw1154 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 MediaWiki exception - 1567 bytes in 0.050 second response time [22:25:36] PROBLEM - Apache HTTP on mw1217 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 MediaWiki exception - 1567 bytes in 0.057 second response time [22:25:36] PROBLEM - Apache HTTP on mw1182 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 MediaWiki exception - 1567 bytes in 0.061 second response time [22:25:36] PROBLEM - Apache HTTP on mw1183 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 MediaWiki exception - 1568 bytes in 0.070 second response time [22:25:36] RECOVERY - Apache HTTP on mw1042 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.072 second response time [22:25:37] PROBLEM - Apache HTTP on mw1194 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 MediaWiki exception - 1567 bytes in 0.056 second response time [22:25:46] PROBLEM - Apache HTTP on mw1057 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 MediaWiki exception - 1568 bytes in 0.066 second response time [22:25:46] PROBLEM - Apache HTTP on mw1106 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 MediaWiki exception - 1567 bytes in 0.068 second response time [22:25:46] PROBLEM - Apache HTTP on mw1179 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 MediaWiki exception - 1567 bytes in 2.506 second response time [22:25:47] PROBLEM - Apache HTTP on mw1091 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 MediaWiki exception - 1567 bytes in 5.955 second response time [22:25:56] PROBLEM - Apache HTTP on mw1155 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 MediaWiki exception - 1568 bytes in 0.073 second response time [22:25:56] PROBLEM - Apache HTTP on mw1214 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 MediaWiki exception - 1568 bytes in 0.059 second response time [22:25:56] PROBLEM - Apache HTTP on mw1124 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 MediaWiki exception - 1567 bytes in 0.082 second response time [22:25:57] PROBLEM - Apache HTTP on mw1033 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 MediaWiki exception - 1567 bytes in 0.066 second response time [22:25:57] PROBLEM - Apache HTTP on mw1061 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 MediaWiki exception - 1567 bytes in 0.072 second response time [22:25:57] RECOVERY - Apache HTTP on mw1088 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.053 second response time [22:25:57] RECOVERY - Apache HTTP on mw1093 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.060 second response time [22:26:07] PROBLEM - Apache HTTP on mw1104 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [22:26:07] PROBLEM - Apache HTTP on mw1020 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [22:26:07] PROBLEM - Apache HTTP on mw1201 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [22:26:07] RECOVERY - Apache HTTP on mw1207 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.057 second response time [22:26:16] RECOVERY - Apache HTTP on mw1160 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.071 second response time [22:26:16] RECOVERY - Apache HTTP on mw1192 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 1.212 second response time [22:26:36] RECOVERY - Apache HTTP on mw1154 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.048 second response time [22:26:36] RECOVERY - Apache HTTP on mw1217 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.064 second response time [22:26:36] PROBLEM - Apache HTTP on mw1169 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 MediaWiki exception - 1567 bytes in 0.059 second response time [22:26:36] PROBLEM - Apache HTTP on mw1127 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 MediaWiki exception - 1567 bytes in 0.073 second response time [22:26:36] PROBLEM - Apache HTTP on mw1066 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 MediaWiki exception - 1567 bytes in 0.069 second response time [22:26:37] PROBLEM - Apache HTTP on mw1120 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 MediaWiki exception - 1567 bytes in 0.064 second response time [22:26:37] PROBLEM - Apache HTTP on mw1125 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 MediaWiki exception - 1568 bytes in 0.063 second response time [22:26:38] PROBLEM - Apache HTTP on mw1049 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 MediaWiki exception - 1568 bytes in 0.076 second response time [22:26:38] PROBLEM - Apache HTTP on mw1146 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 MediaWiki exception - 1567 bytes in 0.051 second response time [22:26:39] PROBLEM - Apache HTTP on mw1144 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 MediaWiki exception - 1567 bytes in 0.088 second response time [22:26:39] PROBLEM - Apache HTTP on mw1098 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 MediaWiki exception - 1567 bytes in 0.064 second response time [22:26:40] PROBLEM - Apache HTTP on mw1191 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 MediaWiki exception - 1567 bytes in 0.057 second response time [22:26:40] PROBLEM - Apache HTTP on mw1059 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 MediaWiki exception - 1568 bytes in 0.153 second response time [22:26:46] RECOVERY - Apache HTTP on mw1179 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 1.262 second response time [22:26:46] RECOVERY - Apache HTTP on mw1057 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.076 second response time [22:26:46] PROBLEM - Apache HTTP on mw1171 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 MediaWiki exception - 1568 bytes in 0.063 second response time [22:26:47] PROBLEM - Apache HTTP on mw1051 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 MediaWiki exception - 1568 bytes in 8.149 second response time [22:26:47] PROBLEM - Apache HTTP on mw1084 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [22:26:47] PROBLEM - Apache HTTP on mw1086 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [22:26:47] PROBLEM - Apache HTTP on mw1023 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [22:26:48] PROBLEM - Apache HTTP on mw1060 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [22:26:56] RECOVERY - Apache HTTP on mw1033 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.048 second response time [22:26:57] PROBLEM - Apache HTTP on mw1136 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 MediaWiki exception - 1567 bytes in 0.080 second response time [22:26:57] PROBLEM - Apache HTTP on mw1063 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 MediaWiki exception - 1568 bytes in 0.065 second response time [22:26:57] PROBLEM - Apache HTTP on mw1041 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [22:26:57] RECOVERY - Apache HTTP on mw1218 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.063 second response time [22:26:57] RECOVERY - Apache HTTP on mw1201 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.061 second response time [22:27:07] RECOVERY - Apache HTTP on mw1054 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.075 second response time [22:27:07] PROBLEM - Apache HTTP on mw1068 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [22:27:07] RECOVERY - Apache HTTP on mw1082 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.066 second response time [22:27:18] PROBLEM - Apache HTTP on mw1168 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [22:27:33] my route to iron is dying at cr1-eqiad [22:27:36] PROBLEM - Apache HTTP on mw1147 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 MediaWiki exception - 1567 bytes in 0.069 second response time [22:27:36] PROBLEM - Apache HTTP on mw1216 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 MediaWiki exception - 1567 bytes in 0.057 second response time [22:27:36] PROBLEM - Apache HTTP on mw1184 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 MediaWiki exception - 1567 bytes in 0.058 second response time [22:27:36] RECOVERY - Apache HTTP on mw1182 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.058 second response time [22:27:36] PROBLEM - Apache HTTP on mw1039 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 MediaWiki exception - 1567 bytes in 0.068 second response time [22:27:37] PROBLEM - Apache HTTP on mw1174 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 MediaWiki exception - 1567 bytes in 0.055 second response time [22:27:37] PROBLEM - Apache HTTP on mw1133 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 MediaWiki exception - 1567 bytes in 0.065 second response time [22:27:38] PROBLEM - Apache HTTP on mw1122 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 MediaWiki exception - 1567 bytes in 0.064 second response time [22:27:39] PROBLEM - Apache HTTP on mw1151 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 MediaWiki exception - 1567 bytes in 0.063 second response time [22:27:39] PROBLEM - Apache HTTP on mw1071 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 MediaWiki exception - 1568 bytes in 0.064 second response time [22:27:39] RECOVERY - Apache HTTP on mw1191 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.077 second response time [22:27:46] RECOVERY - Apache HTTP on mw1051 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 1.345 second response time [22:27:46] PROBLEM - Apache HTTP on mw1038 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 MediaWiki exception - 1567 bytes in 0.053 second response time [22:27:46] RECOVERY - Apache HTTP on mw1106 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.080 second response time [22:27:46] PROBLEM - Apache HTTP on mw1021 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 MediaWiki exception - 1567 bytes in 5.566 second response time [22:27:46] PROBLEM - Apache HTTP on mw1152 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 MediaWiki exception - 1568 bytes in 0.065 second response time [22:27:46] PROBLEM - Apache HTTP on mw1075 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 MediaWiki exception - 1567 bytes in 0.064 second response time [22:27:47] RECOVERY - Apache HTTP on mw1091 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 5.319 second response time [22:27:47] PROBLEM - Apache HTTP on mw1092 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 MediaWiki exception - 1567 bytes in 7.375 second response time [22:27:48] PROBLEM - Apache HTTP on mw1172 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [22:27:57] RECOVERY - Apache HTTP on mw1155 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.056 second response time [22:27:57] RECOVERY - Apache HTTP on mw1214 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.062 second response time [22:27:57] RECOVERY - Apache HTTP on mw1124 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.112 second response time [22:27:57] RECOVERY - Apache HTTP on mw1136 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.099 second response time [22:27:57] RECOVERY - Apache HTTP on mw1219 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.083 second response time [22:27:57] RECOVERY - Apache HTTP on mw1063 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.091 second response time [22:27:57] RECOVERY - Apache HTTP on mw1061 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.094 second response time [22:27:58] RECOVERY - Apache HTTP on mw1030 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.096 second response time [22:27:58] RECOVERY - Apache HTTP on mw1104 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.064 second response time [22:28:00] so is this related to the MobileApp extension deploy we think? [22:28:04] Sorry! This site is experiencing technical difficulties. [22:28:07] RECOVERY - Apache HTTP on mw1168 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.070 second response time [22:28:12] Yes. [22:28:13] mark: unsure [22:28:15] mark: I think MaxSem hadn't gotten anything out yet [22:28:23] !log killed thousands of broken connections on s1 slaves in Sleep state [22:28:29] Logged the message, Master [22:28:36] RECOVERY - Apache HTTP on mw1147 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.071 second response time [22:28:36] RECOVERY - Apache HTTP on mw1216 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.059 second response time [22:28:36] RECOVERY - Apache HTTP on mw1184 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.062 second response time [22:28:36] RECOVERY - Apache HTTP on mw1066 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.057 second response time [22:28:36] RECOVERY - Apache HTTP on mw1120 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.051 second response time [22:28:36] RECOVERY - Apache HTTP on mw1183 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.081 second response time [22:28:37] RECOVERY - Apache HTTP on mw1039 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.069 second response time [22:28:37] RECOVERY - Apache HTTP on mw1169 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.093 second response time [22:28:38] RECOVERY - Apache HTTP on mw1127 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.076 second response time [22:28:38] RECOVERY - Apache HTTP on mw1174 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.067 second response time [22:28:38] rdwrer: SAL says he started a scap [22:28:39] RECOVERY - Apache HTTP on mw1146 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.075 second response time [22:28:39] RECOVERY - Apache HTTP on mw1133 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.066 second response time [22:28:40] RECOVERY - Apache HTTP on mw1151 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.069 second response time [22:28:41] RECOVERY - Apache HTTP on mw1144 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.069 second response time [22:28:41] RECOVERY - Apache HTTP on mw1122 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.073 second response time [22:28:41] RECOVERY - Apache HTTP on mw1125 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.105 second response time [22:28:42] RECOVERY - Apache HTTP on mw1071 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.063 second response time [22:28:42] RECOVERY - Apache HTTP on mw1021 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.066 second response time [22:28:46] RECOVERY - Apache HTTP on mw1038 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.043 second response time [22:28:46] RECOVERY - Apache HTTP on mw1084 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 3.100 second response time [22:28:46] RECOVERY - Apache HTTP on mw1152 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.068 second response time [22:28:46] RECOVERY - Apache HTTP on mw1171 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.059 second response time [22:28:46] RECOVERY - Apache HTTP on mw1075 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.071 second response time [22:28:47] RECOVERY - Apache HTTP on mw1059 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 7.217 second response time [22:28:47] PROBLEM - Apache HTTP on mw1042 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [22:28:48] RECOVERY - Apache HTTP on mw1092 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 8.259 second response time [22:29:12] they're just reconnecting [22:29:13] greg-pidgin, my scap started seconds after the first notice [22:29:22] oh, after? [22:29:23] huh [22:29:27] I've gotten the error: PHP fatal error in /usr/local/apache/common-local/php-1.23wmf12/includes/db/LoadBalancer.php line 294: [22:29:27] Call to a member function getSlavePos() on a non-object [22:29:28] * huh ? [22:29:36] RECOVERY - Apache HTTP on mw1205 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.057 second response time [22:29:36] RECOVERY - Apache HTTP on mw1049 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.070 second response time [22:29:36] RECOVERY - Apache HTTP on mw1194 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.057 second response time [22:29:36] RECOVERY - Apache HTTP on mw1042 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.282 second response time [22:29:41] huh: change your nick [22:29:46] RECOVERY - Apache HTTP on mw1172 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 6.169 second response time [22:29:51] :) [22:29:53] done, sorry [22:29:56] RECOVERY - Disk space on ms-be1001 is OK: DISK OK [22:29:57] Oh, is it a downtime? [22:30:06] RECOVERY - Apache HTTP on mw1020 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 8.929 second response time [22:30:07] RECOVERY - Apache HTTP on mw1107 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 2.790 second response time [22:30:25] now Cannot contact the database server: Too many connections (10.64.32.21) [22:30:32] so, I don't have access to my backlogs, did that mediastorage issue get diagnosed? [22:30:46] RECOVERY - Apache HTTP on mw1098 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 3.440 second response time [22:30:46] RECOVERY - Apache HTTP on mw1041 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.067 second response time [22:30:56] RECOVERY - Apache HTTP on mw1068 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.095 second response time [22:31:06] hello [22:31:17] Ross_Hill: we know :) [22:31:18] (03CR) 10Dzahn: "that broke https://gerrit.wikimedia.org/r/#/c/112587/ i did purposely not touch these because they aren't wrong "used ot be on locke in th" [operations/puppet] - 10https://gerrit.wikimedia.org/r/112593 (owner: 10Matanya) [22:31:29] did someone change topic in -tech? [22:31:33] (03PS2) 10Dzahn: remove locke from puppet,dsh,dhcpd,ud2log filter [operations/puppet] - 10https://gerrit.wikimedia.org/r/112587 [22:31:36] RECOVERY - Apache HTTP on mw1086 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.088 second response time [22:31:36] Alex did [22:31:40] (yes) [22:31:40] grrrit-wm: they did [22:31:41] thanks [22:31:46] RECOVERY - Apache HTTP on mw1023 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.833 second response time [22:31:52] Krenair: thanks [22:31:57] what's the bot? [22:31:59] !log pt-kill jobs on s1 slaves killing anything sleeping longer than 10s [22:32:00] yw [22:32:06] greg-g: first thing reported was PROBLEM - Disk space on ms-be1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [22:32:07] Logged the message, Master [22:32:18] and after that PROBLEM - MySQL Slave Delay on db1050 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds [22:32:31] Ross_Hill: "There is an icinga-wm bot in #wikimedia-operations that will echo whatever Icinga alerts on (see below)" [22:32:38] matanya: yeah, not sure if they're related, but that's the last thing I remember seeing before my net went wonky [22:32:39] and a few morem and then 1050 was reported crashed, abd was depooled [22:32:42] https://wikitech.wikimedia.org/wiki/Icinga [22:32:46] PROBLEM - Apache HTTP on mw1162 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [22:32:46] PROBLEM - Apache HTTP on mw1179 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [22:32:46] PROBLEM - Apache HTTP on mw1024 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [22:32:47] PROBLEM - Apache HTTP on mw1049 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [22:32:47] PROBLEM - Apache HTTP on mw1111 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [22:32:47] PROBLEM - Apache HTTP on mw1084 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [22:32:47] PROBLEM - Apache HTTP on mw1172 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [22:32:48] PROBLEM - Apache HTTP on mw1042 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [22:32:48] PROBLEM - Apache HTTP on mw1064 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [22:32:50] then the storm statred [22:32:52] uh oh [22:32:57] PROBLEM - Disk space on ms-be1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [22:33:36] RECOVERY - Apache HTTP on mw1179 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.058 second response time [22:33:36] RECOVERY - Apache HTTP on mw1172 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.057 second response time [22:33:36] RECOVERY - Apache HTTP on mw1111 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.066 second response time [22:33:36] RECOVERY - Apache HTTP on mw1024 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.061 second response time [22:33:36] RECOVERY - Apache HTTP on mw1049 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.073 second response time [22:33:37] RECOVERY - Apache HTTP on mw1064 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.072 second response time [22:33:37] RECOVERY - Apache HTTP on mw1042 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.074 second response time [22:33:38] RECOVERY - Apache HTTP on mw1084 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.088 second response time [22:33:46] RECOVERY - Apache HTTP on mw1060 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 4.690 second response time [22:33:56] PROBLEM - Apache HTTP on mw1018 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [22:33:57] (03PS1) 10MaxSem: Revert "Deploy MobileApp on test and test2" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/112597 [22:34:06] (03CR) 10MaxSem: [C: 032] Revert "Deploy MobileApp on test and test2" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/112597 (owner: 10MaxSem) [22:34:07] PROBLEM - Apache HTTP on mw1088 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [22:34:13] (03Merged) 10jenkins-bot: Revert "Deploy MobileApp on test and test2" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/112597 (owner: 10MaxSem) [22:34:16] PROBLEM - Apache HTTP on mw1168 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [22:34:16] PROBLEM - Apache HTTP on mw1055 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [22:34:23] just to keep the cluster in consistent state [22:34:31] MaxSem: makes sense [22:34:42] MaxSem: sync it, IMO [22:34:47] PROBLEM - Apache HTTP on mw1059 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [22:34:47] PROBLEM - Apache HTTP on mw1078 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [22:34:47] !log Power cycled ms-be1001 [22:34:56] Logged the message, Master [22:34:56] PROBLEM - Apache HTTP on mw1095 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [22:35:07] PROBLEM - Apache HTTP on mw1219 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [22:35:57] RECOVERY - Apache HTTP on mw1219 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 2.365 second response time [22:36:07] PROBLEM - Apache HTTP on mw1218 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [22:36:07] RECOVERY - Apache HTTP on mw1168 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 3.225 second response time [22:36:36] RECOVERY - Apache HTTP on mw1162 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.070 second response time [22:36:36] RECOVERY - Apache HTTP on mw1059 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.065 second response time [22:36:36] RECOVERY - Apache HTTP on mw1078 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.064 second response time [22:36:45] !log maxsem synchronized wmf-config 'https://gerrit.wikimedia.org/r/112597' [22:36:46] RECOVERY - Apache HTTP on mw1095 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.084 second response time [22:36:46] RECOVERY - Apache HTTP on mw1018 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.088 second response time [22:36:46] PROBLEM - Host ms-be1001 is DOWN: PING CRITICAL - Packet loss = 100% [22:36:54] Logged the message, Master [22:36:54] mw1163: ssh: connect to host mw1163 port 22: Connection timed out [22:36:57] RECOVERY - Apache HTTP on mw1218 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.065 second response time [22:36:57] RECOVERY - Apache HTTP on mw1088 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.088 second response time [22:37:00] MaxSem: unrelated [22:37:07] RECOVERY - Apache HTTP on mw1055 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.136 second response time [22:37:18] dberror log rate is dropping fast. [22:37:21] well, it could've been in swapdeath [22:37:38] Now seeing quite a few "Error connecting to 10.64.16.145" messages [22:37:46] RECOVERY - Disk space on ms-be1001 is OK: DISK OK [22:37:47] RECOVERY - swift-object-server on ms-be1001 is OK: PROCS OK: 101 processes with regex args ^/usr/bin/python /usr/bin/swift-object-server [22:37:56] RECOVERY - Host ms-be1001 is UP: PING OK - Packet loss = 0%, RTA = 0.25 ms [22:37:57] RECOVERY - SSH on ms-be1001 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1.1 (protocol 2.0) [22:37:57] RECOVERY - swift-account-reaper on ms-be1001 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-reaper [22:37:57] RECOVERY - RAID on ms-be1001 is OK: OK: optimal, 14 logical, 14 physical [22:37:57] RECOVERY - swift-container-updater on ms-be1001 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-updater [22:38:01] I got that one [22:38:09] RECOVERY - DPKG on ms-be1001 is OK: All packages OK [22:38:09] RECOVERY - puppet disabled on ms-be1001 is OK: OK [22:38:09] RECOVERY - swift-container-auditor on ms-be1001 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [22:38:09] RECOVERY - swift-container-replicator on ms-be1001 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-replicator [22:38:09] RECOVERY - swift-account-replicator on ms-be1001 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-replicator [22:38:09] RECOVERY - swift-account-server on ms-be1001 is OK: PROCS OK: 13 processes with regex args ^/usr/bin/python /usr/bin/swift-account-server [22:38:15] ~15 minutes ago :-P [22:38:16] RECOVERY - swift-container-server on ms-be1001 is OK: PROCS OK: 13 processes with regex args ^/usr/bin/python /usr/bin/swift-container-server [22:38:26] RECOVERY - swift-object-updater on ms-be1001 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-object-updater [22:38:33] doesn't seem to be related to the mobile depoly in any way [22:38:36] RECOVERY - swift-object-auditor on ms-be1001 is OK: PROCS OK: 2 processes with regex args ^/usr/bin/python /usr/bin/swift-object-auditor [22:38:36] RECOVERY - swift-object-replicator on ms-be1001 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-object-replicator [22:38:46] RECOVERY - swift-account-auditor on ms-be1001 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-auditor [22:38:56] PROBLEM - Host db1050 is DOWN: PING CRITICAL - Packet loss = 100% [22:39:03] We've only had recoveries since MaxSem synced the revert [22:39:06] Oh, lies. [22:39:11] Spoke too soon [22:39:11] !log springle synchronized wmf-config/db-eqiad.php 'move s1 vslow dump' [22:39:19] Logged the message, Master [22:39:53] rdwrer: since he started the deploy after the first alret, i suspect it is related [22:40:12] *it is not [22:40:24] matanya, by the time apaches started melting, code wasn't on them [22:40:33] that is my point [22:40:44] so, mediastorage? related? [22:41:11] that was the first error + RT https://rt.wikimedia.org/Ticket/Display.html?id=6804 [22:41:19] that points me there [22:41:23] ~3500 connection errors to mysql host 10.64.16.145 in last 5 minutes [22:41:29] I think the cluster melts every time someone says "no-op" in the channel (22.01 UTC) [22:41:49] mmm, media storage causing PHP timeouts causing dead mysql connections? [22:41:58] Nemo_bis: Well don't say it then! :) [22:42:07] (03PS1) 10Springle: db1050 was hot-depooled from tin after lockup. now do it properly. [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/112600 [22:42:07] rdwrer: whoops :P [22:42:34] MaxSem: that's my only current theory [22:42:35] MaxSem: that is my guess, though doesn't sound so likely [22:42:38] (03CR) 10Springle: [C: 032] db1050 was hot-depooled from tin after lockup. now do it properly. [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/112600 (owner: 10Springle) [22:42:49] I'm waiting for sean to come back from hot fixing :) [22:43:01] (03Merged) 10jenkins-bot: db1050 was hot-depooled from tin after lockup. now do it properly. [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/112600 (owner: 10Springle) [22:44:15] mutante: you stole my thunder [22:44:19] !log springle synchronized wmf-config/db-eqiad.php 'sync proper non-hot depool db1050' [22:44:27] Logged the message, Master [22:44:36] PROBLEM - Parsoid on wtp1008 is CRITICAL: Connection refused [22:45:09] I hope parsoid can accept a no as answer [22:45:32] !log restarting parsoid on wtp1008 [22:45:33] db1050 is powered off and depooled. don't know what locked it yet [22:45:36] RECOVERY - Parsoid on wtp1008 is OK: HTTP OK: HTTP/1.1 200 OK - 970 bytes in 0.004 second response time [22:45:36] PROBLEM - Parsoid on wtp1009 is CRITICAL: Connection refused [22:45:40] Logged the message, Master [22:45:41] matanya: i don't know which one [22:45:45] why would mw1010 be special and be the one that has a massive spike while others didn't? [22:45:48] locke one mutante [22:45:53] i'm still doing the parsoid stuff [22:46:02] cant [22:46:05] something about that caused thousands of sleeping connections to slave dbs, which then hit max_connections even though doing very little [22:46:26] PROBLEM - Apache HTTP on mw1206 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [22:46:36] RECOVERY - Parsoid on wtp1009 is OK: HTTP OK: HTTP/1.1 200 OK - 970 bytes in 0.011 second response time [22:47:11] matanya, mw1010 was the first scap victim [22:47:14] springle: you seem to have the best current grasp of things, I'll look to you in a bit for some thoughts :) [22:47:28] MaxSem: innnnntestting [22:47:53] MaxSem: how far did it get? [22:47:56] matanya: actualy i think you got something merged 2 minutes before my change, that was 1 line in a comment but made mine needing a rebase :p [22:47:59] https://ganglia.wikimedia.org/latest/?c=Jobrunners%20eqiad&m=cpu_report&r=hour&s=by%20name&hc=4&mc=2 see 1010 [22:48:17] 5 hosts finished, some others were rsyncing [22:48:18] so unsure about stealing thunder :p [22:48:30] ok, i owe you one mutante :P [22:49:26] Nemo_bis, except that parsoid is a different cluster & only interacts with PHP code through the API cluster [22:49:50] which ones MaxSem ? [22:49:52] matanya: btw, db9 NRPE: Command 'check_mysql_disk_space' not defined , that is probably because the new puppet way fails there [22:50:07] PROBLEM - MySQL InnoDB on db1040 is CRITICAL: CRIT longest blocking idle transaction sleeps for 617 seconds [22:50:13] but not the same issue with the whitespace [22:50:16] yeas [22:50:26] PROBLEM - MySQL InnoDB on db1038 is CRITICAL: CRIT longest blocking idle transaction sleeps for 630 seconds [22:50:38] gwicke: what I would say in _security but the wifi here sucks is: right, I should have stopped MaxSem while you all were still going, because being down one person, while we're already down so many ops, isn't good during an outage like this [22:50:41] ah, my bad - mw1010 is a scap proxy so it was busy pushing changes to multiple other hosts [22:50:47] PROBLEM - MySQL Idle Transactions on db1038 is CRITICAL: CRIT longest blocking idle transaction sleeps for 657 seconds [22:50:47] PROBLEM - MySQL Idle Transactions on db1040 is CRITICAL: CRIT longest blocking idle transaction sleeps for 659 seconds [22:50:57] MaxSem: interesting [22:51:17] that's why load on it fell after I hit Ctrl-C, not when I rolled back [22:51:33] MaxSem: you can see that from the network graphs [22:51:52] gwicke: jokes don't need serious answers :) [22:52:14] ori: thoughts on that? (ie: mw1010 being the fan out for scap, falling over and possibly taking things with it?) [22:52:35] Nemo_bis, touch� ;) [22:52:45] well, MobileApp is not present on all apaches [22:52:56] but the wmf-config change to require it was synced [22:53:01] i put my money it is not mobile [22:53:07] and it shouldn't be sending any requests as the app that calls it isn't in the appstore yet, right? [22:53:10] matanya: #wikimedia-gabling please [22:53:26] RECOVERY - MySQL InnoDB on db1038 is OK: OK longest blocking idle transaction sleeps for 0 seconds [22:53:33] well, if you require_once "$IP/extensions/MobileApp/MobileApp.php", and that file doesn't exist, that's an issue [22:53:52] RECOVERY - MySQL Idle Transactions on db1038 is OK: OK longest blocking idle transaction sleeps for 0 seconds [22:53:52] RECOVERY - MySQL Idle Transactions on db1040 is OK: OK longest blocking idle transaction sleeps for 0 seconds [22:54:08] RECOVERY - MySQL InnoDB on db1040 is OK: OK longest blocking idle transaction sleeps for 0 seconds [22:54:11] ori: ah [22:54:12] there was no such fatal [22:54:43] you should have synced the MobileApp directory first and done the config change separately, regardless [22:55:00] Things we still need answers for: 1) MediaStorage flapping 2)what was making the calls that killed the dbs? [22:55:26] PROBLEM - Parsoid on wtp1015 is CRITICAL: Connection refused [22:55:47] greg-pidgin, 3) when I can attempt another scap:P [22:56:12] MaxSem: luckily, you're the last person on the list for today [22:56:39] greg-pidgin: db1050 was the first death i think. 2) might be: what causes many sleeping apache connections to db slaves when one slave dies [22:56:55] to which i have no answer so far [22:56:56] PROBLEM - Parsoid on wtp1016 is CRITICAL: Connection refused [22:57:18] yes springle 1050 was first [22:57:56] RECOVERY - Parsoid on wtp1016 is OK: HTTP OK: HTTP/1.1 200 OK - 970 bytes in 0.004 second response time [22:58:26] RECOVERY - Parsoid on wtp1015 is OK: HTTP OK: HTTP/1.1 200 OK - 970 bytes in 0.003 second response time [22:59:17] !log restarting db1050 for investigation [22:59:26] Logged the message, Master [22:59:41] !log aaron synchronized php-1.23wmf13/includes/db/LoadBalancer.php '8f6471e04ce0f33c64c090cbe5561deed82f60ee' [22:59:49] Logged the message, Master [22:59:58] AaronSchulz: que es? [23:00:30] !log mw1185 segfaulting starting at 22:39Z. ~240 occurrences in last 20 minutes [23:00:35] just a fatal fix [23:00:38] Logged the message, Master [23:00:42] AaronSchulz: k [23:01:12] AaronSchulz: did you per chance see what was going on with the mediastorage hosts earlier today? [23:01:50] I didn't look at it too much [23:02:02] may have been related to container 503s...don't know [23:02:29] AaronSchulz: https://rt.wikimedia.org/Ticket/Display.html?id=6804 too [23:02:59] !log all parsoid machines reployed per gwicke's [23:03:06] Logged the message, Master [23:03:11] mutante, thanks a bunch! [23:03:12] eh, that https://gist.github.com/gwicke/bb5e58ae2a4bcd47baac [23:03:16] sure, np [23:03:18] anyway, shall I attempt another scap? [23:03:22] dberror shows this message, starting at 22:10: Mon Feb 10 22:17:04 UTC 2014 mw1063 enwiki Error connecting to 10.64.16.145: :real_connect(): (HY000/2003): Can't connect to MySQL server on '10.64.16.145' (110) [23:03:36] PROBLEM - Parsoid on wtp1022 is CRITICAL: Connection refused [23:03:43] MaxSem: not yet, please [23:03:53] at 22:17, the error changes to Mon Feb 10 22:17:16 UTC 2014 mw1209 enwiki Error connecting to 10.64.16.144: :real_connect(): (HY000/1040): Too many connections [23:04:07] and there are some odd ones in the middle [23:04:09] mhurd, yeah [23:04:21] ori: define "odd"? [23:04:36] RECOVERY - Parsoid on wtp1022 is OK: HTTP OK: HTTP/1.1 200 OK - 970 bytes in 0.009 second response time [23:04:48] :real_connect(): (HY000/2013): Lost connection to MySQL server at 'reading initial communication packet', system error: 104 [23:04:56] gwicke: ^ some need a second restart, unsure why not all, but then they are fine [23:05:26] RECOVERY - Host db1050 is UP: PING OK - Packet loss = 0%, RTA = 0.55 ms [23:06:11] ori: That looks like a mysql being shut down; does that match when springle shut down and depooled 1050? [23:06:31] started at 22:14 [23:07:22] mutante: verified that all are updated with dsh; maybe removing the repo before restarting caused some weirdness [23:07:40] springle: bind-address or firewall issue? [23:07:43] * ori will bbiab [23:07:52] gwicke: cool, yea, thx [23:15:03] I know ori is busy, but is anyone else actively diagnosing this still? [23:15:49] greg-pidgin: i'm writing an email now [23:15:57] PROBLEM - Host db1050 is DOWN: PING CRITICAL - Packet loss = 100% [23:16:25] thanks springle [23:17:37] springle: post mortem on wikitech would be nice too for email-less people [23:18:20] matanya: I'll get it there in due time [23:18:46] usually process is email to ops@ and/or engineering@, I santize and post to wikitech, and file bugs/rt tickets as appropriate (that's the ideal, at least) [23:18:49] i'm sure greg-g [23:19:44] only 4 days late with this one: https://wikitech.wikimedia.org/wiki/Incident_documentation/20140206-Math [23:19:47] ;) [23:19:57] (just posted [23:19:58] ) [23:20:04] thank greg-g ! [23:20:05] (03PS3) 10Dzahn: remove locke from puppet,dsh,dhcpd,ud2log filter [operations/puppet] - 10https://gerrit.wikimedia.org/r/112587 [23:20:18] was waiting for this one :) [23:20:40] sorry :) [23:20:44] weekend got in the way [23:21:07] RECOVERY - Host db1050 is UP: PING OK - Packet loss = 0%, RTA = 0.43 ms [23:21:39] (03CR) 10Matanya: [C: 031] remove locke from puppet,dsh,dhcpd,ud2log filter [operations/puppet] - 10https://gerrit.wikimedia.org/r/112587 (owner: 10Dzahn) [23:21:54] no need for sorry :) [23:24:36] RECOVERY - Disk space on db1050 is OK: DISK OK [23:24:36] RECOVERY - MySQL Replication Heartbeat on db1050 is OK: OK replication delay 0 seconds [23:24:36] RECOVERY - MySQL Slave Delay on db1050 is OK: OK replication delay 0 seconds [23:24:36] (03CR) 10Dzahn: turn planet into a module (033 comments) [operations/puppet] - 10https://gerrit.wikimedia.org/r/108674 (owner: 10Dzahn) [23:24:46] RECOVERY - MySQL Recent Restart on db1050 is OK: OK seconds since restart [23:24:46] RECOVERY - MySQL Idle Transactions on db1050 is OK: OK longest blocking idle transaction sleeps for seconds [23:24:46] RECOVERY - Full LVS Snapshot on db1050 is OK: OK no full LVM snapshot volumes [23:24:56] RECOVERY - RAID on db1050 is OK: OK: optimal, 1 logical, 2 physical [23:25:06] RECOVERY - SSH on db1050 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1.1 (protocol 2.0) [23:25:07] RECOVERY - MySQL disk space on db1050 is OK: DISK OK [23:25:07] RECOVERY - MySQL Slave Running on db1050 is OK: OK replication [23:25:26] RECOVERY - DPKG on db1050 is OK: All packages OK [23:25:26] RECOVERY - puppet disabled on db1050 is OK: OK [23:25:51] wow, so many checks on the db [23:26:47] PROBLEM - Apache HTTP on mw1184 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [23:26:52] many checks [23:26:56] so db [23:27:06] so coverage [23:27:07] wow [23:27:20] anywho, why's 1184 having issues now? [23:28:07] * bd808 proposes the quick and complete death of the doge meme [23:28:36] bd808: such euthanasia [23:28:38] very sad [23:28:40] no way [23:29:01] (03CR) 10Ottomata: [C: 031] "Filter file can totally go" [operations/puppet] - 10https://gerrit.wikimedia.org/r/112587 (owner: 10Dzahn) [23:30:11] (03CR) 10Dzahn: turn planet into a module (031 comment) [operations/puppet] - 10https://gerrit.wikimedia.org/r/108674 (owner: 10Dzahn) [23:37:56] RECOVERY - HTTP 5xx req/min on tungsten is OK: OK: reqstats.5xx [warn=250.000 [23:38:12] * paravoid catches up with backlog [23:38:16] everything ok? [23:38:23] it is now [23:38:25] yeah [23:38:30] can someone send an outage report? :) [23:38:35] springle's on it [23:38:37] shortly [23:38:39] oh, cool [23:38:48] anything I can do? [23:39:24] don't think so right now [23:39:30] k [23:39:33] thanks [23:39:51] you can help the cluster by having a good sleep right now;) [23:41:30] what Max said [23:41:33] btw, MaxSem, I'm just waiting for springle's report/OK to go ahead. [23:42:46] PROBLEM - MySQL Idle Transactions on db1040 is CRITICAL: CRIT longest blocking idle transaction sleeps for 1139 seconds [23:42:55] springle, ^ [23:43:01] yep [23:43:07] PROBLEM - MySQL InnoDB on db1040 is CRITICAL: CRIT longest blocking idle transaction sleeps for 1157 seconds [23:43:47] RECOVERY - MySQL Idle Transactions on db1040 is OK: OK longest blocking idle transaction sleeps for 0 seconds [23:44:07] RECOVERY - MySQL InnoDB on db1040 is OK: OK longest blocking idle transaction sleeps for 0 seconds [23:45:27] (03CR) 10Matanya: "adding to the inline comments: there is no point in putting all the feeds in erb files, if you hard code the only two changing lines "name" (034 comments) [operations/puppet] - 10https://gerrit.wikimedia.org/r/108674 (owner: 10Dzahn) [23:46:37] Working ... he said [23:46:49] grrrit-wm: [23:50:12] hey, I'm observing a lot of segfaults from mw1185 and a bit from mw1094 - can someone investigate? [23:52:41] MaxSem: On it [23:54:00] (03PS1) 10Ryan Lane: Fix submodule fetching in trebuchet [operations/puppet] - 10https://gerrit.wikimedia.org/r/112605 [23:54:13] greg-g: emailed ops@. need discussion and input before moving to wikitech [23:54:54] springle: kk [23:55:20] springle: thoughts on letting MaxSem try to redeploy? Was it bad timing for him, or do you think it could have been related? [23:55:53] greg-g: i suspect it was just bad timing. go ahead, i'll watch [23:56:01] kk [23:56:04] MaxSem: ^^ go ahead [23:56:07] db1050 is still sidelined [23:56:07] greg-g: I'm pretty sure it was just bad timing. [23:56:13] * greg-g nods [23:56:17] wee [23:56:23] * MaxSem cancels sleep [23:56:28] he [23:56:29] h [23:56:29] anybody seen a doc on "deployment-prep"? perhaps it also has another name? [23:56:43] (03PS1) 10MaxSem: Deploy MobileApp on test and test2 [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/112606 [23:56:43] jgage: that's the Labs group that hosts the Beta Cluster [23:56:48] deployment-prep is beta.wmflabs.org [23:56:51] eg http://en.wikipedia.beta.wmflabs.org/wiki/Main_Page [23:56:59] thanks greg-g [23:57:23] (03CR) 10MaxSem: [C: 032] Deploy MobileApp on test and test2 [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/112606 (owner: 10MaxSem) [23:57:33] (03Merged) 10jenkins-bot: Deploy MobileApp on test and test2 [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/112606 (owner: 10MaxSem) [23:58:13] !log maxsem started scap: MobileApp deployment [23:58:20] Logged the message, Master [23:58:35] MaxSem: I don [23:58:57] I don't see a pattern except that most seem to be preceeded by the OOM killer going on a rampage. [23:59:26] ugh