[00:01:08] !log repooling ssl1002 (upgrade complete) [00:01:12] !log depooling ssl1001 [00:01:15] Logged the message, Master [00:01:21] Logged the message, Master [00:04:29] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:06:36] okay, scapping [00:07:00] MaxSem: Reedy yes, those are in a half-up state [00:07:05] not in rotation at the moment [00:07:56] PROBLEM - NTP on ssl4 is CRITICAL: NTP CRITICAL: No response from NTP server [00:19:11] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.662 seconds [00:21:34] !log maxsem Started syncing Wikimedia installation... : Black-deployment of ext:Solarium, not enabled anywhere yet [00:21:41] Logged the message, Master [00:22:20] PROBLEM - HTTPS on ssl4 is CRITICAL: Connection refused [00:23:15] !log repooling ssl4 (upgrade complete) [00:23:22] Logged the message, Master [00:23:23] !log depooling ssl2 for upgrade to precise [00:23:29] Logged the message, Master [00:23:59] RECOVERY - HTTPS on ssl4 is OK: OK - Certificate will expire on 08/22/2015 22:23. [00:31:29] RECOVERY - Apache HTTP on srv278 is OK: HTTP OK HTTP/1.1 200 OK - 453 bytes in 0.004 seconds [00:33:35] PROBLEM - Puppet freshness on analytics1001 is CRITICAL: Puppet has not run in the last 10 hours [00:33:35] PROBLEM - Puppet freshness on ocg3 is CRITICAL: Puppet has not run in the last 10 hours [00:33:35] PROBLEM - Puppet freshness on ms-fe1 is CRITICAL: Puppet has not run in the last 10 hours [00:33:35] PROBLEM - Puppet freshness on virt1004 is CRITICAL: Puppet has not run in the last 10 hours [00:34:47] RECOVERY - Apache HTTP on srv267 is OK: HTTP OK HTTP/1.1 200 OK - 454 bytes in 0.020 seconds [00:37:38] PROBLEM - HTTPS on ssl1001 is CRITICAL: Connection refused [00:42:53] RECOVERY - Apache HTTP on srv279 is OK: HTTP OK HTTP/1.1 200 OK - 454 bytes in 0.006 seconds [00:44:14] RECOVERY - HTTPS on ssl1001 is OK: OK - Certificate will expire on 10/27/2015 12:00. [00:44:20] Dear opsen, i'm looking for the ganglia config files. [00:44:47] awight: it's in gerrit [00:44:54] in the operations/puppet repo [00:45:02] likely in ganglia.pp [00:45:16] yep, in ganglia.pp [00:45:17] i found that, but it doesn't contain all the instrumentation [00:45:26] that's in files and templates [00:45:39] you'll need to read the manifest to find those files [00:45:44] RECOVERY - Apache HTTP on srv268 is OK: HTTP OK HTTP/1.1 200 OK - 453 bytes in 0.005 seconds [00:45:51] likely files/ganglia [00:45:53] and templates/ganglia [00:46:10] !log repooling ssl1001 (upgrade complete) [00:46:17] Logged the message, Master [00:46:17] !log all eqiad https hosts upgraded to precise [00:46:24] Logged the message, Master [00:48:09] grrr, scap appears to hang on wikiversions sync [00:49:09] where't that new deployment system already? [00:49:10] oh [00:49:10] right [00:49:11] PROBLEM - Host ssl2 is DOWN: PING CRITICAL - Packet loss = 100% [00:49:56] RECOVERY - Host ssl2 is UP: PING OK - Packet loss = 0%, RTA = 0.26 ms [00:53:23] PROBLEM - HTTPS on ssl2 is CRITICAL: Connection refused [00:53:50] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:56:16] Ryan_Lane, TBH, it's hanging ssh'ing one of the half-baked servers. will git-deploy not use ssh at all? [00:56:25] it will not [00:56:35] 0mq [00:56:36] via salt [00:57:08] RECOVERY - NTP on srv278 is OK: NTP OK: Offset -0.05188822746 secs [00:57:11] not saying it won't hang, but it'll hang waiting for all minions to return (and that has a timeout setting) [00:57:21] and every host will get the command in parallel [00:57:26] s/host/minion/ [00:57:32] I really should use consistent terms [00:58:03] like the commanding entity=overlord?:P [01:00:44] RECOVERY - NTP on srv267 is OK: NTP OK: Offset -0.04328072071 secs [01:02:28] !log Scap hung and had to be aborted. Since what was being deployed wasn't enabled, no clusters were harmed. [01:02:36] Logged the message, Master [01:08:32] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.031 seconds [01:09:59] !log depooling ssl3003 to upgrade to precise [01:12:04] Logged the message, Master [01:13:20] RECOVERY - Apache HTTP on srv269 is OK: HTTP OK HTTP/1.1 200 OK - 453 bytes in 0.008 seconds [01:14:05] RECOVERY - Apache HTTP on srv280 is OK: HTTP OK HTTP/1.1 200 OK - 454 bytes in 0.002 seconds [01:21:22] RECOVERY - HTTPS on ssl2 is OK: OK - Certificate will expire on 08/22/2015 22:23. [01:21:22] !log repooling ssl2 (upgrade complete) [01:21:25] !log depooling ssl1 [01:21:30] Logged the message, Master [01:21:36] Logged the message, Master [01:27:22] RECOVERY - NTP on srv279 is OK: NTP OK: Offset -0.03760004044 secs [01:27:40] RECOVERY - NTP on srv268 is OK: NTP OK: Offset -0.04951989651 secs [01:40:07] PROBLEM - MySQL Slave Delay on db78 is CRITICAL: CRIT replication delay 327 seconds [01:40:25] RECOVERY - NTP on srv269 is OK: NTP OK: Offset -0.0414069891 secs [01:40:25] RECOVERY - NTP on srv280 is OK: NTP OK: Offset -0.03783047199 secs [01:40:25] PROBLEM - MySQL Slave Delay on db1025 is CRITICAL: CRIT replication delay 345 seconds [01:42:13] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:43:25] RECOVERY - MySQL Slave Delay on db78 is OK: OK replication delay 0 seconds [01:47:01] RECOVERY - MySQL Slave Delay on db1025 is OK: OK replication delay 0 seconds [01:51:13] PROBLEM - Puppet freshness on zhen is CRITICAL: Puppet has not run in the last 10 hours [01:57:04] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 2.661 seconds [01:59:37] PROBLEM - MySQL Slave Delay on db78 is CRITICAL: CRIT replication delay 231 seconds [01:59:55] PROBLEM - MySQL Slave Delay on db1025 is CRITICAL: CRIT replication delay 246 seconds [02:07:57] mutante: !log repooling srv290-srv295 [02:08:27] er [02:08:32] that wasn't just for mutante [02:08:37] !log repooling srv290-srv295 [02:08:44] Logged the message, notpeter [02:09:59] !log depooling srv296-srv301 for upgrades to precise [02:10:07] Logged the message, notpeter [02:23:37] PROBLEM - HTTPS on ssl1 is CRITICAL: Connection refused [02:23:46] PROBLEM - Host srv296 is DOWN: PING CRITICAL - Packet loss = 100% [02:24:14] New review: Aaron Schulz; "Looks fine." [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/34251 [02:24:41] PROBLEM - Host srv297 is DOWN: PING CRITICAL - Packet loss = 100% [02:25:16] PROBLEM - Host srv298 is DOWN: PING CRITICAL - Packet loss = 100% [02:25:52] PROBLEM - Host srv299 is DOWN: PING CRITICAL - Packet loss = 100% [02:26:46] PROBLEM - Host srv300 is DOWN: PING CRITICAL - Packet loss = 100% [02:27:13] PROBLEM - Host srv301 is DOWN: PING CRITICAL - Packet loss = 100% [02:28:20] !log LocalisationUpdate completed (1.21wmf4) at Tue Nov 20 02:28:20 UTC 2012 [02:28:27] Logged the message, Master [02:29:28] RECOVERY - Host srv296 is UP: PING OK - Packet loss = 0%, RTA = 0.44 ms [02:30:22] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:30:22] RECOVERY - Host srv297 is UP: PING OK - Packet loss = 0%, RTA = 0.98 ms [02:30:58] RECOVERY - Host srv298 is UP: PING OK - Packet loss = 0%, RTA = 0.22 ms [02:31:34] RECOVERY - Host srv299 is UP: PING OK - Packet loss = 0%, RTA = 0.26 ms [02:32:28] RECOVERY - Host srv300 is UP: PING OK - Packet loss = 0%, RTA = 1.24 ms [02:32:55] RECOVERY - Host srv301 is UP: PING OK - Packet loss = 0%, RTA = 0.23 ms [02:33:04] PROBLEM - Apache HTTP on srv296 is CRITICAL: Connection refused [02:33:13] PROBLEM - SSH on srv296 is CRITICAL: Connection refused [02:34:16] PROBLEM - SSH on srv297 is CRITICAL: Connection refused [02:34:25] PROBLEM - Apache HTTP on srv297 is CRITICAL: Connection refused [02:34:34] PROBLEM - SSH on srv298 is CRITICAL: Connection refused [02:35:01] PROBLEM - Apache HTTP on srv299 is CRITICAL: Connection refused [02:35:19] PROBLEM - Apache HTTP on srv298 is CRITICAL: Connection refused [02:35:55] PROBLEM - SSH on srv299 is CRITICAL: Connection refused [02:36:40] PROBLEM - Apache HTTP on srv300 is CRITICAL: Connection refused [02:36:40] PROBLEM - SSH on srv300 is CRITICAL: Connection refused [02:36:58] PROBLEM - Apache HTTP on srv301 is CRITICAL: Connection refused [02:37:52] PROBLEM - SSH on srv301 is CRITICAL: Connection refused [02:38:37] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.037 seconds [02:41:01] RECOVERY - SSH on srv298 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [02:41:19] RECOVERY - SSH on srv296 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [02:42:32] RECOVERY - SSH on srv297 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [02:43:34] RECOVERY - SSH on srv300 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [02:44:10] RECOVERY - SSH on srv299 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [02:44:19] RECOVERY - SSH on srv301 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [02:52:09] !log LocalisationUpdate completed (1.21wmf3) at Tue Nov 20 02:52:09 UTC 2012 [02:52:17] Logged the message, Master [02:53:10] PROBLEM - NTP on srv296 is CRITICAL: NTP CRITICAL: No response from NTP server [02:54:13] PROBLEM - NTP on srv297 is CRITICAL: NTP CRITICAL: No response from NTP server [02:54:40] PROBLEM - NTP on srv298 is CRITICAL: NTP CRITICAL: No response from NTP server [02:55:25] PROBLEM - NTP on srv299 is CRITICAL: NTP CRITICAL: No response from NTP server [02:56:01] PROBLEM - NTP on srv301 is CRITICAL: NTP CRITICAL: No response from NTP server [02:56:29] PROBLEM - NTP on srv300 is CRITICAL: NTP CRITICAL: No response from NTP server [03:05:55] PROBLEM - Squid on brewster is CRITICAL: Connection refused [03:31:11] New patchset: Tim Starling; "Updates for Score deployment" [operations/debs/wikimedia-task-appserver] (master) - https://gerrit.wikimedia.org/r/34255 [03:36:58] RECOVERY - Puppet freshness on dobson is OK: puppet ran at Tue Nov 20 03:36:46 UTC 2012 [03:44:28] !log on brewster: root partition is full, removing some useless squid logs [03:44:35] Logged the message, Master [03:46:19] why do people make servers with microscopic root partitions and then configure them to store gigabytes of logs? [03:46:52] RECOVERY - MySQL Slave Delay on db1025 is OK: OK replication delay 9 seconds [03:47:01] RECOVERY - MySQL Slave Delay on db78 is OK: OK replication delay 0 seconds [03:47:55] RECOVERY - Squid on brewster is OK: TCP OK - 0.001 second response time on port 8080 [03:48:00] i think brewster had a full / recently? less than 2 weeks i guess [03:48:08] but idk if it was squid logs [03:49:43] RECOVERY - Apache HTTP on srv301 is OK: HTTP OK HTTP/1.1 200 OK - 454 bytes in 0.003 seconds [03:51:57] !log on brewster: reduced rotate count for squid logs to zero and ran logrotate -f [03:52:04] Logged the message, Master [04:03:40] RECOVERY - NTP on srv301 is OK: NTP OK: Offset -0.001162052155 secs [04:04:17] is there any way to install a package without installing the packages it recommends? [04:04:41] !log repooling ssl1 (upgrade complete) [04:04:44] RECOVERY - HTTPS on ssl1 is OK: OK - Certificate will expire on 08/22/2015 22:23. [04:04:48] Logged the message, Master [04:04:56] TimStarling: yes, not that I can remember how off the top of my head [04:05:06] TimStarling: --no-install-recommends [04:05:17] or do you mean via puppet? [04:06:21] via puppet or wikimedia-task-appserver [04:06:33] puppet is going to be more difficult [04:07:19] http://projects.puppetlabs.com/issues/1766 [04:07:59] I added timidity and lilypond to wikimedia-task-appserver [04:08:17] it will already be going out to lucid apaches since I just pushed it into the lucid-wikimedia repo [04:08:21] why are we still using wikimedia-task-appserver? [04:08:29] I thought we were pulling the package dependencies out [04:08:30] well, I did ask [04:08:50] does anyone have opinions on wikimedia-task-appserver and whether it should continue to exist? [04:08:57] it has dependencies for most of the packages that MW needs, except for 4 which have been added directly to the puppet class [04:08:58] I think it should not exist [04:09:04] it's probably a bit easier to update puppet than to update the task package [04:09:07] I think puppet should handle this [04:09:34] wikimedia-task-appserver had not been updated since august [04:09:39] I've been trying to kill all the configuration packages for a while now [04:10:01] I could have sworn I did so when we upgraded to lucid, except for a few scripts [04:10:34] we had branches of the package, I wonder if mine simply disappeared [04:10:45] wikimedia-task-appserver had 53 packages and puppet had 5 [04:10:54] so I figured wikimedia-task-appserver was the preferred solution [04:11:20] nope. I'm nearly positive I stripped almost all of them out at some point in a branched version for lucid [04:11:33] did you commit it anywhere? [04:11:34] * Ryan_Lane checks [04:12:27] I just imported it into git half an hour ago, in case you're looking there [04:12:39] ah. no. I was looking for a branch in svn [04:12:49] would the import have gotten the branches too? [04:13:15] no [04:13:26] * Ryan_Lane grumbles [04:13:39] hrmmm, still no rdns on WMF IPv6 ? [04:13:49] let me see what the ticket # was [04:14:00] http://svn.wikimedia.org/viewvc/mediawiki/branches/hardy/debs/wikimedia-task-appserver/debian/control?r1=83002&r2=85389 [04:14:23] that looks like I did the opposite? [04:14:25] that's 1.5 years ago [04:14:34] and in hardy [04:14:36] hrmmmm, well it does have rdns now... [04:14:39] 2620:0:861:1::2 [04:14:54] the revision says "Updating package for lucid." [04:15:05] http://svn.wikimedia.org/viewvc/mediawiki/trunk/debs/wikimedia-task-appserver/debian/control?view=log&pathrev=85389 [04:15:11] in a branch called hardy? [04:15:18] you had branched it [04:15:23] and reverted my changes [04:15:31] > Received: from [2620:0:861:1::2] (port=48253 helo=lists.wikimedia.org) by mchenry.wikimedia.org with esmtp (Exim 4.69) (envelope-from ) id baz for info@wikipedia.org; Mon, 19 Nov 2012 19:19:24 +0000 [04:15:47] though it looks like I added dependencies [04:15:50] not removed. [04:16:11] seems I'm insane. ignore me. [04:16:42] done [04:16:54] either way, I despise the configuration packages and would much prefer that things were done in puppet [04:17:33] I'm pretty sure the only one that doesn't feel that way is Jeff_Green [04:17:39] and either way, there is no way to avoid installing recommended packages? [04:17:56] yes and no [04:18:10] can change the apt configuration so that the default is to not install recommended [04:18:53] timidity recommends timidity-daemon, which is some kind of hardware emulation thing for ALSA [04:19:02] ugh [04:19:16] and lilypond recommends a couple of hundred MB of docs [04:19:56] let me see if it's possible in the package definition [04:21:37] seems not [04:21:39] fucking puppet [04:21:53] so, can change the apt configuration [04:21:59] may be able to do it per package [04:23:16] Change merged: Tim Starling; [operations/debs/wikimedia-task-appserver] (master) - https://gerrit.wikimedia.org/r/34255 [04:23:31] already deployed, so about time I merged it [04:23:56] /etc/apt/apt.conf [04:24:00] APT::Install-Recommends "0"; [04:24:18] of course, would be nicer to do for specific packages [04:24:28] maybe it's possible to do similar to pinning [04:24:31] I could add a Conflicts line [04:24:39] to wikimedia-task-appserver [04:24:42] that would likely cause problems [04:25:36] !log repooling srv258-srv280 [04:25:44] Logged the message, notpeter [04:27:03] or I could just disable the timidity-daemon service via puppet, then it would be pretty harmless [04:27:15] could. yeah [04:27:25] it's absurd that puppet can't handle this [04:27:49] I wonder what difference it would make if we disabled recommends by default [04:28:03] hard to say without reinstalling the server [04:28:08] yep [04:29:14] you could clean install 2 servers (one each with and without reccomends) and diff their installed package lists [04:29:24] yeah. could do it in labs [04:29:26] * Ryan_Lane shrugs [04:29:37] the recommendation is to not turn recommends off (of course) [04:29:48] where? [04:30:24] Ryan_Lane: so is it at all possible to get tim tams locally? [04:30:40] I've heard yes, but I haven't seen them anywhere [04:31:04] is it OK to put this in mediawiki::packages or do I need to add a dozen layers of abstraction? [04:31:12] we have stroopwafel in NYC [04:31:17] why is it always a dozen? [04:31:24] never thirteen or eleven [04:31:32] TimStarling: the packages? [04:31:41] should be fine to put into mediawiki::packages [04:31:52] disabling the timidity-daemon service [04:31:56] ah [04:32:04] hm. [04:32:26] one sec [04:33:08] hey tim, whatever you do, can you put it in both the mediawii class and the mediawiki_new module? alsmost done migrating, I promise! [04:33:12] I'd say it's fine there [04:33:45] * jeremyb hands notpeter a k [04:33:50] it's a side-effect of the package [04:33:59] it's 830, and I've been drinking ;) [04:34:03] I think it's a good idea to put it there and document why its there [04:34:07] I can lose all the letters I want! [04:34:28] * jeremyb gets a beer [04:37:21] there's no mediawiki class [04:40:30] interesting how wikimedia-task-appserver and the other packages MW needs are in different puppet classes in the new hierarchy [04:40:41] despite the fact that they do the same thing [04:40:58] mediawiki::packages [04:40:59] sorry [04:41:21] we should really split this off into a module eventually [04:41:25] notpeter: is that in the plans? :) [04:41:55] well, let's see if ssl3003 comes back up [04:42:09] sure. it's all in the plans [04:42:10] !log rebooting ssl3003 (upgraded via ssh, not console) [04:42:12] :D [04:42:16] Logged the message, Master [04:42:36] well, if ssl3003 doesn't come back up I guess I'll need to stop till mark gets back in [04:44:27] \o/ it came back up [04:45:25] PROBLEM - Puppet freshness on ms1002 is CRITICAL: Puppet has not run in the last 10 hours [04:46:25] New patchset: Tim Starling; "Changes for Score extension deployment" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/34262 [04:48:14] !log repooling ssl3003 (upgrade finished) [04:48:20] Logged the message, Master [04:48:22] !log depooling ssl3002 for upgrade to precise [04:48:29] Logged the message, Master [04:50:15] someone want to review that last change? [04:50:57] name isn't necessary in the service definition [04:51:13] though it also won't hurt anything [04:51:21] yeah, I thought it probably wasn't, but there was nothing to say it wasn't in the docs [04:51:37] at least, not in the section I was reading [04:51:50] if it isn't specified, it takes the title [04:52:17] hm. well I can't really review the rewrite code [04:52:25] TimStarling: "lilypond"? [04:52:54] New review: Ryan Lane; "+1 on puppet and varnish. Someone else will need to approve for rewrite." [operations/puppet] (production); V: 0 C: 1; - https://gerrit.wikimedia.org/r/34262 [04:53:15] thanks Aaron|home [04:53:31] amending for both [04:54:00] New patchset: Tim Starling; "Changes for Score extension deployment" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/34262 [04:56:09] I grepped for "math", that's how I found that varnish change [04:58:04] ahh, right, we are using varnish now [04:58:47] hopefully there will be no switching back [05:00:19] ok, I'll push that out now, if there are no more comments [05:00:36] sec :) [05:00:38] looking [05:01:09] paravoid: you are still awake? [05:01:10] heh [05:01:11] crazy [05:01:28] no, I just woke up [05:01:56] ah [05:02:09] TimStarling: two things [05:02:22] wow, paravoid waking up in the morning and sleeping at night [05:02:31] yeah, unexpected [05:02:40] a) I'm running a patched rewrite.py ms-fe1 right now for testing, so I'd have to patch this by hand [05:03:00] but I think I'm just going to roll-out all of my rewrite.py changes to all servers anyway [05:03:05] b) this needs a squid change too [05:03:16] we're not running exclusively on varnish yet [05:03:22] squid configuration is not in puppet, is it? [05:03:27] nope [05:03:28] nope [05:03:44] paravoid: where are we using squid for uploads? [05:03:45] I want to push this out first, before I install timidity on any more servers [05:04:13] then I want to update wikimedia-task-appserver on precise [05:04:34] are we still using that? [05:04:35] then it will be time for squid [05:04:40] yes, see above [05:05:16] paravoid: I had basically the same reaction [05:06:00] someone was working on ditching this [05:06:02] I think hashar [05:06:06] oh well [05:06:19] Ryan thought he was working on ditching it too [05:06:29] but apparently nobody has succeeded [05:06:53] heh [05:06:59] so, want some help Tim? [05:07:16] if you like [05:07:19] sure [05:08:03] thanks [05:08:53] the MW configuration change is here, in case you need it for reference: https://gerrit.wikimedia.org/r/#/c/34251/ [05:09:04] status draft until the ops changes are done [05:09:07] doing squid now [05:09:19] and I already +1'ed tha change [05:09:25] for the rewrite.py part :) [05:09:59] Ryan reviewed the rest of it [05:10:03] yeah [05:10:04] I'll deploy it now [05:10:15] Change merged: Tim Starling; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/34262 [05:10:22] Aaron|home: pmtpa still runs on squid and is the backend for esams [05:11:48] !log rebooting ssl3002 [05:11:55] Logged the message, Master [05:11:56] I'm wondering if we should put the squid configs to gerrit [05:12:03] paravoid: meg [05:12:04] err [05:12:05] meh [05:12:11] we'll be rid of it soon enough [05:12:16] well, that was my initial reaction too [05:12:19] and we'd need to clean it [05:12:29] but people have needed it a number of times since then [05:12:34] which means we'd need to split part of it out into private configs [05:12:34] wikivoyage, this etc. [05:12:39] you know there's a reason I put the password in a separate file [05:12:52] it was meant to be public from the start, but I was shouted down [05:12:56] the IPs we are blocking would need to be split out too [05:13:03] bleh. I wish it was public [05:13:11] it's made life annoying in labs [05:13:35] IP blocks are public when they are made via the web interface [05:13:54] there's no real reason why they can't be public when they are made via configuration [05:13:59] except we're putting in blocks against DoS [05:14:23] maybe [05:14:30] !log deploying squid config for score [05:14:37] Logged the message, Master [05:14:55] I think it's likely more trouble than it's worth at this point, though [05:15:19] I'd rather put more effort in DNS being public than squid [05:16:51] New patchset: Faidon; "swift: also handle URLErrors from imagescalers" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/33490 [05:16:51] New patchset: Faidon; "swift: passthrough all imgscalers errors as-is" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/33491 [05:16:51] New patchset: Faidon; "swift: fix https for short thumb URL redirects" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/33492 [05:16:52] New patchset: Faidon; "swift: use WSGIContext properly" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/33510 [05:16:52] New patchset: Faidon; "swift: removed code to hide the ETag." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/23392 [05:16:52] New patchset: Faidon; "swift: removed copy2() and friends from rewrite.py" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/25410 [05:16:52] New patchset: Faidon; "swift: remove unreferenced code/variables" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/33651 [05:16:53] New patchset: Faidon; "swift: add CORS support" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/33652 [05:16:57] rebase fun [05:17:01] * Ryan_Lane twitches [05:17:01] heh [05:18:12] PROBLEM - HTTPS on ssl3002 is CRITICAL: Connection refused [05:19:15] Change merged: Faidon; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/23392 [05:19:25] !log repooling ssl3002 (upgrade complete) [05:19:31] !log depooling ssl3001 [05:19:33] Logged the message, Master [05:19:33] Change merged: Faidon; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/25410 [05:19:39] Logged the message, Master [05:19:51] RECOVERY - HTTPS on ssl3002 is OK: OK - Certificate will expire on 08/22/2015 22:23. [05:20:17] Change merged: Faidon; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/33490 [05:20:40] Change merged: Faidon; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/33491 [05:21:06] Change merged: Faidon; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/33492 [05:21:41] Change merged: Faidon; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/33510 [05:21:53] Change merged: Faidon; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/33651 [05:22:08] Change merged: Faidon; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/33652 [05:23:08] New patchset: Faidon; "swift: remove support for container sync" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/33653 [05:23:38] Change merged: Faidon; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/33653 [05:29:17] New patchset: Faidon; "swift: remove more unreferenced config" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/34263 [05:29:29] Change merged: Faidon; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/34263 [05:31:15] RECOVERY - Puppet freshness on ms-fe1 is OK: puppet ran at Tue Nov 20 05:30:59 UTC 2012 [05:31:27] TimStarling: running puppet on swift proxies and restarting them via rollover now [05:36:30] RECOVERY - Apache HTTP on srv296 is OK: HTTP OK HTTP/1.1 200 OK - 454 bytes in 0.007 seconds [05:36:39] RECOVERY - Apache HTTP on srv297 is OK: HTTP OK HTTP/1.1 200 OK - 454 bytes in 0.005 seconds [05:36:57] RECOVERY - Apache HTTP on srv300 is OK: HTTP OK HTTP/1.1 200 OK - 454 bytes in 0.006 seconds [05:37:06] RECOVERY - Apache HTTP on srv299 is OK: HTTP OK HTTP/1.1 200 OK - 454 bytes in 0.005 seconds [05:37:33] RECOVERY - Apache HTTP on srv298 is OK: HTTP OK HTTP/1.1 200 OK - 454 bytes in 0.006 seconds [05:38:42] Ryan_Lane: ok, then what prereqs are there for public DNS? [05:39:06] someone doing it? :) [05:39:29] i only have a very basic idea of what the current state is [05:39:30] heh [05:39:33] looks like I created a few appservers from scratch with my dsh -g apaches apt-get install wikimedia-task-appserver [05:39:38] I did a bit of a DNS work a few weekends ago [05:39:48] incl. rewriting our powerdns to work with libgeoip [05:39:51] yeah, those ones that just went OK [05:40:00] TimStarling: that's possible that dsh group is likely bad [05:40:11] there's a period somewhere in that sentence [05:40:26] you know, puppet has an export feature [05:40:40] what is rollover? [05:40:44] @@, like we use to generate the nagios configuration [05:40:57] yes, it does [05:41:01] jeremyb: depooling a server, restarting it, pooling it back. rinse, repeat [05:41:09] I'm going to install salt this week [05:41:10] TimStarling: I don't think you want us to go there [05:41:10] so mark says that puppet can't generate dsh node groups, but it's not obvious to me why not [05:41:14] so, what's the point? [05:41:25] I've used it heavily in the past [05:41:29] it's not easy to have puppet generate it [05:41:30] it has several problems [05:41:37] it's very easy [05:41:52] paravoid: oh, i was thinking there was a script or something called rollover [05:41:52] well, not very maybe [05:42:04] honestly, the solution is to switch to a modern remote execution app [05:42:20] like salt, right [05:43:05] and how will salt get groups? [05:43:32] we can have puppet configure "grains" on the systems [05:44:06] so the individual systems know which groups they're in not the central brain [05:44:11] yes [05:44:14] it's pub/sub [05:49:54] anyway, I can't deploy this without an apaches node group that is correct [05:50:08] so does anyone know what changes have been made recently that would affect it? [05:50:18] New patchset: Faidon; "swift: add CORS on just-generated thumbs too" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/34264 [05:50:19] no idea [05:50:46] but a wrong apache groups sounds like much more severe than just score, doesn't it? [05:51:00] okay, ESYNTAX [05:51:28] a wrong apache group sounds severe as affects all MW deployments, doesn't it? [05:51:41] Change merged: Faidon; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/34264 [05:51:46] mediawiki-installation is the group for mediawiki deployments [05:52:00] I could use that one instead, I guess [05:52:12] oh [05:52:24] if that is wrong, things really will go bad [05:52:42] PROBLEM - swift-container-server on ms-be2 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-server [05:53:41] TimStarling: that seems to include fenari, hume etc. though [05:54:12] RECOVERY - swift-container-server on ms-be2 is OK: PROCS OK: 25 processes with regex args ^/usr/bin/python /usr/bin/swift-container-server [05:54:30] yes [05:54:43] !log rebooting ssl3001 [05:54:46] they probably need timidity anyway [05:54:49] Logged the message, Master [05:55:18] note that srv296-300 must still be in puppet, because they were still being monitored [05:55:37] unless puppet is broken on spence, I guess it wouldn't be the first time [05:55:39] diff -u <(sort /etc/dsh/group/mediawiki-installation) <(sort /etc/dsh/group/apaches) [05:55:43] is... interesting [05:55:53] so "apaches" doesn't have imagescalers or tmh [05:56:27] PROBLEM - Host ssl3001 is DOWN: PING CRITICAL - Packet loss = 100% [05:56:33] imagescalers is deliberately missing, it has its own group [05:56:51] apaches is traditionally just the main cluster [05:57:07] theoretically equivalent to /home/wikipedia/conf/pybal/pmtpa/apaches [05:57:21] PROBLEM - swift-account-auditor on ms-be2 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-account-auditor [05:57:30] PROBLEM - swift-account-reaper on ms-be2 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-account-reaper [05:57:39] PROBLEM - swift-account-replicator on ms-be2 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-account-replicator [05:58:06] PROBLEM - swift-account-server on ms-be2 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-account-server [05:58:33] PROBLEM - swift-container-updater on ms-be5 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-updater [05:59:41] what is going on? [05:59:52] nothing [05:59:54] RECOVERY - swift-account-server on ms-be2 is OK: PROCS OK: 25 processes with regex args ^/usr/bin/python /usr/bin/swift-account-server [06:00:30] RECOVERY - swift-account-auditor on ms-be2 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-auditor [06:00:39] RECOVERY - swift-account-reaper on ms-be2 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-reaper [06:00:40] we really need to fix our nagios setup dammit [06:00:46] too slow [06:00:49] RECOVERY - swift-account-replicator on ms-be2 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-replicator [06:00:49] RECOVERY - Host ssl3001 is UP: PING OK - Packet loss = 0%, RTA = 119.22 ms [06:01:55] yeah, why do you think I was running apt-get install manually? [06:02:18] PROBLEM - Lucene on search13 is CRITICAL: Connection timed out [06:02:23] "waiting for puppet", it's the ops equivalent of "my code's compiling" [06:03:33] there's more room for fencing on the chairs in the office now. haven't seen that happen yet, but we've only had our current config a day now [06:03:36] * Ryan_Lane grumbles [06:03:41] ssl3001 won't come back up [06:04:24] PROBLEM - SSH on ssl3001 is CRITICAL: Connection refused [06:04:33] PROBLEM - HTTPS on ssl3001 is CRITICAL: Connection refused [06:06:20] and I can't get into the damn console [06:06:22] oh well [06:06:29] guess we'll be down another https server in esams [06:06:57] that's not very good [06:07:09] meh [06:07:11] well, the packages are all installed, but it looks like rewrite.py isn't updated yet? [06:07:22] we can handle all of the current traffic with a single https host if needed [06:07:26] TimStarling: it should be [06:07:52] TimStarling: have an example URL for me? [06:07:52] ah right, maybe it has [06:08:15] well, I'll enable the extension on test2 then we can get MW to make some test files [06:08:39] but the error message has changed so I guess it has been deployed [06:09:42] the "token may have timed out" basically means that no container has been created [06:10:20] Ryan_Lane: so that's how many up / down in esams now? [06:10:34] 2/2 [06:10:38] k [06:11:25] it was a risk upgrading them without console access [06:11:41] which is why I did one at a time [06:11:50] one has been broken for months [06:11:58] sure. just didn't realize there was one still broken forever [06:12:03] why no console access? [06:12:06] no clue [06:12:37] either they aren't connected or their network config is screwed up or they are in the wrong vlan or some other reason [06:12:52] you tried through hooft, right? [06:13:00] why through hooft? [06:13:35] you can use any esams host [06:13:46] well, yeah [06:13:56] but yeah, tried in esams ;) [06:14:05] yeah that was the question :) [06:14:16] all of the ssl hosts and a couple of the cp boxes are inaccessible [06:14:25] oh hrm [06:14:26] strange [06:14:39] yep [06:14:51] worked at some point. otherwise I couldn't have installed them [06:16:51] RECOVERY - Lucene on search13 is OK: TCP OK - 0.002 second response time on port 8123 [06:18:43] New patchset: Faidon; "Bug 41304 Add X-Content-Duration to allowed_headers" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/29768 [06:18:58] Change merged: Faidon; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/29768 [06:21:54] Change merged: Faidon; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/33202 [06:22:53] TimStarling: so, all is well? [06:23:00] RECOVERY - swift-container-updater on ms-be5 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-updater [06:23:47] the code for the extension wasn't deployed yet, I'm adding it [06:23:59] mostly involves waiting for git [06:24:27] okay, breakfast time [06:24:30] see in you in a bit [06:26:24] bye [06:29:27] PROBLEM - swift-container-updater on ms-be5 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-updater [06:33:51] !log tstarling synchronized php-1.21wmf3/extensions/Score [06:33:57] Logged the message, Master [06:34:28] !log tstarling synchronized php-1.21wmf4/extensions/Score [06:34:34] Logged the message, Master [06:36:00] Change merged: Tim Starling; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/34251 [06:36:34] !log tstarling synchronized wmf-config/filebackend.php [06:36:41] Logged the message, Master [06:36:51] !log tstarling synchronized wmf-config/InitialiseSettings.php [06:36:57] Logged the message, Master [06:37:08] !log tstarling synchronized wmf-config/CommonSettings.php [06:37:13] back [06:37:16] Logged the message, Master [06:39:05] almost there [06:39:22] :) [06:40:33] PROBLEM - Lucene on search13 is CRITICAL: Connection timed out [06:41:11] search13 is lonely, wants apergos [06:41:14] New patchset: Tim Starling; "Added Score to extension-list" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/34267 [06:41:30] Change merged: Tim Starling; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/34267 [06:41:56] probably [06:43:18] ok looks like I have to run scap [06:43:25] how long does that take? [06:43:33] RECOVERY - Lucene on search13 is OK: TCP OK - 0.011 second response time on port 8123 [06:43:36] yeah, my own fault, I know [06:47:09] localization update, I'm assuming? [06:47:14] probably a while [06:48:26] OggHander is disabled, that breaks it [06:52:17] PROBLEM - Puppet freshness on db42 is CRITICAL: Puppet has not run in the last 10 hours [06:52:17] PROBLEM - Puppet freshness on neon is CRITICAL: Puppet has not run in the last 10 hours [06:53:27] oh rats [06:54:14] RECOVERY - swift-container-updater on ms-be5 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-updater [07:00:41] PROBLEM - swift-container-updater on ms-be5 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-updater [07:00:55] hmm [07:00:56] strange. [07:01:52] !log tstarling Started syncing Wikimedia installation... : [07:02:01] Logged the message, Master [07:02:20] RECOVERY - swift-container-updater on ms-be5 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-updater [07:06:24] apergos: any progress with ms7 cruft btw? [07:06:46] I'm going to take a look at the grub issue today [07:06:50] not yet [07:06:52] hopefully this should unblock you [07:06:56] can I follow along somehow? [07:07:08] PROBLEM - swift-container-updater on ms-be5 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-updater [07:07:42] hmm I don't like this [07:08:38] RECOVERY - swift-container-updater on ms-be5 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-updater [07:16:44] PROBLEM - swift-container-updater on ms-be5 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-updater [07:25:26] PROBLEM - swift-container-replicator on ms-be5 is CRITICAL: Connection refused by host [07:25:26] PROBLEM - swift-container-auditor on ms-be5 is CRITICAL: Connection refused by host [07:25:38] (that's me) [07:25:44] PROBLEM - swift-object-replicator on ms-be5 is CRITICAL: Connection refused by host [07:26:02] PROBLEM - swift-object-server on ms-be5 is CRITICAL: Connection refused by host [07:26:02] PROBLEM - swift-account-reaper on ms-be5 is CRITICAL: Connection refused by host [07:26:20] PROBLEM - swift-account-replicator on ms-be5 is CRITICAL: Connection refused by host [07:26:29] PROBLEM - swift-container-server on ms-be5 is CRITICAL: Connection refused by host [07:26:38] PROBLEM - swift-account-auditor on ms-be5 is CRITICAL: Connection refused by host [07:26:38] PROBLEM - swift-account-server on ms-be5 is CRITICAL: Connection refused by host [07:26:53] PROBLEM - swift-object-auditor on ms-be5 is CRITICAL: Connection refused by host [07:26:53] PROBLEM - swift-object-updater on ms-be5 is CRITICAL: Connection refused by host [07:28:46] huh [07:28:48] that's strange [07:28:52] 53G used and nothing uses them [07:29:54] very strange [07:32:58] !log rebooting ms-be5, strange df output for / [07:33:06] Logged the message, Master [07:35:17] PROBLEM - Host ms-be5 is DOWN: PING CRITICAL - Packet loss = 100% [07:39:29] RECOVERY - swift-account-auditor on ms-be5 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-auditor [07:39:29] RECOVERY - swift-container-auditor on ms-be5 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [07:39:29] RECOVERY - swift-object-auditor on ms-be5 is OK: PROCS OK: 2 processes with regex args ^/usr/bin/python /usr/bin/swift-object-auditor [07:39:38] RECOVERY - Host ms-be5 is UP: PING OK - Packet loss = 0%, RTA = 0.25 ms [07:39:56] RECOVERY - swift-object-replicator on ms-be5 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-object-replicator [07:40:06] RECOVERY - swift-account-reaper on ms-be5 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-reaper [07:40:14] RECOVERY - swift-account-replicator on ms-be5 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-replicator [07:40:15] RECOVERY - swift-object-server on ms-be5 is OK: PROCS OK: 101 processes with regex args ^/usr/bin/python /usr/bin/swift-object-server [07:40:15] RECOVERY - swift-container-server on ms-be5 is OK: PROCS OK: 25 processes with regex args ^/usr/bin/python /usr/bin/swift-container-server [07:40:15] RECOVERY - swift-container-replicator on ms-be5 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-replicator [07:40:23] RECOVERY - swift-object-updater on ms-be5 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-object-updater [07:40:23] RECOVERY - swift-account-server on ms-be5 is OK: PROCS OK: 25 processes with regex args ^/usr/bin/python /usr/bin/swift-account-server [07:40:56] huh [07:43:07] where are those 51G?!? [07:43:54] New patchset: Hashar; "zuul: learn the ability to set push_change_refs" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/34268 [07:45:21] * hashar gives 1GB and a cookie to paravoid [07:49:34] this is very strange [07:49:47] I see what you mean [07:53:01] commuting to coworking place … brb. [08:00:45] ha, very interesting [08:01:01] /proc/self/mountinfo is current. df returns what's in /etc/mtab which can be old (stale) (according to google) [08:01:49] you won't find anything there [08:01:53] it's something with the filesystem [08:02:00] maybe directory leak [08:06:27] fsck show anything? [08:08:21] no [08:08:32] awesome [08:08:33] but I'm going to retry [08:08:39] rebooting again [08:08:47] good luck [08:08:57] no point in both of us looking at it [08:09:16] oh, I'm already off [08:12:02] PROBLEM - Host ms-be5 is DOWN: PING CRITICAL - Packet loss = 100% [08:15:23] !log tstarling Finished syncing Wikimedia installation... : [08:15:31] Logged the message, Master [08:15:38] RECOVERY - Host ms-be5 is UP: PING OK - Packet loss = 0%, RTA = 0.57 ms [08:16:32] RECOVERY - swift-container-updater on ms-be5 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-updater [08:23:08] PROBLEM - Host ms-be5 is DOWN: PING CRITICAL - Packet loss = 100% [08:27:37] PROBLEM - Puppet freshness on ms-be7 is CRITICAL: Puppet has not run in the last 10 hours [08:34:13] PROBLEM - Puppet freshness on analytics1002 is CRITICAL: Puppet has not run in the last 10 hours [08:41:52] RECOVERY - Host ms-be5 is UP: PING OK - Packet loss = 0%, RTA = 0.36 ms [08:45:19] PROBLEM - swift-object-auditor on ms-be5 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:45:19] PROBLEM - swift-container-auditor on ms-be5 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:45:19] PROBLEM - swift-account-auditor on ms-be5 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:45:28] PROBLEM - SSH on ms-be5 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:46:04] PROBLEM - swift-account-reaper on ms-be5 is CRITICAL: Connection refused by host [08:46:04] PROBLEM - swift-account-replicator on ms-be5 is CRITICAL: Connection refused by host [08:46:04] PROBLEM - swift-object-replicator on ms-be5 is CRITICAL: Connection refused by host [08:46:04] PROBLEM - swift-object-server on ms-be5 is CRITICAL: Connection refused by host [08:46:04] PROBLEM - swift-container-server on ms-be5 is CRITICAL: Connection refused by host [08:46:22] PROBLEM - swift-account-server on ms-be5 is CRITICAL: Connection refused by host [08:46:31] PROBLEM - swift-container-updater on ms-be5 is CRITICAL: Connection refused by host [08:46:49] PROBLEM - swift-container-replicator on ms-be5 is CRITICAL: Connection refused by host [08:47:07] PROBLEM - swift-object-updater on ms-be5 is CRITICAL: Connection refused by host [08:47:10] I'm getting tired of this. [08:48:01] RECOVERY - swift-account-server on ms-be5 is OK: PROCS OK: 25 processes with regex args ^/usr/bin/python /usr/bin/swift-account-server [08:48:10] RECOVERY - swift-container-updater on ms-be5 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-updater [08:48:28] RECOVERY - swift-container-replicator on ms-be5 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-replicator [08:48:29] RECOVERY - swift-account-auditor on ms-be5 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-auditor [08:48:37] RECOVERY - SSH on ms-be5 is OK: SSH OK - OpenSSH_5.3p1 Debian-3ubuntu7 (protocol 2.0) [08:48:37] RECOVERY - swift-object-updater on ms-be5 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-object-updater [08:49:13] RECOVERY - swift-container-auditor on ms-be5 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [08:49:13] RECOVERY - swift-account-reaper on ms-be5 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-reaper [08:49:13] RECOVERY - swift-object-server on ms-be5 is OK: PROCS OK: 101 processes with regex args ^/usr/bin/python /usr/bin/swift-object-server [08:49:13] RECOVERY - swift-container-server on ms-be5 is OK: PROCS OK: 25 processes with regex args ^/usr/bin/python /usr/bin/swift-container-server [08:49:13] RECOVERY - swift-account-replicator on ms-be5 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-replicator [08:49:14] RECOVERY - swift-object-replicator on ms-be5 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-object-replicator [08:49:22] RECOVERY - swift-object-auditor on ms-be5 is OK: PROCS OK: 2 processes with regex args ^/usr/bin/python /usr/bin/swift-object-auditor [08:52:30] what is "this"? [08:52:40] hey mark [08:52:44] New patchset: Faidon; "swift: set keep_cache_size to 5G" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/34269 [08:52:55] ms-be5 [08:52:58] PROBLEM - swift-container-updater on ms-be5 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-updater [08:53:03] Filesystem Size Used Avail Use% Mounted on [08:53:05] /dev/md0 56G 53G 0 100% / [08:53:29] root@ms-be5:~# du --one-file-system -hs / [08:53:29] 1.9G / [08:53:40] persistent across reboots, so it's not unlinked files or anything [08:53:48] persists even after /forcefsck [08:55:15] it's kinda academic, since we're going to rebuild that box soonish [08:56:35] that's really bizarre [08:56:40] strange [08:57:27] very [08:58:07] just that box? [08:58:18] seems so [08:58:22] didn't check all of them [08:58:28] but a few of them that I did weren't like that [09:04:33] Change merged: Nikerabbit; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/34027 [09:05:16] PROBLEM - Puppet freshness on db62 is CRITICAL: Puppet has not run in the last 10 hours [09:05:45] oh heh [09:06:00] mark: saw that swift commit above? [09:06:10] effectively I'm disabling the fadvise behavior completely [09:06:20] we can set a threshold to whatever we want to, but I set it at 5G [09:06:26] (max file size in swift atm) [09:09:17] New patchset: Nemo bis; "(bug 31859) Enable Narayam by default in or wiki projects" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/34270 [09:10:31] RECOVERY - swift-container-updater on ms-be5 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-updater [09:10:41] ok [09:10:48] i still don't know what the point would be anyway [09:11:20] of what? fadvise? [09:11:26] why they have that? [09:11:36] to avoid cache pollution I think [09:11:43] New patchset: Nemo bis; "(bug 31859) Enable Narayam by default in or wiki projects" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/34270 [09:11:55] but yeah, I don't think there's any point our workload [09:12:12] Change merged: Nikerabbit; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/34270 [09:18:05] !log nikerabbit synchronized wmf-config/InitialiseSettings.php 'i18n deploy' [09:18:12] Logged the message, Master [09:18:55] PROBLEM - swift-container-updater on ms-be5 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-updater [09:22:48] !log upgrading Jenkins plugins [09:22:55] Logged the message, Master [09:23:00] okay, going to transfer 53g to fenari now [09:26:25] PROBLEM - swift-object-server on ms-be5 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-object-server [09:26:25] PROBLEM - swift-object-auditor on ms-be5 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-object-auditor [09:26:34] PROBLEM - swift-object-replicator on ms-be5 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-object-replicator [09:26:45] PROBLEM - swift-container-server on ms-be5 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-server [09:26:52] PROBLEM - swift-account-reaper on ms-be5 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-account-reaper [09:26:52] PROBLEM - swift-container-auditor on ms-be5 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [09:27:19] PROBLEM - swift-object-updater on ms-be5 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-object-updater [09:27:19] PROBLEM - swift-container-replicator on ms-be5 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-replicator [09:27:19] PROBLEM - swift-account-server on ms-be5 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-account-server [09:27:19] PROBLEM - swift-account-auditor on ms-be5 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-account-auditor [09:27:37] PROBLEM - swift-account-replicator on ms-be5 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-account-replicator [09:29:25] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/34226 [09:36:29] mark: hi :) Timo also needs to be added to the jenkins user group on gallium. That is apparently done by editing /etc/group . RT is https://rt.wikimedia.org/Ticket/Display.html?id=3942 [09:36:40] yes [09:36:51] i was already on it :P [09:37:13] nice!! thanks a ton :-] [09:37:47] !log restarting Jenkins [09:37:53] Logged the message, Master [09:40:10] mark: and if you are in the mood for some review, I have a simple change pending https://gerrit.wikimedia.org/r/#/c/34268/ , it adds a parameter to a bunch of class which is then used in a template expansion. (for zuul, not in production yet) [09:41:14] RECOVERY - swift-account-server on ms-be5 is OK: PROCS OK: 25 processes with regex args ^/usr/bin/python /usr/bin/swift-account-server [09:41:24] ok [09:41:24] RECOVERY - swift-object-auditor on ms-be5 is OK: PROCS OK: 2 processes with regex args ^/usr/bin/python /usr/bin/swift-object-auditor [09:41:24] RECOVERY - swift-container-auditor on ms-be5 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [09:41:24] RECOVERY - swift-account-auditor on ms-be5 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-auditor [09:41:50] RECOVERY - swift-object-replicator on ms-be5 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-object-replicator [09:41:50] RECOVERY - swift-container-replicator on ms-be5 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-replicator [09:41:50] RECOVERY - swift-account-reaper on ms-be5 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-reaper [09:42:17] RECOVERY - swift-container-server on ms-be5 is OK: PROCS OK: 25 processes with regex args ^/usr/bin/python /usr/bin/swift-container-server [09:42:17] RECOVERY - swift-account-replicator on ms-be5 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-replicator [09:42:17] RECOVERY - swift-object-server on ms-be5 is OK: PROCS OK: 101 processes with regex args ^/usr/bin/python /usr/bin/swift-object-server [09:42:44] RECOVERY - swift-object-updater on ms-be5 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-object-updater [09:46:33] mark: so, are we planning to get a single certificate with SANs *.m.wikipedia.org, *.m.wiktionary.org etc. for all the projects? [09:53:34] found it! [09:53:35] yay! [09:54:16] ms-be5 [09:54:20] so obvious [09:54:30] it's wild goose chase week [09:56:59] !log Upgrading Jenkins plugin xUnit which introduce a potential back compatibility issue renamed to . Will update configs files. [09:57:06] Logged the message, Master [10:10:38] i don't know if that's possible [10:10:42] but if it's not, it can't work [10:10:53] because of separate squid and varnish clusters :/ [10:11:37] yeah [10:11:47] that's what I said yesterday too [10:12:01] so, found the du/df issue [10:12:03] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/34268 [10:12:20] /srv/swift-storage/sde1 on / had 50G of files there [10:12:25] but /dev/sde1 was mounted on top of it [10:12:26] doh! [10:13:20] RECOVERY - swift-container-updater on ms-be5 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-updater [10:16:06] mark: thanks for the merge :-] [10:17:48] hehe [10:20:15] ahahaha [10:20:34] I had looked at the one with nothing mounted on it (but of course it was empty) [10:34:20] PROBLEM - Puppet freshness on ocg3 is CRITICAL: Puppet has not run in the last 10 hours [10:34:20] PROBLEM - Puppet freshness on analytics1001 is CRITICAL: Puppet has not run in the last 10 hours [10:34:20] PROBLEM - Puppet freshness on virt1004 is CRITICAL: Puppet has not run in the last 10 hours [10:49:33] New patchset: Mark Bergsma; "Retab" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/34278 [10:59:52] New patchset: MF-Warburg; "abusefilter-log-detail right from sysop to autoconfirmed" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/32681 [11:00:46] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/34278 [11:07:23] New patchset: MaxSem; "Update tests, now with Wikivoyage too" [operations/debs/squid] (master) - https://gerrit.wikimedia.org/r/29895 [11:10:35] New patchset: Faidon; "Hardcode GeoIP netmask to 24 in Varnish's GeoIP" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/34280 [11:10:41] mark: ^^^ [11:11:21] New patchset: MaxSem; "Update redirection rules for Wikivoyage" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/34281 [11:11:24] ... [11:11:41] what? [11:11:45] yes, I know C too you know ;-) [11:11:52] why do you do that while I'm working on it [11:11:56] New review: MaxSem; "Tests are at https://gerrit.wikimedia.org/r/29895" [operations/puppet] (production) C: 0; - https://gerrit.wikimedia.org/r/34281 [11:12:10] sorry, thought you said you were going to put traffic first [11:12:30] just trying to help, don't bite :) [11:14:53] Warning: opendir(/mnt/upload6/private/ExtensionDistributor/mw-snapshot/trunk/extensions) [function.opendir]: failed to open dir: [11:14:53] No such file or directory in /usr/local/apache/common-local/php-1.21wmf4/extensions/ExtensionDistributor/ExtensionDistributor_body.php on line 80 [11:14:54] Grrr [11:16:09] New review: Nemo bis; "It should be ok now, but I've not reviewed the existing configuration of all wikis so I can't +1 mys..." [operations/mediawiki-config] (master) C: 0; - https://gerrit.wikimedia.org/r/32681 [11:16:59] Reedy: can you rebase https://gerrit.wikimedia.org/r/#/c/25737/ so that jenkins is run on it (or whatever it's needed for that)? [11:17:22] It needs rebasing onto master and the conflicts resolving [11:17:58] Reedy: meaning that your magical button is not enough? [11:18:04] Nope [11:18:38] aww [11:19:02] By the look of your changes, chances are it's trivial enough [11:19:14] Reedy: and who could I hope to get https://gerrit.wikimedia.org/r/#/c/33713/ reviewed from, sooner or later? [11:19:25] hmm ok so maybe I'll learn that and try [11:19:35] git review -d 25737 [11:19:41] git rebase origin [11:19:45] it'll tell you what to do from there [11:29:44] Reedy: CONFLICT (content): Merge conflict in wmf-config/InitialiseSettings.php etc. [11:29:50] Yup [11:29:52] Edit the file [11:29:56] It'll show you what the conflicts are [11:41:55] New patchset: Nemo bis; "(bug 29692) Per-wiki namespace aliases shouldn't override (remove) global ones" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/25737 [11:43:35] hmpf, a tab got lost [11:51:53] PROBLEM - Puppet freshness on zhen is CRITICAL: Puppet has not run in the last 10 hours [11:54:53] PROBLEM - Puppet freshness on nescio is CRITICAL: Puppet has not run in the last 10 hours [11:55:56] PROBLEM - Puppet freshness on srv296 is CRITICAL: Puppet has not run in the last 10 hours [11:55:56] PROBLEM - Puppet freshness on srv299 is CRITICAL: Puppet has not run in the last 10 hours [11:55:56] PROBLEM - Puppet freshness on srv298 is CRITICAL: Puppet has not run in the last 10 hours [11:56:59] PROBLEM - Puppet freshness on srv297 is CRITICAL: Puppet has not run in the last 10 hours [11:56:59] PROBLEM - Puppet freshness on srv300 is CRITICAL: Puppet has not run in the last 10 hours [12:33:36] Reedy, Warning: opendir(/mnt/upload6/private/ExtensionDistributor/mw-snapshot/trunk/extensions) No such file or directory in /usr/local/apache/common-local/php-1.21wmf4/extensions/ExtensionDistributor/ExtensionDistributor_body.php on line 80 [12:37:32] PROBLEM - Puppetmaster HTTPS on virt0 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:40:41] RECOVERY - Puppetmaster HTTPS on virt0 is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.039 seconds [13:12:59] MaxSem: has the extension distributor ever worked? [13:13:29] I heard it did at times over the last few years [13:13:40] :P [13:13:53] MaxSem: ooh such instances should be recorded [13:45:58] RECOVERY - Host storage3 is UP: PING OK - Packet loss = 0%, RTA = 0.28 ms [13:53:46] PROBLEM - Host storage3 is DOWN: PING CRITICAL - Packet loss = 100% [13:58:07] PROBLEM - Puppet freshness on srv301 is CRITICAL: Puppet has not run in the last 10 hours [14:02:05] New patchset: Dereckson; "(bug 42280) Enable WebFonts on fa.wikipedia" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/34300 [14:11:10] RECOVERY - Host storage3 is UP: PING OK - Packet loss = 0%, RTA = 0.38 ms [14:12:18] New patchset: Hashar; "zuul: update url_pattern on labs" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/34303 [14:13:23] mark: another single click for you, changes the value of a variable ;-] https://gerrit.wikimedia.org/r/34303 [14:13:39] though you seem busy hacking some stack trace so feel free to skip [14:14:01] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/34303 [14:14:49] \O/ [14:22:10] MaxSem: Yup, I know... [14:30:27] New patchset: Hashar; "zuul: fix url_pattern on labs" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/34305 [14:46:58] PROBLEM - Puppet freshness on ms1002 is CRITICAL: Puppet has not run in the last 10 hours [14:52:28] New review: Siebrand; "Don't merge yet. It,will probably make sense to,also deploy to the other Farsi projects. That's an o..." [operations/mediawiki-config] (master) C: 0; - https://gerrit.wikimedia.org/r/34300 [15:02:18] New review: Dereckson; "Per last Siebrand comment." [operations/mediawiki-config] (master) C: -1; - https://gerrit.wikimedia.org/r/34300 [15:10:58] PROBLEM - Puppet freshness on ssl3001 is CRITICAL: Puppet has not run in the last 10 hours [15:44:15] New patchset: Cmjohnson; "Adding db42 to decom list" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/34318 [15:47:30] robh: can you verify and +2 ^ [15:48:11] looking now [15:49:17] Change merged: RobH; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/34318 [15:49:36] cmjohnson1: you need to merge the change on sockpuppet (ensure you pass your key so it udpates both puppet servers) [15:49:44] kk [15:49:44] its merged on gerrit [15:49:48] cool thx [15:49:50] welcome [15:53:33] PROBLEM - Puppet freshness on ms-fe1 is CRITICAL: Puppet has not run in the last 10 hours [16:07:48] RECOVERY - Puppet freshness on analytics1002 is OK: puppet ran at Tue Nov 20 16:07:20 UTC 2012 [16:14:01] !log depooling srv281-srv289 for upgrades to precise [16:14:08] Logged the message, notpeter [16:23:33] PROBLEM - Host srv282 is DOWN: PING CRITICAL - Packet loss = 100% [16:23:51] PROBLEM - Host srv283 is DOWN: PING CRITICAL - Packet loss = 100% [16:24:36] PROBLEM - Host srv285 is DOWN: PING CRITICAL - Packet loss = 100% [16:24:36] PROBLEM - Host srv286 is DOWN: PING CRITICAL - Packet loss = 100% [16:24:36] PROBLEM - Host srv287 is DOWN: PING CRITICAL - Packet loss = 100% [16:24:36] PROBLEM - Host srv284 is DOWN: PING CRITICAL - Packet loss = 100% [16:24:36] PROBLEM - Host srv296 is DOWN: PING CRITICAL - Packet loss = 100% [16:24:37] PROBLEM - Host srv297 is DOWN: PING CRITICAL - Packet loss = 100% [16:25:21] RECOVERY - Host srv281 is UP: PING OK - Packet loss = 0%, RTA = 0.30 ms [16:25:33] New patchset: Jgreen; "remove class 'base' from civicrm build, clean up misc/fundraising.pp" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/34327 [16:26:24] PROBLEM - Host srv289 is DOWN: PING CRITICAL - Packet loss = 100% [16:26:24] PROBLEM - Host srv298 is DOWN: PING CRITICAL - Packet loss = 100% [16:26:24] PROBLEM - Host srv288 is DOWN: PING CRITICAL - Packet loss = 100% [16:27:27] PROBLEM - Host srv299 is DOWN: PING CRITICAL - Packet loss = 100% [16:28:12] RECOVERY - NTP on analytics1002 is OK: NTP OK: Offset -0.03115904331 secs [16:28:21] PROBLEM - Host srv300 is DOWN: PING CRITICAL - Packet loss = 100% [16:28:57] PROBLEM - Host srv301 is DOWN: PING CRITICAL - Packet loss = 100% [16:29:06] PROBLEM - Apache HTTP on srv281 is CRITICAL: Connection refused [16:29:15] RECOVERY - Host srv282 is UP: PING OK - Packet loss = 0%, RTA = 1.16 ms [16:29:33] RECOVERY - Host srv283 is UP: PING OK - Packet loss = 0%, RTA = 0.25 ms [16:29:51] PROBLEM - SSH on srv281 is CRITICAL: Connection refused [16:30:10] New patchset: Jgreen; "remove class 'base' from civicrm build, clean up misc/fundraising.pp (typo)" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/34327 [16:30:18] RECOVERY - Host srv285 is UP: PING OK - Packet loss = 0%, RTA = 0.71 ms [16:30:18] RECOVERY - Host srv287 is UP: PING OK - Packet loss = 0%, RTA = 0.31 ms [16:30:18] RECOVERY - Host srv286 is UP: PING OK - Packet loss = 0%, RTA = 0.52 ms [16:30:18] RECOVERY - Host srv296 is UP: PING OK - Packet loss = 0%, RTA = 0.42 ms [16:30:18] RECOVERY - Host srv297 is UP: PING OK - Packet loss = 0%, RTA = 0.48 ms [16:30:27] PROBLEM - Host cp3019 is DOWN: PING CRITICAL - Packet loss = 100% [16:31:39] RECOVERY - Host cp3019 is UP: PING OK - Packet loss = 0%, RTA = 118.14 ms [16:32:06] RECOVERY - Host srv289 is UP: PING OK - Packet loss = 0%, RTA = 0.23 ms [16:32:06] RECOVERY - Host srv288 is UP: PING OK - Packet loss = 0%, RTA = 1.52 ms [16:32:06] RECOVERY - Host srv298 is UP: PING OK - Packet loss = 0%, RTA = 1.03 ms [16:32:33] PROBLEM - Apache HTTP on srv282 is CRITICAL: Connection refused [16:32:42] PROBLEM - SSH on srv282 is CRITICAL: Connection refused [16:33:09] RECOVERY - Host srv299 is UP: PING OK - Packet loss = 0%, RTA = 0.27 ms [16:33:36] PROBLEM - Apache HTTP on srv287 is CRITICAL: Connection refused [16:33:45] PROBLEM - Apache HTTP on srv283 is CRITICAL: Connection refused [16:33:45] PROBLEM - Apache HTTP on srv285 is CRITICAL: Connection refused [16:33:45] PROBLEM - SSH on srv287 is CRITICAL: Connection refused [16:33:46] PROBLEM - SSH on srv286 is CRITICAL: Connection refused [16:33:54] PROBLEM - SSH on srv283 is CRITICAL: Connection refused [16:33:54] PROBLEM - SSH on srv285 is CRITICAL: Connection refused [16:34:03] PROBLEM - Memcached on srv287 is CRITICAL: Connection refused [16:34:03] PROBLEM - Memcached on srv282 is CRITICAL: Connection refused [16:34:03] PROBLEM - Memcached on srv283 is CRITICAL: Connection refused [16:34:03] RECOVERY - Host srv300 is UP: PING OK - Packet loss = 0%, RTA = 0.38 ms [16:34:12] PROBLEM - Apache HTTP on srv286 is CRITICAL: Connection refused [16:34:12] PROBLEM - Apache HTTP on srv297 is CRITICAL: Connection refused [16:34:21] PROBLEM - SSH on srv296 is CRITICAL: Connection refused [16:34:30] PROBLEM - Apache HTTP on srv296 is CRITICAL: Connection refused [16:34:30] PROBLEM - Memcached on srv285 is CRITICAL: Connection refused [16:34:30] PROBLEM - Memcached on srv286 is CRITICAL: Connection refused [16:34:39] RECOVERY - Host srv301 is UP: PING OK - Packet loss = 0%, RTA = 0.30 ms [16:34:48] PROBLEM - SSH on srv297 is CRITICAL: Connection refused [16:34:56] New patchset: Jgreen; "remove class 'base' from civicrm build, clean up misc/fundraising.pp (typo^2)" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/34327 [16:35:33] PROBLEM - Apache HTTP on srv298 is CRITICAL: Connection refused [16:35:34] PROBLEM - SSH on srv289 is CRITICAL: Connection refused [16:35:34] PROBLEM - Memcached on srv288 is CRITICAL: Connection refused [16:35:51] PROBLEM - Apache HTTP on srv288 is CRITICAL: Connection refused [16:36:00] PROBLEM - Apache HTTP on srv289 is CRITICAL: Connection refused [16:36:18] PROBLEM - SSH on srv288 is CRITICAL: Connection refused [16:36:27] PROBLEM - Memcached on srv289 is CRITICAL: Connection refused [16:36:36] PROBLEM - Apache HTTP on srv299 is CRITICAL: Connection refused [16:36:54] PROBLEM - SSH on srv298 is CRITICAL: Connection refused [16:36:57] Change merged: Jgreen; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/34327 [16:37:12] RECOVERY - SSH on srv283 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [16:37:21] PROBLEM - SSH on srv300 is CRITICAL: Connection refused [16:37:21] PROBLEM - SSH on srv299 is CRITICAL: Connection refused [16:37:30] RECOVERY - SSH on srv282 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [16:37:57] RECOVERY - SSH on srv281 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [16:38:06] PROBLEM - Apache HTTP on srv300 is CRITICAL: Connection refused [16:38:24] PROBLEM - SSH on srv301 is CRITICAL: Connection refused [16:38:42] RECOVERY - SSH on srv286 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [16:38:43] RECOVERY - SSH on srv285 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [16:39:09] PROBLEM - Apache HTTP on srv301 is CRITICAL: Connection refused [16:39:36] RECOVERY - SSH on srv288 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [16:40:12] RECOVERY - SSH on srv287 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [16:40:21] RECOVERY - SSH on srv289 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [16:41:24] RECOVERY - SSH on srv297 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [16:42:27] RECOVERY - SSH on srv296 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [16:43:21] RECOVERY - SSH on srv298 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [16:43:48] RECOVERY - SSH on srv299 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [16:43:48] RECOVERY - SSH on srv300 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [16:44:51] RECOVERY - SSH on srv301 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [16:45:28] New patchset: Pyoungmeister; "setting last of srv servers to use applicationserver role classes" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/34328 [16:48:34] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/34328 [16:49:39] PROBLEM - NTP on srv281 is CRITICAL: NTP CRITICAL: No response from NTP server [16:51:48] New patchset: Jgreen; "grr. File attribute defaults scope is unpredictable" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/34332 [16:53:06] PROBLEM - NTP on srv282 is CRITICAL: NTP CRITICAL: No response from NTP server [16:53:33] PROBLEM - Puppet freshness on db42 is CRITICAL: Puppet has not run in the last 10 hours [16:53:33] PROBLEM - Puppet freshness on neon is CRITICAL: Puppet has not run in the last 10 hours [16:53:33] PROBLEM - Puppet freshness on srv281 is CRITICAL: Puppet has not run in the last 10 hours [16:53:33] PROBLEM - NTP on srv285 is CRITICAL: NTP CRITICAL: No response from NTP server [16:53:50] Change merged: Jgreen; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/34332 [16:54:45] PROBLEM - NTP on srv286 is CRITICAL: NTP CRITICAL: No response from NTP server [16:55:03] PROBLEM - NTP on srv287 is CRITICAL: NTP CRITICAL: No response from NTP server [16:58:21] PROBLEM - NTP on srv301 is CRITICAL: NTP CRITICAL: No response from NTP server [17:01:39] PROBLEM - NTP on srv283 is CRITICAL: NTP CRITICAL: No response from NTP server [17:03:19] PROBLEM - NTP on srv288 is CRITICAL: NTP CRITICAL: No response from NTP server [17:04:21] PROBLEM - NTP on srv289 is CRITICAL: NTP CRITICAL: No response from NTP server [17:08:51] RECOVERY - Puppet freshness on srv281 is OK: puppet ran at Tue Nov 20 17:08:34 UTC 2012 [17:08:51] RECOVERY - Puppet freshness on srv296 is OK: puppet ran at Tue Nov 20 17:08:35 UTC 2012 [17:09:18] RECOVERY - Apache HTTP on srv281 is OK: HTTP OK HTTP/1.1 200 OK - 454 bytes in 0.003 seconds [17:10:03] RECOVERY - Apache HTTP on srv296 is OK: HTTP OK HTTP/1.1 200 OK - 453 bytes in 0.013 seconds [17:22:22] RECOVERY - NTP on srv281 is OK: NTP OK: Offset -0.006912469864 secs [17:23:51] RECOVERY - Apache HTTP on srv282 is OK: HTTP OK HTTP/1.1 200 OK - 454 bytes in 0.001 seconds [17:24:18] RECOVERY - Puppet freshness on srv297 is OK: puppet ran at Tue Nov 20 17:24:12 UTC 2012 [17:25:57] RECOVERY - Apache HTTP on srv297 is OK: HTTP OK HTTP/1.1 200 OK - 454 bytes in 0.006 seconds [17:36:27] RECOVERY - NTP on srv282 is OK: NTP OK: Offset -0.0072286129 secs [17:37:30] RECOVERY - Apache HTTP on srv283 is OK: HTTP OK HTTP/1.1 200 OK - 453 bytes in 0.003 seconds [17:38:58] New review: Hashar; "Stepping out of this change. Reedy I think you want to abandon that change." [operations/mediawiki-config] (master); V: 0 C: -1; - https://gerrit.wikimedia.org/r/27830 [17:39:36] RECOVERY - Puppet freshness on srv298 is OK: puppet ran at Tue Nov 20 17:39:19 UTC 2012 [17:39:38] Change abandoned: Reedy; "THINK OF ALL MY HARD WORK TO DO THIS" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/27830 [17:40:03] RECOVERY - NTP on srv296 is OK: NTP OK: Offset -0.03867936134 secs [17:41:15] RECOVERY - Apache HTTP on srv298 is OK: HTTP OK HTTP/1.1 200 OK - 454 bytes in 0.004 seconds [17:50:15] RECOVERY - Apache HTTP on srv285 is OK: HTTP OK HTTP/1.1 200 OK - 454 bytes in 0.002 seconds [17:54:00] RECOVERY - NTP on srv297 is OK: NTP OK: Offset -0.03966116905 secs [17:55:21] RECOVERY - Puppet freshness on srv299 is OK: puppet ran at Tue Nov 20 17:55:09 UTC 2012 [17:56:24] RECOVERY - Apache HTTP on srv299 is OK: HTTP OK HTTP/1.1 200 OK - 454 bytes in 0.006 seconds [18:01:58] New review: Dereckson; "Per bug 42280 comment 3, this change only concerns fa.wikipedia and not other Persian projects, whic..." [operations/mediawiki-config] (master) C: 0; - https://gerrit.wikimedia.org/r/34300 [18:02:33] RECOVERY - NTP on srv285 is OK: NTP OK: Offset 0.09365105629 secs [18:03:27] RECOVERY - Apache HTTP on srv286 is OK: HTTP OK HTTP/1.1 200 OK - 456 bytes in 0.012 seconds [18:05:42] RECOVERY - NTP on srv283 is OK: NTP OK: Offset -0.03600215912 secs [18:09:54] RECOVERY - NTP on srv298 is OK: NTP OK: Offset -0.04400169849 secs [18:11:06] RECOVERY - Puppet freshness on srv300 is OK: puppet ran at Tue Nov 20 18:10:58 UTC 2012 [18:11:42] RECOVERY - Apache HTTP on srv300 is OK: HTTP OK HTTP/1.1 200 OK - 454 bytes in 0.006 seconds [18:17:15] RECOVERY - NTP on srv286 is OK: NTP OK: Offset 0.06504952908 secs [18:17:51] RECOVERY - Apache HTTP on srv287 is OK: HTTP OK HTTP/1.1 200 OK - 454 bytes in 0.004 seconds [18:22:48] PROBLEM - Puppet freshness on storage3 is CRITICAL: Puppet has not run in the last 10 hours [18:26:15] RECOVERY - NTP on srv300 is OK: NTP OK: Offset 0.04560732841 secs [18:26:24] RECOVERY - NTP on srv299 is OK: NTP OK: Offset -0.0379383564 secs [18:27:09] RECOVERY - Puppet freshness on srv301 is OK: puppet ran at Tue Nov 20 18:27:02 UTC 2012 [18:28:03] RECOVERY - Apache HTTP on srv301 is OK: HTTP OK HTTP/1.1 200 OK - 454 bytes in 0.005 seconds [18:28:48] PROBLEM - Puppet freshness on ms-be7 is CRITICAL: Puppet has not run in the last 10 hours [18:30:00] RECOVERY - NTP on srv287 is OK: NTP OK: Offset -0.1149597168 secs [18:32:15] RECOVERY - Apache HTTP on srv288 is OK: HTTP OK HTTP/1.1 200 OK - 454 bytes in 0.003 seconds [18:49:12] RECOVERY - Apache HTTP on srv289 is OK: HTTP OK HTTP/1.1 200 OK - 453 bytes in 0.003 seconds [18:58:03] RECOVERY - NTP on srv301 is OK: NTP OK: Offset -0.04851830006 secs [18:58:38] New patchset: Jgreen; "logrotate config for aluminium/grosley" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/34343 [19:00:18] Change merged: Jgreen; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/34343 [19:01:30] RECOVERY - NTP on srv288 is OK: NTP OK: Offset -0.05046093464 secs [19:04:25] New patchset: Jgreen; "fix owner/privs for fundraising logrotate conf file" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/34344 [19:04:46] Change merged: Jgreen; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/34344 [19:06:09] PROBLEM - Puppet freshness on db62 is CRITICAL: Puppet has not run in the last 10 hours [19:16:30] RECOVERY - NTP on srv289 is OK: NTP OK: Offset -0.03718757629 secs [19:23:22] New patchset: Dzahn; "move misc::graphite out of misc-servers into misc/graphite.pp (RT-720)" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/34350 [19:26:28] New review: Dzahn; "how about misc::noc-wikimedia? it sets up the apache site for graphite on noc, but is specific. so i..." [operations/puppet] (production); V: 0 C: 0; - https://gerrit.wikimedia.org/r/34350 [19:29:07] New review: Dzahn; "and how about misc::noc-wikimedia? it sets up the apache site for graphite on noc, but is specific. ..." [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/34350 [19:29:08] Change merged: Dzahn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/34350 [19:31:21] RECOVERY - NTP on srv205 is OK: NTP OK: Offset -0.04544866085 secs [19:32:25] New patchset: Ori.livneh; "Enable event logging for mobile beta" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/32864 [19:38:25] New patchset: Dzahn; "move misc::nfs-server::home / misc::nfs-server::home:rsync from misc-servers to nfs.pp (RT-720)" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/34353 [19:41:52] New review: Dzahn; "this is also to get stuff moved out of misc-servers." [operations/puppet] (production); V: 0 C: 0; - https://gerrit.wikimedia.org/r/34353 [19:42:45] !log scribbled over /dev/sda partition table on ms-be7 so I could watch the install break, left in shell form installer [19:42:52] Logged the message, Master [19:43:08] !change 34353 | apergos [19:43:08] apergos: https://gerrit.wikimedia.org/r/#q,34353,n,z [19:43:39] !change 33511 | apergos [19:43:39] apergos: https://gerrit.wikimedia.org/r/#q,33511,n,z [19:44:09] (it's half just fyi, and half a question where to move stuff) [19:45:24] looks fine to me (download.pp) [19:46:12] misc::nfs-server::home::rsyncd -> nfs.pp [19:46:29] don't have much to say about that one [19:46:30] but class misc::images::rsyncd -> ? rsync.pp ? [19:46:35] ok [19:47:04] as usual mark may have other thoughts about how to organize this stuff [19:47:33] so I think it owuld be good to collect the rsync related stuff somehow. but we also have an rsync module now that I'm not totally in love with [19:47:42] yea, but there seems to be consensus for moving things out of misc-servers.pp one way or another [19:48:12] cool [19:48:15] and its likely that some are not used or should be used in role classes etc etc [19:48:29] so i just consider this the first step.. move existing stuff out of one file..then see [19:48:42] makes sense to me [19:48:59] and it you move em out then someone else can always decide they want to move some bit t a different location later [19:49:39] exactly, at least it makes it visible and easier to see, unlike searching spaghetti misc-servers.pp [19:50:12] heh heh [19:50:53] ok I have my copy of the install syslog for tomorrow morning's reading [19:51:16] having stared at all the relevant scripts and pretty sure of the few code paths that it can be [19:51:20] (for ms-be7) [19:51:37] this will just let me verify it and think about how we can work around it without it majorly sucking [19:53:33] RECOVERY - NTP on srv265 is OK: NTP OK: Offset -0.04022908211 secs [19:53:51] RECOVERY - NTP on srv213 is OK: NTP OK: Offset -0.03467869759 secs [19:54:36] RECOVERY - NTP on srv277 is OK: NTP OK: Offset -0.05993068218 secs [19:57:01] apergos: it's gonna be a real PITA [19:58:07] we'll see [19:58:42] you can just leave it at the shell for now, I'm going to check it out tomorrow now that I know that is going on underneath the hood [20:01:20] okay [20:01:20] New patchset: Ryan Lane; "Add salt to production" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/34356 [20:02:43] afk for the night (it's already late) [20:02:47] have a good rest of the day [20:03:26] New patchset: Kaldari; "Allowing Commons sysops to use flickr uploading via UploadWizard" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/34357 [20:03:54] PROBLEM - Host storage3 is DOWN: PING CRITICAL - Packet loss = 100% [20:04:38] New patchset: Kaldari; "Allowing Commons sysops to use flickr uploading via UploadWizard" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/34357 [20:05:43] Change merged: Kaldari; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/34357 [20:08:02] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/34356 [20:09:18] !log kaldari synchronized wmf-config/InitialiseSettings.php 'turning on flickr uploading for sysops on Commons' [20:09:24] Logged the message, Master [20:19:14] New patchset: Ryan Lane; "Add master_finger configuration option" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/34359 [20:19:14] New patchset: Ryan Lane; "Add master_finger fingerprint for salt masters" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/34360 [20:21:07] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/34359 [20:21:14] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/34360 [20:21:36] PROBLEM - Apache HTTP on srv289 is CRITICAL: HTTP CRITICAL: HTTP/1.0 500 Internal Server Error [20:23:32] !log running change_tag index migrations on the 757 wikis with <7k ct rows [20:23:39] Logged the message, Master [20:23:44] -_- [20:23:56] well, *that* puppet change didn't work :( [20:26:37] oh puppet. I hate you so [20:26:44] we've got problem on srv289: require(/usr/local/apache/common-local/php-1.21wmf4/includes/WebStart.php) [function.require]: failed to open stream: Permission denied in /usr/local/apache/common-local/php-1.21wmf4/index.php on line 55 [20:26:52] notpeter: ^^^^ [20:26:57] yes [20:27:03] that specific one is mildly fucked [20:27:08] woo [20:27:14] but it's alos depooled [20:27:18] *also [20:27:26] depooling fail? [20:27:52] 127 errors among latest 1000 in apache.log [20:28:09] prob not a depooling fail then [20:28:46] maxsem@fenari:~$ tail /home/wikipedia/syslog/apache.log [20:29:09] Nov 20 20:27:58 10.0.8.39 apache2[2878]: PHP Warning: require(/usr/local/apache/common-local/php-1.21wmf4/includes/WebStart.php) blah blah [20:29:29] yeah, thanks MaxSem [20:29:37] depooled servers are still monitored by pybal and nagios [20:30:16] 155/1000 [20:30:24] and soon more! [20:31:39] aaaa [20:32:09] binasher: shit, what's happening? [20:33:47] i depooled it so hard, it took ALL the traffic [20:33:59] 1/0 [20:34:56] PROBLEM - Puppet freshness on analytics1001 is CRITICAL: Puppet has not run in the last 10 hours [20:34:56] PROBLEM - Puppet freshness on ocg3 is CRITICAL: Puppet has not run in the last 10 hours [20:34:56] PROBLEM - Puppet freshness on virt1004 is CRITICAL: Puppet has not run in the last 10 hours [20:36:35] RECOVERY - Apache HTTP on srv289 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.060 second response time [20:37:56] !log switching payments back to eqiad [20:38:03] Logged the message, Master [20:38:38] Is Chris Steipp in the channel? [20:38:50] PROBLEM - mysqld processes on es1004 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [20:38:51] PROBLEM - MySQL Slave Running on es1004 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [20:38:56] csteipp, ^^^ [20:39:08] PROBLEM - MySQL disk space on es1004 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [20:39:29] guess i should have looked lol csteipp - what username do you want for OTRS? [20:39:35] PROBLEM - MySQL Recent Restart on es1004 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [20:39:51] RD: FYI, seems like csteipp is out for lunch. [20:39:56] OK [20:40:01] Thanks [20:44:32] PROBLEM - Host es1004 is DOWN: PING CRITICAL - Packet loss = 100% [20:45:10] New patchset: Demon; "Remove "resolves RT" feature that nobody uses" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/34362 [20:47:30] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/34362 [20:59:47] !log depooling mw62-mw65 for upgrade to precise [20:59:53] Logged the message, notpeter [21:01:36] okay, mobile window [21:01:47] New patchset: Pyoungmeister; "more apache upgrades" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/34418 [21:02:51] Change merged: MaxSem; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/32864 [21:02:52] !log depooling mw17-mw19 and mw50-mw54 for upgrade to precise [21:02:58] Logged the message, notpeter [21:12:08] PROBLEM - Host mw18 is DOWN: PING CRITICAL - Packet loss = 100% [21:12:08] PROBLEM - Host mw17 is DOWN: PING CRITICAL - Packet loss = 100% [21:12:26] PROBLEM - Host mw63 is DOWN: PING CRITICAL - Packet loss = 100% [21:12:35] PROBLEM - Host mw64 is DOWN: PING CRITICAL - Packet loss = 100% [21:13:56] PROBLEM - Host mw19 is DOWN: PING CRITICAL - Packet loss = 100% [21:15:26] PROBLEM - SSH on mw62 is CRITICAL: Connection refused [21:15:35] PROBLEM - Host mw52 is DOWN: PING CRITICAL - Packet loss = 100% [21:15:53] PROBLEM - Host mw53 is DOWN: PING CRITICAL - Packet loss = 100% [21:16:29] PROBLEM - Apache HTTP on mw62 is CRITICAL: Connection refused [21:17:50] RECOVERY - Host mw18 is UP: PING OK - Packet loss = 0%, RTA = 0.35 ms [21:17:51] RECOVERY - Host mw17 is UP: PING OK - Packet loss = 0%, RTA = 2.50 ms [21:18:08] RECOVERY - Host mw63 is UP: PING OK - Packet loss = 0%, RTA = 0.48 ms [21:18:17] RECOVERY - Host mw64 is UP: PING OK - Packet loss = 0%, RTA = 0.61 ms [21:18:35] PROBLEM - Apache HTTP on mw65 is CRITICAL: Connection refused [21:18:44] PROBLEM - Apache HTTP on mw51 is CRITICAL: Connection refused [21:18:53] PROBLEM - SSH on mw65 is CRITICAL: Connection refused [21:19:20] PROBLEM - SSH on mw51 is CRITICAL: Connection refused [21:19:38] PROBLEM - Memcached on mw51 is CRITICAL: Connection refused [21:19:38] RECOVERY - Host mw19 is UP: PING OK - Packet loss = 0%, RTA = 0.56 ms [21:19:56] PROBLEM - Apache HTTP on mw54 is CRITICAL: Connection refused [21:20:14] PROBLEM - Memcached on mw54 is CRITICAL: Connection refused [21:20:32] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/34418 [21:21:08] PROBLEM - SSH on mw54 is CRITICAL: Connection refused [21:21:26] RECOVERY - Host mw52 is UP: PING OK - Packet loss = 0%, RTA = 0.32 ms [21:21:44] PROBLEM - Apache HTTP on mw17 is CRITICAL: Connection refused [21:21:44] PROBLEM - SSH on mw63 is CRITICAL: Connection refused [21:21:44] RECOVERY - Host mw53 is UP: PING OK - Packet loss = 0%, RTA = 0.97 ms [21:22:11] PROBLEM - Apache HTTP on mw64 is CRITICAL: Connection refused [21:22:20] PROBLEM - Apache HTTP on mw63 is CRITICAL: Connection refused [21:22:20] PROBLEM - Memcached on mw17 is CRITICAL: Connection refused [21:22:29] PROBLEM - SSH on mw17 is CRITICAL: Connection refused [21:22:29] PROBLEM - Memcached on mw18 is CRITICAL: Connection refused [21:22:38] PROBLEM - SSH on mw18 is CRITICAL: Connection refused [21:22:39] PROBLEM - Apache HTTP on mw18 is CRITICAL: Connection refused [21:22:39] PROBLEM - SSH on mw64 is CRITICAL: Connection refused [21:22:54] RD: I'm back [21:23:41] PROBLEM - Memcached on mw19 is CRITICAL: Connection refused [21:23:41] PROBLEM - Apache HTTP on mw19 is CRITICAL: Connection refused [21:23:59] PROBLEM - SSH on mw19 is CRITICAL: Connection refused [21:24:55] csteipp: Hi - was going to create your account, just need a username [21:25:10] RD: Oh great! csteipp [21:25:20] PROBLEM - Memcached on mw53 is CRITICAL: Connection refused [21:25:38] PROBLEM - Memcached on mw52 is CRITICAL: Connection refused [21:25:56] PROBLEM - SSH on mw52 is CRITICAL: Connection refused [21:26:05] PROBLEM - SSH on mw53 is CRITICAL: Connection refused [21:26:14] PROBLEM - Apache HTTP on mw52 is CRITICAL: Connection refused [21:26:32] PROBLEM - Apache HTTP on mw53 is CRITICAL: Connection refused [21:26:41] RECOVERY - SSH on mw63 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [21:26:41] RECOVERY - SSH on mw62 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [21:26:59] RECOVERY - SSH on mw65 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [21:27:10] Can some gerrit op abandon this: https://gerrit.wikimedia.org/r/#/c/28296/ [21:27:17] RECOVERY - SSH on mw19 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [21:27:17] RECOVERY - SSH on mw17 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [21:27:23] and this https://gerrit.wikimedia.org/r/#/c/28944/ [21:27:26] RECOVERY - SSH on mw18 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [21:27:35] RECOVERY - SSH on mw64 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [21:27:36] these 2 are broken for a long time, and are not in master, incorrect submission to branch. [21:28:54] Krinkle: done [21:28:56] RECOVERY - SSH on mw51 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [21:29:14] RECOVERY - SSH on mw52 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [21:29:23] RECOVERY - SSH on mw53 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [21:30:44] RECOVERY - SSH on mw54 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [21:34:29] RECOVERY - Apache HTTP on mw62 is OK: HTTP OK HTTP/1.1 200 OK - 454 bytes in 0.005 seconds [21:34:38] RECOVERY - Apache HTTP on mw17 is OK: HTTP OK HTTP/1.1 200 OK - 454 bytes in 0.004 seconds [21:34:47] RECOVERY - Apache HTTP on mw51 is OK: HTTP OK HTTP/1.1 200 OK - 453 bytes in 0.003 seconds [21:36:36] All done csteipp [21:36:39] You have a couple emails [21:36:45] RD: Thank you! [21:36:53] PROBLEM - NTP on mw62 is CRITICAL: NTP CRITICAL: Offset unknown [21:38:26] No prob [21:38:42] PROBLEM - NTP on mw65 is CRITICAL: NTP CRITICAL: No response from NTP server [21:39:00] PROBLEM - NTP on mw51 is CRITICAL: NTP CRITICAL: Offset unknown [21:41:44] PROBLEM - NTP on mw17 is CRITICAL: NTP CRITICAL: Offset unknown [21:41:44] PROBLEM - NTP on mw18 is CRITICAL: NTP CRITICAL: No response from NTP server [21:43:23] PROBLEM - NTP on mw64 is CRITICAL: NTP CRITICAL: No response from NTP server [21:46:32] RECOVERY - NTP on mw51 is OK: NTP OK: Offset 0.04408299923 secs [21:46:41] PROBLEM - NTP on mw53 is CRITICAL: NTP CRITICAL: No response from NTP server [21:47:35] RECOVERY - NTP on mw62 is OK: NTP OK: Offset 0.07466650009 secs [21:48:11] RECOVERY - NTP on mw17 is OK: NTP OK: Offset -0.005201101303 secs [21:48:11] RECOVERY - Apache HTTP on mw63 is OK: HTTP OK HTTP/1.1 200 OK - 454 bytes in 0.008 seconds [21:48:56] RECOVERY - Apache HTTP on mw52 is OK: HTTP OK HTTP/1.1 200 OK - 454 bytes in 0.002 seconds [21:49:23] RECOVERY - Apache HTTP on mw18 is OK: HTTP OK HTTP/1.1 200 OK - 454 bytes in 2.996 seconds [21:50:26] PROBLEM - NTP on mw63 is CRITICAL: NTP CRITICAL: Offset unknown [21:51:29] PROBLEM - NTP on mw19 is CRITICAL: NTP CRITICAL: No response from NTP server [21:53:08] PROBLEM - Puppet freshness on zhen is CRITICAL: Puppet has not run in the last 10 hours [21:53:08] PROBLEM - NTP on mw52 is CRITICAL: NTP CRITICAL: Offset unknown [21:54:47] PROBLEM - NTP on mw54 is CRITICAL: NTP CRITICAL: No response from NTP server [21:56:08] PROBLEM - Puppet freshness on nescio is CRITICAL: Puppet has not run in the last 10 hours [21:59:54] RECOVERY - NTP on mw18 is OK: NTP OK: Offset -0.0756663084 secs [22:00:03] RECOVERY - NTP on mw63 is OK: NTP OK: Offset 0.03009176254 secs [22:00:47] RECOVERY - Apache HTTP on mw19 is OK: HTTP OK HTTP/1.1 200 OK - 454 bytes in 0.012 seconds [22:01:27] New patchset: Ottomata; "Removing apache analytics proxy configs" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/34430 [22:01:59] RECOVERY - Apache HTTP on mw53 is OK: HTTP OK HTTP/1.1 200 OK - 454 bytes in 0.001 seconds [22:02:17] RECOVERY - Apache HTTP on mw64 is OK: HTTP OK HTTP/1.1 200 OK - 454 bytes in 0.005 seconds [22:03:58] Change merged: Ottomata; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/34430 [22:12:47] RECOVERY - NTP on mw19 is OK: NTP OK: Offset -0.05096757412 secs [22:14:26] RECOVERY - NTP on mw53 is OK: NTP OK: Offset 0.02249991894 secs [22:15:56] RECOVERY - NTP on mw52 is OK: NTP OK: Offset -0.01121246815 secs [22:22:48] Can someone please fix the permissions on the 1.21wmf4 git objects directory please? [22:22:53] chmod -R g+w /home/wikipedia/common/php-1.21wmf4/.git/objects [22:23:12] maybe RoanKattouw [22:23:22] I hear that guy likes to fix git permissions [22:24:37] * RoanKattouw fixes [22:25:10] Thanks [22:25:36] Done [22:28:59] RECOVERY - NTP on mw64 is OK: NTP OK: Offset -0.009249091148 secs [22:32:42] New patchset: Asher; "now only 30 wikis that need wgOldChangeTagsIndex = true, changing default to false [bug 40867]" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/34435 [22:33:32] Change merged: Asher; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/34435 [22:34:14] RECOVERY - Apache HTTP on mw54 is OK: HTTP OK HTTP/1.1 200 OK - 454 bytes in 0.014 seconds [22:34:23] RECOVERY - Apache HTTP on mw65 is OK: HTTP OK HTTP/1.1 200 OK - 454 bytes in 0.007 seconds [22:35:00] !log asher synchronized wmf-config/InitialiseSettings.php 'now only 30 wikis that need wgOldChangeTagsIndex = true, changing default to false' [22:35:06] Logged the message, Master [22:41:21] New patchset: Kaldari; "Turning off Flickr uploading on Commons until author bug is fixed." [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/34437 [22:41:40] Change merged: Kaldari; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/34437 [22:44:03] !log maxsem Started syncing Wikimedia installation... : Weekly mobile deployment [22:44:09] Logged the message, Master [22:45:28] !log kaldari synchronized wmf-config/InitialiseSettings.php 'disabling experimental flickr uploading on commons until some bugs are fixed' [22:45:34] Logged the message, Master [22:56:47] !log reedy synchronized php-1.21wmf3/ [22:56:53] Logged the message, Master [23:00:19] !log reedy synchronized php-1.21wmf4/ [23:00:27] Logged the message, Master [23:01:59] RECOVERY - NTP on mw65 is OK: NTP OK: Offset -0.01026904583 secs [23:01:59] RECOVERY - NTP on mw54 is OK: NTP OK: Offset -0.01452946663 secs [23:03:54] notpeter: Do the apaches you're reinstalling have /mnt/upload6 mounted post install? [23:05:03] Reedy, what are you pushing? [23:05:16] Reedy: doesn't look like it [23:05:17] why? [23:05:25] nas1-a.pmtpa.wmnet:/vol/thumbs on /mnt/thumbs2 type nfs (rw,bg,intr) [23:05:28] nas1-a.pmtpa.wmnet:/vol/originals on /mnt/upload7 type nfs (rw,bg,intr) [23:05:50] 17 Warning: opendir(/mnt/upload6/private/ExtensionDistributor/mw-snapshot/trunk/extensions) [function.opendir]: failed to open dir: [23:05:51] No such file or directory in /usr/local/apache/common-local/php-1.21wmf4/extensions/ExtensionDistributor/ExtensionDistributor_body.php on line 80 [23:06:04] uuuuuuhhhhhh [23:06:54] * Reedy hides from paravoid [23:07:44] Reedy: what box is that? [23:08:12] numerous [23:08:12] i don't think anything should have upload6 mounted still [23:08:27] yeah, I habeebed that that was depricated at this point [23:08:29] though i see it's still referenced in CommonSettings :/ [23:08:30] upload6 is still used by Extdist... [23:08:39] oh good, paravoid is here [23:08:42] binasher: I changed that over again [23:08:46] as upload.wm.o isn't using nas [23:08:56] we have no web server in front of the netapp [23:09:00] so the files it was adding to NFS weren't available to grab via http [23:09:19] and extdist wants appservers to write stuff to be immediately served by upload.wm.org [23:09:31] and hasn't been ported to mw-store [23:09:35] yay for legacy [23:09:59] paravoid: did you redo nfs.pp? [23:10:19] I broke it yesterday [23:10:20] fixing [23:10:27] 3048774239bb9e59d5cfb48e79aefdfebc31ee95 [23:10:44] I slipped an absent by mistake [23:11:00] notpeter: nfs::upload looks like it would do the thing that needs doing [23:11:15] yep [23:11:19] I removed /mnt/thumbs [23:11:47] and by accident unmounted upload6 [23:11:51] that was today [23:12:07] ah, gotcha [23:12:42] binasher: that's what Ithought. and apparently I thought right! [23:13:37] New patchset: Faidon; "Re-add upload6" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/34443 [23:13:54] Change merged: Faidon; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/34443 [23:14:05] Reedy: apologies [23:14:19] heh, don't worry about it [23:14:23] thanks for fixing it [23:14:26] no one has complained yet [23:14:33] paravoid: yep! thank you :) [23:15:01] I've asked apergos to look at that remaining ms7 cruft [23:15:13] update the wiki page, then we can file a few bugzilla items [23:15:35] like "port ext-dist to mw-store or getridofitpleaseprettyplease" [23:16:14] I wonder how much work it would be.. [23:16:25] I think captcha is also like that [23:16:44] haha, no one complains when captcha is broken ;) [23:16:45] yeah... [23:17:37] upload.wm.org means to me "user uploaded content" not "random file storage place where pieces of our infrastructure use" [23:23:36] any ideas on how to properly dereference URLs with VCL? like /wikipedia/en/./foo or /wikipedia//en//foo [23:23:55] ideally it should work with /../ too, so a regexp won't cut it [23:28:50] srv203 looks broken somehow [23:29:31] 146703 messages from it in memcached-serious.log.. the next highest server has 634 [23:30:50] netstat -i doesn't show anything in teh err cols, but tcp retrans keep climbing.. [23:30:58] LeslieCarr: can you check srv203's switch port? [23:31:54] ok [23:32:02] !log depooled srv203 [23:32:09] Logged the message, Master [23:32:38] AaronSchulz: I remember you adding file backends support to captcha, but I don't see any captcha-related containers [23:32:48] what's the status? [23:33:02] binasher: yeah it has autonegotiated to 10Mbit [23:33:42] glad to see it has good negotiating skills. [23:33:58] binasher: i can imagine that would be saturated just a little bit quickly ;) [23:34:18] paravoid: no one touched it lately, still using nfs [23:35:05] AaronSchulz: okay, thanks. [23:46:24] Change merged: Faidon; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/34269 [23:47:07] A few people have written in within the past 18 hours about their NTP server/clients getting set to the year 2000.  The cause of this behavior is that an NTP server at the US Naval Observatory (pretty much the authoritative time source in the US) was rebooted and somehow reverted to the year 2000.  This, then, propogated out for a limited time and downstream time sources also got this value.  It's a transient problem and should alr [23:47:14] lol. [23:49:57] So that problems is affecting huge numbers of users all over the US, paravoid? [23:50:52] that's what it says, but I haven't seen anything that affects us so far [23:52:40] !log maxsem Finished syncing Wikimedia installation... : Weekly mobile deployment [23:52:47] Logged the message, Master [23:53:35] phewww [23:54:07] paravoid: :) [23:54:31] glad that the navy is so on top of their shit [23:58:04] New patchset: Asher; "enabling http_stub_status_module on protoproxy nginx servers, for use with ganglia" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/34447 [23:58:23] the us navy needs chronology protector [23:59:17] Change merged: Asher; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/34447 [23:59:18] working on SSL binasher?