[00:20:33] PROBLEM - Puppet freshness on owa3 is CRITICAL: Puppet has not run in the last 10 hours [00:27:36] PROBLEM - Puppet freshness on owa2 is CRITICAL: Puppet has not run in the last 10 hours [00:27:36] PROBLEM - Puppet freshness on owa1 is CRITICAL: Puppet has not run in the last 10 hours [00:28:03] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:33:54] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 335 bytes in 0.024 seconds [01:07:07] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:12:58] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 335 bytes in 2.942 seconds [01:47:01] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:51:58] PROBLEM - Disk space on mw61 is CRITICAL: DISK CRITICAL - free space: / 284 MB (3% inode=63%): /var/lib/ureadahead/debugfs 284 MB (3% inode=63%): [01:52:52] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 335 bytes in 2.303 seconds [02:28:06] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:28:33] PROBLEM - Varnish HTTP bits on niobium is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:29:45] PROBLEM - Varnish HTTP bits on arsenic is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:34:15] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 335 bytes in 0.023 seconds [02:36:03] wtf: http://ganglia.wikimedia.org/latest/?c=Bits%20caches%20eqiad&m=load_one&r=hour&s=by%20name&hc=4&mc=2 [02:36:21] <^demon> The hell...? [02:36:39] can't ssh to either of them [02:36:43] ah. can now [02:36:54] checking for packet loss [02:37:15] none [02:40:01] not swapping [02:41:04] well, let's restart varnish on them [02:41:23] fuck, is it varnish, or varnish3? [02:42:55] seems it's varnish [02:42:58] ack! [02:43:46] why so much system cpu all of a sudden? [02:44:09] RECOVERY - Varnish HTTP bits on arsenic is OK: HTTP OK HTTP/1.1 200 OK - 625 bytes in 0.267 seconds [02:44:16] !log restarted varnish on niobium and arsenic [02:44:20] Logged the message, Master [02:44:45] RECOVERY - Varnish HTTP bits on niobium is OK: HTTP OK HTTP/1.1 200 OK - 627 bytes in 0.143 seconds [02:44:56] well, that seemed to have helped for now. let's see if it dies again soon [02:45:11] i hope so! [02:45:30] you hope it dies again? ;) [02:45:36] hope it's fixed [02:45:38] heh [02:45:39] !!! [02:45:48] thought it was my sucky internet connection [02:46:05] if I have to take a guess, it's going to die again [02:46:18] it works now [02:46:56] I wonder if there was brief packet loss to pmtpa from eqiad [02:47:06] that could trigger this [02:47:26] it still doesn't look great: http://ganglia.wikimedia.org/latest/?c=Bits%20caches%20eqiad&m=load_one&r=hour&s=by%20name&hc=4&mc=2 [02:48:05] i can't save to wikimania wiki [02:48:16] yikes! [02:48:59] it saved but took forever [02:50:02] ah. packet loss [02:50:18] PROBLEM - Varnish HTTP bits on arsenic is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:51:14] paged mark and leslie [02:51:31] I'm going to move traffic to pmtpa [02:51:45] hm. I wonder if the bits cache there is totally cold [02:52:50] hey Ryan_Lane [02:52:54] hey [02:53:03] i think the SMS notifier is dead again [02:53:14] I'm considering moving bits traffic to pmtpa [02:53:23] but… it's possible the cache there is totally cold [02:53:30] ok [02:53:40] can you check into the packet loss? [02:53:44] trying to get into eqiad routers now [02:53:48] ok [02:53:56] !log moving bits traffic to pmtpa [02:53:59] Logged the message, Master [02:54:08] if this makes it worse I'll move it back [02:56:02] cache being doesn't really matter for bits [02:56:06] it's a tiny cache [02:56:13] ah. ok [02:56:16] so, I moved it [02:56:58] hrm, so not seeing the p-loss in my mtr's right now yet [02:57:13] packet loss may not be due to network, but due to the system being very overloaded [02:58:52] this is at least good timing as my Ryan just asked me to clean up around the house a bit ;) [03:00:19] so i see on arsenic a big network drop at 02:25 UTC [03:00:27] Feb 26 05:09:15 asw-b-eqiad chassisd[965]: CHASSISD_SNMP_TRAP6: SNMP trap generated: Power Supply failed (jnxContentsContainerIndex 2, jnxContentsL1Index 4, jnxContentsL2Index 2, jnxContentsL3Index 0, jnxContentsDescr Power Supply 1, jnxOperatingState/Temp 6) [03:00:31] http://ganglia.wikimedia.org/latest/graph_all_periods.php?h=arsenic.wikimedia.org&m=cpu_report&r=hour&s=descending&hc=4&mc=2&st=1330311572&g=network_report&z=large&c=Bits%20caches%20eqiad [03:00:42] ah [03:00:57] that's no good [03:01:19] LeslieCarr: yeah, but it's also associated with a very high load [03:01:49] mark@asw-b-eqiad> show chassis alarms [03:01:49] 1 alarms currently active [03:01:49] Alarm time Class Description [03:01:50] 2011-12-15 18:46:15 UTC Major FPC 3 PEM 1 is not powered [03:02:00] that's a while ago though [03:02:23] holy shit, [03:02:24] root@niobium:~# uptime [03:02:24] 03:02:12 up 110 days, 9:39, 3 users, load average: 1.55, 277.47, 2274.78 [03:02:37] !log restarting varnish on arsenic [03:02:39] Logged the message, Master [03:03:02] mark: unless you want to strace the process... [03:03:25] not really [03:03:31] * Ryan_Lane nod [03:04:11] it's pretty annoying that once varnish starts crapping itself that it won't recover [03:04:14] and neither one is on that switch, at least :) [03:04:15] RECOVERY - Varnish HTTP bits on arsenic is OK: HTTP OK HTTP/1.1 200 OK - 627 bytes in 0.062 seconds [03:06:02] huh? [03:06:05] one of the two should be [03:06:26] niobium is [03:06:43] on that stack, but on rack 4, not 3 [03:06:49] was thinking physical single switch [03:06:54] sure [03:07:06] but any one switch can affect the entire stack of course [03:07:13] yeah [03:07:18] I was saying that because observium is seeing asw-b-eqiad flapping a lot [03:07:38] and with varnish, any packet loss whatsoever will cause carnage [03:07:57] even if it's brief [03:08:33] ok. eqiad is happy again [03:08:43] want me to move traffic back, or leave it in pmtpa for a while? [03:09:04] leave it there [03:09:07] * Ryan_Lane nods [03:09:39] whatever went wrong over there will likely happen again [03:09:51] yes and it's nicely on 4 hosts in pmtpa [03:09:59] yep [03:10:02] i'm deploying upload on bits in eqiad this week [03:10:06] i'll setup two more hosts as well [03:10:09] er [03:10:12] upload on varnish in eqiad [03:10:15] i'm not fully awake yet :P [03:10:17] oh. cool [03:10:19] heh [03:10:42] ah. crap it's 4am there [03:10:52] I should have checked before I paged you [03:11:22] it's ok [03:14:26] so i'm not seeing errors on the links … i think we're good for now [03:14:28] so one of the two netapp controllers in eqiad seems down [03:14:50] that is kinda weird [03:14:55] indeed [03:15:02] but since it's unused, I kinda don't wanna look at it right now [03:16:49] i'm gonna get going and finish cooking dinner [03:16:56] yeah. me too [03:17:01] i'll go back to sleep ;) [03:17:04] nice to have redundancy! [03:17:06] g'night :) [03:17:08] yes it is ! [03:17:10] good night [03:17:14] goodnight! [03:17:15] mark: night! sorry for the page ;) [03:17:21] no problem [03:17:24] see ya [03:18:21] PROBLEM - MySQL disk space on db12 is CRITICAL: DISK CRITICAL - free space: / 287 MB (3% inode=90%): [03:19:33] PROBLEM - Disk space on db12 is CRITICAL: DISK CRITICAL - free space: / 287 MB (3% inode=90%): [03:34:05] PROBLEM - Disk space on srv249 is CRITICAL: DISK CRITICAL - free space: / 284 MB (3% inode=63%): /var/lib/ureadahead/debugfs 284 MB (3% inode=63%): [03:38:24] I'm looking for Russ Nelson. is he in here? [03:38:49] I need to add his email address to the openstack swift authors file, and I don't have it [03:53:42] hi notmyname [03:53:48] hi [03:54:00] know how I can contact russ? [03:54:18] i'm pinging him... not sure if he's available now [03:54:40] his nick is nelson [03:54:52] thanks. I can camp in here (yay, IRC bouncers!) and you can ping me when you have info [03:55:37] ok [03:59:51] he seems not around right now [04:00:30] no worries [04:01:33] ok [05:03:40] RECOVERY - Disk space on db12 is OK: DISK OK [05:03:49] RECOVERY - MySQL disk space on db12 is OK: DISK OK [06:45:20] RECOVERY - Disk space on srv249 is OK: DISK OK [06:48:02] RECOVERY - Disk space on mw61 is OK: DISK OK [08:07:05] PROBLEM - Packetloss_Average on locke is CRITICAL: CRITICAL: packet_loss_average is 16.3681021053 (gt 8.0) [08:07:42] PROBLEM - udp2log processes on locke is CRITICAL: CRITICAL: filters absent: /a/squid/urjc.awk, [08:12:08] RECOVERY - udp2log processes on locke is OK: OK: all filters present [08:24:08] PROBLEM - udp2log processes on locke is CRITICAL: CRITICAL: filters absent: /a/squid/urjc.awk, [08:30:08] RECOVERY - udp2log processes on locke is OK: OK: all filters present [08:32:59] PROBLEM - Puppet freshness on mw1010 is CRITICAL: Puppet has not run in the last 10 hours [08:38:59] PROBLEM - Puppet freshness on mw1110 is CRITICAL: Puppet has not run in the last 10 hours [08:38:59] PROBLEM - Puppet freshness on mw1020 is CRITICAL: Puppet has not run in the last 10 hours [08:56:41] RECOVERY - Packetloss_Average on locke is OK: OK: packet_loss_average is 1.90896460177 [09:29:09] PROBLEM - Puppet freshness on lvs1002 is CRITICAL: Puppet has not run in the last 10 hours [10:21:35] PROBLEM - Puppet freshness on owa3 is CRITICAL: Puppet has not run in the last 10 hours [10:29:32] PROBLEM - Puppet freshness on owa1 is CRITICAL: Puppet has not run in the last 10 hours [10:29:32] PROBLEM - Puppet freshness on owa2 is CRITICAL: Puppet has not run in the last 10 hours [13:32:05] PROBLEM - Disk space on mw61 is CRITICAL: DISK CRITICAL - free space: / 284 MB (3% inode=63%): /var/lib/ureadahead/debugfs 284 MB (3% inode=63%): [14:09:18] apergos: some debugging process on mw61 is eating all disk space :-) [14:09:38] * apergos guess gmond at random [14:10:12] well I am just guessing [14:10:33] cause of the debugfs filename in nagios notification [14:15:17] nah [14:15:37] looks like there's just enough php warnings in the logs that it's getting fuller [14:15:44] I wish we made these root partitions a bit bigger [14:17:35] mw61 apache2[7296]: PHP Warning: filemtime() [function.filemtime]: stat failed for /usr/local/apache/common-local/php-1.17/extensions/WikiEditor/modules/./images/toc/close.png in /usr/local/apache/common-local/php-1.18/includes/resourceloader/ResourceLoaderFileModule.php on line 380 [14:17:41] lots of messages like these [14:19:00] I gzipped messages.1 and syslog.1 a bit ahead of schedule [14:19:27] I wonder what is running 1.17 [14:19:56] RECOVERY - Disk space on mw61 is OK: DISK OK [14:21:09] apergos: /var should probably be a different partition :-) [14:21:17] so whenever it fills up, you don't have the whole system going foobar [14:21:28] just like /tmp [14:21:49] sure, but even better would be to have some room for logs that have a lot of extra crap in them [14:21:54] instead of pretending that's never going to happen [14:22:00] do you have the date of such messages? [14:22:07] today [14:22:11] dooh [14:22:24] the 1.17 errors are related to some cache somewhere [14:22:31] well [14:22:42] that need to be cleared. Roan said he cleared them all but that does not seem to be the case :-( [14:24:02] guess not [14:24:12] yeah, will have to ping him about it [14:24:58] in the meantime I bought you guys a little more time [14:25:49] ohhhh [14:25:57] STUPID ROAN EMIGRATED TO USA !!!!!!!!!!!!!!!!!! [14:26:05] we can't join him anymore during the day :-( [14:26:17] :-D [14:26:23] he'll be around later [14:29:05] really? bugzilla? [14:29:08] https://bugzilla.wikimedia.org/show_bug.cgi?id=34752 [14:29:09] yeah [14:29:16] easier than email [14:29:20] I mean, I'll be on here when people are on later [14:29:27] we have an ops meeting (*grumble*) [14:29:41] we got platform engineering on monday too [14:29:47] 11pm- midnight local time .. [14:29:48] oh yeah? [14:29:51] ugh [14:30:13] you need to have more people living over here so they can speak up about the time slo [14:30:14] t [14:31:01] Rob is the director. Is team in SF is made of only one person (Aaron) :-)))))))))))))))))) [14:31:23] so that was the compromise time? eeewww [14:31:40] where are you guys all then? [14:31:47] then all the other are Remote workers [14:32:04] Tim in Australia, Sam Reed in UK, me in France, Chad somewhere in the U.S.A [14:32:34] + non MediaWiki folks, Chris McMahon in USA somewhere, the analytic team being in N/Y and NL [14:32:37] that's a lot of timezones [14:32:38] well that is a bit messy [14:33:02] anyway, the MediaWiki weekly meeting is at 8am for Tim, 11pm for me, 10pm for Reedy and in the afternoon for US fok [14:33:19] any other hour mean it will be too early for tim or to late for me [14:33:26] you and tim are the problem, clearly [14:33:32] OR  at night for the USA folks [14:33:36] one of you will have to get reassigned :-P [14:33:52] just need to migrate all the american staff back in Europe [14:33:54] where they belong [14:34:27] :-D [14:34:28] http://en.wikipedia.org/wiki/European_colonization_of_the_Americas \o/ [14:35:06] you know, that argument only works for americans of european descent :-P [14:35:31] the other one will be back to Mexico then back to Spain :-D [14:35:51] african americans [14:36:16] * hashar realize that all of sudden americas will be free from human being 8) [14:36:17] and what about asian americans? [14:36:17] PROBLEM - Packetloss_Average on locke is CRITICAL: CRITICAL: packet_loss_average is 10.8520634211 (gt 8.0) [14:36:39] no, the native americans will have a bunch of land again [14:39:19] anyway [14:39:39] apergos: I have been ranting about swift logs polluting the main syslog file [14:39:59] I thought that got worked out [14:40:01] I have did a change for syslog-ng configuration https://gerrit.wikimedia.org/r/#change,2673 [14:40:04] anyways these weren't swift eerrors [14:40:07] ben reviewed it [14:40:20] ok [14:40:21] so we might have a solution approaching soon :-) [14:40:26] good for that! [14:43:10] RECOVERY - Packetloss_Average on locke is OK: OK: packet_loss_average is 3.23771982301 [14:49:01] New patchset: Hashar; "send Swift syslogs to their own file" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2673 [15:11:22] PROBLEM - Packetloss_Average on locke is CRITICAL: CRITICAL: packet_loss_average is 8.13589707965 (gt 8.0) [16:12:45] PROBLEM - Packetloss_Average on locke is CRITICAL: CRITICAL: packet_loss_average is 10.1901229204 (gt 8.0) [16:36:45] PROBLEM - Packetloss_Average on locke is CRITICAL: CRITICAL: packet_loss_average is 10.2623781579 (gt 8.0) [16:49:35] apergos: do we have a list of server fingerprint somewhere ? [16:50:04] I mean the ssh finger print [16:50:45] I don;t know of one [16:55:17] !log blog updated to newest version [16:55:19] Logged the message, RobH [16:57:03] New patchset: Hashar; "modifying testfile again" [test/mediawiki/core2] (master) - https://gerrit.wikimedia.org/r/2791 [16:58:27] New review: jenkins-bot; "Build Started https://integration.mediawiki.org/ci/job/MediaWiki-GIT-Fetching/27/ (1/2)" [test/mediawiki/core2] (master); V: 0 C: 0; - https://gerrit.wikimedia.org/r/2791 [16:58:28] New review: jenkins-bot; "Build Started https://integration.mediawiki.org/ci/job/MediaWiki-GIT-Fetching/39/ (2/2)" [test/mediawiki/core2] (master); V: 0 C: 0; - https://gerrit.wikimedia.org/r/2791 [17:01:50] New patchset: RobH; "require unzip for blog server" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2792 [17:03:02] New review: RobH; "need unzip package to update blog plugins" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/2792 [17:03:03] Change merged: RobH; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2792 [17:03:21] aude: did you get in contact with russ? [17:04:39] New review: Hashar; "(no comment)" [test/mediawiki/core2] (master); V: 1 C: 2; - https://gerrit.wikimedia.org/r/2791 [17:04:39] Change merged: Hashar; [test/mediawiki/core2] (master) - https://gerrit.wikimedia.org/r/2791 [17:05:57] New patchset: Hashar; "add 'topic' feature" [test/mediawiki/core2] (master) - https://gerrit.wikimedia.org/r/2793 [17:06:58] PROBLEM - Packetloss_Average on locke is CRITICAL: CRITICAL: packet_loss_average is 10.3481768421 (gt 8.0) [17:07:07] New review: jenkins-bot; "Build Started https://integration.mediawiki.org/ci/job/MediaWiki-GIT-Fetching/40/ (1/2)" [test/mediawiki/core2] (master); V: 0 C: 0; - https://gerrit.wikimedia.org/r/2793 [17:07:07] New review: jenkins-bot; "Build Started https://integration.mediawiki.org/ci/job/MediaWiki-GIT-Fetching/28/ (2/2)" [test/mediawiki/core2] (master); V: 0 C: 0; - https://gerrit.wikimedia.org/r/2793 [17:07:07] New review: Hashar; "(no comment)" [test/mediawiki/core2] (master); V: 1 C: 2; - https://gerrit.wikimedia.org/r/2793 [17:07:08] Change merged: Hashar; [test/mediawiki/core2] (master) - https://gerrit.wikimedia.org/r/2793 [17:10:58] !log blog plugins updated, blog puppet config updated to support unzip package [17:11:00] Logged the message, RobH [17:11:27] PROBLEM - Apache HTTP on mw22 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:11:27] PROBLEM - Apache HTTP on srv199 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:11:27] PROBLEM - Apache HTTP on mw53 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:11:27] PROBLEM - Apache HTTP on mw29 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:11:27] PROBLEM - Apache HTTP on mw4 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:11:28] PROBLEM - Apache HTTP on mw14 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:11:28] PROBLEM - Apache HTTP on srv267 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:11:29] PROBLEM - Apache HTTP on srv226 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:12:03] PROBLEM - Apache HTTP on mw25 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:12:03] PROBLEM - Apache HTTP on mw13 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:12:03] PROBLEM - Apache HTTP on srv225 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:12:03] PROBLEM - Apache HTTP on mw48 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:12:03] PROBLEM - Apache HTTP on srv288 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:12:04] PROBLEM - Apache HTTP on mw19 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:12:04] PROBLEM - Apache HTTP on mw6 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:12:05] PROBLEM - Apache HTTP on srv289 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:12:05] PROBLEM - Apache HTTP on srv194 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:12:06] PROBLEM - Apache HTTP on srv270 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:12:06] PROBLEM - Apache HTTP on srv261 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:12:07] PROBLEM - Apache HTTP on mw54 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:12:07] PROBLEM - Apache HTTP on mw26 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:12:08] PROBLEM - Apache HTTP on mw31 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:12:08] PROBLEM - Apache HTTP on srv200 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:12:09] PROBLEM - Apache HTTP on srv233 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:12:09] PROBLEM - Apache HTTP on srv212 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:12:10] PROBLEM - Apache HTTP on srv276 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:12:10] PROBLEM - Apache HTTP on srv208 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:12:11] PROBLEM - Apache HTTP on srv283 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:12:11] PROBLEM - Apache HTTP on mw20 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:12:12] PROBLEM - Apache HTTP on srv244 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:12:21] RECOVERY - Apache HTTP on mw53 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.028 second response time [17:12:22] PROBLEM - LVS HTTP on appservers.svc.pmtpa.wmnet is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:12:22] PROBLEM - Apache HTTP on mw10 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:12:30] PROBLEM - Apache HTTP on mw30 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:12:30] PROBLEM - Apache HTTP on mw36 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:12:30] PROBLEM - Apache HTTP on mw41 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:12:30] PROBLEM - Apache HTTP on mw47 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:12:31] PROBLEM - Apache HTTP on srv207 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:12:31] PROBLEM - Apache HTTP on srv213 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:12:31] PROBLEM - Apache HTTP on srv232 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:12:31] PROBLEM - Apache HTTP on srv238 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:12:47] wt... [17:12:48] PROBLEM - Apache HTTP on srv243 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:12:48] PROBLEM - Apache HTTP on mw2 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:12:48] PROBLEM - Apache HTTP on mw37 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:12:48] PROBLEM - Apache HTTP on srv262 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:12:48] PROBLEM - Apache HTTP on srv275 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:12:48] PROBLEM - Apache HTTP on srv269 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:12:57] PROBLEM - Apache HTTP on srv201 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:12:57] PROBLEM - Apache HTTP on srv227 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:12:57] PROBLEM - Apache HTTP on srv239 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:12:58] PROBLEM - Apache HTTP on srv245 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:12:58] PROBLEM - Apache HTTP on srv263 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:12:58] PROBLEM - Apache HTTP on srv282 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:13:06] PROBLEM - Apache HTTP on mw15 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:13:06] PROBLEM - Apache HTTP on mw27 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:13:06] PROBLEM - Apache HTTP on mw32 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:13:06] PROBLEM - Apache HTTP on mw38 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:13:06] PROBLEM - Apache HTTP on mw43 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:13:07] PROBLEM - Apache HTTP on mw49 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:13:15] PROBLEM - Apache HTTP on srv195 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:13:15] PROBLEM - Apache HTTP on srv196 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:13:15] PROBLEM - Apache HTTP on srv202 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:13:15] PROBLEM - Apache HTTP on srv209 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:13:15] PROBLEM - Apache HTTP on srv228 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:13:16] PROBLEM - Apache HTTP on srv234 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:13:16] PROBLEM - Apache HTTP on srv240 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:13:17] PROBLEM - Apache HTTP on srv246 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:13:17] PROBLEM - Apache HTTP on srv258 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:13:18] PROBLEM - Apache HTTP on srv264 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:13:18] PROBLEM - Apache HTTP on srv271 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:13:19] PROBLEM - Apache HTTP on srv277 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:13:19] PROBLEM - Apache HTTP on srv285 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:13:33] PROBLEM - Apache HTTP on mw28 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:13:33] PROBLEM - Apache HTTP on mw33 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:13:33] PROBLEM - Apache HTTP on mw39 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:13:33] PROBLEM - Apache HTTP on mw7 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:13:33] PROBLEM - Apache HTTP on mw44 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:13:34] PROBLEM - Apache HTTP on mw5 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:13:34] PROBLEM - Apache HTTP on mw56 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:13:35] PROBLEM - Apache HTTP on mw8 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:13:35] PROBLEM - Apache HTTP on srv190 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:13:36] PROBLEM - Apache HTTP on srv197 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:13:36] PROBLEM - Apache HTTP on srv203 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:13:37] PROBLEM - Apache HTTP on srv210 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:13:37] PROBLEM - Apache HTTP on srv229 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:13:38] PROBLEM - Apache HTTP on srv235 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:13:38] PROBLEM - Apache HTTP on srv247 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:13:39] PROBLEM - Apache HTTP on srv259 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:13:39] PROBLEM - Apache HTTP on srv265 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:13:40] PROBLEM - Apache HTTP on srv272 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:13:40] PROBLEM - Apache HTTP on srv286 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:13:51] PROBLEM - Apache HTTP on mw17 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:13:51] PROBLEM - Apache HTTP on mw34 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:13:51] PROBLEM - Apache HTTP on mw45 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:13:51] PROBLEM - Apache HTTP on mw16 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:13:51] PROBLEM - Apache HTTP on srv198 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:14:00] PROBLEM - Apache HTTP on srv230 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:14:00] PROBLEM - Apache HTTP on srv236 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:14:00] PROBLEM - Apache HTTP on srv242 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:14:00] PROBLEM - Apache HTTP on srv260 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:14:00] PROBLEM - Apache HTTP on srv273 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:14:00] PROBLEM - Frontend Squid HTTP on cp1018 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:14:09] New patchset: RobH; "hooper is no longer a blog server" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2794 [17:14:09] PROBLEM - Apache HTTP on mw1 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:14:10] PROBLEM - Apache HTTP on mw12 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:14:10] PROBLEM - Apache HTTP on mw18 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:14:10] PROBLEM - Apache HTTP on mw24 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:14:10] PROBLEM - Apache HTTP on mw3 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:14:10] PROBLEM - Apache HTTP on mw40 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:14:10] PROBLEM - Apache HTTP on mw46 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:14:11] PROBLEM - Apache HTTP on mw52 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:14:11] PROBLEM - Apache HTTP on mw58 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:14:18] PROBLEM - Apache HTTP on srv204 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:14:18] PROBLEM - Apache HTTP on srv205 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:14:19] PROBLEM - Apache HTTP on srv231 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:14:19] PROBLEM - Apache HTTP on srv237 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:14:19] PROBLEM - Apache HTTP on srv268 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:14:19] PROBLEM - Apache HTTP on srv274 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:14:19] PROBLEM - Apache HTTP on srv280 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:14:27] PROBLEM - Apache HTTP on srv211 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:15:27] New review: RobH; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/2794 [17:15:27] Change merged: RobH; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2794 [17:16:24] PROBLEM - Router interfaces on br1-knams is CRITICAL: CRITICAL: No response from remote host 91.198.174.245 for 1.3.6.1.2.1.2.2.1.7 with snmp version 2 [17:17:18] PROBLEM - Apache HTTP on mw55 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:17:36] PROBLEM - Apache HTTP on mw11 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:17:36] PROBLEM - Apache HTTP on mw21 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:17:45] PROBLEM - Apache HTTP on srv241 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:18:03] PROBLEM - Apache HTTP on mw9 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:18:03] PROBLEM - Apache HTTP on mw51 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:18:03] PROBLEM - Apache HTTP on mw57 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:18:03] RECOVERY - Apache HTTP on srv236 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 8.794 second response time [17:18:12] PROBLEM - Apache HTTP on srv287 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:18:21] PROBLEM - Apache HTTP on mw35 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:18:21] RECOVERY - Router interfaces on br1-knams is OK: OK: host 91.198.174.245, interfaces up: 10, down: 0, dormant: 0, excluded: 0, unused: 0 [17:18:39] PROBLEM - Apache HTTP on mw53 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:18:57] RECOVERY - Apache HTTP on srv208 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 5.582 second response time [17:18:57] RECOVERY - Apache HTTP on srv283 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 7.301 second response time [17:19:06] RECOVERY - Apache HTTP on srv289 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 8.766 second response time [17:19:15] PROBLEM - Apache HTTP on mw42 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:19:42] RECOVERY - Apache HTTP on srv235 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 7.894 second response time [17:20:45] RECOVERY - Apache HTTP on srv226 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 6.934 second response time [17:20:45] RECOVERY - Apache HTTP on srv207 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 8.416 second response time [17:21:03] RECOVERY - Apache HTTP on srv263 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 8.181 second response time [17:21:09] DDoS in progress, please be silent [17:21:12] RECOVERY - Apache HTTP on srv245 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 9.978 second response time [17:21:30] RECOVERY - Apache HTTP on srv264 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 7.047 second response time [17:21:39] RECOVERY - Apache HTTP on srv203 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.031 second response time [17:21:39] RECOVERY - Apache HTTP on srv190 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 2.656 second response time [17:21:39] RECOVERY - Apache HTTP on srv277 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 4.156 second response time [17:21:39] RECOVERY - Apache HTTP on srv241 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 4.356 second response time [17:21:39] RECOVERY - Apache HTTP on srv286 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 4.815 second response time [17:21:40] RECOVERY - Apache HTTP on srv209 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 6.663 second response time [17:21:48] RECOVERY - Apache HTTP on srv197 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 9.700 second response time [17:21:57] RECOVERY - Apache HTTP on srv198 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.825 second response time [17:22:06] RECOVERY - Apache HTTP on srv287 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 5.171 second response time [17:22:24] RECOVERY - Apache HTTP on srv280 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 3.655 second response time [17:22:25] RECOVERY - Apache HTTP on srv211 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 2.624 second response time [17:22:25] RECOVERY - Apache HTTP on srv274 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 4.842 second response time [17:22:25] RECOVERY - Apache HTTP on srv268 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 5.554 second response time [17:22:33] RECOVERY - Apache HTTP on srv205 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 6.467 second response time [17:22:42] RECOVERY - Apache HTTP on srv232 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.024 second response time [17:22:42] RECOVERY - Apache HTTP on srv238 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.023 second response time [17:22:42] RECOVERY - Apache HTTP on srv194 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.024 second response time [17:22:42] RECOVERY - Apache HTTP on srv244 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 6.982 second response time [17:22:42] RECOVERY - Apache HTTP on srv213 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 7.289 second response time [17:22:51] RECOVERY - Apache HTTP on srv288 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.024 second response time [17:23:00] RECOVERY - Apache HTTP on srv262 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.020 second response time [17:23:00] RECOVERY - Apache HTTP on srv243 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.038 second response time [17:23:00] RECOVERY - Apache HTTP on srv270 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.024 second response time [17:23:00] RECOVERY - Apache HTTP on srv269 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.019 second response time [17:23:00] RECOVERY - Apache HTTP on srv261 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.467 second response time [17:23:01] RECOVERY - Apache HTTP on srv201 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.030 second response time [17:23:01] RECOVERY - Apache HTTP on srv227 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.022 second response time [17:23:01] RECOVERY - Apache HTTP on srv239 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.026 second response time [17:23:02] RECOVERY - Apache HTTP on srv276 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.018 second response time [17:23:02] RECOVERY - Apache HTTP on srv233 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.024 second response time [17:23:04] RECOVERY - Apache HTTP on srv282 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.025 second response time [17:23:04] RECOVERY - Apache HTTP on srv275 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 3.242 second response time [17:23:04] RECOVERY - Apache HTTP on srv225 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 5.849 second response time [17:23:18] RECOVERY - Apache HTTP on srv212 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.037 second response time [17:23:36] RECOVERY - Apache HTTP on srv246 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.024 second response time [17:23:36] RECOVERY - Apache HTTP on srv271 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.023 second response time [17:23:36] RECOVERY - Apache HTTP on srv258 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.022 second response time [17:23:36] RECOVERY - Apache HTTP on srv228 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.024 second response time [17:23:36] RECOVERY - Apache HTTP on srv196 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.023 second response time [17:23:37] RECOVERY - Apache HTTP on srv195 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.025 second response time [17:23:37] RECOVERY - Apache HTTP on srv234 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 4.130 second response time [17:23:45] RECOVERY - Apache HTTP on srv240 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.027 second response time [17:23:45] RECOVERY - Apache HTTP on srv265 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.020 second response time [17:23:46] RECOVERY - Apache HTTP on srv229 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.024 second response time [17:23:46] RECOVERY - Apache HTTP on srv210 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.023 second response time [17:23:46] RECOVERY - Apache HTTP on srv272 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.021 second response time [17:23:46] RECOVERY - Apache HTTP on srv247 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.030 second response time [17:23:46] RECOVERY - Apache HTTP on srv259 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.021 second response time [17:23:54] RECOVERY - Apache HTTP on srv285 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.018 second response time [17:23:54] RECOVERY - Apache HTTP on srv202 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.033 second response time [17:24:03] RECOVERY - Apache HTTP on mw22 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 2.188 second response time [17:24:03] RECOVERY - Apache HTTP on mw9 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 6.874 second response time [17:24:03] RECOVERY - Apache HTTP on mw16 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 7.585 second response time [17:24:12] RECOVERY - Apache HTTP on srv260 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.019 second response time [17:24:12] RECOVERY - Apache HTTP on srv230 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.027 second response time [17:24:13] RECOVERY - Apache HTTP on srv242 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.027 second response time [17:24:13] RECOVERY - Apache HTTP on srv273 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.034 second response time [17:24:13] RECOVERY - Apache HTTP on srv267 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.024 second response time [17:24:21] RECOVERY - Apache HTTP on mw12 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 3.757 second response time [17:24:21] RECOVERY - Apache HTTP on mw24 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 3.880 second response time [17:24:21] RECOVERY - Apache HTTP on mw35 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 3.992 second response time [17:24:30] RECOVERY - LVS HTTP on appservers.svc.pmtpa.wmnet is OK: HTTP OK HTTP/1.1 200 OK - 64220 bytes in 0.114 seconds [17:24:31] RECOVERY - Apache HTTP on srv199 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.026 second response time [17:24:31] RECOVERY - Apache HTTP on srv237 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.163 second response time [17:24:31] RECOVERY - Apache HTTP on srv204 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.173 second response time [17:24:39] RECOVERY - Apache HTTP on mw46 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 7.049 second response time [17:24:40] RECOVERY - Apache HTTP on mw13 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.025 second response time [17:24:40] RECOVERY - Apache HTTP on srv231 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 2.883 second response time [17:24:40] RECOVERY - Apache HTTP on mw30 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 4.244 second response time [17:24:40] RECOVERY - Apache HTTP on mw36 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 8.234 second response time [17:24:48] RECOVERY - Apache HTTP on srv200 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.868 second response time [17:24:58] RECOVERY - Apache HTTP on mw26 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.045 second response time [17:24:58] RECOVERY - Apache HTTP on mw37 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.362 second response time [17:24:58] RECOVERY - Apache HTTP on mw31 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.467 second response time [17:24:58] RECOVERY - Apache HTTP on mw6 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 2.805 second response time [17:24:58] RECOVERY - Apache HTTP on mw54 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 3.315 second response time [17:25:06] RECOVERY - Apache HTTP on mw10 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.562 second response time [17:25:06] RECOVERY - Apache HTTP on mw14 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 9.117 second response time [17:25:15] RECOVERY - Apache HTTP on mw38 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.031 second response time [17:25:15] RECOVERY - Apache HTTP on mw32 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.035 second response time [17:25:15] RECOVERY - Apache HTTP on mw15 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 3.021 second response time [17:25:15] RECOVERY - Apache HTTP on mw27 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 3.358 second response time [17:25:15] RECOVERY - Apache HTTP on mw55 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 3.736 second response time [17:25:16] RECOVERY - Apache HTTP on mw42 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 6.673 second response time [17:25:24] RECOVERY - Apache HTTP on mw49 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 2.954 second response time [17:25:24] RECOVERY - Apache HTTP on mw43 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 3.047 second response time [17:25:33] RECOVERY - Apache HTTP on mw33 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.896 second response time [17:25:33] RECOVERY - Apache HTTP on mw39 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 1.013 second response time [17:25:34] RECOVERY - Apache HTTP on mw11 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 1.887 second response time [17:25:34] RECOVERY - Apache HTTP on mw7 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 5.928 second response time [17:25:42] RECOVERY - Apache HTTP on mw8 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.025 second response time [17:25:42] RECOVERY - Apache HTTP on mw28 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.037 second response time [17:25:42] RECOVERY - Apache HTTP on mw5 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.984 second response time [17:25:42] RECOVERY - Apache HTTP on mw44 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 4.284 second response time [17:25:51] RECOVERY - Apache HTTP on mw21 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.023 second response time [17:25:52] RECOVERY - Apache HTTP on mw56 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.028 second response time [17:26:00] RECOVERY - Apache HTTP on mw45 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.036 second response time [17:26:00] RECOVERY - Apache HTTP on mw17 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.028 second response time [17:26:00] RECOVERY - Apache HTTP on mw51 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.043 second response time [17:26:00] RECOVERY - Apache HTTP on mw29 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 1.165 second response time [17:26:00] RECOVERY - Apache HTTP on mw57 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 2.510 second response time [17:26:00] RECOVERY - Apache HTTP on mw4 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 5.202 second response time [17:26:09] RECOVERY - Packetloss_Average on locke is OK: OK: packet_loss_average is 3.02932451327 [17:26:09] RECOVERY - Apache HTTP on mw1 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.032 second response time [17:26:09] RECOVERY - Frontend Squid HTTP on cp1018 is OK: HTTP OK HTTP/1.0 200 OK - 27672 bytes in 0.161 seconds [17:26:14] !log Denying POST / requests on frontend squids [17:26:17] Logged the message, Master [17:26:18] RECOVERY - Apache HTTP on mw34 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 8.477 second response time [17:26:36] RECOVERY - Apache HTTP on mw58 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.022 second response time [17:26:36] RECOVERY - Apache HTTP on mw52 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.036 second response time [17:26:36] RECOVERY - Apache HTTP on mw40 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.163 second response time [17:26:36] RECOVERY - Apache HTTP on mw3 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.498 second response time [17:26:36] RECOVERY - Apache HTTP on mw25 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.038 second response time [17:26:37] RECOVERY - Apache HTTP on mw41 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.036 second response time [17:26:37] RECOVERY - Apache HTTP on mw47 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.027 second response time [17:26:38] RECOVERY - Apache HTTP on mw53 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.040 second response time [17:26:38] RECOVERY - Apache HTTP on mw19 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.852 second response time [17:26:39] RECOVERY - Apache HTTP on mw18 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 2.123 second response time [17:27:03] RECOVERY - Apache HTTP on mw2 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 3.529 second response time [17:27:12] RECOVERY - Apache HTTP on mw48 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.034 second response time [17:27:21] RECOVERY - Apache HTTP on mw20 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.176 second response time [17:35:26] New review: Bhartshorne; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/2673 [17:35:27] Change merged: Bhartshorne; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2673 [17:37:49] May this somehow be related to the morning bits outage? [17:39:08] vvv: is bits fixed? [17:40:21] jeremyb: apparently [17:40:59] IIRC it was switched from equid to pmtpa [17:42:03] eqiad* [17:42:19] yep [17:52:06] PROBLEM - Packetloss_Average on locke is CRITICAL: CRITICAL: packet_loss_average is 9.15823333333 (gt 8.0) [17:53:09] PROBLEM - Disk space on srv192 is CRITICAL: DISK CRITICAL - free space: / 284 MB (3% inode=63%): /var/lib/ureadahead/debugfs 284 MB (3% inode=63%): [17:53:23] New review: Mark Bergsma; "Please just make that whole list one single file resource, with "recurse => remote"... and use mode ..." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2782 [17:55:48] New review: Ottomata; "(no comment)" [analytics/reportcard] (andre/mobile); V: 1 C: 2; - https://gerrit.wikimedia.org/r/2770 [17:56:10] New review: Ottomata; "(no comment)" [analytics/reportcard] (andre/mobile); V: 1 C: 2; - https://gerrit.wikimedia.org/r/2770 [18:00:12] RECOVERY - MySQL Slave Running on db34 is OK: OK replication Slave_IO_Running: Yes Slave_SQL_Running: Yes Last_Error: [18:01:36] New review: Aaron Schulz; "(no comment)" [operations/puppet] (production) C: 1; - https://gerrit.wikimedia.org/r/2788 [18:04:15] PROBLEM - MySQL Slave Delay on db34 is CRITICAL: CRIT replication delay 1102134 seconds [18:27:48] PROBLEM - Packetloss_Average on locke is CRITICAL: CRITICAL: packet_loss_average is 9.55816752212 (gt 8.0) [18:34:37] PROBLEM - Puppet freshness on mw1010 is CRITICAL: Puppet has not run in the last 10 hours [18:40:37] PROBLEM - Puppet freshness on mw1110 is CRITICAL: Puppet has not run in the last 10 hours [18:40:37] PROBLEM - Puppet freshness on mw1020 is CRITICAL: Puppet has not run in the last 10 hours [18:51:25] PROBLEM - Packetloss_Average on locke is CRITICAL: CRITICAL: packet_loss_average is 13.3243047368 (gt 8.0) [18:57:01] Ryan_Lane: Do you know if the ops meeting is happening? Or is everyone too busy plugging and unplugging network cables? [18:57:14] oh [18:57:18] the internets went down [18:57:25] oh. it's going on now! [18:57:34] What # do I call? [18:57:44] it should be in like 2 minute [18:57:44] s [18:57:47] 2002 [18:57:51] x2002 [18:58:27] Actually, the extension is the part that I know. [18:58:33] d'you know what # that's an extension for? [18:58:37] ah [18:58:55] +1-415-839-6885 [18:59:24] thanks! [18:59:36] yw [19:07:08] New patchset: Asher; "s5 master changing from db45 to db35" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2795 [19:23:22] RECOVERY - Packetloss_Average on locke is OK: OK: packet_loss_average is 3.85951903509 [19:30:34] PROBLEM - Puppet freshness on lvs1002 is CRITICAL: Puppet has not run in the last 10 hours [19:31:19] PROBLEM - Packetloss_Average on locke is CRITICAL: CRITICAL: packet_loss_average is 10.3746984211 (gt 8.0) [19:42:53] !log adjusted threshholds for ps1-b4-sdtpa.mgmt.pmtpa.wmnet again, bottom sensor set to high [19:42:56] Logged the message, RobH [20:11:20] http://gdash.wikimedia.org/dashboards/reqerror/ [20:11:48] PROBLEM - Packetloss_Average on locke is CRITICAL: CRITICAL: packet_loss_average is 9.57119877193 (gt 8.0) [20:22:54] PROBLEM - Puppet freshness on owa3 is CRITICAL: Puppet has not run in the last 10 hours [20:24:17] i'm getting ready to switch the s5 (dewiki) mysql master [20:30:14] RECOVERY - Packetloss_Average on locke is OK: OK: packet_loss_average is 2.4255640708 [20:31:00] !log new s5 master pos - MASTER_LOG_FILE='db35-bin.000011', MASTER_LOG_POS=374074061 [20:31:03] Logged the message, Master [20:31:52] PROBLEM - Puppet freshness on owa1 is CRITICAL: Puppet has not run in the last 10 hours [20:31:53] PROBLEM - Puppet freshness on owa2 is CRITICAL: Puppet has not run in the last 10 hours [20:32:37] New review: Asher; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/2795 [20:32:37] Change merged: Asher; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2795 [20:33:20] PROBLEM - Disk space on srv249 is CRITICAL: DISK CRITICAL - free space: / 284 MB (3% inode=63%): /var/lib/ureadahead/debugfs 284 MB (3% inode=63%): [20:34:37] New review: Bhartshorne; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/2746 [20:34:39] Change merged: Bhartshorne; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2746 [20:34:55] PROBLEM - MySQL Replication Heartbeat on db1021 is CRITICAL: CRIT replication delay 513 seconds [20:35:05] PROBLEM - MySQL Replication Heartbeat on db35 is CRITICAL: CRIT replication delay 517 seconds [20:35:05] PROBLEM - MySQL Replication Heartbeat on db44 is CRITICAL: CRIT replication delay 518 seconds [20:35:19] PROBLEM - MySQL Replication Heartbeat on db55 is CRITICAL: CRIT replication delay 529 seconds [20:35:28] maplebed: aforementioned corrupt image: https://upload.wikimedia.org/wikipedia/commons/thumb/b/b5/MediaWiki_flyer_student2012.svg/2000px-MediaWiki_flyer_student2012.svg.png [20:35:40] tnx. [20:36:30] Ryan_Lane: can you look at https://gerrit.wikimedia.org/r/#change,2748 and tell me whether that approach will work as intended? [20:36:58] yeah, it should be fine [20:37:06] it's a totally new account, and his old account is being deleted [20:37:13] ok, I'll merge. [20:37:16] ok [20:37:21] tnx. [20:37:37] New review: Bhartshorne; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/2748 [20:37:38] Change merged: Bhartshorne; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2748 [20:38:01] RECOVERY - MySQL Replication Heartbeat on db1021 is OK: OK replication delay seconds [20:38:01] RECOVERY - MySQL Replication Heartbeat on db35 is OK: OK replication delay seconds [20:38:09] RECOVERY - MySQL Replication Heartbeat on db44 is OK: OK replication delay seconds [20:38:18] RECOVERY - MySQL Replication Heartbeat on db55 is OK: OK replication delay 0 seconds [20:38:41] hii [20:38:55] ottomata: two questions about https://gerrit.wikimedia.org/r/#change,2754 [20:39:06] first, will 'provider => pip' work? [20:39:16] second, do you have to get the mysql-python stuff from pip instead of the debian package? [20:39:39] RECOVERY - MySQL Recent Restart on db1006 is OK: OK seconds since restart [20:39:48] RECOVERY - RAID on db1006 is OK: OK: State is Optimal, checked 2 logical device(s) [20:39:48] RECOVERY - Host db1006 is UP: PING OK - Packet loss = 0%, RTA = 31.11 ms [20:39:57] RECOVERY - SSH on db1006 is OK: SSH OK - OpenSSH_5.3p1 Debian-3ubuntu7 (protocol 2.0) [20:39:58] RECOVERY - MySQL Replication Heartbeat on db1006 is OK: OK replication delay seconds [20:40:06] RECOVERY - MySQL Slave Delay on db1006 is OK: OK replication delay seconds [20:40:07] RECOVERY - DPKG on db1006 is OK: All packages OK [20:40:24] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:40:42] RECOVERY - MySQL Slave Running on db1006 is OK: OK replication [20:40:42] RECOVERY - Disk space on db1006 is OK: DISK OK [20:40:51] RECOVERY - Full LVS Snapshot on db1006 is OK: OK no full LVM snapshot volumes [20:40:51] RECOVERY - MySQL disk space on db1006 is OK: DISK OK [20:41:18] RECOVERY - MySQL Idle Transactions on db1006 is OK: OK longest blocking idle transaction sleeps for seconds [20:42:21] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 335 bytes in 4.031 seconds [20:42:53] !log powercycled db1006 after finding nothing on the serial console. booted without issue, then started mysql. [20:42:55] Logged the message, Master [20:45:06] New patchset: Siebrand; "Break long line." [mediawiki/core] (master) - https://gerrit.wikimedia.org/r/2796 [20:46:06] PROBLEM - MySQL Slave Delay on db1006 is CRITICAL: CRIT replication delay 646809 seconds [20:46:06] PROBLEM - MySQL Replication Heartbeat on db1006 is CRITICAL: CRIT replication delay 646808 seconds [20:54:43] New patchset: QChris; "Migration to upstream git" [operations/dumps/test] (master) - https://gerrit.wikimedia.org/r/2797 [20:59:11] !log preparing to switch s6 master [20:59:13] Logged the message, Master [20:59:14] New patchset: Ottomata; "Made Dygraphs loader smarter. Observations now will not record instances of a trait_set if any of the trait properties are None. This allows transform callbacks to reject a trait based on its value." [analytics/reportcard] (master) - https://gerrit.wikimedia.org/r/2798 [20:59:34] maplebed: commit amended. [21:00:11] * maplebed looks [21:01:16] ottomata: did you convince yourself that 'provide => pip' will work or you want me to just try and see? [21:01:59] !log new s6 master pos - MASTER_LOG_FILE='db43-bin.000027', MASTER_LOG_POS=577074024 [21:02:02] Logged the message, Master [21:02:09] PROBLEM - Disk space on srv248 is CRITICAL: DISK CRITICAL - free space: / 283 MB (3% inode=63%): /var/lib/ureadahead/debugfs 283 MB (3% inode=63%): [21:02:11] i'm curious, at least. [21:02:36] i believe I amended in the comment commit [21:02:37] to remove that [21:02:53] i think the .dev will be less annoying for now [21:02:56] .deb* [21:02:57] !log db1006 (s6-secondary) is still slaving from db47 - it's very behind post hw failure. need to manually swap to db43 once caught up [21:03:00] Logged the message, Master [21:03:07] ottomata: the mysql stuff no longer comes from pip but the pywurfl stuff still does. [21:03:11] i'm not sure what you need to do in puppet to set that up (other than have pip installed) [21:03:16] oh right right [21:03:42] my guess is something like https://github.com/rcrowley/puppet-pip [21:03:46] New patchset: Asher; "new s6 master" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2799 [21:04:07] naw, the pip provider is now included in puppet [21:04:07] or [21:04:11] at least it is in the type ref [21:04:27] New review: Asher; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/2799 [21:04:29] Change merged: Asher; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2799 [21:04:37] ottomata: I need to remove your account from stat1 with the same commit since you're ensuring it's absent elsewhere. [21:04:38] http://docs.puppetlabs.com/references/stable/type.html#package [21:04:51] ? [21:04:56] the commit should do that, no? [21:05:03] but, you can remove whatever you need to [21:05:04] i've hardly used it yet [21:05:08] nothing there I need to save [21:05:10] New review: Siebrand; "Blah." [mediawiki/core] (master) C: 1; - https://gerrit.wikimedia.org/r/2796 [21:06:03] PROBLEM - MySQL Replication Heartbeat on db50 is CRITICAL: CRIT replication delay 333 seconds [21:06:12] PROBLEM - MySQL Replication Heartbeat on db43 is CRITICAL: CRIT replication delay 340 seconds [21:06:23] New patchset: Ottomata; "Adding page_views_pipeline.py." [analytics/reportcard] (master) - https://gerrit.wikimedia.org/r/2800 [21:06:38] New patchset: Bhartshorne; "removing aotto from stat1 so it doesn't barf when it hits the account ensure => absent" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2801 [21:06:48] PROBLEM - MySQL Replication Heartbeat on db46 is CRITICAL: CRIT replication delay 379 seconds [21:06:57] ottomata: 2801 is what I meant. [21:07:15] New review: Bhartshorne; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/2801 [21:07:15] Change merged: Bhartshorne; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2801 [21:07:18] New review: Diederik; "OK." [analytics/reportcard] (master); V: 1 C: 2; - https://gerrit.wikimedia.org/r/2800 [21:07:47] New review: Diederik; "Ok." [analytics/reportcard] (master); V: 1 C: 2; - https://gerrit.wikimedia.org/r/2798 [21:07:48] Change merged: Diederik; [analytics/reportcard] (master) - https://gerrit.wikimedia.org/r/2800 [21:07:48] Change merged: Diederik; [analytics/reportcard] (master) - https://gerrit.wikimedia.org/r/2798 [21:08:44] Ryan_Lane: I've gotten this error twice so far this morning when doing a 'git fetch' on sockpuppet: error: RPC failed; result=22, HTTP code = 503 [21:08:54] RECOVERY - MySQL Replication Heartbeat on db46 is OK: OK replication delay seconds [21:09:24] maplebed: the gerrit server is being slaughtered [21:09:30] RECOVERY - MySQL Replication Heartbeat on db1040 is OK: OK replication delay seconds [21:09:36] every mediawiki dev is trying to clone right now [21:09:57] RECOVERY - MySQL Replication Heartbeat on db1006 is OK: OK replication delay seconds [21:10:01] New patchset: Bhartshorne; "Revert "removing aotto from stat1 so it doesn't barf when it hits the account ensure => absent"" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2802 [21:10:06] RECOVERY - MySQL Replication Heartbeat on db50 is OK: OK replication delay seconds [21:10:06] RECOVERY - MySQL Replication Heartbeat on db43 is OK: OK replication delay 0 seconds [21:10:06] RECOVERY - MySQL Replication Heartbeat on db1022 is OK: OK replication delay seconds [21:15:19] New patchset: Asher; "pt-heartbeat should use REPLACE instead of UPDATE" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2803 [21:15:42] New review: Bhartshorne; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/2802 [21:15:42] Change merged: Bhartshorne; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2802 [21:15:47] New review: Asher; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/2803 [21:15:48] Change merged: Asher; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2803 [21:16:24] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [21:20:09] RECOVERY - Disk space on srv248 is OK: DISK OK [21:20:53] * Jeff_Green looks forward to php-1.18 being dropped [21:21:34] Change merged: QChris; [operations/dumps/test] (master) - https://gerrit.wikimedia.org/r/2797 [21:22:33] RECOVERY - Disk space on srv249 is OK: DISK OK [21:22:52] Jeff_Green: why? [21:23:36] it's huge and the apache boxes aren't partitioned well to handle the extra footprint [21:24:03] /dev/sda1 7.4G 6.8G 254M 97% / [21:24:26] 1.1G php-1.18 [21:24:26] 1.1G php-1.19 [21:24:30] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 335 bytes in 0.021 seconds [21:26:53] oh, that makes sense. I didn't look beyond /usr/ holding all the stuff. [21:27:43] ya, i only thought to look there because I ran into this the other day on some other host I now forget [21:28:40] it would be really spiffy if php-1.18 and php-1.19 shared the l18n cache, I'm not sure if that's possible [21:29:45] New patchset: Bhartshorne; "giving andrew a different real name so his two user accounts don't conflict." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2804 [21:30:53] New review: Bhartshorne; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/2804 [21:30:54] Change merged: Bhartshorne; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2804 [21:31:24] Jeff_Green: No, that's explicitly not possible [21:31:41] One of the /features/ of het deploy is that they can have separate l10n caches [21:31:41] ottomata: one more change to try and make it work... ^^^^ [21:32:57] RoanKattouw: ok. then I think we should seriously consider adjusting the way we build webservers to accomodate the increased footprint [21:33:23] That sounds like a good idea [21:33:28] we could probably do without this empty 59G /a :-) [21:33:41] were you the one who posted the RT ticket re. moving apache's /tmp to /a/tmp ? [21:35:28] We should also make sure that the scalers have sufficient /tmp space [21:35:36] Right now their /tmp lives on / which is just as tiny [21:37:34] RoanKattouw: would it be feasible to move the mw cache to /tmp if we were built such that /tmp had sufficient capacity? [21:37:54] I mean without horrible symlink madness :-) [21:38:13] Well [21:38:20] You're talking about cache/l10n ? [21:38:31] That stuff is eventually not gonna be pushed to the Apaches at all, I think [21:38:36] Ask Tim about that [21:38:41] ohrly [21:38:52] i thought it was built locally [21:38:58] No [21:39:03] cache/l10n is built centrally and pushed out [21:39:10] /tmp/l10ncache-* is built locally [21:39:11] oic [21:39:23] I think the plan is to build the latter centrally and push it out, that obviates the need to distribute the former [21:41:22] interesting, and the latter is also much smaller [21:53:40] robla: that image you gave me (the Mediawiki flyer) was modified Thu, 02 Feb 2012 - aka before the fix went live. [21:53:53] !log preparing to swap enwiki master, it will be read only for a couple minutes [21:53:56] Logged the message, Master [21:56:28] I'm not sure what to do with the apache server partitioning scheme--how did we arrive at these partitions and sizes? [21:57:45] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [21:57:58] maplebed: do you have a good automated way of cleaning up images like those? [21:58:29] robla: well, the method I ran last time did clear out a large number of images (I haven't actually counted how many) [21:58:38] but clearly it didn't catch them all. [21:58:43] I haven't written a better one yet [21:59:02] New patchset: Asher; "s1 master swap" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2805 [21:59:14] I do want to write something (and I think apergos actually already has this tool that I only need to modify) to make it easier to clear out specific thumbs (rather than all thumbs for an image) [21:59:34] * robla needs to drop off of IRC for a sec [21:59:36] robla: so I think the full answer is "sort of, with better tools coming." [21:59:42] New review: Asher; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/2805 [21:59:43] Change merged: Asher; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2805 [22:00:54] PROBLEM - MySQL Replication Heartbeat on db12 is CRITICAL: CRIT replication delay 272 seconds [22:00:54] PROBLEM - MySQL Replication Heartbeat on db1033 is CRITICAL: CRIT replication delay 274 seconds [22:02:51] RECOVERY - MySQL Replication Heartbeat on db12 is OK: OK replication delay 0 seconds [22:03:00] RECOVERY - MySQL Replication Heartbeat on db1033 is OK: OK replication delay 0 seconds [22:03:36] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 335 bytes in 7.368 seconds [22:09:42] !log new s1 (enwiki) master pos - MASTER_LOG_FILE='db38-bin.000129', MASTER_LOG_POS=255719721 [22:09:45] Logged the message, Master [22:14:26] !log running 1.19 schema migration script to get former s5, s6, s1 masters (db45, db47, db36) [22:14:29] Logged the message, Master [22:16:04] !log cadmium locked up, rebooting [22:16:06] Logged the message, RobH [22:19:30] RECOVERY - Host cadmium is UP: PING OK - Packet loss = 0%, RTA = 31.07 ms [22:21:26] binasher - when u have a chance, pls rotate s4 master as well [22:21:56] yep, that's next on the list. when will someone be able to replace the failed drive? [22:22:21] PROBLEM - MySQL Slave Delay on db45 is CRITICAL: CRIT replication delay 284 seconds [22:22:37] chris will do itprolly when it is rotated out as master [22:23:42] binasher: my understanding is we have the drive on site to swap [22:23:51] PROBLEM - MySQL Replication Heartbeat on db45 is CRITICAL: CRIT replication delay 375 seconds [22:23:55] we just didnt wanna do it while it was master, as chris is having trouble identifying the drive [22:24:06] yeah :/ [22:24:42] New patchset: Siebrand; "Add magic word translations for Dutch." [test/mediawiki/extensions/examples] (master) - https://gerrit.wikimedia.org/r/2806 [22:24:48] ok, rotating it in a few minutes. db22 is prob going to need to be retired within 6 months due to only having 450GB of space [22:29:28] !log switching s4 master to db31 [22:29:30] Logged the message, Master [22:29:44] New patchset: Asher; "switching s4 master to db31" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2807 [22:31:06] !log new s4 master pos - MASTER_LOG_FILE='db31-bin.000253', MASTER_LOG_POS=457980068 [22:31:09] Logged the message, Master [22:31:38] New review: Asher; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/2807 [22:31:39] Change merged: Asher; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2807 [22:39:28] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [22:40:23] New patchset: Nikerabbit; "Breaking stuff" [test/mediawiki/extensions/examples] (master) - https://gerrit.wikimedia.org/r/2808 [22:43:30] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 335 bytes in 6.358 seconds [22:45:12] !log dns update for manganese server [22:45:15] Logged the message, RobH [22:51:00] New review: Hashar; "You are breaking stuff! :-D" [test/mediawiki/extensions/examples] (master) C: -1; - https://gerrit.wikimedia.org/r/2808 [22:56:54] Change abandoned: Siebrand; "This sux." [test/mediawiki/extensions/examples] (master) - https://gerrit.wikimedia.org/r/2806 [22:58:16] Change restored: Hashar; "Restoring. Just amend your change!" [test/mediawiki/extensions/examples] (master) - https://gerrit.wikimedia.org/r/2806 [22:58:55] New review: Nikerabbit; "(no comment)" [test/mediawiki/extensions/examples] (master) C: 1; - https://gerrit.wikimedia.org/r/2806 [23:02:19] New review: Varnent; "It's possible this change will bring about the end of humankind as we know it..." [test/mediawiki/extensions/examples] (master) C: -1; - https://gerrit.wikimedia.org/r/2808 [23:03:32] New review: Siebrand; "Stubborn me. I'll approve anyway." [test/mediawiki/extensions/examples] (master) C: 1; - https://gerrit.wikimedia.org/r/2808 [23:03:49] New review: Siebrand; "Stubborn me. I'll approve anyway." [test/mediawiki/extensions/examples] (master) C: 2; - https://gerrit.wikimedia.org/r/2808 [23:03:51] New review: Varnent; "Maybe this is terrible after all" [test/mediawiki/extensions/examples] (master) C: -1; - https://gerrit.wikimedia.org/r/2808 [23:03:53] Change merged: Varnent; [test/mediawiki/extensions/examples] (master) - https://gerrit.wikimedia.org/r/2808 [23:09:49] New patchset: Lcarr; "Commenting out dual definitions" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2809 [23:10:14] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/2809 [23:10:47] New review: Lcarr; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/2809 [23:10:47] Change merged: Lcarr; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2809 [23:18:40] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [23:24:31] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 335 bytes in 3.959 seconds [23:30:57] New patchset: Lcarr; "changing some nagios3 config files to uniques" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2810 [23:31:20] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/2810 [23:37:40] New patchset: Lcarr; "adding new config file + commenting out old bits" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2811 [23:38:03] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/2811 [23:38:12] New review: Lcarr; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/2810 [23:38:12] Change merged: Lcarr; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2810 [23:38:29] New review: Lcarr; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/2811 [23:38:29] Change merged: Lcarr; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2811 [23:45:52] New patchset: Cmcmahon; "trying commit before review" [test/mediawiki/extensions/examples] (master) - https://gerrit.wikimedia.org/r/2812 [23:50:59] New patchset: Ryan Lane; "Making the default repo channel #mediawiki" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2813 [23:51:25] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/2813 [23:51:45] New review: Ryan Lane; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/2813 [23:51:46] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2813 [23:59:01] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds