[00:20:33] PROBLEM - Puppet freshness on owa3 is CRITICAL: Puppet has not run in the last 10 hours [00:27:36] PROBLEM - Puppet freshness on owa2 is CRITICAL: Puppet has not run in the last 10 hours [00:27:36] PROBLEM - Puppet freshness on owa1 is CRITICAL: Puppet has not run in the last 10 hours [00:28:03] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:33:54] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 335 bytes in 0.024 seconds [01:07:07] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:12:58] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 335 bytes in 2.942 seconds [01:47:01] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:51:58] PROBLEM - Disk space on mw61 is CRITICAL: DISK CRITICAL - free space: / 284 MB (3% inode=63%): /var/lib/ureadahead/debugfs 284 MB (3% inode=63%): [01:52:52] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 335 bytes in 2.303 seconds [02:28:06] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:28:33] PROBLEM - Varnish HTTP bits on niobium is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:29:45] PROBLEM - Varnish HTTP bits on arsenic is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:34:15] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 335 bytes in 0.023 seconds [02:36:03] wtf: http://ganglia.wikimedia.org/latest/?c=Bits%20caches%20eqiad&m=load_one&r=hour&s=by%20name&hc=4&mc=2 [02:36:21] <^demon> The hell...? [02:36:39] can't ssh to either of them [02:36:43] ah. can now [02:36:54] checking for packet loss [02:37:15] none [02:40:01] not swapping [02:41:04] well, let's restart varnish on them [02:41:23] fuck, is it varnish, or varnish3? [02:42:55] seems it's varnish [02:42:58] ack! [02:43:46] why so much system cpu all of a sudden? [02:44:09] RECOVERY - Varnish HTTP bits on arsenic is OK: HTTP OK HTTP/1.1 200 OK - 625 bytes in 0.267 seconds [02:44:16] !log restarted varnish on niobium and arsenic [02:44:20] Logged the message, Master [02:44:45] RECOVERY - Varnish HTTP bits on niobium is OK: HTTP OK HTTP/1.1 200 OK - 627 bytes in 0.143 seconds [02:44:56] well, that seemed to have helped for now. let's see if it dies again soon [02:45:11] i hope so! [02:45:30] you hope it dies again? ;) [02:45:36] hope it's fixed [02:45:38] heh [02:45:39] !!! [02:45:48] thought it was my sucky internet connection [02:46:05] if I have to take a guess, it's going to die again [02:46:18] it works now [02:46:56] I wonder if there was brief packet loss to pmtpa from eqiad [02:47:06] that could trigger this [02:47:26] it still doesn't look great: http://ganglia.wikimedia.org/latest/?c=Bits%20caches%20eqiad&m=load_one&r=hour&s=by%20name&hc=4&mc=2 [02:48:05] i can't save to wikimania wiki [02:48:16] yikes! [02:48:59] it saved but took forever [02:50:02] ah. packet loss [02:50:18] PROBLEM - Varnish HTTP bits on arsenic is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:51:14] paged mark and leslie [02:51:31] I'm going to move traffic to pmtpa [02:51:45] hm. I wonder if the bits cache there is totally cold [02:52:50] hey Ryan_Lane [02:52:54] hey [02:53:03] i think the SMS notifier is dead again [02:53:14] I'm considering moving bits traffic to pmtpa [02:53:23] but… it's possible the cache there is totally cold [02:53:30] ok [02:53:40] can you check into the packet loss? [02:53:44] trying to get into eqiad routers now [02:53:48] ok [02:53:56] !log moving bits traffic to pmtpa [02:53:59] Logged the message, Master [02:54:08] if this makes it worse I'll move it back [02:56:02] cache being doesn't really matter for bits [02:56:06] it's a tiny cache [02:56:13] ah. ok [02:56:16] so, I moved it [02:56:58] hrm, so not seeing the p-loss in my mtr's right now yet [02:57:13] packet loss may not be due to network, but due to the system being very overloaded [02:58:52] this is at least good timing as my Ryan just asked me to clean up around the house a bit ;) [03:00:19] so i see on arsenic a big network drop at 02:25 UTC [03:00:27] Feb 26 05:09:15 asw-b-eqiad chassisd[965]: CHASSISD_SNMP_TRAP6: SNMP trap generated: Power Supply failed (jnxContentsContainerIndex 2, jnxContentsL1Index 4, jnxContentsL2Index 2, jnxContentsL3Index 0, jnxContentsDescr Power Supply 1, jnxOperatingState/Temp 6) [03:00:31] http://ganglia.wikimedia.org/latest/graph_all_periods.php?h=arsenic.wikimedia.org&m=cpu_report&r=hour&s=descending&hc=4&mc=2&st=1330311572&g=network_report&z=large&c=Bits%20caches%20eqiad [03:00:42] ah [03:00:57] that's no good [03:01:19] LeslieCarr: yeah, but it's also associated with a very high load [03:01:49] mark@asw-b-eqiad> show chassis alarms [03:01:49] 1 alarms currently active [03:01:49] Alarm time Class Description [03:01:50] 2011-12-15 18:46:15 UTC Major FPC 3 PEM 1 is not powered [03:02:00] that's a while ago though [03:02:23] holy shit, [03:02:24] root@niobium:~# uptime [03:02:24] 03:02:12 up 110 days, 9:39, 3 users, load average: 1.55, 277.47, 2274.78 [03:02:37] !log restarting varnish on arsenic [03:02:39] Logged the message, Master [03:03:02] mark: unless you want to strace the process... [03:03:25] not really [03:03:31] * Ryan_Lane nod [03:04:11] it's pretty annoying that once varnish starts crapping itself that it won't recover [03:04:14] and neither one is on that switch, at least :) [03:04:15] RECOVERY - Varnish HTTP bits on arsenic is OK: HTTP OK HTTP/1.1 200 OK - 627 bytes in 0.062 seconds [03:06:02] huh? [03:06:05] one of the two should be [03:06:26] niobium is [03:06:43] on that stack, but on rack 4, not 3 [03:06:49] was thinking physical single switch [03:06:54] sure [03:07:06] but any one switch can affect the entire stack of course [03:07:13] yeah [03:07:18] I was saying that because observium is seeing asw-b-eqiad flapping a lot [03:07:38] and with varnish, any packet loss whatsoever will cause carnage [03:07:57] even if it's brief [03:08:33] ok. eqiad is happy again [03:08:43] want me to move traffic back, or leave it in pmtpa for a while? [03:09:04] leave it there [03:09:07] * Ryan_Lane nods [03:09:39] whatever went wrong over there will likely happen again [03:09:51] yes and it's nicely on 4 hosts in pmtpa [03:09:59] yep [03:10:02] i'm deploying upload on bits in eqiad this week [03:10:06] i'll setup two more hosts as well [03:10:09] er [03:10:12] upload on varnish in eqiad [03:10:15] i'm not fully awake yet :P [03:10:17] oh. cool [03:10:19] heh [03:10:42] ah. crap it's 4am there [03:10:52] I should have checked before I paged you [03:11:22] it's ok [03:14:26] so i'm not seeing errors on the links … i think we're good for now [03:14:28] so one of the two netapp controllers in eqiad seems down [03:14:50] that is kinda weird [03:14:55] indeed [03:15:02] but since it's unused, I kinda don't wanna look at it right now [03:16:49] i'm gonna get going and finish cooking dinner [03:16:56] yeah. me too [03:17:01] i'll go back to sleep ;) [03:17:04] nice to have redundancy! [03:17:06] g'night :) [03:17:08] yes it is ! [03:17:10] good night [03:17:14] goodnight! [03:17:15] mark: night! sorry for the page ;) [03:17:21] no problem [03:17:24] see ya [03:18:21] PROBLEM - MySQL disk space on db12 is CRITICAL: DISK CRITICAL - free space: / 287 MB (3% inode=90%): [03:19:33] PROBLEM - Disk space on db12 is CRITICAL: DISK CRITICAL - free space: / 287 MB (3% inode=90%): [03:34:05] PROBLEM - Disk space on srv249 is CRITICAL: DISK CRITICAL - free space: / 284 MB (3% inode=63%): /var/lib/ureadahead/debugfs 284 MB (3% inode=63%): [03:38:24] I'm looking for Russ Nelson. is he in here? [03:38:49] I need to add his email address to the openstack swift authors file, and I don't have it [03:53:42] hi notmyname [03:53:48] hi [03:54:00] know how I can contact russ? [03:54:18] i'm pinging him... not sure if he's available now [03:54:40] his nick is nelson [03:54:52] thanks. I can camp in here (yay, IRC bouncers!) and you can ping me when you have info [03:55:37] ok [03:59:51] he seems not around right now [04:00:30] no worries [04:01:33] ok [05:03:40] RECOVERY - Disk space on db12 is OK: DISK OK [05:03:49] RECOVERY - MySQL disk space on db12 is OK: DISK OK [06:45:20] RECOVERY - Disk space on srv249 is OK: DISK OK [06:48:02] RECOVERY - Disk space on mw61 is OK: DISK OK [08:07:05] PROBLEM - Packetloss_Average on locke is CRITICAL: CRITICAL: packet_loss_average is 16.3681021053 (gt 8.0) [08:07:42] PROBLEM - udp2log processes on locke is CRITICAL: CRITICAL: filters absent: /a/squid/urjc.awk, [08:12:08] RECOVERY - udp2log processes on locke is OK: OK: all filters present [08:24:08] PROBLEM - udp2log processes on locke is CRITICAL: CRITICAL: filters absent: /a/squid/urjc.awk, [08:30:08] RECOVERY - udp2log processes on locke is OK: OK: all filters present [08:32:59] PROBLEM - Puppet freshness on mw1010 is CRITICAL: Puppet has not run in the last 10 hours [08:38:59] PROBLEM - Puppet freshness on mw1110 is CRITICAL: Puppet has not run in the last 10 hours [08:38:59] PROBLEM - Puppet freshness on mw1020 is CRITICAL: Puppet has not run in the last 10 hours [08:56:41] RECOVERY - Packetloss_Average on locke is OK: OK: packet_loss_average is 1.90896460177 [09:29:09] PROBLEM - Puppet freshness on lvs1002 is CRITICAL: Puppet has not run in the last 10 hours [10:21:35] PROBLEM - Puppet freshness on owa3 is CRITICAL: Puppet has not run in the last 10 hours [10:29:32] PROBLEM - Puppet freshness on owa1 is CRITICAL: Puppet has not run in the last 10 hours [10:29:32] PROBLEM - Puppet freshness on owa2 is CRITICAL: Puppet has not run in the last 10 hours [13:32:05] PROBLEM - Disk space on mw61 is CRITICAL: DISK CRITICAL - free space: / 284 MB (3% inode=63%): /var/lib/ureadahead/debugfs 284 MB (3% inode=63%): [14:09:18] apergos: some debugging process on mw61 is eating all disk space :-) [14:09:38] * apergos guess gmond at random [14:10:12] well I am just guessing [14:10:33] cause of the debugfs filename in nagios notification [14:15:17] nah [14:15:37] looks like there's just enough php warnings in the logs that it's getting fuller [14:15:44] I wish we made these root partitions a bit bigger [14:17:35] mw61 apache2[7296]: PHP Warning: filemtime() [function.filemtime]: stat failed for /usr/local/apache/common-local/php-1.17/extensions/WikiEditor/modules/./images/toc/close.png in /usr/local/apache/common-local/php-1.18/includes/resourceloader/ResourceLoaderFileModule.php on line 380 [14:17:41] lots of messages like these [14:19:00] I gzipped messages.1 and syslog.1 a bit ahead of schedule [14:19:27] I wonder what is running 1.17 [14:19:56] RECOVERY - Disk space on mw61 is OK: DISK OK [14:21:09] apergos: /var should probably be a different partition :-) [14:21:17] so whenever it fills up, you don't have the whole system going foobar [14:21:28] just like /tmp [14:21:49] sure, but even better would be to have some room for logs that have a lot of extra crap in them [14:21:54] instead of pretending that's never going to happen [14:22:00] do you have the date of such messages? [14:22:07] today [14:22:11] dooh [14:22:24] the 1.17 errors are related to some cache somewhere [14:22:31] well [14:22:42] that need to be cleared. Roan said he cleared them all but that does not seem to be the case :-( [14:24:02] guess not [14:24:12] yeah, will have to ping him about it [14:24:58] in the meantime I bought you guys a little more time [14:25:49] ohhhh [14:25:57] STUPID ROAN EMIGRATED TO USA !!!!!!!!!!!!!!!!!! [14:26:05] we can't join him anymore during the day :-( [14:26:17] :-D [14:26:23] he'll be around later [14:29:05] really? bugzilla? [14:29:08] https://bugzilla.wikimedia.org/show_bug.cgi?id=34752 [14:29:09] yeah [14:29:16] easier than email [14:29:20] I mean, I'll be on here when people are on later [14:29:27] we have an ops meeting (*grumble*) [14:29:41] we got platform engineering on monday too [14:29:47] 11pm- midnight local time .. [14:29:48] oh yeah? [14:29:51] ugh [14:30:13] you need to have more people living over here so they can speak up about the time slo [14:30:14] t [14:31:01] Rob is the director. Is team in SF is made of only one person (Aaron) :-)))))))))))))))))) [14:31:23] so that was the compromise time? eeewww [14:31:40] where are you guys all then? [14:31:47] then all the other are Remote workers [14:32:04] Tim in Australia, Sam Reed in UK, me in France, Chad somewhere in the U.S.A [14:32:34] + non MediaWiki folks, Chris McMahon in USA somewhere, the analytic team being in N/Y and NL [14:32:37] that's a lot of timezones [14:32:38] well that is a bit messy [14:33:02] anyway, the MediaWiki weekly meeting is at 8am for Tim, 11pm for me, 10pm for Reedy and in the afternoon for US fok [14:33:19] any other hour mean it will be too early for tim or to late for me [14:33:26] you and tim are the problem, clearly [14:33:32] OR at night for the USA folks [14:33:36] one of you will have to get reassigned :-P [14:33:52] just need to migrate all the american staff back in Europe [14:33:54] where they belong [14:34:27] :-D [14:34:28] http://en.wikipedia.org/wiki/European_colonization_of_the_Americas \o/ [14:35:06] you know, that argument only works for americans of european descent :-P [14:35:31] the other one will be back to Mexico then back to Spain :-D [14:35:51] african americans [14:36:16] * hashar realize that all of sudden americas will be free from human being 8) [14:36:17] and what about asian americans? [14:36:17] PROBLEM - Packetloss_Average on locke is CRITICAL: CRITICAL: packet_loss_average is 10.8520634211 (gt 8.0) [14:36:39] no, the native americans will have a bunch of land again [14:39:19] anyway [14:39:39] apergos: I have been ranting about swift logs polluting the main syslog file [14:39:59] I thought that got worked out [14:40:01] I have did a change for syslog-ng configuration https://gerrit.wikimedia.org/r/#change,2673 [14:40:04] anyways these weren't swift eerrors [14:40:07] ben reviewed it [14:40:20] ok [14:40:21] so we might have a solution approaching soon :-) [14:40:26] good for that! [14:43:10] RECOVERY - Packetloss_Average on locke is OK: OK: packet_loss_average is 3.23771982301 [14:49:01] New patchset: Hashar; "send Swift syslogs to their own file" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2673 [15:11:22] PROBLEM - Packetloss_Average on locke is CRITICAL: CRITICAL: packet_loss_average is 8.13589707965 (gt 8.0) [16:12:45] PROBLEM - Packetloss_Average on locke is CRITICAL: CRITICAL: packet_loss_average is 10.1901229204 (gt 8.0) [16:36:45] PROBLEM - Packetloss_Average on locke is CRITICAL: CRITICAL: packet_loss_average is 10.2623781579 (gt 8.0) [16:49:35] apergos: do we have a list of server fingerprint somewhere ? [16:50:04] I mean the ssh finger print [16:50:45] I don;t know of one [16:55:17] !log blog updated to newest version [16:55:19] Logged the message, RobH [16:57:03] New patchset: Hashar; "modifying testfile again" [test/mediawiki/core2] (master) - https://gerrit.wikimedia.org/r/2791 [16:58:27] New review: jenkins-bot; "Build Started https://integration.mediawiki.org/ci/job/MediaWiki-GIT-Fetching/27/ (1/2)" [test/mediawiki/core2] (master); V: 0 C: 0; - https://gerrit.wikimedia.org/r/2791 [16:58:28] New review: jenkins-bot; "Build Started https://integration.mediawiki.org/ci/job/MediaWiki-GIT-Fetching/39/ (2/2)" [test/mediawiki/core2] (master); V: 0 C: 0; - https://gerrit.wikimedia.org/r/2791 [17:01:50] New patchset: RobH; "require unzip for blog server" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2792 [17:03:02] New review: RobH; "need unzip package to update blog plugins" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/2792 [17:03:03] Change merged: RobH; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2792 [17:03:21] aude: did you get in contact with russ? [17:04:39] New review: Hashar; "(no comment)" [test/mediawiki/core2] (master); V: 1 C: 2; - https://gerrit.wikimedia.org/r/2791 [17:04:39] Change merged: Hashar; [test/mediawiki/core2] (master) - https://gerrit.wikimedia.org/r/2791 [17:05:57] New patchset: Hashar; "add 'topic' feature" [test/mediawiki/core2] (master) - https://gerrit.wikimedia.org/r/2793 [17:06:58] PROBLEM - Packetloss_Average on locke is CRITICAL: CRITICAL: packet_loss_average is 10.3481768421 (gt 8.0) [17:07:07] New review: jenkins-bot; "Build Started https://integration.mediawiki.org/ci/job/MediaWiki-GIT-Fetching/40/ (1/2)" [test/mediawiki/core2] (master); V: 0 C: 0; - https://gerrit.wikimedia.org/r/2793 [17:07:07] New review: jenkins-bot; "Build Started https://integration.mediawiki.org/ci/job/MediaWiki-GIT-Fetching/28/ (2/2)" [test/mediawiki/core2] (master); V: 0 C: 0; - https://gerrit.wikimedia.org/r/2793 [17:07:07] New review: Hashar; "(no comment)" [test/mediawiki/core2] (master); V: 1 C: 2; - https://gerrit.wikimedia.org/r/2793 [17:07:08] Change merged: Hashar; [test/mediawiki/core2] (master) - https://gerrit.wikimedia.org/r/2793 [17:10:58] !log blog plugins updated, blog puppet config updated to support unzip package [17:11:00] Logged the message, RobH [17:11:27] PROBLEM - Apache HTTP on mw22 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:11:27] PROBLEM - Apache HTTP on srv199 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:11:27] PROBLEM - Apache HTTP on mw53 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:11:27] PROBLEM - Apache HTTP on mw29 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:11:27] PROBLEM - Apache HTTP on mw4 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:11:28] PROBLEM - Apache HTTP on mw14 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:11:28] PROBLEM - Apache HTTP on srv267 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:11:29] PROBLEM - Apache HTTP on srv226 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:12:03] PROBLEM - Apache HTTP on mw25 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:12:03] PROBLEM - Apache HTTP on mw13 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:12:03] PROBLEM - Apache HTTP on srv225 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:12:03] PROBLEM - Apache HTTP on mw48 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:12:03] PROBLEM - Apache HTTP on srv288 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:12:04] PROBLEM - Apache HTTP on mw19 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:12:04] PROBLEM - Apache HTTP on mw6 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:12:05] PROBLEM - Apache HTTP on srv289 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:12:05] PROBLEM - Apache HTTP on srv194 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:12:06] PROBLEM - Apache HTTP on srv270 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:12:06] PROBLEM - Apache HTTP on srv261 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:12:07] PROBLEM - Apache HTTP on mw54 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:12:07] PROBLEM - Apache HTTP on mw26 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:12:08] PROBLEM - Apache HTTP on mw31 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:12:08] PROBLEM - Apache HTTP on srv200 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:12:09] PROBLEM - Apache HTTP on srv233 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:12:09] PROBLEM - Apache HTTP on srv212 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:12:10] PROBLEM - Apache HTTP on srv276 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:12:10] PROBLEM - Apache HTTP on srv208 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:12:11] PROBLEM - Apache HTTP on srv283 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:12:11] PROBLEM - Apache HTTP on mw20 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:12:12] PROBLEM - Apache HTTP on srv244 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:12:21] RECOVERY - Apache HTTP on mw53 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.028 second response time [17:12:22] PROBLEM - LVS HTTP on appservers.svc.pmtpa.wmnet is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:12:22] PROBLEM - Apache HTTP on mw10 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:12:30] PROBLEM - Apache HTTP on mw30 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:12:30] PROBLEM - Apache HTTP on mw36 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:12:30] PROBLEM - Apache HTTP on mw41 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:12:30] PROBLEM - Apache HTTP on mw47 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:12:31] PROBLEM - Apache HTTP on srv207 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:12:31] PROBLEM - Apache HTTP on srv213 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:12:31] PROBLEM - Apache HTTP on srv232 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:12:31] PROBLEM - Apache HTTP on srv238 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:12:47] wt... [17:12:48] PROBLEM - Apache HTTP on srv243 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:12:48] PROBLEM - Apache HTTP on mw2 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:12:48] PROBLEM - Apache HTTP on mw37 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:12:48] PROBLEM - Apache HTTP on srv262 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:12:48] PROBLEM - Apache HTTP on srv275 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:12:48] PROBLEM - Apache HTTP on srv269 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:12:57] PROBLEM - Apache HTTP on srv201 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:12:57] PROBLEM - Apache HTTP on srv227 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:12:57] PROBLEM - Apache HTTP on srv239 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:12:58] PROBLEM - Apache HTTP on srv245 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:12:58] PROBLEM - Apache HTTP on srv263 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:12:58] PROBLEM - Apache HTTP on srv282 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:13:06] PROBLEM - Apache HTTP on mw15 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:13:06] PROBLEM - Apache HTTP on mw27 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:13:06] PROBLEM - Apache HTTP on mw32 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:13:06] PROBLEM - Apache HTTP on mw38 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:13:06] PROBLEM - Apache HTTP on mw43 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:13:07] PROBLEM - Apache HTTP on mw49 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:13:15] PROBLEM - Apache HTTP on srv195 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:13:15] PROBLEM - Apache HTTP on srv196 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:13:15] PROBLEM - Apache HTTP on srv202 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:13:15] PROBLEM - Apache HTTP on srv209 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:13:15] PROBLEM - Apache HTTP on srv228 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:13:16] PROBLEM - Apache HTTP on srv234 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:13:16] PROBLEM - Apache HTTP on srv240 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:13:17] PROBLEM - Apache HTTP on srv246 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:13:17] PROBLEM - Apache HTTP on srv258 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:13:18] PROBLEM - Apache HTTP on srv264 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:13:18] PROBLEM - Apache HTTP on srv271 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:13:19] PROBLEM - Apache HTTP on srv277 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:13:19] PROBLEM - Apache HTTP on srv285 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:13:33] PROBLEM - Apache HTTP on mw28 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:13:33] PROBLEM - Apache HTTP on mw33 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:13:33] PROBLEM - Apache HTTP on mw39 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:13:33] PROBLEM - Apache HTTP on mw7 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:13:33] PROBLEM - Apache HTTP on mw44 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:13:34] PROBLEM - Apache HTTP on mw5 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:13:34] PROBLEM - Apache HTTP on mw56 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:13:35] PROBLEM - Apache HTTP on mw8 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:13:35] PROBLEM - Apache HTTP on srv190 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:13:36] PROBLEM - Apache HTTP on srv197 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:13:36] PROBLEM - Apache HTTP on srv203 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:13:37] PROBLEM - Apache HTTP on srv210 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:13:37] PROBLEM - Apache HTTP on srv229 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:13:38] PROBLEM - Apache HTTP on srv235 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:13:38] PROBLEM - Apache HTTP on srv247 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:13:39] PROBLEM - Apache HTTP on srv259 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:13:39] PROBLEM - Apache HTTP on srv265 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:13:40] PROBLEM - Apache HTTP on srv272 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:13:40] PROBLEM - Apache HTTP on srv286 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:13:51] PROBLEM - Apache HTTP on mw17 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:13:51] PROBLEM - Apache HTTP on mw34 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:13:51] PROBLEM - Apache HTTP on mw45 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:13:51] PROBLEM - Apache HTTP on mw16 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:13:51] PROBLEM - Apache HTTP on srv198 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:14:00] PROBLEM - Apache HTTP on srv230 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:14:00] PROBLEM - Apache HTTP on srv236 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:14:00] PROBLEM - Apache HTTP on srv242 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:14:00] PROBLEM - Apache HTTP on srv260 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:14:00] PROBLEM - Apache HTTP on srv273 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:14:00] PROBLEM - Frontend Squid HTTP on cp1018 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:14:09] New patchset: RobH; "hooper is no longer a blog server" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2794 [17:14:09] PROBLEM - Apache HTTP on mw1 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:14:10] PROBLEM - Apache HTTP on mw12 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:14:10] PROBLEM - Apache HTTP on mw18 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:14:10] PROBLEM - Apache HTTP on mw24 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:14:10] PROBLEM - Apache HTTP on mw3 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:14:10] PROBLEM - Apache HTTP on mw40 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:14:10] PROBLEM - Apache HTTP on mw46 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:14:11] PROBLEM - Apache HTTP on mw52 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:14:11] PROBLEM - Apache HTTP on mw58 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:14:18] PROBLEM - Apache HTTP on srv204 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:14:18] PROBLEM - Apache HTTP on srv205 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:14:19] PROBLEM - Apache HTTP on srv231 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:14:19] PROBLEM - Apache HTTP on srv237 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:14:19] PROBLEM - Apache HTTP on srv268 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:14:19] PROBLEM - Apache HTTP on srv274 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:14:19] PROBLEM - Apache HTTP on srv280 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:14:27] PROBLEM - Apache HTTP on srv211 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:15:27] New review: RobH; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/2794 [17:15:27] Change merged: RobH; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2794 [17:16:24] PROBLEM - Router interfaces on br1-knams is CRITICAL: CRITICAL: No response from remote host 91.198.174.245 for 1.3.6.1.2.1.2.2.1.7 with snmp version 2 [17:17:18] PROBLEM - Apache HTTP on mw55 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:17:36] PROBLEM - Apache HTTP on mw11 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:17:36] PROBLEM - Apache HTTP on mw21 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:17:45] PROBLEM - Apache HTTP on srv241 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:18:03] PROBLEM - Apache HTTP on mw9 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:18:03] PROBLEM - Apache HTTP on mw51 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:18:03] PROBLEM - Apache HTTP on mw57 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:18:03] RECOVERY - Apache HTTP on srv236 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 8.794 second response time [17:18:12] PROBLEM - Apache HTTP on srv287 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:18:21] PROBLEM - Apache HTTP on mw35 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:18:21]