[00:04:33] RECOVERY - Apache HTTP on srv278 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.025 second response time [00:26:17] That "Error deleting file: Could not delete lock file for "mwstore://local-backend/local-public/1/10"." error is still happening, if that's not known. [00:26:20] enwiki [00:28:15] PROBLEM - swift-container-auditor on ms-be3 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [00:48:21] PROBLEM - MySQL Idle Transactions on db31 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [00:49:15] PROBLEM - MySQL Slave Running on db31 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [00:50:18] PROBLEM - MySQL Replication Heartbeat on db31 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [00:51:21] RECOVERY - swift-container-auditor on ms-be3 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [00:52:24] PROBLEM - MySQL Slave Delay on db31 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [00:52:24] PROBLEM - MySQL Recent Restart on db31 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [00:53:45] RECOVERY - MySQL Slave Delay on db31 is OK: OK replication delay seconds [00:53:45] RECOVERY - MySQL Recent Restart on db31 is OK: OK 6144517 seconds since restart [00:54:12] RECOVERY - MySQL Idle Transactions on db31 is OK: OK longest blocking idle transaction sleeps for 0 seconds [00:54:50] looking at it [00:56:00] RECOVERY - MySQL Replication Heartbeat on db31 is OK: OK replication delay 0 seconds [00:56:18] RECOVERY - MySQL Slave Running on db31 is OK: OK replication [00:57:03] PROBLEM - swift-container-auditor on ms-be3 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [01:20:27] PROBLEM - Host db46 is DOWN: PING CRITICAL - Packet loss = 100% [01:22:51] RECOVERY - swift-container-auditor on ms-be3 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [01:30:03] PROBLEM - Puppet freshness on amslvs4 is CRITICAL: Puppet has not run in the last 10 hours [01:40:53] !log on professor: restarted udpprofile collector [01:40:56] Logged the message, Master [01:42:12] RECOVERY - Host db46 is UP: PING OK - Packet loss = 0%, RTA = 1.80 ms [01:42:48] PROBLEM - MySQL Slave Delay on storage3 is CRITICAL: CRIT replication delay 240 seconds [01:51:50] RECOVERY - MySQL Slave Delay on storage3 is OK: OK replication delay 19 seconds [01:54:23] PROBLEM - check_all_memcacheds on spence is CRITICAL: MEMCACHED CRITICAL - Could not connect: 10.0.11.38:11000 (timeout) 10.0.11.32:11000 (timeout) [01:55:44] RECOVERY - check_all_memcacheds on spence is OK: MEMCACHED OK - All memcacheds are online [02:02:56] PROBLEM - check_all_memcacheds on spence is CRITICAL: MEMCACHED CRITICAL - Could not connect: 10.0.11.36:11000 (timeout) [02:04:26] RECOVERY - check_all_memcacheds on spence is OK: MEMCACHED OK - All memcacheds are online [02:31:44] PROBLEM - swift-container-auditor on ms-be3 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [02:31:44] PROBLEM - check_all_memcacheds on spence is CRITICAL: MEMCACHED CRITICAL - Could not connect: 10.0.8.39:11000 (timeout) [02:33:05] RECOVERY - check_all_memcacheds on spence is OK: MEMCACHED OK - All memcacheds are online [02:40:17] RECOVERY - swift-container-auditor on ms-be3 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [02:46:53] RECOVERY - Puppet freshness on gallium is OK: puppet ran at Thu Apr 26 02:46:42 UTC 2012 [04:01:44] PROBLEM - swift-container-auditor on ms-be3 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [04:13:26] RECOVERY - swift-container-auditor on ms-be3 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [05:57:21] PROBLEM - LVS HTTP on foundation-lb.esams.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [05:57:21] PROBLEM - LVS HTTPS on wikinews-lb.esams.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [05:57:23] PROBLEM - LVS HTTP on wikibooks-lb.esams.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [05:57:32] PROBLEM - LVS HTTP on wiktionary-lb.esams.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [05:57:32] PROBLEM - LVS HTTP on wikipedia-lb.esams.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [05:57:41] PROBLEM - LVS HTTPS on wikipedia-lb.esams.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [05:57:50] PROBLEM - LVS HTTP on wikimedia-lb.esams.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [05:57:51] PROBLEM - LVS HTTPS on wikimedia-lb.esams.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [05:57:51] PROBLEM - LVS HTTPS on wikiversity-lb.esams.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [05:57:51] PROBLEM - SSH on amslvs1 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [05:57:51] PROBLEM - LVS HTTP on wikinews-lb.esams.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [05:57:54] that looks bad [05:57:58] mark? [05:57:59] PROBLEM - LVS HTTPS on wiktionary-lb.esams.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [05:57:59] PROBLEM - LVS HTTPS on wikisource-lb.esams.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [05:57:59] PROBLEM - LVS HTTPS on foundation-lb.esams.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [05:57:59] PROBLEM - LVS HTTP on wikiquote-lb.esams.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [05:58:08] PROBLEM - LVS HTTPS on wikibooks-lb.esams.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [05:58:08] PROBLEM - LVS HTTPS on wikiquote-lb.esams.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [05:58:08] PROBLEM - LVS HTTP on wikisource-lb.esams.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [05:58:17] PROBLEM - LVS HTTPS on bits.esams.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [05:58:17] PROBLEM - LVS HTTP on wikiversity-lb.esams.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [05:58:20] oh joy [05:59:20] PROBLEM - LVS HTTP on bits.esams.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [05:59:49] apergos - the site seems ok here [05:59:57] how's it there? [06:00:00] this is esams only [06:00:03] woosters: all of the alerts are esams [06:00:10] ya, i noticed [06:00:18] and both #-tech complainants are euro [06:00:20] that is why aspergos can validate [06:00:23] PROBLEM - LVS HTTPS on mediawiki-lb.esams.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [06:00:30] now a 3rd [06:00:50] RECOVERY - SSH on amslvs1 is OK: SSH OK - OpenSSH_5.3p1 Debian-3ubuntu7 (protocol 2.0) [06:01:04] oh, some good news for a change [06:02:53] fwiw, http://www.internetpulse.net/ seems bad atm [06:03:02] socket buffers [06:04:18] I don't know how to clear that [06:04:38] tried restarting varnish on cp3001, that was not sufficient [06:05:11] RECOVERY - LVS HTTP on bits.esams.wikimedia.org is OK: HTTP OK HTTP/1.1 200 OK - 3976 bytes in 8.921 seconds [06:05:12] PROBLEM - SSH on amslvs1 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [06:05:29] PROBLEM - LVS HTTP on mediawiki-lb.esams.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [06:05:30] apergos: why single out that box? or it's arbitrary? [06:05:32] I have varnish stopped on cp3001 for a minute [06:05:36] there's two bits caches [06:05:46] I chose the first one, they both are unhappy [06:06:38] oh, only 2 huh [06:06:44] ok I"m goign to start it back up now [06:06:49] but then why all the alerts for not bits? [06:08:20] RECOVERY - LVS HTTP on mediawiki-lb.esams.wikimedia.org is OK: HTTP OK - HTTP/1.0 301 Moved Permanently - 9.757 second response time [06:09:16] this looks better [06:09:21] let me do the same on cp3002 [06:10:04] hmm I don't see thatmessage there [06:10:34] it is ok I guess, looking at the graphs [06:11:03] or at least not in as bad a situation [06:12:59] PROBLEM - Host bits.esams.wikimedia.org is DOWN: PING CRITICAL - Packet loss = 100% [06:15:14] RECOVERY - SSH on amslvs1 is OK: SSH OK - OpenSSH_5.3p1 Debian-3ubuntu7 (protocol 2.0) [06:15:41] PROBLEM - LVS HTTP on mediawiki-lb.esams.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [06:15:44] i have no packet loss to ve7.te-8-1.csw1-esams.wikimedia.org and then plenty of loss to wikipedia-lb.esams.wikimedia.org [06:15:47] from US [06:15:58] (consecutive hops) [06:18:13] I don't know how to fix this :-( [06:18:26] i paged mark and asher [06:18:30] I am seeing out of socket memory again on cp3001 [06:18:32] woosters: leslie? [06:18:37] asher should be on in 5 minutes [06:18:41] RECOVERY - Host bits.esams.wikimedia.org is UP: PING WARNING - Packet loss = 86%, RTA = 154.51 ms [06:18:43] there he is [06:18:46] hi asher [06:19:00] hey [06:19:05] 26 06:15:43 < jeremyb> i have no packet loss to ve7.te-8-1.csw1-esams.wikimedia.org and then plenty of loss to wikipedia-lb.esams.wikimedia.org [06:19:09] 26 06:15:45 < jeremyb> from US [06:19:11] 26 06:15:57 < jeremyb> (consecutive hops) [06:19:15] (still) [06:19:34] http://ganglia.wikimedia.org/latest/?r=hour&cs=&ce=&m=load_one&s=by+name&c=Bits+caches+esams&h=&host_regex=&max_graphs=0&tab=m&vn=&sh=1&z=small&hc=4 [06:19:42] on cps001 we have out of socket memory issues [06:19:44] PROBLEM - SSH on amslvs1 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [06:19:45] jeremyb - leslie is sick :-( , not going to disturb her [06:19:48] restarting varnish did not clear them up [06:19:51] woosters: ;( [06:20:06] cp3001 [06:20:32] failover bits to pmtpa is an option if bits is the issue [06:21:54] jeremyb: where are you observing packet loss from? [06:22:27] pasting [06:22:46] binasher: http://dpaste.com/738014/plain/ [06:23:38] oh yeah, major packet loss to lvs in esams for me too [06:23:53] ok [06:24:20] binasher: also, maybe relevant: http://www.internetpulse.net/ is kinda unhealthy [06:24:23] RECOVERY - LVS HTTP on mediawiki-lb.esams.wikimedia.org is OK: HTTP OK - HTTP/1.0 301 Moved Permanently - 8.462 second response time [06:24:32] PROBLEM - LVS HTTPS on bits.esams.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [06:24:44] although then why is it only at the last hop [06:25:26] RECOVERY - SSH on amslvs1 is OK: SSH OK - OpenSSH_5.3p1 Debian-3ubuntu7 (protocol 2.0) [06:26:39] !log shifting all traffic out of esams [06:26:42] Logged the message, Master [06:27:01] hmm I see, it's not really the buffer memory, even though that's the kernel message. ugh [06:27:57] how fast should that change propagate? [06:28:16] (i.e. i don't know if it's a DNS thing or BGP or what) [06:28:44] PROBLEM - LVS HTTP on mediawiki-lb.esams.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [06:29:19] looks like the solution the last few times has been "reboot the box". yuck [06:29:24] it's a dns thing [06:29:30] apergos: are you looking a separate issue? [06:29:45] no. it's the same issue but a different piece of it [06:29:56] PROBLEM - SSH on amslvs1 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [06:30:06] apergos: ? [06:30:11] on cp3001 (one of the bits varnish caches) we are seeing the out of socket buffer memory issue [06:30:19] which no doubt caused it to nosedive [06:30:39] that's not surprising if outbound network is failing [06:31:06] outbound but not in? [06:31:06] probably not worth troubleshooting at this point [06:31:10] ok [06:31:44] RECOVERY - LVS HTTP on mediawiki-lb.esams.wikimedia.org is OK: HTTP OK - HTTP/1.0 301 Moved Permanently - 9.371 second response time [06:32:38] RECOVERY - SSH on amslvs1 is OK: SSH OK - OpenSSH_5.3p1 Debian-3ubuntu7 (protocol 2.0) [06:32:56] PROBLEM - LVS HTTP on bits.esams.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [06:33:17] i'm getting 100% packet loss to wikipedia-lb.esams.wikimedia.org from comcast, 64% from pmtpa [06:34:05] jeremyb: dns site failover can take a while even with very low ttls due to many browsers not respecting them [06:34:25] more importantly recursors not respecting? [06:34:44] i think at least one of google/opendns doesn't respect them entirely [06:35:56] PROBLEM - LVS HTTP on mediawiki-lb.esams.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [06:36:48] i'm looking at a one of four bits servers in pmtpa, the former esams traffic should be all thats pointed to it and its getting 3500 reqs/sec.. up from 2000 around a minute ago [06:37:08] PROBLEM - SSH on amslvs1 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [06:37:17] RECOVERY - LVS HTTPS on wikibooks-lb.esams.wikimedia.org is OK: HTTP OK HTTP/1.1 200 OK - 43571 bytes in 7.884 seconds [06:37:25] we had a few anecdotal reports that it's fixed [06:37:34] well, de wiki loading again including bits. [06:38:14] it isn't going to be a bit for some users but should be for everyone with a current browser [06:39:50] RECOVERY - SSH on amslvs1 is OK: SSH OK - OpenSSH_5.3p1 Debian-3ubuntu7 (protocol 2.0) [06:40:08] RECOVERY - LVS HTTP on wikiversity-lb.esams.wikimedia.org is OK: HTTP OK HTTP/1.0 200 OK - 50120 bytes in 3.778 seconds [06:41:38] PROBLEM - LVS HTTPS on wikibooks-lb.esams.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [06:42:26] te-8-2.csw1-esams.wikimedia.org [06:42:41] RECOVERY - LVS HTTP on wiktionary-lb.esams.wikimedia.org is OK: HTTP OK HTTP/1.0 200 OK - 62470 bytes in 1.539 seconds [06:42:55] what is that.. router? switch? i wish i knew our network device naming scheme better [06:43:08] core switch [06:43:11] csw [06:44:01] ah, makes sense. what does the te-8-2 portion mean? [06:44:05] amslvs1 [06:44:20] PROBLEM - SSH on amslvs1 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [06:44:38] PROBLEM - LVS HTTP on wikiversity-lb.esams.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [06:44:52] csw1-esams [06:47:33] RECOVERY - LVS HTTP on bits.esams.wikimedia.org is OK: HTTP OK HTTP/1.1 200 OK - 3954 bytes in 7.245 seconds [06:47:50] !log restarting pybal on amslvs2 [06:47:53] Logged the message, Master [06:48:12] on amslvs2? why? [06:48:18] PROBLEM - LVS HTTP on wiktionary-lb.esams.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [06:49:03] RECOVERY - SSH on amslvs1 is OK: SSH OK - OpenSSH_5.3p1 Debian-3ubuntu7 (protocol 2.0) [06:49:12] RECOVERY - LVS HTTPS on bits.esams.wikimedia.org is OK: HTTP OK HTTP/1.1 200 OK - 3982 bytes in 3.585 seconds [06:50:08] apergos: i can ping amslvs2 from fenari but not amslvs1 [06:50:28] but why restart it on amslvs2? [06:50:50] amslvs1 is the one with the issue, afaict [06:53:24] PROBLEM - SSH on amslvs1 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [06:53:43] our routers are still sending wikipedia-lb.esams.wikimedia.org traffic to amslvs1 [06:56:06] RECOVERY - SSH on amslvs1 is OK: SSH OK - OpenSSH_5.3p1 Debian-3ubuntu7 (protocol 2.0) [06:56:32] http://ganglia.wikimedia.org/latest/graph_all_periods.php?h=amslvs1.esams.wikimedia.org&m=load_one&r=2hr&s=by%20name&hc=4&mc=2&st=1335423339&g=network_report&z=large&c=Miscellaneous%20esams [06:56:47] amslv1 is not sending out traffic [06:56:51] ok yay I am on amslvs1 via mgmt after hopping ovr to amssq35 to get there [06:56:53] sheesh [06:56:58] !log [06:56:59] oops [06:57:22] !log restarted pybal on amlvs2 with bgp enabled [06:57:24] Logged the message, Master [06:57:32] !log restart pybal on amlvs1 with bgp disabled [06:57:34] Logged the message, Master [06:58:12] RECOVERY - LVS HTTP on mediawiki-lb.esams.wikimedia.org is OK: HTTP OK - HTTP/1.0 301 Moved Permanently - 9.393 second response time [06:58:38] that seemed to transfer the problem to amlvs2, interesting [06:59:05] now getting 80% packetloss to amlvs2 from fenari, previously was getting none [06:59:06] PROBLEM - LVS HTTP on bits.esams.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [06:59:11] woosters: is mark on vacation? [06:59:16] no [06:59:21] i sms'd him [06:59:25] ok [06:59:25] but he has not replied [06:59:30] let me try again [06:59:42] i'm not sure how critical it is at the moment, since the site is failed over [06:59:42] PROBLEM - LVS HTTPS on bits.esams.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [07:00:54] PROBLEM - BGP status on csw1-esams is CRITICAL: CRITICAL: host 91.198.174.247, sessions up: 4, down: 1, shutdown: 0BRPeering with AS64600 not established - BR [07:01:03] PROBLEM - LVS HTTP on upload.esams.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [07:01:05] could be that foundry switch giving problem ... [07:01:12] PROBLEM - LVS HTTPS on upload.esams.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [07:01:12] but network diagnostics is needed at the core switch [07:01:27] that bgp alert is pybal related [07:01:34] observium showing anything interesting? [07:01:58] hey [07:02:04] hello [07:02:10] hi [07:02:11] what's the summary so far? [07:02:17] hey [07:02:24] i failed esams to pmtpa [07:02:26] plenty of packetloss to LVS from me but only on the last hop [07:02:33] PROBLEM - LVS HTTP on mediawiki-lb.esams.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [07:02:44] (to esams) [07:02:53] i had 100% packetloss to mediawiki-lb.esams.wikimedia.org from home, 80% from fenari to there [07:03:15] i've been bumping between 70-90% [07:03:16] traceoute ended at te-8-2.csw1-esams.wikimedia.org [07:03:25] to which i get no packet loss [07:03:29] hmm [07:03:44] so both amslvs1 and amslvs2 see the issue? [07:03:45] RECOVERY - LVS HTTP on upload.esams.wikimedia.org is OK: HTTP OK HTTP/1.0 200 OK - 656 bytes in 1.007 seconds [07:03:46] i got 100% to amlvs1 but none to amlvs2 [07:03:49] i also have 0% loss to ve7.te-8-1.csw1-esams.wikimedia.org [07:03:54] RECOVERY - LVS HTTPS on bits.esams.wikimedia.org is OK: HTTP OK HTTP/1.1 200 OK - 3959 bytes in 5.551 seconds [07:03:57] but could get into 1 from 2 [07:04:24] i in-place disabled pybal bgp on 1, enabled on 2, and restarted pybal on both [07:04:43] it should already be enabled on 2? [07:04:47] now i can't connect to 2 but can connect to 1 [07:04:51] hmm [07:05:06] on 2, it was enabled in global but disabled in all of the individual settings [07:05:15] RECOVERY - LVS HTTP on mediawiki-lb.esams.wikimedia.org is OK: HTTP OK - HTTP/1.0 301 Moved Permanently - 3.306 second response time [07:05:24] PROBLEM - SSH on amslvs2 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [07:05:40] http://ganglia.wikimedia.org/latest/graph_all_periods.php?h=amslvs1.esams.wikimedia.org&m=n_object&r=hour&s=by%20name&hc=4&mc=2&st=1335423912&g=network_report&z=large&c=Miscellaneous%20esams [07:05:45] hehe [07:06:12] that's a lot of inbound traffic [07:06:19] that's way too much inbound traffic [07:06:23] ddos? [07:07:07] check out kern.log on amlvs1 [07:07:22] Apr 26 07:11:10 amslvs1 kernel: [6637421.265034] UDP: bad checksum. From 64.56.147.178:27905 to 91.198.174.225:80 ulen 1033 [07:07:22] Apr 26 07:11:10 amslvs1 kernel: [6637421.314976] UDP: bad checksum. From 64.56.147.178:27905 to 91.198.174.225:80 ulen 1033 [07:07:23] Apr 26 07:11:15 amslvs1 kernel: [6637425.683207] net_ratelimit: 7 callbacks suppressed [07:07:27] that's normal [07:07:28] lots of that, but just from a few ips [07:07:32] ah [07:08:01] udp to port 80 looked weird to me [07:08:19] that is weird indeed [07:08:42] but I mean, that general message we see all the time [07:09:36] RECOVERY - SSH on amslvs2 is OK: SSH OK - OpenSSH_5.3p1 Debian-3ubuntu7 (protocol 2.0) [07:09:45] PROBLEM - LVS HTTP on mediawiki-lb.esams.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [07:11:15] RECOVERY - LVS HTTPS on mediawiki-lb.esams.wikimedia.org is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 3.801 second response time [07:12:18] PROBLEM - ps1-b5-sdtpa-infeed-load-tower-A-phase-Z on ps1-b5-sdtpa is CRITICAL: ps1-b5-sdtpa-infeed-load-tower-A-phase-Z CRITICAL - *2563* [07:12:27] binasher: hm, if it was ddos, shouldn't it be either still pounding on the same boxes, OR pound the boxes in tampa now? [07:13:41] binasher: can you restore whatever you changed? [07:13:54] T3rminat0r: depends, such a thing may or may not follow dns changes. but no evidence that there is a ddos [07:13:56] yup [07:14:06] PROBLEM - LVS HTTP on upload.esams.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [07:14:07] PROBLEM - LVS HTTPS on bits.esams.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [07:15:17] binasher: yea, that's what I meant, if it was a ddos, it would still be going, on either the ip or the host, unless the guy running it read that you switched to tampa ...so I'd guess no ddos ... [07:15:36] PROBLEM - LVS HTTPS on mediawiki-lb.esams.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [07:15:45] Well, if it is a botnet, it would located completely in Europe [07:15:56] Which is not really typical of botnets [07:16:48] PROBLEM - ps1-b5-sdtpa-infeed-load-tower-A-phase-Z on ps1-b5-sdtpa is CRITICAL: ps1-b5-sdtpa-infeed-load-tower-A-phase-Z CRITICAL - *2563* [07:16:57] RECOVERY - LVS HTTPS on bits.esams.wikimedia.org is OK: HTTP OK HTTP/1.1 200 OK - 3981 bytes in 8.564 seconds [07:18:18] RECOVERY - LVS HTTP on upload.esams.wikimedia.org is OK: HTTP OK HTTP/1.0 200 OK - 653 bytes in 0.219 seconds [07:18:27] RECOVERY - LVS HTTPS on upload.esams.wikimedia.org is OK: HTTP OK HTTP/1.1 200 OK - 638 bytes in 0.456 seconds [07:19:30] RECOVERY - LVS HTTP on bits.esams.wikimedia.org is OK: HTTP OK HTTP/1.1 200 OK - 3966 bytes in 3.840 seconds [07:19:39] RECOVERY - LVS HTTP on mediawiki-lb.esams.wikimedia.org is OK: HTTP OK - HTTP/1.0 301 Moved Permanently - 3.950 second response time [07:19:57] RECOVERY - BGP status on csw1-esams is OK: OK: host 91.198.174.247, sessions up: 5, down: 0, shutdown: 0 [07:21:18] PROBLEM - LVS HTTPS on bits.esams.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [07:22:39] PROBLEM - LVS HTTP on upload.esams.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [07:23:15] PROBLEM - LVS HTTPS on upload.esams.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [07:23:33] RECOVERY - LVS HTTP on wikisource-lb.esams.wikimedia.org is OK: HTTP OK HTTP/1.0 200 OK - 43364 bytes in 8.002 seconds [07:23:51] RECOVERY - ps1-b5-sdtpa-infeed-load-tower-A-phase-Z on ps1-b5-sdtpa is OK: ps1-b5-sdtpa-infeed-load-tower-A-phase-Z OK - 2288 [07:25:40] RECOVERY - LVS HTTP on upload.esams.wikimedia.org is OK: HTTP OK HTTP/1.0 200 OK - 656 bytes in 1.708 seconds [07:25:48] RECOVERY - LVS HTTP on wikibooks-lb.esams.wikimedia.org is OK: HTTP OK HTTP/1.0 200 OK - 43568 bytes in 1.351 seconds [07:25:57] RECOVERY - LVS HTTPS on upload.esams.wikimedia.org is OK: HTTP OK HTTP/1.1 200 OK - 638 bytes in 1.349 seconds [07:26:15] RECOVERY - LVS HTTPS on wiktionary-lb.esams.wikimedia.org is OK: HTTP OK HTTP/1.1 200 OK - 62476 bytes in 5.493 seconds [07:26:24] RECOVERY - LVS HTTPS on wikiquote-lb.esams.wikimedia.org is OK: HTTP OK HTTP/1.1 200 OK - 52892 bytes in 5.478 seconds [07:27:00] RECOVERY - LVS HTTPS on mediawiki-lb.esams.wikimedia.org is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 2.405 second response time [07:27:45] RECOVERY - LVS HTTP on foundation-lb.esams.wikimedia.org is OK: HTTP OK HTTP/1.0 200 OK - 38924 bytes in 0.437 seconds [07:27:45] RECOVERY - LVS HTTP on wikiversity-lb.esams.wikimedia.org is OK: HTTP OK HTTP/1.0 200 OK - 50120 bytes in 0.437 seconds [07:27:45] RECOVERY - LVS HTTP on wikiquote-lb.esams.wikimedia.org is OK: HTTP OK HTTP/1.0 200 OK - 52886 bytes in 0.545 seconds [07:27:45] RECOVERY - LVS HTTP on wiktionary-lb.esams.wikimedia.org is OK: HTTP OK HTTP/1.0 200 OK - 62470 bytes in 0.547 seconds [07:27:45] RECOVERY - LVS HTTP on wikipedia-lb.esams.wikimedia.org is OK: HTTP OK HTTP/1.0 200 OK - 60179 bytes in 0.545 seconds [07:27:46] RECOVERY - LVS HTTPS on wikisource-lb.esams.wikimedia.org is OK: HTTP OK HTTP/1.1 200 OK - 43370 bytes in 0.776 seconds [07:27:46] RECOVERY - LVS HTTPS on wikiversity-lb.esams.wikimedia.org is OK: HTTP OK HTTP/1.1 200 OK - 50126 bytes in 0.773 seconds [07:27:47] RECOVERY - LVS HTTPS on wikibooks-lb.esams.wikimedia.org is OK: HTTP OK HTTP/1.1 200 OK - 43572 bytes in 0.785 seconds [07:28:01] whoooo [07:28:03] RECOVERY - LVS HTTP on wikimedia-lb.esams.wikimedia.org is OK: HTTP OK HTTP/1.0 200 OK - 80032 bytes in 0.660 seconds [07:28:04] RECOVERY - LVS HTTPS on wikipedia-lb.esams.wikimedia.org is OK: HTTP OK HTTP/1.1 200 OK - 60182 bytes in 0.895 seconds [07:28:04] PROBLEM - check_all_memcacheds on spence is CRITICAL: MEMCACHED CRITICAL - Could not connect: 10.0.11.47:11000 (timeout) [07:28:21] RECOVERY - LVS HTTPS on foundation-lb.esams.wikimedia.org is OK: HTTP OK HTTP/1.1 200 OK - 38930 bytes in 0.702 seconds [07:28:21] RECOVERY - LVS HTTP on wikinews-lb.esams.wikimedia.org is OK: HTTP OK HTTP/1.0 200 OK - 70126 bytes in 0.547 seconds [07:28:39] RECOVERY - LVS HTTPS on bits.esams.wikimedia.org is OK: HTTP OK HTTP/1.1 200 OK - 3971 bytes in 0.441 seconds [07:28:57] RECOVERY - LVS HTTPS on wikinews-lb.esams.wikimedia.org is OK: HTTP OK HTTP/1.1 200 OK - 70132 bytes in 0.818 seconds [07:28:57] RECOVERY - LVS HTTPS on wikimedia-lb.esams.wikimedia.org is OK: HTTP OK HTTP/1.1 200 OK - 80039 bytes in 0.876 seconds [07:29:33] RECOVERY - check_all_memcacheds on spence is OK: MEMCACHED OK - All memcacheds are online [07:35:24] PROBLEM - swift-container-auditor on ms-be5 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [07:36:00] PROBLEM - swift-container-auditor on ms-be2 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [07:43:39] PROBLEM - Puppet freshness on amslvs2 is CRITICAL: Puppet has not run in the last 10 hours [07:50:16] wtf has been going on? [07:56:42] PROBLEM - check_all_memcacheds on spence is CRITICAL: MEMCACHED CRITICAL - Could not connect: 10.0.11.38:11000 (timeout) 10.0.11.33:11000 (timeout) 10.0.8.39:11000 (timeout) [07:57:00] RECOVERY - swift-container-auditor on ms-be5 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [07:58:03] RECOVERY - check_all_memcacheds on spence is OK: MEMCACHED OK - All memcacheds are online [08:00:27] RECOVERY - swift-container-auditor on ms-be2 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [08:21:00] PROBLEM - check_all_memcacheds on spence is CRITICAL: MEMCACHED CRITICAL - Could not connect: 10.0.11.40:11000 (Connection timed out) 10.0.11.37:11000 (timeout) 10.0.8.39:11000 (timeout) [08:21:36] PROBLEM - Host mw40 is DOWN: PING CRITICAL - Packet loss = 100% [08:27:37] !log Power cycled mw40 [08:27:40] Logged the message, Master [08:31:21] RECOVERY - Host mw40 is UP: PING OK - Packet loss = 0%, RTA = 1.03 ms [08:32:15] RECOVERY - check_all_memcacheds on spence is OK: MEMCACHED OK - All memcacheds are online [08:35:33] PROBLEM - Apache HTTP on mw40 is CRITICAL: Connection refused [08:50:51] PROBLEM - check_all_memcacheds on spence is CRITICAL: MEMCACHED CRITICAL - Could not connect: 10.0.8.23:11000 (timeout) [08:52:12] RECOVERY - check_all_memcacheds on spence is OK: MEMCACHED OK - All memcacheds are online [09:01:03] RECOVERY - Apache HTTP on mw40 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.026 second response time [09:45:54] PROBLEM - Auth DNS on ns2.wikimedia.org is CRITICAL: CRITICAL - Plugin timed out while executing system call [10:19:46] New patchset: Mark Bergsma; "Set file size at 100G for all upload caches now" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/5876 [10:20:01] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/5876 [10:20:34] New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/5876 [10:20:37] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/5876 [10:50:23] New patchset: Mark Bergsma; "Manage tftpboot from Puppet, import PXE config files from Precise" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/5878 [10:50:40] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/5878 [11:02:03] New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/5878 [11:02:06] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/5878 [11:08:34] New patchset: Mark Bergsma; "Put files in the correct place" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/5880 [11:08:51] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/5880 [11:09:16] New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/5880 [11:09:19] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/5880 [11:27:11] New patchset: Mark Bergsma; "Add Wikimedia specific boot configs" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/5881 [11:27:28] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/5881 [11:30:26] New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/5881 [11:30:28] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/5881 [11:30:36] PROBLEM - Puppet freshness on amslvs4 is CRITICAL: Puppet has not run in the last 10 hours [12:06:23] New patchset: Mark Bergsma; "Fix broken indentation" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/5882 [12:06:40] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/5882 [12:06:51] New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/5882 [12:06:55] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/5882 [12:17:24] New patchset: Nikerabbit; "Cron entries for TranslationNotifications" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/5783 [12:17:42] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/5783 [12:18:10] New review: Nikerabbit; "(no comment)" [operations/puppet] (production) C: 0; - https://gerrit.wikimedia.org/r/5783 [12:18:20] New review: Nikerabbit; "(no comment)" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/5783 [12:34:39] PROBLEM - Puppet freshness on spence is CRITICAL: Puppet has not run in the last 10 hours [12:47:42] PROBLEM - Puppet freshness on gallium is CRITICAL: Puppet has not run in the last 10 hours [12:56:33] !log Created precise-wikimedia APT distribution [12:56:36] Logged the message, Master [12:58:22] New patchset: Mark Bergsma; "Use ocg3 as Precise install test host" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/5883 [12:58:39] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/5883 [12:58:54] New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/5883 [12:58:56] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/5883 [13:03:57] !log (re)starting innobackupex from db1017 to db59 for new s1 slave [13:03:59] Logged the message, notpeter [13:31:48] PROBLEM - Host srv278 is DOWN: PING CRITICAL - Packet loss = 100% [13:33:18] RECOVERY - Host srv278 is UP: PING OK - Packet loss = 0%, RTA = 0.88 ms [13:34:39] RECOVERY - mysqld processes on db60 is OK: PROCS OK: 1 process with command name mysqld [13:37:03] PROBLEM - Apache HTTP on srv278 is CRITICAL: Connection refused [13:37:21] PROBLEM - Apache HTTP on mw62 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:37:21] PROBLEM - Apache HTTP on srv290 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:37:30] PROBLEM - Apache HTTP on mw69 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:37:48] PROBLEM - Apache HTTP on srv301 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:37:48] PROBLEM - Apache HTTP on srv254 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:37:57] PROBLEM - MySQL Replication Heartbeat on db60 is CRITICAL: CRIT replication delay 56524 seconds [13:37:57] PROBLEM - Apache HTTP on mw70 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:37:57] PROBLEM - Apache HTTP on mw73 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:37:57] PROBLEM - Apache HTTP on srv296 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:38:06] PROBLEM - Apache HTTP on srv298 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:38:06] PROBLEM - Apache HTTP on mw65 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:38:06] PROBLEM - Apache HTTP on srv294 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:38:15] PROBLEM - Apache HTTP on srv299 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:38:24] PROBLEM - Apache HTTP on mw66 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:38:24] PROBLEM - Apache HTTP on srv291 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:38:24] PROBLEM - Apache HTTP on mw64 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:38:33] PROBLEM - Apache HTTP on srv295 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:38:33] PROBLEM - Apache HTTP on srv216 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:38:42] PROBLEM - Apache HTTP on srv300 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:38:42] PROBLEM - Apache HTTP on mw67 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:38:42] PROBLEM - Apache HTTP on mw74 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:38:42] PROBLEM - LVS HTTP on api.svc.pmtpa.wmnet is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:38:42] PROBLEM - Apache HTTP on srv292 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:38:51] PROBLEM - Apache HTTP on mw63 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:39:00] PROBLEM - Apache HTTP on mw71 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:39:00] PROBLEM - Apache HTTP on srv297 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:39:00] PROBLEM - Apache HTTP on srv293 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:39:09] PROBLEM - Apache HTTP on mw72 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:39:09] PROBLEM - Apache HTTP on mw68 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:39:54] RECOVERY - Apache HTTP on srv216 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 6.580 second response time [13:40:39] RECOVERY - Apache HTTP on srv254 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 3.311 second response time [13:41:33] RECOVERY - LVS HTTP on api.svc.pmtpa.wmnet is OK: HTTP OK HTTP/1.1 200 OK - 2241 bytes in 9.187 seconds [13:46:03] PROBLEM - LVS Lucene on search-prefix.svc.eqiad.wmnet is CRITICAL: Connection timed out [13:48:54] RECOVERY - LVS Lucene on search-prefix.svc.eqiad.wmnet is OK: TCP OK - 9.020 second response time on port 8123 [13:50:06] RECOVERY - Apache HTTP on srv278 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.028 second response time [13:50:15]