[00:04:33] RECOVERY - Apache HTTP on srv278 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.025 second response time [00:26:17] That "Error deleting file: Could not delete lock file for "mwstore://local-backend/local-public/1/10"." error is still happening, if that's not known. [00:26:20] enwiki [00:28:15] PROBLEM - swift-container-auditor on ms-be3 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [00:48:21] PROBLEM - MySQL Idle Transactions on db31 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [00:49:15] PROBLEM - MySQL Slave Running on db31 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [00:50:18] PROBLEM - MySQL Replication Heartbeat on db31 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [00:51:21] RECOVERY - swift-container-auditor on ms-be3 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [00:52:24] PROBLEM - MySQL Slave Delay on db31 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [00:52:24] PROBLEM - MySQL Recent Restart on db31 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [00:53:45] RECOVERY - MySQL Slave Delay on db31 is OK: OK replication delay seconds [00:53:45] RECOVERY - MySQL Recent Restart on db31 is OK: OK 6144517 seconds since restart [00:54:12] RECOVERY - MySQL Idle Transactions on db31 is OK: OK longest blocking idle transaction sleeps for 0 seconds [00:54:50] looking at it [00:56:00] RECOVERY - MySQL Replication Heartbeat on db31 is OK: OK replication delay 0 seconds [00:56:18] RECOVERY - MySQL Slave Running on db31 is OK: OK replication [00:57:03] PROBLEM - swift-container-auditor on ms-be3 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [01:20:27] PROBLEM - Host db46 is DOWN: PING CRITICAL - Packet loss = 100% [01:22:51] RECOVERY - swift-container-auditor on ms-be3 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [01:30:03] PROBLEM - Puppet freshness on amslvs4 is CRITICAL: Puppet has not run in the last 10 hours [01:40:53] !log on professor: restarted udpprofile collector [01:40:56] Logged the message, Master [01:42:12] RECOVERY - Host db46 is UP: PING OK - Packet loss = 0%, RTA = 1.80 ms [01:42:48] PROBLEM - MySQL Slave Delay on storage3 is CRITICAL: CRIT replication delay 240 seconds [01:51:50] RECOVERY - MySQL Slave Delay on storage3 is OK: OK replication delay 19 seconds [01:54:23] PROBLEM - check_all_memcacheds on spence is CRITICAL: MEMCACHED CRITICAL - Could not connect: 10.0.11.38:11000 (timeout) 10.0.11.32:11000 (timeout) [01:55:44] RECOVERY - check_all_memcacheds on spence is OK: MEMCACHED OK - All memcacheds are online [02:02:56] PROBLEM - check_all_memcacheds on spence is CRITICAL: MEMCACHED CRITICAL - Could not connect: 10.0.11.36:11000 (timeout) [02:04:26] RECOVERY - check_all_memcacheds on spence is OK: MEMCACHED OK - All memcacheds are online [02:31:44] PROBLEM - swift-container-auditor on ms-be3 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [02:31:44] PROBLEM - check_all_memcacheds on spence is CRITICAL: MEMCACHED CRITICAL - Could not connect: 10.0.8.39:11000 (timeout) [02:33:05] RECOVERY - check_all_memcacheds on spence is OK: MEMCACHED OK - All memcacheds are online [02:40:17] RECOVERY - swift-container-auditor on ms-be3 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [02:46:53] RECOVERY - Puppet freshness on gallium is OK: puppet ran at Thu Apr 26 02:46:42 UTC 2012 [04:01:44] PROBLEM - swift-container-auditor on ms-be3 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [04:13:26] RECOVERY - swift-container-auditor on ms-be3 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [05:57:21] PROBLEM - LVS HTTP on foundation-lb.esams.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [05:57:21] PROBLEM - LVS HTTPS on wikinews-lb.esams.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [05:57:23] PROBLEM - LVS HTTP on wikibooks-lb.esams.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [05:57:32] PROBLEM - LVS HTTP on wiktionary-lb.esams.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [05:57:32] PROBLEM - LVS HTTP on wikipedia-lb.esams.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [05:57:41] PROBLEM - LVS HTTPS on wikipedia-lb.esams.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [05:57:50] PROBLEM - LVS HTTP on wikimedia-lb.esams.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [05:57:51] PROBLEM - LVS HTTPS on wikimedia-lb.esams.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [05:57:51] PROBLEM - LVS HTTPS on wikiversity-lb.esams.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [05:57:51] PROBLEM - SSH on amslvs1 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [05:57:51] PROBLEM - LVS HTTP on wikinews-lb.esams.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [05:57:54] that looks bad [05:57:58] mark? [05:57:59] PROBLEM - LVS HTTPS on wiktionary-lb.esams.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [05:57:59] PROBLEM - LVS HTTPS on wikisource-lb.esams.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [05:57:59] PROBLEM - LVS HTTPS on foundation-lb.esams.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [05:57:59] PROBLEM - LVS HTTP on wikiquote-lb.esams.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [05:58:08] PROBLEM - LVS HTTPS on wikibooks-lb.esams.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [05:58:08] PROBLEM - LVS HTTPS on wikiquote-lb.esams.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [05:58:08] PROBLEM - LVS HTTP on wikisource-lb.esams.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [05:58:17] PROBLEM - LVS HTTPS on bits.esams.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [05:58:17] PROBLEM - LVS HTTP on wikiversity-lb.esams.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [05:58:20] oh joy [05:59:20] PROBLEM - LVS HTTP on bits.esams.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [05:59:49] apergos - the site seems ok here [05:59:57] how's it there? [06:00:00] this is esams only [06:00:03] woosters: all of the alerts are esams [06:00:10] ya, i noticed [06:00:18] and both #-tech complainants are euro [06:00:20] that is why aspergos can validate [06:00:23] PROBLEM - LVS HTTPS on mediawiki-lb.esams.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [06:00:30] now a 3rd [06:00:50] RECOVERY - SSH on amslvs1 is OK: SSH OK - OpenSSH_5.3p1 Debian-3ubuntu7 (protocol 2.0) [06:01:04] oh, some good news for a change [06:02:53] fwiw, http://www.internetpulse.net/ seems bad atm [06:03:02] socket buffers [06:04:18] I don't know how to clear that [06:04:38] tried restarting varnish on cp3001, that was not sufficient [06:05:11] RECOVERY - LVS HTTP on bits.esams.wikimedia.org is OK: HTTP OK HTTP/1.1 200 OK - 3976 bytes in 8.921 seconds [06:05:12] PROBLEM - SSH on amslvs1 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [06:05:29] PROBLEM - LVS HTTP on mediawiki-lb.esams.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [06:05:30] apergos: why single out that box? or it's arbitrary? [06:05:32] I have varnish stopped on cp3001 for a minute [06:05:36] there's two bits caches [06:05:46] I chose the first one, they both are unhappy [06:06:38] oh, only 2 huh [06:06:44] ok I"m goign to start it back up now [06:06:49] but then why all the alerts for not bits? [06:08:20] RECOVERY - LVS HTTP on mediawiki-lb.esams.wikimedia.org is OK: HTTP OK - HTTP/1.0 301 Moved Permanently - 9.757 second response time [06:09:16] this looks better [06:09:21] let me do the same on cp3002 [06:10:04] hmm I don't see thatmessage there [06:10:34] it is ok I guess, looking at the graphs [06:11:03] or at least not in as bad a situation [06:12:59] PROBLEM - Host bits.esams.wikimedia.org is DOWN: PING CRITICAL - Packet loss = 100% [06:15:14] RECOVERY - SSH on amslvs1 is OK: SSH OK - OpenSSH_5.3p1 Debian-3ubuntu7 (protocol 2.0) [06:15:41] PROBLEM - LVS HTTP on mediawiki-lb.esams.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [06:15:44] i have no packet loss to ve7.te-8-1.csw1-esams.wikimedia.org and then plenty of loss to wikipedia-lb.esams.wikimedia.org [06:15:47] from US [06:15:58] (consecutive hops) [06:18:13] I don't know how to fix this :-( [06:18:26] i paged mark and asher [06:18:30] I am seeing out of socket memory again on cp3001 [06:18:32] woosters: leslie? [06:18:37] asher should be on in 5 minutes [06:18:41] RECOVERY - Host bits.esams.wikimedia.org is UP: PING WARNING - Packet loss = 86%, RTA = 154.51 ms [06:18:43] there he is [06:18:46] hi asher [06:19:00] hey [06:19:05] 26 06:15:43 < jeremyb> i have no packet loss to ve7.te-8-1.csw1-esams.wikimedia.org and then plenty of loss to wikipedia-lb.esams.wikimedia.org [06:19:09] 26 06:15:45 < jeremyb> from US [06:19:11] 26 06:15:57 < jeremyb> (consecutive hops) [06:19:15] (still) [06:19:34] http://ganglia.wikimedia.org/latest/?r=hour&cs=&ce=&m=load_one&s=by+name&c=Bits+caches+esams&h=&host_regex=&max_graphs=0&tab=m&vn=&sh=1&z=small&hc=4 [06:19:42] on cps001 we have out of socket memory issues [06:19:44] PROBLEM - SSH on amslvs1 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [06:19:45] jeremyb - leslie is sick :-( , not going to disturb her [06:19:48] restarting varnish did not clear them up [06:19:51] woosters: ;( [06:20:06] cp3001 [06:20:32] failover bits to pmtpa is an option if bits is the issue [06:21:54] jeremyb: where are you observing packet loss from? [06:22:27] pasting [06:22:46] binasher: http://dpaste.com/738014/plain/ [06:23:38] oh yeah, major packet loss to lvs in esams for me too [06:23:53] ok [06:24:20] binasher: also, maybe relevant: http://www.internetpulse.net/ is kinda unhealthy [06:24:23] RECOVERY - LVS HTTP on mediawiki-lb.esams.wikimedia.org is OK: HTTP OK - HTTP/1.0 301 Moved Permanently - 8.462 second response time [06:24:32] PROBLEM - LVS HTTPS on bits.esams.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [06:24:44] although then why is it only at the last hop [06:25:26] RECOVERY - SSH on amslvs1 is OK: SSH OK - OpenSSH_5.3p1 Debian-3ubuntu7 (protocol 2.0) [06:26:39] !log shifting all traffic out of esams [06:26:42] Logged the message, Master [06:27:01] hmm I see, it's not really the buffer memory, even though that's the kernel message. ugh [06:27:57] how fast should that change propagate? [06:28:16] (i.e. i don't know if it's a DNS thing or BGP or what) [06:28:44] PROBLEM - LVS HTTP on mediawiki-lb.esams.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [06:29:19] looks like the solution the last few times has been "reboot the box". yuck [06:29:24] it's a dns thing [06:29:30] apergos: are you looking a separate issue? [06:29:45] no. it's the same issue but a different piece of it [06:29:56] PROBLEM - SSH on amslvs1 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [06:30:06] apergos: ? [06:30:11] on cp3001 (one of the bits varnish caches) we are seeing the out of socket buffer memory issue [06:30:19] which no doubt caused it to nosedive [06:30:39] that's not surprising if outbound network is failing [06:31:06] outbound but not in? [06:31:06] probably not worth troubleshooting at this point [06:31:10] ok [06:31:44] RECOVERY - LVS HTTP on mediawiki-lb.esams.wikimedia.org is OK: HTTP OK - HTTP/1.0 301 Moved Permanently - 9.371 second response time [06:32:38] RECOVERY - SSH on amslvs1 is OK: SSH OK - OpenSSH_5.3p1 Debian-3ubuntu7 (protocol 2.0) [06:32:56] PROBLEM - LVS HTTP on bits.esams.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [06:33:17] i'm getting 100% packet loss to wikipedia-lb.esams.wikimedia.org from comcast, 64% from pmtpa [06:34:05] jeremyb: dns site failover can take a while even with very low ttls due to many browsers not respecting them [06:34:25] more importantly recursors not respecting? [06:34:44] i think at least one of google/opendns doesn't respect them entirely [06:35:56] PROBLEM - LVS HTTP on mediawiki-lb.esams.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [06:36:48] i'm looking at a one of four bits servers in pmtpa, the former esams traffic should be all thats pointed to it and its getting 3500 reqs/sec.. up from 2000 around a minute ago [06:37:08] PROBLEM - SSH on amslvs1 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [06:37:17] RECOVERY - LVS HTTPS on wikibooks-lb.esams.wikimedia.org is OK: HTTP OK HTTP/1.1 200 OK - 43571 bytes in 7.884 seconds [06:37:25] we had a few anecdotal reports that it's fixed [06:37:34] well, de wiki loading again including bits. [06:38:14] it isn't going to be a bit for some users but should be for everyone with a current browser [06:39:50] RECOVERY - SSH on amslvs1 is OK: SSH OK - OpenSSH_5.3p1 Debian-3ubuntu7 (protocol 2.0) [06:40:08] RECOVERY - LVS HTTP on wikiversity-lb.esams.wikimedia.org is OK: HTTP OK HTTP/1.0 200 OK - 50120 bytes in 3.778 seconds [06:41:38] PROBLEM - LVS HTTPS on wikibooks-lb.esams.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [06:42:26] te-8-2.csw1-esams.wikimedia.org [06:42:41] RECOVERY - LVS HTTP on wiktionary-lb.esams.wikimedia.org is OK: HTTP OK HTTP/1.0 200 OK - 62470 bytes in 1.539 seconds [06:42:55] what is that.. router? switch? i wish i knew our network device naming scheme better [06:43:08] core switch [06:43:11] csw [06:44:01] ah, makes sense. what does the te-8-2 portion mean? [06:44:05] amslvs1 [06:44:20] PROBLEM - SSH on amslvs1 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [06:44:38] PROBLEM - LVS HTTP on wikiversity-lb.esams.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [06:44:52] csw1-esams [06:47:33] RECOVERY - LVS HTTP on bits.esams.wikimedia.org is OK: HTTP OK HTTP/1.1 200 OK - 3954 bytes in 7.245 seconds [06:47:50] !log restarting pybal on amslvs2 [06:47:53] Logged the message, Master [06:48:12] on amslvs2? why? [06:48:18] PROBLEM - LVS HTTP on wiktionary-lb.esams.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [06:49:03] RECOVERY - SSH on amslvs1 is OK: SSH OK - OpenSSH_5.3p1 Debian-3ubuntu7 (protocol 2.0) [06:49:12] RECOVERY - LVS HTTPS on bits.esams.wikimedia.org is OK: HTTP OK HTTP/1.1 200 OK - 3982 bytes in 3.585 seconds [06:50:08] apergos: i can ping amslvs2 from fenari but not amslvs1 [06:50:28] but why restart it on amslvs2? [06:50:50] amslvs1 is the one with the issue, afaict [06:53:24] PROBLEM - SSH on amslvs1 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [06:53:43] our routers are still sending wikipedia-lb.esams.wikimedia.org traffic to amslvs1 [06:56:06] RECOVERY - SSH on amslvs1 is OK: SSH OK - OpenSSH_5.3p1 Debian-3ubuntu7 (protocol 2.0) [06:56:32] http://ganglia.wikimedia.org/latest/graph_all_periods.php?h=amslvs1.esams.wikimedia.org&m=load_one&r=2hr&s=by%20name&hc=4&mc=2&st=1335423339&g=network_report&z=large&c=Miscellaneous%20esams [06:56:47] amslv1 is not sending out traffic [06:56:51] ok yay I am on amslvs1 via mgmt after hopping ovr to amssq35 to get there [06:56:53] sheesh [06:56:58] !log [06:56:59] oops [06:57:22] !log restarted pybal on amlvs2 with bgp enabled [06:57:24] Logged the message, Master [06:57:32] !log restart pybal on amlvs1 with bgp disabled [06:57:34] Logged the message, Master [06:58:12] RECOVERY - LVS HTTP on mediawiki-lb.esams.wikimedia.org is OK: HTTP OK - HTTP/1.0 301 Moved Permanently - 9.393 second response time [06:58:38] that seemed to transfer the problem to amlvs2, interesting [06:59:05] now getting 80% packetloss to amlvs2 from fenari, previously was getting none [06:59:06] PROBLEM - LVS HTTP on bits.esams.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [06:59:11] woosters: is mark on vacation? [06:59:16] no [06:59:21] i sms'd him [06:59:25] ok [06:59:25] but he has not replied [06:59:30] let me try again [06:59:42] i'm not sure how critical it is at the moment, since the site is failed over [06:59:42] PROBLEM - LVS HTTPS on bits.esams.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [07:00:54] PROBLEM - BGP status on csw1-esams is CRITICAL: CRITICAL: host 91.198.174.247, sessions up: 4, down: 1, shutdown: 0BRPeering with AS64600 not established - BR [07:01:03] PROBLEM - LVS HTTP on upload.esams.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [07:01:05] could be that foundry switch giving problem ... [07:01:12] PROBLEM - LVS HTTPS on upload.esams.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [07:01:12] but network diagnostics is needed at the core switch [07:01:27] that bgp alert is pybal related [07:01:34] observium showing anything interesting? [07:01:58] hey [07:02:04] hello [07:02:10] hi [07:02:11] what's the summary so far? [07:02:17] hey [07:02:24] i failed esams to pmtpa [07:02:26] plenty of packetloss to LVS from me but only on the last hop [07:02:33] PROBLEM - LVS HTTP on mediawiki-lb.esams.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [07:02:44] (to esams) [07:02:53] i had 100% packetloss to mediawiki-lb.esams.wikimedia.org from home, 80% from fenari to there [07:03:15] i've been bumping between 70-90% [07:03:16] traceoute ended at te-8-2.csw1-esams.wikimedia.org [07:03:25] to which i get no packet loss [07:03:29] hmm [07:03:44] so both amslvs1 and amslvs2 see the issue? [07:03:45] RECOVERY - LVS HTTP on upload.esams.wikimedia.org is OK: HTTP OK HTTP/1.0 200 OK - 656 bytes in 1.007 seconds [07:03:46] i got 100% to amlvs1 but none to amlvs2 [07:03:49] i also have 0% loss to ve7.te-8-1.csw1-esams.wikimedia.org [07:03:54] RECOVERY - LVS HTTPS on bits.esams.wikimedia.org is OK: HTTP OK HTTP/1.1 200 OK - 3959 bytes in 5.551 seconds [07:03:57] but could get into 1 from 2 [07:04:24] i in-place disabled pybal bgp on 1, enabled on 2, and restarted pybal on both [07:04:43] it should already be enabled on 2? [07:04:47] now i can't connect to 2 but can connect to 1 [07:04:51] hmm [07:05:06] on 2, it was enabled in global but disabled in all of the individual settings [07:05:15] RECOVERY - LVS HTTP on mediawiki-lb.esams.wikimedia.org is OK: HTTP OK - HTTP/1.0 301 Moved Permanently - 3.306 second response time [07:05:24] PROBLEM - SSH on amslvs2 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [07:05:40] http://ganglia.wikimedia.org/latest/graph_all_periods.php?h=amslvs1.esams.wikimedia.org&m=n_object&r=hour&s=by%20name&hc=4&mc=2&st=1335423912&g=network_report&z=large&c=Miscellaneous%20esams [07:05:45] hehe [07:06:12] that's a lot of inbound traffic [07:06:19] that's way too much inbound traffic [07:06:23] ddos? [07:07:07] check out kern.log on amlvs1 [07:07:22] Apr 26 07:11:10 amslvs1 kernel: [6637421.265034] UDP: bad checksum. From 64.56.147.178:27905 to 91.198.174.225:80 ulen 1033 [07:07:22] Apr 26 07:11:10 amslvs1 kernel: [6637421.314976] UDP: bad checksum. From 64.56.147.178:27905 to 91.198.174.225:80 ulen 1033 [07:07:23] Apr 26 07:11:15 amslvs1 kernel: [6637425.683207] net_ratelimit: 7 callbacks suppressed [07:07:27] that's normal [07:07:28] lots of that, but just from a few ips [07:07:32] ah [07:08:01] udp to port 80 looked weird to me [07:08:19] that is weird indeed [07:08:42] but I mean, that general message we see all the time [07:09:36] RECOVERY - SSH on amslvs2 is OK: SSH OK - OpenSSH_5.3p1 Debian-3ubuntu7 (protocol 2.0) [07:09:45] PROBLEM - LVS HTTP on mediawiki-lb.esams.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [07:11:15] RECOVERY - LVS HTTPS on mediawiki-lb.esams.wikimedia.org is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 3.801 second response time [07:12:18] PROBLEM - ps1-b5-sdtpa-infeed-load-tower-A-phase-Z on ps1-b5-sdtpa is CRITICAL: ps1-b5-sdtpa-infeed-load-tower-A-phase-Z CRITICAL - *2563* [07:12:27] binasher: hm, if it was ddos, shouldn't it be either still pounding on the same boxes, OR pound the boxes in tampa now? [07:13:41] binasher: can you restore whatever you changed? [07:13:54] T3rminat0r: depends, such a thing may or may not follow dns changes. but no evidence that there is a ddos [07:13:56] yup [07:14:06] PROBLEM - LVS HTTP on upload.esams.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [07:14:07] PROBLEM - LVS HTTPS on bits.esams.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [07:15:17] binasher: yea, that's what I meant, if it was a ddos, it would still be going, on either the ip or the host, unless the guy running it read that you switched to tampa ...so I'd guess no ddos ... [07:15:36] PROBLEM - LVS HTTPS on mediawiki-lb.esams.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [07:15:45] Well, if it is a botnet, it would located completely in Europe [07:15:56] Which is not really typical of botnets [07:16:48] PROBLEM - ps1-b5-sdtpa-infeed-load-tower-A-phase-Z on ps1-b5-sdtpa is CRITICAL: ps1-b5-sdtpa-infeed-load-tower-A-phase-Z CRITICAL - *2563* [07:16:57] RECOVERY - LVS HTTPS on bits.esams.wikimedia.org is OK: HTTP OK HTTP/1.1 200 OK - 3981 bytes in 8.564 seconds [07:18:18] RECOVERY - LVS HTTP on upload.esams.wikimedia.org is OK: HTTP OK HTTP/1.0 200 OK - 653 bytes in 0.219 seconds [07:18:27] RECOVERY - LVS HTTPS on upload.esams.wikimedia.org is OK: HTTP OK HTTP/1.1 200 OK - 638 bytes in 0.456 seconds [07:19:30] RECOVERY - LVS HTTP on bits.esams.wikimedia.org is OK: HTTP OK HTTP/1.1 200 OK - 3966 bytes in 3.840 seconds [07:19:39] RECOVERY - LVS HTTP on mediawiki-lb.esams.wikimedia.org is OK: HTTP OK - HTTP/1.0 301 Moved Permanently - 3.950 second response time [07:19:57] RECOVERY - BGP status on csw1-esams is OK: OK: host 91.198.174.247, sessions up: 5, down: 0, shutdown: 0 [07:21:18] PROBLEM - LVS HTTPS on bits.esams.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [07:22:39] PROBLEM - LVS HTTP on upload.esams.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [07:23:15] PROBLEM - LVS HTTPS on upload.esams.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [07:23:33] RECOVERY - LVS HTTP on wikisource-lb.esams.wikimedia.org is OK: HTTP OK HTTP/1.0 200 OK - 43364 bytes in 8.002 seconds [07:23:51] RECOVERY - ps1-b5-sdtpa-infeed-load-tower-A-phase-Z on ps1-b5-sdtpa is OK: ps1-b5-sdtpa-infeed-load-tower-A-phase-Z OK - 2288 [07:25:40] RECOVERY - LVS HTTP on upload.esams.wikimedia.org is OK: HTTP OK HTTP/1.0 200 OK - 656 bytes in 1.708 seconds [07:25:48] RECOVERY - LVS HTTP on wikibooks-lb.esams.wikimedia.org is OK: HTTP OK HTTP/1.0 200 OK - 43568 bytes in 1.351 seconds [07:25:57] RECOVERY - LVS HTTPS on upload.esams.wikimedia.org is OK: HTTP OK HTTP/1.1 200 OK - 638 bytes in 1.349 seconds [07:26:15] RECOVERY - LVS HTTPS on wiktionary-lb.esams.wikimedia.org is OK: HTTP OK HTTP/1.1 200 OK - 62476 bytes in 5.493 seconds [07:26:24] RECOVERY - LVS HTTPS on wikiquote-lb.esams.wikimedia.org is OK: HTTP OK HTTP/1.1 200 OK - 52892 bytes in 5.478 seconds [07:27:00] RECOVERY - LVS HTTPS on mediawiki-lb.esams.wikimedia.org is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 2.405 second response time [07:27:45] RECOVERY - LVS HTTP on foundation-lb.esams.wikimedia.org is OK: HTTP OK HTTP/1.0 200 OK - 38924 bytes in 0.437 seconds [07:27:45] RECOVERY - LVS HTTP on wikiversity-lb.esams.wikimedia.org is OK: HTTP OK HTTP/1.0 200 OK - 50120 bytes in 0.437 seconds [07:27:45] RECOVERY - LVS HTTP on wikiquote-lb.esams.wikimedia.org is OK: HTTP OK HTTP/1.0 200 OK - 52886 bytes in 0.545 seconds [07:27:45] RECOVERY - LVS HTTP on wiktionary-lb.esams.wikimedia.org is OK: HTTP OK HTTP/1.0 200 OK - 62470 bytes in 0.547 seconds [07:27:45] RECOVERY - LVS HTTP on wikipedia-lb.esams.wikimedia.org is OK: HTTP OK HTTP/1.0 200 OK - 60179 bytes in 0.545 seconds [07:27:46] RECOVERY - LVS HTTPS on wikisource-lb.esams.wikimedia.org is OK: HTTP OK HTTP/1.1 200 OK - 43370 bytes in 0.776 seconds [07:27:46] RECOVERY - LVS HTTPS on wikiversity-lb.esams.wikimedia.org is OK: HTTP OK HTTP/1.1 200 OK - 50126 bytes in 0.773 seconds [07:27:47] RECOVERY - LVS HTTPS on wikibooks-lb.esams.wikimedia.org is OK: HTTP OK HTTP/1.1 200 OK - 43572 bytes in 0.785 seconds [07:28:01] whoooo [07:28:03] RECOVERY - LVS HTTP on wikimedia-lb.esams.wikimedia.org is OK: HTTP OK HTTP/1.0 200 OK - 80032 bytes in 0.660 seconds [07:28:04] RECOVERY - LVS HTTPS on wikipedia-lb.esams.wikimedia.org is OK: HTTP OK HTTP/1.1 200 OK - 60182 bytes in 0.895 seconds [07:28:04] PROBLEM - check_all_memcacheds on spence is CRITICAL: MEMCACHED CRITICAL - Could not connect: 10.0.11.47:11000 (timeout) [07:28:21] RECOVERY - LVS HTTPS on foundation-lb.esams.wikimedia.org is OK: HTTP OK HTTP/1.1 200 OK - 38930 bytes in 0.702 seconds [07:28:21] RECOVERY - LVS HTTP on wikinews-lb.esams.wikimedia.org is OK: HTTP OK HTTP/1.0 200 OK - 70126 bytes in 0.547 seconds [07:28:39] RECOVERY - LVS HTTPS on bits.esams.wikimedia.org is OK: HTTP OK HTTP/1.1 200 OK - 3971 bytes in 0.441 seconds [07:28:57] RECOVERY - LVS HTTPS on wikinews-lb.esams.wikimedia.org is OK: HTTP OK HTTP/1.1 200 OK - 70132 bytes in 0.818 seconds [07:28:57] RECOVERY - LVS HTTPS on wikimedia-lb.esams.wikimedia.org is OK: HTTP OK HTTP/1.1 200 OK - 80039 bytes in 0.876 seconds [07:29:33] RECOVERY - check_all_memcacheds on spence is OK: MEMCACHED OK - All memcacheds are online [07:35:24] PROBLEM - swift-container-auditor on ms-be5 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [07:36:00] PROBLEM - swift-container-auditor on ms-be2 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [07:43:39] PROBLEM - Puppet freshness on amslvs2 is CRITICAL: Puppet has not run in the last 10 hours [07:50:16] wtf has been going on? [07:56:42] PROBLEM - check_all_memcacheds on spence is CRITICAL: MEMCACHED CRITICAL - Could not connect: 10.0.11.38:11000 (timeout) 10.0.11.33:11000 (timeout) 10.0.8.39:11000 (timeout) [07:57:00] RECOVERY - swift-container-auditor on ms-be5 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [07:58:03] RECOVERY - check_all_memcacheds on spence is OK: MEMCACHED OK - All memcacheds are online [08:00:27] RECOVERY - swift-container-auditor on ms-be2 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [08:21:00] PROBLEM - check_all_memcacheds on spence is CRITICAL: MEMCACHED CRITICAL - Could not connect: 10.0.11.40:11000 (Connection timed out) 10.0.11.37:11000 (timeout) 10.0.8.39:11000 (timeout) [08:21:36] PROBLEM - Host mw40 is DOWN: PING CRITICAL - Packet loss = 100% [08:27:37] !log Power cycled mw40 [08:27:40] Logged the message, Master [08:31:21] RECOVERY - Host mw40 is UP: PING OK - Packet loss = 0%, RTA = 1.03 ms [08:32:15] RECOVERY - check_all_memcacheds on spence is OK: MEMCACHED OK - All memcacheds are online [08:35:33] PROBLEM - Apache HTTP on mw40 is CRITICAL: Connection refused [08:50:51] PROBLEM - check_all_memcacheds on spence is CRITICAL: MEMCACHED CRITICAL - Could not connect: 10.0.8.23:11000 (timeout) [08:52:12] RECOVERY - check_all_memcacheds on spence is OK: MEMCACHED OK - All memcacheds are online [09:01:03] RECOVERY - Apache HTTP on mw40 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.026 second response time [09:45:54] PROBLEM - Auth DNS on ns2.wikimedia.org is CRITICAL: CRITICAL - Plugin timed out while executing system call [10:19:46] New patchset: Mark Bergsma; "Set file size at 100G for all upload caches now" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/5876 [10:20:01] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/5876 [10:20:34] New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/5876 [10:20:37] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/5876 [10:50:23] New patchset: Mark Bergsma; "Manage tftpboot from Puppet, import PXE config files from Precise" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/5878 [10:50:40] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/5878 [11:02:03] New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/5878 [11:02:06] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/5878 [11:08:34] New patchset: Mark Bergsma; "Put files in the correct place" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/5880 [11:08:51] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/5880 [11:09:16] New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/5880 [11:09:19] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/5880 [11:27:11] New patchset: Mark Bergsma; "Add Wikimedia specific boot configs" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/5881 [11:27:28] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/5881 [11:30:26] New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/5881 [11:30:28] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/5881 [11:30:36] PROBLEM - Puppet freshness on amslvs4 is CRITICAL: Puppet has not run in the last 10 hours [12:06:23] New patchset: Mark Bergsma; "Fix broken indentation" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/5882 [12:06:40] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/5882 [12:06:51] New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/5882 [12:06:55] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/5882 [12:17:24] New patchset: Nikerabbit; "Cron entries for TranslationNotifications" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/5783 [12:17:42] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/5783 [12:18:10] New review: Nikerabbit; "(no comment)" [operations/puppet] (production) C: 0; - https://gerrit.wikimedia.org/r/5783 [12:18:20] New review: Nikerabbit; "(no comment)" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/5783 [12:34:39] PROBLEM - Puppet freshness on spence is CRITICAL: Puppet has not run in the last 10 hours [12:47:42] PROBLEM - Puppet freshness on gallium is CRITICAL: Puppet has not run in the last 10 hours [12:56:33] !log Created precise-wikimedia APT distribution [12:56:36] Logged the message, Master [12:58:22] New patchset: Mark Bergsma; "Use ocg3 as Precise install test host" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/5883 [12:58:39] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/5883 [12:58:54] New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/5883 [12:58:56] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/5883 [13:03:57] !log (re)starting innobackupex from db1017 to db59 for new s1 slave [13:03:59] Logged the message, notpeter [13:31:48] PROBLEM - Host srv278 is DOWN: PING CRITICAL - Packet loss = 100% [13:33:18] RECOVERY - Host srv278 is UP: PING OK - Packet loss = 0%, RTA = 0.88 ms [13:34:39] RECOVERY - mysqld processes on db60 is OK: PROCS OK: 1 process with command name mysqld [13:37:03] PROBLEM - Apache HTTP on srv278 is CRITICAL: Connection refused [13:37:21] PROBLEM - Apache HTTP on mw62 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:37:21] PROBLEM - Apache HTTP on srv290 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:37:30] PROBLEM - Apache HTTP on mw69 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:37:48] PROBLEM - Apache HTTP on srv301 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:37:48] PROBLEM - Apache HTTP on srv254 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:37:57] PROBLEM - MySQL Replication Heartbeat on db60 is CRITICAL: CRIT replication delay 56524 seconds [13:37:57] PROBLEM - Apache HTTP on mw70 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:37:57] PROBLEM - Apache HTTP on mw73 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:37:57] PROBLEM - Apache HTTP on srv296 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:38:06] PROBLEM - Apache HTTP on srv298 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:38:06] PROBLEM - Apache HTTP on mw65 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:38:06] PROBLEM - Apache HTTP on srv294 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:38:15] PROBLEM - Apache HTTP on srv299 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:38:24] PROBLEM - Apache HTTP on mw66 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:38:24] PROBLEM - Apache HTTP on srv291 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:38:24] PROBLEM - Apache HTTP on mw64 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:38:33] PROBLEM - Apache HTTP on srv295 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:38:33] PROBLEM - Apache HTTP on srv216 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:38:42] PROBLEM - Apache HTTP on srv300 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:38:42] PROBLEM - Apache HTTP on mw67 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:38:42] PROBLEM - Apache HTTP on mw74 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:38:42] PROBLEM - LVS HTTP on api.svc.pmtpa.wmnet is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:38:42] PROBLEM - Apache HTTP on srv292 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:38:51] PROBLEM - Apache HTTP on mw63 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:39:00] PROBLEM - Apache HTTP on mw71 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:39:00] PROBLEM - Apache HTTP on srv297 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:39:00] PROBLEM - Apache HTTP on srv293 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:39:09] PROBLEM - Apache HTTP on mw72 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:39:09] PROBLEM - Apache HTTP on mw68 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:39:54] RECOVERY - Apache HTTP on srv216 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 6.580 second response time [13:40:39] RECOVERY - Apache HTTP on srv254 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 3.311 second response time [13:41:33] RECOVERY - LVS HTTP on api.svc.pmtpa.wmnet is OK: HTTP OK HTTP/1.1 200 OK - 2241 bytes in 9.187 seconds [13:46:03] PROBLEM - LVS Lucene on search-prefix.svc.eqiad.wmnet is CRITICAL: Connection timed out [13:48:54] RECOVERY - LVS Lucene on search-prefix.svc.eqiad.wmnet is OK: TCP OK - 9.020 second response time on port 8123 [13:50:06] RECOVERY - Apache HTTP on srv278 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.028 second response time [13:50:15] PROBLEM - LVS HTTP on api.svc.pmtpa.wmnet is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:51:09] PROBLEM - Apache HTTP on srv250 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:51:09] PROBLEM - Apache HTTP on srv254 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:51:36] PROBLEM - Apache HTTP on srv252 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:51:45] PROBLEM - Apache HTTP on srv257 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:51:45] PROBLEM - Apache HTTP on srv216 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:52:03] PROBLEM - Apache HTTP on srv215 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:52:39] PROBLEM - Apache HTTP on srv256 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:52:48] PROBLEM - Apache HTTP on srv251 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:52:57] PROBLEM - Apache HTTP on srv214 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:53:15] PROBLEM - Apache HTTP on srv255 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:53:33] PROBLEM - Apache HTTP on srv253 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:53:33] PROBLEM - Apache HTTP on srv218 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:53:35] good morning! (east coast time here) [13:53:55] apergos and mark [13:54:10] so, since we added stat1 to the list of allowed exports yesterday [13:54:24] the nfs mounts to 10.0.5.8 nfs1 work fine now [13:54:27] so. [13:54:29] this is a bad time. [13:54:37] ahhhh phooey :0 [13:54:38] :) [13:54:39] we have a problem with the api servers. [13:54:43] oh yikes ok [13:54:44] please wait til it's sorted out [13:54:47] !log restartin lucene on search1017 and search1018 [13:54:50] Logged the message, notpeter [13:54:53] no probs, thanks [13:55:30] RECOVERY - Apache HTTP on srv296 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 7.537 second response time [13:55:39] RECOVERY - Apache HTTP on srv251 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.028 second response time [13:55:39] RECOVERY - Apache HTTP on srv294 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.748 second response time [13:55:39] RECOVERY - Apache HTTP on srv291 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.027 second response time [13:55:39] RECOVERY - Apache HTTP on srv214 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.025 second response time [13:55:48] RECOVERY - Apache HTTP on srv252 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.039 second response time [13:55:49] I sure don't see anything inthe sampled log, and there's too much crap inthe log on a given api server to tell what's noise and what isn't [13:56:06] yeah [13:56:06] RECOVERY - Apache HTTP on srv255 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 7.872 second response time [13:56:06] RECOVERY - Apache HTTP on srv292 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 8.702 second response time [13:56:15] RECOVERY - Apache HTTP on srv297 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 3.086 second response time [13:56:24] RECOVERY - Apache HTTP on srv253 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 7.647 second response time [13:57:00] RECOVERY - Apache HTTP on srv256 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 7.896 second response time [13:57:00] RECOVERY - Apache HTTP on srv301 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 9.496 second response time [13:57:25] !log installing some (security) upgrades on fenari (apt,cron,samba,...) [13:57:28] Logged the message, Master [13:57:54] so, I restarted the search prefix hosts [13:58:03] and things seem to be clearing up now [13:58:11] they weren't getting crazy number of requests or anything [13:59:38] nope still messed up [14:00:00] PROBLEM - Apache HTTP on srv296 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:00:09] PROBLEM - Apache HTTP on srv291 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:00:09] PROBLEM - Apache HTTP on srv294 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:00:18] PROBLEM - Apache HTTP on srv252 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:00:27] PROBLEM - Apache HTTP on srv292 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:00:45] PROBLEM - Apache HTTP on srv253 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:00:45] PROBLEM - Apache HTTP on srv297 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:00:47] notpeter: those Apache problems are related to search? [14:00:53] I don't think so [14:00:57] oh [14:01:09] I mean, sure, in that all of our systems are related [14:01:16] but the problem looks to be upstream of that [14:01:21] PROBLEM - Apache HTTP on srv301 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:01:39] PROBLEM - Apache HTTP on srv214 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:01:40] because search boxes look fine, logs look normal in terms of requests [14:04:37] !log stopping lucene on search1017 and 1018 to take that out of the equation [14:04:39] Logged the message, notpeter [14:04:57] RECOVERY - Apache HTTP on srv218 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 4.586 second response time [14:04:57] RECOVERY - Apache HTTP on srv253 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 6.267 second response time [14:04:57] RECOVERY - Apache HTTP on srv290 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 5.506 second response time [14:04:57] RECOVERY - Apache HTTP on srv215 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 5.537 second response time [14:04:57] RECOVERY - Apache HTTP on srv297 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 7.671 second response time [14:05:06] RECOVERY - Apache HTTP on mw63 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 9.120 second response time [14:05:06] RECOVERY - Apache HTTP on mw69 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 9.676 second response time [14:05:07] it does appear related ;) [14:05:12] indeed [14:05:15] RECOVERY - Apache HTTP on srv293 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.031 second response time [14:05:16] *sigh* [14:05:24] RECOVERY - Apache HTTP on mw70 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.029 second response time [14:05:24] RECOVERY - Apache HTTP on mw72 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.037 second response time [14:05:33] RECOVERY - Apache HTTP on srv301 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.027 second response time [14:05:33] RECOVERY - Apache HTTP on srv298 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.030 second response time [14:05:33] RECOVERY - Apache HTTP on mw65 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.029 second response time [14:05:33] RECOVERY - Apache HTTP on mw68 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.031 second response time [14:05:33] RECOVERY - Apache HTTP on srv250 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.026 second response time [14:05:34] RECOVERY - Apache HTTP on srv254 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.024 second response time [14:05:34] RECOVERY - Apache HTTP on mw73 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.031 second response time [14:05:35] RECOVERY - Apache HTTP on mw71 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.028 second response time [14:05:42] RECOVERY - Apache HTTP on srv296 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.029 second response time [14:05:51] RECOVERY - Apache HTTP on srv214 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.028 second response time [14:05:51] RECOVERY - Apache HTTP on mw66 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.028 second response time [14:05:51] RECOVERY - Apache HTTP on srv291 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.025 second response time [14:05:51] RECOVERY - Apache HTTP on srv294 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.028 second response time [14:05:51] RECOVERY - Apache HTTP on srv295 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.027 second response time [14:06:00] RECOVERY - Apache HTTP on srv252 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.024 second response time [14:06:00] RECOVERY - Apache HTTP on mw64 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.030 second response time [14:06:00] RECOVERY - Apache HTTP on srv299 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.030 second response time [14:06:09] RECOVERY - Apache HTTP on srv300 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.031 second response time [14:06:09] RECOVERY - Apache HTTP on mw67 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.033 second response time [14:06:09] RECOVERY - Apache HTTP on srv292 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.029 second response time [14:06:09] RECOVERY - Apache HTTP on mw74 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.032 second response time [14:06:09] RECOVERY - Apache HTTP on srv257 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.025 second response time [14:06:10] RECOVERY - LVS HTTP on api.svc.pmtpa.wmnet is OK: HTTP OK HTTP/1.1 200 OK - 2241 bytes in 0.104 seconds [14:06:10] RECOVERY - Apache HTTP on srv216 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.028 second response time [14:06:18] RECOVERY - Apache HTTP on mw62 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.031 second response time [14:07:21] PROBLEM - LVS Lucene on search-prefix.svc.eqiad.wmnet is CRITICAL: Connection refused [14:07:25] notpeter: heh, a full page of recoveries should cheer you up [14:07:48] not even slightly [14:08:02] that's a result of turning off autocomplete on search [14:08:29] i see..:( [14:11:33] PROBLEM - Lucene on search1017 is CRITICAL: Connection refused [14:13:03] RECOVERY - Lucene on search1017 is OK: TCP OK - 0.028 second response time on port 8123 [14:13:03] RECOVERY - LVS Lucene on search-prefix.svc.eqiad.wmnet is OK: TCP OK - 0.027 second response time on port 8123 [14:17:42] ottomata: nice @ mount works for stat1 (besides the general NFS subject) [14:18:32] Jeff_Green: so.... MaxBackupIndex is not a thing that works in log4j currently it would seem [14:18:44] oh java, you never cease to amaze me [14:18:59] notpeter: what's it suppose do do? roll off old logs? [14:19:08] ottomata: ok to resolve the ticket then? (had done some other stuff in it and it was just open for the mount) [14:19:18] !log cleaned log space on search1017 and search1018 and started lucene [14:19:20] Logged the message, notpeter [14:19:29] Jeff_Green: it's the part of the system that deletes old logs [14:19:33] oic [14:19:44] so all that noise was from full logs [14:19:48] logrotate can probably do just that part [14:19:57] or just a cron that rms [14:20:42] what mount works? [14:21:03] ottomata> the nfs mounts to 10.0.5.8 nfs1 work fine now [14:21:11] < ottomata> so, since we added stat1 to the list of allowed exports yesterday [14:21:17] I thought we were getting rid of that mount [14:21:20] for the stat1 server [14:21:29] read the ticket [14:22:44] mutante, not quite [14:22:47] yeah i want to ask you guys about that [14:23:01] i took the mounts out of puppet [14:23:03] but it is woooorking [14:23:12] it doesn't matter that it's working or not [14:23:14] and i was discussing this more with apergos yesterday [14:23:18] we were getting rid of them for other reasons [14:23:21] yeah i know [14:23:26] we were trying to find a solution that satisfied erik's needs [14:23:29] and it was getting late [14:23:33] so we decided to bring it up again now [14:23:50] the NFS mount satisfies his needs, an rsync like deploy type thing will be complicated [14:23:57] doable but annoying and complicated [14:24:10] example: he updates the files in htdocs every 15 minutes as his jobs run [14:24:19] so he can see the status on stats.wikimedia.org [14:24:28] yes but htdocs will be on stat1 [14:24:31] no [14:24:35] why not? [14:24:44] was on #2162 "config changes for stat1" . didnt see that yet about getting rid of it [14:24:48] htdocs is gping to be served from spence by an nfs mount of stats1? [14:24:51] really? [14:24:55] from SPENCE? [14:24:56] right [14:24:58] yeah that's how it is now [14:25:02] stats.wikimedia.org [14:25:03] what the fuck [14:25:06] that certainly needs to change [14:25:07] right [14:25:27] see this is where I began to headdesk yesterday [14:25:36] can someone give me the summary now? [14:25:37] how about a web server, as in an actual one [14:25:38] so he generates files on stat1, and needs to have updates often show up on stats.wikimedia.org [14:25:44] so. [14:25:46] yes [14:25:48] however you guys want to solve that is fine with me :) [14:25:51] i can do whatever [14:25:52] so why not put stats.wikimedia.org on stat1? [14:25:58] that's what we discussed yesterday [14:26:07] i think because it is meant to be a computation server, [14:26:15] if he is doing heavy stuff he doesn't want it to bog down the site [14:26:20] well spence is meant to be a monitoring server [14:26:26] certainly not a server to host statistics [14:26:58] do we not have a misc web server box? [14:27:11] let's not [14:27:15] ja, wherever is fine, but we'd still have to solve the NFS vs. no NFS prob [14:27:16] let's just put a web server on stat1 [14:27:26] if needed we can put it behind varnish misc cluster [14:28:10] hm, ok. [14:28:23] mark, what if I told you that the analytics team wants to deprecate all of this stuff eventually [14:28:24] ? [14:28:33] I'm sure they do [14:28:38] and this is just a temporary change, to help erik be able to generate his stats on a more powerful machine [14:28:39] but this we can easily fix now can't we? [14:28:55] I would say that temporary is always long term around here [14:28:58] i guess, i mean, i don't mind hosting it from stat1 myself, i guess I should ask erik though [14:29:00] heheh, yeah i know [14:29:04] and probably true in this case too :) [14:29:33] to do so, what is required? just copying htdocs over, setting up apache…and then? [14:29:40] some proxy rule somewhere? [14:29:41] yes indeed [14:29:47] no proxy rule needed yet [14:29:55] oh cause it has a public IP [14:29:56] ja [14:30:00] if stat1 ever runs in trouble due to web servering (I highly doubt it) we'll put varnish in front [14:30:03] yeah [14:30:04] ok [14:30:13] could you do me a favor then? [14:30:17] i don't have access to spence [14:30:18] yes? [14:30:21] could you get me the vhost file? [14:30:27] i could just make one [14:30:32] but he might have special stuff in there? [14:30:41] probably making a new one from similar ones is a better idea [14:30:44] but yeah we can get you that [14:30:49] yeah, i'll doa new one [14:30:50] via puppet [14:30:59] but it'd be good to see what he has just in case there is something id on't know about [14:31:01] cool [14:31:03] yep [14:31:04] otto@wikimedia.org [14:31:19] ok, i'll double check the hosting from stat1 with erik before I do it [14:31:22] but i think it should be ok [14:32:09] here it is: http://p.defau.lt/?HtKBW_mfNTfo8fu_ySi6Bw [14:32:14] I don't think that's up to erik :) [14:33:15] it certainly shouldn't be where it is now (spence), and I don't see any reason to complicate it by putting the web serving on a different host [14:33:23] aye, ok [14:33:30] it's not exactly a high traffic web site, and it shouldn't matter for that host at all [14:33:32] well, if this thing does get super loaded with analytics scripts [14:33:35] we'll worry about that then [14:33:38] yes [14:33:47] then we can help get it moved [14:33:49] cool [14:34:02] varnish caching in front should work well too, it's all static [14:34:16] aye [14:34:25] except for his job status pages, but yeah [14:34:34] hey, there are two VirtualHosts in this file with the same ServerName? [14:34:46] oh wikipedia [14:34:48] naw, got it [14:34:50] sorry, thanks [14:35:06] ok, on it, thanks mark and apergos [14:35:12] thanks for working on it [14:36:23] yes, thanks for doing all the legwork [14:36:26] ah, so, one more thing [14:36:39] just heard from erik about read only access to dataset2 [14:36:42] he needs write access [14:36:44] s'ok to change? [14:36:52] I thought we talked about this already [14:36:53] no? [14:37:02] well, we thought he only needed read only [14:37:16] i just asked him if he had what he needed there [14:37:20] he respnoded [14:37:20] /mnt/data/xmldatadumps/public/other/pagecounts-ez I thought this was going to be done by rsync [14:37:38] oh to write? [14:37:44] but read from /mnt/data is ok? [14:37:46] yes, to write [14:37:49] read from there is ok [14:38:09] ahhh, sorry, guess i missed that…but um, if the mount is already there, is there a reason not to mount it rw? [14:38:40] New patchset: Pyoungmeister; "adding a cron to compliment log4j log rotation on search hosts. log4j rotation is good because it is easier on the system, but it does not yet have the ability to delete old logs :/ someday..." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/5886 [14:38:56] if we rsync [14:38:57] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/5886 [14:39:02] we're going to have to worry about users and access to dataset2 [14:39:18] apergos can deal with that in the rsyncd [14:39:42] no need to have full write access to all of dataset2's data [14:40:25] hmm, ok, [14:40:29] i think he needs write to /mnt/data/xmldatadumps/public/other/pagecounts-ez/ [14:40:40] (or wherever that is on dataset2) [14:41:00] yes, that's his directory for various things [14:41:35] mark: actually useful commit message ^^ :) [14:41:48] notpeter: yay! [14:42:04] now next time put a short summary line as the first line, then the details in a separate paragraph :P [14:42:10] so gerrit doesn't show that as the summary [14:42:26] ok, so apergos, can you set up an rsync module that I can write to on dataset2? [14:42:36] ah, that'sreasonable. I am trying to improve my practices [14:42:51] for you to write to? [14:42:56] sorry [14:42:57] for erik [14:42:59] I thought we want erik to be able to write to it [14:43:02] yeah [14:43:06] uh huh [14:43:07] i'll be setting it up though :p [14:43:16] huh? [14:43:19] maybe will create a script that makes it easy for him [14:43:21] dunno [14:43:23] i'll talk with him about it [14:43:26] I see [14:43:40] but as long as he can write to that directory via rsync, we shoudl be good [14:48:09] PROBLEM - LVS HTTP on upload.esams.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:49:39] RECOVERY - LVS HTTP on upload.esams.wikimedia.org is OK: HTTP OK HTTP/1.0 200 OK - 792 bytes in 0.297 seconds [14:52:48] PROBLEM - MySQL Slave Delay on db60 is CRITICAL: CRIT replication delay 60513 seconds [15:02:18] New review: Pyoungmeister; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/5886 [15:02:22] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/5886 [15:05:00] mark [15:05:02] i will also need [15:05:20] /etc/apache2/htpasswd.stats [15:05:25] from spence [15:07:34] right [15:07:57] i'll copy it to stats1 [15:08:10] danke [15:08:55] it's in /root now [15:13:59] thanks! [15:14:04] New patchset: ArielGlenn; "rsync access for pagecount and related files for stats1" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/5887 [15:14:20] woosters: update quote with SSDs included in #1961 [15:14:21] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/5887 [15:14:24] updated* [15:29:41] who knows about stuff in webserver.pp? [15:29:47] there are a few webserver::apache type classes [15:29:51] hard to know what I should use [15:30:03] we're in a transition [15:30:08] you can use the old style I guess [15:30:12] we can always convert to new style some day [15:30:21] so the top stuff, what's used most in other manifests [15:30:31] top of the file [15:31:24] so, webserver::apache2 [15:31:28] can I use apache_site define? [15:31:31] with that? [15:32:13] yes [15:32:15] mmk [15:32:42] PROBLEM - Puppet freshness on amslvs1 is CRITICAL: Puppet has not run in the last 10 hours [15:43:39] RECOVERY - Auth DNS on ns2.wikimedia.org is OK: DNS OK: 0.121 seconds response time. www.wikipedia.org returns 208.80.154.225 [15:53:16] so i see TCP Connection Thread died because of STL error: Timeout reading data in ns2 logs [15:54:02] it's a known bug [15:54:09] fixed too [15:54:14] just we haven't deployed it yet [15:54:36] ah, what is the bug and could it be causing the pdns issues ? some quick googling suggests just restrarting the server when seeing this [15:54:50] it IS the pdns issue [15:55:00] i'll roll a newer build when I overhaul the auth dns setup [15:55:04] with precise too [15:55:07] :) [15:55:11] hardy -> precise [15:55:14] it's the future! [15:55:20] lucid in fact [15:55:23] just dobson is hardy [15:56:43] oh did you see my local pref changes? [15:58:54] yeah [15:58:56] makes sense I guess [15:59:02] on one hand we prefer private over public [15:59:05] on other hand, not nice to public peers ;) [16:00:31] mark, have you ever tried to require a define by name? [16:00:35] in puppet? [16:00:42] yes [16:00:49] Apache_site["name"] [16:00:52] can you do it with defines that have :: in the name? [16:00:53] what about [16:01:03] PROBLEM - check_all_memcacheds on spence is CRITICAL: MEMCACHED CRITICAL - Could not connect: 10.0.8.23:11000 (timeout) 10.0.8.39:11000 (timeout) [16:01:08] require -> Webserver::apache::modules['rewrite'] [16:01:18] then you would use: [16:01:48] Webserver::Apache::Modules['rewrite'] -> Class["class:name:which:requires"] [16:01:51] but isn't that the newer stuff? [16:01:53] you shouldn't use that yet [16:02:22] i shouldn't use it? what if it works? I didn't realize that the generic_vhost.erb thing was newer stuff still I already started doing it [16:02:26] i seem to be getting what I need [16:02:28] no good? [16:03:33] ok [16:03:41] you can use it if you like, just not well tested yet [16:03:45] ok :) [16:03:48] i can help test it [16:03:54] RECOVERY - check_all_memcacheds on spence is OK: MEMCACHED OK - All memcacheds are online [16:03:57] i kinda want to make the site define better too [16:04:19] i might look into making vhost.erb be fancy with AuthUserFile stuff [16:04:21] mayyyybe [16:04:24] just using custom => right now [16:06:12] ok, thanks, haven't used puppet this new before , I was previously stuck on an older version because of some custom facts I had written. relationship chaining coooool [16:07:26] ah yeah require => Webserver::Apache::Modules['rewrite'] works too [16:08:16] ohh all caps? [16:08:17] hmmmm [16:08:19] lemme try that too [16:08:44] yes [16:15:05] yeah that works, i'll just do that [16:15:11] cool, danke [16:35:23] mark, seeing as the default docroot on this new define is in /srv [16:35:32] is that the preferred place to put hosted site document roots? [16:35:40] yes [16:36:27] ok cool, fyi, i get this if I do not specify docroot manually [16:36:39] Error 400 on SERVER: Cannot reassign variable docroot at /etc/puppet/manifests/webserver.pp:244 [16:44:06] RECOVERY - MySQL Slave Delay on db60 is OK: OK replication delay 0 seconds [16:44:24] RECOVERY - MySQL Replication Heartbeat on db60 is OK: OK replication delay 0 seconds [16:48:36] PROBLEM - Puppet freshness on db1004 is CRITICAL: Puppet has not run in the last 10 hours [16:56:47] ahhhh crap crackers [16:56:56] not enough space on / on stat1 for htdocs [16:56:57] oh yeah [16:56:59] because it is 32 G [16:57:34] then it should be on a separate partition/filesystem [16:57:41] yeah [16:57:46] i can symlink to /a [16:57:50] there's 8.9T avail there [16:57:53] probably using LVM [16:58:00] a reinstall would be good :P [16:58:14] if it can wait a few days you can use precise too [16:58:14] booo [16:58:16] instead of lucid [16:58:24] can I just use /a ? [16:58:26] and symlink? [16:58:51] alright [17:02:58] whoa...we're actually considering upgrading production machines to Precise before Tepid Tapeworm comes out? ;-) [17:03:47] yeah... as always? [17:04:03] just apaches we rarely care about, not important part of our infra ;) [17:06:19] this shows the Precise upgrade train starting this time next year: http://www.mediawiki.org/wiki/Wikimedia_Engineering/2012-13_Goals#Milestones_by_quarter [17:06:52] uhh... [17:06:58] I was actually planning to deploy eqiad apaches with precise [17:07:07] and everything else we touch from now on [17:07:23] woosters: ^ [17:07:40] ya [17:07:54] although mediawiki and newer php releases are a bit difficult to test for us [17:08:02] New patchset: preilly; "Thursday 10am-11:30am PST: Uganda, Ivory Coast" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/5891 [17:08:09] other than "just push it in production and wait for people to complain" ;) [17:08:19] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/5891 [17:08:21] Man why couldn't Tepid be the LTS. I'd have loved to be able to say our site runs on tapeworm [17:08:28] Would confuse some non-tech people :) [17:08:44] we can deploy eqiad apaches with lucid as well if that's problematic [17:08:52] but everything else we'll start upgrading now [17:09:14] Oh, wait, Tepid IS the next LTS. Sweet! [17:09:42] mark: Why don't we make precise available in labs first, and start upgrading stuff there [17:10:37] RoanKattouw: it should become available in labs too [17:10:49] LeslieCarr: can you do this one today https://gerrit.wikimedia.org/r/5891 [17:11:20] well, oneiric instances in labs get easily broken [17:11:32] (probably some incompatibility with puppet rules) [17:11:43] so yes, better test them deeply before ;) [17:11:53] preilly: when do you need it ? [17:11:59] LeslieCarr: ASAP [17:12:08] well puppet is so fucked up in labs that I can't be bothered for most things [17:12:11] easier to test and fix in production [17:12:12] LeslieCarr: and this is the last time I'll do this to you [17:12:17] (without traffic ;) [17:12:22] arthur better pour me more whiskey next time we're in the same city :p [17:13:48] New review: Lcarr; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/5891 [17:13:51] Change merged: Lcarr; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/5891 [17:14:06] PROBLEM - check_all_memcacheds on spence is CRITICAL: MEMCACHED CRITICAL - Could not connect: 10.0.11.44:11000 (timeout) [17:14:43] LeslieCarr: ha ha ah [17:15:46] merging the changes now [17:16:57] RECOVERY - check_all_memcacheds on spence is OK: MEMCACHED OK - All memcacheds are online [17:17:01] !log reloaded varnish on mobile caches [17:17:04] Logged the message, Mistress of the network gear. [17:18:39] hola [17:18:50] mark: ^ [17:18:57] hi [17:19:12] no strong opinion overall, but it would be nice if we could de-uglify mailman a bit [17:19:19] it is ugly, that's true [17:19:54] I don't know how much things will break with future updates [17:20:09] it's possible it's very minor and can be fixed by the community maintaining the templates after upgrades [17:20:23] or it might break functionality during upgrade, which would actually suck [17:20:47] yeah .. if tho can do his work in labs perhaps he can write a little script to revert to the default templates prior to an upgrade, and then apply the changes in labs to see if they break anything? [17:21:02] is it feasible to get a mailman instance up in labs for him to play with? [17:21:10] that would be feasible [17:21:26] the mail part of that is a bit hard [17:21:33] but at least the web interface side is easy [17:22:03] Thehelpfulone, any progress getting an instance for this? if you have a small instance perhaps you can do most of the legwork of proof-of-concepting it [17:22:32] Eloquence: well I asked Ryan_Lane about an instance and he said the ops decided against the mailman listinfo update [17:22:38] so that kind of killed the request for the instance [17:22:44] er [17:22:44] heh [17:22:52] we said we preferred it didn't happen [17:22:56] and someone would talk :) [17:23:28] preilly: lemme know when to revert [17:23:37] LeslieCarr: okay will do [17:24:08] having a labs mailman instance would be useful for its own sake ... tho, I'll try to grab a couple mins of ryan's time later to see if he has principled reservations [17:24:12] I think you should be able to get an instance regardlessly [17:24:13] yes [17:25:42] Eloquence: thanks [17:26:03] hey, you guys probably already know about this, but it seems images are not working on wikipedia right now? or maybe just for me? [17:26:26] http://cl.ly/1c1M1Y0S3O1X2B2p2W2N [17:26:58] hmm it works for me ottomata [17:27:01] do you have images blocked? [17:27:26] ottomata, what happens if you go to http://upload.wikimedia.org/wikipedia/commons/6/63/Wikipedia-logo.png ? [17:27:45] PROBLEM - Lighttpd HTTP on dataset2 is CRITICAL: Connection refused [17:27:47] (that's the image you should have get in the center) [17:28:00] all of them show up for me too [17:29:15] RECOVERY - Lighttpd HTTP on dataset2 is OK: HTTP OK HTTP/1.0 200 OK - 5349 bytes in 0.001 seconds [17:29:24] 404 [17:29:33] checked all my browsers [17:29:42] hm, weird [17:29:44] ohohoohoohoho [17:29:45] doh [17:29:50] interesting, do a host on upload.wikimedia.org ? [17:29:57] do not listen to me [17:30:02] i have been playing with my /etc/hosts file [17:30:07] oh [17:30:13] sorry for the false alarm [17:44:42] PROBLEM - Puppet freshness on amslvs2 is CRITICAL: Puppet has not run in the last 10 hours [17:50:07] https://bugzilla.wikimedia.org/show_bug.cgi?id=35900 <- we're still getting reports of this [17:50:20] (session problems on enwiki, probably others) [17:54:09] PROBLEM - MySQL Replication Heartbeat on db1035 is CRITICAL: CRIT replication delay 193 seconds [17:54:45] PROBLEM - MySQL Slave Delay on db1035 is CRITICAL: CRIT replication delay 206 seconds [17:56:00] New patchset: Ottomata; "Setting up virtual host for stats.wikimedia.org on stat1. RT 2162." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/5896 [17:56:18] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/5896 [17:57:18] mark or LeslieCarr mutante, review please :) [17:59:09] New patchset: MarkAHershberger; "Bug #36164 ? rewrite rules for ShortURL" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/5433 [17:59:27] New patchset: MarkAHershberger; ".conf files from noc, with w/s removed" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/5431 [17:59:43] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/5433 [17:59:44] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/5431 [18:00:18] RECOVERY - MySQL Slave Delay on db1035 is OK: OK replication delay 0 seconds [18:00:34] New review: MarkAHershberger; "updated bug# and changed rewrite to be more in line with the others as well as switching to /s/ whi..." [operations/puppet] (production) C: 0; - https://gerrit.wikimedia.org/r/5433 [18:01:21] RECOVERY - MySQL Replication Heartbeat on db1035 is OK: OK replication delay 0 seconds [18:04:58] robla: so, anyway [18:05:10] I can't replicate the problem, so I haven't put in a bug [18:05:28] also, I have another session related issue due to LdapAuth, so I haven't put in a bug [18:05:37] that said, the two aren't related [18:05:46] thanks for the update Ryan_Lane [18:06:21] The thing where memc returns the correct values according to tcpdump but not according to MW is something that I would have to see with my own eyes [18:06:28] when using php sessions on disk on labsconsole my sessions last *much* longer [18:06:32] It seems really strange to me, and it seems like something that wouldn't suddenly become a problem [18:06:56] also, memcache on that system isn't evicting anything [18:07:00] so, I know it isn't that [18:08:16] I wonder if anything has changed in the memc client recently [18:08:27] Has anything changed in memc recently? Like did you guys upgrade or tweak it? [18:08:33] no [18:08:46] I did notice it become unstable last week [18:08:50] +has [18:09:46] New patchset: Ottomata; "Renaming mediawiki_clone to mediawiki::clone." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/5897 [18:09:48] hm. maybe I should put sessions back on disk [18:09:53] on labsconsole [18:10:04] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/5897 [18:23:49] mark: here or you're done for the day? [18:27:09] PROBLEM - Packetloss_Average on emery is CRITICAL: CRITICAL: packet_loss_average is 9.11087582677 (gt 8.0) [18:27:45] PROBLEM - SSH on amslvs1 is CRITICAL: Server answer: [18:28:50] duh oh, packet loss on emery [18:29:06] RECOVERY - SSH on amslvs1 is OK: SSH OK - OpenSSH_5.3p1 Debian-3ubuntu7 (protocol 2.0) [18:33:20] nimish is running python reportOneCountry.py [18:33:28] taking a buncha cpu [18:35:48] on emery? [18:35:57] binasher: db60 is good to go and db59 is in the process of catching up [18:36:08] let the rotation begin! [18:36:10] I thought only the filters were supposed to run there? [18:36:13] ottomata: ^^ [18:36:22] yeah [18:36:28] dunno [18:36:31] he just IMed [18:36:31] also... trying to point the dump at /root doesn't work. just fyi ;) [18:36:41] tell him to use irc [18:37:12] RECOVERY - Packetloss_Average on emery is OK: OK: packet_loss_average is 3.13176515873 [18:37:17] Hah, mw1 went down? [18:38:42] emery monitoring ftw [18:39:25] robla: email notifiction is working correctly now, yes? [18:39:31] notpeter: yes [18:39:34] cool [18:39:34] yup :) [18:39:39] and ottomata is working on it [18:39:53] notpeter: any news about oxygen? [18:40:47] not so far. I'm going to email tim and see if he can review the nginx patch [18:41:03] notpeter: yup [18:41:49] notpeter: thx [18:59:40] notpeter: link on the nginx patch? [19:02:16] on the way to that link [19:02:19] can someone review this for me? [19:02:19] https://gerrit.wikimedia.org/r/#change,5896 [19:03:00] robla: was in an email to the ops team. I can forward to you [19:03:13] ah man! [19:03:16] i don't have sudo on bayes [19:04:32] woosters: may I? [19:10:16] ottomata - put in a ticet and i will work on it [19:14:12] New patchset: Pyoungmeister; "giving dario shell on stat1" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/5907 [19:14:29] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/5907 [19:15:12] New review: Pyoungmeister; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/5907 [19:15:15] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/5907 [19:16:10] wooters: done [19:16:18] RT 2866 [19:16:24] http://rt.wikimedia.org/Ticket/Display.html?id=2866 [19:18:11] !log enabled NewUserMessage on labsconsole [19:18:13] Logged the message, Master [19:18:55] !log made LiquidThreads disabled by default on labsconsole, now users must add the special string to a page to enable it there. [19:18:58] Logged the message, Master [19:20:55] !log restarting puppet on db59 [19:20:57] Logged the message, notpeter [19:26:27] woosters: can I get the package "python-dateutil" installed on storage3? [19:29:45] RoanKattouw, are the tcpdump of memcached available somewhere? [19:30:26] I find very strange that our implementation suddenly starts failing [19:39:57] fun pages, what happened with upload [19:48:02] Anyone here with access to gallium (integration.mediawiki.org/testswarm) [19:48:04] the database is fried [19:48:07] testswarm down [19:48:12] Not connected: Can't connect to local MySQL server through socket '/var/run/mysqld/mysqld.sock' (2) [19:48:26] the db is at the same host? [19:48:43] heck do I know [19:48:45] I guess [19:48:48] :D [19:49:03] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:49:25] Not connected: Can't connect to local MySQL server through socket '/var/run/mysqld/mysqld.sock' (2) [19:49:29] http://integration.mediawiki.org/testswarm/user/mediawiki/ [19:50:24] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 335 bytes in 6.723 seconds [20:04:40] Platonides: They're not, but Asher will try to reproduce later [20:24:45] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:29:33] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 335 bytes in 8.968 seconds [20:34:15] New patchset: Ryan Lane; "Change the automount timeout to 2 hours" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/5927 [20:34:24] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/5927 [20:51:06] hiii help meh! [20:51:06] https://gerrit.wikimedia.org/r/#change,5896 [20:53:05] poking time. [20:53:14] scrolling down the nickname list, let's see. [20:56:01] ottomata: I can help in the harassment process, but first.... [20:56:07] yesh? [20:56:26] do you know if the nginx code is in git/svn anywhere? [20:56:42] hmmm, [20:56:50] i think…unless i'm thinking of something else, one sec [20:57:25] oh yeah [20:57:44] http://svn.mediawiki.org/viewvc/mediawiki/trunk/debs/nginx/modules/nginx-udplog/ [20:58:40] ok...cool. the patch that Faidon wrote; that only exists in email and other random places, right? [20:58:52] i've only heard it talked about in here [20:58:59] * robla forwards to ottomata [20:59:53] dun [21:00:39] got it [21:01:38] * robla mentally associates "paravoid" with "Faidon", now that he's gone over and asked him [21:02:54] though unfortunately, paravoid is having problems with his monitor right now [21:03:09] haha [21:03:48] where is Leslie? [21:03:50] it should be in svn, I think [21:03:54] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [21:04:49] ottomata: anyway, here's what I think should happen. paravoid should get that checked into svn, ottomata should do an initial review on it (commenting in CodeReview), and then maybe we still have Tim do the final review on it [21:05:06] can do [21:05:15] so, hi [21:05:42] so, svn was desynced from what we actually had in production [21:05:45] so I fixed that yesterday [21:05:56] oh, right, that problem [21:05:56] oh, cool, ah that makes sense [21:05:58] the patch was applying weird [21:06:01] lemme update [21:06:07] hmm [21:06:12] no change? [21:06:16] that is the right place, right? [21:06:17] the deb? [21:06:22] mediawiki/trunk/debs/nginx/modules/nginx-udplog [21:06:22] ? [21:06:29] try mediawiki/trunk/debs/nginx/ [21:06:42] the module was ok, but the diff under debian/patches was not on par [21:06:47] an debian/changelog was two versions behind [21:07:14] so, how does commiting my patch now sounds like? [21:07:28] oh hm [21:07:48] having multiple people review it sounds an overkill tbh [21:07:48] i've never done a code review via svn [21:07:58] there's a tool we use, right? [21:08:03] and we can always try it into one of the servers for a few days, just in case [21:08:11] ottomata: yup...one sec [21:08:24] psst, Ryan_Lane: https://gerrit.wikimedia.org/r/#change,5896 [21:08:28] help meh! [21:08:32] heh [21:08:52] i'd better be careful [21:08:57] i'm going to teach people not to help meh [21:09:00] because if they do I will remember [21:09:03] and bug them more often [21:09:14] meh, you're writting puppet stuff for us [21:09:22] we don't mind being bugged for that [21:09:24] so: good? [21:09:24] yay! [21:09:27] i will write so much puppet [21:09:45] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 335 bytes in 3.275 seconds [21:10:14] ottomata: https://www.mediawiki.org/wiki/Special:Code/MediaWiki/ [21:10:18] really need to give you a way to test this stuff in labs [21:10:27] New review: Ryan Lane; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/5896 [21:10:30] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/5896 [21:10:34] (a lot harder to find that link now that the sidebar on mediawiki.org points to gerrit) [21:11:31] i'm testing this stuff on my local [21:11:31] would be SO annoying to have to go through gerrit and labs to test it [21:11:31] got a local vm set up with puppet [21:11:31] paravoid: ^^ [21:11:31] heh [21:11:31] daily complain about puppet in labs :) [21:11:38] *complaint [21:13:37] plus locally I can just save and run pupppet [21:13:40] test test test [21:13:44] get it just right [21:13:46] then commit [21:16:11] are you just running individual things? [21:16:15] using puppet apply? [21:16:19] no [21:16:21] puppetd —test [21:16:28] running a local puppetmaster? [21:16:29] i have un committed file for my local vm node [21:16:31] yes [21:16:34] * Ryan_Lane nods [21:16:36] oh cool [21:16:46] that's one of the solutions paravoid suggested [21:16:51] someone already did what I had on my todo [21:17:04] well, kind of [21:17:08] yeah :) [21:17:09] that's in his local vm [21:17:12] on his own system [21:17:20] but I wanted to test how that would work first [21:17:24] * Ryan_Lane nods [21:17:25] what did you do with private? [21:17:27] its kinda weird [21:17:28] ottomata: ^ [21:17:31] i had to comment out stuff [21:17:39] i just kept commenting out what I didn't care about [21:17:42] until I could test my node [21:17:48] so I have another puppet clone [21:17:50] that I use for my vm [21:17:53] and a local branch [21:18:14] it isn't ideal, we couldn't distribute it like that [21:18:44] I had an idea how to get rid of the star certificates at least [21:18:46] but it would be awesome to have a packaged up VM for people to download [21:18:55] nah [21:18:59] we'd prefer people to work in labs [21:19:04] yeah, makes sense [21:21:34] !log started deletion script on ms-be4 [21:21:37] Logged the message, Mistress of the network gear. [21:24:30] LeslieCarr: can you roll that back now [21:26:08] preilly: it should be better now [21:26:56] Ryan_Lane: cool [21:27:10] preilly: ok [21:28:11] New patchset: Lcarr; "reverting per preilly Revert "Thursday 10am-11:30am PST: Uganda, Ivory Coast"" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/5930 [21:28:28] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/5930 [21:31:39] PROBLEM - Puppet freshness on amslvs4 is CRITICAL: Puppet has not run in the last 10 hours [21:32:42] hey LeslieCarr [21:32:47] did you write the git::clone define? [21:33:19] ottomata: i don't think so .. i do remember modifying/using it a bit [21:33:32] aye, you are the last on git blame, twas why I asked [21:33:43] want to change something about it, thought i'd ask the author first [21:34:18] New review: Lcarr; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/5930 [21:34:21] Change merged: Lcarr; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/5930 [21:35:53] ottomata: be bold [21:37:56] hah, ok, unless you know how to make git blame tell me who first added it [21:37:58] i will just do it [21:38:08] always just do it [21:38:17] preilly: done [21:38:20] ha k [21:42:45] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [21:42:51] LeslieCarr: thanks! [21:47:06] ottomata, robla: so, svn now has 4 distinct changes over what we have in production [21:47:28] 1) the escaped user agent which was already there and Ryan told me it has been reviewed before [21:47:45] 2) my segfault patch [21:48:28] 3-4) Ubuntu had released in the meantime two updates, one of them a security fix(!) and the other one a serious issue when doing a configuration reload with ipv6 enabled [21:48:48] (the latter is important since nginx will also serve as our ipv6 proxy soon) [21:49:57] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 335 bytes in 2.933 seconds [21:50:41] the code is built and installed in labs and seems to pass my rudimentary test case (openssl s_client + netcat) [21:50:57] I don't really have any method for testing for e.g. memory leaks though (afaik) [21:51:15] I guess a code review + a staged deployment will make sure everything's okay [21:52:41] config reload with ipv6 enabled? [21:52:51] I wonder if this is what's causing reload to fail [21:53:23] we noticed this issue a while back. we just always restart right now [21:53:39] https://bugs.launchpad.net/ubuntu/+source/nginx/+bug/902223 [21:54:09] paravoid, were you able to test with > 2 udp log sockets? [21:54:22] that sounds like the error we're seeing [21:54:34] ottomata: you didn't need > 2 *working* udp log sockets, just defining them made it crash [21:54:47] so, yes, I've put 3 sockets into the config in labs [21:54:49] and it doesn't crash [21:54:54] ok cool [21:55:00] and I've verified that I'm getting log data by listening to one of them [21:55:07] aye cool [21:55:29] now we just need that sequence number thing fixed and that logging module will actually be usable :D [21:55:36] right now the https logs are being ignored [21:55:55] not totally, right? they are still going to locke and emery? [21:55:57] just not oxygen [21:55:59] they go there [21:56:04] but they are filtered out [21:56:13] ah [21:56:16] since there's no way to verify that there's no packet loss [21:57:07] we could go the hackish way short term and append the pid to the hostname [21:57:23] eh? [21:57:34] and what about threads? [21:57:45] I believe it works properly with threads [21:57:51] just not processes [21:57:52] or you mean pid + thread id? [21:58:59] don't the threads share the same memory space? [21:59:09] the global counter should work there [21:59:14] (it does in squid) [22:24:09] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [22:29:42] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 335 bytes in 2.396 seconds [22:35:42] PROBLEM - Puppet freshness on spence is CRITICAL: Puppet has not run in the last 10 hours [22:48:36] PROBLEM - Puppet freshness on gallium is CRITICAL: Puppet has not run in the last 10 hours [23:03:27] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [23:09:00] PROBLEM - Host srv278 is DOWN: PING CRITICAL - Packet loss = 100% [23:09:36] RECOVERY - Host srv278 is UP: PING OK - Packet loss = 0%, RTA = 0.28 ms [23:12:00] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 335 bytes in 0.018 seconds [23:13:21] PROBLEM - Apache HTTP on srv278 is CRITICAL: Connection refused [23:30:18] RECOVERY - Apache HTTP on srv278 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.027 second response time [23:44:42] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [23:51:45] PROBLEM - swift-container-auditor on ms-be5 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [23:53:15] RECOVERY - swift-container-auditor on ms-be5 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [23:53:15] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 335 bytes in 0.029 seconds