[01:30:54] PROBLEM - Puppet freshness on labstore2 is CRITICAL: Last successful Puppet run was Fri 21 Mar 2014 01:17:26 AM UTC [02:00:54] PROBLEM - HTTP on holmium is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 699 bytes in 0.002 second response time [02:06:54] RECOVERY - HTTP on holmium is OK: HTTP OK: HTTP/1.1 200 OK - 67882 bytes in 0.003 second response time [02:09:42] !log LocalisationUpdate completed (1.23wmf18) at 2014-03-23 02:09:42+00:00 [02:09:53] Logged the message, Master [02:17:16] !log LocalisationUpdate completed (1.23wmf19) at 2014-03-23 02:17:16+00:00 [02:17:22] Logged the message, Master [02:41:24] !log LocalisationUpdate ResourceLoader cache refresh completed at Sun Mar 23 02:41:21 UTC 2014 (duration 41m 20s) [02:41:30] Logged the message, Master [04:19:17] (03PS1) 1001tonythomas: Update Mingle URL in "See Also" field extension [wikimedia/bugzilla/modifications] - 10https://gerrit.wikimedia.org/r/120341 [04:31:54] PROBLEM - Puppet freshness on labstore2 is CRITICAL: Last successful Puppet run was Fri 21 Mar 2014 01:17:26 AM UTC [05:05:40] (03PS1) 10Tim Landscheidt: Tools: Remove unused syslog role [operations/puppet] - 10https://gerrit.wikimedia.org/r/120347 [05:35:28] (03PS1) 10Tim Landscheidt: Tools: Remove obsolete webserver class [operations/puppet] - 10https://gerrit.wikimedia.org/r/120348 [07:32:54] PROBLEM - Puppet freshness on labstore2 is CRITICAL: Last successful Puppet run was Fri 21 Mar 2014 01:17:26 AM UTC [10:33:54] PROBLEM - Puppet freshness on labstore2 is CRITICAL: Last successful Puppet run was Fri 21 Mar 2014 01:17:26 AM UTC [12:33:53] Hi, I'm in Germany, and *.wikimedia.org and *.wikipedia.org websites aren't loading for me. [12:34:24] Pings come through OK, [12:34:52] but the websites stop at "Read bits.wikimedia.org" [12:35:13] Pinging bits.wikimedia.org works fine. [12:36:14] works for me [12:36:19] (also Germany) [12:36:27] How should I try to diagnose it? [12:36:34] hi David [12:37:19] Hi Jyothis [12:37:36] I tried to look up your nick, but I'm having trouble accessing Wikimedia servers :-/ [12:37:40] slashme: No packet loss? [12:37:44] None. [12:37:53] I only pinged with three packets, though. [12:38:03] Maybe I'll try again with bigger packets, and 10 of them. [12:38:19] yeah :) [12:38:25] * LordOfLight waves back [12:39:11] Oh, that's not working so well. [12:39:13] david@davidexternal:~$ ping -c10 -s2048 bits.wikimedia.org [12:39:13] PING bits-lb.esams.wikimedia.org (91.198.174.202) 2048(2076) bytes of data. [12:39:13] 2056 bytes from bits-lb.esams.wikimedia.org (91.198.174.202): icmp_req=10 ttl=56 time=59.3 ms [12:39:13] --- bits-lb.esams.wikimedia.org ping statistics --- [12:39:13] 10 packets transmitted, 1 received, 90% packet loss, time 8999ms [12:39:14] rtt min/avg/max/mdev = 59.382/59.382/59.382/0.000 ms [12:39:30] versus [12:39:30] david@davidexternal:~$ ping -c3 bits.wikimedia.org [12:39:30] PING bits-lb.esams.wikimedia.org (91.198.174.202) 56(84) bytes of data. [12:39:30] 64 bytes from bits-lb.esams.wikimedia.org (91.198.174.202): icmp_req=1 ttl=56 time=27.6 ms [12:39:30] 64 bytes from bits-lb.esams.wikimedia.org (91.198.174.202): icmp_req=2 ttl=56 time=30.1 ms [12:39:31] 64 bytes from bits-lb.esams.wikimedia.org (91.198.174.202): icmp_req=3 ttl=56 time=30.0 ms [12:40:08] ... try to traceroute [12:40:46] It just ends up with lots of lines with * * * [12:41:20] yeah, that's fine [12:41:37] After going via 88.79.10.221 92.79.210.53 and 92.79.213.138 [12:42:08] So you only see 3 hops? [12:42:32] no, first my router and my IP's box: dslb-188-110-168-001.pools.arcor-ip.net [12:42:38] and then those three. [12:43:08] and then 25 lines of asterisks. [12:43:16] nothing more after 92.79? [12:43:23] No. [12:43:26] that's an Arcor backbone IP [12:43:50] Yeah... if you used a custom (big) packet length, you might want to lower it [12:44:18] In which context? [12:44:25] traceroute [12:44:38] No, there I just used the default. [12:44:48] 60 byte packets [12:45:54] I guess there's just a server in between that doesn't honour TTL, so traceroute only shows a certain part of the network. [12:46:05] ok... so it's hard to tell who to blame here, might be that your provider got linking issues and thus those packets don't make it out of its backbone [12:46:51] possible, but rather uncommon [12:47:00] wouldnt editing /etc/hosts help here, temporarily? just a blind shot [12:47:24] Ah, traceroute-t comes through. [12:47:33] If he can reach eqiad or ulsfo fine, that would fix it, yeah [12:47:42] I mean, traceroute -T [12:47:58] david@davidexternal:~$ sudo traceroute -T bits.wikimedia.org [12:47:58] [sudo] password for david: [12:47:58] traceroute to bits.wikimedia.org (91.198.174.202), 30 hops max, 60 byte packets [12:47:58] 1 easy.box (192.168.2.1) 1.272 ms 1.449 ms 4.067 ms [12:47:58] 2 dslb-188-110-168-001.pools.arcor-ip.net (188.110.168.1) 19.819 ms 23.855 ms 25.215 ms [12:47:59] 3 88.79.10.221 (88.79.10.221) 26.697 ms 28.697 ms * [12:47:59] 4 * * * [12:48:00] 5 * * * [12:48:00] 6 * * * [12:48:01] 7 * * * [12:48:01] 8 * bits-lb.esams.wikimedia.org (91.198.174.202) 30.101 ms 33.897 ms [12:48:13] ok, not try it with the bigger packets as well [12:49:48] I just did, and it came through OK, which is strange. Repeating the ping test now. [12:49:57] david@davidexternal:~$ sudo traceroute -T bits.wikimedia.org 2048 [12:49:57] traceroute to bits.wikimedia.org (91.198.174.202), 30 hops max, 60 byte packets [12:49:57] 1 easy.box (192.168.2.1) 1.236 ms 1.453 ms 2.108 ms [12:49:58] 2 dslb-188-110-168-001.pools.arcor-ip.net (188.110.168.1) 21.477 ms 23.147 ms 24.480 ms [12:49:58] 3 88.79.10.221 (88.79.10.221) 31.085 ms * * [12:49:58] 4 92.79.210.53 (92.79.210.53) 32.670 ms * * [12:49:58] 5 * * * [12:49:59] 6 * * * [12:49:59] 7 * * * [12:50:00] 8 * * bits-lb.esams.wikimedia.org (91.198.174.202) 29.762 ms [12:50:00] david@davidexternal:~$ ping -c10 -s2048 bits.wikimedia.org [12:50:01] PING bits-lb.esams.wikimedia.org (91.198.174.202) 2048(2076) bytes of data. [12:50:43] Whoops, just noticed, the size was still 60 bytes. Back to the man page. [12:54:29] I can't figure out how to change the packet size in traceroute. [12:54:45] According to the man page, and examples I see on the internet, you just put the size at the end of the command, [12:54:52] but that isn't working for me. [12:57:55] slashme: That's because tcp only opens a connection, but isn't sending real data [12:58:06] either use traceroute -I which probably also works [12:58:20] or traceroute -N tcpconn which really sends data AFAIR [12:58:34] tcpconn should be able to work with larger packets [12:59:05] Away for a bit [13:00:22] Duh, right. [13:00:24] Thanks! [13:03:32] Still, traceroute is coming through OK. [13:03:42] Anyway, I'll try again in an hour or two. [13:03:52] Need to make sure that I stay married ;-] [13:34:54] PROBLEM - Puppet freshness on labstore2 is CRITICAL: Last successful Puppet run was Fri 21 Mar 2014 01:17:26 AM UTC [15:07:49] The API seems to be handing out bad tokens/ [16:35:54] PROBLEM - Puppet freshness on labstore2 is CRITICAL: Last successful Puppet run was Fri 21 Mar 2014 01:17:26 AM UTC [19:36:55] PROBLEM - Puppet freshness on labstore2 is CRITICAL: Last successful Puppet run was Fri 21 Mar 2014 01:17:26 AM UTC [21:45:54] PROBLEM - HTTP on holmium is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 699 bytes in 0.001 second response time [22:31:26] any opens around to poke at blog.wikimedia.org? [22:37:54] PROBLEM - Puppet freshness on labstore2 is CRITICAL: Last successful Puppet run was Fri 21 Mar 2014 01:17:26 AM UTC [22:54:52] ping bblack apergos akosiaris Coren jgage_ [22:55:05] Pong? [22:55:09] * Coren checks scrollback. [22:55:15] Hey Coren - blog's down, mind taking a look? [22:55:23] Sure. Gimme a minute. [22:56:54] RECOVERY - HTTP on holmium is OK: HTTP OK: HTTP/1.1 200 OK - 67873 bytes in 0.494 second response time [22:57:04] Kicking apache in the bits woke it up; looking into why now. [23:20:42] paravoid, mark, if either of you happens to be awake… I'm running into a labs network problem. [23:21:18] I /think/ that the issue is that any instance outside of 10.68.16.0/24 is shut out [23:21:31] even though the actual IP range we're using is /21 [23:22:09] Instance IPs just rolled over from 10.68.16 to 10.68.17 and new instances have stopped working [23:22:37] Eloquence: The logs are completely silent. As far as I can tell, it was just Apache being completely wedged up; but I didn't keep it alive to investigate opting instead to restore service since it had been down for a while. [23:23:16] Coren: There were a zillion apache2 processes running right before you kicked it. Is that normal? e.g does apache fork for every client? [23:23:18] Eloquence: We probably will want to investigate more thoroughly if it happens again though. I'll keep an eye on the blog tonight. [23:23:39] I guess that would count as 'wedged up' [23:24:00] andrewbogott: It's setup as worker processes, so yes -- but it's the normal symptom of every worker wedging up until the server runs out. [23:24:22] ok, makes sense. [23:25:18] btw, coren, ^^ might be of interest to you as well… unless packet filtering is as much a mystery to you as it is to me. [23:27:14] andrewbogott: It's not, as a rule, but I speak Cisco more than Juniper so I'm not all that comfortable poking around our routers without Faidon or Mark being around. But what you describe sounds like there might be a stray /24 in one of the rules; it seems like a plausible typo to me and wouldn't have been evident until we started running outside that range. [23:27:34] Yep, it fits the data so far. [23:28:08] If I want to do a bit of grepping, do you know where those rules live? [23:28:20] I definitely wouldn't mess with them w/out adult supervision [23:28:59] ... no. :-( Last time I spoke network with Leslie, she told me that those weren't in source control but rather backed up from the routers. She did mention that it would be "a good thing" to fix that. :-) [23:29:24] ok then [23:52:00] Coren, [belatedly], thanks for kicking it back up.