[00:12:24] hi tim [00:12:32] hi, I just got your text message [00:13:16] asher's last assessment is here: http://lists.wikimedia.org/pipermail/wikitech-l/2012-March/059304.html [00:16:55] there's not much to see in the graphs [00:23:14] I see asher is finding a rise in the 99th percentile service time, not in the average [00:23:41] the reports sound more like a network issue to me [00:23:59] this is the mobilefrontend domparse spike: https://graphite.wikimedia.org/render/?title=Top%2010%20MobileFrontend%20Methods%20by%20Max%20Average%20Time%20(ms)%20log(2)%20-1week&from=-1week&width=1024&height=500&until=now&areaMode=none&hideLegend=false&logBase=2&lineWidth=1&lineMode=connected&target=cactiStyle(substr(highestMax(ExtMobileFrontend.*.tavg,10),1,2)) [00:24:38] usually when some people say wikipedia is slow and some people say it is fast, it is because of packet loss [00:25:00] that 99th percentile is for large article parses, not for any old page view [00:25:48] from a user point of view, the page stalls in the middle. which I can pretty reproducibly get on large pages like http://en.wikipedia.org/wiki/Wikipedia:VPL [00:26:21] yeah, that's pretty typical of a network issue [00:26:36] it can't be on the backend side of squid because squid buffers the whole page [00:27:22] has Leslie looked at this? [00:27:49] she's the best person for network stuff in a US timezone, right? [00:28:11] not yet. I'll ping her [00:28:15] maplebed is around as well [00:28:42] (I am, however I have to leave momentarily to cook dinner for my parents) [00:31:45] Eloquence: I get WP:VPL in 5.4 seconds with curl [00:32:17] can you PM me your IP so I can ping it? [00:37:10] doing some more analysis in chrome console .. looks like onload only triggers at about 15 secs, but page is fully loaded at about 7.3. logged out pageview renders fast. [00:37:22] so let me turn off all unusual stuff on my account and see if I can still get it. [00:39:03] the traceroute via he.net looks weird [00:39:43] http://paste.tstarling.com/p/jdTvso.html [00:40:15] LeslieCarr: what do you make of this traceroute out of eqiad: http://paste.tstarling.com/p/jdTvso.html [00:40:24] hey [00:40:57] the situation is that some people say wikipedia is slow, with pages loading halfway and then stalling, and some say it's fast [00:41:05] is that going out and into pmtpa ? [00:41:11] which to me usually indicates that one of our upstream providers is losing packets [00:41:44] where are you tracing to ? [00:41:52] that's from lvs1001 to Eloquence's home IP, I will PM it [00:41:56] ok [00:42:25] the 10.0 is messing me up... [00:42:36] fwiw, out of a bunch of curls to en:VPL, they take from 2 to 30 seconds (from my home) [00:42:44] anyways, first let me do a quick checkout of the state of the routers [00:43:21] hrm [00:43:37] ok, there's something weird going on [00:43:43] on cr2 [00:43:56] looks like all the interface bundles went up (they didn't go down before) [00:44:33] as an example, on one pageload, it spent about 5.5 secs in waiting state and 4.4 secs in receiving state for a 44K resourceloader request [00:44:38] it's stopped now but a bit concerning [00:44:39] hrm [00:44:53] it's not always resourceloader .. gets stuck at different points [00:46:50] It felt like network lag to me. [00:48:44] it could be some random lag, i am definitely concerned about these messages i'm seeing on cr1/2 which could indicate some net issue ... [00:48:51] give me about 5 more minutes to inspect all the interfaces ? [00:49:23] oh [00:49:27] there's definitely a bad link here [00:49:49] cr2 to asw-a - admin'ing that off [00:50:02] here's an example of a slow directl curl request to bits: http://pastebin.com/EPexqPF9 [00:50:12] and switching vrrp to cr1 mastership [00:51:50] okay, can you try again ? [00:51:57] lots of curl'ing and clicking around ? [00:52:07] sure [00:52:25] seems ok to me now [00:53:16] consistent curl times: http://paste.tstarling.com/p/zKgfbM.html [00:53:48] previously I got a couple of connections that seemed to get throttled way back, they went for over a minute before I killed them [00:54:08] same here, consistent now at least for the resourceloader request [00:54:12] dropped packets due to l2 errors :( [00:54:32] when rob is in monday we can troubleshoot the link - lemme double check i don't see any other big errors on other links [00:54:50] but, at least eqiad is designed with redundancy :) [00:55:47] thanks Leslie [00:56:48] ok, i think everything else is ok [00:56:56] asher looked at it and dug up some 99th percentile backend service time graphs that showed a slight rise two days ago, and concluded that it must have been my fault ;) [00:57:57] heh [00:58:03] hehehe [00:58:07] well, i'm not saying it's not ;) [00:58:27] when you get a chance you might want to educate him about what network issues usually look like [00:58:35] okay [00:58:43] when he gets back from malaysia [00:58:47] * LeslieCarr is jealous [00:58:50] all the other links look good [00:58:51] thanks leslie, looking at a few page loads I'm not able to reproduce the stalling at this point. [00:59:23] !log admin down asw-a-eqiad xe-1/1/2 and cr2-eqiad xe-5/0/0 due to framing errors causing packet loss and lacp sporadic timeouts. source of the issue [00:59:27] Logged the message, Mistress of the network gear. [00:59:31] i'll write up the thread saying what was wrong [00:59:36] LeslieCarr: was this affecting purely external connections? [01:00:05] it would be anything which happened to be on row a of ashburn and had the bad luck to go over that one link [01:00:23] so maybe up to 1/4 of all packets were at risk, sourced from row a ? [01:00:37] hmm, we've gotten a few messages that aluminum couldn't connect to db1008 [01:00:48] obviously not 1/4 of packets being dropped [01:00:56] was it load balanced by destination port or something? [01:00:57] hrm, checking which row that's on [01:01:03] extremely sporadic though [01:01:08] source/dest port/ip [01:01:16] that would make sense [01:02:30] if aluminium and db1008 are in the same row, so this issue shouldn't have affected traffic between those two [01:02:45] all of that traffic should have gone over the switch "backplane" [01:03:09] okay, thanks for checking. we'll continue to sit and hope it goes away, at least until monday [01:03:31] if there's nothing else you need me for, I'm going to go enjoy the rest of my sunday [01:03:41] want me to check it out pgehres ? [01:03:46] have a nice sunday tim ! [01:03:46] thanks tim :) [01:03:59] bye [01:04:27] LeslieCarr: if you've nothing better to do, we've only had 4 alerts over the last 2 days [01:04:36] for something that runs every 3 minutes [01:04:42] i'll give it a quick checkout, see about iperf [01:09:57] LeslieCarr: whatever you did seems to have fixed the reporting from the misc servers in eqiad to ganglia: http://ganglia.wikimedia.org/latest/?r=week&cs=&ce=&m=load_one&s=by+name&c=Miscellaneous+eqiad&h=&host_regex=&max_graphs=0&tab=m&vn=&sh=1&z=small&hc=4 now looks okay in the 20m view [01:11:13] oh man [01:11:27] i can't believe we didn't see that :( fail [01:11:59] yeah, it's strikingly obvious if ones looks [01:13:43] so right now i'm not seeing any packet loss between aluminium and db1008 [01:14:16] and seeing good bandwidth results on a tcp test [01:14:18] okay, thanks for looking. will let you know if it kicks back up again [01:14:23] thanks [01:14:45] idk if you've heard this one, but I thought of you when I saw it: "If you're ever lost in the woods, bury a length of fiber in the ground and wait for the backhoe to show up" [01:14:59] hehehe [01:15:09] PROBLEM - Disk space on stafford is CRITICAL: DISK CRITICAL - free space: /var/lib/puppet 748 MB (3% inode=92%): [01:15:10] damn, i wish i had heard that before the conference this week :) [01:15:14] would have been a good one [01:18:28] okay, i will keep checking my email to see if any more issues are reported [01:18:42] same here, have a good rest of the weekend [01:18:45] bye for now :) [03:25:12] PROBLEM - Disk space on srv224 is CRITICAL: DISK CRITICAL - free space: / 106 MB (1% inode=61%): /var/lib/ureadahead/debugfs 106 MB (1% inode=61%): [03:40:40] RECOVERY - Disk space on srv224 is OK: DISK OK [04:03:28] PROBLEM - Puppet freshness on owa3 is CRITICAL: Puppet has not run in the last 10 hours [04:05:25] PROBLEM - Puppet freshness on amslvs2 is CRITICAL: Puppet has not run in the last 10 hours [04:15:28] PROBLEM - Puppet freshness on owa1 is CRITICAL: Puppet has not run in the last 10 hours [04:15:28] PROBLEM - Puppet freshness on owa2 is CRITICAL: Puppet has not run in the last 10 hours [04:27:28] RECOVERY - Disk space on stafford is OK: DISK OK [04:35:17] PROBLEM - Disk space on search1022 is CRITICAL: DISK CRITICAL - free space: /a 4009 MB (3% inode=99%): [04:35:17] PROBLEM - Disk space on search1021 is CRITICAL: DISK CRITICAL - free space: /a 4009 MB (3% inode=99%): [04:56:26] PROBLEM - Disk space on search1021 is CRITICAL: DISK CRITICAL - free space: /a 4277 MB (3% inode=99%): [04:58:41] PROBLEM - Disk space on search1022 is CRITICAL: DISK CRITICAL - free space: /a 4033 MB (3% inode=99%): [06:02:34] PROBLEM - Disk space on search1021 is CRITICAL: DISK CRITICAL - free space: /a 4012 MB (3% inode=99%): [06:02:34] PROBLEM - Disk space on search1022 is CRITICAL: DISK CRITICAL - free space: /a 4013 MB (3% inode=99%): [06:15:19] PROBLEM - Disk space on search1022 is CRITICAL: DISK CRITICAL - free space: /a 4013 MB (3% inode=99%): [06:37:53] PROBLEM - Disk space on search1022 is CRITICAL: DISK CRITICAL - free space: /a 4166 MB (3% inode=99%): [06:55:52] PROBLEM - Squid on brewster is CRITICAL: Connection refused [07:21:31] PROBLEM - Puppet freshness on db59 is CRITICAL: Puppet has not run in the last 10 hours [07:26:29] PROBLEM - Disk space on search1022 is CRITICAL: DISK CRITICAL - free space: /a 3815 MB (3% inode=99%): [07:36:59] PROBLEM - Disk space on search1022 is CRITICAL: DISK CRITICAL - free space: /a 3914 MB (3% inode=99%): [07:59:56] PROBLEM - Puppet freshness on amslvs4 is CRITICAL: Puppet has not run in the last 10 hours [11:40:02] PROBLEM - RAID on searchidx2 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [11:41:59] RECOVERY - RAID on searchidx2 is OK: OK: State is Optimal, checked 4 logical device(s) [12:28:41] PROBLEM - Apache HTTP on srv278 is CRITICAL: Connection refused [12:44:26] PROBLEM - check_job_queue on spence is CRITICAL: JOBQUEUE CRITICAL - the following wikis have more than 9,999 jobs: , commonswiki (11362) [12:53:56] RECOVERY - Apache HTTP on srv278 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.056 second response time [13:25:35] RECOVERY - check_job_queue on spence is OK: JOBQUEUE OK - all job queues below 10,000 [13:29:11] PROBLEM - Puppet freshness on mw19 is CRITICAL: Puppet has not run in the last 10 hours [13:31:17] PROBLEM - Puppet freshness on ssl2 is CRITICAL: Puppet has not run in the last 10 hours [13:32:11] PROBLEM - Puppet freshness on mw33 is CRITICAL: Puppet has not run in the last 10 hours [13:35:11] PROBLEM - Puppet freshness on mw27 is CRITICAL: Puppet has not run in the last 10 hours [13:37:17] PROBLEM - Puppet freshness on ms-be3 is CRITICAL: Puppet has not run in the last 10 hours [13:42:14] PROBLEM - Puppet freshness on search1016 is CRITICAL: Puppet has not run in the last 10 hours [13:44:11] PROBLEM - Puppet freshness on mw59 is CRITICAL: Puppet has not run in the last 10 hours [13:46:17] PROBLEM - Puppet freshness on mw1073 is CRITICAL: Puppet has not run in the last 10 hours [13:49:17] PROBLEM - Puppet freshness on mw45 is CRITICAL: Puppet has not run in the last 10 hours [13:55:17] PROBLEM - Puppet freshness on mw30 is CRITICAL: Puppet has not run in the last 10 hours [13:55:17] PROBLEM - Puppet freshness on mw72 is CRITICAL: Puppet has not run in the last 10 hours [13:55:17] PROBLEM - Puppet freshness on search1006 is CRITICAL: Puppet has not run in the last 10 hours [14:07:16] PROBLEM - Puppet freshness on amslvs2 is CRITICAL: Puppet has not run in the last 10 hours [14:07:16] PROBLEM - Puppet freshness on owa3 is CRITICAL: Puppet has not run in the last 10 hours [14:17:37] PROBLEM - Puppet freshness on owa2 is CRITICAL: Puppet has not run in the last 10 hours [14:17:37] PROBLEM - Puppet freshness on owa1 is CRITICAL: Puppet has not run in the last 10 hours [14:56:10] PROBLEM - Packetloss_Average on emery is CRITICAL: CRITICAL: packet_loss_average is 14.0073177586 (gt 8.0) [15:02:28] RECOVERY - Packetloss_Average on emery is OK: OK: packet_loss_average is 1.96235805085 [15:35:48] PROBLEM - Packetloss_Average on emery is CRITICAL: CRITICAL: packet_loss_average is 8.15920394737 (gt 8.0) [15:42:06] RECOVERY - Packetloss_Average on emery is OK: OK: packet_loss_average is 0.6930475 [16:07:41] !log updated firmware successfully on ps1-a8-eqiad, if it has observium alarms now then there are bigger issues. [16:07:46] Logged the message, RobH [16:09:18] !log updating firmware on ps1-s1-eqiad and ps1-b1-sdtpa [16:09:21] Logged the message, RobH [16:13:59] !log ps1-a1-eqiad firmware updated successfully [16:14:02] Logged the message, RobH [16:14:26] !log ps1-b1-sdtpa firmware updated successfully [16:14:30] Logged the message, RobH [16:15:59] !log updating firmware on ps1-a1-sdtpa [16:16:02] Logged the message, RobH [16:18:24] PROBLEM - Host ps1-a1-sdtpa is DOWN: PING CRITICAL - Packet loss = 0%, RTA = 2060.24 ms [16:21:06] RECOVERY - Host ps1-a1-sdtpa is UP: PING OK - Packet loss = 0%, RTA = 2.95 ms [16:22:09] !log ps1-a1-sdtpa firmware update complete [16:22:12] Logged the message, RobH [16:26:30] PROBLEM - Packetloss_Average on emery is CRITICAL: CRITICAL: packet_loss_average is 9.88808470085 (gt 8.0) [16:32:48] RECOVERY - Packetloss_Average on emery is OK: OK: packet_loss_average is 0.604784869565 [16:45:04] PROBLEM - Puppet freshness on brewster is CRITICAL: Puppet has not run in the last 10 hours [17:23:01] PROBLEM - Puppet freshness on db59 is CRITICAL: Puppet has not run in the last 10 hours [18:01:32] PROBLEM - Puppet freshness on amslvs4 is CRITICAL: Puppet has not run in the last 10 hours [19:06:29] RECOVERY - MySQL Replication Heartbeat on db52 is OK: OK replication delay 0 seconds [19:07:23] RECOVERY - MySQL Slave Delay on db52 is OK: OK replication delay 0 seconds [20:12:56] PROBLEM - Host ps1-a4-eqiad is DOWN: PING CRITICAL - Packet loss = 44%, RTA = 2303.25 ms [20:13:11] !log firmware updated on all power strips in row a eqiad. [20:13:15] Logged the message, RobH [20:14:28] !log to fellow ops, you can disregard those observium errors, as I caused them [20:14:32] Logged the message, RobH [20:14:44] RECOVERY - Host ps1-a4-eqiad is UP: PING OK - Packet loss = 0%, RTA = 29.23 ms [21:33:47] PROBLEM - Disk space on srv221 is CRITICAL: DISK CRITICAL - free space: / 281 MB (3% inode=61%): /var/lib/ureadahead/debugfs 281 MB (3% inode=61%): [21:46:23] RECOVERY - Disk space on srv221 is OK: DISK OK [21:55:50] PROBLEM - Apache HTTP on srv278 is CRITICAL: Connection refused [22:16:33] RECOVERY - Apache HTTP on srv278 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.054 second response time [22:18:57] !log firmware updates on servertechs in row b eqiad, disregard alarms [22:19:01] Logged the message, RobH [22:24:12] PROBLEM - Host ps1-b3-eqiad is DOWN: PING CRITICAL - Packet loss = 37%, RTA = 2238.17 ms [22:26:18] RECOVERY - Host ps1-b3-eqiad is UP: PING OK - Packet loss = 0%, RTA = 29.42 ms [22:26:51] !log row b servertech firmware in eqiad all updated, should clear alarms as they come back online [22:26:54] Logged the message, RobH [23:31:15] PROBLEM - Puppet freshness on mw19 is CRITICAL: Puppet has not run in the last 10 hours [23:35:22] PROBLEM - Puppet freshness on mw33 is CRITICAL: Puppet has not run in the last 10 hours [23:35:22] PROBLEM - Puppet freshness on ssl2 is CRITICAL: Puppet has not run in the last 10 hours [23:36:34] PROBLEM - Puppet freshness on mw27 is CRITICAL: Puppet has not run in the last 10 hours [23:38:40] PROBLEM - Puppet freshness on ms-be3 is CRITICAL: Puppet has not run in the last 10 hours [23:43:46] PROBLEM - Puppet freshness on search1016 is CRITICAL: Puppet has not run in the last 10 hours [23:45:34] PROBLEM - Puppet freshness on mw59 is CRITICAL: Puppet has not run in the last 10 hours [23:47:40] PROBLEM - Puppet freshness on mw1073 is CRITICAL: Puppet has not run in the last 10 hours [23:50:40] PROBLEM - Puppet freshness on mw45 is CRITICAL: Puppet has not run in the last 10 hours [23:56:40] PROBLEM - Puppet freshness on mw30 is CRITICAL: Puppet has not run in the last 10 hours [23:56:40] PROBLEM - Puppet freshness on search1006 is CRITICAL: Puppet has not run in the last 10 hours [23:56:40] PROBLEM - Puppet freshness on mw72 is CRITICAL: Puppet has not run in the last 10 hours