[00:02:01] PROBLEM - Puppet freshness on virt1005 is CRITICAL: Puppet has not run in the last 10 hours [00:46:40] RECOVERY - MySQL Replication Heartbeat on db53 is OK: OK replication delay 0 seconds [00:46:53] RECOVERY - MySQL Slave Delay on db53 is OK: OK replication delay 0 seconds [01:13:42] RECOVERY - Puppet freshness on cp1034 is OK: puppet ran at Sun Mar 17 01:13:30 UTC 2013 [01:14:42] RECOVERY - Varnish traffic logger on cp1034 is OK: PROCS OK: 3 processes with command name varnishncsa [01:32:50] PROBLEM - MySQL Slave Delay on db71 is CRITICAL: CRIT replication delay 193 seconds [01:32:50] PROBLEM - MySQL Replication Heartbeat on db71 is CRITICAL: CRIT replication delay 197 seconds [01:34:51] RECOVERY - MySQL Replication Heartbeat on db71 is OK: OK replication delay 22 seconds [01:35:53] RECOVERY - MySQL Slave Delay on db71 is OK: OK replication delay 0 seconds [02:26:01] PROBLEM - Puppet freshness on neon is CRITICAL: Puppet has not run in the last 10 hours [02:31:16] !log LocalisationUpdate completed (1.21wmf11) at Sun Mar 17 02:31:16 UTC 2013 [02:31:24] Logged the message, Master [02:33:50] PROBLEM - MySQL Slave Delay on db66 is CRITICAL: CRIT replication delay 230 seconds [02:34:00] PROBLEM - MySQL Replication Heartbeat on db66 is CRITICAL: CRIT replication delay 237 seconds [02:38:12] PROBLEM - Puppet freshness on sq70 is CRITICAL: Puppet has not run in the last 10 hours [02:48:40] PROBLEM - MySQL Replication Heartbeat on db53 is CRITICAL: CRIT replication delay 190 seconds [02:48:51] PROBLEM - MySQL Slave Delay on db53 is CRITICAL: CRIT replication delay 194 seconds [02:51:41] RECOVERY - MySQL Replication Heartbeat on db53 is OK: OK replication delay 0 seconds [02:51:50] RECOVERY - MySQL Slave Delay on db53 is OK: OK replication delay 0 seconds [03:11:50] RECOVERY - MySQL Slave Delay on db66 is OK: OK replication delay 0 seconds [03:12:00] RECOVERY - MySQL Replication Heartbeat on db66 is OK: OK replication delay 0 seconds [03:49:51] PROBLEM - MySQL Slave Delay on db53 is CRITICAL: CRIT replication delay 181 seconds [03:50:40] PROBLEM - MySQL Replication Heartbeat on db53 is CRITICAL: CRIT replication delay 202 seconds [04:02:10] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [04:05:01] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 1.012 second response time [04:10:00] PROBLEM - MySQL Slave Delay on db32 is CRITICAL: CRIT replication delay 182 seconds [04:10:00] PROBLEM - MySQL Replication Heartbeat on db32 is CRITICAL: CRIT replication delay 183 seconds [04:22:00] RECOVERY - MySQL Slave Delay on db32 is OK: OK replication delay 0 seconds [04:22:00] RECOVERY - MySQL Replication Heartbeat on db32 is OK: OK replication delay 0 seconds [04:24:43] Change abandoned: J; "https://gerrit.wikimedia.org/r/#/c/52707/ has all the discussions and the same patch." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/44008 [04:33:42] RECOVERY - MySQL Replication Heartbeat on db53 is OK: OK replication delay 6 seconds [04:33:52] RECOVERY - MySQL Slave Delay on db53 is OK: OK replication delay 0 seconds [04:38:13] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [04:39:00] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 1.269 second response time [05:36:41] PROBLEM - Host mw27 is DOWN: PING CRITICAL - Packet loss = 100% [05:37:50] RECOVERY - Host mw27 is UP: PING OK - Packet loss = 0%, RTA = 26.64 ms [05:39:50] PROBLEM - Apache HTTP on mw27 is CRITICAL: Connection refused [06:51:02] PROBLEM - Puppet freshness on ms1004 is CRITICAL: Puppet has not run in the last 10 hours [06:59:10] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [07:02:11] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 7.221 second response time [07:06:11] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [07:07:02] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.144 second response time [07:39:12] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [07:46:11] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 5.402 second response time [08:01:31] Ryan_Lane: Do you have experience in creating debian packages? (specifically for a ruby gem?) https://bugzilla.wikimedia.org/show_bug.cgi?id=46236 [08:01:39] (or know how in operations may be able to help me with this) [08:01:50] who* [08:08:01] PROBLEM - Puppet freshness on search20 is CRITICAL: Puppet has not run in the last 10 hours [08:10:01] PROBLEM - Puppet freshness on palladium is CRITICAL: Puppet has not run in the last 10 hours [08:21:10] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:32:10] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 5.745 second response time [08:38:00] PROBLEM - Puppet freshness on mw1061 is CRITICAL: Puppet has not run in the last 10 hours [08:55:02] PROBLEM - Puppet freshness on europium is CRITICAL: Puppet has not run in the last 10 hours [09:21:01] PROBLEM - Puppet freshness on msfe1002 is CRITICAL: Puppet has not run in the last 10 hours [09:21:02] PROBLEM - Puppet freshness on lvs1004 is CRITICAL: Puppet has not run in the last 10 hours [09:21:02] PROBLEM - Puppet freshness on lvs1006 is CRITICAL: Puppet has not run in the last 10 hours [09:21:02] PROBLEM - Puppet freshness on lvs1005 is CRITICAL: Puppet has not run in the last 10 hours [10:03:01] PROBLEM - Puppet freshness on virt1005 is CRITICAL: Puppet has not run in the last 10 hours [12:27:01] PROBLEM - Puppet freshness on neon is CRITICAL: Puppet has not run in the last 10 hours [12:33:00] PROBLEM - MySQL Slave Delay on db32 is CRITICAL: CRIT replication delay 193 seconds [12:33:10] PROBLEM - MySQL Replication Heartbeat on db32 is CRITICAL: CRIT replication delay 196 seconds [12:36:00] RECOVERY - MySQL Slave Delay on db32 is OK: OK replication delay 0 seconds [12:36:10] RECOVERY - MySQL Replication Heartbeat on db32 is OK: OK replication delay 0 seconds [12:39:00] PROBLEM - Puppet freshness on sq70 is CRITICAL: Puppet has not run in the last 10 hours [14:04:11] PROBLEM - Puppet freshness on mc1008 is CRITICAL: Puppet has not run in the last 10 hours [14:27:21] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:34:10] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 2.147 second response time [14:39:00] PROBLEM - Packetloss_Average on locke is CRITICAL: CRITICAL: packet_loss_average is 13.4757371318 (gt 8.0) [15:07:31] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:09:00] PROBLEM - Packetloss_Average on oxygen is CRITICAL: CRITICAL: packet_loss_average is 8.82499815385 (gt 8.0) [15:13:01] PROBLEM - Packetloss_Average on locke is CRITICAL: CRITICAL: packet_loss_average is 11.5410048201 (gt 8.0) [15:19:59] PROBLEM - MySQL Replication Heartbeat on db53 is CRITICAL: CRIT replication delay 186 seconds [15:20:31] PROBLEM - MySQL Slave Delay on db53 is CRITICAL: CRIT replication delay 196 seconds [15:21:31] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 4.070 second response time [15:26:00] RECOVERY - MySQL Replication Heartbeat on db53 is OK: OK replication delay 22 seconds [15:26:30] RECOVERY - MySQL Slave Delay on db53 is OK: OK replication delay 1 seconds [15:28:31] PROBLEM - MySQL Slave Delay on db33 is CRITICAL: CRIT replication delay 197 seconds [15:28:32] PROBLEM - MySQL Replication Heartbeat on db33 is CRITICAL: CRIT replication delay 197 seconds [15:33:29] RECOVERY - MySQL Slave Delay on db33 is OK: OK replication delay 0 seconds [15:33:30] RECOVERY - MySQL Replication Heartbeat on db33 is OK: OK replication delay 0 seconds [15:40:00] PROBLEM - MySQL Replication Heartbeat on db53 is CRITICAL: CRIT replication delay 183 seconds [15:40:33] PROBLEM - MySQL Slave Delay on db53 is CRITICAL: CRIT replication delay 195 seconds [15:44:00] RECOVERY - MySQL Replication Heartbeat on db53 is OK: OK replication delay 21 seconds [15:44:31] RECOVERY - MySQL Slave Delay on db53 is OK: OK replication delay 3 seconds [15:53:50] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:04:49] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 3.681 second response time [16:09:01] PROBLEM - Packetloss_Average on locke is CRITICAL: CRITICAL: packet_loss_average is 11.1173771901 (gt 8.0) [16:11:38] PROBLEM - MySQL Slave Delay on db53 is CRITICAL: CRIT replication delay 183 seconds [16:12:05] PROBLEM - MySQL Replication Heartbeat on db53 is CRITICAL: CRIT replication delay 195 seconds [16:37:55] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:39:05] PROBLEM - Packetloss_Average on locke is CRITICAL: CRITICAL: packet_loss_average is 9.07876040984 (gt 8.0) [16:39:47] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.125 second response time [16:41:06] RECOVERY - MySQL Replication Heartbeat on db53 is OK: OK replication delay 17 seconds [16:41:27] RECOVERY - MySQL Slave Delay on db53 is OK: OK replication delay 13 seconds [16:49:04] @replag [16:49:06] jeremyb_: [s2] db1009: 1s, db1018: 1s; [s6] db1022: 1s [16:52:05] PROBLEM - Puppet freshness on ms1004 is CRITICAL: Puppet has not run in the last 10 hours [16:52:56] PROBLEM - Packetloss_Average on locke is CRITICAL: CRITICAL: packet_loss_average is 12.2702961789 (gt 8.0) [17:09:03] PROBLEM - Packetloss_Average on locke is CRITICAL: CRITICAL: packet_loss_average is 10.3917755833 (gt 8.0) [17:23:31] RECOVERY - Puppet freshness on palladium is OK: puppet ran at Sun Mar 17 17:23:21 UTC 2013 [17:28:31] PROBLEM - MySQL Replication Heartbeat on db55 is CRITICAL: CRIT replication delay 198 seconds [17:28:31] PROBLEM - MySQL Replication Heartbeat on db44 is CRITICAL: CRIT replication delay 202 seconds [17:28:39] PROBLEM - MySQL Replication Heartbeat on db45 is CRITICAL: CRIT replication delay 210 seconds [17:28:50] PROBLEM - MySQL Replication Heartbeat on db1005 is CRITICAL: CRIT replication delay 219 seconds [17:29:02] PROBLEM - MySQL Replication Heartbeat on db1021 is CRITICAL: CRIT replication delay 228 seconds [17:29:02] PROBLEM - MySQL Replication Heartbeat on db1039 is CRITICAL: CRIT replication delay 228 seconds [17:29:02] PROBLEM - MySQL Replication Heartbeat on db1026 is CRITICAL: CRIT replication delay 232 seconds [17:45:29] RECOVERY - MySQL Replication Heartbeat on db55 is OK: OK replication delay 0 seconds [17:45:30] RECOVERY - MySQL Replication Heartbeat on db44 is OK: OK replication delay 0 seconds [17:45:39] RECOVERY - MySQL Replication Heartbeat on db45 is OK: OK replication delay 0 seconds [17:45:51] RECOVERY - MySQL Replication Heartbeat on db1005 is OK: OK replication delay 0 seconds [17:45:59] RECOVERY - MySQL Replication Heartbeat on db1039 is OK: OK replication delay 0 seconds [17:46:00] RECOVERY - MySQL Replication Heartbeat on db1021 is OK: OK replication delay 0 seconds [17:46:00] RECOVERY - MySQL Replication Heartbeat on db1026 is OK: OK replication delay 0 seconds [17:46:29] PROBLEM - MySQL Replication Heartbeat on db53 is CRITICAL: CRIT replication delay 191 seconds [17:46:30] PROBLEM - MySQL Slave Delay on db53 is CRITICAL: CRIT replication delay 192 seconds [18:04:00] PROBLEM - Puppet freshness on colby is CRITICAL: Puppet has not run in the last 10 hours [18:09:01] PROBLEM - Puppet freshness on search20 is CRITICAL: Puppet has not run in the last 10 hours [18:39:00] PROBLEM - Puppet freshness on mw1061 is CRITICAL: Puppet has not run in the last 10 hours [18:40:32] RECOVERY - MySQL Replication Heartbeat on db53 is OK: OK replication delay 0 seconds [18:40:33] RECOVERY - MySQL Slave Delay on db53 is OK: OK replication delay 1 seconds [18:44:04] * jeremyb_ wonders if shard prefixes (x, m, s, rdb, etc. have a glossary somewhere) [18:53:58] where? [18:55:59] PROBLEM - Puppet freshness on europium is CRITICAL: Puppet has not run in the last 10 hours [18:59:00] PROBLEM - MySQL Slave Delay on db1049 is CRITICAL: CRIT replication delay 643 seconds [18:59:00] PROBLEM - MySQL Replication Heartbeat on db1049 is CRITICAL: CRIT replication delay 640 seconds [19:07:09] RECOVERY - MySQL Replication Heartbeat on db1049 is OK: OK replication delay 29 seconds [19:08:01] RECOVERY - MySQL Slave Delay on db1049 is OK: OK replication delay 0 seconds [19:12:52] hi folks, anyone around who might be able to see why oxygen is freaking out right now? [19:12:54] http://ganglia.wikimedia.org/latest/graph_all_periods.php?c=Miscellaneous%20eqiad&h=oxygen.wikimedia.org&v=13.3593481395&m=packet_loss_average&r=hour&z=default&jr=&js=&st=1363547247&vl=%25&z=large [19:21:00] PROBLEM - SSH on lvs6 is CRITICAL: Server answer: [19:21:17] this is interesting: http://ganglia.wikimedia.org/latest/graph_all_periods.php?h=oxygen.wikimedia.org&m=cpu_report&r=day&s=by%20name&hc=4&mc=2&st=1363547967&g=mem_report&z=large&c=Miscellaneous%20eqiad [19:22:00] RECOVERY - SSH on lvs6 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [19:22:01] PROBLEM - Puppet freshness on lvs1004 is CRITICAL: Puppet has not run in the last 10 hours [19:22:01] PROBLEM - Puppet freshness on lvs1005 is CRITICAL: Puppet has not run in the last 10 hours [19:22:01] PROBLEM - Puppet freshness on lvs1006 is CRITICAL: Puppet has not run in the last 10 hours [19:22:01] PROBLEM - Puppet freshness on msfe1002 is CRITICAL: Puppet has not run in the last 10 hours [19:22:04] kraigparkinson: this graph here in particular: http://ganglia.wikimedia.org/latest/graph.php?r=week&z=xlarge&h=oxygen.wikimedia.org&m=cpu_report&s=by+name&mc=2&g=mem_report&c=Miscellaneous+eqiad [19:22:53] seems to correlate with this graph: http://ganglia.wikimedia.org/latest/graph.php?r=week&z=xlarge&c=Miscellaneous+eqiad&h=oxygen.wikimedia.org&v=13.0938467692&m=packet_loss_average&jr=&js=&vl=%25 [19:23:54] implication is that we should see this as a daily event, and that we should be ready to pounce when it happens tomorrow [19:24:18] ah, interesting. [19:25:18] something happening daily Sunday through Thursday our time [19:25:41] http://ganglia.wikimedia.org/latest/graph.php?r=month&z=xlarge&h=oxygen.wikimedia.org&m=cpu_report&s=by+name&mc=2&g=mem_report&c=Miscellaneous+eqiad [19:27:18] seems like roughly six hours at a time, right? [19:27:27] (just making sure I'm reading this properly) [19:28:00] yeah, that's about right [19:28:24] one easier way to read the graphs is to go here: http://ganglia.wikimedia.org/latest/graph_all_periods.php?h=oxygen.wikimedia.org&m=cpu_report&r=day&s=by%20name&hc=4&mc=2&st=1363547967&g=mem_report&z=large&c=Miscellaneous%20eqiad [19:28:35] ....then click "inspect" on the one you want to look at [19:29:27] you get a Javascript/Canvas rendered graph of some sort (I don't remember which one it uses....I think Flot or jqplot) [19:30:35] this is all pacific time, right? [19:30:39] UTC [19:31:24] many opsen get cranky when you refer to times other than UTC :-) [19:32:18] actually, it's probably easier to find an opsen by trolling them on timezones than it is by asking for help ;-) [19:32:44] lol [19:35:49] anyway, on this issue, looks like it's a daily occurrence that we can probably catch tomorrow. today looks more severe than normal to me, but assuming this one subsides shortly, we can deal with it during normal business hours [19:36:14] also, some metrics are dead: http://gdash.wikimedia.org/dashboards/indexpager/ [19:36:29] PROBLEM - MySQL Slave Delay on db33 is CRITICAL: CRIT replication delay 189 seconds [19:36:30] PROBLEM - MySQL Replication Heartbeat on db33 is CRITICAL: CRIT replication delay 190 seconds [19:36:31] should be due to come down soon given the previous duration and the time if I'm reading it correctly. [19:36:34] related? [19:38:49] PROBLEM - Varnish traffic logger on cp1023 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishncsa [19:39:42] MaxSem: I'm not sure how that stuff is working, so I can't say [19:40:05] it also uses UDP so... [19:41:17] hmmmm.....could be a problem with the multicast gateway maybe (maybe?) [19:42:06] nah, the memory usage on oxygen seems to imply it's resource bound somewhere on that machine [19:59:00] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:04:09] PROBLEM - Puppet freshness on virt1005 is CRITICAL: Puppet has not run in the last 10 hours [20:05:00] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 8.984 second response time [20:06:50] RECOVERY - Varnish traffic logger on cp1023 is OK: PROCS OK: 3 processes with command name varnishncsa [20:07:02] PROBLEM - Packetloss_Average on locke is CRITICAL: CRITICAL: packet_loss_average is 20.0615116949 (gt 8.0) [20:37:59] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:39:59] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 6.238 second response time [21:07:09] PROBLEM - Packetloss_Average on locke is CRITICAL: CRITICAL: packet_loss_average is 8.48789547009 (gt 8.0) [21:14:01] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [21:14:49] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 2.449 second response time [21:21:08] PROBLEM - Packetloss_Average on locke is CRITICAL: CRITICAL: packet_loss_average is 15.9764174138 (gt 8.0) [21:26:34] anyone around that can help look into the packetloss issues on locke? can't seem to find ottomata [21:29:44] hi kraigparkinson, it looks like the locke packet loss is headed down. we must have slightly higher traffic than normal today [21:30:31] just saw the warnings coming up repeatedly now; got me worried. [21:30:51] we've gotta figure out what's going on with locke in the coming days, but I strongly suspect it's one of the Fundraising filters, since they've got it cranked way up [21:31:13] actually, Fundraising plus the fact we haven't moved anything else off that machine [21:31:51] the solution I think Diederik and Andrew are pushing is to just replace locke with modern hardware, since it's pretty old hardware [21:33:13] not much we can do in the very short term but disable filteres [21:34:42] PROBLEM - MySQL Slave Delay on db53 is CRITICAL: CRIT replication delay 188 seconds [21:34:43] PROBLEM - MySQL Replication Heartbeat on db53 is CRITICAL: CRIT replication delay 188 seconds [21:34:52] even though the loss is about as bad on locke as it is on oxygen, I'm more worried about oxygen because that's more anomalous [21:36:24] bbiab [21:41:13] RECOVERY - Packetloss_Average on locke is OK: OK: packet_loss_average is 3.29704848921 [21:43:05] cool, thanks for the context robla [21:48:03] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [21:51:06] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 9.209 second response time [21:54:06] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [21:57:04] PROBLEM - Packetloss_Average on locke is CRITICAL: CRITICAL: packet_loss_average is 11.6458436879 (gt 8.0) [21:59:02] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.129 second response time [22:06:12] PROBLEM - HTTP on fenari is CRITICAL: CRITICAL - Socket timeout after 10 seconds [22:09:12] RECOVERY - HTTP on fenari is OK: HTTP OK: HTTP/1.1 200 OK - 4915 bytes in 7.302 second response time [22:11:11] RECOVERY - Packetloss_Average on locke is OK: OK: packet_loss_average is 3.74624251799 [22:17:10] RECOVERY - Packetloss_Average on oxygen is OK: OK: packet_loss_average is 3.08066815385 [22:28:00] PROBLEM - Puppet freshness on neon is CRITICAL: Puppet has not run in the last 10 hours [22:30:10] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [22:40:01] PROBLEM - Puppet freshness on sq70 is CRITICAL: Puppet has not run in the last 10 hours [22:45:08] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.122 second response time [23:01:59] RECOVERY - MySQL Replication Heartbeat on db33 is OK: OK replication delay 0 seconds [23:02:39] RECOVERY - MySQL Slave Delay on db33 is OK: OK replication delay 0 seconds [23:14:39] RECOVERY - MySQL Slave Delay on db53 is OK: OK replication delay 27 seconds [23:14:53] RECOVERY - MySQL Replication Heartbeat on db53 is OK: OK replication delay 22 seconds [23:16:09] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [23:16:53] PROBLEM - MySQL Slave Delay on db33 is CRITICAL: CRIT replication delay 184 seconds [23:17:10] PROBLEM - MySQL Replication Heartbeat on db33 is CRITICAL: CRIT replication delay 198 seconds [23:17:59] PROBLEM - MySQL Slave Delay on db1020 is CRITICAL: CRIT replication delay 200 seconds [23:18:12] PROBLEM - MySQL Replication Heartbeat on db1020 is CRITICAL: CRIT replication delay 201 seconds [23:20:00] RECOVERY - MySQL Slave Delay on db1020 is OK: OK replication delay 28 seconds [23:20:13] RECOVERY - MySQL Replication Heartbeat on db1020 is OK: OK replication delay 1 seconds [23:24:28] robla: I don't know if that's the cause or not, but looking at /var/log/udp2log i see quite a lot of this: terminate called after throwing an instance of 'libc_error' what(): Socket::Bind: Address already in use [23:25:24] the log file is written to by multiple processes and the start-up log lines are not timestamped, which makes it hard to pinpoint when this is happening the most [23:26:25] ori-l: that's on locke or oxygen? [23:26:28] err, multiple processes or threads -- at any rate the timestamps are not monotonic [23:26:30] oxygen [23:27:08] it then restarts multiple times rapidly while trying to bind the port [23:28:06] the error is often adjacent (right before or right after) Opened pipe with factor 1: /bin/grep 'SearchResults:' >> /a/log/lucene/lucene.log [23:30:01] it could be a red herring, actually, since that's a separate udp2log instance listening on another port (51234) [23:30:09] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 5.853 second response time [23:31:12] ori-l: possibly. I think I'm going to rely on Andrew and crew looking at this tomorrow as it (presumably) starts happening again [23:32:31] robla: ok -- keep me posted if you find out. oxygen relays eventlogging stuff from europe to vanadium so it affects that too :/ [23:33:06] okee doke. [23:33:12] have a good rest of the day [23:34:39] you too! [23:34:48] thanks for poking at this! [23:35:40] not that anything useful came out of it, but sure :) [23:39:28] ori-l: actually, one thing I'd like to ask about. was there anything that changed between Thursday and today that would increase traffic to oxygen? [23:40:37] with EventLogging? no. I'll look at the server admin log and see if there's anything else that looks relevant. [23:40:49] k...thanks [23:43:38] * robla goes afk for a bit [23:47:20] robla: nothing unusual that I could see [23:48:11] PROBLEM - MySQL Replication Heartbeat on db33 is CRITICAL: CRIT replication delay 182 seconds [23:48:52] PROBLEM - MySQL Slave Delay on db33 is CRITICAL: CRIT replication delay 191 seconds [23:58:17] robla: i think it's just not keeping up with peak (but normal) load [23:58:19] http://ganglia.wikimedia.org/latest/?r=day&cs=&ce=&dg=1&tab=m&vn=&hreg%5B%5D=oxygen&mreg%5B%5D=packet_loss_average%7Cbytes_in%7Ccpu_user [23:59:10] cpu kind of flattens at 20%. it has 8 cores, so that's probably one core (12%) at 100% and the others (7 x 1%) mostly idling