[00:38:42] New patchset: Platonides; "Enhance account throttling" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/12185 [00:47:29] PROBLEM - Puppet freshness on nfs2 is CRITICAL: Puppet has not run in the last 10 hours [00:50:20] PROBLEM - Puppet freshness on nfs1 is CRITICAL: Puppet has not run in the last 10 hours [01:04:06] mark: pls take a look on toolserver channel - it's been flooded by tsnag since the restart. i assume some network issues from what it writes. thanks [01:42:22] PROBLEM - MySQL Slave Delay on db1025 is CRITICAL: CRIT replication delay 288 seconds [01:44:01] PROBLEM - MySQL Slave Delay on storage3 is CRITICAL: CRIT replication delay 373 seconds [01:48:04] PROBLEM - Misc_Db_Lag on storage3 is CRITICAL: CHECK MySQL REPLICATION - lag - CRITICAL - Seconds_Behind_Master : 616s [01:49:34] RECOVERY - MySQL Slave Delay on db1025 is OK: OK replication delay 1 seconds [01:51:49] PROBLEM - Puppet freshness on lvs1001 is CRITICAL: Puppet has not run in the last 10 hours [01:52:34] RECOVERY - Misc_Db_Lag on storage3 is OK: CHECK MySQL REPLICATION - lag - OK - Seconds_Behind_Master : 1s [01:52:43] RECOVERY - MySQL Slave Delay on storage3 is OK: OK replication delay 10 seconds [01:56:07] New review: Helder.wiki; "(no comment)" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/9136 [03:04:05] New review: Krinkle; "We should make the bots smarter, and spread out to relevant channels. But I don't think a bot channe..." [operations/puppet] (production) C: -1; - https://gerrit.wikimedia.org/r/12388 [03:37:08] PROBLEM - Puppet freshness on db1047 is CRITICAL: Puppet has not run in the last 10 hours [03:45:05] PROBLEM - Puppet freshness on db42 is CRITICAL: Puppet has not run in the last 10 hours [04:18:02] PROBLEM - Puppet freshness on maerlant is CRITICAL: Puppet has not run in the last 10 hours [04:56:46] PROBLEM - Host srv278 is DOWN: PING CRITICAL - Packet loss = 100% [04:57:31] RECOVERY - Host srv278 is UP: PING OK - Packet loss = 0%, RTA = 0.40 ms [05:01:07] PROBLEM - Apache HTTP on srv278 is CRITICAL: Connection refused [05:11:10] RECOVERY - Apache HTTP on srv278 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.041 second response time [05:19:53] PROBLEM - Puppet freshness on db29 is CRITICAL: Puppet has not run in the last 10 hours [05:22:53] PROBLEM - Puppet freshness on ms-be5 is CRITICAL: Puppet has not run in the last 10 hours [08:15:54] PROBLEM - Puppet freshness on ocg3 is CRITICAL: Puppet has not run in the last 10 hours [08:21:22] PROBLEM - MySQL Slave Delay on db44 is CRITICAL: CRIT replication delay 221 seconds [08:21:22] PROBLEM - MySQL Replication Heartbeat on db44 is CRITICAL: CRIT replication delay 221 seconds [09:15:04] RECOVERY - MySQL Slave Delay on db44 is OK: OK replication delay 5 seconds [09:15:22] RECOVERY - MySQL Replication Heartbeat on db44 is OK: OK replication delay 23 seconds [10:01:41] mark: ping? [10:13:07] @replag [10:13:09] Krinkle: [s2] db53: 1s, db54: 1s, db57: 1s [10:14:14] @externals update [10:14:36]

Your branch is behind 'origin/master' by 187 commits, and can be fast-forwarded. [10:14:37] wow [10:14:43] Krinkle: Successfully updated externals! [10:14:49] @externals [10:14:49] Krinkle: [operations/mediawiki-config.git] Checked out HEAD: d9439d6473d5469d0e27e5c11248947af1e18007 - https://gerrit.wikimedia.org/r/gitweb?p=operations/mediawiki-config.git;a=commit;h=d9439d6473d5469d0e27e5c11248947af1e18007 [10:15:38] ah, ts quota blocked the crontab too [10:15:58] anyway, up to date now [10:48:36] PROBLEM - Puppet freshness on nfs2 is CRITICAL: Puppet has not run in the last 10 hours [10:51:36] PROBLEM - Puppet freshness on nfs1 is CRITICAL: Puppet has not run in the last 10 hours [11:52:32] PROBLEM - Puppet freshness on lvs1001 is CRITICAL: Puppet has not run in the last 10 hours [12:41:57] PROBLEM - swift-container-auditor on ms-be3 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [12:43:27] RECOVERY - swift-container-auditor on ms-be3 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [12:51:06] PROBLEM - MySQL Replication Heartbeat on db44 is CRITICAL: CRIT replication delay 181 seconds [12:51:42] PROBLEM - MySQL Slave Delay on db44 is CRITICAL: CRIT replication delay 214 seconds [12:53:57] RECOVERY - MySQL Replication Heartbeat on db44 is OK: OK replication delay 0 seconds [12:54:33] RECOVERY - MySQL Slave Delay on db44 is OK: OK replication delay 4 seconds [13:20:57] PROBLEM - LVS HTTP IPv4 on foundation-lb.pmtpa.wikimedia.org is CRITICAL: HTTP CRITICAL: HTTP/1.0 504 Gateway Time-out [13:21:06] PROBLEM - LVS HTTPS IPv4 on foundation-lb.pmtpa.wikimedia.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 504 Gateway Time-out [13:21:06] PROBLEM - Apache HTTP on srv234 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:21:06] PROBLEM - Apache HTTP on mw27 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:21:06] PROBLEM - Apache HTTP on mw30 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:21:06] PROBLEM - Apache HTTP on mw22 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:21:15] PROBLEM - SSH on srv288 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:21:15] PROBLEM - Apache HTTP on mw9 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:21:15] PROBLEM - Apache HTTP on srv200 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:21:15] PROBLEM - Apache HTTP on mw36 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:21:15] PROBLEM - Apache HTTP on mw28 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:21:16] PROBLEM - LVS HTTP IPv4 on wikiversity-lb.pmtpa.wikimedia.org is CRITICAL: HTTP CRITICAL: HTTP/1.0 504 Gateway Time-out [13:21:16] PROBLEM - Apache HTTP on mw2 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:21:24] PROBLEM - Apache HTTP on srv226 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:21:24] PROBLEM - Apache HTTP on srv245 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:21:24] PROBLEM - Apache HTTP on srv235 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:21:24] PROBLEM - Apache HTTP on mw10 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:21:24] PROBLEM - Apache HTTP on srv247 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:21:25] PROBLEM - Apache HTTP on mw7 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:21:33] PROBLEM - Apache HTTP on srv282 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:21:33] PROBLEM - LVS HTTP IPv4 on foundation-lb.esams.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:21:33] PROBLEM - Apache HTTP on srv231 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:21:33] PROBLEM - Apache HTTP on srv267 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:21:33] PROBLEM - Apache HTTP on srv229 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:21:34] PROBLEM - check_all_memcacheds on spence is CRITICAL: MEMCACHED CRITICAL - Could not connect: 10.0.8.38:11000 (timeout) [13:21:42] PROBLEM - Apache HTTP on srv246 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:21:42] PROBLEM - Apache HTTP on srv238 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:21:42] PROBLEM - Apache HTTP on srv239 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:21:42] PROBLEM - Apache HTTP on srv227 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:21:42] PROBLEM - Apache HTTP on srv236 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:21:43] PROBLEM - Apache HTTP on srv243 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:21:43] PROBLEM - Apache HTTP on mw15 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:21:44] PROBLEM - Apache HTTP on mw34 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:21:44] PROBLEM - Apache HTTP on mw11 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:21:45] PROBLEM - Apache HTTP on srv228 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:21:51] PROBLEM - Apache HTTP on srv273 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:21:51] PROBLEM - Apache HTTP on srv241 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:21:51] PROBLEM - LVS HTTP IPv4 on appservers.svc.pmtpa.wmnet is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:21:52] PROBLEM - Apache HTTP on srv242 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:21:52] PROBLEM - Apache HTTP on srv288 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:21:52] PROBLEM - Apache HTTP on srv195 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:21:52] PROBLEM - Apache HTTP on srv244 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:21:53] PROBLEM - LVS HTTP IPv4 on wiktionary-lb.pmtpa.wikimedia.org is CRITICAL: HTTP CRITICAL: HTTP/1.0 504 Gateway Time-out [13:22:00] PROBLEM - LVS HTTP IPv4 on wiktionary-lb.esams.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:22:00] PROBLEM - Apache HTTP on mw51 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:22:00] PROBLEM - Apache HTTP on mw56 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:22:00] PROBLEM - Apache HTTP on mw5 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:22:00] PROBLEM - LVS HTTPS IPv4 on foundation-lb.esams.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:22:01] PROBLEM - Apache HTTP on mw17 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:22:01] PROBLEM - Apache HTTP on srv233 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:22:02] PROBLEM - Apache HTTP on srv285 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:22:02] PROBLEM - Apache HTTP on srv230 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:22:09] PROBLEM - Apache HTTP on srv205 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:22:09] PROBLEM - Apache HTTP on mw32 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:22:09] PROBLEM - Apache HTTP on srv208 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:22:09] PROBLEM - Apache HTTP on srv283 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:22:09] PROBLEM - Apache HTTP on mw35 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:22:10] PROBLEM - LVS HTTP IPv4 on wikiversity-lb.esams.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:22:10] PROBLEM - Apache HTTP on mw14 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:22:11] PROBLEM - LVS HTTPS IPv4 on wiktionary-lb.esams.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:22:11] PROBLEM - Apache HTTP on mw46 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:22:12] PROBLEM - Apache HTTP on mw53 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:22:12] PROBLEM - Apache HTTP on mw54 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:22:13] PROBLEM - Apache HTTP on mw58 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:22:13] PROBLEM - Apache HTTP on mw4 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:22:14] PROBLEM - Apache HTTP on srv198 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:22:14] PROBLEM - Apache HTTP on mw49 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:22:18] PROBLEM - Apache HTTP on mw18 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:22:18] PROBLEM - Apache HTTP on mw3 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:22:18] PROBLEM - Apache HTTP on mw45 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:22:18] PROBLEM - Apache HTTP on mw47 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:22:18] PROBLEM - Apache HTTP on srv202 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:22:19] PROBLEM - Apache HTTP on mw6 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:22:19] PROBLEM - Apache HTTP on mw20 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:22:20] PROBLEM - Apache HTTP on mw57 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:22:20] PROBLEM - Apache HTTP on mw25 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:22:27] PROBLEM - Apache HTTP on srv203 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:22:27] PROBLEM - Apache HTTP on mw43 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:22:28] PROBLEM - Apache HTTP on srv212 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:22:28] PROBLEM - Apache HTTP on mw55 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:22:28] PROBLEM - LVS HTTPS IPv4 on wikiversity-lb.esams.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:22:28] PROBLEM - Apache HTTP on mw29 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:22:28] PROBLEM - Apache HTTP on mw21 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:22:29] PROBLEM - Apache HTTP on srv237 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:22:29] PROBLEM - Apache HTTP on mw39 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:22:30] PROBLEM - Apache HTTP on mw13 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:22:30] PROBLEM - Apache HTTP on mw38 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:22:31] PROBLEM - Apache HTTP on srv204 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:22:31] PROBLEM - Apache HTTP on mw16 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:22:32] PROBLEM - Apache HTTP on mw26 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:22:32] PROBLEM - Apache HTTP on srv201 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:22:33] PROBLEM - Apache HTTP on mw8 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:22:36] PROBLEM - Apache HTTP on srv232 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:22:46] PROBLEM - Apache HTTP on mw12 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:22:54] PROBLEM - Apache HTTP on srv211 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:23:00] Reedy: ^ [13:23:03] PROBLEM - LVS HTTPS IPv4 on wiktionary-lb.pmtpa.wikimedia.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 504 Gateway Time-out [13:23:03] PROBLEM - Apache HTTP on mw37 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:23:21] PROBLEM - Apache HTTP on srv261 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:23:21] PROBLEM - Apache HTTP on mw1 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:23:21] PROBLEM - Apache HTTP on srv286 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:23:30] PROBLEM - Apache HTTP on srv262 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:23:30] PROBLEM - Apache HTTP on srv194 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:23:30] PROBLEM - Apache HTTP on srv207 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:23:30] PROBLEM - Apache HTTP on srv264 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:23:30] PROBLEM - Apache HTTP on mw19 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:23:33] the wikis are unavailable, too [13:23:39] PROBLEM - Apache HTTP on srv210 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:23:48] PROBLEM - Apache HTTP on mw41 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:23:48] PROBLEM - Apache HTTP on srv213 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:23:48] PROBLEM - Apache HTTP on srv197 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:23:57] PROBLEM - LVS HTTPS IPv4 on wikiversity-lb.pmtpa.wikimedia.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 504 Gateway Time-out [13:24:02] Platonides: yeah [13:24:21] ssearch is going throug hthe roof [13:24:24] and it's across all clusters [13:24:29] srv288 [13:24:32] well I should say "went through" [13:25:30] looks like the mc issue again.. [13:25:44] I wonder how one of these can take us out [13:25:50] let's see if that's really true first [13:26:01] srv288 is swapping and loads of wait cpu [13:26:10] yeah, I'm going to have a look at it [13:26:30] PROBLEM - Apache HTTP on srv258 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:26:30] Do you want to swap to another spare? [13:26:47] thankgod the new mc boxes are in the process of coming online [13:26:55] no, I am just going to powercycle it I think [13:27:33] PROBLEM - Apache HTTP on srv265 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:27:49] hmm I have a login prompt (and the tail end of some message form the console but only the timestamp) [13:28:02] don't think it's going to let me get on the box, too bad [13:28:51] apergos: reboot? [13:28:54] !log powrcycling srv288, swap death etc, some message to mgmt console but only the timestamp so couldn't see the issue, also couldn't get past the login prompt [13:29:00] yes, that's what I am doing [13:29:00] Logged the message, Master [13:29:12] PROBLEM - Apache HTTP on mw31 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:29:23] in which ganglia category are the memcached? [13:29:50] I guess they are app servers [13:29:50] they aren't seperated [13:29:54] https://noc.wikimedia.org/conf/highlight.php?file=mc.php [13:29:59] they should be soon [13:30:15] PROBLEM - Apache HTTP on srv275 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:30:25] *sigh* [13:30:28] yep, srv288 is there [13:30:36] Last heartbeat 117s ago [13:30:50] login timed out, that's the last message I got from the console, it's rebooting now [13:31:28] could be some DoS [13:31:44] I find very strange that it is happning so often now [13:32:05] it could be just someone crawling us with some expensive queries [13:32:16] my bet is on some runaway search queries [13:32:21] RECOVERY - LVS HTTP IPv4 on wiktionary-lb.esams.wikimedia.org is OK: HTTP OK HTTP/1.0 200 OK - 60343 bytes in 5.272 seconds [13:32:39] PROBLEM - Apache HTTP on mw33 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:32:46] and srv288 is back up [13:32:48] let's see [13:32:57] PROBLEM - Apache HTTP on srv271 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:33:02] seems ok now [13:33:06] RECOVERY - SSH on srv288 is OK: SSH OK - OpenSSH_5.3p1 Debian-3ubuntu7 (protocol 2.0) [13:33:06] RECOVERY - LVS HTTPS IPv4 on foundation-lb.pmtpa.wikimedia.org is OK: HTTP OK HTTP/1.1 200 OK - 39007 bytes in 0.214 seconds [13:33:06] RECOVERY - Apache HTTP on srv286 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 3.494 second response time [13:33:06] RECOVERY - Apache HTTP on srv275 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 3.743 second response time [13:33:06] RECOVERY - Apache HTTP on srv232 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 5.877 second response time [13:33:07] RECOVERY - Apache HTTP on srv241 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 8.681 second response time [13:33:14] gotta let things settle for a couple minutes [13:33:15] RECOVERY - Apache HTTP on srv229 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 7.973 second response time [13:33:15] RECOVERY - Apache HTTP on srv234 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 9.833 second response time [13:33:15] RECOVERY - Apache HTTP on srv211 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.036 second response time [13:33:15] RECOVERY - Apache HTTP on srv245 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.035 second response time [13:33:15] RECOVERY - Apache HTTP on srv226 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 1.084 second response time [13:33:16] RECOVERY - Apache HTTP on srv235 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 2.081 second response time [13:33:16] RECOVERY - Apache HTTP on srv247 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 4.676 second response time [13:33:17] RECOVERY - Apache HTTP on mw30 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 6.841 second response time [13:33:19] probably, there's a big bump in Search pmtpa which started at 13:10 [13:33:19] apergos: do you actually need some apache up to read the log? they are being logged on another machine anyway [13:33:21] and there they come [13:33:22] using udp2log [13:33:24] RECOVERY - Apache HTTP on mw10 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 7.372 second response time [13:33:24] RECOVERY - Apache HTTP on srv282 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.024 second response time [13:33:24] RECOVERY - Apache HTTP on mw7 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 9.434 second response time [13:33:24] RECOVERY - LVS HTTP IPv4 on wikiversity-lb.pmtpa.wikimedia.org is OK: HTTP OK HTTP/1.0 200 OK - 46919 bytes in 9.540 seconds [13:33:24] RECOVERY - Apache HTTP on mw49 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 4.444 second response time [13:33:33] RECOVERY - Apache HTTP on srv265 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.039 second response time [13:33:33] RECOVERY - Apache HTTP on srv231 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.041 second response time [13:33:33] RECOVERY - LVS HTTPS IPv4 on wiktionary-lb.pmtpa.wikimedia.org is OK: HTTP OK HTTP/1.1 200 OK - 60160 bytes in 0.195 seconds [13:33:33] RECOVERY - Apache HTTP on mw31 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.049 second response time [13:33:33] RECOVERY - Apache HTTP on mw57 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.900 second response time [13:33:34] RECOVERY - Apache HTTP on srv246 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.044 second response time [13:33:34] RECOVERY - Apache HTTP on srv238 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.107 second response time [13:33:35] RECOVERY - check_all_memcacheds on spence is OK: MEMCACHED OK - All memcacheds are online [13:33:35] RECOVERY - Apache HTTP on mw2 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 1.507 second response time [13:33:43] RECOVERY - Apache HTTP on srv239 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.030 second response time [13:33:43] RECOVERY - Apache HTTP on mw11 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.031 second response time [13:33:43] RECOVERY - Apache HTTP on mw36 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.034 second response time [13:33:43] RECOVERY - Apache HTTP on mw34 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.033 second response time [13:33:50] I didn't bother to look at the log, priority was getting the site up [13:33:51] RECOVERY - Apache HTTP on srv242 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.035 second response time [13:33:51] RECOVERY - Apache HTTP on srv261 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.025 second response time [13:33:51] RECOVERY - Apache HTTP on srv243 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.032 second response time [13:33:51] RECOVERY - Apache HTTP on mw15 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.048 second response time [13:33:51] RECOVERY - Apache HTTP on srv285 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.031 second response time [13:33:52] RECOVERY - Apache HTTP on srv258 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.026 second response time [13:33:52] RECOVERY - Apache HTTP on mw56 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.042 second response time [13:33:53] RECOVERY - Apache HTTP on mw51 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.038 second response time [13:33:53] RECOVERY - Apache HTTP on srv262 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.037 second response time [13:33:54] RECOVERY - Apache HTTP on srv230 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.028 second response time [13:33:54] RECOVERY - Apache HTTP on mw17 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.040 second response time [13:33:55] RECOVERY - Apache HTTP on srv194 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.037 second response time [13:33:55] RECOVERY - Apache HTTP on srv233 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.033 second response time [13:33:56] RECOVERY - Apache HTTP on srv207 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.035 second response time [13:33:56] RECOVERY - Apache HTTP on mw1 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.034 second response time [13:33:57] RECOVERY - Apache HTTP on srv273 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.042 second response time [13:33:57] RECOVERY - LVS HTTP IPv4 on wiktionary-lb.pmtpa.wikimedia.org is OK: HTTP OK HTTP/1.0 200 OK - 60151 bytes in 0.215 seconds [13:33:58] RECOVERY - LVS HTTP IPv4 on appservers.svc.pmtpa.wmnet is OK: HTTP OK HTTP/1.1 200 OK - 60913 bytes in 0.225 seconds [13:33:58] RECOVERY - Apache HTTP on mw5 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.602 second response time [13:33:59] RECOVERY - LVS HTTPS IPv4 on foundation-lb.esams.wikimedia.org is OK: HTTP OK HTTP/1.1 200 OK - 39199 bytes in 1.041 seconds [13:33:59] it was obvious which host was out to lunch, so... [13:34:00] RECOVERY - Apache HTTP on srv244 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.033 second response time [13:34:00] RECOVERY - Apache HTTP on mw33 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.019 second response time [13:34:00] RECOVERY - Apache HTTP on mw46 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.026 second response time [13:34:01] RECOVERY - Apache HTTP on srv208 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.031 second response time [13:34:01] RECOVERY - Apache HTTP on mw35 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.032 second response time [13:34:02] RECOVERY - Apache HTTP on mw14 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.040 second response time [13:34:02] RECOVERY - Apache HTTP on srv283 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.038 second response time [13:34:03] RECOVERY - Apache HTTP on mw32 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.043 second response time [13:34:03] RECOVERY - Apache HTTP on mw54 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.027 second response time [13:34:04] RECOVERY - Apache HTTP on srv210 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.031 second response time [13:34:04] RECOVERY - Apache HTTP on srv205 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.031 second response time [13:34:05] RECOVERY - Apache HTTP on srv198 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.029 second response time [13:34:05] RECOVERY - Apache HTTP on mw58 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.041 second response time [13:34:06] RECOVERY - Apache HTTP on mw53 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.037 second response time [13:34:06] RECOVERY - LVS HTTPS IPv4 on wiktionary-lb.esams.wikimedia.org is OK: HTTP OK HTTP/1.1 200 OK - 60350 bytes in 0.915 seconds [13:34:07] RECOVERY - LVS HTTP IPv4 on wikiversity-lb.esams.wikimedia.org is OK: HTTP OK HTTP/1.0 200 OK - 47112 bytes in 0.943 seconds [13:34:11] RECOVERY - Apache HTTP on srv267 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.031 second response time [13:34:11] RECOVERY - Apache HTTP on mw3 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.021 second response time [13:34:11] RECOVERY - Apache HTTP on srv213 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.031 second response time [13:34:11] RECOVERY - Apache HTTP on srv202 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.037 second response time [13:34:11] RECOVERY - Apache HTTP on srv197 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.032 second response time [13:34:11] RECOVERY - Apache HTTP on mw47 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.039 second response time [13:34:11] RECOVERY - Apache HTTP on mw18 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.037 second response time [13:34:11] RECOVERY - Apache HTTP on mw6 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.033 second response time [13:34:11] RECOVERY - Apache HTTP on mw25 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.056 second response time [13:34:16] who thought putting bot output in a work channel was a good idea again? [13:34:18] RECOVERY - Apache HTTP on srv271 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.025 second response time [13:34:18] RECOVERY - Apache HTTP on mw43 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.033 second response time [13:34:18] RECOVERY - Apache HTTP on srv212 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.026 second response time [13:34:18] RECOVERY - Apache HTTP on mw39 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.040 second response time [13:34:18] RECOVERY - Apache HTTP on mw55 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.041 second response time [13:34:23] What's srv288 and how its outage caused so many Apaches to hang? [13:34:27] thank god it's gone [13:34:33] Memcached? [13:34:40] hey maplebed [13:34:51] vvv: seemingly, yes [13:34:54] looks like we're back in business again [13:35:15] that means I have perfect timing. [13:35:58] apche and job runner too [13:36:03] Quitted with Excess flood? [13:36:07] yes. [13:36:12] is that the problem Freenode givin' us? [13:36:16] A freenode server killed it [13:36:29] yes maplebed, it would have been even more perfect if you hadn't come on (since I think things are ok now) [13:36:34] so explains the complains about it on lists.wm.o [13:36:34] can we bribe asher to work the weekends? [13:36:39] it's not a problem at all [13:36:47] I'm fine with the bot being flooded out for things like this [13:36:48] Hydriz, that was a separate issue [13:37:10] its about bots on IRC anyway :P [13:37:29] what does search use memcached for? [13:37:48] Hydriz? wut? I thought it was about other non-wikimedia wikis trying to use freenode [13:37:59] apergos: which server was it that you had to kick to clear the problem? [13:38:09] not sure, there are like 2-3 threads if I am not wrong [13:38:18] and happen to see it somewhere [13:38:36] srv288 [13:39:30] searchidx2.pmtpa.wmnet went from 0 to 80MB/s network usage 10 minutes before the outage began [13:39:44] syslog is void of anything useful [13:39:52] I am sure that dmesg had good things in it but it's gone now [13:40:08] It's always different server? [13:40:13] can you look at what searchidx2 was doing? [13:40:41] let me finish poking around srv228 first [13:40:49] then I'll look at search reqs, they were next on the list [13:42:11] apergos: what's the signal that tells you which server needs kicking? [13:42:32] ganglia showing it exremely unhappy [13:42:43] ssh in fails [13:42:49] login on console times out [13:43:01] why doesn't Wikipedia move to akamai? [13:43:25] Why would it? [13:43:31] Yay, proprietory 3rd parties [13:43:55] it's cheaper and more realiable [13:44:02] *reliable [13:44:15] Don't tell me it's clouds [13:44:21] Aren't they really just a CDN? [13:44:46] there's no reason memcached would need to swap [13:44:47] it's purpose is precisely that it would start forgetting data when full [13:44:57] without needing to manually clean their minds with a reboot :P [13:45:01] ah. so scanning http://ganglia.wikimedia.org/latest/?c=Application%20servers%20pmtpa&m=load_one&r=hour&s=by%20name&hc=4&mc=2 makes it clear. [13:45:13] Platonides: there's also an Apache and a job runner there [13:45:25] at least I don't think a single memcached overloading can bring the whole akamai platform down ;) [13:45:42] I thought memcache servers stopped sharing boxes with apaches time ago [13:45:48] no [13:45:48] Vito: do you even have an idea how does it happen? [13:45:52] Vito: the actual technical reason is the speed of selective cache clearing on akamai is insufficient. the cultural reason is different. a CDN wouldn't help this problem though. [13:46:01] They're supposed to not run on the same boxes as job runners, but that's a different issue [13:46:13] This should be a null point in the very near future [13:46:28] maplebed: that's a good point [13:47:50] yeah I don't see anything helpful on srv288, atop shows swap death, there's a surprise :-/ [13:48:45] vvv: by this side it has no value, I'm dealing with the toughness of our infrastructure [13:50:22] apergos: so, this is actually a swap death? [13:50:45] On what sort of hosts did it previously happen? All memcached? All job runners? [13:51:02] they've all been job runner and memcached AFAIK [13:51:04] http://gdash.wikimedia.org/dashboards/apimethods/ [13:51:12] trying to divine things from these graphs but not getting much [13:51:18] off to sampled_1000 log [13:51:24] Vito: in order to resolve the problem with the toughness of our infrastructure, you must understand how the problem manifests. (in this case, one of the issues is how the memcached library handles timeouts) [13:52:22] So, do I understand correctly that it causes Apache death because of memcached timeouts? [13:52:26] api mobileview rised a lot at that time [13:52:40] a search engine crawling a .m.wikipedia.org site? [13:53:09] but that wouldn't affect search :S [13:54:13] Should search engines be allowed there at all? [13:55:35] maplebed: would automatic memcached fallback solve that problem? [13:55:47] http://en.m.wikipedia.org/robots.txt is the same as http://en.wikipedia.org/robots.txt there doesn't seem to be a blocking rule [13:56:02] vvv, it would need to be automatic memcached rotation [13:56:05] vvv: the problem is the way we distribute mc stuff [13:56:05] vvv he's gone back tosleep [13:56:13] key/hash to host index [13:56:15] it's baaad [13:56:24] or a watchdog which rebooted the box when it's unresponsive :) [13:56:50] Reedy: what's alternative? [13:57:17] something that doesn't suck ;) [13:57:17] I suggest that we setup a fallback server and switch bad nodes to it once some memcached box is down [13:57:25] That's what we do now [13:57:28] Look at the list of spares [13:57:32] Reedy: oh I see, it should be 20% cooler [13:57:39] The memcached client we use sucks also [13:58:00] tim has added a pecl memcached client [13:58:21] it was indeed memcached btw cause you can see in the logs where at 13;17 there's a flood of rejects [13:58:29] we're also waiting on a new memcached cluster to be installed [13:58:31] Why does the client suck except being written in PHP? [13:58:37] I can't remember [13:58:47] so memcached won't share its world with apache boxes [13:58:56] Reedy: if we had a fallback set up, those Apaches would not fall down [13:59:01] I mean an automatic fallback [13:59:02] they complained the other day about the timeouts not being short enough [13:59:16] but we have special code in php to deal with that :/ [14:00:12] jobs only had enotify and refreshlinks2 going [14:00:57] Something like "if memcahed server X is down, switch to backup cluster" [14:01:21] well it should be "if x memcached server is down we don't notice" [14:01:27] unfortunately ... it ain't like that [14:01:47] trying to wade through the sampled log now [14:01:48] apergos: well, if it does switch to backup cluster, we will not notice [14:07:24] Reedy: try think of the good sides of the hash-based balancing. It does not have a balancer as a SPOF [14:08:12] true [14:08:43] Very rarely is anything perfect [14:09:12] What I would do is introduce a fallback data and some cross-process flag set indicating whether given servers are down [14:09:50] let's store a list of unavailable servers on memcached! [14:10:05] That may be a nice idea btw [14:10:10] A SPOFy one, though [14:10:14]