[00:00:12] and why didn't nagios catch this already? [00:00:23] it's 3am and I'm getting seriously grumpy :) [00:02:50] can I just get rid of logs? [00:06:55] woosters: around? [00:08:39] paravoid: on the ssls? [00:08:40] yes [00:08:42] !log esams SSL outage, working on it [00:08:43] kill the logs with fire [00:08:47] Logged the message, Master [00:08:54] I just created a new 50G LV and moving them on it [00:08:58] SSL just went down? [00:09:04] PROBLEM - HTTPS on ssl3001 is CRITICAL: Connection refused [00:09:07] yep [00:09:18] I could have sworn I disabled the logs ages ago [00:09:25] was it access or error? [00:09:30] both? [00:09:38] access is what filled it up [00:09:43] damn it [00:09:49] they should be disabled [00:09:49] ya? [00:09:54] woosters: see above, jfyi [00:09:58] lemme look in puppet [00:10:16] Ryan_Lane: I'm putting logs in LVs, no worries [00:10:21] Ryan_Lane: btw, any idea why ssl3004 is down? [00:10:25] lemme see [00:10:25] RECOVERY - HTTPS on ssl3001 is OK: OK - Certificate will expire on 07/19/2016 16:14. [00:10:39] yes [00:10:40] hardware [00:10:41] Ryan_Lane: also, if you want to do something while I'm working on the rest of them, check why nagios didn't pick up a thing [00:10:43] it's depooled [00:10:52] no pages [00:10:55] don't we have df output? [00:11:01] er, checks? [00:11:05] wow. wtf [00:11:08] it's like it just died [00:11:24] and why didn't ssl3001 just get depooled? [00:11:36] I stopped it to move the logs [00:11:44] I know why nagios didn't alert [00:11:45] and now I just stopped ssl3002 and ssl3003 [00:11:52] because we're using the sh scheduler [00:12:09] ah shit [00:12:12] yep [00:12:20] shared ssl cache would rock [00:15:23] either way, a single downed node shouldn't cause an outage [00:15:28] all of them are full [00:15:40] 100%? [00:16:01] 97%, which I presume is the reserved for root [00:16:14] nginx died on all of them? [00:16:31] I didn't check but it doesn't die, it just returns 500s [00:16:42] didn't check all of them [00:16:46] that was at least the ssl3001 behavior [00:17:00] I think only ssl3001 actually died [00:17:03] but... [00:17:06] that *will* cause a small outage [00:17:13] since we're using sh [00:17:32] failover isn't immediate, when using that scheduler [00:17:47] note how these servers are also IPv6 [00:17:53] which uses wrr [00:17:56] it's less of a problem for ipv6 [00:18:03] that's a separate service [00:18:09] yes [00:18:12] it'll get depooled separately [00:18:20] so, ipv6 will switch immediately [00:18:22] ssl will not [00:18:34] I know [00:18:48] are the access logs only for ipv6? [00:18:54] or are they also for ipv4? [00:19:10] error logs are enabled, but access logs should be disbled [00:19:15] both [00:19:19] they aren't explicitly disabled for ipv6 [00:19:23] well, that's a problem [00:19:36] error_log /var/log/nginx/<%= name %>.error.log; [00:19:36] access_log off; [00:19:40] oh hm, no [00:19:44] just ipv6 error logs [00:19:56] sorry, it's getting late :) [00:19:57] no access logs? [00:20:13] no ipv6 access [00:20:15] it's ok. I can also just look myself :) [00:20:18] they do have ssl access [00:20:27] well, wtf [00:20:28] and have both ssl & ipv6 error [00:20:30] they are disabled [00:20:36] yeah. error log makes sense [00:20:49] oh [00:20:52] again, wrong [00:21:01] there's just an "access.log" [00:21:04] access is only ipv6 [00:21:08] damn it [00:21:08] which has ipv6 addresses [00:21:09] right [00:21:16] how did I miss that? [00:21:20] oh [00:21:22] I know why [00:21:31] we had access logs enabled for IPv6 on purpose before [00:21:36] because we had stats for it [00:21:58] I'm fixing it [00:22:03] I got it [00:22:33] New patchset: Ryan Lane; "Disable access logs for ipv6 nginx" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/11721 [00:22:38] ah, I had it too [00:23:02] New review: Faidon; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/11721 [00:23:02] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/11721 [00:23:13] Change merged: Faidon; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/11721 [00:23:24] so... [00:23:32] we don't have notify set for this [00:23:37] err [00:23:41] wait [00:23:43] ? [00:24:15] what's "ipv6and4"? [00:24:18] it still has access logs [00:24:27] yes [00:24:32] that was what we had the stats running on [00:24:41] let's not mess with that just yet [00:24:49] it's small [00:24:56] okay [00:25:01] force-running puppet on the servers [00:25:11] puppet won't restart nginx [00:25:19] !log SSL ipv6 access logs disabled; force running puppet and rm'ing access.logs on esams [00:25:24] Logged the message, Master [00:25:25] it does for the main config, but not the sub-configs [00:25:33] will do [00:25:37] I'm logged in all of them anyway [00:25:41] * Ryan_Lane nods [00:25:55] I did that in the past because I was worried of puppet bringing them all down :) [00:26:15] we can likely change that now [00:26:33] woosters: renew ssl on ssl3001  - Certificate will expire on 07/19/2016 16:14. [00:26:35] ..... [00:26:39] that's 3 years from now [00:26:54] probably don't need a ticket for that ;) [00:27:04] the message sent by nagios is normal for when a server comes back up [00:27:33] lo [00:27:35] l [00:28:33] yea [00:28:43] i noticed that and cancelled that ticket ;-P [00:28:46] heh [00:28:59] thought it was 2012 [00:29:06] :D [00:31:05] why do we use 250GB disks and allocate 9GB for / [00:31:08] how stupid is that?!?! [00:31:22] it was some fucked up partmap recipie [00:31:29] Dude, don't get me started, hume has a 5GB partition for /usr/local/apache [00:31:51] also, the logs were supposed to be disabled ;) [00:32:11] Even though we've made a point of putting it on a separate partition on all app servers so we could grow it from ~7GB to ~100GB [00:32:20] it matters on the app servers [00:32:29] it shouldn't matter at all on reverse proxies [00:32:52] we likely would have had the same result, but a few weeks from now [00:33:32] I'd be perfectly happy having those be diskless [00:34:00] !log esams SSL should be back up [00:34:05] Logged the message, Master [00:34:13] now let's force run puppet and cleanup on pmtpa and eqiad :) [00:34:26] yes, indeed [00:34:46] I blame ipv6 [00:34:47] :) [00:34:51] hahaha [00:35:01] oh [00:35:02] right [00:35:08] I get all the blame for that anyways [00:35:10] do we not have df checks? [00:35:25] I have no clue why we wouldn't [00:35:36] maybe we don't have a standard set of checks for all nodes? [00:36:45] you doing the puppet force run> [00:36:49] yes [00:36:51] ok [00:36:52] and rm access.log [00:36:59] we have an ssl dsh group, btw [00:36:59] and restart nginx [00:37:22] and rolling back the separate LV for logs that I had began deploying :) [00:37:57] Thehelpfulone: thanks for noticing and notifying us [00:38:03] indeed [00:38:10] Thehelpfulone: it would have probably gone undetected for hours if it weren't for you [00:38:15] he was probably one of the unlucky 1/3 that was affected [00:38:28] no problem :) [00:38:35] well, it would have failed over to the other nodes, which would have then died quickly, then we'd have been paged :) [00:38:44] while we were sleeping [00:38:49] so indeed, thank you :) [00:38:50] why would it? [00:38:56] does pybal check for 500s too? [00:39:00] yes [00:39:04] on random URLs? [00:39:14] no, but it would get a 500 [00:39:17] not everything was failing, just URLs with contents [00:39:21] I'm not sure if / would fail [00:39:27] it would have content [00:39:31] wait. [00:39:40] do we do more than a connection check there? [00:39:42] we don't :( [00:39:48] but [00:39:53] the ssh check would have failed [00:39:56] so, undetected :) [00:39:58] why? [00:40:03] it tries to write [00:40:13] oh, nice [00:40:13] if it can't it fails [00:40:15] yes [00:40:19] you mean our check? [00:40:20] or sshd? [00:40:23] ours [00:40:26] oh, coool [00:40:32] why didn't it fail already though? [00:40:38] of course, that's assuming we have the same check on that one [00:40:51] 3001 was full when I started troubleshooting [00:40:55] what we really need is a content check via ssh [00:41:04] did you see if it was pooled or not? [00:41:09] I'm betting it was depooled [00:41:33] but, just because it's depooled doesn't mean much [00:41:36] if it was, how would Thehelpfulone getting 500s? :-) [00:41:37] it'll still have connections open [00:41:42] sh scheduler [00:41:52] read about its failure mode [00:41:59] oh, right [00:42:01] dammit [00:42:10] it takes a bit longer [00:42:22] we need to work on getting stud :) [00:42:31] nginx just released SPDY support :) [00:42:34] like, hours ago [00:42:35] \o/ [00:42:42] wait. [00:42:43] no [00:42:47] no? [00:42:48] http://nginx.org/patches/spdy/ :) [00:42:51] that actually makes me less willing to switch :( [00:42:58] exactly :) [00:43:00] that and the 1.1 support [00:43:06] there's a SSL memcached patch for nginx floating around [00:43:12] yeah [00:43:16] it needs a shit-ton of work [00:43:23] yep [00:43:27] and the guy who wrote it [00:43:34] but it might be worth the effort [00:43:34] switched from an IT career [00:43:38] yeah [00:43:40] to being a train driver [00:43:43] we'd need to maintain it ourselves [00:43:46] wtf is up with that :) [00:43:51] they make more money [00:43:57] and it's less stressful [00:44:39] and come on, you get to drive a train, that sounds fun to me :) [00:44:54] (actually, it sounds incredibly boring) [00:45:53] i've looked for that patch! [00:46:10] it's not very good :( [00:46:19] but. we can fix it [00:46:22] * Ryan_Lane shrugs [00:46:52] this is the kind of magazine I'd expect to find in "The good hotel" http://25.media.tumblr.com/tumblr_m5myygZnf01r2fm6fo1_500.jpg [00:46:54] i'm surprised that stud broadcasts ssl session data to all other instances instead of using memcached [00:47:02] the train driver gets to blow his horn ;-P [00:47:08] binasher: yeah, same [00:47:35] binasher: what happens when one of the ssl servers dies, and its memcache goes away? [00:47:59] the hashing algorithm stores the cache info on another? [00:48:44] 1/nth of clients would have to re-negotiate [00:48:56] until the node came back up? [00:49:03] or just once? [00:49:04] no, that wouldn't work [00:49:09] binasher: there was some explanation on the bug tracker that I don't remember [00:49:13] let me find it [00:49:32] broadcasting means there's no renegotiation necessary in case of failure [00:50:26] "Storing a new session is done asynchronously. However, getting a session from memcached is done synchronously. Therefore, there is a major performance drawback. Unfortunately, there is no way to register an asynchronous callback with OpenSSL. Either we could use some threads (eeek) or insert remote sessions into the local cache. I don't know if memcached allows something like this. We may switch to something that would broadcast insertions [00:50:33] from https://github.com/bumptech/stud/pull/29 [00:50:36] binasher: ^^^ [00:50:41] ah [00:50:50] that makes sense [00:52:48] are the licenses compatible? can we just repurpose stud's implementation? [00:53:27] lets make stud+varnish appliances and call it studly [00:53:31] :D [00:53:41] binasher: I want the SPDY support in nginx :( [00:54:20] there's a spedey thing too [00:54:28] which is inspired by stud but is a spdy proxy [00:54:32] also has X-F-F [00:54:39] no SSL session cache though :) [00:54:44] bleh [00:54:47] also, I think that SPDY also has some session information [00:54:53] i.e. keeps state [00:55:02] yes, of course it does [00:55:05] https://github.com/pquerna/spedye [00:55:16] oh paravoid already mentioned [00:55:31] no ssl session cache would be the fail [00:55:38] "based upon the ideas of stud" [00:55:40] uh huh [00:56:44] Ryan_Lane: i.e. if we do spdy, we need more than shared ssl session cache to get rid of sh [00:56:49] we need shared spdy state too [00:56:52] good luck with that [00:57:20] or we could give up and hash on client ip [00:57:20] true [00:57:34] binasher: that's what sh does [00:57:55] ah, source hash? [00:57:59] which is what we're using now [00:58:00] yeah [00:58:10] we're like to switch to wrr [00:58:22] *we'd [00:58:30] (it's late :) ) [01:00:49] ah. spdy support in nginx isn't really usable right now anyway [01:00:59] it's draft 2 implementation [01:03:57] yes [01:04:17] but we have to take its development into account :) [01:04:26] yes. true [01:04:29] before deciding into switching to stud (which we can't right now anyway) [01:04:35] yep [01:04:56] we should figure out if it makes more sense to spend dev time on one or the other [01:04:56] also, [01:05:10] with chrome + firefox 13+ supporting spdy by default [01:05:18] the 4*3 servers won't be nearly enough [01:05:47] we'd want it on every server [01:05:51] which also means that we'd need "sh" for half of our traffic [01:06:03] we've thought about it in the past [01:06:13] we wanted to make sure ssl wouldn't eat up too much memory [01:06:41] though we could always patch openssl for that anwyay [01:06:55] (I wanted to avoid patching openssl, if possible) [01:07:02] patch it for what? [01:07:22] there's a patch google made that dramatically lowers the amount of memory used [01:07:48] I think it's upstreamed, but it's in a much, much newer version of openssl from what I remember [01:08:08] much newer than precise too? [01:08:19] I believe so. lemme see [01:09:00] hmm, apparently you can do SPDY without SSL too [01:09:07] oh [01:09:12] it's in precise [01:09:33] :-) [01:09:46] it's SSL_MODE_RELEASE_BUFFERS [01:11:19] we need to upgrade the ssl servers :) [01:11:44] it'll get us that and HTTP 1.1 support [01:12:17] anyway, I need to go to bed [01:12:26] twitter does spdy already [01:12:28] damn them :) [01:12:35] fuckers [01:12:43] well, we beat them to ipv6 anyway [01:12:47] hahaha [01:12:53] it's 2-1 though [01:12:59] ssl, spdy vs. ipv6 [01:13:11] ipv6 is more important [01:13:43] ok. off to bed [01:13:44] * Ryan_Lane waves [01:13:46] night [01:13:53] that was an intense day... [01:41:36] PROBLEM - MySQL Slave Delay on db1025 is CRITICAL: CRIT replication delay 288 seconds [01:42:12] PROBLEM - MySQL Slave Delay on storage3 is CRITICAL: CRIT replication delay 235 seconds [01:47:18] RECOVERY - MySQL Slave Delay on db1025 is OK: OK replication delay 10 seconds [01:49:06] PROBLEM - Misc_Db_Lag on storage3 is CRITICAL: CHECK MySQL REPLICATION - lag - CRITICAL - Seconds_Behind_Master : 650s [01:51:57] RECOVERY - Misc_Db_Lag on storage3 is OK: CHECK MySQL REPLICATION - lag - OK - Seconds_Behind_Master : 11s [01:52:33] RECOVERY - MySQL Slave Delay on storage3 is OK: OK replication delay 17 seconds [01:58:06] PROBLEM - Puppet freshness on maerlant is CRITICAL: Puppet has not run in the last 10 hours [01:58:07] PROBLEM - Puppet freshness on professor is CRITICAL: Puppet has not run in the last 10 hours [02:13:42] PROBLEM - MySQL Slave Delay on db26 is CRITICAL: CRIT replication delay 192 seconds [02:13:42] PROBLEM - MySQL Replication Heartbeat on db26 is CRITICAL: CRIT replication delay 193 seconds [02:41:28] New patchset: Dereckson; "(bug 37644) Enable subpages on be.wikimedia.org for NS 0 and 4" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/11723 [02:41:34] New review: jenkins-bot; "Build Successful " [operations/mediawiki-config] (master); V: 1 C: 0; - https://gerrit.wikimedia.org/r/11723 [02:43:27] New review: SPQRobin; "(no comment)" [operations/mediawiki-config] (master) C: 1; - https://gerrit.wikimedia.org/r/11723 [02:55:36] New patchset: Dereckson; "(bug 37644) Enable subpages on be.wikimedia.org for NS 0 and 4" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/11723 [02:55:43] New review: jenkins-bot; "Build Successful " [operations/mediawiki-config] (master); V: 1 C: 0; - https://gerrit.wikimedia.org/r/11723 [02:56:18] New review: Dereckson; "Trailing space missing." [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/11723 [02:57:09] New review: Dereckson; "Added trailing space." [operations/mediawiki-config] (master) C: 0; - https://gerrit.wikimedia.org/r/11723 [02:59:46] RECOVERY - MySQL Slave Delay on db26 is OK: OK replication delay 28 seconds [03:00:49] RECOVERY - MySQL Replication Heartbeat on db26 is OK: OK replication delay 20 seconds [03:00:49] PROBLEM - Puppet freshness on db29 is CRITICAL: Puppet has not run in the last 10 hours [03:41:36] New review: SPQRobin; "(no comment)" [operations/mediawiki-config] (master) C: 1; - https://gerrit.wikimedia.org/r/11723 [03:58:27] New patchset: SPQRobin; "Bug 37614 - Change project talk namespace for bjnwiki" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/11583 [03:58:34] New review: jenkins-bot; "Build Successful " [operations/mediawiki-config] (master); V: 1 C: 0; - https://gerrit.wikimedia.org/r/11583 [03:59:47] New patchset: SPQRobin; "Bug 37614 - Change project talk namespace for bjnwiki" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/11583 [03:59:53] New review: jenkins-bot; "Build Successful " [operations/mediawiki-config] (master); V: 1 C: 0; - https://gerrit.wikimedia.org/r/11583 [04:03:17] New review: SPQRobin; "* Good idea, I have done that now" [operations/mediawiki-config] (master) C: 0; - https://gerrit.wikimedia.org/r/11583 [04:07:59] New review: SPQRobin; "(I was thinking though, if the sitename is changed, then it will also affect the project talk namesp..." [operations/mediawiki-config] (master) C: 0; - https://gerrit.wikimedia.org/r/11583 [04:09:29] RECOVERY - Puppet freshness on professor is OK: puppet ran at Sat Jun 16 04:09:19 UTC 2012 [04:46:14] PROBLEM - Puppet freshness on virt0 is CRITICAL: Puppet has not run in the last 10 hours [05:58:16] PROBLEM - Puppet freshness on ocg3 is CRITICAL: Puppet has not run in the last 10 hours [06:49:41] PROBLEM - Puppet freshness on searchidx2 is CRITICAL: Puppet has not run in the last 10 hours [11:59:26] PROBLEM - Puppet freshness on maerlant is CRITICAL: Puppet has not run in the last 10 hours [12:11:17] paravoid: still no change [12:11:45] so, I'm going to change it to labs-ns0/1 [12:12:00] or should we just keep operating like now and wait till monday? [12:12:18] nothing is currently broken, so it may be better to only need to make a single change [13:01:59] PROBLEM - Puppet freshness on db29 is CRITICAL: Puppet has not run in the last 10 hours [13:41:53] PROBLEM - MySQL Idle Transactions on db52 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:47:44] RECOVERY - MySQL Idle Transactions on db52 is OK: OK longest blocking idle transaction sleeps for 0 seconds [14:46:56] PROBLEM - Puppet freshness on virt0 is CRITICAL: Puppet has not run in the last 10 hours [15:23:14] PROBLEM - Host niobium is DOWN: PING CRITICAL - Packet loss = 100% [15:59:39] PROBLEM - Puppet freshness on ocg3 is CRITICAL: Puppet has not run in the last 10 hours [16:50:39] PROBLEM - Puppet freshness on searchidx2 is CRITICAL: Puppet has not run in the last 10 hours [18:12:01] New patchset: Ryan Lane; "Initial commit of adminbot" [operations/debs/adminbot] (master) - https://gerrit.wikimedia.org/r/11732 [18:12:29] New review: Ryan Lane; "(no comment)" [operations/debs/adminbot] (master); V: 1 C: 2; - https://gerrit.wikimedia.org/r/11732 [18:12:32] Change merged: Ryan Lane; [operations/debs/adminbot] (master) - https://gerrit.wikimedia.org/r/11732 [22:00:39] PROBLEM - Puppet freshness on maerlant is CRITICAL: Puppet has not run in the last 10 hours [23:02:37] PROBLEM - Puppet freshness on db29 is CRITICAL: Puppet has not run in the last 10 hours