[00:00:11] !log completed user.user_email index migrations [00:00:22] Logged the message, Master [00:02:45] PROBLEM - Puppet freshness on labstore1 is CRITICAL: Puppet has not run in the last 10 hours [00:02:45] PROBLEM - Puppet freshness on ms-be1007 is CRITICAL: Puppet has not run in the last 10 hours [00:02:45] PROBLEM - Puppet freshness on ms-be1011 is CRITICAL: Puppet has not run in the last 10 hours [00:02:45] PROBLEM - Puppet freshness on ms-be1010 is CRITICAL: Puppet has not run in the last 10 hours [00:06:48] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:08:45] PROBLEM - Puppet freshness on neon is CRITICAL: Puppet has not run in the last 10 hours [00:10:45] New patchset: Tim Starling; "On ptwiki set wgCategoryCollation to uca-default" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/21024 [00:11:12] Change merged: Tim Starling; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/21024 [00:20:18] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.030 seconds [00:25:52] binasher: There's no active query killer? [00:26:46] 5 days is excessive [00:27:06] 5 minutes is excessive [00:27:08] binasher: Mind sending me the full SQL of that query so I can figure out why it ran for 5 days and fix the code so it doesn't run those any more? [00:27:18] Heck, 5 seconds is excessive on the web [00:27:28] Brooke: there isn't currently, there should be [00:28:04] I would support having a query killer for wikiuser (i.e. MediaWiki), but not for wikiadmin (i.e. manual queries) [00:29:17] domas previously had one in place that ran from fenari.. i've been thinking about reimplementing with pt-kill on each local db but haven't gotten around to it, or defining exact parameters [00:31:25] RoanKattouw: the full query was [00:31:36] SELECT /* ApiQueryLogEvents::execute */ /*! STRAIGHT_JOIN */ log_type, log_action, log_timestamp, log_deleted, log_id, page_id, log_user, user_name, log_namespace, log_title, log_comment, log_params FROM `logging` FORCE INDEX (times) JOIN `user` ON ((user_id=log_user)) LEFT JOIN `page` ON ((log_namespace=page_namespace) AND (log_title=page_title)) INNER JOIN `change_tag` FORCE INDEX (ct_tag) ON ((log_id=ct_log_id)) WHERE [00:31:37] (log_type != 'suppress') AND ct_tag = 'posible vandalismo' ORDER BY log_timestamp DESC LIMIT 11 [00:33:02] Blegh change tagging [00:33:14] I'll poke in a minute [00:34:51] change_tag has an interesting schema [00:53:36] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:07:15] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.024 seconds [01:20:13] Deadlock found when trying to get lock; try restarting transaction (10.0.6.48) UPDATE `page` SET page_touched = '20120822005443' WHERE page_namespace = '0' AND page_title = 'Mitt_Romney\'s_tax_returns' [01:20:17] heh [01:20:26] if anything should be a deadlock.. [01:20:55] hhaha [01:22:37] PROBLEM - Puppet freshness on stat1 is CRITICAL: Puppet has not run in the last 10 hours [01:32:40] PROBLEM - Puppet freshness on zinc is CRITICAL: Puppet has not run in the last 10 hours [01:32:40] PROBLEM - Puppet freshness on magnesium is CRITICAL: Puppet has not run in the last 10 hours [01:40:28] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:41:31] PROBLEM - MySQL Slave Delay on storage3 is CRITICAL: CRIT replication delay 234 seconds [01:42:25] PROBLEM - MySQL Slave Delay on db1025 is CRITICAL: CRIT replication delay 289 seconds [01:49:01] PROBLEM - Misc_Db_Lag on storage3 is CRITICAL: CHECK MySQL REPLICATION - lag - CRITICAL - Seconds_Behind_Master : 685s [01:54:52] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.018 seconds [01:57:25] RECOVERY - MySQL Slave Delay on db1025 is OK: OK replication delay 4 seconds [01:59:31] RECOVERY - Misc_Db_Lag on storage3 is OK: CHECK MySQL REPLICATION - lag - OK - Seconds_Behind_Master : 10s [01:59:49] RECOVERY - MySQL Slave Delay on storage3 is OK: OK replication delay 0 seconds [02:27:52] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:37:28] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.023 seconds [05:27:41] New patchset: Tim Starling; "Enable Scribunto on www.mediawiki.org" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/21027 [05:28:04] Change merged: Tim Starling; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/21027 [05:47:29] !log preparing to switch s1 master to db63 (upgrade to 96GB ram and precise) in a minute [05:47:38] Logged the message, Master [05:53:30] PROBLEM - MySQL Replication Heartbeat on db42 is CRITICAL: CRIT replication delay 232 seconds [05:53:30] PROBLEM - MySQL Replication Heartbeat on db60 is CRITICAL: CRIT replication delay 232 seconds [05:53:30] PROBLEM - MySQL Replication Heartbeat on db1047 is CRITICAL: CRIT replication delay 232 seconds [05:53:30] PROBLEM - MySQL Replication Heartbeat on db63 is CRITICAL: CRIT replication delay 232 seconds [05:53:48] PROBLEM - MySQL Replication Heartbeat on db36 is CRITICAL: CRIT replication delay 250 seconds [05:54:15] PROBLEM - MySQL Replication Heartbeat on db59 is CRITICAL: CRIT replication delay 277 seconds [05:54:15] PROBLEM - MySQL Replication Heartbeat on db1033 is CRITICAL: CRIT replication delay 277 seconds [05:54:15] PROBLEM - MySQL Replication Heartbeat on db1050 is CRITICAL: CRIT replication delay 278 seconds [05:55:27] PROBLEM - Auth DNS on ns1.wikimedia.org is CRITICAL: CRITICAL - Plugin timed out while executing system call [05:55:34] New patchset: Asher; "db63 is the new s1 master" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/21028 [05:55:45] RECOVERY - MySQL Replication Heartbeat on db1033 is OK: OK replication delay 0 seconds [05:55:45] RECOVERY - MySQL Replication Heartbeat on db1050 is OK: OK replication delay 0 seconds [05:56:17] these heartbeat warnings are ok and will clear shortly [05:56:18] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/21028 [05:56:30] RECOVERY - MySQL Replication Heartbeat on db42 is OK: OK replication delay 0 seconds [05:56:30] RECOVERY - MySQL Replication Heartbeat on db60 is OK: OK replication delay 0 seconds [05:56:30] RECOVERY - MySQL Replication Heartbeat on db63 is OK: OK replication delay 0 seconds [05:56:30] RECOVERY - MySQL Replication Heartbeat on db1047 is OK: OK replication delay 0 seconds [05:56:35] Change merged: Asher; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/21028 [05:56:48] RECOVERY - MySQL Replication Heartbeat on db36 is OK: OK replication delay 0 seconds [05:57:15] RECOVERY - MySQL Replication Heartbeat on db59 is OK: OK replication delay 0 seconds [06:01:00] PROBLEM - Puppet freshness on ms-be1001 is CRITICAL: Puppet has not run in the last 10 hours [06:01:00] PROBLEM - Puppet freshness on ms-be1003 is CRITICAL: Puppet has not run in the last 10 hours [06:01:00] PROBLEM - Puppet freshness on ms-be1002 is CRITICAL: Puppet has not run in the last 10 hours [06:01:00] PROBLEM - Puppet freshness on ms-be1005 is CRITICAL: Puppet has not run in the last 10 hours [06:01:00] PROBLEM - Puppet freshness on ms-be1009 is CRITICAL: Puppet has not run in the last 10 hours [06:01:00] PROBLEM - Puppet freshness on ms-be1006 is CRITICAL: Puppet has not run in the last 10 hours [06:01:00] PROBLEM - Puppet freshness on ms-fe1001 is CRITICAL: Puppet has not run in the last 10 hours [06:01:01] PROBLEM - Puppet freshness on ocg3 is CRITICAL: Puppet has not run in the last 10 hours [06:01:01] PROBLEM - Puppet freshness on singer is CRITICAL: Puppet has not run in the last 10 hours [06:01:02] PROBLEM - Puppet freshness on virt1002 is CRITICAL: Puppet has not run in the last 10 hours [06:01:03] PROBLEM - Puppet freshness on virt1001 is CRITICAL: Puppet has not run in the last 10 hours [06:01:03] PROBLEM - Puppet freshness on virt1004 is CRITICAL: Puppet has not run in the last 10 hours [06:01:03] PROBLEM - Puppet freshness on virt1003 is CRITICAL: Puppet has not run in the last 10 hours [06:02:09] !log shutting down mysql on db38 and rebooting for kernel upgrade [06:02:18] Logged the message, Master [06:03:51] PROBLEM - Puppet freshness on silver is CRITICAL: Puppet has not run in the last 10 hours [06:14:52] !log new s1 repl position: MASTER_LOG_FILE='db63-bin.000049', MASTER_LOG_POS=646903534 [06:15:02] Logged the message, Master [06:24:21] binasher: I didn't kill long running queries [06:24:29] I killed code that was running long running queries [06:24:29] :) [06:25:01] hah! much better [06:35:42] binasher: can you look what is the problem here? https://commons.wikimedia.org/wiki/File:Pelican3.jpg [06:42:18] matanya: what am i looking for? [06:42:37] the file has a page, but no file [06:44:01] https://upload.wikimedia.org/wikipedia/commons/2/2f/Pelican3.jpg doesn't load for you? [06:44:30] now it does [06:44:32] thanks [06:45:21] hmm.. [07:31:22] New review: Tim Starling; "Can be merged after the migration job running on ms7 is finished (screen -r 3619.pts-6.ms7)." [operations/mediawiki-config] (master); V: 0 C: 0; - https://gerrit.wikimedia.org/r/20844 [07:45:55] PROBLEM - SSH on amslvs1 is CRITICAL: Server answer: [07:56:25] RECOVERY - SSH on amslvs1 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [08:02:25] PROBLEM - SSH on amslvs1 is CRITICAL: Server answer: [08:05:25] RECOVERY - SSH on amslvs1 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [08:45:06] morning [08:45:34] apergos: what do you mean by obvious? [08:45:49] er, obvious? :-D [08:45:58] why would originals to swift put additional load on ms5? [08:46:13] there were an additional 5 gets per second. [08:46:28] not much, but prolly enough to push it over the edge given how fragile it now is [08:47:02] ben's right that there were already a few nfs timeout incidents here and there [08:47:48] my proposal is going ot be to call off tonight's deployment attempt and reschedule when we have a decent chunk of space back [09:21:28] isn't today's window about the 1.5 upgrade? [09:21:54] I'm more worried about needing Ben for that than the originals switch. [09:27:50] his email seemed to indicate otherwise [09:27:54] "We have a window to try again tomorrow morning at 9am. " [09:28:16] anyways I think we cannot move forward with the originals switch [09:28:38] for a few days at least (hope we get it done while here's here though) [09:29:46] I don't think we should do random attempts on switching without knowing exactly what went wrong [09:30:09] I said so yesterday too [09:31:13] is someone suggesting doing a randm attempt? [09:31:33] I'm not, you're not, and I don't read Ben's email that way either [09:31:51] anyways, tonight's window might as well be used for the upgrade, I don't know what prep is needed for that though [09:42:34] well, we don't know what's wrong yet. attempting to re-deploy and see what will happen is what Ben suggested as far as I understand [09:42:40] I disagree with that, that's all I'm saying [09:43:42] we know one of the things that's wrong, and it will take at least a few days to fix it... [09:43:48] probably we're both saying the same thing [09:44:09] so can you tell me a little about the upgrade? I know nothing about this [09:44:30] I still don't see why would the originals switch to swift put additional load on ms5 [09:44:46] and how is it changing /any/ access pattern that would affect NFS [09:46:15] more i/o on the disks [09:46:22] not much more but not much is needed at this point [09:46:38] I don't know why there were get requests to ms5, but there were [09:54:17] also we are going to have to kickstart the discussion about thumb purging generally; the idea thaat we generate all sizes on demand and keep them forever is not sustainable [09:55:05] this has come up on the wikitech mailing list before but it needs followthrough til a decision is reached [09:57:47] more i/o on the disks *why*? our change had nothing to do with thumbs [09:57:53] let alone NFS [09:58:45] in theory it had nothing to do with thumbs [09:59:37] in practice too, remember how we didn't even change the swift_thumbs acl [09:59:43] in practice there were gets [09:59:47] recorded in the error log [09:59:51] and on the ganglia graphs [10:00:06] those stopped dead after the revert [10:00:20] (and tcp dump shows there are none, I watched it for quite a while) [10:00:25] do you se the problem here? :) [10:00:39] the first problem is the nfs issue [10:00:40] we have requests in a backend system that our design did not intend to [10:00:46] that's the real design issue [10:00:50] and the space issue, that is extremely urgent [10:01:09] the next issue is where the gets come from; I raised this last night repeatedly [10:01:22] ben also mentioned it in his email I believe [10:02:34] (iterally: they were coming from the squids. but what generated them, no idea, and I would sure like to know) [10:03:10] from the squids? there were no references to ms5 on the squid configs [10:03:32] I'm sure of that [10:03:35] from the squids. that's right. [10:03:55] PROBLEM - Puppet freshness on labstore1 is CRITICAL: Puppet has not run in the last 10 hours [10:03:55] PROBLEM - Puppet freshness on ms-be1007 is CRITICAL: Puppet has not run in the last 10 hours [10:03:55] PROBLEM - Puppet freshness on ms-be1010 is CRITICAL: Puppet has not run in the last 10 hours [10:03:55] PROBLEM - Puppet freshness on ms-be1011 is CRITICAL: Puppet has not run in the last 10 hours [10:04:04] our new config that got pushed to sq51 had no reference to ms5 at all [10:04:16] the rest of them have ms5 but under an ACL that never matches [10:04:18] mostly sq41 but a few others [10:04:37] aha, I think I know what happened there [10:04:44] oh? [10:04:45] I have a theory [10:04:57] a theory that has to do with sq41? I am fascinated :-D [10:05:03] :-) [10:05:39] so [10:05:57] the "old" config, the one that's deployed now [10:06:13] and was in production yesterday in every squid but sq51 [10:06:32] has both swift and ms5, with essentially the same ACL with different names [10:06:38] (and ms7 as a gateway of last resort) [10:06:41] right. [10:07:06] which already makes me nervous (that they have ms5 in there) [10:07:11] anyways, you were saying [10:08:02] we (rightfully) have maxconns and connect-timeout in those cache-peer definitions [10:08:12] at one point swift became completely unresponsive [10:08:16] s you think there was failover? [10:08:20] that was ben's theory too [10:08:21] so some squids falled back on ms5 [10:08:28] makes absolute sense [10:08:31] I don't know enough about squid config to know if that's how it works [10:08:42] (read: I know very little about squid config) [10:08:50] well, if squid was taking 40s to respond to a thumb (we know /that/ happened) [10:09:06] it's easy to imagine those 200 max connections getting filled up [10:09:31] well one thing that could be done is to pull that acl [10:09:37] it really shouldn't be in there [10:10:03] that's easy to do [10:10:10] PROBLEM - Puppet freshness on neon is CRITICAL: Puppet has not run in the last 10 hours [10:10:24] however, note that if that was in effect yesterday, we'd be returning 500s to clients [10:10:29] instead of gracefully falling back to ms5 [10:10:30] yes we would [10:10:36] so in a sense, this saved us from a bigger outage [10:10:49] I'd rather return 500s tbh [10:11:23] incidentally mark just sent a mail that describes that [10:13:12] uh huh [10:13:46] https://www.youtube.com/watch?v=ByY3CgR7-4E [10:21:45] mark: ping? [10:29:00] back in a little bit, must get cat food/litter before the place closes [11:02:30] yes [11:02:38] hey [11:03:08] just sent a second mail [11:03:15] so... [11:03:47] I also see a bit of a deeper design flaw, if you're up to discussing it [11:05:50] yes [11:06:51] so... [11:07:37] a client requests a thumb from swift. swift's rewrite.py checks if it's there and if it's not it fetches it from the image scalers [11:08:13] mid-flight/synchronously, i.e. the client waits while swift's request to an imagescaler is happening, to get back the result [11:08:23] yes [11:08:54] now, the imagescalers check if the thumb is there with a HEAD (bypassing rewrite.py), and if it's not (the case here) they fetch the original from swift with a GET [11:09:13] that already can be optimized, but go on [11:09:13] this is not a real loop, due to the bypass, but I think it's a *performance* loop [11:09:16] yes [11:09:24] if swift becomes slow for whatever reason [11:09:59] the problem will cascade to the imagescalers, which will have a lot of i/o wait [11:10:05] i don't see the i/o wait [11:10:13] waiting on swift is not i/o wait, waiting on nfs is [11:10:13] waiting for Swift to respond [11:10:16] but sure [11:10:38] it's waiting on a network response anyway, so almost the same [11:10:43] network i/o I meant. [11:10:46] alright [11:11:08] so, imagescalers will wait for swift a lot and gain a larger outstanding queue [11:11:16] making it worse [11:11:17] of requests waiting for a scale to happen [11:11:36] then this will cascade to swift again, which in addition to the original problem [11:11:49] it will now have a larger outstanding queue of thumbs [11:12:07] waiting on the imagescalers [11:12:11] which will wait on swift [11:12:13] and so on. [11:12:14] i think i've seen this happen too, 2 days ago [11:12:33] but then I didn't understand completely what was going on [11:12:37] but I've seen the same in top [11:12:55] lots of apache processes, few/no scaling happening [11:12:59] and later it looked different [11:13:01] a cycle [11:13:16] so i'll buy this [11:13:36] so: [11:13:44] even if this hasn't happened before, it will happen at some point [11:13:58] 1. that HEAD check needs to go. It can do a GET rightaway if it doesn't know whether a thumb exists or not [11:13:59] it's a cascading problem on a loop [11:14:09] 2. IF rewrite.py already determined a thumb doesn't exist, we can pass that as a parameter [11:14:18] then mediawiki can go straight on with scaling [11:14:25] (with some authentication so outsiders can't do this) [11:14:37] 3. we need strong control on timeouts for all fetches [11:14:37] I thought of (2), but how you deal with concurrency? [11:14:54] how does the current code deal with it? [11:14:59] does it? [11:15:10] poolcounter can do this [11:15:18] I don't know, but expanding the race across systems doesn't sound good [11:15:31] i don't see how it's different from now? [11:15:59] it's not very different, no [11:16:10] perhaps the race is slightly bigger, that's all [11:16:29] right. fetch/fetch/write never solves a fetch/write race :) [11:16:37] indeed [11:16:59] where is this code? [11:17:04] it sounds like you have read it [11:17:08] which code? [11:17:19] rewrite.py? [11:17:22] or PHP? [11:17:39] php [11:18:30] I've briefly seen it but my observations are result of symptoms and discussions with Aaron [11:18:37] ok [11:18:51] I grepped through a random MW box /usr/local/apache [11:18:53] so I also want explicit timeout control [11:18:59] otherwise this solution is just as bad as NFS [11:19:26] and probably some form of locking through poolcounter, if it's not already there [11:19:34] even wit those three -with which I don't disagree at all- [11:19:40] the dependency cycle still stands [11:20:21] perhaps rewrite.py could POST the original to mediawiki or something [11:20:26] although that feels a bit dirty too [11:20:36] it's also not like rewrite.py has the data readily available [11:20:43] it translates into backend requests inside swift and such [11:21:19] hm, could we http redirect? [11:21:57] it would be ugly for clients, and squid internally resolving that would be ugly too [11:22:09] (if at all possible) [11:22:29] i don't see what you mean [11:22:43] redirecting squid to thumb.php? [11:23:05] if swift determines the thumb is not there, redirecting to imagescalers instead of fetching it synchronously [11:23:31] might work with squid, would need testing [11:23:48] feels dirty too though [11:24:03] PROBLEM - Puppet freshness on stat1 is CRITICAL: Puppet has not run in the last 10 hours [11:24:03] it does not help that much [11:24:27] it reduces the amount of resources blocked inside the swift proxy I guess [11:24:33] yeah, that was my point [11:24:41] it wouldn't help that much though [11:24:55] it was nice how thumbs and originals were in different systems with NFS [11:25:12] we could use different proxies for them in swift [11:25:27] we have 6 or so? [11:25:35] (ridiculous amount for the task if you ask me) [11:25:48] 4 + 4 from what I can see [11:25:51] heh :) [11:25:55] wow [11:26:39] 4 per DC I mean [11:27:06] right [11:27:20] well, we could do 2+2 per dc and assume that in emergencies the other cluster can do the exact same tasks anyway [11:27:31] but does this really help us much? [11:27:59] wouldn't just increasing some bottleneck in the swift proxy do the same, except not fix it conceptually [11:28:22] the frontend proxies don't look loaded at all in ganglia [11:28:28] but again, that doesn't say much with swift :P [11:28:36] heh [11:30:48] another interesting data point: the imagescalers are still not on the pre-cache flush load levels [11:31:14] that remains weird [11:31:32] also note that yesterday we did not flush any cache at all [11:31:40] yeah [11:31:47] btw, i have the nfs /home migration in a few hours [11:31:50] but I might cancel that [11:31:51] apergos: ping [11:31:54] that's right, the symptoms were noticed when the cutover took place [11:32:01] mark: ponnngggg [11:32:02] apergos: are you aware of this, the /home migration? [11:32:11] you're one of the main users of /home mounts [11:32:30] and we just redirected one single squid *with* its cache to swift originals (without touching thumbs) [11:32:39] that's just nothing in traffic increase [11:32:40] imagine what had happened with all squids [11:32:42] I forgot when it was [11:32:49] apergos: in 2.5 hours [11:33:13] no problem [11:33:20] i might cancel, not sure yet [11:33:27] I assume I can be on bastion1001 without any issues? [11:33:27] i'm behind in preparation due to other stuff [11:33:29] (like swift) [11:33:31] yes [11:33:34] (I'm not relying on the mount for anything) [11:33:39] you're not? [11:33:42] you're not using /home? [11:33:42] no [11:33:45] I'm in it [11:33:49] but I don't need to be [11:34:06] PROBLEM - Puppet freshness on magnesium is CRITICAL: Puppet has not run in the last 10 hours [11:34:06] PROBLEM - Puppet freshness on zinc is CRITICAL: Puppet has not run in the last 10 hours [11:35:28] right [11:35:33] i thought your dump stuff used NFS [11:35:37] noooo [11:35:39] but i no longer see them in the client list [11:35:39] gj [11:35:41] it absolutely does not [11:36:08] thnks for checkign though [11:36:09] hm. [11:36:15] what's port 2049 in ms5 used for? [11:36:23] dunno [11:36:26] ah that's nfs [11:36:28] never mind me [11:36:40] doh [11:37:04] so, I can easily remove ms5 from squids completely [11:37:08] I'm not sure if I should though [11:37:34] i don't know why that wasn't done during the deployment [11:37:37] if anything, ms5 *helped* yesterday, by picking up the load that swift couldn't handle [11:37:43] sure [11:37:56] but in the desired case where swift actually works as desired... ;) [11:37:59] we can't have ms5 [11:38:13] I'd rather have the failure (and have it be obvious). we want ms5 gone in the end anyways [11:38:42] !log Removed /homewmf NFS mount on kaulen [11:38:52] Logged the message, Master [11:38:53] look at it this way, with it out, it's one less variable in the mix [11:40:37] how nice [11:40:43] we only have a few /home mount clients left [11:40:49] mostly fenari, spence [11:40:50] and stat1/bayes [11:40:54] and hume [11:41:07] and bast1001 soon? [11:41:12] no [11:41:17] not bast1001 [11:41:36] sure, bast1001 right now has a readonly replica of /home mounted from the eqiad netapp [11:41:40] but even that's gonna be discussed [11:41:45] \o/ (for not bast1001) [11:41:46] and it's not /home there [11:41:47] oh? I thought it was going to be fenari's equivalent [11:41:56] what makes you think I want to clone fenari :P [11:41:56] ah, you mean with the bastion/deployment split [11:42:00] yes [11:42:04] right yes [11:42:57] so why is stat1 still using NFS [11:43:12] *sigh* [11:43:32] brb [11:52:06] PROBLEM - Puppet freshness on snapshot4 is CRITICAL: Puppet has not run in the last 10 hours [11:54:03] PROBLEM - Puppet freshness on srv208 is CRITICAL: Puppet has not run in the last 10 hours [11:54:03] PROBLEM - Puppet freshness on srv238 is CRITICAL: Puppet has not run in the last 10 hours [11:58:46] New patchset: Mark Bergsma; "Let's try using the standard Linux NFS mount performance options again" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/21036 [11:59:31] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/21036 [11:59:48] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/21036 [12:00:23] paravoid: do you know anything about the swift upgrade procedure? [12:11:04] meh [12:11:08] tons of dependencies on /home on hume [12:11:21] including some jobs by aaron for swift I think [12:11:33] perhaps i'll move it to next week, i need to announce it better [12:12:39] apergos: I'm afraid I don't, no [12:12:59] well this should be exciting then [12:13:14] I don't either, and I was hoping to try to think things though ahead of time :-( [12:13:15] everything I know is what Ben told me in that conversation that you saw [12:16:23] I know that rewrite.py need to be changed [12:18:11] PROBLEM - Router interfaces on cr1-sdtpa is CRITICAL: CRITICAL: host 208.80.152.196, interfaces up: 76, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-1/1/0: down - Core: cr2-eqiad:xe-5/2/1 (FPL/Level3, CV71028) [10Gbps wave]BR [12:18:37] * apergos rereads through the conversation again slowly [12:24:21] hmmm? [12:26:08] so we have one wave down again [12:26:25] how nice [12:26:43] unless that's leslie/chris but I don't think so [12:36:19] so, mark/apergos, should I disable ms5 now? [12:36:45] I have mixed feelings, but if you both are in favor, I'll do it now [12:36:45] seems like a fine idea to me [12:41:40] onsite in /topic is stale i guess [12:53:16] s3 compatibility layer? really? sweet [12:53:33] ah but it's a separate project now. meh [12:55:41] apergos: you don't like the swift api? ;) [12:56:02] I prefer not to have a dozen apis for distributed storage [12:56:10] one decent one is good enough [12:56:58] idk... we're ditching ec2. where else do we use s3 already? [12:57:41] well it's more about who else uses it. archive has an s3-like interface, same with google storage [13:01:14] archive.org you mean? [13:01:24] i know nothing about google storage [13:01:47] is there a contact for free storage/compute at google? [13:02:17] I don't know that there is generally such a thing [13:03:54] i thought wikimedia had some amount of free storage? [13:04:07] * jeremyb was thinking wikilovesmonuments could use some free compute [13:04:28] (e.g. the dumps at google?) [13:05:41] well I don't know if that's something they would do or not [13:05:42] mark: rebooting fenari I see? [13:05:48] perhaps we should install updates first then? [13:06:16] I could get you in touch with my contact at google and you guys could chat anyways [13:06:35] * jeremyb just was thinking it was worth a shot [13:07:31] ok, remind me in a few days when swift and media is off our backs [13:09:01] well, there's not *that* much time before we need it. but i am already getting it setup elsewhere (with puppet) so it shouldn't be so painful to move to or add capacity with them [13:09:20] (september is the big month but want to be ready before then) [13:09:29] ah ok [13:10:03] * jeremyb puts it on his calendar, danke ;) [13:10:23] what would wikilovesmonuments put there anyways? [13:10:47] wikilovesmonuments.us ; the website itself mostly [13:10:52] mark: nevermind, I pushed updates to fenari already. [13:10:54] ok [13:11:27] apergos: europe got 7.5 mil hits for the month last year i think? obviously we expect to break that this year but not really sure how that translates to US [13:11:46] what happens to the logs written to /home/w/log in the meantime, I wonder [13:12:11] * apergos has no idea how hosting a website on google storage would work [13:12:47] apergos: well they have compute now too. so we could spin up nodes to serve the site [13:13:02] ok [13:13:52] apergos: anyway, will poke friday i guess [13:13:59] cool [13:19:04] need a short nap. anticipating that tonight will be another of those nights [13:42:53] New review: Tim Starling; "It's finished." [operations/mediawiki-config] (master); V: 0 C: 0; - https://gerrit.wikimedia.org/r/20844 [14:17:20] hey opsies, i got a brain bounce q, its a simple one [14:17:28] trying to make a decision: to organize or to leave things be [14:17:43] paravoid, you round? (this is an easy q) [14:17:50] just want a quick brain bounce [14:18:27] hi andrew [14:18:34] hiya! [14:20:14] you working/can I bother you with a brain bounce? [14:20:23] I am, you can :) [14:20:25] shoot! [14:20:28] sosoos [14:20:31] udp2log directories [14:20:41] on the 3 log machines (locke, emery, oxygen) [14:20:46] are really inconsistent [14:21:01] particularly, right now I am concerned with emery [14:21:05] there are 2 udp2log instances running there [14:21:11] the main one, which logs in /var/log/squid [14:21:16] * apergos lurks. I had a q about these yesterday [14:21:18] (which I hate btw, because they are not all squid logs) [14:21:22] and [14:21:30] the AFT/clicktracking instance [14:21:31] which logs in [14:21:37] /var/log/aft [14:21:38] cmjohnson1: ping? [14:21:56] i'm only even messing with this right now because we want to rsync the aft logs to stat1 [14:22:00] I have an rsync module setup to allow this [14:22:11] but it is only currently set up for /var/log/squid/archive [14:22:25] i was planning on changing the rsync module to allow rsync from /var/log/squid [14:22:27] and then either: [14:22:38] 1. symlinking /var/log/aft from /var/log/squid/aft [14:22:40] or [14:22:56] 2. changing the udp2log aft instance configs to log into /var/log/aft [14:23:15] and, if I did 2, while I'm at it, I'd have to urge to clean things up in general [14:23:24] jeremyb: heya [14:23:24] and maybe make all udp2log instance log in /var/log/udp2log [14:23:43] AND [14:23:47] if I went that far [14:23:52] cmjohnson1: woot, bot's fixed. you showed up at the end of http://bots.wmflabs.org/~petrb/logs/%23wikimedia-operations/20120822.txt [14:23:57] I could go farther and make all the log directories on the 3 machines be consistent [14:24:15] which would allow me to abstract some defaults out of the puppet misc::udp2log::instance define [14:24:34] would make site.pp a bit simpler [14:24:45] so I guess I'm looking for advice on how far I should go [14:25:02] option 1. symlink is the path of least resistence [14:25:07] but is the ugliest [14:25:14] jeremyb: way to finish strong [14:32:11] phew, sorry paravoid, i'm having internet troubles [14:32:24] just a sec [14:32:51] np [14:34:16] so [14:34:23] I'm a big fan of cleanups :) [14:34:56] I think in (2) you mean change it to log to /var/log/squid/aft, right? [14:37:17] hey peeps - Management access in tampa is gonna be going down [14:37:46] LeslieCarr: hi [14:37:54] LeslieCarr: are you aware that we lost one wave earlier? [14:38:00] no [14:38:16] oh look at that [14:38:37] well, first i fix management, then i call and scream [14:38:39] again [14:38:41] yet again [14:39:00] * Damianz finds the ear plugs and happy cake for later [14:39:01] heh [14:40:11] paravoid, yeah [14:40:13] in 2 yeah [14:40:20] unless I change everything to /var/log/udp2log/aft [14:40:20] but yeah [14:40:28] it is too bad that squid != squid [14:42:31] squid == calamari [14:43:02] i had calamari in my salad for lunch yesterday [14:45:03] PROBLEM - Host ps1-b5-sdtpa is DOWN: CRITICAL - Network Unreachable (10.1.5.10) [14:45:03] PROBLEM - Host ps1-d2-pmtpa is DOWN: CRITICAL - Network Unreachable (10.1.5.18) [14:45:03] PROBLEM - Host ps1-c3-sdtpa is DOWN: CRITICAL - Network Unreachable (10.1.5.13) [14:45:30] PROBLEM - Host ps1-b2-sdtpa is DOWN: CRITICAL - Network Unreachable (10.1.5.7) [14:45:30] PROBLEM - Host ps1-d1-pmtpa is DOWN: CRITICAL - Network Unreachable (10.1.5.17) [14:45:30] PROBLEM - Host ps1-a2-sdtpa is DOWN: CRITICAL - Network Unreachable (10.1.5.2) [14:45:39] PROBLEM - Host ps1-d3-pmtpa is DOWN: PING CRITICAL - Packet loss = 100% [14:45:39] PROBLEM - Host ps1-c2-sdtpa is DOWN: PING CRITICAL - Packet loss = 100% [14:45:39] PROBLEM - Host ps1-a3-sdtpa is DOWN: PING CRITICAL - Packet loss = 100% [14:45:39] PROBLEM - Host ps1-a1-sdtpa is DOWN: CRITICAL - Network Unreachable (10.1.5.1) [14:45:48] PROBLEM - Host ps1-b1-sdtpa is DOWN: CRITICAL - Network Unreachable (10.1.5.6) [14:45:48] PROBLEM - Host ps1-a5-sdtpa is DOWN: CRITICAL - Network Unreachable (10.1.5.5) [14:45:48] PROBLEM - Host ps1-b3-sdtpa is DOWN: CRITICAL - Network Unreachable (10.1.5.8) [14:45:48] PROBLEM - Host ps1-d1-sdtpa is DOWN: CRITICAL - Network Unreachable (10.1.5.14) [14:45:48] PROBLEM - Host ps1-b4-sdtpa is DOWN: CRITICAL - Network Unreachable (10.1.5.9) [14:45:48] PROBLEM - Host ps1-d2-sdtpa is DOWN: CRITICAL - Network Unreachable (10.1.5.15) [14:45:57] PROBLEM - Host ps1-d3-sdtpa is DOWN: CRITICAL - Network Unreachable (10.1.5.16) [14:46:33] PROBLEM - Host mr1-pmtpa is DOWN: CRITICAL - Network Unreachable (10.1.2.3) [14:46:33] PROBLEM - Host ps1-c1-sdtpa is DOWN: CRITICAL - Network Unreachable (10.1.5.11) [14:46:42] PROBLEM - Host ps1-a4-sdtpa is DOWN: CRITICAL - Network Unreachable (10.1.5.4) [14:49:10] that is all the management network [14:49:13] you can ignore [14:49:45] * jeremyb wonders if that paged [14:51:03] it didn't [15:08:36] I hunted around for upgrade instructions but didn't find them [15:08:54] http://wikitech.wikimedia.org/view/User:Bhartshorne/swift_upgrade_notes_2012-08 but I didn't search the User namespace [15:09:14] I think I told faidon about that but perhaps didn't mail about it. [15:09:19] there's also https://gerrit.wikimedia.org/r/#/c/18264/ [15:09:39] ok, going to read all this now [15:09:54] I've been looking a little at the swift install docs but obviously that's pretty different than an upgrade [15:10:28] I'd much rather we focus on getting originals onto swift, if we have to choose. [15:10:40] ok well [15:10:53] we cannot move forward with anything on that if ms5 is still going to be written to [15:11:04] space has to be cleared first [15:11:16] I have a list of 200k images for purging I can hand you (er, directories) [15:11:35] this is the standard "not in use, added since last run" [15:11:38] I'm confused though, why you now think it is ms5 causing things to be slow when before all signs (except MW profiling) said it wasn't. [15:11:47] I know that ms5 had problems [15:11:55] and it has to be taken care of, regardless of other issues [15:12:16] I'm not ruling out other issues by sany means [15:12:16] New patchset: Ottomata; "Pointing udp2log machine rsync modules at ./squid/ rather than ./squid/archive" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/21043 [15:12:18] *any [15:12:39] maplebed: I'd rather focus on the 1.5 upgrade rather than originals tbh [15:12:43] (morning btw) [15:12:50] 200k dirs won't get us much but I'll work up another list tomorrow to get us a reasnable amount of breathing room [15:12:59] New patchset: Ottomata; "Moving AFT logs from /var/log/aft into /var/log/squid/aft." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/21044 [15:13:03] we can delete anything we want on ms5. [15:13:10] we can? [15:13:11] and btw, I've prepared squid changes to remove ms5 from the squid configs completely [15:13:25] haven't deployed it yet [15:13:28] that's the whole thing aaron deployed last thursday - reads come from swift, not ms5. [15:13:41] New patchset: Ottomata; "misc/statistics.pp - setting up cronjob to rsync clicktracking logs to stat1" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/21045 [15:14:02] you don't stil lrun your swift cleaner any more? [15:14:03] the reason we couldn't delete everything earlier was that mediawiki used ms5 as a source for choosing which thumbnails to purge. that list now comes from swift. [15:14:13] no, the swiftcleaner's stopped. [15:14:18] ahhh [15:14:19] maplebed: squids still have ms5, so when swift started failing yesterday they started sending traffic to ms5 [15:14:22] well [15:14:24] New patchset: Ottomata; "Changing custom filters to use udp-filter instead. There should be no change in content." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/20919 [15:14:36] when the 200 maxconn limit was reached presumably [15:14:47] since swift was waiting for imagescalers and hence taking a while to respond [15:14:53] once paravoid's change makes it in then we might be good to toss a bunch of stuff [15:15:08] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/21043 [15:15:09] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/21044 [15:15:09] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/21045 [15:15:09] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/20919 [15:15:56] apergos: the change should be no-op under normal circumstances [15:16:00] uh huh [15:16:09] it's the abnormal ones we get to worry about [15:16:33] so we can get ms5 out of the reads but not out of the writes? [15:16:39] yes. [15:16:57] s/can get/have gotten/ (minus the squid thing) [15:17:12] irritating [15:17:43] yeah, the squid thing is really really minor [15:17:51] it only manifests when swift is unable to serve [15:18:04] which shouldn't happen [15:18:22] jorn: you've been flapping for hours, could you fix your client/network? [15:19:30] paravoid: oups, sorry at a conference (campus-party.eu) and bad reception, forgot my irc client, sorry [15:19:36] thanks for telling me [15:19:44] thanks for the quick reply :) [15:20:09] jorn: the channel's logged btw, if you want to catch backlog or anything [15:20:31] maplebed, apergos: so what do you think, shall we context switch to originals? [15:20:34] er, to 1.5 [15:20:37] instead of originals? [15:20:42] no, I don't think we should. [15:20:50] we have focus and should keep it. [15:20:59] well [15:21:10] do we know what caused the image scalers to overload [15:21:12] ? [15:21:24] I don't [15:21:26] there's one thing that was going to the scalers yesterday that hasn't ever before - requests for originals. [15:21:41] that bug is now fixed. it seems worth while to me to try again. [15:21:48] do we have any proof that this would be a problem? [15:22:07] I don't like guessing and trying things at random until it works. [15:22:10] no, but it's the only thing that is actually different from what we've been doing since forever. [15:22:33] see, the sqiuds would only request from ms5 if they weren't getting from swift. that says there was a problem before ms5's nfs issues (assumign we aren't overlooking something) [15:22:42] sorry for the typos [15:22:53] apergos: I disagree. [15:23:00] please explain [15:23:09] once ms5 starts having issues, the scalers back up, then swift backs up, then the squids overflow to ms5. [15:23:30] I still don't understand why the ms5 would have issues *because we switched originals to swift* [15:24:03] RECOVERY - Host ps1-a2-sdtpa is UP: PING OK - Packet loss = 0%, RTA = 2.08 ms [15:24:03] RECOVERY - Host ps1-a1-sdtpa is UP: PING OK - Packet loss = 0%, RTA = 1.99 ms [15:24:10] management access back up [15:24:12] RECOVERY - Host ps1-d1-sdtpa is UP: PING OK - Packet loss = 0%, RTA = 2.52 ms [15:24:12] RECOVERY - Host ps1-c1-sdtpa is UP: PING OK - Packet loss = 0%, RTA = 2.14 ms [15:24:12] RECOVERY - Host ps1-c3-sdtpa is UP: PING OK - Packet loss = 0%, RTA = 2.10 ms [15:24:12] RECOVERY - Host ps1-c2-sdtpa is UP: PING OK - Packet loss = 0%, RTA = 2.40 ms [15:24:12] RECOVERY - Host ps1-d1-pmtpa is UP: PING OK - Packet loss = 0%, RTA = 3.51 ms [15:24:12] no, I don't either. but I also don't think we're going to figure it out. [15:24:21] RECOVERY - Host ps1-b1-sdtpa is UP: PING OK - Packet loss = 0%, RTA = 1.98 ms [15:24:21] RECOVERY - Host ps1-b3-sdtpa is UP: PING OK - Packet loss = 0%, RTA = 1.89 ms [15:24:21] RECOVERY - Host ps1-a4-sdtpa is UP: PING OK - Packet loss = 0%, RTA = 2.16 ms [15:24:21] RECOVERY - Host ps1-b2-sdtpa is UP: PING OK - Packet loss = 0%, RTA = 4.11 ms [15:24:21] RECOVERY - Host ps1-b5-sdtpa is UP: PING OK - Packet loss = 0%, RTA = 2.87 ms [15:24:21] RECOVERY - Host ps1-a3-sdtpa is UP: PING OK - Packet loss = 0%, RTA = 3.09 ms [15:24:30] maplebed: let us figure it out first then, and then proceed [15:24:30] RECOVERY - Host ps1-a5-sdtpa is UP: PING OK - Packet loss = 0%, RTA = 2.23 ms [15:24:30] RECOVERY - Host ps1-b4-sdtpa is UP: PING OK - Packet loss = 0%, RTA = 2.23 ms [15:24:30] RECOVERY - Host ps1-d3-sdtpa is UP: PING OK - Packet loss = 0%, RTA = 2.37 ms [15:24:30] RECOVERY - Host ps1-d3-pmtpa is UP: PING OK - Packet loss = 0%, RTA = 2.67 ms [15:24:30] RECOVERY - Host ps1-d2-pmtpa is UP: PING OK - Packet loss = 0%, RTA = 2.98 ms [15:24:34] it's also the case that nfs is going away soon. spending ages trying to figure out why it's broken then as soon as we fix it taking it out of rotation is a waste of time. [15:24:50] I don't know why that would happen either, but I know there are a lot of little ratholes in the code that make things like this hard to keep track of [15:24:51] what NFS has to do with anything [15:25:02] we don't *know* if it's related or not [15:25:11] how can you say that NFS is the problem? [15:25:15] RECOVERY - Host mr1-pmtpa is UP: PING OK - Packet loss = 0%, RTA = 0.73 ms [15:25:37] paravoid: the mediawiki profiling. that's how. [15:25:49] from ms5's perspective nothing changed yesterday [15:26:00] RECOVERY - Host ps1-d2-sdtpa is UP: PING OK - Packet loss = 0%, RTA = 2.24 ms [15:26:32] excetp that random broken requests were coming in to the image scalers, which heavily use ms5. [15:26:47] that's not entirely true, it got a small amount of http traffic, as we were just saying [15:26:51] it's either a coincidence or something else we haven't thought of and we're likely to repeat if we retrace our steps. [15:27:29] well, I don't really see us getting more insight by waiting around and doing something else either... [15:27:55] here's teh thing. [15:27:58] if we had infinite time, I'd suggest trying to debug this further over the course of this and the next week [15:27:59] yes, we don't know why it's breaking. [15:28:07] but we've made a change that is entirely relevant, [15:28:09] but we don't, and we need you more on the 1.5 upgrade than this imho [15:28:21] and we can get more good info by trying again and seeing if that makes a difference. [15:28:47] I'm also not a big fan of getting info by trying on production. [15:28:49] the 1.5 upgrade is a piece of cake in comparison. [15:29:03] well, we don't have any other environment in which we can test this, so... [15:29:07] I'd rather prefer to test the "random broken requests" in labs [15:29:23] if the alternatives are test in production or just sit around and wait for us to feel like something else is different, I'll go with teh former. [15:29:29] or even in production but manually, with a limited amount of requests, rather than production traffic [15:30:09] labs doesn't have an environment that has the pieces we need for a meaningful test. [15:30:24] of course if it's strictly a performance issue, that won't net us much. but it won't hurt either [15:30:36] the 1.5 upgrade might be a piece of cake for you but it's still unknown territory for both me and Ariel [15:30:48] that's why I'd like us to do it before you leave [15:30:53] when is your last day again? [15:31:04] a week and a couple of days? [15:31:08] I'm still struggling to not test a change that is really the only thing that's different from how we're doing everything now. [15:31:10] paravoid, if you've got more time for me, could you help me review my java.pp change? [15:31:13] apergos: next tuesday. [15:31:16] ugh [15:31:23] that's why I want to test this now. [15:31:34] maplebed: a *possibly service affecting* change [15:31:50] a) I don't like breaking the site every day, [15:32:00] b) we don't know why the load dropped yesterday [15:32:03] yes, we have the possibility of making it so that thumbnail requests are delayed. [15:32:26] and we don't know at what point it becomes unrecoverable either [15:32:33] New patchset: Pyoungmeister; "moving srv193 to new apache role class" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/21049 [15:32:40] I agree we don't understand the problem. [15:32:53] !log rebooting srv193, upgrade to precise [15:33:03] Logged the message, notpeter [15:33:09] so why can't we do this: 1) test the change manually. 2) delete some crap off of ms5 (need to delete serious piles of crap from each of the 256 subdirs commons/x/xx/) 3) see where we are at that point. [15:33:14] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/21049 [15:33:31] Change abandoned: Pyoungmeister; "(no reason)" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/21049 [15:33:33] apergos: we tested it manually yesterday too and it appeared it worked. [15:33:40] testing it manually again won't give us new info. [15:33:41] we don't understand the problem = risking by making changes [15:33:52] the change you made afterwards, I mean [15:34:38] see what the logs look like with it, what errors the scalers still have [15:34:42] either we should context-switch to upgrade (which I prefer) or we should try to think of the problem, dig in the code with Aaron etc. [15:34:47] not try random things in production. [15:35:15] maybe we will get some leads. [15:35:29] I can't agree with reproducing yesterday's conditions which resulted in a semi-outage, without understanding why that happened [15:35:41] is that to me or to ben? [15:35:47] generally [15:35:52] everyone [15:36:51] ok well I"m suggesting we do: I'd rather prefer to test the "random broken requests" in labs or even in production but manually, with a limited amount of requests, rather than production traffic [15:36:55] New patchset: Pyoungmeister; "upgrading srv193 to new role class/precise" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/21050 [15:37:04] after making sure ms5 will not get any reads that is [15:37:29] (sorry, I quoted you paravoid and your name didn't make it in the copy paste of that line) [15:37:36] I don't mind [15:37:39] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/21050 [15:38:00] ben said he tried it yesterday and didn't see any problem with that [15:38:07] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/21050 [15:38:18] I'd say let's ask Aaron to trace that codepath and see if he finds any possible issues [15:39:16] which code path? [15:39:54] broken requests. [15:41:00] PROBLEM - Host srv193 is DOWN: PING CRITICAL - Packet loss = 100% [15:42:23] I'm pretty sure we're not going to find a smoking gun... even if they do interact with ms5, what does that tell us? just that we need to test again to recreate the problem. [15:42:38] but sure. we can ask. [15:42:45] he did spend a long time looking through the code yesterday, [15:42:57] but more to verify that the profiling in fact does only surround nfs [15:43:03] and not something else that could have stuck things up. [15:43:42] recreating the problem is unacceptable imho. [15:46:30] could you give an example of such a broken request? [15:46:35] paravoid: do you understand which section of our users were affected by yesterday's event? (no judgement, just checking) [15:46:42] RECOVERY - Host srv193 is UP: PING OK - Packet loss = 0%, RTA = 0.53 ms [15:46:54] I have one example, but I'm not sure of how many different types of broken requetss there are. [15:49:25] what times were we live? 19:00, right? [15:50:18] PROBLEM - Memcached on srv193 is CRITICAL: Connection refused [15:51:04] ah. 18:58 to 19:26 [15:51:18] does it matter? [15:51:23] the section of our users [15:51:53] well, yes. we should treat service degredations differently when they affect 90% of our users vs. 2%. [15:52:22] (vs. 0.001%) [15:52:51] it's not a matter of amount of users but of the amount of risk involved [15:53:12] not only that is [15:53:24] if there's a 95% chance that we'll break the site for 2%, we shouldn't do it. [15:53:49] even if that break only actually affects 30 people and gets us additional information? [15:53:59] is it 30? [15:54:01] that seems rather extreme, no? [15:54:30] no, I'm just trying to establish that there is a continuum, and that understanding how many users it affects is part of the equation. [15:54:35] and what makes you think it will manifest just as we deploy and not two hours later? or two days later? [15:54:51] what would you call as a success of the test? [15:55:16] I think apergos' thershold yesterday was pretty good. 30 minutes. [15:55:21] (the problem appeared long after we deployed the squid change) [15:55:30] like 20' after [15:55:55] (and actually, that's an argument in favor of it not actually being related to our change.) [15:55:57] :P [15:56:07] are you suggesting it was a coincidence? [15:56:09] oh I disagree [15:56:16] that's a hell of a coincidence. [15:56:22] how lilely is it that two days in aa row when we deploy we get hit with this and at no other time? [15:56:30] that defies common sense [15:56:38] I agree. [15:57:25] no, I'm not suggesting that it is. just that increasing the time between change and failure to the order of days decreases the culpability of the change and makes it more likely that it's something else. That's why 30 minutes seems reasonable - that's an appropriate amount of time for a change to have an effect (when we're dealing with piling up queries) [15:57:52] the only thing that has changed since yesterday is your 4 line change that affects only very specific requests [15:57:59] let's try those without a full deploy [15:58:15] can we agree on that much? [15:58:22] so for example broken requests, I'm getting them from the swift log, on ms-fe1: grep "^Aug 21 19:10" syslog.1 | grep -v thumb | grep " 404 " [15:58:42] the log lines then need to be transformed from the swift format back to an upload style url [15:59:24] and btw, when are we going to do the 1.5 upgrade? [15:59:39] how long do we expect that to take anyways? [15:59:46] and I didn't get to read the docs yet! [15:59:55] apergos: how long did we expect origs to take that long either [15:59:57] well, I was hoping to do it this week, but I'd really like to see us keep focus. so it's not currently scheduled. [16:00:11] paravoid: before you get all twitchy, I was going to add in buffer time [16:00:17] cause that's how I roll [16:01:07] I didn't have any expectation that we would get originals working the first day though, since you mention it. however I am considered to be the most conservative person on the team with respect to deployments [16:01:13] morning robla [16:01:23] howdy [16:01:33] PROBLEM - Puppet freshness on ms-be1001 is CRITICAL: Puppet has not run in the last 10 hours [16:01:33] PROBLEM - Puppet freshness on ms-be1002 is CRITICAL: Puppet has not run in the last 10 hours [16:01:33] PROBLEM - Puppet freshness on ms-fe1001 is CRITICAL: Puppet has not run in the last 10 hours [16:01:33] PROBLEM - Puppet freshness on singer is CRITICAL: Puppet has not run in the last 10 hours [16:01:33] PROBLEM - Puppet freshness on ms-be1005 is CRITICAL: Puppet has not run in the last 10 hours [16:01:33] PROBLEM - Puppet freshness on ms-be1003 is CRITICAL: Puppet has not run in the last 10 hours [16:01:34] PROBLEM - Puppet freshness on ms-be1009 is CRITICAL: Puppet has not run in the last 10 hours [16:01:34] PROBLEM - Puppet freshness on ms-be1006 is CRITICAL: Puppet has not run in the last 10 hours [16:01:34] PROBLEM - Puppet freshness on ocg3 is CRITICAL: Puppet has not run in the last 10 hours [16:01:35] PROBLEM - Puppet freshness on virt1001 is CRITICAL: Puppet has not run in the last 10 hours [16:01:36] PROBLEM - Puppet freshness on virt1003 is CRITICAL: Puppet has not run in the last 10 hours [16:01:36] PROBLEM - Puppet freshness on virt1002 is CRITICAL: Puppet has not run in the last 10 hours [16:01:36] PROBLEM - Puppet freshness on virt1004 is CRITICAL: Puppet has not run in the last 10 hours [16:01:45] maplebed: I think we got a lot of input from you these past few days on this matter [16:02:10] while we haven't gotten the same for the 1.5 upgrade and I'd like us to do this together. [16:02:36] I think it's much more risky to attempt a 1.5 upgrade without you around [16:02:38] our window (for whatever we do) is starting now? and we have two hours? [16:02:46] we only have 1 hour today. [16:02:50] oh gee [16:02:52] ok well [16:02:58] we don't have to use it. [16:02:59] I think we won't do either of those things today. [16:03:01] it's just here if we want it. [16:03:17] that would be my vote. nothing will get done in an hour [16:03:21] please try to get the Squid switch to happen [16:03:38] that's really important work that needs to happen [16:03:40] what, to use originals? [16:03:43] yes [16:03:57] robla: we've been discussing this for a while now [16:04:03] I don't want to be the bearer of bad news but [16:04:22] even with a two hour window we are as likely as not to bring down the site again [16:04:27] ok....I have meetings all morning. should I cancel to convince you guys to do this? [16:04:42] PROBLEM - Puppet freshness on silver is CRITICAL: Puppet has not run in the last 10 hours [16:04:42] robla: we still don't know the circumstances of yesterday's incident and I don't like us shooting in the dark in production. [16:04:46] it's not a matter of convincing. it's a matter of not having the info we need to fix the issue [16:05:08] I'd rather focus on either a post-mortem for yesterday or a 1.5 upgrade that also needs to happen ASAP [16:05:28] and that only Ben knows the system in and out to be able to do it without much more preparation [16:05:42] the 1.5 upgrade isn't as important [16:05:43] (while I feel I'm equally prepared as Ben for the originals switch at this point) [16:06:10] maplebed: isn't 1.5 a blocker for eqiad? [16:06:25] yes. [16:06:33] is 1.5 the last blocker for eqiad, or one of many? [16:06:40] !log ms-be6 powering down for memory testing [16:06:49] Logged the message, Master [16:06:53] one of many. but Ben's here for a limited amount of time. [16:07:01] either way: we have one week (5 days) of Ben left. we need to get as much brain dump out of him as possible. [16:07:02] yep [16:07:10] agreed [16:07:11] (for refenece; ms-be6 is already out of rotation; powering it down won't have any effect on the cluster) [16:07:20] yeah...basically, this switch *must* happen before Wiki Loves Monuments [16:07:25] we got enough of a braindump on the origs switch, we need more data on the 1.5 switch [16:07:28] what's their date? [16:07:44] robla: how come? I was never told of this [16:07:48] we need *any* data on the 1.5 switch. (yes I can and will read the docs, but...) [16:07:50] paravoid: that's my fault [16:07:50] sometime early September [16:07:57] I knew of the correlation but forgot. [16:08:17] the reason is that WLM will upload a huge chunk of stuff and ms7 is getting full. [16:08:22] if you can get us that hard deadline information that would be great [16:08:26] yes it is. [16:08:31] um....could you look it up:? [16:08:39] back to my meeting now [16:08:59] the zero partner testing shouldn't have any interaction with the swift stuff, can you guys do overlapping ? [16:09:10] ok, I suggest pushing the origs switch for next Monday and focus today and tomorrow on 1.5 [16:09:19] LeslieCarr: it's looking likely we're going to skip our window today. so no problem. [16:09:28] ah [16:09:37] thanks for checking. [16:09:39] PROBLEM - NTP on srv193 is CRITICAL: NTP CRITICAL: Offset unknown [16:09:55] and Friday and whatever remains of tomorrow on the post-mortem [16:11:09] * apergos goes looking for jeremyb [16:11:16] yah [16:11:18] this timeline on meta is pretty vague [16:11:30] yes, would you know when the rush to upload for wiki loves monuments is supposed to start? [16:11:41] the exact date? [16:11:42] hah [16:11:48] uhh, #wikilovesmonuments ? [16:11:53] * jeremyb doesn't really have a clue [16:11:55] there's a channel? [16:11:58] yes [16:12:08] * apergos goes to ask [16:12:16] photos are only valid for the contest if uploading during september [16:12:25] but can be taken any time [16:12:28] http://www.wikilovesmonuments.org/contest/ says "upload during september 2012" [16:12:42] "september" is a bit defined differently per country sometimes [16:13:02] so in terms of timing, I'd expect a spike at the beginning of the month, a fall off, then most of them during the latter half peaking at the end. [16:13:10] but that's just a swag. [16:13:16] sorry, brb [16:13:20] New patchset: Andrew Bogott; "Add a configurable timeout to the git-clone exec." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/20913 [16:13:45] off to call FPL [16:14:03] maplebed: you've kinda planned for us to do the 1.5 alone and I don't feel comfortable with that [16:14:03] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/20913 [16:19:47] Change merged: Andrew Bogott; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/20913 [16:22:10] ok, back. [16:22:23] ok [16:22:26] morning AaronSchulz [16:22:43] since we do have a window [16:22:43] RECOVERY - NTP on srv193 is OK: NTP OK: Offset -0.03837430477 secs [16:22:44] hello [16:22:52] and we're sitting doing nothing [16:23:08] I'm going to deploy the minor squid change that removes ms5 [16:23:15] sounds good [16:23:17] +1 [16:23:28] do we have anything else pending? the reqrite.py changes already went, right? last ngiht? [16:23:35] apergos: yeah, they did. [16:23:43] ok [16:23:48] AaronSchulz: do you have anything you want to push during our window? the auth caching stuff maybe? [16:23:53] (if it's ready) [16:24:13] that was already done yesterday, though it only applies to wmf10 [16:24:16] otherwise as soon as paravoid is done we can hand back the leftover time to the WPZero folks. [16:24:22] that'd be nice, if anything to facilitate debugging [16:24:24] it will become effective as more wikis switch over [16:24:34] is wmf10 already out on some wikis? [16:25:00] a few, commons will have it after 11 today [16:25:06] nice! [16:25:22] great, thanks so much [16:25:56] we also had a brief chat with mark earlier [16:26:37] PROBLEM - Apache HTTP on srv193 is CRITICAL: Connection refused [16:26:44] sec., let me deploy. [16:27:04] multitasking is bad in this case :) [16:27:10] +1 [16:27:52] hey someone from dev team, maybe platform... mabe AaronSchulz... could you take a quick peak at srv193 and tell me if you can use test.w.o [16:28:07] RECOVERY - Apache HTTP on srv193 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.067 second response time [16:28:19] I can load test.w.o [16:28:23] so that's good [16:28:29] but another pair of eyes would be nice [16:28:31] <^demon> Same. [16:28:46] ^demon: woo! [16:32:28] PROBLEM - HTTP on kaulen is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:32:44] !log deploy new squid config that removes ms5 from a swift fallback for thumbs [16:32:54] bugzilla seems down [16:32:54] Logged the message, Master [16:32:55] !! [16:32:59] !log adding dns for pay-lvs100[12] [16:33:08] notpeter: seems ok [16:33:08] Logged the message, Master [16:33:29] AaronSchulz: cool! thanks. [16:33:37] sent an email as well [16:33:46] garg. ns1faceplantsadcries [16:33:49] RECOVERY - HTTP on kaulen is OK: HTTP OK HTTP/1.1 200 OK - 461 bytes in 3.686 seconds [16:33:50] please kick bugzilla [16:34:02] maplebed, apergos: that's full deployment, please confirm [16:34:25] paravoid: you mean you've just done full deploy and want us to test or you watn confirmation you should deploy all? [16:34:45] I did full deployment after testing with sq51 first [16:34:50] aude: bz == kaulen [16:34:51] wfm [16:34:52] PROBLEM - SSH on kaulen is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:34:56] ok, will test. [16:35:06] jeremyb: thanks [16:35:09] !log restarted ns1 b/c it fell down after authdns-update [16:35:10] I think it's minor and very very safe [16:35:14] * aude can't fix anymore bugs [16:35:20] Logged the message, Master [16:35:21] so I don't think it needs many eyes [16:35:26] but since we're in this together :) [16:35:37] RECOVERY - Auth DNS on ns1.wikimedia.org is OK: DNS OK: 0.009 seconds response time. www.wikipedia.org returns 208.80.154.225 [16:36:01] hmmm? what happened? [16:36:13] RECOVERY - SSH on kaulen is OK: SSH OK - OpenSSH_5.3p1 Debian-3ubuntu7 (protocol 2.0) [16:36:17] seems to have made it around ok [16:36:29] paravoid: it used to be that the cache_peer for thumbs was above the default in the squid config. now it's below. do you know if that order matters? [16:36:38] yay! [16:36:39] given the acls, I'd say no, it doesn't. [16:36:50] maplebed: order matters, but they have mutually exclusive ACLs [16:36:57] so in this case, it doesn't [16:37:33] what happens now if it hits the 200 max-conn limit or the timout limit? squid returns a 500? [16:37:38] yes [16:37:42] btw, re bugzilla/kaulen flapping above: there's a bug triage going now for i18n. they can't get much done with it down. (but sounds like it's back up for them for now) [16:37:50] <^demon> !log bugzilla went down, but came back up. magic? [16:38:00] Logged the message, Master [16:38:04] 8ball? [16:40:54] paravoid: would you join #swiftstack? [16:40:57] sure. [16:48:58] paravoid: my squid tests passed too. [16:49:04] I think we're good on that change. [16:49:16] I'd like to stat removing my list of dirs from ms5 [16:49:22] 200k of em [16:49:23] +1 apergos [16:49:26] *start [16:49:34] any objections? [16:49:44] nope [16:50:47] !log removing thumbs on ms5 unused on any project, for june-july uploads. running in screen as root [16:50:56] Logged the message, Master [16:51:40] maplebed: so, plan? [16:51:59] that won't buy us much, tomorrow I'll figure out some large balanced subset of the files in the commons shards I can toss [16:52:00] yes. stick to this (noisy) channel or go find a quieter place to talk? [16:52:11] this is public which is why I prefer it [16:52:23] (ie everyone can follow along from the community) [16:52:50] agreed [16:52:56] I agree, but the cost is that we'll have to talk over other conversations and ignore the bots. [16:53:17] I don't mind [16:53:19] I know. it could be worse, it could be wikimedia-tech (which I *still* prefer :-P) [16:54:28] for a few minutes I need to watch ms5 and a random scaler [16:55:06] okay, while apergos does that, maplebed have you seen RT 3446? [16:55:13] sounds swift-related [16:55:19] looking. [16:55:20] even those deletes (in serial) were enough to cause nfs timeouts to appear [16:55:21] dang [16:55:57] this is going to be painful [16:56:10] PROBLEM - LVS HTTP IPv4 on rendering.svc.pmtpa.wmnet is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:56:19] nice [16:56:31] I suspended the job immediately [16:57:16] paravoid: I hadn't seen that ticket, but I have seen all those bugzilla bugs. it's a mediawiki thing and a permissions bug on ms7. aaron's on it. [16:57:18] and there is a sleep after every 100 deletes too [16:57:23] * apergos stabs [16:57:35] maplebed: ah, great, thanks. [16:58:00] maplebed: here's your data point! [16:58:02] :-) [16:58:04] \o/ [16:58:15] umm... errr... /o\ [16:58:19] haha [16:58:39] I've never used that one before. I ilke it. [16:58:41] so we can't even clean up over there. and we can't turn off writes. [16:58:43] wf [16:58:46] er *wtf [16:58:58] AaronSchulz: so, about that hack to stop writing to ms5... [16:59:04] why would that be a hack? [16:59:10] RECOVERY - LVS HTTP IPv4 on rendering.svc.pmtpa.wmnet is OK: HTTP OK HTTP/1.1 200 OK - 61575 bytes in 0.190 seconds [16:59:11] isn't that the target? [16:59:22] goal even [16:59:26] paravoid: the hackish part is stopping writes to ms5 while still writingc to ms7. [16:59:27] yes but it's supposed to be at the same time as ms7 [16:59:33] right [16:59:41] the multiwrite backend doesn't have thumb/original differentiation at the moment. [16:59:59] !log oh, for people following along, shot the delete job. apparently even deletes is enough to push ms5 over the edge, yay [17:00:00] okay, stupid question: why do we need either? [17:00:13] Logged the message, Master [17:00:21] what's up? [17:00:23] we need writes to ms7 for a while [17:00:28] (well, it does for write location (eg /mnt/thumb vs. /mnt/upload) but not for yes write and no don't write.) [17:00:36] it's not "deploy, looks ok, now let's burn all bridges right away" [17:00:52] we could just pull NFS out of multiwrite and I can run the sync script to keep ms7 up to date [17:00:59] mark - apergos was trying to delete content from ms5 to make it less likely to tip over when we try originals again and the simple act of deleting content was enough to tip it over and make the scalers fall over. [17:01:26] AaronSchulz: that's an intruiging idea. [17:01:28] 100 deltes, sleep, 100 deletes, sleep [17:01:34] AaronSchulz: that would mean doing NFS asynchronously, which I like a lot. [17:01:35] but if client reads come from swift, you might get 404s briefly after creating a file... [17:01:51] ok [17:01:52] I mean NFS [17:01:54] AaronSchulz: I don't understand that part [17:01:58] got a pointer to your sync stuff, AaronSchulz? [17:01:58] oh. [17:02:13] maplebed: "eventual consistency" ;) [17:02:43] so the confusion would come from [17:02:44] * upload a file [17:02:44] * look for the full rez version of the uploaded file (which comes from ms7) [17:02:44] * it's missing because it hasn't been synced yet [17:03:21] yeah, that would be a problem [17:03:54] (though I would expect people only look at the thumbnail after uploading, but it still kinda sucks.) [17:04:09] and that confusion point only happens until we point squids to swift. [17:04:44] sync script pointer? pretty please? [17:04:51] AaronSchulz: what would happen if we mount /dev/null on /mnt/thumbs [17:04:56] (or something to that effect) [17:05:05] er what? [17:05:18] or we could change multiwrite to have a list of containers that only write to the main backend [17:05:33] prefer that [17:05:41] how much work would that be? [17:05:46] PROBLEM - Host wikibooks-lb.eqiad.wikimedia.org_ipv6 is DOWN: /bin/ping6 -n -U -w 15 -c 5 2620:0:861:ed1a::4 [17:05:55] RECOVERY - Router interfaces on cr1-sdtpa is OK: OK: host 208.80.152.196, interfaces up: 79, down: 0, dormant: 0, excluded: 0, unused: 0 [17:06:12] hm?! [17:06:23] no idea [17:06:26] apergos: I can't imagine that hard [17:06:31] RECOVERY - Host wikibooks-lb.eqiad.wikimedia.org_ipv6 is UP: PING OK - Packet loss = 0%, RTA = 26.80 ms [17:06:41] * apergos waits for the recovery page [17:06:42] probably a hiccup because of the wave being up again [17:07:03] AaronSchulz: that sounds like a good idea. [17:07:11] even turning off writes for just commons and enwiki will probably be plenty. [17:07:38] though we could use the top 20 list (or however many there are) that we used for sharding [17:07:56] I have a top n list as well, probably identical to yours [17:08:37] ACKNOWLEDGEMENT - Host ms-be6 is DOWN: PING CRITICAL - Packet loss = 100% daniel_zahn RT-3425 - memory errors [17:08:40] so [17:08:43] is any deployment going on? [17:08:56] no. we're out of the window anyways [17:09:01] mark: WP Zero has a window now. [17:09:20] preilly: are you deploying anythincg at the moment? [17:09:41] so, maybe mark could solve our tie between continuing to debug this or switch to the 1.5 upgrade. [17:09:49] :-) [17:09:56] nah, you've beaten me into agreement. [17:10:00] :-D [17:10:03] yay! [17:10:09] ah! [17:10:19] didn't mean to beat you into anything :) [17:10:26] apergos' suggestions seem like a plan - clean off ms5 then try again. or get aaron's change in then try again. [17:10:29] cause honestly as I look at all this stuff for the upgrade I'm like, er, how does that bit work? at several places [17:10:44] get aaron's change in. I want ms5 completely decoupled. [17:10:44] (aaron's change will make more of a difference than cleaning off ms5, so I prefer that route) [17:10:55] maplebed: I'll make you a deal you can use my slot if someone flushes the mobile cache for me [17:11:00] that [17:11:08] and I want the other design decisions looked at as well [17:11:26] the things we raised in the emails [17:11:50] /usr/share/pyshared/eventlet/wsgi.py: for hook, args, kwargs in self.environ['eventlet.posthooks']: [17:11:50] /usr/share/pyshared/eventlet/wsgi.py: env['eventlet.posthooks'] = [] [17:11:56] mark: when you have a minute can you help me make sense of a pdns question? [17:12:01] I don't think we need a slot right now [17:12:04] whops. pastebomb [17:12:12] but I can nevertheless flush the cache for you preilly. [17:12:35] Jeff_Green: yes [17:12:50] paravoid: thanks [17:12:51] preilly: do you want to do that now? [17:12:58] paravoid: yes right now please [17:13:03] okay [17:13:46] ok, so current status - we're gonna talk thorugh the 1.5 stuff and AaronSchulz is going to work on putting in a config to only write selected containers to the main filestore. once he's done we'll find a window to try again. (we have one tomorrow 9am but can schedule one next week too). sound right? apergos paravoid [17:13:49] preilly: done [17:14:01] paravoid: thanks! [17:14:10] um [17:14:11] god damnit [17:14:15] someone just bumped my car [17:14:17] bbl [17:14:21] oops [17:14:39] we will look at the head/get request cycle mark mentioned in his email (which probably means aaron will look at it) and see if that can be optimized away [17:15:15] maplebed: btw, the tests pass on my lenovo laptop, except for 1 bogus item in a listing [17:15:31] PROBLEM - SSH on kaulen is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:15:43] maplebed: agreed. when should we schedule the 1.5 upgrade? or are we going to decide this later? [17:15:44] paravoid: could you do me a favor and check that supervisor and puppet are stopped on silver and zhen [17:15:44] (against copper) [17:16:01] paravoid: let's make that call in a few hours after talking through what it entails. [17:16:05] fair enough [17:16:31] AaronSchulz: which is your lenovo laptop? the one that had problems or the one that didn't? [17:16:32] :P [17:16:34] preilly: they are on silver but not on zhen. stopping. [17:16:41] paravoid: and that no supervisord jobs are running [17:16:46] maplebed: the new one [17:16:52] RECOVERY - SSH on kaulen is OK: SSH OK - OpenSSH_5.3p1 Debian-3ubuntu7 (protocol 2.0) [17:17:01] ok so for now it's go over the 1.5 upgrade procedure, figure out what's needed? [17:17:15] preilly: what do you mean by that? I'm unfamiliar with that setup [17:17:38] apergos: can you set us up an etherpad? [17:17:39] paravoid: oh just that supervisor isn't running anything [17:17:47] apergos: I see that maplebed is talking with the swiftstack folks, so do that in a minute? [17:17:59] almost done. [17:18:08] preilly: it runs some twisted stuff, do I kill them? [17:18:16] sure, "for now" = as soon as we are all freed up again [17:18:44] paravoid: kill anything that's running yes [17:18:55] paravoid: and make sure that puppet is stopped too [17:18:58] paravoid: thanks [17:19:05] !log stopped supervisord & puppet on silver/zhen per preilly's request [17:19:14] Logged the message, Master [17:21:39] alright. on to upgrade discussion. [17:22:11] * maplebed looks for the gerirt change [17:22:17] it's n the pad [17:22:49] oh, so it is. [17:22:57] I'm going to be much more slack about tonight's discussion (wandering off to arrange food and hydration at regular intervals) [17:23:11] so if you don't get an instant response to something, that will be why [17:23:15] ok. [17:23:31] so the upgarde consists of a few different parts [17:23:39] there's a set of new debian packages [17:23:47] a set of config changes [17:23:53] and a few changes to rewrit.py [17:24:22] there's some init scripts to be dealt with too it looks like? [17:24:50] hmm, maybe a container list is trickier than I though...I'll just go with the doOperations/doQuickOperations distinction [17:25:04] of the new features present in the upgrade, we're leaving one important area unconfigured on initiial deploy, with the intent of enabling it in a separate step (the statsd tsuff). [17:25:22] AaronSchulz: what does that mean practically speaking? [17:25:49] right [17:25:50] apergos: service starting and stopping is managed by puppet, so I think we don't need to worry about the init script stuff. whatever comes with the pcakages is fine. [17:25:54] probably nothing, we only use doQuickOperations() for things like thumbnails [17:26:07] so it draws the line we want [17:26:12] apergos: it means splitting NFS on thumbs vs. originals rather than on specific project /wiki combos. [17:26:26] ok. sounds great [17:27:17] there's a note here in your email, maplebed, that says we need to check for swift sysv-style init scripts [17:27:21] that's why I brought it up [17:27:23] the statsd stuff is one of the features I'm really excited about, but it can safely be deployed disabled and enabled later, so I'd like to do it that way. (there were some issues in testing that aren't yet resolved) [17:27:25] RECOVERY - udp2log log age for oxygen on vanadium is OK: OK: all log files active [17:27:31] (ok)