[00:00:11] !log completed user.user_email index migrations [00:00:22] Logged the message, Master [00:02:45] PROBLEM - Puppet freshness on labstore1 is CRITICAL: Puppet has not run in the last 10 hours [00:02:45] PROBLEM - Puppet freshness on ms-be1007 is CRITICAL: Puppet has not run in the last 10 hours [00:02:45] PROBLEM - Puppet freshness on ms-be1011 is CRITICAL: Puppet has not run in the last 10 hours [00:02:45] PROBLEM - Puppet freshness on ms-be1010 is CRITICAL: Puppet has not run in the last 10 hours [00:06:48] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:08:45] PROBLEM - Puppet freshness on neon is CRITICAL: Puppet has not run in the last 10 hours [00:10:45] New patchset: Tim Starling; "On ptwiki set wgCategoryCollation to uca-default" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/21024 [00:11:12] Change merged: Tim Starling; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/21024 [00:20:18] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.030 seconds [00:25:52] binasher: There's no active query killer? [00:26:46] 5 days is excessive [00:27:06] 5 minutes is excessive [00:27:08] binasher: Mind sending me the full SQL of that query so I can figure out why it ran for 5 days and fix the code so it doesn't run those any more? [00:27:18] Heck, 5 seconds is excessive on the web [00:27:28] Brooke: there isn't currently, there should be [00:28:04] I would support having a query killer for wikiuser (i.e. MediaWiki), but not for wikiadmin (i.e. manual queries) [00:29:17] domas previously had one in place that ran from fenari.. i've been thinking about reimplementing with pt-kill on each local db but haven't gotten around to it, or defining exact parameters [00:31:25] RoanKattouw: the full query was [00:31:36] SELECT /* ApiQueryLogEvents::execute */ /*! STRAIGHT_JOIN */ log_type, log_action, log_timestamp, log_deleted, log_id, page_id, log_user, user_name, log_namespace, log_title, log_comment, log_params FROM `logging` FORCE INDEX (times) JOIN `user` ON ((user_id=log_user)) LEFT JOIN `page` ON ((log_namespace=page_namespace) AND (log_title=page_title)) INNER JOIN `change_tag` FORCE INDEX (ct_tag) ON ((log_id=ct_log_id)) WHERE [00:31:37] (log_type != 'suppress') AND ct_tag = 'posible vandalismo' ORDER BY log_timestamp DESC LIMIT 11 [00:33:02] Blegh change tagging [00:33:14] I'll poke in a minute [00:34:51] change_tag has an interesting schema [00:53:36] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:07:15] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.024 seconds [01:20:13] Deadlock found when trying to get lock; try restarting transaction (10.0.6.48) UPDATE `page` SET page_touched = '20120822005443' WHERE page_namespace = '0' AND page_title = 'Mitt_Romney\'s_tax_returns' [01:20:17] heh [01:20:26] if anything should be a deadlock.. [01:20:55] hhaha [01:22:37] PROBLEM - Puppet freshness on stat1 is CRITICAL: Puppet has not run in the last 10 hours [01:32:40] PROBLEM - Puppet freshness on zinc is CRITICAL: Puppet has not run in the last 10 hours [01:32:40] PROBLEM - Puppet freshness on magnesium is CRITICAL: Puppet has not run in the last 10 hours [01:40:28] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:41:31] PROBLEM - MySQL Slave Delay on storage3 is CRITICAL: CRIT replication delay 234 seconds [01:42:25] PROBLEM - MySQL Slave Delay on db1025 is CRITICAL: CRIT replication delay 289 seconds [01:49:01] PROBLEM - Misc_Db_Lag on storage3 is CRITICAL: CHECK MySQL REPLICATION - lag - CRITICAL - Seconds_Behind_Master : 685s [01:54:52] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.018 seconds [01:57:25] RECOVERY - MySQL Slave Delay on db1025 is OK: OK replication delay 4 seconds [01:59:31] RECOVERY - Misc_Db_Lag on storage3 is OK: CHECK MySQL REPLICATION - lag - OK - Seconds_Behind_Master : 10s [01:59:49] RECOVERY - MySQL Slave Delay on storage3 is OK: OK replication delay 0 seconds [02:27:52] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:37:28] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.023 seconds [05:27:41] New patchset: Tim Starling; "Enable Scribunto on www.mediawiki.org" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/21027 [05:28:04] Change merged: Tim Starling; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/21027 [05:47:29] !log preparing to switch s1 master to db63 (upgrade to 96GB ram and precise) in a minute [05:47:38] Logged the message, Master [05:53:30] PROBLEM - MySQL Replication Heartbeat on db42 is CRITICAL: CRIT replication delay 232 seconds [05:53:30] PROBLEM - MySQL Replication Heartbeat on db60 is CRITICAL: CRIT replication delay 232 seconds [05:53:30] PROBLEM - MySQL Replication Heartbeat on db1047 is CRITICAL: CRIT replication delay 232 seconds [05:53:30] PROBLEM - MySQL Replication Heartbeat on db63 is CRITICAL: CRIT replication delay 232 seconds [05:53:48] PROBLEM - MySQL Replication Heartbeat on db36 is CRITICAL: CRIT replication delay 250 seconds [05:54:15] PROBLEM - MySQL Replication Heartbeat on db59 is CRITICAL: CRIT replication delay 277 seconds [05:54:15] PROBLEM - MySQL Replication Heartbeat on db1033 is CRITICAL: CRIT replication delay 277 seconds [05:54:15] PROBLEM - MySQL Replication Heartbeat on db1050 is CRITICAL: CRIT replication delay 278 seconds [05:55:27] PROBLEM - Auth DNS on ns1.wikimedia.org is CRITICAL: CRITICAL - Plugin timed out while executing system call [05:55:34] New patchset: Asher; "db63 is the new s1 master" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/21028 [05:55:45] RECOVERY - MySQL Replication Heartbeat on db1033 is OK: OK replication delay 0 seconds [05:55:45] RECOVERY - MySQL Replication Heartbeat on db1050 is OK: OK replication delay 0 seconds [05:56:17] these heartbeat warnings are ok and will clear shortly [05:56:18] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/21028 [05:56:30] RECOVERY - MySQL Replication Heartbeat on db42 is OK: OK replication delay 0 seconds [05:56:30] RECOVERY - MySQL Replication Heartbeat on db60 is OK: OK replication delay 0 seconds [05:56:30] RECOVERY - MySQL Replication Heartbeat on db63 is OK: OK replication delay 0 seconds [05:56:30] RECOVERY - MySQL Replication Heartbeat on db1047 is OK: OK replication delay 0 seconds [05:56:35] Change merged: Asher; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/21028 [05:56:48] RECOVERY - MySQL Replication Heartbeat on db36 is OK: OK replication delay 0 seconds [05:57:15] RECOVERY - MySQL Replication Heartbeat on db59 is OK: OK replication delay 0 seconds [06:01:00] PROBLEM - Puppet freshness on ms-be1001 is CRITICAL: Puppet has not run in the last 10 hours [06:01:00] PROBLEM - Puppet freshness on ms-be1003 is CRITICAL: Puppet has not run in the last 10 hours [06:01:00] PROBLEM - Puppet freshness on ms-be1002 is CRITICAL: Puppet has not run in the last 10 hours [06:01:00] PROBLEM - Puppet freshness on ms-be1005 is CRITICAL: Puppet has not run in the last 10 hours [06:01:00] PROBLEM - Puppet freshness on ms-be1009 is CRITICAL: Puppet has not run in the last 10 hours [06:01:00] PROBLEM - Puppet freshness on ms-be1006 is CRITICAL: Puppet has not run in the last 10 hours [06:01:00] PROBLEM - Puppet freshness on ms-fe1001 is CRITICAL: Puppet has not run in the last 10 hours [06:01:01] PROBLEM - Puppet freshness on ocg3 is CRITICAL: Puppet has not run in the last 10 hours [06:01:01] PROBLEM - Puppet freshness on singer is CRITICAL: Puppet has not run in the last 10 hours [06:01:02] PROBLEM - Puppet freshness on virt1002 is CRITICAL: Puppet has not run in the last 10 hours [06:01:03] PROBLEM - Puppet freshness on virt1001 is CRITICAL: Puppet has not run in the last 10 hours [06:01:03] PROBLEM - Puppet freshness on virt1004 is CRITICAL: Puppet has not run in the last 10 hours [06:01:03] PROBLEM - Puppet freshness on virt1003 is CRITICAL: Puppet has not run in the last 10 hours [06:02:09] !log shutting down mysql on db38 and rebooting for kernel upgrade [06:02:18] Logged the message, Master [06:03:51] PROBLEM - Puppet freshness on silver is CRITICAL: Puppet has not run in the last 10 hours [06:14:52] !log new s1 repl position: MASTER_LOG_FILE='db63-bin.000049', MASTER_LOG_POS=646903534 [06:15:02] Logged the message, Master [06:24:21] binasher: I didn't kill long running queries [06:24:29] I killed code that was running long running queries [06:24:29] :) [06:25:01] hah! much better [06:35:42] binasher: can you look what is the problem here? https://commons.wikimedia.org/wiki/File:Pelican3.jpg [06:42:18] matanya: what am i looking for? [06:42:37] the file has a page, but no file [06:44:01] https://upload.wikimedia.org/wikipedia/commons/2/2f/Pelican3.jpg doesn't load for you? [06:44:30] now it does [06:44:32] thanks [06:45:21] hmm.. [07:31:22] New review: Tim Starling; "Can be merged after the migration job running on ms7 is finished (screen -r 3619.pts-6.ms7)." [operations/mediawiki-config] (master); V: 0 C: 0; - https://gerrit.wikimedia.org/r/20844 [07:45:55] PROBLEM - SSH on amslvs1 is CRITICAL: Server answer: [07:56:25] RECOVERY - SSH on amslvs1 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [08:02:25] PROBLEM - SSH on amslvs1 is CRITICAL: Server answer: [08:05:25] RECOVERY - SSH on amslvs1 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [08:45:06] morning [08:45:34] apergos: what do you mean by obvious? [08:45:49] er, obvious? :-D [08:45:58] why would originals to swift put additional load on ms5? [08:46:13] there were an additional 5 gets per second. [08:46:28] not much, but prolly enough to push it over the edge given how fragile it now is [08:47:02] ben's right that there were already a few nfs timeout incidents here and there [08:47:48] my proposal is going ot be to call off tonight's deployment attempt and reschedule when we have a decent chunk of space back [09:21:28] isn't today's window about the 1.5 upgrade? [09:21:54] I'm more worried about needing Ben for that than the originals switch. [09:27:50] his email seemed to indicate otherwise [09:27:54] "We have a window to try again tomorrow morning at 9am. " [09:28:16] anyways I think we cannot move forward with the originals switch [09:28:38] for a few days at least (hope we get it done while here's here though) [09:29:46] I don't think we should do random attempts on switching without knowing exactly what went wrong [09:30:09] I said so yesterday too [09:31:13] is someone suggesting doing a randm attempt? [09:31:33] I'm not, you're not, and I don't read Ben's email that way either [09:31:51] anyways, tonight's window might as well be used for the upgrade, I don't know what prep is needed for that though [09:42:34] well, we don't know what's wrong yet. attempting to re-deploy and see what will happen is what Ben suggested as far as I understand [09:42:40] I disagree with that, that's all I'm saying [09:43:42] we know one of the things that's wrong, and it will take at least a few days to fix it... [09:43:48] probably we're both saying the same thing [09:44:09] so can you tell me a little about the upgrade? I know nothing about this [09:44:30] I still don't see why would the originals switch to swift put additional load on ms5 [09:44:46] and how is it changing /any/ access pattern that would affect NFS [09:46:15] more i/o on the disks [09:46:22] not much more but not much is needed at this point [09:46:38] I don't know why there were get requests to ms5, but there were [09:54:17] also we are going to have to kickstart the discussion about thumb purging generally; the idea thaat we generate all sizes on demand and keep them forever is not sustainable [09:55:05] this has come up on the wikitech mailing list before but it needs followthrough til a decision is reached [09:57:47] more i/o on the disks *why*? our change had nothing to do with thumbs [09:57:53] let alone NFS [09:58:45] in theory it had nothing to do with thumbs [09:59:37] in practice too, remember how we didn't even change the swift_thumbs acl [09:59:43] in practice there were gets [09:59:47] recorded in the error log [09:59:51] and on the ganglia graphs [10:00:06] those stopped dead after the revert [10:00:20] (and tcp dump shows there are none, I watched it for quite a while) [10:00:25] do you se the problem here? :) [10:00:39] the first problem is the nfs issue [10:00:40] we have requests in a backend system that our design did not intend to [10:00:46] that's the real design issue [10:00:50] and the space issue, that is extremely urgent [10:01:09] the next issue is where the gets come from; I raised this last night repeatedly [10:01:22] ben also mentioned it in his email I believe [10:02:34] (iterally: they were coming from the squids. but what generated them, no idea, and I would sure like to know) [10:03:10] from the squids? there were no references to ms5 on the squid configs [10:03:32] I'm sure of that [10:03:35] from the squids. that's right. [10:03:55] PROBLEM - Puppet freshness on labstore1 is CRITICAL: Puppet has not run in the last 10 hours [10:03:55] PROBLEM - Puppet freshness on ms-be1007 is CRITICAL: Puppet has not run in the last 10 hours [10:03:55] PROBLEM - Puppet freshness on ms-be1010 is CRITICAL: Puppet has not run in the last 10 hours [10:03:55] PROBLEM - Puppet freshness on ms-be1011 is CRITICAL: Puppet has not run in the last 10 hours [10:04:04] our new config that got pushed to sq51 had no reference to ms5 at all [10:04:16] the rest of them have ms5 but under an ACL that never matches [10:04:18] mostly sq41 but a few others [10:04:37] aha, I think I know what happened there [10:04:44] oh? [10:04:45] I have a theory [10:04:57] a theory that has to do with sq41? I am fascinated :-D [10:05:03] :-) [10:05:39] so [10:05:57] the "old" config, the one that's deployed now [10:06:13] and was in production yesterday in every squid but sq51 [10:06:32] has both swift and ms5, with essentially the same ACL with different names [10:06:38] (and ms7 as a gateway of last resort) [10:06:41] right. [10:07:06] which already makes me nervous (that they have ms5 in there) [10:07:11] anyways, you were saying [10:08:02] we (rightfully) have maxconns and connect-timeout in those cache-peer definitions [10:08:12] at one point swift became completely unresponsive [10:08:16] s you think there was failover? [10:08:20] that was ben's theory too [10:08:21] so some squids falled back on ms5 [10:08:28] makes absolute sense [10:08:31] I don't know enough about squid config to know if that's how it works [10:08:42] (read: I know very little about squid config) [10:08:50] well, if squid was taking 40s to respond to a thumb (we know /that/ happened) [10:09:06] it's easy to imagine those 200 max connections getting filled up [10:09:31] well one thing that could be done is to pull that acl [10:09:37] it really shouldn't be in there [10:10:03] that's easy to do [10:10:10] PROBLEM - Puppet freshness on neon is CRITICAL: Puppet has not run in the last 10 hours [10:10:24] however, note that if that was in effect yesterday, we'd be returning 500s to clients [10:10:29] instead of gracefully falling back to ms5 [10:10:30] yes we would [10:10:36] so in a sense, this saved us from a bigger outage [10:10:49] I'd rather return 500s tbh [10:11:23] incidentally mark just sent a mail that describes that [10:13:12] uh huh [10:13:46] https://www.youtube.com/watch?v=ByY3CgR7-4E [10:21:45] mark: ping? [10:29:00] back in a little bit, must get cat food/litter before the place closes [11:02:30] yes [11:02:38] hey [11:03:08] just sent a second mail [11:03:15] so... [11:03:47] I also see a bit of a deeper design flaw, if you're up to discussing it [11:05:50] yes [11:06:51] so... [11:07:37] a client requests a thumb from swift. swift's rewrite.py checks if it's there and if it's not it fetches it from the image scalers [11:08:13] mid-flight/synchronously, i.e. the client waits while swift's request to an imagescaler is happening, to get back the result [11:08:23] yes [11:08:54] now, the imagescalers check if the thumb is there with a HEAD (bypassing rewrite.py), and if it's not (the case here) they fetch the original from swift with a GET [11:09:13] that already can be optimized, but go on [11:09:13] this is not a real loop, due to the bypass, but I think it's a *performance* loop [11:09:16] yes [11:09:24] if swift becomes slow for whatever reason [11:09:59] the problem will cascade to the imagescalers, which will have a lot of i/o wait [11:10:05] i don't see the i/o wait [11:10:13] waiting on swift is not i/o wait, waiting on nfs is [11:10:13] waiting for Swift to respond [11:10:16] but sure [11:10:38] it's waiting on a network response anyway, so almost the same [11:10:43] network i/o I meant. [11:10:46] alright [11:11:08] so, imagescalers will wait for swift a lot and gain a larger outstanding queue [11:11:16] making it worse [11:11:17] of requests waiting for a scale to happen [11:11:36] then this will cascade to swift again, which in addition to the original problem [11:11:49] it will now have a larger outstanding queue of thumbs [11:12:07] waiting on the imagescalers [11:12:11] which will wait on swift [11:12:13] and so on. [11:12:14] i think i've seen this happen too, 2 days ago [11:12:33] but then I didn't understand completely what was going on [11:12:37] but I've seen the same in top [11:12:55] lots of apache processes, few/no scaling happening [11:12:59] and later it looked different [11:13:01] a cycle [11:13:16] so i'll buy this [11:13:36] so: [11:13:44] even if this hasn't happened before, it will happen at some point [11:13:58] 1. that HEAD check needs to go. It can do a GET rightaway if it doesn't know whether a thumb exists or not [11:13:59] it's a cascading problem on a loop [11:14:09] 2. IF rewrite.py already determined a thumb doesn't exist, we can pass that as a parameter [11:14:18] then mediawiki can go straight on with scaling [11:14:25] (with some authentication so outsiders can't do this) [11:14:37] 3. we need strong control on timeouts for all fetches [11:14:37] I thought of (2), but how you deal with concurrency? [11:14:54] how does the current code deal with it? [11:14:59] does it? [11:15:10] poolcounter can do this [11:15:18] I don't know, but expanding the race across systems doesn't sound good [11:15:31] i don't see how it's different from now? [11:15:59] it's not very different, no [11:16:10] perhaps the race is slightly bigger, that's all [11:16:29] right. fetch/fetch/write never solves a fetch/write race :) [11:16:37] indeed [11:16:59] where is this code? [11:17:04] it sounds like you have read it [11:17:08] which code? [11:17:19] rewrite.py? [11:17:22] or PHP? [11:17:39] php [11:18:30] I've briefly seen it but my observations are result of symptoms and discussions with Aaron [11:18:37] ok [11:18:51] I grepped through a random MW box /usr/local/apache [11:18:53] so I also want explicit timeout control [11:18:59] otherwise this solution is just as bad as NFS [11:19:26] and probably some form of locking through poolcounter, if it's not already there [11:19:34] even wit those three -with which I don't disagree at all- [11:19:40] the dependency cycle still stands [11:20:21] perhaps rewrite.py could POST the original to mediawiki or something [11:20:26] although that feels a bit dirty too [11:20:36] it's also not like rewrite.py has the data readily available [11:20:43] it translates into backend requests inside swift and such [11:21:19] hm, could we http redirect? [11:21:57] it would be ugly for clients, and squid internally resolving that would be ugly too [11:22:09] (if at all possible) [11:22:29] i don't see what you mean [11:22:43] redirecting squid to thumb.php? [11:23:05] if swift determines the thumb is not there, redirecting to imagescalers instead of fetching it synchronously [11:23:31] might work with squid, would need testing [11:23:48] feels dirty too though [11:24:03] PROBLEM - Puppet freshness on stat1 is CRITICAL: Puppet has not run in the last 10 hours [11:24:03] it does not help that much [11:24:27] it reduces the amount of resources blocked inside the swift proxy I guess [11:24:33] yeah, that was my point [11:24:41] it wouldn't help that much though [11:24:55] it was nice how thumbs and originals were in different systems with NFS [11:25:12] we could use different proxies for them in swift [11:25:27] we have 6 or so? [11:25:35] (ridiculous amount for the task if you ask me) [11:25:48] 4 + 4 from what I can see [11:25:51] heh :) [11:25:55] wow [11:26:39] 4 per DC I mean [11:27:06] right [11:27:20] well, we could do 2+2 per dc and assume that in emergencies the other cluster can do the exact same tasks anyway [11:27:31] but does this really help us much? [11:27:59] wouldn't just increasing some bottleneck in the swift proxy do the same, except not fix it conceptually [11:28:22] the frontend proxies don't look loaded at all in ganglia [11:28:28] but again, that doesn't say much with swift :P [11:28:36] heh [11:30:48] another interesting data point: the imagescalers are still not on the pre-cache flush load levels [11:31:14] that remains weird [11:31:32] also note that yesterday we did not flush any cache at all [11:31:40] yeah [11:31:47] btw, i have the nfs /home migration in a few hours [11:31:50] but I might cancel that [11:31:51] apergos: ping [11:31:54] that's right, the symptoms were noticed when the cutover took place [11:32:01] mark: ponnngggg [11:32:02] apergos: are you aware of this, the /home migration? [11:32:11] you're one of the main users of /home mounts [11:32:30] and we just redirected one single squid *with* its cache to swift originals (without touching thumbs) [11:32:39] that's just nothing in traffic increase [11:32:40] imagine what had happened with all squids [11:32:42] I forgot when it was [11:32:49] apergos: in 2.5 hours [11:33:13] no problem [11:33:20] i might cancel, not sure yet [11:33:27] I assume I can be on bastion1001 without any issues? [11:33:27] i'm behind in preparation due to other stuff [11:33:29] (like swift) [11:33:31] yes [11:33:34] (I'm not relying on the mount for anything) [11:33:39] you're not? [11:33:42] you're not using /home? [11:33:42] no [11:33:45] I'm in it [11:33:49] but I don't need to be [11:34:06] PROBLEM - Puppet freshness on magnesium is CRITICAL: Puppet has not run in the last 10 hours [11:34:06] PROBLEM - Puppet freshness on zinc is CRITICAL: Puppet has not run in the last 10 hours [11:35:28] right [11:35:33] i thought your dump stuff used NFS [11:35:37] noooo [11:35:39] but i no longer see them in the client list [11:35:39] gj [11:35:41] it absolutely does not [11:36:08] thnks for checkign though [11:36:09] hm. [11:36:15] what's port 2049 in ms5 used for? [11:36:23] dunno [11:36:26] ah that's nfs [11:36:28] never mind me [11:36:40] doh [11:37:04] so, I can easily remove ms5 from squids completely [11:37:08] I'm not sure if I should though [11:37:34] i don't know why that wasn't done during the deployment [11:37:37] if anything, ms5 *helped* yesterday, by picking up the load that swift couldn't handle [11:37:43] sure [11:37:56] but in the desired case where swift actually works as desired... ;) [11:37:59] we can't have ms5 [11:38:13] I'd rather have the failure (and have it be obvious). we want ms5 gone in the end anyways [11:38:42] !log Removed /homewmf NFS mount on kaulen [11:38:52] Logged the message, Master [11:38:53] look at it this way, with it out, it's one less variable in the mix [11:40:37] how nice [11:40:43] we only have a few /home mount clients left [11:40:49] mostly fenari, spence [11:40:50] and stat1/bayes [11:40:54] and hume [11:41:07] and bast1001 soon? [11:41:12] no [11:41:17] not bast1001 [11:41:36] sure, bast1001 right now has a readonly replica of /home mounted from the eqiad netapp [11:41:40] but even that's gonna be discussed [11:41:45] \o/ (for not bast1001) [11:41:46] and it's not /home there [11:41:47] oh? I thought it was going to be fenari's equivalent [11:41:56] what makes you think I want to clone fenari :P [11:41:56] ah, you mean with the bastion/deployment split [11:42:00] yes [11:42:04] right yes [11:42:57] so why is stat1 still using NFS [11:43:12] *sigh* [11:43:32] brb [11:52:06] PROBLEM - Puppet freshness on snapshot4 is CRITICAL: Puppet has not run in the last 10 hours [11:54:03] PROBLEM - Puppet freshness on srv208 is CRITICAL: Puppet has not run in the last 10 hours [11:54:03] PROBLEM - Puppet freshness on srv238 is CRITICAL: Puppet has not run in the last 10 hours [11:58:46] New patchset: Mark Bergsma; "Let's try using the standard Linux NFS mount performance options again" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/21036 [11:59:31] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/21036 [11:59:48] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/21036 [12:00:23] paravoid: do you know anything about the swift upgrade procedure? [12:11:04] meh [12:11:08] tons of dependencies on /home on hume [12:11:21] including some jobs by aaron for swift I think [12:11:33] perhaps i'll move it to next week, i need to announce it better [12:12:39] apergos: I'm afraid I don't, no [12:12:59] well this should be exciting then [12:13:14] I don't either, and I was hoping to try to think things though ahead of time :-( [12:13:15] everything I know is what Ben told me in that conversation that you saw [12:16:23] I know that rewrite.py need to be changed [12:18:11] PROBLEM - Router interfaces on cr1-sdtpa is CRITICAL: CRITICAL: host 208.80.152.196, interfaces up: 76, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-1/1/0: down - Core: cr2-eqiad:xe-5/2/1 (FPL/Level3, CV71028) [10Gbps wave]BR [12:18:37] * apergos rereads through the conversation again slowly [12:24:21] hmmm? [12:26:08] so we have one wave down again [12:26:25] how nice [12:26:43] unless that's leslie/chris but I don't think so [12:36:19] so, mark/apergos, should I disable ms5 now? [12:36:45] I have mixed feelings, but if you both are in favor, I'll do it now [12:36:45] seems like a fine idea to me [12:41:40] onsite in /topic is stale i guess [12:53:16] s3 compatibility layer? really? sweet [12:53:33]