[00:12:36] New patchset: Aaron Schulz; "Unbreak math directory settings." [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/20844 [00:18:47] New patchset: Ryan Lane; "Change some nova policy rules from netadmin to sysadmin" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/20847 [00:19:29] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/20847 [00:19:32] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/20847 [00:22:07] New patchset: Dzahn; "move index.html, add resource for wmf planet logo" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/20848 [00:22:45] Change merged: Dzahn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/20848 [00:23:47] !log fenari - install a couple lib* package upgrades [00:23:56] Logged the message, Master [00:30:07] PROBLEM - Host ms-be6 is DOWN: PING CRITICAL - Packet loss = 100% [00:32:49] !log removed ms-be6 from rotation due to bad memory and repeated kernel panics [00:32:59] Logged the message, Master [00:43:27] PROBLEM - Apache HTTP on srv221 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:43:54] PROBLEM - Apache HTTP on srv223 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:44:57] RECOVERY - Apache HTTP on srv221 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 7.315 second response time [00:45:15] RECOVERY - Apache HTTP on srv223 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.028 second response time [00:45:49] New patchset: Dzahn; "fix syntax error" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/20850 [00:46:30] New review: Dzahn; "This apparently affected hosts not even using the class. Sorry" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/20850 [00:46:31] Change merged: Dzahn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/20850 [01:15:46] New patchset: Dzahn; "disable austin account and remove from singer" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/20854 [01:16:37] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/20854 [01:17:07] PROBLEM - MySQL Replication Heartbeat on db1035 is CRITICAL: CRIT replication delay 184 seconds [01:17:07] PROBLEM - MySQL Slave Delay on db1035 is CRITICAL: CRIT replication delay 184 seconds [01:41:25] PROBLEM - MySQL Slave Delay on db1025 is CRITICAL: CRIT replication delay 242 seconds [01:42:19] PROBLEM - MySQL Slave Delay on storage3 is CRITICAL: CRIT replication delay 296 seconds [01:47:03] !log adding user.user_email index to enwiki via osc [01:47:14] Logged the message, Master [01:48:19] PROBLEM - Misc_Db_Lag on storage3 is CRITICAL: CHECK MySQL REPLICATION - lag - CRITICAL - Seconds_Behind_Master : 656s [01:52:58] PROBLEM - MySQL Slave Delay on db42 is CRITICAL: CRIT replication delay 186 seconds [01:53:07] PROBLEM - MySQL Replication Heartbeat on db42 is CRITICAL: CRIT replication delay 190 seconds [01:56:25] RECOVERY - MySQL Slave Delay on db1025 is OK: OK replication delay 4 seconds [01:58:49] RECOVERY - MySQL Slave Delay on storage3 is OK: OK replication delay 27 seconds [01:58:49] RECOVERY - Misc_Db_Lag on storage3 is OK: CHECK MySQL REPLICATION - lag - OK - Seconds_Behind_Master : 0s [01:59:13] binasher: are you dropping those fr tables soon? :) [01:59:24] I mean truncating [01:59:36] AaronSchulz: i was just going to ask, truncate not drop? [01:59:51] right, just for sanity, and to have the tables for completeness [02:00:07] so truncate (flaggedtemplates,flaggedimages), enwiki only [02:01:39] i'll do it after the enwiki.user migration i'm running now is done.. well, probably tomorrow. definitely tomorrow :) [02:02:17] after wmf10 is finished, a filejournal column can also be dropped (fj_path_sha1) [02:02:28] * AaronSchulz needs to roll those tables sometime [02:03:57] binasher: have you been able to ping the maria folks? [02:04:09] yeah [02:04:30] initially they said they had no plan on making xa replication safe but could if they were sponsored [02:04:58] then they said they thought of a way to do it that might not be so hard, so they might be able to take care of it as a regular bug fix [02:05:57] nothing since then, but there's a bug in pending status [02:06:22] there have also been some recent oracle developments that make me more interested in eventually moving to mariadb [02:06:58] http://techcrunch.com/2012/08/18/oracle-makes-more-moves-to-kill-open-source-mysql/ [02:07:20] http://ronaldbradford.com/blog/when-is-a-crashing-mysql-bug-not-a-bug-2012-08-15/ [02:07:43] and from the mysql dev lead at facebook - http://mysqlha.blogspot.com/2012/08/less-open-source.html [02:10:14] Yay, thanks binasher [02:11:33] hrm [02:20:32] * AaronSchulz lols at the trolling on http://blog.mariadb.org/disappearing-test-cases/ [02:23:07] RECOVERY - MySQL Replication Heartbeat on db1035 is OK: OK replication delay 0 seconds [02:23:25] RECOVERY - MySQL Slave Delay on db1035 is OK: OK replication delay 0 seconds [02:27:41] AaronSchulz: just looking at I9865de75 [02:27:54] it looks kind of epic [02:28:22] too late I guess [02:45:10] PROBLEM - MySQL Idle Transactions on db39 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:46:40] RECOVERY - MySQL Idle Transactions on db39 is OK: OK longest blocking idle transaction sleeps for 0 seconds [02:51:10] PROBLEM - MySQL Idle Transactions on db39 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:55:31] RECOVERY - MySQL Idle Transactions on db39 is OK: OK longest blocking idle transaction sleeps for 0 seconds [03:00:10] PROBLEM - MySQL Idle Transactions on db39 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:09:19] RECOVERY - MySQL Idle Transactions on db39 is OK: OK longest blocking idle transaction sleeps for 0 seconds [03:15:19] PROBLEM - MySQL Idle Transactions on db39 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:16:49] RECOVERY - MySQL Idle Transactions on db39 is OK: OK longest blocking idle transaction sleeps for 4 seconds [03:21:28] PROBLEM - MySQL Idle Transactions on db39 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:29:34] RECOVERY - MySQL Replication Heartbeat on db42 is OK: OK replication delay 0 seconds [03:29:52] RECOVERY - MySQL Slave Delay on db42 is OK: OK replication delay 0 seconds [03:31:31] RECOVERY - MySQL Idle Transactions on db39 is OK: OK longest blocking idle transaction sleeps for 0 seconds [03:44:00] !log adding user.user_email index on s2 [03:44:09] Logged the message, Master [04:00:55] PROBLEM - Puppet freshness on labstore1 is CRITICAL: Puppet has not run in the last 10 hours [04:00:55] PROBLEM - Puppet freshness on ms-be1007 is CRITICAL: Puppet has not run in the last 10 hours [04:00:55] PROBLEM - Puppet freshness on ms-be1010 is CRITICAL: Puppet has not run in the last 10 hours [04:00:55] PROBLEM - Puppet freshness on ms-be1011 is CRITICAL: Puppet has not run in the last 10 hours [04:02:10] !log adding user.user_email index on s3 [04:02:19] Logged the message, Master [04:06:55] PROBLEM - Puppet freshness on neon is CRITICAL: Puppet has not run in the last 10 hours [05:29:58] PROBLEM - Puppet freshness on magnesium is CRITICAL: Puppet has not run in the last 10 hours [05:29:58] PROBLEM - Puppet freshness on zinc is CRITICAL: Puppet has not run in the last 10 hours [05:41:41] !log halting s3 user migration on hiwiktionary, will resume in the morning (pst) [05:41:51] Logged the message, Master [09:11:19] morning [09:11:52] not any more! [09:30:41] true that [09:40:53] New patchset: preilly; "production vumi settings" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/20867 [09:41:36] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/20867 [09:42:46] paravoid: ping [09:43:09] pong [09:43:28] paravoid: can you merge https://gerrit.wikimedia.org/r/#/c/20867/ for me once jerith signs off on it? [09:43:28] that ^^^? [09:43:34] heh [09:45:04] sure, when do you expect that? [09:45:16] paravoid: in like 5 minutes tops [09:45:27] paravoid: actually right now [09:45:28] ah. okay, sure. [09:46:02] btw, what's the canonical email address for the mobile team? [09:47:22] paravoid: mobile-tech@wikimedia.org [09:47:28] paravoid: is that what you mean? [09:47:41] New review: Jerith; "Happiness and kittens." [operations/puppet] (production) C: 1; - https://gerrit.wikimedia.org/r/20867 [09:47:50] yeah, I guess so [09:48:23] we got a mail to noc@ from a mobile company's engineer with comments about the layout and stuff [09:48:37] paravoid: oh okay cool [09:48:47] paravoid: can you please merge https://gerrit.wikimedia.org/r/#/c/20867/ and push it to sock puppet [09:48:55] already onit [09:49:02] paravoid: sweet thanks! [09:49:07] Change merged: Faidon; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/20867 [09:49:30] paravoid: could you also force a puppet run on zhen and silver? [09:50:03] already on that too [09:50:26] paravoid: sweet thanks! [09:52:44] done. [09:53:49] paravoid: awesome thanks so much [09:58:26] PROBLEM - Puppet freshness on ms-be1001 is CRITICAL: Puppet has not run in the last 10 hours [09:58:26] PROBLEM - Puppet freshness on ms-be1005 is CRITICAL: Puppet has not run in the last 10 hours [09:58:26] PROBLEM - Puppet freshness on ms-be1003 is CRITICAL: Puppet has not run in the last 10 hours [09:58:26] PROBLEM - Puppet freshness on ms-be1009 is CRITICAL: Puppet has not run in the last 10 hours [09:58:26] PROBLEM - Puppet freshness on ms-be1006 is CRITICAL: Puppet has not run in the last 10 hours [09:58:26] PROBLEM - Puppet freshness on ms-be1002 is CRITICAL: Puppet has not run in the last 10 hours [09:58:27] PROBLEM - Puppet freshness on ms-fe1001 is CRITICAL: Puppet has not run in the last 10 hours [09:58:27] PROBLEM - Puppet freshness on ocg3 is CRITICAL: Puppet has not run in the last 10 hours [09:58:27] PROBLEM - Puppet freshness on singer is CRITICAL: Puppet has not run in the last 10 hours [09:58:28] PROBLEM - Puppet freshness on virt1003 is CRITICAL: Puppet has not run in the last 10 hours [09:58:28] PROBLEM - Puppet freshness on virt1004 is CRITICAL: Puppet has not run in the last 10 hours [09:58:29] PROBLEM - Puppet freshness on virt1001 is CRITICAL: Puppet has not run in the last 10 hours [09:58:30] PROBLEM - Puppet freshness on virt1002 is CRITICAL: Puppet has not run in the last 10 hours [10:14:16] PROBLEM - Host srv278 is DOWN: PING CRITICAL - Packet loss = 100% [10:16:31] RECOVERY - Host srv278 is UP: PING OK - Packet loss = 0%, RTA = 0.29 ms [10:20:43] PROBLEM - Apache HTTP on srv278 is CRITICAL: Connection refused [10:25:01] New patchset: Mark Bergsma; "Add labs subnets" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/20873 [10:25:42] New review: gerrit2; "Change did not pass lint check. You will need to send an amended patchset for this (see: https://lab..." [operations/puppet] (production); V: -1 - https://gerrit.wikimedia.org/r/20873 [10:26:10] heya mark [10:27:14] hi [10:27:47] New patchset: Mark Bergsma; "Add labs subnets" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/20873 [10:28:13] RECOVERY - Apache HTTP on srv278 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.023 second response time [10:28:35] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/20873 [10:30:00] is there any reason to not run "git init" in /h/w/conf/squid? [10:30:09] or should I attempt to migrate RCS to git instead? [10:30:35] git could work [10:30:37] i wouldn't bother migrating [10:30:58] do we care about RCS history? [10:31:04] do put in a check in the makefile then [10:31:05] no [10:31:08] i never used it, for one [10:31:12] so it's missing a whole lot anyway [10:31:15] hahahaha [10:31:54] sorry for laughing, I was just wondering "who would use RCS?!?" [10:32:01] so annoying [10:32:44] Isn't RCS the pre-cvs thing that literally just stores diff's you then have to patch back in? [10:33:26] isn't that something you and everyone else knows so it's pointless to mention it? [10:34:35] Dunno :P I started out with svn as a vcs [10:35:32] youngster [10:36:02] New patchset: Mark Bergsma; "Add labs subnets" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/20873 [10:36:48] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/20873 [10:37:10] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/20873 [10:44:16] okay. switched to git, wikitech updated (and while at it, removed the outdated cluster information at the top) [10:44:49] how wonderful [10:45:14] yeah, I had to make copies yesterday to facilitate rollback [10:45:18] New patchset: Tpt; "(bug 37483) Add a list of Page and Index namespaces ids in order to use the new namespace configuration system included into Proofread Page (change: https://gerrit.wikimedia.org/r/#/c/17643/ )" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/20876 [10:45:26] I'm glad I didn't forget anything but I don't want to take that risk again [10:45:26] I saw [11:11:00] well that was annoying [11:11:05] maybe my modem is starting to go [11:16:04] so you remove a squid from the squid configs on fenari, the ones just moved to git. you push that out to the front ends. now you make changes to the config that are intended to be deployed just to the backend squid. how do you get "generate.php" to regenerate that file, now that the squid isn't in the list? [11:21:19] revert that and regenerate? we do have support for partial deploys [11:21:57] so you mean revert, then do the backend changes, then generate, then push just to the one backend squid. [11:22:03] yes. [11:22:16] that sounds like it would work [11:24:00] you see any easier way to do it? [11:25:28] no, but that's easy enough [11:25:30] I'd be happy to do it [11:25:43] just as long as it's written down I'm fine with it [11:26:41] so, lunch, be back in an hour or so. [11:26:48] enjoy [13:05:28] !log Setup labs networking in eqiad. Reworked VRRP setup using apply groups [13:09:43] mark: do morebots [13:09:46] *no [13:10:29] 2012-08-21 9.06 -!- morebots [~morebots@wikitech.wikimedia.org] has quit [Ping timeout: 265 seconds] [13:19:23] ping mutante [13:22:14] early for that [13:23:00] probably [13:23:26] no problem, there is no hurry [14:02:17] PROBLEM - Puppet freshness on labstore1 is CRITICAL: Puppet has not run in the last 10 hours [14:02:17] PROBLEM - Puppet freshness on ms-be1007 is CRITICAL: Puppet has not run in the last 10 hours [14:02:17] PROBLEM - Puppet freshness on ms-be1011 is CRITICAL: Puppet has not run in the last 10 hours [14:02:17] PROBLEM - Puppet freshness on ms-be1010 is CRITICAL: Puppet has not run in the last 10 hours [14:08:17] PROBLEM - Puppet freshness on neon is CRITICAL: Puppet has not run in the last 10 hours [14:23:56] New patchset: Ottomata; "misc/statistics.pp - setting up rsync job for sampled-1000 logs" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/20897 [14:24:41] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/20897 [14:25:02] New patchset: Ottomata; "misc/statistics.pp - setting up rsync job for sampled-1000 logs" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/20897 [14:25:46] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/20897 [14:26:04] hi guys, anyone around to help me with some reviews? [14:26:07] i have 3 reallllyyyy easy ones [14:26:13] and one that might require more reading [15:22:01] morning paravoid, apergos [15:22:07] good morning [15:22:13] just checking in before I cross the bay to get to the office. [15:22:30] hello [15:23:33] anything of note re: our upcoming window? any comments on the squid configs that collectively aaron and I suggested? [15:23:45] feel comfortable we can make it work this time? [15:23:47] :) [15:23:56] have a look at the etherpad [15:24:06] both of us have made some comments/added stuff [15:24:22] I think if we have a specific goal and don't add ot it, we can get it done [15:26:04] do we have a specific goal? [15:27:22] traffic for originals to swift, ignore all exceptions for now, that is what I would say [15:27:34] +1 [15:29:15] neither of you wanted to pull the change aaron suggested into the etherpad? [15:29:42] (enumerating projects to make the originals match more strict) [15:30:46] also, have you drafted a squid rule that will catch the auth headers? or is that something we'll do during the window? [15:31:15] PROBLEM - Puppet freshness on magnesium is CRITICAL: Puppet has not run in the last 10 hours [15:31:15] PROBLEM - Puppet freshness on zinc is CRITICAL: Puppet has not run in the last 10 hours [15:31:43] enumerating all the projects? [15:31:53] did you read aaron's email? [15:31:59] (did he send it to everybody?) [15:32:02] oh. I was looking at the ehterpad [15:32:18] hey ben [15:32:22] morning. [15:32:28] do you want to do the originals migration today? or the upgrade? [15:32:41] btw, at this point I'm not going to make it to the office until about 9:30... :( [15:32:44] mark: originals. [15:33:03] i'm a bit concerned about it [15:33:07] there are some unknowns from yesterday [15:33:15] like where the extra load came from yesterday, on the image scalers [15:33:17] and they've been discussed, no? [15:33:24] perhaps I haven't seen the result [15:33:27] ah, that one. [15:33:34] I'm fine with listing them, but if we do that we need to make sure that additionof a new prject (*cough*wikitravel*cough*) won't screw us, that it's well documented to add it to the acls [15:33:37] it seemed as if swift didn't have many thumbs [15:33:49] whereas it should have pretty much everything in squid by now [15:34:05] mark: I'm not sure that hypothesis is correct. [15:34:14] there wasn't a large bump in the number of objects in swift [15:34:29] it might be something else [15:34:30] it would have been visible on the graph that tracks the delta on the number of objects. [15:34:32] but we don't currently know what [15:34:41] and since that was just one squid... [15:34:48] it's also the case that the image scalers flipped out both before and long after the squid change. [15:34:56] they were flipping out as late as 6pm yesterday. [15:35:00] yes, it's possible it was unrelated thumb.php traffic not coming from swift [15:35:06] which leads me to believe it was unrelated. [15:35:17] we'll be testing on one squid again first [15:36:06] wouldn't it be better to focus on the 1.5 upgrade first now? [15:36:23] why? [15:36:25] looking at the week graph, traffic on the image scalers has been markedly higher since last thursday. I'm pretty sure it's unrelated. [15:36:50] I'm not feeling particularly well prepared for the originals today either tbh [15:36:50] mark: robla and I agree that we'd like to get this switch done first (mostly because we've already started, but also because I"m not comfortable yet with the amount of testing we've done with mediawiki on the upgrade) [15:37:12] i also see mediawiki is not quite optimized yet [15:37:17] all the separate auth request [15:37:21] if we need to do more prep to move the originals traffic, then let's do the prep [15:37:24] also why the separate HEADs before GETs? [15:38:00] we have a braindump on the pad, but considering how many cases we found in < 24h, I don't feel comfortable that we don't have even more [15:38:16] also, have in mind that removing a backend squid for testing is going to produce the exact same result that we had yesterday [15:38:33] unless it's the same squid, which presumably means same thumbs [15:38:43] paravoid: by switching just originals instead of the default, we skirt all the edge cases we've found so far. [15:38:52] well we don't know that, and if it does then we can investigate the problem [15:39:04] which is I guess one reason mark is uneasy about the move [15:39:08] mark: re: head vs. get - MW has a number of cases where it takes a different action based on whether the thumb already exists. [15:39:34] but it always requests the thumb after it finds it exists, doesn't it? [15:39:36] anyway. I've got to leave to get into the office. [15:39:46] no, we don't. we haven't researched the "block non-HEAD/GET" or "block X-Authenticate" options [15:40:15] if I find the token, I can send a DELETE with X-Authenticate and -afaik- delete an image via the squids [15:40:18] that's bad imho [15:41:22] if we need to do more prep we'll do more prep. it's got to be done sooner or later. if we don't make this window we don't make it, but I'd rather have some progress made [15:41:31] +1 apergos. [15:41:48] please feel free to start without me; I'll be back online as soon as I get there. [15:42:03] heyaaaa RobH [15:42:09] you around? [15:42:11] see you in a bit [15:42:16] I don't like rushing things. yesterday we almost brought down all the squids. [15:42:40] well by start I have in mind writing down things like the delete issue (assuming we do whitelist) [15:42:55] delete has nothing to do with white/blacklist [15:43:24] (and it's already written, hours ago) [15:43:46] nothing to do with our implementation, but i really don't like that swift auth tokens are good for a long period of time for any request type, and passed in plain text.. vs. oauth where a request is signed instead of containing the auth token, and can't be replayed after a second [15:43:48] sorry, but I guess I didn't see it [15:44:18] binasher: auth tokens are valid for a day afaik [15:44:38] you have a username/password, you login and get back a short-lived token (if you consider a day as short) [15:45:15] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/20746 [15:45:32] on the other hand, i liked that i could tcpdump a swift fe and then craft whatever valid request i wanted from seeing one result [15:45:33] do we ever need to delete (originals)? [15:45:39] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/20897 [15:46:27] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/20454 [15:47:27] the auth just seems pointless, if all requests to swift are sanitized [15:47:51] it's intended for a different use case than ours (generally) [15:47:54] or if mw is going to request the token constantly, might as well expire tokens every minute [15:48:41] apergos: yes.. it's intended for an s3 sort of case. in which case it should use signed requests instead of actually passing around a token [15:50:13] is that something we need to fix now? [15:50:46] i rather see this well planned and not done in a rush rather than have someone else clean up the mess later [15:51:12] I'd like us to spend the time in our window on that planning, rather than on something else [15:51:14] apergos: no, as i said, it has nothing to do with our implementation [15:51:18] that's what I'm advocating here [15:51:23] binasher: ok [15:51:27] wow, how did i accidentally underline that [15:52:39] so far we have the delete issue and the possibly missing thumbs in swift to look at (or determine that swift isn't missing thumbs and that something else happened) [15:53:13] binasher: what's scary is that we don't filter headers or methods on the squids (afaik) [15:53:31] you can probably send a DELETE with an X-Authenticate header and delete stuff from outside [15:53:56] we can [15:54:01] don't we? [15:54:12] I remember there such an option, I don't think we do that now. [15:54:13] we do at least in some configs, but not sure about the upload cluster as is [15:54:18] i'm pretty sure we do [15:54:20] I'm not sure. [15:54:44] I'd like to test to be sure [15:55:42] that was fast [15:55:44] * maplebed lurks while in transit [15:55:48] ah [15:55:55] (yay mifi!) [15:59:20] I see some things in here like [15:59:26] http_access deny !gethead [16:00:53] ah, that's on the frontend [16:00:55] indeed. [16:00:58] yes [16:01:10] good to test it too of course [16:04:04] we should block container list requests from the squids [16:04:56] container listing is an authenticated request; blocking all auth stuff (i.e. with the authed header) accomplishes that goal. [16:07:36] unrelated, the reason to take the test squid out of the front end is so its front end doesn't pass 1/nth of the requests to itself on the back end (which will have been corrected on all other squids) [16:08:02] out of the front end pool. sorry [16:08:54] why wouldn't be corrected on that frontend too then? [16:09:00] we'll deploy all frontends anyway [16:09:14] that squid file won't be regenerate, it's out of the conf list [16:09:28] it will be regenerated when we reveert and add thebackend changes [16:09:52] that's what we discussed earlier right? [16:13:39] bbiab [16:15:18] okay, so I'll take sq51 backend out [16:15:30] and then we'll see how that goes [16:15:55] front and back end you mean? [16:16:16] frontend does not need depooling [16:16:49] that's what I've said 10 times already [16:17:06] ok, then I don't get it. please explain to me how its front end will have the correct configuration [16:17:23] if you have lost patience with it you can tell me to shut up, but it jus means I won't understand it [16:18:45] I haven't lost patience, no :) [16:19:20] i think apergos means that if you take the one squid out of the server list in the config, the frontend's squid config won't get regenerated on the next make invocation [16:19:55] right, so the test squid will still have its own back end in its list, but we want no squid to have the test squid back end in its list [16:20:08] because we want no production requests to go there. [16:20:32] if this is wrong, can you please tell me in baby steps how it actually works? [16:21:00] I know that's what you mean, but we can always vi by hand iirc :) [16:21:35] then let's add that in the directions, or it won't happen [16:22:56] added [16:23:48] I hope I got it right [16:25:01] you know, since we're doing the more strict only-pull-originals bit in squid, none of the authed stuff will make it through to swift even without the auth blocking bits. [16:25:54] also (I want to confirm this but) rewrite.py will drop DELETE requests, so the same thing applies. [16:25:56] well, yeah, that's kinda obious :) [16:26:11] frontend squids drop non-GET/HEAD apparently. [16:26:27] but the back end is directly accessible from the public net, isn't it? [16:26:35] ok I have that and the missing thumbs issue in the list of "prep" [16:26:42] is it?? [16:26:44] jesus [16:26:54] yeah, because it's hit directly by the squids in esams [16:27:01] it's not firewalled but blocked via squid acls afaik [16:27:09] oh, that makes sense. [16:27:23] oh right, it would have to be [16:27:39] (my wmf squid experience is two days old, don't assume I'm right in what I say) [16:27:51] yes but you have squid experience form elsewhere [16:28:04] my total squid experience is probably aobut 10 days in my years here. [16:28:13] (and I have none previous) [16:30:16] !log disabling sq51 backend on all frontend squids (incl. sq51) [16:30:43] oh, are we doing this? I thought we were still well in the "prep" stage... [16:30:48] I wonder if there are any varnishes pointed at squid backends. [16:32:16] mark: do we have Varnish stacked with squid backends anywhere? [16:33:04] in eqiad, for upload [16:33:11] not active, but may get a little bit of client load [16:33:15] people not observing dns [16:33:27] paravoid: are you sure it's disabled on sq51? [16:33:30] is that configured via role/cache.pp? [16:33:51] (sorry, I'm looking at /etc/squid/frontend.conf though) [16:34:06] apergos: I haven't checked yet, no. I would have, but good thing that you did too [16:35:18] seems like 208.80.152.61 is still in the list over there [16:35:21] paravoid: yes [16:35:25] I think [16:35:47] mark: I'll figure it out. thanks. [16:36:14] (our flawless plan is having holes again, grmbl grmbl) [16:37:25] apergos: removed and reloaded, thanks for double checking! [16:37:30] sure [16:37:41] yep it's fixed now [16:37:43] in other news, the bot is down again and didn't log my message [16:37:50] ah joy [16:40:11] !log shot morebots so the restarter script would restart it [16:40:21] and now we wait for it to read all its factoids etc [16:40:23] Logged the message, Master [16:40:57] mark: eqiad's role::cache::upload: frontend seem to have themselves as backends; backends seem to have upload.svc.pmtpa.wmnet:80 (i.e. pmtpa frontends) as backend [16:41:40] sounds correct [16:42:06] so, not pointed to squid backends directly [16:42:15] and hence no need to change anything there [16:42:47] there are a few requests sneaking through to the back end yet [16:43:13] I was about to check on emery, is that what you did? [16:43:17] or tcpdump? [16:43:17] like 3 [16:43:22] I'm looking at cachemgr [16:43:29] via noc [16:44:03] http://noc.wikimedia.org/cgi-bin/cachemgr.cgi?host=sq51.wikimedia.org&port=3128 but you will have to stuff in the username and password into the form to see it [16:44:20] that's because varnish's hashing is different [16:44:28] so we send it via the frontends to correct that (for now) [16:45:36] I see requests coming from esams squids. looking [16:46:01] mark: heh, interesting, good to know :) [16:46:49] yes, but esams squids should follow the squid config as long as you deployed to them as well [16:46:53] uh huh, a few from esams [16:47:17] and one from itself, grrrr! [16:47:20] the esams frontends hash directly to the pmtpa back ends, so we need to push the frontend config change to esams to pull out sq51, right? [16:47:29] hrm, the config seems to not be deployed in an esams server [16:47:52] ah, I know [16:48:33] I ran ./deploy frontend all, but it's esams *backend* that pushes to pmtpa [16:48:51] oh right. :P [16:49:44] ottomata: you know who in analytics would make the decision when we can take down db1047 for the disk shelf addon? [16:49:50] I don't =P [16:50:22] drdee: ^? [16:50:24]