[00:12:36] New patchset: Aaron Schulz; "Unbreak math directory settings." [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/20844 [00:18:47] New patchset: Ryan Lane; "Change some nova policy rules from netadmin to sysadmin" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/20847 [00:19:29] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/20847 [00:19:32] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/20847 [00:22:07] New patchset: Dzahn; "move index.html, add resource for wmf planet logo" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/20848 [00:22:45] Change merged: Dzahn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/20848 [00:23:47] !log fenari - install a couple lib* package upgrades [00:23:56] Logged the message, Master [00:30:07] PROBLEM - Host ms-be6 is DOWN: PING CRITICAL - Packet loss = 100% [00:32:49] !log removed ms-be6 from rotation due to bad memory and repeated kernel panics [00:32:59] Logged the message, Master [00:43:27] PROBLEM - Apache HTTP on srv221 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:43:54] PROBLEM - Apache HTTP on srv223 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:44:57] RECOVERY - Apache HTTP on srv221 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 7.315 second response time [00:45:15] RECOVERY - Apache HTTP on srv223 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.028 second response time [00:45:49] New patchset: Dzahn; "fix syntax error" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/20850 [00:46:30] New review: Dzahn; "This apparently affected hosts not even using the class. Sorry" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/20850 [00:46:31] Change merged: Dzahn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/20850 [01:15:46] New patchset: Dzahn; "disable austin account and remove from singer" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/20854 [01:16:37] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/20854 [01:17:07] PROBLEM - MySQL Replication Heartbeat on db1035 is CRITICAL: CRIT replication delay 184 seconds [01:17:07] PROBLEM - MySQL Slave Delay on db1035 is CRITICAL: CRIT replication delay 184 seconds [01:41:25] PROBLEM - MySQL Slave Delay on db1025 is CRITICAL: CRIT replication delay 242 seconds [01:42:19] PROBLEM - MySQL Slave Delay on storage3 is CRITICAL: CRIT replication delay 296 seconds [01:47:03] !log adding user.user_email index to enwiki via osc [01:47:14] Logged the message, Master [01:48:19] PROBLEM - Misc_Db_Lag on storage3 is CRITICAL: CHECK MySQL REPLICATION - lag - CRITICAL - Seconds_Behind_Master : 656s [01:52:58] PROBLEM - MySQL Slave Delay on db42 is CRITICAL: CRIT replication delay 186 seconds [01:53:07] PROBLEM - MySQL Replication Heartbeat on db42 is CRITICAL: CRIT replication delay 190 seconds [01:56:25] RECOVERY - MySQL Slave Delay on db1025 is OK: OK replication delay 4 seconds [01:58:49] RECOVERY - MySQL Slave Delay on storage3 is OK: OK replication delay 27 seconds [01:58:49] RECOVERY - Misc_Db_Lag on storage3 is OK: CHECK MySQL REPLICATION - lag - OK - Seconds_Behind_Master : 0s [01:59:13] binasher: are you dropping those fr tables soon? :) [01:59:24] I mean truncating [01:59:36] AaronSchulz: i was just going to ask, truncate not drop? [01:59:51] right, just for sanity, and to have the tables for completeness [02:00:07] so truncate (flaggedtemplates,flaggedimages), enwiki only [02:01:39] i'll do it after the enwiki.user migration i'm running now is done.. well, probably tomorrow. definitely tomorrow :) [02:02:17] after wmf10 is finished, a filejournal column can also be dropped (fj_path_sha1) [02:02:28] * AaronSchulz needs to roll those tables sometime [02:03:57] binasher: have you been able to ping the maria folks? [02:04:09] yeah [02:04:30] initially they said they had no plan on making xa replication safe but could if they were sponsored [02:04:58] then they said they thought of a way to do it that might not be so hard, so they might be able to take care of it as a regular bug fix [02:05:57] nothing since then, but there's a bug in pending status [02:06:22] there have also been some recent oracle developments that make me more interested in eventually moving to mariadb [02:06:58] http://techcrunch.com/2012/08/18/oracle-makes-more-moves-to-kill-open-source-mysql/ [02:07:20] http://ronaldbradford.com/blog/when-is-a-crashing-mysql-bug-not-a-bug-2012-08-15/ [02:07:43] and from the mysql dev lead at facebook - http://mysqlha.blogspot.com/2012/08/less-open-source.html [02:10:14] Yay, thanks binasher [02:11:33] hrm [02:20:32] * AaronSchulz lols at the trolling on http://blog.mariadb.org/disappearing-test-cases/ [02:23:07] RECOVERY - MySQL Replication Heartbeat on db1035 is OK: OK replication delay 0 seconds [02:23:25] RECOVERY - MySQL Slave Delay on db1035 is OK: OK replication delay 0 seconds [02:27:41] AaronSchulz: just looking at I9865de75 [02:27:54] it looks kind of epic [02:28:22] too late I guess [02:45:10] PROBLEM - MySQL Idle Transactions on db39 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:46:40] RECOVERY - MySQL Idle Transactions on db39 is OK: OK longest blocking idle transaction sleeps for 0 seconds [02:51:10] PROBLEM - MySQL Idle Transactions on db39 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:55:31] RECOVERY - MySQL Idle Transactions on db39 is OK: OK longest blocking idle transaction sleeps for 0 seconds [03:00:10] PROBLEM - MySQL Idle Transactions on db39 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:09:19] RECOVERY - MySQL Idle Transactions on db39 is OK: OK longest blocking idle transaction sleeps for 0 seconds [03:15:19] PROBLEM - MySQL Idle Transactions on db39 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:16:49] RECOVERY - MySQL Idle Transactions on db39 is OK: OK longest blocking idle transaction sleeps for 4 seconds [03:21:28] PROBLEM - MySQL Idle Transactions on db39 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:29:34] RECOVERY - MySQL Replication Heartbeat on db42 is OK: OK replication delay 0 seconds [03:29:52] RECOVERY - MySQL Slave Delay on db42 is OK: OK replication delay 0 seconds [03:31:31] RECOVERY - MySQL Idle Transactions on db39 is OK: OK longest blocking idle transaction sleeps for 0 seconds [03:44:00] !log adding user.user_email index on s2 [03:44:09] Logged the message, Master [04:00:55] PROBLEM - Puppet freshness on labstore1 is CRITICAL: Puppet has not run in the last 10 hours [04:00:55] PROBLEM - Puppet freshness on ms-be1007 is CRITICAL: Puppet has not run in the last 10 hours [04:00:55] PROBLEM - Puppet freshness on ms-be1010 is CRITICAL: Puppet has not run in the last 10 hours [04:00:55] PROBLEM - Puppet freshness on ms-be1011 is CRITICAL: Puppet has not run in the last 10 hours [04:02:10] !log adding user.user_email index on s3 [04:02:19] Logged the message, Master [04:06:55] PROBLEM - Puppet freshness on neon is CRITICAL: Puppet has not run in the last 10 hours [05:29:58] PROBLEM - Puppet freshness on magnesium is CRITICAL: Puppet has not run in the last 10 hours [05:29:58] PROBLEM - Puppet freshness on zinc is CRITICAL: Puppet has not run in the last 10 hours [05:41:41] !log halting s3 user migration on hiwiktionary, will resume in the morning (pst) [05:41:51] Logged the message, Master [09:11:19] morning [09:11:52] not any more! [09:30:41] true that [09:40:53] New patchset: preilly; "production vumi settings" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/20867 [09:41:36] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/20867 [09:42:46] paravoid: ping [09:43:09] pong [09:43:28] paravoid: can you merge https://gerrit.wikimedia.org/r/#/c/20867/ for me once jerith signs off on it? [09:43:28] that ^^^? [09:43:34] heh [09:45:04] sure, when do you expect that? [09:45:16] paravoid: in like 5 minutes tops [09:45:27] paravoid: actually right now [09:45:28] ah. okay, sure. [09:46:02] btw, what's the canonical email address for the mobile team? [09:47:22] paravoid: mobile-tech@wikimedia.org [09:47:28] paravoid: is that what you mean? [09:47:41] New review: Jerith; "Happiness and kittens." [operations/puppet] (production) C: 1; - https://gerrit.wikimedia.org/r/20867 [09:47:50] yeah, I guess so [09:48:23] we got a mail to noc@ from a mobile company's engineer with comments about the layout and stuff [09:48:37] paravoid: oh okay cool [09:48:47] paravoid: can you please merge https://gerrit.wikimedia.org/r/#/c/20867/ and push it to sock puppet [09:48:55] already onit [09:49:02] paravoid: sweet thanks! [09:49:07] Change merged: Faidon; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/20867 [09:49:30] paravoid: could you also force a puppet run on zhen and silver? [09:50:03] already on that too [09:50:26] paravoid: sweet thanks! [09:52:44] done. [09:53:49] paravoid: awesome thanks so much [09:58:26] PROBLEM - Puppet freshness on ms-be1001 is CRITICAL: Puppet has not run in the last 10 hours [09:58:26] PROBLEM - Puppet freshness on ms-be1005 is CRITICAL: Puppet has not run in the last 10 hours [09:58:26] PROBLEM - Puppet freshness on ms-be1003 is CRITICAL: Puppet has not run in the last 10 hours [09:58:26] PROBLEM - Puppet freshness on ms-be1009 is CRITICAL: Puppet has not run in the last 10 hours [09:58:26] PROBLEM - Puppet freshness on ms-be1006 is CRITICAL: Puppet has not run in the last 10 hours [09:58:26] PROBLEM - Puppet freshness on ms-be1002 is CRITICAL: Puppet has not run in the last 10 hours [09:58:27] PROBLEM - Puppet freshness on ms-fe1001 is CRITICAL: Puppet has not run in the last 10 hours [09:58:27] PROBLEM - Puppet freshness on ocg3 is CRITICAL: Puppet has not run in the last 10 hours [09:58:27] PROBLEM - Puppet freshness on singer is CRITICAL: Puppet has not run in the last 10 hours [09:58:28] PROBLEM - Puppet freshness on virt1003 is CRITICAL: Puppet has not run in the last 10 hours [09:58:28] PROBLEM - Puppet freshness on virt1004 is CRITICAL: Puppet has not run in the last 10 hours [09:58:29] PROBLEM - Puppet freshness on virt1001 is CRITICAL: Puppet has not run in the last 10 hours [09:58:30] PROBLEM - Puppet freshness on virt1002 is CRITICAL: Puppet has not run in the last 10 hours [10:14:16] PROBLEM - Host srv278 is DOWN: PING CRITICAL - Packet loss = 100% [10:16:31] RECOVERY - Host srv278 is UP: PING OK - Packet loss = 0%, RTA = 0.29 ms [10:20:43] PROBLEM - Apache HTTP on srv278 is CRITICAL: Connection refused [10:25:01] New patchset: Mark Bergsma; "Add labs subnets" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/20873 [10:25:42] New review: gerrit2; "Change did not pass lint check. You will need to send an amended patchset for this (see: https://lab..." [operations/puppet] (production); V: -1 - https://gerrit.wikimedia.org/r/20873 [10:26:10] heya mark [10:27:14] hi [10:27:47] New patchset: Mark Bergsma; "Add labs subnets" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/20873 [10:28:13] RECOVERY - Apache HTTP on srv278 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.023 second response time [10:28:35] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/20873 [10:30:00] is there any reason to not run "git init" in /h/w/conf/squid? [10:30:09] or should I attempt to migrate RCS to git instead? [10:30:35] git could work [10:30:37] i wouldn't bother migrating [10:30:58] do we care about RCS history? [10:31:04] do put in a check in the makefile then [10:31:05] no [10:31:08] i never used it, for one [10:31:12] so it's missing a whole lot anyway [10:31:15] hahahaha [10:31:54] sorry for laughing, I was just wondering "who would use RCS?!?" [10:32:01] so annoying [10:32:44] Isn't RCS the pre-cvs thing that literally just stores diff's you then have to patch back in? [10:33:26] isn't that something you and everyone else knows so it's pointless to mention it? [10:34:35] Dunno :P I started out with svn as a vcs [10:35:32] youngster [10:36:02] New patchset: Mark Bergsma; "Add labs subnets" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/20873 [10:36:48] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/20873 [10:37:10] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/20873 [10:44:16] okay. switched to git, wikitech updated (and while at it, removed the outdated cluster information at the top) [10:44:49] how wonderful [10:45:14] yeah, I had to make copies yesterday to facilitate rollback [10:45:18] New patchset: Tpt; "(bug 37483) Add a list of Page and Index namespaces ids in order to use the new namespace configuration system included into Proofread Page (change: https://gerrit.wikimedia.org/r/#/c/17643/ )" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/20876 [10:45:26] I'm glad I didn't forget anything but I don't want to take that risk again [10:45:26] I saw [11:11:00] well that was annoying [11:11:05] maybe my modem is starting to go [11:16:04] so you remove a squid from the squid configs on fenari, the ones just moved to git. you push that out to the front ends. now you make changes to the config that are intended to be deployed just to the backend squid. how do you get "generate.php" to regenerate that file, now that the squid isn't in the list? [11:21:19] revert that and regenerate? we do have support for partial deploys [11:21:57] so you mean revert, then do the backend changes, then generate, then push just to the one backend squid. [11:22:03] yes. [11:22:16] that sounds like it would work [11:24:00] you see any easier way to do it? [11:25:28] no, but that's easy enough [11:25:30] I'd be happy to do it [11:25:43] just as long as it's written down I'm fine with it [11:26:41] so, lunch, be back in an hour or so. [11:26:48] enjoy [13:05:28] !log Setup labs networking in eqiad. Reworked VRRP setup using apply groups [13:09:43] mark: do morebots [13:09:46] *no [13:10:29] 2012-08-21 9.06 -!- morebots [~morebots@wikitech.wikimedia.org] has quit [Ping timeout: 265 seconds] [13:19:23] ping mutante [13:22:14] early for that [13:23:00] probably [13:23:26] no problem, there is no hurry [14:02:17] PROBLEM - Puppet freshness on labstore1 is CRITICAL: Puppet has not run in the last 10 hours [14:02:17] PROBLEM - Puppet freshness on ms-be1007 is CRITICAL: Puppet has not run in the last 10 hours [14:02:17] PROBLEM - Puppet freshness on ms-be1011 is CRITICAL: Puppet has not run in the last 10 hours [14:02:17] PROBLEM - Puppet freshness on ms-be1010 is CRITICAL: Puppet has not run in the last 10 hours [14:08:17] PROBLEM - Puppet freshness on neon is CRITICAL: Puppet has not run in the last 10 hours [14:23:56] New patchset: Ottomata; "misc/statistics.pp - setting up rsync job for sampled-1000 logs" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/20897 [14:24:41] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/20897 [14:25:02] New patchset: Ottomata; "misc/statistics.pp - setting up rsync job for sampled-1000 logs" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/20897 [14:25:46] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/20897 [14:26:04] hi guys, anyone around to help me with some reviews? [14:26:07] i have 3 reallllyyyy easy ones [14:26:13] and one that might require more reading [15:22:01] morning paravoid, apergos [15:22:07] good morning [15:22:13] just checking in before I cross the bay to get to the office. [15:22:30] hello [15:23:33] anything of note re: our upcoming window? any comments on the squid configs that collectively aaron and I suggested? [15:23:45] feel comfortable we can make it work this time? [15:23:47] :) [15:23:56] have a look at the etherpad [15:24:06] both of us have made some comments/added stuff [15:24:22] I think if we have a specific goal and don't add ot it, we can get it done [15:26:04] do we have a specific goal? [15:27:22] traffic for originals to swift, ignore all exceptions for now, that is what I would say [15:27:34] +1 [15:29:15] neither of you wanted to pull the change aaron suggested into the etherpad? [15:29:42] (enumerating projects to make the originals match more strict) [15:30:46] also, have you drafted a squid rule that will catch the auth headers? or is that something we'll do during the window? [15:31:15] PROBLEM - Puppet freshness on magnesium is CRITICAL: Puppet has not run in the last 10 hours [15:31:15] PROBLEM - Puppet freshness on zinc is CRITICAL: Puppet has not run in the last 10 hours [15:31:43] enumerating all the projects? [15:31:53] did you read aaron's email? [15:31:59] (did he send it to everybody?) [15:32:02] oh. I was looking at the ehterpad [15:32:18] hey ben [15:32:22] morning. [15:32:28] do you want to do the originals migration today? or the upgrade? [15:32:41] btw, at this point I'm not going to make it to the office until about 9:30... :( [15:32:44] mark: originals. [15:33:03] i'm a bit concerned about it [15:33:07] there are some unknowns from yesterday [15:33:15] like where the extra load came from yesterday, on the image scalers [15:33:17] and they've been discussed, no? [15:33:24] perhaps I haven't seen the result [15:33:27] ah, that one. [15:33:34] I'm fine with listing them, but if we do that we need to make sure that additionof a new prject (*cough*wikitravel*cough*) won't screw us, that it's well documented to add it to the acls [15:33:37] it seemed as if swift didn't have many thumbs [15:33:49] whereas it should have pretty much everything in squid by now [15:34:05] mark: I'm not sure that hypothesis is correct. [15:34:14] there wasn't a large bump in the number of objects in swift [15:34:29] it might be something else [15:34:30] it would have been visible on the graph that tracks the delta on the number of objects. [15:34:32] but we don't currently know what [15:34:41] and since that was just one squid... [15:34:48] it's also the case that the image scalers flipped out both before and long after the squid change. [15:34:56] they were flipping out as late as 6pm yesterday. [15:35:00] yes, it's possible it was unrelated thumb.php traffic not coming from swift [15:35:06] which leads me to believe it was unrelated. [15:35:17] we'll be testing on one squid again first [15:36:06] wouldn't it be better to focus on the 1.5 upgrade first now? [15:36:23] why? [15:36:25] looking at the week graph, traffic on the image scalers has been markedly higher since last thursday. I'm pretty sure it's unrelated. [15:36:50] I'm not feeling particularly well prepared for the originals today either tbh [15:36:50] mark: robla and I agree that we'd like to get this switch done first (mostly because we've already started, but also because I"m not comfortable yet with the amount of testing we've done with mediawiki on the upgrade) [15:37:12] i also see mediawiki is not quite optimized yet [15:37:17] all the separate auth request [15:37:21] if we need to do more prep to move the originals traffic, then let's do the prep [15:37:24] also why the separate HEADs before GETs? [15:38:00] we have a braindump on the pad, but considering how many cases we found in < 24h, I don't feel comfortable that we don't have even more [15:38:16] also, have in mind that removing a backend squid for testing is going to produce the exact same result that we had yesterday [15:38:33] unless it's the same squid, which presumably means same thumbs [15:38:43] paravoid: by switching just originals instead of the default, we skirt all the edge cases we've found so far. [15:38:52] well we don't know that, and if it does then we can investigate the problem [15:39:04] which is I guess one reason mark is uneasy about the move [15:39:08] mark: re: head vs. get - MW has a number of cases where it takes a different action based on whether the thumb already exists. [15:39:34] but it always requests the thumb after it finds it exists, doesn't it? [15:39:36] anyway. I've got to leave to get into the office. [15:39:46] no, we don't. we haven't researched the "block non-HEAD/GET" or "block X-Authenticate" options [15:40:15] if I find the token, I can send a DELETE with X-Authenticate and -afaik- delete an image via the squids [15:40:18] that's bad imho [15:41:22] if we need to do more prep we'll do more prep. it's got to be done sooner or later. if we don't make this window we don't make it, but I'd rather have some progress made [15:41:31] +1 apergos. [15:41:48] please feel free to start without me; I'll be back online as soon as I get there. [15:42:03] heyaaaa RobH [15:42:09] you around? [15:42:11] see you in a bit [15:42:16] I don't like rushing things. yesterday we almost brought down all the squids. [15:42:40] well by start I have in mind writing down things like the delete issue (assuming we do whitelist) [15:42:55] delete has nothing to do with white/blacklist [15:43:24] (and it's already written, hours ago) [15:43:46] nothing to do with our implementation, but i really don't like that swift auth tokens are good for a long period of time for any request type, and passed in plain text.. vs. oauth where a request is signed instead of containing the auth token, and can't be replayed after a second [15:43:48] sorry, but I guess I didn't see it [15:44:18] binasher: auth tokens are valid for a day afaik [15:44:38] you have a username/password, you login and get back a short-lived token (if you consider a day as short) [15:45:15] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/20746 [15:45:32] on the other hand, i liked that i could tcpdump a swift fe and then craft whatever valid request i wanted from seeing one result [15:45:33] do we ever need to delete (originals)? [15:45:39] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/20897 [15:46:27] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/20454 [15:47:27] the auth just seems pointless, if all requests to swift are sanitized [15:47:51] it's intended for a different use case than ours (generally) [15:47:54] or if mw is going to request the token constantly, might as well expire tokens every minute [15:48:41] apergos: yes.. it's intended for an s3 sort of case. in which case it should use signed requests instead of actually passing around a token [15:50:13] is that something we need to fix now? [15:50:46] i rather see this well planned and not done in a rush rather than have someone else clean up the mess later [15:51:12] I'd like us to spend the time in our window on that planning, rather than on something else [15:51:14] apergos: no, as i said, it has nothing to do with our implementation [15:51:18] that's what I'm advocating here [15:51:23] binasher: ok [15:51:27] wow, how did i accidentally underline that [15:52:39] so far we have the delete issue and the possibly missing thumbs in swift to look at (or determine that swift isn't missing thumbs and that something else happened) [15:53:13] binasher: what's scary is that we don't filter headers or methods on the squids (afaik) [15:53:31] you can probably send a DELETE with an X-Authenticate header and delete stuff from outside [15:53:56] we can [15:54:01] don't we? [15:54:12] I remember there such an option, I don't think we do that now. [15:54:13] we do at least in some configs, but not sure about the upload cluster as is [15:54:18] i'm pretty sure we do [15:54:20] I'm not sure. [15:54:44] I'd like to test to be sure [15:55:42] that was fast [15:55:44] * maplebed lurks while in transit [15:55:48] ah [15:55:55] (yay mifi!) [15:59:20] I see some things in here like [15:59:26] http_access deny !gethead [16:00:53] ah, that's on the frontend [16:00:55] indeed. [16:00:58] yes [16:01:10] good to test it too of course [16:04:04] we should block container list requests from the squids [16:04:56] container listing is an authenticated request; blocking all auth stuff (i.e. with the authed header) accomplishes that goal. [16:07:36] unrelated, the reason to take the test squid out of the front end is so its front end doesn't pass 1/nth of the requests to itself on the back end (which will have been corrected on all other squids) [16:08:02] out of the front end pool. sorry [16:08:54] why wouldn't be corrected on that frontend too then? [16:09:00] we'll deploy all frontends anyway [16:09:14] that squid file won't be regenerate, it's out of the conf list [16:09:28] it will be regenerated when we reveert and add thebackend changes [16:09:52] that's what we discussed earlier right? [16:13:39] bbiab [16:15:18] okay, so I'll take sq51 backend out [16:15:30] and then we'll see how that goes [16:15:55] front and back end you mean? [16:16:16] frontend does not need depooling [16:16:49] that's what I've said 10 times already [16:17:06] ok, then I don't get it. please explain to me how its front end will have the correct configuration [16:17:23] if you have lost patience with it you can tell me to shut up, but it jus means I won't understand it [16:18:45] I haven't lost patience, no :) [16:19:20] i think apergos means that if you take the one squid out of the server list in the config, the frontend's squid config won't get regenerated on the next make invocation [16:19:55] right, so the test squid will still have its own back end in its list, but we want no squid to have the test squid back end in its list [16:20:08] because we want no production requests to go there. [16:20:32] if this is wrong, can you please tell me in baby steps how it actually works? [16:21:00] I know that's what you mean, but we can always vi by hand iirc :) [16:21:35] then let's add that in the directions, or it won't happen [16:22:56] added [16:23:48] I hope I got it right [16:25:01] you know, since we're doing the more strict only-pull-originals bit in squid, none of the authed stuff will make it through to swift even without the auth blocking bits. [16:25:54] also (I want to confirm this but) rewrite.py will drop DELETE requests, so the same thing applies. [16:25:56] well, yeah, that's kinda obious :) [16:26:11] frontend squids drop non-GET/HEAD apparently. [16:26:27] but the back end is directly accessible from the public net, isn't it? [16:26:35] ok I have that and the missing thumbs issue in the list of "prep" [16:26:42] is it?? [16:26:44] jesus [16:26:54] yeah, because it's hit directly by the squids in esams [16:27:01] it's not firewalled but blocked via squid acls afaik [16:27:09] oh, that makes sense. [16:27:23] oh right, it would have to be [16:27:39] (my wmf squid experience is two days old, don't assume I'm right in what I say) [16:27:51] yes but you have squid experience form elsewhere [16:28:04] my total squid experience is probably aobut 10 days in my years here. [16:28:13] (and I have none previous) [16:30:16] !log disabling sq51 backend on all frontend squids (incl. sq51) [16:30:43] oh, are we doing this? I thought we were still well in the "prep" stage... [16:30:48] I wonder if there are any varnishes pointed at squid backends. [16:32:16] mark: do we have Varnish stacked with squid backends anywhere? [16:33:04] in eqiad, for upload [16:33:11] not active, but may get a little bit of client load [16:33:15] people not observing dns [16:33:27] paravoid: are you sure it's disabled on sq51? [16:33:30] is that configured via role/cache.pp? [16:33:51] (sorry, I'm looking at /etc/squid/frontend.conf though) [16:34:06] apergos: I haven't checked yet, no. I would have, but good thing that you did too [16:35:18] seems like 208.80.152.61 is still in the list over there [16:35:21] paravoid: yes [16:35:25] I think [16:35:47] mark: I'll figure it out. thanks. [16:36:14] (our flawless plan is having holes again, grmbl grmbl) [16:37:25] apergos: removed and reloaded, thanks for double checking! [16:37:30] sure [16:37:41] yep it's fixed now [16:37:43] in other news, the bot is down again and didn't log my message [16:37:50] ah joy [16:40:11] !log shot morebots so the restarter script would restart it [16:40:21] and now we wait for it to read all its factoids etc [16:40:23] Logged the message, Master [16:40:57] mark: eqiad's role::cache::upload: frontend seem to have themselves as backends; backends seem to have upload.svc.pmtpa.wmnet:80 (i.e. pmtpa frontends) as backend [16:41:40] sounds correct [16:42:06] so, not pointed to squid backends directly [16:42:15] and hence no need to change anything there [16:42:47] there are a few requests sneaking through to the back end yet [16:43:13] I was about to check on emery, is that what you did? [16:43:17] or tcpdump? [16:43:17] like 3 [16:43:22] I'm looking at cachemgr [16:43:29] via noc [16:44:03] http://noc.wikimedia.org/cgi-bin/cachemgr.cgi?host=sq51.wikimedia.org&port=3128 but you will have to stuff in the username and password into the form to see it [16:44:20] that's because varnish's hashing is different [16:44:28] so we send it via the frontends to correct that (for now) [16:45:36] I see requests coming from esams squids. looking [16:46:01] mark: heh, interesting, good to know :) [16:46:49] yes, but esams squids should follow the squid config as long as you deployed to them as well [16:46:53] uh huh, a few from esams [16:47:17] and one from itself, grrrr! [16:47:20] the esams frontends hash directly to the pmtpa back ends, so we need to push the frontend config change to esams to pull out sq51, right? [16:47:29] hrm, the config seems to not be deployed in an esams server [16:47:52] ah, I know [16:48:33] I ran ./deploy frontend all, but it's esams *backend* that pushes to pmtpa [16:48:51] oh right. :P [16:49:44] ottomata: you know who in analytics would make the decision when we can take down db1047 for the disk shelf addon? [16:49:50] I don't =P [16:50:22] drdee: ^? [16:50:24] RobH: i don't think that's an analytics decison, that sounds like an ops decision to me [16:50:31] okay, ./deploy all it is [16:50:43] well, its downtime won't cause any issues for actual production, its a slave in the db cluster [16:50:57] but if you guys are actively pulling data and doing some project right now... [16:51:03] binasher: started discussion in here [16:51:10] is db1047 one of the slaves used by analysts? [16:51:15] i would say maybe 20 minutes [16:51:24] but allow 30 if things arent smooth [16:51:30] in that case, a quick email to wsor list would make sense [16:51:48] apergos: puppet overwrote sq51's frontend.conf; fixed in manually on stafford, fixing it manually again on sq51. [16:51:58] oh fer cripes sake [16:52:03] drdee: my understanding is that it affects you guys, but im not sure. [16:52:14] i asked binasher in other channel, so he knows whats up [16:52:35] RobH: i am pretty sure we are not affected by it. [16:53:21] i'll email wsor@lists.wikimedia.rog [16:53:24] waiting for cachemgr to show no esams or local backend requests except my "active requests" query [16:53:59] esams still making requests [16:54:31] no, this must be dead, it says it started some many seconds ago [16:54:51] yay I think we are finally good [16:55:06] * maplebed waits for the ganglia graph to drop to zero [16:55:55] which logs are on emerey anyways?? [16:56:06] binasher: cool, so we can do today then? [16:56:16] I want to get it knocked out, its been sitting for a bit waiting on me to do this [16:56:46] I'll just tell them it's going down in an hour, at 11am PST [16:57:27] sounds awesome [16:57:48] binasher: do you know how we want the raid on the disk shelf setup? [16:57:51] the squid frontend process on sq51 is still using 75% cpu... [16:58:00] of course, it can always be done via command line from the OS [16:58:06] but its a LOT easier from the raid bios ;] [16:58:09] it's still in the lvs config. [16:58:14] yes it is [16:58:19] it will be talking to other backends [16:58:47] ok. [16:58:53] RobH: remind me about the disk shelf.. was it 12 or 24 drives? [16:59:07] paravoid: what did you look at on emery? (for squid requests) [16:59:44] 24 [17:00:11] 24 600gb 10k [17:00:16] 2.5" [17:00:30] so if its performance requirements, then two raid10s i would assume [17:00:39] but if they dont need that, but space, then two raid6. [17:00:54] can lose 2 disks in a raid6 array without issue. [17:01:05] (but its not very good for writes) [17:02:57] okay, so sq51 is drained of backend traffic [17:03:20] RobH: let's do 2 raid-10's of 12 disks each [17:03:29] maplebed: we haven't depooled frontend [17:03:39] yes, I see. I'm fine with that. [17:03:45] okay [17:03:53] robh: and you can go ahead and take it down in an hour [17:08:20] binasher: confirmed, will take it down at 11am your time 2pm mine. [17:08:44] will do 2 raid10 of 12 disks each, 256kb stripe, usual write back and adap read ahead [17:09:25] binasher: then I'll leave how you guys want to partition it up to you guys [17:09:31] since it wont be part of downtime, it has no rush [17:10:14] okay, the imagescalers are not too happy but it's not nearly as bad as yesterday [17:10:42] so can I ask what you looked at on emery to check squid traffic? [17:10:53] oh I didn't, I ended up doing tcpdump [17:10:58] but we have /var/log/squid on emery [17:11:03] with the 1/1000 sampled logs [17:11:12] oh. I guess I only know about the locke copy [17:11:15] is that outdated now? [17:11:22] no idea [17:11:30] ok [17:11:39] thanks [17:11:51] sorry, as I said my experience is ~two days :) [17:12:10] well you knew about logs I didn't, so.. figured you might know the rest :-) [17:13:18] yeah scalers don't seem to be particularly unhappy when I look at the 2 hour graph [17:13:24] right [17:13:47] they're worse than their usual, but not more since we started [17:13:53] yup [17:14:09] so, can we investigate and verify or reject "swift is missing a bunch of thunbs"? [17:14:19] not really. [17:14:25] ugh [17:14:30] we removed the same squid and we have persistent hashing [17:15:11] but we really don't have any evidence that it is missing a bunch of thumbs. [17:15:21] so, the same set of thumbs will be uncached today [17:15:40] maplebed: what do you mean? yesterday's image scaler spike wasn't such evidence? [17:15:46] I thought we agreed on that [17:15:59] I thought we don't have proof one way or another [17:16:42] no, that was a hypothesis, but the other bits of corroborating evidence weren't there, such as the graph indicating the number of objects getting put into swift didn't appreciably rise. [17:17:48] anyway, it's not something we can really verify (without crawling ms5) [17:18:11] we can look at some subdirectory on ms5, 0/00 for commons, let's say [17:18:36] and your logic sounds reasonable, that it's the same shard so won't trigger the same set of thumbs even if they were missing. [17:18:41] that's not very expensive (as long as we don't ask ls to sort its list) [17:18:43] apergos: there are about 250k objects in each shard. [17:19:48] one approach that might yield results is to more closely examine the logs of the images scalers to see what they were doing yesterday and look for patterns. [17:19:50] remind me, we don't store the thumbs directories as bjects, right? [17:19:52] or the swift logs, for that matter. [17:20:00] just the thumbs themselves? [17:20:11] (in swift) [17:20:29] swift is a flat filesystem; it doesn't have directories. (what it does have is paths to objects that contain slashes) [17:20:59] uh huh, which means it will be much more expensive than I want, but we can, again, jsut take some portion of those [17:21:04] *just [17:21:14] so though the path is /a/a2/foo/230px-foo.jpg, that's really just an object. [17:21:19] yep [17:21:34] well /a/a2/foo.png/230px-foo.png [17:21:43] yes. [17:22:47] that's actually a big bummer because I can't easily see if there are *any* thumbs for X file in swift [17:23:07] I have to poll for each possible thumb one at a time [17:23:15] you can; swift container listing allows a prefix. [17:23:37] (so it's sorta the same thing as listing a directory, it just can't do globbing and such) [17:24:06] I see [17:24:17] hmm [17:25:57] honestly though, I'm convinced that trying to see whether the collection in swift and ms5 are different as a precondition for continuing the deploy is a waste of time. [17:26:17] they are different, and we know that. [17:26:41] the interesting question is whether they're different in a way that matters, and we won't be able to determine that by polling a random sample. [17:26:46] well what I really want to understand is whether we are going to seriously bork the scalers and/or ms5/7 if the other squids get switched over [17:27:10] if you have some ideas on that, cool [17:27:20] even if it was directly related, didn't it coincide with clearing the cache, not switching originals? [17:27:37] maplebed: I could argue that disabling sq51 today without having a definite post-mortem for yesterday was wrong, since it risked having the same effect [17:27:46] but we're past that and nothing bad happened fortunately. [17:28:00] so let's just move on (for now) [17:28:26] ok. what's next then? [17:28:56] I'm trying to configure squid for swift originals [17:29:06] i.e. add an ACL for originals and direct squids to swift for those [17:29:21] and leave the gateway of last resort untouched (ms7) [17:29:35] right [17:32:55] acl swift_temps url_regex ^http://upload\.wikimedia\.org/(wikibooks|wikinews|wikiquote|wikiversity|wikimedia|wikipedia|wikisource|wiktionary)/[^/]+/temp/[0-9a-f]/[0-9a-f][0-9a-f]/* [17:32:59] acl swift_thumbs url_regex ^http://upload\.wikimedia\.org/(wikibooks|wikinews|wikiquote|wikiversity|wikimedia|wikipedia|wikisource|wiktionary)/[^/]+/thumb/[0-9a-f]/[0-9a-f][0-9a-f]/* [17:33:04] that's from the pad [17:33:08] besides the fact that this has "*" at the end which is wrong [17:33:25] sorry, I forgot to strip off the final *. [17:33:30] isn't it also wrong by having /thumb/ (and I guess /temp/) before the shard rather than after? [17:33:30] I copied those in from aaron's email [17:34:12] no, temp and thumb do come before the shard. [17:35:07] acl swift_thumbs url_regex ^http://upload\.wikimedia\.org(/+)[^/][^/]*/[^/][^/]*/thumb/ [17:35:16] that's what we have now [17:35:58] to upload? really? [17:36:34] notpeter: search32�.WHY!!!! WHY! [17:37:39] http(s)://upload.wikimedia.org/wikipedia/commons/thumb/b/bd/blah...jpg/480px-blah....jpg huh yep, bfore the shard [17:38:39] cmjohnson1: :( [17:38:42] sigh, it matches accidentally [17:38:47] can we actually just throw it out the window? [17:38:51] are there windows? [17:38:51] New patchset: Andrew Bogott; "Add a configurable timeout to the git-clone exec." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/20913 [17:39:09] there are 2! [17:39:27] but they are painted over but I think they still open [17:39:32] just putting it out there as an option.... [17:39:36] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/20913 [17:39:51] so, it's [^/][^/]+ e.g. for "wikipedia" and [^/][^/]+ for e.g. "en" [17:39:55] that's not the shard. [17:40:30] [^/][^/]* or [^/][^/]+ ? [17:40:32] fair enough, but I'm going to keep the swift_thumbs ACL unmodified to avoid any regressions [17:40:36] er, *, correct. [17:41:11] [^/][^/]* is a peculiar way of saying [^/]+ though... [17:41:28] ? it's matching project/lang/thumb (a bit weird as to the regex but ok) [17:41:32] I have a change to rewrite.py I'd like to suggest too; only send thumb 404s to the scalers instead of all 404s. It's not really necessary because the scalers will 404 requests for originals, but it will keep traffic off them. Should I do that before or after this test? [17:41:51] (yeah, I don't know why it's not []+.) [17:41:52] I'd do it afterwards for simplicity [17:41:53] ah sorry, didn't see you guys had typed a bunch, my client crolls hung [17:41:56] *scroll [17:42:19] maplebed: I don't want to muddy the waters with the rewrite.py change now [17:42:25] maplebed: are we keeping the current squid thumb acl for now or making it like the proposed ones for originals [17:42:28] ok. +1 [17:42:47] AaronSchulz: "I'm going to keep the swift_thumbs ACL unmodified to avoid any regressions [17:42:48] I don't have an opinion on the thumb acl; I don't mind keeping it as is to minimize concurrent change. [17:42:56] yep, modify as little as possible [17:42:57] right [17:43:01] paravoid: ok [17:43:04] so that if we break something we have a clue what broke it. [17:43:07] we can't forget "upload.wikimedia.org/(wikibooks|wikinews|wikiquote|wikiversity|wikimedia|wikipedia|wikisource|wiktionary)/[^/]+/archive/[0-9a-f]/[0-9a-f][0-9a-f]/*" too [17:43:08] but we should put it on the list to make them the same later. [17:43:22] AaronSchulz: sorry, that wasn't in the original list. thanks for mentioning it. [17:43:33] * AaronSchulz checks his list [17:43:46] maplebed: were is the working list you guys have? [17:43:48] ohhh archive [17:44:01] AaronSchulz: in the etherpad I linked to in the email I sent. [17:44:11] http://etherpad.wikimedia.org/Swift-Switch-Originals for ease of clicking. [17:44:41] I see [17:45:07] would be nice to just combine "archive" into the existing regex [17:45:37] (temp|archive) you mean? [17:45:57] both of those shard the same (on the mw side)? [17:46:06] and have the same access restrictions? [17:46:17] maybe not temp, though I guess you could do that too...I kind of like the "one per zone" list [17:46:22] AaronSchulz: not really important right now, but could you care to explain what "temp" and "archive" are? [17:47:00] archive/ is a subdir of the "public" zone that stores prior versions of files (when you upload over a file, the old versions go to /archive) [17:47:16] aha, makes sense [17:47:20] in gerrit, the button to copy the full URL for a pathset seems gone for me [17:47:38] New patchset: Bhartshorne; "change to swift to only send thumbs to the image scalers instead of all 404s." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/20916 [17:47:39] ^^^ staged but not merging. just so it's available later. [17:47:43] temp/ mostly have upload wizard files and stashed files from uploads in progress (say you upload a file and get and some minor warning, you can finish it without reuploading from the client end) [17:47:59] temp is a zone btw [17:48:21] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/20916 [17:48:30] paravoid: rewrite.py also has some good comments about the types of urls [17:50:42] * AaronSchulz updates the pad [17:51:14] paravoid: Despite it not fixing everything, does https://gerrit.wikimedia.org/r/#/c/20913/ look right to you? (I'm pushing prematurely because I want to see if my current bug has to do with puppetmaster::self) [17:51:36] so where are we at the moment? [17:52:26] andrewbogott: we're in the middle of some changes, could we talk about this later? [17:52:30] maplebed: squid changes. [17:52:36] paravoid: Yep, sorry to interrupt :) [17:54:28] just a check in: this window was from 7 til 9 pm our time? (eet) [17:55:20] paravoid: are you using the latest etherpad regexes? [17:55:27] apergos: yes, but nobody's following us. [17:55:50] the next window doesn't start until 22:00UTC. [17:55:54] the next thing on gcalender is 3pm [17:56:36] I admit I am not necessarily game to continue til 1am our time [17:57:05] sure, I didn't mean that we should continue until then, just that there's not going to be anybody chomping on our heels. [17:57:06] paravoid: how long are you good for? you usually are here later than me [17:58:35] !log db1047 mysql shutting down (going to add new disk shelf to server) [17:58:45] Logged the message, RobH [17:58:48] moar disks! [18:00:27] !log db1047 shutting down [18:00:37] Logged the message, RobH [18:02:12] just to make sure, all our servers are on the same timezone, right? [18:02:28] (obv they're not *in* the same timezone, but they all think it's UTC, yes?) [18:02:43] PROBLEM - Host db1047 is DOWN: PING CRITICAL - Packet loss = 100% [18:02:55] dschoon: yes. [18:02:59] kk [18:03:01] ty [18:03:54] Change merged: Dzahn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/20917 [18:04:14] gerrit-wm: aha, now you talk again? [18:04:24] it did not mention new patches before [18:05:13] maplebed, apergos: run "git diff" [18:05:22] where? [18:05:31] fenari: /home/w/conf/squid [18:05:56] (yeah, that's in git instead of RCS since a few hoursa go) [18:06:06] Change abandoned: Dzahn; "this one could not be merged due to path conflict because i changed the configs to be templates rath..." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/15715 [18:06:27] reading. [18:06:45] binasher: btw ^^^, squid config in git; wikitech's updated but I don't think you'd read that :) [18:07:06] paravoid: in swift_origs you'r emissing a / inside the archive optional segment. [18:07:29] ha, ACLs changed again. copying again from the pad [18:07:48] good catch, thanks. [18:08:03] so this won't go out to sq51. we want it to though, right? [18:08:28] yeah, I'll fix that in a bit [18:08:34] ok [18:08:38] staring at the regexes now [18:09:03] well, fixed now [18:09:17] looks ok [18:09:40] lgtm. [18:10:11] thumb/temp thumb/archive [18:10:38] apergos: what about them? [18:10:40] these requests need to be accounted for [18:10:58] I see them in the diff [18:11:00] apergos: we're not changing the thumbs with this push - that's in there but commented out. [18:11:14] the existing thumbs regex stops at thumb/ [18:11:18] also that part is changing atm anyway, put it has those [18:11:23] + '=swift_thumbs' => 'ms-fe.svc.pmtpa.wmnet', [18:11:24] that isn't [18:11:33] the existing acl is also scalled swift_thumbs. [18:11:37] ok [18:11:41] sorry, it's not in the diff [18:11:49] heh [18:11:54] yeah; it's: [18:11:56] acl swift_thumbs url_regex ^http://upload\.wikimedia\.org(/+)[^/][^/]*/[^/][^/]*/thumb/ [18:11:58] !log db1047 disk shelf added, two raid10 arrays of 12 disks each, booting now into OS. [18:12:07] Logged the message, RobH [18:12:15] very top [18:12:45] well, depends on your unified= :) [18:12:47] looks ok [18:12:49] okay, deployed on sq51 [18:13:47] 1) random /math/: MISS, 200 OK, Sun-Java-System-Web-Server [18:14:42] 2) random thumb: MISS, 200 OK, swift (X-Object-Meta-Sha1base36) [18:15:01] RECOVERY - Host db1047 is UP: PING OK - Packet loss = 0%, RTA = 26.52 ms [18:15:16] wikinews/en/math ... that's what those should look like, shouldn't be caught by the regexp [18:15:28] 3) random orig: MISS, 200 OK, swift [18:15:48] great [18:15:49] apergos: on a related note, see https://gerrit.wikimedia.org/r/#/c/20844/ [18:16:17] heh awesome [18:16:37] so these will be /math directly, worksforme [18:16:38] !log db1047 back online, presends sdb and sdc at 3597.3 GB each [18:16:47] Logged the message, RobH [18:16:58] binasher: So would you let whoever needs to know about the increase in space about that ^? [18:17:04] or lemme know who i need to ping [18:17:13] 4) random /archive/: MISS, 200 OK, swift [18:17:57] RobH: no one now, i'm going to have to have a lengthier downtime later to copy off what's currently in /a and convert it to lvm [18:18:19] PROBLEM - MySQL Slave Delay on db1047 is CRITICAL: CRIT replication delay 775 seconds [18:18:25] n) request using swift syntax to monitoring container/file via sq51 failed (404 from ms7) [18:18:26] i'm going to switch it to mariadb and possibly upgrade to precise at the same time [18:18:28] PROBLEM - MySQL Replication Heartbeat on db1047 is CRITICAL: CRIT replication delay 771 seconds [18:18:38] * AaronSchulz wonders what thats for [18:18:52] RobH: i'll let them know its back up tho [18:18:53] AaronSchulz: what? [18:19:05] mariadb :) [18:19:21] oh [18:21:26] 6) OPTIONS on backend: 405 method not allowed (presumably from swift); OPTIONS on frontend 403 forbidden [18:21:37] maplebed: yours was (5); could you clarify what you did? [18:22:01] paravoid: added to the etherpad. [18:22:05] at the bottom. [18:22:06] ah, you put it to the pad [18:22:09] right, yes, sorry [18:22:14] so many windows :) [18:22:19] :P [18:22:50] I wonder if we should block auth/AUTH now [18:23:01] does it hurt to? [18:23:08] I don't see any legimitate reasons for leaving it, sounds safe to me [18:23:13] cause if not, I'd say, do it [18:23:41] I prefer the bad guys to have a few hoops to jump through rather than just one [18:24:04] +1. it won't help until later, but shouldn't hurt either. [18:24:19] RECOVERY - MySQL Slave Delay on db1047 is OK: OK replication delay 0 seconds [18:24:28] RECOVERY - MySQL Replication Heartbeat on db1047 is OK: OK replication delay 0 seconds [18:29:16] New patchset: Ottomata; "Changing custom filters to use udp-filter instead. There should be no change in content." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/20919 [18:30:02] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/20919 [18:31:09] doesn't seem to work [18:31:32] what happened? [18:31:34] i.e I get 404s [18:31:40] when I should be getting 403s [18:32:00] ah, I think I know [18:32:51] okay, fixed [18:33:26] lgtm. [18:33:40] !log authdns-update for mc1009-1016 mgmt [18:33:50] Logged the message, RobH [18:33:52] it's already deployed on sq51 and tested, so :) [18:34:22] paravoid: does it not matter that it's below the nearby content servers section in the published config? [18:35:01] something unrelated: squid configs mention upload.wikimedia.org/centralnotice which is not on our list but doesn't seem to exist either [18:35:06] probably legacy [18:35:18] very legacy [18:35:29] good to have an old-timer around :) [18:36:02] you young whippersnappers, I only got one thing to say to you [18:36:08] "git off my lawn!" [18:36:42] hahaha [18:37:06] paravoid: confirmed where I got a 404 from ms7 previously (with an authed reuqest) I now get a 403 from squid. [18:37:14] * apergos checks cachemgr on sq51 out of paranooia... looks good [18:37:25] so... [18:37:27] okay [18:37:32] we have another FIXME [18:38:11] send real traffic through these rules? [18:40:26] bueller? [18:40:28] what's the fixme? [18:41:05] ms6_thumbs [18:41:29] I think it's a no-op since it's the last one [18:41:34] but I want to verify that first [18:43:34] the ms6_thumbs cache_peer line didn't show up on sq51. IIRC that's esams equivalent of ms5. and still live (with the same kind of 404 handler that sends traffic back to pmtpa on missing thumbs) [18:43:41] why do we care about ms6 right now? [18:43:52] it can just keep doing its job as esams thumb cache for the moment [18:44:29] fair enough [18:44:31] it's not particularly full or loaded [18:44:44] (but eventually w should decide what happens in esams) [18:44:56] I think mark's idea was a swift cluster there. [18:45:03] and I think we already have hardware for it. [18:45:05] sure, why not [18:45:11] should have one in each dc [18:45:21] (it would only have thumbs though, not originals) [18:45:33] yup [18:45:46] ah legal issues, gotta love 'em [18:46:13] so, are we ready to throw real traffic at this thing? [18:46:23] I think so [18:46:32] well does real traffic mean put sq51 back end back in the pool? [18:46:36] right [18:46:43] sounds ok [18:46:44] I suggest putting sq51 back in the pool as the first step [18:46:48] +1 [18:46:56] \o/ [18:46:59] and if all goes well, then push that config to other backends [18:47:02] now don't jinx us [18:47:17] all goes well = we watch it for a half hour (no joke either) [18:47:28] note how I don't have a good way of putting it completely back in the pool [18:47:34] :-D [18:47:42] I can only partially put it back [18:48:03] hm, or maybe I can [18:50:06] can't wait to see this workaround.... [18:50:30] ./deploy frontend all; ./deploy cache esams :) [18:51:45] * apergos waits for it to go around [18:51:46] (or knams.upload) [18:52:23] thereit is [18:52:23] !log putting sq51 (modified with originals pointing to swift) back into the backend pool [18:52:32] Logged the message, Master [18:52:38] apergos: what are you looking? [18:52:58] I was checking another squid to make sure the conf change made it there (it did) [18:53:13] I'm watching the swift logs [18:54:11] yup it's hopping right along [18:54:27] paravoid: apergos take a look at fenari:/tmp/benswiftlog.txt [18:54:31] those just showed up [18:54:36] some of these seem to be taking a while [18:55:21] eeww [18:55:25] apergos: ? [18:55:42] maplebed: which log are you looking at? ms-fe1 /var/log/syslog? [18:55:55] no, the consolidated long on fenari. [18:55:59] wonder what's causing that (non utf8) [18:56:01] /home/w/log/syslog/swift [18:56:02] oh, right. [18:56:21] I'm running "tail -f /var/log/squid/sampled-1000.log |grep ^sq51 |grep MISS" on emery [18:56:36] TCP_MISS/200s all over [18:57:06] is that catching thumbs? [18:57:53] yes [18:58:03] you can grep them out but I wanted to catch everything [18:58:26] New patchset: Ryan Lane; "Change network settings in eqiad for labs" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/20921 [18:59:10] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/20921 [18:59:40] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/20921 [19:00:03] sq51 is up to its normal traffic pattern [19:00:25] swift's load patterns haven't changed but that's not surprising (originals; only one squid) [19:00:25] yup, I've been watching the requests come in [19:01:26] I'm seeing good looking entries in the swift log coming from sq51; varying user-agents, a mix of 200s and 404s. [19:01:46] this is my grep, btw: [19:01:49] tail -f swift | grep proxy-server | grep -v thumb | grep -v 'v1/AUTH_.auth' | grep -v pybaltestfile.txt | cut -b -270 | grep 'HTTP/1.0 ....' [19:02:02] (my terminal is 270c wide, and the last grep is just to highlight the response code) [19:02:59] all these AUTH_* that I see with random UAs are rewrite.py's, right? [19:02:59] btw, seeing the 404s is not unexpected. I've seen enormous numbers of thumbs requested for images that don't exist, so requesting the actual originals that also don't exist doesn't surprise me. [19:03:13] scalers are picking up again [19:03:18] http://ganglia.wikimedia.org/latest/?r=hour&cs=&ce=&m=load_one&s=by+name&c=Image+scalers+pmtpa&h=&host_regex=&max_graphs=0&tab=m&vn=&sh=1&z=small&hc=4 [19:03:42] oh? [19:03:53] see em all spiking a bit? [19:03:59] I see them and I'm worrying [19:04:05] uh huh [19:04:16] we did nothing that was thumb related and could trigger that [19:04:16] paravoid: yes, that's correct. [19:04:41] making mental note tha tthere's no cache purge or antying, this was just sq51 back in the pool [19:05:54] AaronSchulz: btw, have you had a chance to look at auth caching? [19:06:30] starting to fall off again [19:06:43] apergos: yeah, I noticed, hence my context-switch [19:06:59] it's already in master, though there is no hurry there [19:07:16] AaronSchulz: ah, great! thanks! [19:07:27] AaronSchulz: also, ma rk raised a good point before [19:07:47] AaronSchulz: that we're seeing a lot of "HEAD thumb; GET thumb" [19:07:58] checking if it exists and if so requesting it [19:08:22] if that's the case, this could get a nice optimization [19:10:34] maplebed, apergos: would you agree that everything looks good? [19:10:35] looking at the last 1600 original requests - it appears they're about 30% 404, 60% 200, and 10% other (206, 304, 412, 499) [19:11:03] aha, do you have those 404s? [19:11:20] it loks ok but it's too soon to move more traffic, I want to wait a little while for that [19:11:31] paravoid you want an example original that 404s? [19:11:32] we could check these 480 against ms7 [19:11:56] New patchset: preilly; "The program 'supervisord' is currently not installed. So adding it..." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/20925 [19:12:06] wikipedia-en-local-public.73/7/73/PeterGotti.jpg [19:12:39] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/20925 [19:13:20] maplebed: checked enwiki, deleted [19:13:49] and found a href in a random website [19:13:55] so, yeah, I guess there are a lot of these cases [19:14:15] image scalers are sad again [19:14:26] yup [19:14:31] same pattern as yesterday [19:14:35] spike, better, spike... [19:14:46] I think it's swift not able to handle the load [19:14:49] ouch that is bad [19:14:56] it's the only thing that makes sense [19:15:23] I don't follow. [19:15:24] why were they affected yesterday and why they're affected again today [19:16:02] we put more traffic into swift, so requests from image scalers take more time, so they're not able to cope up with the load [19:16:26] and that explains why their load is from apache processes rather than gs/convert [19:17:07] PROBLEM - Apache HTTP on srv224 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:17:15] argh. [19:17:16] PROBLEM - Apache HTTP on srv223 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:17:17] I want to put the rewrite change to only send thumbs to the image scalers in place. [19:17:22] there they go dangit [19:17:33] http://ganglia.wikimedia.org/latest/?c=Swift%20pmtpa&h=ms-fe1.pmtpa.wmnet&m=cpu_report&r=hour&s=descending&hc=4&mc=2 [19:17:48] maplebed: remind me what we send there now [19:17:54] see the spike in swift_GET_200_90th? [19:17:57] all 404s [19:18:03] DELETE too [19:18:15] no, sorry, all GET and HEAD 404s. [19:18:26] GET_204_90th is weird too [19:18:35] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/20925 [19:18:40] 50 seconds for a 404?!?! [19:18:41] looking at the originals; the 404s are taking a long time. [19:18:45] WTF?? [19:18:55] PROBLEM - Apache HTTP on srv220 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:19:04] 50 seconds for swift to able to say "this object doesn't exist"? [19:19:09] I thought AaronSchulz had said that the scalers would fail fast if the url sent to thumb_handler.php wasn't a thumb, but it appears that isn't the case. [19:19:16] load on ms5 is risingkids [19:19:29] looking at this: http://ganglia.wikimedia.org/latest/?c=Swift%20pmtpa&h=ms-fe1.pmtpa.wmnet&m=cpu_report&r=hour&s=descending&hc=4&mc=2 [19:19:32] PROBLEM - Apache HTTP on srv221 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:19:42] it's more than apparent that swift is not coping up with the load [19:20:16] maplebed: what do you think? [19:20:36] paravoid: the timing present in the logs includes the call back to the scalers, so a high 404 time doesn't actually mean that's how long it's taking swift (on its own) to respond. [19:20:48] swift can take a lot of GET load given what happened last week [19:20:59] 35s for a GET? [19:21:09] (90th percentile) [19:22:00] really inexplicable [19:22:15] looking at the object server's timing for the 404s shows 0.00x seconds, which implies it's all in the proxy server (and my guess is waiting for scalers) [19:22:22] PROBLEM - Apache HTTP on srv219 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:22:39] yeah, that sounds like waiting on the scalars [19:23:00] http://ganglia.wikimedia.org/latest/?r=hour&cs=&ce=&tab=v&vn=swift+frontend+proxies [19:23:37] so, how about pushing that rewrite change to not send original reuqests to the scalers? [19:23:51] in the middle of an incident? [19:23:51] https://gerrit.wikimedia.org/r/#/c/20916/1/files/swift/SwiftMedia/wmf/rewrite.py [19:23:59] I'm thinking of rolling back sq51. [19:24:10] PROBLEM - LVS HTTP IPv4 on rendering.svc.pmtpa.wmnet is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:24:21] yay for paging [19:24:30] okay, I'm rolling back sq51. [19:24:32] either way. [19:24:34] yes, roll back [19:25:07] is anyone working on that? ^^ [19:25:08] the next attempt (not tonight) can be done with the rewrite.py rule in place already [19:25:13] Leslie_sdtpa: yes. [19:25:15] maplebed: hmm, that might make sense to do now then ;) [19:25:17] Leslie_sdtpa: yes. [19:25:24] cool [19:25:30] hide again [19:25:42] New review: Diederik; "Hey ottomata" [operations/puppet] (production) C: -1; - https://gerrit.wikimedia.org/r/20919 [19:26:16] RECOVERY - Apache HTTP on srv224 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 9.461 second response time [19:26:32] !log rolling back sq51 switch to originals [19:26:42] Logged the message, Master [19:27:08] New patchset: preilly; "add supervisor class" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/20928 [19:27:46] RECOVERY - Apache HTTP on srv223 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 3.832 second response time [19:27:50] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/20928 [19:28:00] AaronSchulz: do you have any numbers you can pull from the image scalers to confirm that that's where the slowness is coming from? [19:28:17] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/20928 [19:28:36] New review: Ottomata; "Does it need a special db file? The default one should suffice, no?" [operations/puppet] (production) C: 0; - https://gerrit.wikimedia.org/r/20919 [19:29:57] ok, looks like the config went around properly .. though i don't know why some files have swift1_thumbs swift2_thumbs etc and some don't [19:30:18] wfStreamThumb() hits didn't got up, though there is no profiling on wfThumbHandle404() [19:30:34] since originals would 404, it wouldn't show up in graphite there :/ [19:30:46] PROBLEM - Host ps1-a3-sdtpa is DOWN: CRITICAL - Network Unreachable (10.1.5.3) [19:30:55] PROBLEM - Host ps1-d2-sdtpa is DOWN: CRITICAL - Network Unreachable (10.1.5.15) [19:31:04] PROBLEM - Host ps1-a4-sdtpa is DOWN: CRITICAL - Network Unreachable (10.1.5.4) [19:31:04] PROBLEM - Host ps1-a2-sdtpa is DOWN: CRITICAL - Network Unreachable (10.1.5.2) [19:31:04] PROBLEM - Host ps1-a1-sdtpa is DOWN: CRITICAL - Network Unreachable (10.1.5.1) [19:31:04] PROBLEM - Host ps1-d1-pmtpa is DOWN: CRITICAL - Network Unreachable (10.1.5.17) [19:31:04] PROBLEM - Host ps1-b5-sdtpa is DOWN: CRITICAL - Network Unreachable (10.1.5.10) [19:31:04] PROBLEM - Host ps1-b3-sdtpa is DOWN: CRITICAL - Network Unreachable (10.1.5.8) [19:31:05] PROBLEM - Host ps1-d3-sdtpa is DOWN: CRITICAL - Network Unreachable (10.1.5.16) [19:31:13] PROBLEM - Host ps1-b2-sdtpa is DOWN: CRITICAL - Network Unreachable (10.1.5.7) [19:31:22] PROBLEM - Host ps1-a5-sdtpa is DOWN: CRITICAL - Network Unreachable (10.1.5.5) [19:31:22] PROBLEM - Host ps1-b4-sdtpa is DOWN: CRITICAL - Network Unreachable (10.1.5.9) [19:31:22] PROBLEM - Host ps1-c3-sdtpa is DOWN: CRITICAL - Network Unreachable (10.1.5.13) [19:31:22] PROBLEM - Host ps1-c2-sdtpa is DOWN: CRITICAL - Network Unreachable (10.1.5.12) [19:31:22] PROBLEM - Host ps1-c1-sdtpa is DOWN: CRITICAL - Network Unreachable (10.1.5.11) [19:31:22] PROBLEM - Host ps1-d1-sdtpa is DOWN: CRITICAL - Network Unreachable (10.1.5.14) [19:31:23] PROBLEM - Host ps1-d3-pmtpa is DOWN: CRITICAL - Network Unreachable (10.1.5.19) [19:31:23] PROBLEM - Host ps1-b1-sdtpa is DOWN: PING CRITICAL - Packet loss = 100% [19:31:23] PROBLEM - Host ps1-d2-pmtpa is DOWN: PING CRITICAL - Packet loss = 100% [19:31:50] whaaa?! [19:31:58] PROBLEM - Router interfaces on cr2-pmtpa is CRITICAL: CRITICAL: host 208.80.152.197, interfaces up: 89, down: 1, dormant: 0, excluded: 0, unused: 0BRfxp0: down - BR [19:32:15] uuuhhhh [19:32:17] PROBLEM - Host mr1-pmtpa is DOWN: CRITICAL - Network Unreachable (10.1.2.3) [19:32:25] Leslie_sdtpa: ?? [19:32:43] I'm looking an imagescaler [19:32:44] > Chris and I are going to cut over the management access in tampa by [19:32:45] > taking the management aggregation switches... this? [19:32:45] there was a reason increase in MW startup calls though [19:32:49] could have been the scalars [19:32:56] reason/recent [19:32:59] a lot of apaches in D [19:33:15] timing is unfortunate [19:33:15] paravoid: i think chris is rewiring something [19:33:19] not me yet [19:33:25] ah. ouch... [19:33:29] i can run upstairs and grab him if this is causing problems [19:33:57] i think he must have accidentally been moving the wrong switch [19:34:02] do i need to go upstairs ? [19:34:04] PROBLEM - Apache HTTP on srv224 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:34:04] PROBLEM - Apache HTTP on srv223 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:34:33] Leslie_sdtpa: I don't know what the effect of these network downtimes is, so I can't tell [19:34:45] this is management network unreachable [19:34:49] in tampa [19:34:56] dunno if you were needing it for your troubleshooting [19:35:03] if it's just mgmt that shouldn;'t hurt us [19:35:35] just mgmt [19:36:01] well, depends [19:36:13] if we lose an imagescaler [19:36:24] swapdeath and such [19:36:47] let's hope not [19:37:02] 20. that's noooo good [19:37:05] doesn't look good [19:37:13] New patchset: preilly; "move supervisord.wikipedia.conf" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/20930 [19:37:43] AaronSchulz: do you have any means to check what all these apache processes are waiting? [19:37:49] PROBLEM - Apache HTTP on srv222 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:37:50] changes went arounf to esams too right? [19:37:54] I can strace, but it's noisy; it'd be better if you have something in the app [19:37:56] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/20930 [19:38:08] apergos: what changes? I rolled back sq51 [19:38:16] New patchset: preilly; "move supervisord.wikipedia.conf" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/20930 [19:38:16] paravoid: I'd want to try bens rewrite patch first [19:38:18] uh huh [19:38:56] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/20930 [19:39:05] 404 response time is going thrgh the roof [19:39:33] at 50 sec and climbing [19:40:05] New patchset: preilly; "move supervisord.wikipedia.conf" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/20930 [19:40:16] AaronSchulz: we've completely rolled back the changes, I don't see why we should do a risky change right now [19:40:49] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/20930 [19:40:52] as I see it, imagescalers are waiting on swift, not the other way around [19:41:16] which might be cascading though... [19:41:31] starting to drop a little [19:41:39] yeah, that makes sense. imagescalers are waiting on swift and swift waits on imagescalers [19:42:09] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/20930 [19:42:16] New review: Diederik; "Default is fine." [operations/puppet] (production) C: 1; - https://gerrit.wikimedia.org/r/20919 [19:42:33] that 404 timing is waiting on thumbs to get scaled. [19:42:49] I'm not seeing any increase in backend ops from MW's side [19:43:31] paravoid: ben's change is quite simple [19:44:07] AaronSchulz: at this point no new original 404 requests are coming in, so it wouldn't have any effect. [19:44:24] maplebed: we're way out of squid realm, so you're the expert here [19:44:49] maplebed: right [19:44:56] I'm not sure what we would poke to speed things settling though. [19:44:58] we're already in an outage for almost half an hour [19:45:28] New patchset: preilly; "fix paths" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/20932 [19:45:31] image scalers are in D, waiting for swift as I see it [19:45:46] RECOVERY - Apache HTTP on srv220 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.045 second response time [19:45:55] RECOVERY - LVS HTTP IPv4 on rendering.svc.pmtpa.wmnet is OK: HTTP OK HTTP/1.1 200 OK - 59677 bytes in 0.166 seconds [19:45:56] paravoid: it's not actually an outage; it's just a degredation of service. (eg http://commons.wikimedia.org/wiki/Special:NewFiles still loads) [19:45:57] sadly I think we have to ride it out [19:46:10] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/20932 [19:46:13] RECOVERY - Apache HTTP on srv224 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.027 second response time [19:46:19] huzzah! okay! :) [19:46:22] RECOVERY - Apache HTTP on srv223 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.027 second response time [19:46:22] RECOVERY - Apache HTTP on srv221 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.025 second response time [19:46:23] maplebed: fine line there [19:46:40] RECOVERY - Apache HTTP on srv219 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.037 second response time [19:47:02] okay, this a huge drop [19:47:05] yes [19:47:25] RECOVERY - Apache HTTP on srv222 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.027 second response time [19:48:23] maplebed: so, no idea why that happened or why it stopped? [19:48:23] it's interesting that ms5's cpu utliization flatlined during the time load on the scalers spiked: http://ganglia.wikimedia.org/latest/?c=Miscellaneous%20pmtpa&h=ms5.pmtpa.wmnet&m=cpu_report&r=hour&s=descending&hc=4&mc=2 [19:48:43] the way they all cleared up simultanously really feels like a freed lock [19:48:48] well it went up a little, normally floats around 4 or so [19:48:53] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/20932 [19:49:42] maplebed: what's the theory? that image scalers were trying to write on ms5 but were blocked? [19:50:06] maybe? [19:50:20] does waiting on NFS look any different in terms of iowait from waiting on tcp? [19:50:48] not in top, but I lsof'ed a few of the blocked apaches [19:51:10] RECOVERY - Host ps1-d3-pmtpa is UP: PING OK - Packet loss = 0%, RTA = 2.24 ms [19:51:10] RECOVERY - Host ps1-a2-sdtpa is UP: PING OK - Packet loss = 0%, RTA = 3.26 ms [19:51:10] RECOVERY - Host ps1-d2-pmtpa is UP: PING OK - Packet loss = 0%, RTA = 3.51 ms [19:51:11] no /mnt/upload6 open files at all [19:51:19] RECOVERY - Host ps1-a4-sdtpa is UP: PING OK - Packet loss = 0%, RTA = 4.10 ms [19:51:19] RECOVERY - Host ps1-a1-sdtpa is UP: PING OK - Packet loss = 0%, RTA = 3.35 ms [19:51:19] RECOVERY - Host ps1-c2-sdtpa is UP: PING OK - Packet loss = 0%, RTA = 2.36 ms [19:51:19] RECOVERY - Host ps1-b1-sdtpa is UP: PING OK - Packet loss = 0%, RTA = 3.14 ms [19:51:28] RECOVERY - Host ps1-b3-sdtpa is UP: PING OK - Packet loss = 0%, RTA = 2.11 ms [19:51:28] RECOVERY - Host ps1-a3-sdtpa is UP: PING OK - Packet loss = 0%, RTA = 2.62 ms [19:51:28] RECOVERY - Host ps1-d1-sdtpa is UP: PING OK - Packet loss = 0%, RTA = 3.58 ms [19:51:31] yeah, I didn't see any in lsof earlier either. [19:51:37] RECOVERY - Host ps1-c3-sdtpa is UP: PING OK - Packet loss = 0%, RTA = 2.77 ms [19:51:37] RECOVERY - Host ps1-b2-sdtpa is UP: PING OK - Packet loss = 0%, RTA = 2.70 ms [19:51:37] RECOVERY - Host ps1-b4-sdtpa is UP: PING OK - Packet loss = 0%, RTA = 2.79 ms [19:51:42] if we were going to have locking issues on ms5 I feel like we would have seen this some time ago [19:51:45] not now suddenly [19:51:46] RECOVERY - Host ps1-d1-pmtpa is UP: PING OK - Packet loss = 0%, RTA = 2.50 ms [19:51:50] that too :) [19:51:51] agreed. [19:52:00] the load pattern just looks weird there. [19:52:04] RECOVERY - Host ps1-a5-sdtpa is UP: PING OK - Packet loss = 0%, RTA = 2.44 ms [19:52:04] RECOVERY - Host ps1-d3-sdtpa is UP: PING OK - Packet loss = 0%, RTA = 1.82 ms [19:52:13] RECOVERY - Host ps1-c1-sdtpa is UP: PING OK - Packet loss = 0%, RTA = 2.84 ms [19:52:16] New patchset: preilly; "fix Could not find dependency File issue" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/20933 [19:52:22] RECOVERY - Host ps1-b5-sdtpa is UP: PING OK - Packet loss = 0%, RTA = 3.55 ms [19:52:24] hmm. [19:52:59] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/20933 [19:53:25] RECOVERY - Host ps1-d2-sdtpa is UP: PING OK - Packet loss = 0%, RTA = 2.61 ms [19:53:31] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/20933 [19:53:45] hmm? [19:54:19] RECOVERY - Host mr1-pmtpa is UP: PING OK - Packet loss = 0%, RTA = 0.78 ms [19:54:23] just puzzled. [19:54:40] AaronSchulz: have a look at: http://pastebin.com/nCxtDw3m [19:54:45] maplebed, apergos too I guess [19:54:50] that's an lsof from a blocked apache at the time [19:55:18] well that's not an lsof [19:55:23] thanks for it though :-) [19:55:29] so, ms-fe1 is requesting a resize from rendering.svc [19:56:00] srv220 has the memcache sockets open and *two* ms-fe.svc connections [19:56:03] why? [19:57:49] that's an Aaron question I guess [19:58:12] PROBLEM - Host ps1-a1-sdtpa is DOWN: CRITICAL - Network Unreachable (10.1.5.1) [19:58:14] apergos: one php process with two connections? [19:58:21] PROBLEM - Host ps1-b5-sdtpa is DOWN: CRITICAL - Network Unreachable (10.1.5.10) [19:58:21] PROBLEM - Host ps1-d1-sdtpa is DOWN: CRITICAL - Network Unreachable (10.1.5.14) [19:58:21] PROBLEM - Host ps1-c3-sdtpa is DOWN: CRITICAL - Network Unreachable (10.1.5.13) [19:58:21] PROBLEM - Host ps1-c1-sdtpa is DOWN: CRITICAL - Network Unreachable (10.1.5.11) [19:58:21] PROBLEM - Host ps1-c2-sdtpa is DOWN: CRITICAL - Network Unreachable (10.1.5.12) [19:58:21] PROBLEM - Host ps1-d2-pmtpa is DOWN: CRITICAL - Network Unreachable (10.1.5.18) [19:58:22] PROBLEM - Host ps1-a2-sdtpa is DOWN: CRITICAL - Network Unreachable (10.1.5.2) [19:58:22] PROBLEM - Host ps1-d3-pmtpa is DOWN: CRITICAL - Network Unreachable (10.1.5.19) [19:58:22] PROBLEM - Host ps1-d3-sdtpa is DOWN: CRITICAL - Network Unreachable (10.1.5.16) [19:58:24] PROBLEM - Host ps1-a5-sdtpa is DOWN: CRITICAL - Network Unreachable (10.1.5.5) [19:58:24] PROBLEM - Host ps1-b2-sdtpa is DOWN: CRITICAL - Network Unreachable (10.1.5.7) [19:58:30] PROBLEM - Host ps1-a4-sdtpa is DOWN: PING CRITICAL - Packet loss = 100% [19:58:30] PROBLEM - Host ps1-b1-sdtpa is DOWN: CRITICAL - Network Unreachable (10.1.5.6) [19:58:30] PROBLEM - Host ps1-b3-sdtpa is DOWN: CRITICAL - Network Unreachable (10.1.5.8) [19:58:31] sigh [19:58:38] noisy in here [19:58:59] what would it need to open connections for? [19:59:02] *two [19:59:15] PROBLEM - Host mr1-pmtpa is DOWN: CRITICAL - Network Unreachable (10.1.2.3) [19:59:21] also paravoid would you mind pasting the whole lsof? [19:59:24] PROBLEM - Host ps1-d2-sdtpa is DOWN: CRITICAL - Network Unreachable (10.1.5.15) [19:59:24] PROBLEM - Host ps1-d1-pmtpa is DOWN: CRITICAL - Network Unreachable (10.1.5.17) [19:59:33] PROBLEM - Host ps1-b4-sdtpa is DOWN: PING CRITICAL - Packet loss = 100% [19:59:42] PROBLEM - Puppet freshness on ms-be1001 is CRITICAL: Puppet has not run in the last 10 hours [19:59:42] PROBLEM - Puppet freshness on ms-be1002 is CRITICAL: Puppet has not run in the last 10 hours [19:59:42] PROBLEM - Puppet freshness on ms-be1003 is CRITICAL: Puppet has not run in the last 10 hours [19:59:42] PROBLEM - Puppet freshness on ms-be1009 is CRITICAL: Puppet has not run in the last 10 hours [19:59:42] PROBLEM - Puppet freshness on ms-be1006 is CRITICAL: Puppet has not run in the last 10 hours [19:59:43] PROBLEM - Puppet freshness on ms-be1005 is CRITICAL: Puppet has not run in the last 10 hours [19:59:43] PROBLEM - Puppet freshness on ms-fe1001 is CRITICAL: Puppet has not run in the last 10 hours [19:59:43] PROBLEM - Puppet freshness on ocg3 is CRITICAL: Puppet has not run in the last 10 hours [19:59:43] PROBLEM - Puppet freshness on singer is CRITICAL: Puppet has not run in the last 10 hours [19:59:44] PROBLEM - Puppet freshness on virt1002 is CRITICAL: Puppet has not run in the last 10 hours [19:59:44] PROBLEM - Puppet freshness on virt1001 is CRITICAL: Puppet has not run in the last 10 hours [19:59:45] PROBLEM - Puppet freshness on virt1003 is CRITICAL: Puppet has not run in the last 10 hours [19:59:46] PROBLEM - Puppet freshness on virt1004 is CRITICAL: Puppet has not run in the last 10 hours [20:00:10] apergos: http://pastebin.com/gGG6AAQE [20:00:18] AaronSchulz: yes [20:00:22] thanks [20:00:27] PROBLEM - Host ps1-a3-sdtpa is DOWN: CRITICAL - Network Unreachable (10.1.5.3) [20:00:46] AaronSchulz: one apache/php process, two ms-fe outgoing connections [20:00:56] could it be HEAD/PUT? [20:00:59] New patchset: preilly; "fix for : IOError: [Errno 2] No such file or directory: /wikipedia.yaml" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/20936 [20:01:42] paravoid: cloufiles uses multiple curl handles, one for each of a few categories of requests...something about getting around not being able to reset certain curl options [20:01:49] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/20936 [20:02:01] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/20936 [20:02:24] RECOVERY - Host ps1-b1-sdtpa is UP: PING WARNING - Packet loss = 54%, RTA = 2.67 ms [20:02:24] RECOVERY - Host ps1-b3-sdtpa is UP: PING WARNING - Packet loss = 54%, RTA = 2.23 ms [20:02:24] RECOVERY - Host ps1-d1-pmtpa is UP: PING OK - Packet loss = 0%, RTA = 1.82 ms [20:02:24] RECOVERY - Host ps1-a2-sdtpa is UP: PING OK - Packet loss = 0%, RTA = 3.02 ms [20:02:33] RECOVERY - Host ps1-a1-sdtpa is UP: PING OK - Packet loss = 0%, RTA = 2.53 ms [20:02:33] RECOVERY - Host ps1-a4-sdtpa is UP: PING OK - Packet loss = 0%, RTA = 2.56 ms [20:02:33] RECOVERY - Host ps1-a5-sdtpa is UP: PING OK - Packet loss = 0%, RTA = 3.38 ms [20:02:33] RECOVERY - Host ps1-b2-sdtpa is UP: PING OK - Packet loss = 0%, RTA = 2.63 ms [20:02:33] RECOVERY - Host ps1-c1-sdtpa is UP: PING OK - Packet loss = 0%, RTA = 3.15 ms [20:02:33] RECOVERY - Host ps1-a3-sdtpa is UP: PING OK - Packet loss = 0%, RTA = 2.54 ms [20:02:33] RECOVERY - Host ps1-b4-sdtpa is UP: PING OK - Packet loss = 0%, RTA = 2.73 ms [20:02:34] RECOVERY - Host ps1-c2-sdtpa is UP: PING OK - Packet loss = 0%, RTA = 3.72 ms [20:02:35] RECOVERY - Host ps1-c3-sdtpa is UP: PING OK - Packet loss = 0%, RTA = 2.53 ms [20:02:35] RECOVERY - Host ps1-d3-pmtpa is UP: PING OK - Packet loss = 0%, RTA = 2.04 ms [20:02:35] RECOVERY - Host ps1-d1-sdtpa is UP: PING OK - Packet loss = 0%, RTA = 2.32 ms [20:02:36] RECOVERY - Host ps1-d3-sdtpa is UP: PING OK - Packet loss = 0%, RTA = 2.73 ms [20:02:37] RECOVERY - Host ps1-d2-pmtpa is UP: PING OK - Packet loss = 0%, RTA = 2.49 ms [20:02:43] umm, okay [20:02:45] looks like HEAD/GET_OBJ use different connections [20:02:46] hmm not much going on in here (the output) [20:02:51] RECOVERY - Host ps1-b5-sdtpa is UP: PING OK - Packet loss = 0%, RTA = 2.92 ms [20:03:16] but why would it remain open? [20:03:35] stay open when? [20:03:48] you do a HEAD and then a GET, right? [20:04:14] check if it exists and if so, fetch it? [20:04:32] anybody want to review https://gerrit.wikimedia.org/r/#/c/20916/1/files/swift/SwiftMedia/wmf/rewrite.py? (thanks for the +1 AaronSchulz) [20:04:48] RECOVERY - Host mr1-pmtpa is UP: PING OK - Packet loss = 0%, RTA = 0.73 ms [20:04:57] RECOVERY - Host ps1-d2-sdtpa is UP: PING OK - Packet loss = 0%, RTA = 3.31 ms [20:05:30] paravoid: there is a HEAD of the thumb for a 304 check [20:05:54] New patchset: Ottomata; "Excluding /a/squid/archive from stat1 /a amanda backups." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/20946 [20:06:03] also used for "already exists, don't need to regenerate" [20:06:17] aha [20:06:33] should it close that connection as soon as you're done though? [20:06:42] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/20946 [20:06:50] that's probably minor compared to the rest however. [20:06:59] paravoid: it should be closed when the script is done [20:07:25] maplebed: could you explain your change a bit [20:07:36] maplebed: I'm still a swift newbie :) [20:07:41] I don't think closing that HEAD conn before streaming would make much difference [20:07:47] sure. [20:07:59] and CF would need some slight changes for that [20:08:17] AaronSchulz: right [20:08:24] the section of code is run after rewrite has already checked swift to see if it has the object. [20:08:46] it's splitting on the return status of the request on the way back. [20:08:48] AaronSchulz: so, you do HEAD for 304 and then either GET for the thumb or GET for the original, correct? [20:08:56] if it was a 200, it just passes it along. [20:09:06] if it's a 404, it triggers the 404 handling (call out to the scalers) [20:09:22] anyone available to help me review my java.pp commit? [20:09:23] https://gerrit.wikimedia.org/r/#/c/20741/ [20:09:40] the addition says to only trigger the 404 handler if it's dealing with a thumb (as opposed to an original or temp) [20:10:02] ottomata: sorry, not at the moment. [20:10:11] okay, great [20:10:30] PROBLEM - Host ps1-d2-sdtpa is DOWN: CRITICAL - Network Unreachable (10.1.5.15) [20:10:30] PROBLEM - Host ps1-d1-pmtpa is DOWN: CRITICAL - Network Unreachable (10.1.5.17) [20:10:31] (zone is set above and is either public, temp, or thumb) [20:10:39] s'ok, danke [20:11:24] RECOVERY - Host ps1-d1-pmtpa is UP: PING OK - Packet loss = 0%, RTA = 2.69 ms [20:11:44] paravoid: I'm having a hard time finding the case where it GETs the original [20:12:22] the ms5 nginx error log for this time period is filled with suff like [20:12:24] 2470508460 upstream timed out (110: Connection timed out) while reading response header from upstream, [20:12:55] that's weird. who's hitting ms5 via nginx (and not nfs)? [20:13:17] not nw I mean a bit earlier [20:13:38] squids [20:13:41] I dunno why [20:13:45] yeah, I didn't think nginx on ms5 was getting any traffic since months ago. [20:13:48] maybe this was after the rollback [20:14:04] paravoid: well, of course it GETs the original when it has to render it, but that its [20:14:34] if the thumb is sized the same (or bigger than) as the original, you just get an error (as always) [20:15:37] * AaronSchulz wonders whats using nginx [20:15:41] !log added hosts boron/indium to nagios's nsca config [20:15:50] Logged the message, Master [20:16:58] AaronSchulz: what I'm saying is that we know that PHP wasn't blocked by the HEAD; it was blocked by the second HTTP connection, either waiting for a thumb or for the original [20:22:37] it must get requests when swift traffic is switched over and not otherwise [20:22:38] I don't see why the GET would hang, it's authenticated, so it shouldn't trigger any rewrite loops [20:22:42] no objections to my rewrite.py diff? [20:22:48] none [20:23:32] I would really love to see profiling around the calls to NFS and to swift (from the scalers) [20:24:26] maplebed: no objections but seems to be unrelated to the incident, no? [20:24:36] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:24:41] clients claim to be squids [20:24:56] paravoid: depends. if swift's sending originals 404s to thumb_handler freaks out the scalers, then yes, very related. [20:25:20] what do you mean by "freak out the scalers"? [20:25:39] and if that's the theory, then let's try to reproduce this. [20:26:06] yeah, I can do that. [20:28:13] New patchset: Ottomata; "Excluding /a/squid/archive from stat1 /a amanda backups." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/20946 [20:28:56] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/20946 [20:28:57] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 2.921 seconds [20:29:30] it failed quickly. My two tests were: [20:29:37] curl -v -H "User-Agent: benfoo" -H "Host: commons.wikimedia.org" http://srv222/w/thumb_handler.php/a/a2/Little_kitten_.jpg/1px-Little_kitten_.jpg [20:29:38] and [20:29:48] curl -v -H "User-Agent: benfoo" -H "Host: commons.wikimedia.org" http://srv222/w/thumb_handler.php/a/a2/Little_kitten_.jpg [20:30:53] does this mean anything to anyone: "err: Could not retrieve catalog from remote server: Error 400 on SERVER: Exported resource Nagios_host[db12] cannot override local resource on node spence.wikimedia.org" ? [20:31:12] RECOVERY - Host ps1-d2-sdtpa is UP: PING OK - Packet loss = 0%, RTA = 3.23 ms [20:31:50] I'm looking at graphite metrics FileBackendStore.* with aaron. [20:31:56] the timings are interesting [20:32:35] (FileBackendStore is in the category pulldown) [20:34:03] !log hand-tweezed spence:/etc/nagios/nsca_payments.cfg b/c puppet is broken for spence, restarted nagios on spence [20:34:12] Logged the message, Master [20:34:22] tehy're pointing a finger at NFS getting slow. [20:34:39] maplebed: what do you mean "failed quickly"? [20:35:11] I mean it gave me back a 404 in 0.025s. [20:35:40] maplebed, AaronSchulz: the lsofs were pointing to being blocked by ms-fe rather than NFS; NFS also wouldn't explain why it failed now of all days and times [20:36:02] whereas handing me back the 1px thumb took 0.07 ad generating a new thumb took 0.2s. [20:36:22] no, it doesn't, but the curve is remarkable. [20:37:20] where specifically? [20:38:07] paravoid: http://screencast.com/t/wWdsXPWWxYiL (because graphite is a pain and I can't link to it.) [20:38:48] but to get the graph for yourself, https://graphite.wikimedia.org/dashboard/, then choose FileBackendStore from the Category menu, then open the * folder then choose the tp90 graph. [20:39:06] isn't that https://gdash.wikimedia.org/dashboards/filebackend/ ? [20:39:51] Change merged: Dzahn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/20946 [20:40:20] so what gets to ms5 are taking place when traffic for originals goes to swift? or rather, why are they taking place? [20:40:20] why yes, I think it is! [20:44:07] RoanKattouw: did that script finish? [20:48:56] maplebed: ? [20:49:06] yes? [20:49:44] so, what's the theory? [20:51:10] looking at the graph for the FileBackend stuff, [20:51:18] it appears that ms5 is the one that slowed down, not ms7. [20:51:39] the key: FileBackendStore.doQuickOperations == thumbnails (ms5) whereas [20:51:53] FileBackendStore.doOperations == originals (ms7). [20:52:26] but what's increasing the load on ms5 is a mystery [20:52:32] since the number of calls doesn't actually chancge much. [20:54:33] so it's almost midnight [20:54:45] and as usual about this time (since I've been on line since about 8 am) [20:54:49] I'm not good for much [20:55:04] http://www.megapro.net/products/specialty-drivers/electronic/ [20:55:06] cool shit. [20:55:06] apergos: do you know what the upload-settings.php is used for in /home/w/conf/squid/+ [20:55:13] i ordered some for the datacenter(s) [20:55:23] maplebed: I do, what do you need? [20:55:36] I'm just looking for non-mediawiki references to ms5 [20:55:39] there's one in there. [20:56:21] this is a no-op. [20:56:24] ignore that [20:57:22] are you saying that the issue was ms5 being slow to respond? [20:57:32] ms5's graphs don't show this at all [20:58:00] I know. I don't know what to say. mediawiki's routines that needed to access it took a long time, but it doesn't seem loaded. [20:58:00] no, and I did an iostat over there during the time of the problem [20:58:07] it didn't look too nuts or anything [20:58:12] no load, no traffic spike, a small GET spike to 5 per second [20:58:43] http://ganglia.wikimedia.org/latest/graph.php?r=4hr&z=xlarge&h=ms5.pmtpa.wmnet&m=load_one&s=by+name&mc=2&g=load_report&c=Miscellaneous+pmtpa [20:58:46] I mean, the symptoms we saw were that the scalers appeared hung. [20:58:59] the mediawiki profiling says it's waiting on nfs, but the nfs server seems fully alive and happy. [20:59:02] so ... [20:59:12] I'm stumped. [20:59:20] hung with no /mnt/upload6 fds with 3-4 lsof I did [20:59:25] or /mnt/upload* [20:59:52] (/mntlupload6 is origs; thumbs are /mnt/thumbs) [21:00:08] lookint at atop -r I don't see anything oout of the ordinary during the time frame either [21:01:13] hmm. Aaron|laptop is there anything else going on in the section of code profiled by the fileBackendSTroe.doQuickOperationsInternal-local-NFS.tp90? [21:01:18] Aaron|laptop: load one went from 6 to 9, that's nothing compared to the effects we saw [21:01:30] esp. for an 8-core machine [21:02:35] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [21:02:36] yeah you can see some higher loads in July [21:03:47] http://ganglia.wikimedia.org/latest/graph_all_periods.php?c=Miscellaneous%20pmtpa&h=ms5.pmtpa.wmnet&v=98.67&m=IO_Utilization&r=year&z=small&jr=&js=&st=1345582917&vl=%25&z=large this seems to argue against ms5 being at fault also [21:04:11] I have got to get food, since it appears we are going to psot mortem this to death (ours, of starvation). back shortly [21:06:39] I haven't gotten lunch yet either. [21:07:01] AaronSchulz: Yes, it finished [21:08:32] Aaron|laptop: http://pastebin.com/Qnxn6djh [21:11:29] \o/ [21:11:44] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.851 seconds [21:14:41] going to get lunch. back in 20 [21:26:41] !log resumed s3 user.user_email migrations [21:26:50] Logged the message, Master [21:27:52] dessert first [21:30:22] New review: Platonides; "Thanks Daniel" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/20917 [21:43:04] any ops folks available to help with a ticket? [21:46:35] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [21:59:39] !log running purgeParserCache.php for all dbs [21:59:48] Logged the message, Master [22:00:05] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.043 seconds [22:02:28] andrew_wmf: sorry, unavailable [22:03:07] andrew_wmf: rt-3459? [22:03:17] I can help you with that. [22:03:32] i never get an autoreply with the ticket number, but for the mailman subscribers [22:03:49] a number of these on the image scalers from the switchover period: [22:04:06] Aug 21 18:59:46 srv224 apache2[334]: [error] [client 10.0.6.211] File does not exist: /usr/local/apache/common/docroot/default/wikipedia [22:04:06] Aug 21 18:59:46 srv224 apache2[334]: [error] [client 10.0.6.211] File does not exist: /usr/local/apache/common/docroot/default/w [22:04:07] etc [22:04:17] ah hah and [22:04:19] Aug 21 19:20:14 srv224 kernel: [12476563.950017] nfs: server ms5.pmtpa.wmnet not responding, timed out [22:04:24] a few of these in the end [22:04:27] woo! [22:04:32] is that in dmesg? [22:04:35] or kern.log? [22:04:36] syslog [22:05:36] then there's things like this: [22:05:43] Aug 21 19:32:57 srv224 apache2[19219]: PHP Notice: FSFileBackend::doStoreInternal: copy() failed but returned true. in /usr/local/apache/common-loc [22:05:43] al/php-1.20wmf9/includes/filerepo/backend/FSFileBackend.php on line 208 [22:06:10] the first of the nfs imteouts is the one I pasred, after that I see them regularly [22:06:23] Aaron|laptop got kicked off... getting him to read over my shoulder. [22:06:31] until [22:06:32] Aug 21 19:44:42 srv224 kernel: [12478032.323779] nfs: server ms5.pmtpa.wmnet not responding, timed out [22:06:34] last one [22:07:25] no indication why it's timing out though, is there? [22:07:30] never is [22:08:02] so first I'd want to know what http requests went there and why (I only see the few errors, it's not very helpful) [22:08:36] * apergos goes to scry the ganglia graphs again [22:09:46] andrew_wmf: replied and closed the ticket. please let me know if you don't get the data you need. [22:10:29] apergos: what do you mean which http requests went there? you mean which requests are associated with the MW errors? or what nginx was serving? [22:10:30] http://ganglia.wikimedia.org/latest/graph_all_periods.php?h=ms5.pmtpa.wmnet&m=load_one&r=year&s=by%20name&hc=4&mc=2&st=1345586932&g=network_report&z=large&c=Miscellaneous%20pmtpa can't tell much from looking at these. rats [22:10:38] got it thx [22:11:06] I mean during the tme that it was receiving http requests, where were they coming from, why, how many a second or minute or whatever [22:11:19] apergos: it is interesting that network out dropped so severly yesterday. [22:13:04] and I'd like to know where this path comes from /usr/local/apache/common/docroot/default/wikipedia that we see in the scaler logs [22:13:28] which might be someting for Aaron|laptop again :-P [22:13:41] what? [22:13:49] Aug 21 19:22:02 srv224 apache2[14352]: [error] [client 10.0.6.211] File does not exist: /usr/local/apache/common/docroot/default/wikipedia [22:14:03] things like these on the scalers when we were doing the swift test [22:14:17] that and File does not exist: /usr/local/apache/common/docroot/default/w [22:14:40] anything jump out at you as the cause? [22:15:10] and those were not there before? [22:15:29] apergos: we do have this graph: http://ganglia.wikimedia.org/latest/graph_all_periods.php?c=Miscellaneous%20pmtpa&h=ms5.pmtpa.wmnet&v=%200&m=Gets_per_second&r=day&z=default&jr=&js=&st=1345582194&vl=qps&z=large [22:15:41] that's the output of a tcpdump counting the number of GETs. [22:15:44] (on ms5) [22:15:56] oh yay that's helpful, I thoguht there was but couldn't find it [22:16:16] it's super-ghetto. it's a tcpdump piped to python running in a screen sessoin on ms5. [22:16:18] Aaron|laptop: earliest one in today's log is from when we started mucking about [22:16:59] 18:27:57 utc [22:17:34] last one is 19:26:20 utc [22:17:40] so yeah it's the swift cutover all right [22:17:45] were they spamming? [22:19:56] 530 of em in that time frame [22:21:12] looking at rewrite.py I was wrong about what it's sending the scalers when an original 404s. [22:21:14] the first 20 min only a few, then really quite a lot [22:21:27] I'm not sure exactly what it does send yet, but it's not a well fromed URL. [22:21:39] i need to switch the s2 master, it will be read-only for a couple minutes [22:24:10] !log s2 master swap - new repl position MASTER_LOG_FILE='db54-bin.000302', MASTER_LOG_POS=870283287 [22:24:19] Logged the message, Master [22:25:14] apergos: which log file had that error? [22:25:24] syslog on srv224 [22:25:29] I expect all the scalers have em [22:25:47] I found the cause! \o/ [22:26:10] yup just checked 219 [22:26:11] oh?? [22:26:32] so when an original hits a 404 in swift, it's passing the upload URL unmodified through to the scaler. [22:26:37] PROBLEM - MySQL Replication Heartbeat on db53 is CRITICAL: CRIT replication delay 220 seconds [22:26:37] PROBLEM - MySQL Replication Heartbeat on db57 is CRITICAL: CRIT replication delay 220 seconds [22:26:51] basically curl -v -H "User-Agent: benfoo" -H "Host: upload.wikimedia.org" http://srv222/wikipedia/commons/a/a2/Little_kitten_.jp [22:26:55] PROBLEM - MySQL Replication Heartbeat on db1018 is CRITICAL: CRIT replication delay 239 seconds [22:27:04] and that curl triggers the docroot file not found thing on srv222's log. [22:27:22] PROBLEM - MySQL Replication Heartbeat on db1002 is CRITICAL: CRIT replication delay 265 seconds [22:27:22] PROBLEM - MySQL Replication Heartbeat on db1009 is CRITICAL: CRIT replication delay 265 seconds [22:27:22] PROBLEM - MySQL Replication Heartbeat on db1034 is CRITICAL: CRIT replication delay 265 seconds [22:27:22] PROBLEM - MySQL Replication Heartbeat on db54 is CRITICAL: CRIT replication delay 266 seconds [22:27:44] well that's broken [22:27:57] the fault is in the 404 handler in swift, which expects only thumbnails [22:28:12] (and has an unterminated if: block around creating a proprely formed URL) [22:28:25] RECOVERY - MySQL Replication Heartbeat on db1018 is OK: OK replication delay 0 seconds [22:28:27] line 163 if you're interested. [22:28:34] er which file? [22:28:39] rewrite.py [22:28:52] RECOVERY - MySQL Replication Heartbeat on db54 is OK: OK replication delay 0 seconds [22:29:11] sadly though, I don't think it lends any insight into why ms5 ate itself. [22:29:37] RECOVERY - MySQL Replication Heartbeat on db53 is OK: OK replication delay 0 seconds [22:29:37] RECOVERY - MySQL Replication Heartbeat on db57 is OK: OK replication delay 0 seconds [22:29:47] well one step at a time [22:30:05] :P [22:30:07] next (but not tonight) for me would be to see where ms5 gets come from when swit is on [22:30:12] *swift [22:30:22] RECOVERY - MySQL Replication Heartbeat on db1002 is OK: OK replication delay 0 seconds [22:30:22] RECOVERY - MySQL Replication Heartbeat on db1009 is OK: OK replication delay 0 seconds [22:30:22] RECOVERY - MySQL Replication Heartbeat on db1034 is OK: OK replication delay 0 seconds [22:30:23] i.e. what generates them [22:30:44] apergos: that's going to be a challenge given they're not logged anywhree. [22:31:01] though we could run a tcpdump on ms5 if we could trigger them to happen again. [22:31:03] I have a few logged in the error log on ms5 [22:31:53] ah, so there are. [22:32:23] some are "index forbidden" but the connection timeout ones at least show some urls [22:32:37] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [22:32:53] /wikipedia/commons/thumb/2/25/So_tired.jpg/160px-So_tired.jpg (yes, I chose this one on purpose) for example... looks like a well-formed thumb request, right? [22:33:12] yup. [22:33:25] though original couldn't be found. [22:34:03] also, apergos, the error log has 201 requets from sq41, 27 from sq42, and 2 or 2 from a few others. [22:34:20] yes, I noticed thatmost seemed to be sq41 [22:34:25] not sure what that means either [22:34:47] and maybe it will all be a dead end but getting the noise out of the system is always good anyways [22:36:36] see 5 gets a second, it's nothing [22:36:37] I've got an idea. [22:36:44] used to be 40 [22:36:47] ok... listening [22:36:59] the current squid config says [22:37:04] 1) allow thumbs to swift [22:37:11] 2) deny thumbs but allow everythincg else to ms7 [22:37:16] 3) allow thumbs to ms5 [22:37:26] each of those statement shas a connect-timeout and a max-conn. [22:37:41] uh huh [22:37:41] if either the connect timeout expires or the max-conn is exceeded, does it fall through to the next rule? [22:37:49] I have no idea [22:37:51] (which would allow things to fall through and hit ms5) [22:37:58] * apergos goes to look at the squid conf [22:38:03] and then to rtfm [22:38:54] poor apergos [22:44:54] New patchset: Bhartshorne; "change to swift to only send thumbs to the image scalers instead of all 404s." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/20916 [22:45:04] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 8.420 seconds [22:45:07] well that was scary. [22:45:20] ^^ rewrite.py change to log and return an error when it hits the case we were just discussing. [22:45:26] apergos: do tell. [22:45:36] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/20916 [22:45:37] X locked my cursor to the movement pointer and wouldn't give it back [22:46:03] finally guessed the right key combo in xcfe to tab through the windows, that released it [22:46:53] for whatever reason alt ctl fn wasn't giving me other consoles, probably some fedora 16 weirdness [22:52:17] New patchset: Pyoungmeister; "first bits of new coredb module." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/21019 [22:52:51] ok (on the rewrite.py change) [22:52:58] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/21019 [22:53:16] are we really doing this again tonight? or are we done for a couple days? [22:55:55] maplebed: [22:56:08] we have a window tomorrow morning. [22:56:14] uuggghh [22:56:15] I'm going to push the rewrite changes later today. [22:56:20] ok [22:56:55] tomorrow I'll try to rack my brain to figure out how we can track the cause of the slowdown [22:57:35] is ms5 unhappy cause it has too many scalers hanging on it and they are waiting on swift? or is it some other conbination? tc) [22:57:36] I would love it if you and paravoid work on stuff in yoru morning tomorrow [22:57:37] but for now, [22:57:44] sleep [22:57:52] cause my morning is in 6 hours [22:57:56] night! [22:59:47] g'night. [23:10:57] !log running user.user_email index migration on s4 [23:11:06] Logged the message, Master [23:12:12] does anyone here know pdns well? =) [23:12:42] if so should "allow-recursion=127.0.0.1" be allowed? it looks like it maybe shouldn't [23:14:05] !log truncated enwiki.flagged(images|templates) [23:14:15] Logged the message, Master [23:15:03] New patchset: Ryan Lane; "Change bind address for NAT for eqiad network node" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/21021 [23:15:43] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/21021 [23:15:59] !log running user.user_email index migration on s5 [23:16:09] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/21021 [23:16:09] Logged the message, Master [23:17:58] !log running user.user_email index migration on s6-7 [23:18:07] Logged the message, Master [23:19:25] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [23:21:44] That's intersting.. I was going to ask for that user_email index to be added to the centralauth globaluser table [23:21:48] Turns out it already has key (gu_email), [23:30:18] hrm, i just had to kill a "SELECT /* ApiQueryLogEvents::execute */ /*! STRAIGHT_JOIN */ log_type,log_action.. query on db26 that had been running from mw67 for 421132 seconds [23:32:03] ouch [23:32:55] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 3.727 seconds [23:33:37] not even on enwiki [23:40:12] bc says: scale=2;421132/86400=4.87 [23:41:03] Change merged: Bhartshorne; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/20916 [23:46:12] !log deployed change to swift proxy rewrite to only send thumb 404s to the image scalers gerrit change 20916 [23:46:21] Logged the message, Master