[02:22:19] PROBLEM - Misc_Db_Lag on storage3 is CRITICAL: CHECK MySQL REPLICATION - lag - CRITICAL - Seconds_Behind_Master : 1767s [02:25:29] PROBLEM - MySQL replication status on storage3 is CRITICAL: CHECK MySQL REPLICATION - lag - CRITICAL - Seconds_Behind_Master : 1957s [02:40:55] RECOVERY - Misc_Db_Lag on storage3 is OK: CHECK MySQL REPLICATION - lag - OK - Seconds_Behind_Master : 4s [02:46:35] RECOVERY - MySQL replication status on storage3 is OK: CHECK MySQL REPLICATION - lag - OK - Seconds_Behind_Master : 1s [06:24:01] New patchset: Asher; "adding db1017 as an eqiad db collector" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2292 [06:24:30] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/2292 [06:24:45] New review: Asher; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/2292 [06:24:46] Change merged: Asher; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2292 [06:27:19] PROBLEM - MySQL Slave Delay on db1033 is CRITICAL: CRIT replication delay 36092 seconds [06:29:55] !log rotated all eqiad + anayltic enwiki slaves to replicate from db1017 after db1001 hardware failure [06:29:59] Logged the message, Master [06:31:07] oh noes [06:31:10] hardware failing [06:31:14] before the cluster is live [06:31:17] dead on arrival! [06:34:49] PROBLEM - MySQL Slave Delay on db1017 is CRITICAL: CRIT replication delay 25891 seconds [06:38:09] PROBLEM - MySQL Slave Delay on db1047 is CRITICAL: CRIT replication delay 24180 seconds [06:39:35] i think that makes 4 or 5 of the eqiad dbs [06:39:52] thanks dell! [06:40:05] sun had better dead-on-arrival score [06:40:12] huh [06:40:28] not surprised [06:40:53] good night! [06:55:29] PROBLEM - MySQL Slave Delay on db42 is CRITICAL: CRIT replication delay 32488 seconds [07:24:19] RECOVERY - MySQL Slave Delay on db1047 is OK: OK replication delay 0 seconds [07:26:59] RECOVERY - MySQL Slave Delay on db1017 is OK: OK replication delay 0 seconds [07:32:46] PROBLEM - MySQL Slave Running on db1005 is CRITICAL: CRIT replication Slave_IO_Running: Yes Slave_SQL_Running: No Last_Error: Error Unknown table __old_revision on query. Default database: d [07:35:06] uh oh OSC [07:41:06] PROBLEM - MySQL Slave Running on db1021 is CRITICAL: CRIT replication Slave_IO_Running: Yes Slave_SQL_Running: No Last_Error: Error Unknown table __old_revision on query. Default database: d [08:17:16] RECOVERY - MySQL Slave Delay on db1033 is OK: OK replication delay 0 seconds [08:26:56] RECOVERY - MySQL Replication Heartbeat on db42 is OK: OK replication delay 0 seconds [08:36:46] PROBLEM - Puppet freshness on lvs1003 is CRITICAL: Puppet has not run in the last 10 hours [08:36:46] PROBLEM - Puppet freshness on lvs1006 is CRITICAL: Puppet has not run in the last 10 hours [08:37:06] RECOVERY - MySQL Slave Delay on db42 is OK: OK replication delay 0 seconds [09:00:08] New review: Dzahn; "(no comment)" [analytics/udp-filters] (master) C: 1; - https://gerrit.wikimedia.org/r/2235 [09:00:45] New review: Dzahn; "(no comment)" [analytics/udp-filters] (master) C: 1; - https://gerrit.wikimedia.org/r/2234 [09:11:51] hello [09:18:52] yo [09:48:58] New patchset: Dzahn; "add class for stat server per ezachte (RT 2162)" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2293 [09:49:17] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/2293 [09:51:25] New patchset: Dzahn; "add class for stat server per ezachte (RT 2162), apply on stat1" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2293 [09:51:42] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/2293 [09:57:19] PROBLEM - Disk space on srv223 is CRITICAL: DISK CRITICAL - free space: / 154 MB (2% inode=64%): /var/lib/ureadahead/debugfs 154 MB (2% inode=64%): [10:08:49] RECOVERY - Disk space on srv223 is OK: DISK OK [14:58:23] New patchset: ArielGlenn; "configuration for rolling rsyncs from dataset1001 to dataset2 and v.v." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2295 [14:59:11] New review: ArielGlenn; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/2295 [14:59:11] Change merged: ArielGlenn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2295 [15:02:22] New patchset: Mark Bergsma; "updated list of server for nrpe to include nagios server" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2296 [15:03:28] New patchset: Mark Bergsma; "updated list of server for nrpe to include nagios server" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2296 [15:04:00] New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/2296 [15:04:01] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2296 [15:11:56] New patchset: ArielGlenn; "and actually add the rsync module for rolling dump host rsyncs" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2297 [15:13:02] New review: ArielGlenn; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/2297 [15:13:03] Change merged: ArielGlenn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2297 [15:56:05] New review: Mark Bergsma; "If we're redoing it anyway, why wouldn't we rewrite everything to use the "sudo_user" definitions, i..." [operations/puppet] (production); V: 0 C: -2; - https://gerrit.wikimedia.org/r/2283 [16:01:02] New patchset: Mark Bergsma; "Move misc::torrus to a separate file" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2315 [16:01:21] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/2315 [16:02:03] New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/2315 [16:02:04] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2315 [17:14:21] New patchset: Mark Bergsma; "Experimental attempt of restyling misc web server setup" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2316 [17:23:09] New patchset: Mark Bergsma; "Experimental attempt of restyling misc web server setup" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2316 [17:24:47] New patchset: Mark Bergsma; "Experimental attempt of restyling misc web server setup" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2316 [17:26:50] robh: I dont think b2-sdtpa has the power to handle the new gluster. plz verify [17:27:27] cmjohnson1: i dont have b2 for that, i have it for one of the dbs [17:27:34] check your tickets, you have them reversed [17:27:45] New patchset: Mark Bergsma; "Experimental attempt of restyling misc web server setup" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2316 [17:29:56] cmjohnson1: c3-sdtpa and d1-pmtpa for rt 2393 (labstores/gluster), b2-sdtpa and d1-pmtpa for db59/60 new enwiki slaves [17:29:58] robh: yep..reversed...fyi i received 8 new storage servers [17:30:08] its 4 plus 4 disk shelves [17:30:10] or it shold be [17:30:19] if its 8 servers then there is a problem. [17:30:35] please confirm scan attach to procurement nad such before racking [17:31:12] https://rt.wikimedia.org/Ticket/Display.html?id=2114 [17:31:15] assigning that to you now [17:31:20] (its the procurement forit) [17:33:30] New patchset: Mark Bergsma; "Experimental attempt of restyling misc web server setup" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2316 [17:35:40] New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/2316 [17:35:40] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2316 [17:38:04] robh: new db59 in b2 power issue. [17:38:42] i see tower b is overloaded [17:39:58] cmjohnson1: well that sucks, it wont work there then. [17:40:10] lessee where the hell else it can go... [17:41:16] robh: moved to a different phase....not overloaded but close [17:41:37] cmjohnson1: one is still overloaded [17:41:48] tower b X phase keeps tottering over [17:41:54] so it has to come out of that rack, sorry dude [17:42:12] yep...see that now [17:42:25] http://torrus.wikimedia.org/torrus/Facilities?path=%2FPower_strips%2Fsdtpa%2Fps1-c1-sdtpa%2FSystem%2F [17:42:25] np...let me know where you want it [17:42:41] now, that is a redundant rack, so it needs to be at half the readout [17:42:58] hrmm, 4.3kw is that though [17:43:04] so thats full, cannot put it there [17:43:19] wait, i read that wrong, if orgot they combine towers on these [17:43:23] so we have plenty of overhead [17:43:23] ok, time to start breaking production. [17:43:33] maplebed: i almost named a server ls1 today [17:43:35] * maplebed begins the swift deploy [17:43:36] you would have loved it [17:43:47] where's a trout when I need one?!?! [17:43:52] i ensured it was called labstore just for you. [17:43:58] \o/ [17:44:01] i think my exact wording was [17:44:08] 'if we call it ls# ben will have a shitfit' [17:44:11] this is when I miss the Linden Love Machine the most. [17:44:17] 'he gets hulk mad about this stuff' [17:44:22] ;] [17:44:37] you could've just configured dns aliases, and then locally check who the user was, and rename the server accordingly for display ;) [17:44:48] then since i am not there, he would have had to hulk smash asher [17:45:02] cmjohnson1: Ok, so take a look at power in C1 [17:45:07] c1-sdtpa [17:45:38] the power phases are off, but meh, its prolly due to dataset1 being shut off [17:46:03] cmjohnson1: dataset1 can be spun up and have the disks wiped. you do not have to do it in rack or right now. we will use the disk shelf for it in another server in its place though [17:46:08] so dont bother unracking the disk shelf [17:46:13] i would leave both for now though [17:46:27] i would place db59 in U21/22 [17:46:44] but make sure you understand why =] [17:47:12] also, this rack just happens to have per port switching. [17:47:23] So while it is not vital for this server, we should label the power ports on the power strip [17:47:26] in the software for it [17:48:54] robh: how do you know the max input watts allowed? where/how was that determined? [17:49:22] So that is a good question, I am going to ping in some ops folks who may not know this [17:49:29] or atleast that I dunno if they know [17:49:34] New patchset: Bhartshorne; "changing swift's backend to ms5 in prep for deploying it to production" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2317 [17:49:49] notpeter maplebed woosters Jeff_Green [17:49:59] ya [17:50:01] (sorry for poinging, but going to explain power in our datacenters so you may wanna know this) [17:50:02] pong [17:50:11] If you already know, sorry ;] [17:50:29] New patchset: Mark Bergsma; "Disable misc::torrus on streber for now, add on manutius" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2318 [17:50:41] RobH: in training. will read in scrollback. thanks for the ehads up :) [17:50:56] RobH: ready [17:51:03] So our preferred method of power now in the US is 3 Phase 208V 30 Amp power. We have this delivered to us in EQIAD and SDTPA. We have 3 Phase 208V 20 Amp power in PMTPA. [17:51:26] In the old racks in pmtpa we had traditional 15amp 120V power but we are phasing those out (row B and C) [17:51:36] 20A 120V [17:51:42] sorry, yep [17:51:46] and mark what is ESAMS again? [17:51:59] esams is 25A 230V single phase [17:52:18] So the basics for determining how much power you can use on a given circuit is the same no matter where it is or what its delivering [17:52:19] although I think it's actually a 32A breaker [17:52:33] volts * ampts * .8 = usable watts [17:52:42] per phase [17:52:52] indeed, we can load each phase to 80% [17:53:26] so the easy example is old pmtpa, 120V * 20A * 80% = 1920 Watts [17:53:34] thats also what a lot of offices have and such [17:53:40] now, 3phase is wonky, its a bit off [17:53:56] 208V 3phaser is 3 120V phases [17:54:02] New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/2318 [17:54:06] 3*120*30*0.8 = 8640 WATTS [17:54:17] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2318 [17:54:19] pmtpa [17:54:20] so 208V 3Phase 20A is [17:54:20] 3 * 120 * 20 * 0.8 = 5760 [17:54:28] how does the math differ when servers have 2 power supplies and there are two circuits going to a rack? [17:54:37] goign to get to that =] [17:54:48] all of this is assuming indeed, a single non redundant feed for power [17:54:53] in eqiad we are all redundant [17:55:01] and the same will slowly become so for tampa [17:55:14] New review: Bhartshorne; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/2317 [17:55:20] but does everyone (who is actively following) have any quesitons about the math for the single feed? [17:55:27] Change merged: Bhartshorne; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2317 [17:55:30] it's actually sqrt(3) * sqrt(3) * 120 * 20 * 0.8, which matters if you don't have 3 phases ;) [17:55:32] RobH: aside--let's throw this backscroll into a wikitech page when said and done [17:55:54] yes, i may be dumbing down the math a great deal for our specific use case only [17:55:57] =] [17:56:10] So redundant feeds in a rack is simple really, cut your final total in half. [17:56:22] If everything in the rack is going to both feeds, it will draw roughly 50% per feed [17:56:34] that is up to the final bit of hardware mind you, server, switch, etc to handle [17:56:45] so if 8.6kw is how much we load on normal feed [17:56:56] we only load 4.3kw per feed on redundant, in case one feed is lost [17:57:00] the other wont instantly overload [17:57:23] it's better if you just make sure that the sum of both feeds never exceeds 8.6 kW [17:57:32] It is also technically part of every contract for most datacenters that provide redundant feeds [17:57:33] normally it doesn't matter if one feed does 6 kW and the other 2 [17:57:36] well, part of most [17:57:49] which may happen e.g. in pmtpa now, where not every server is redundantly connected [17:58:10] but should not happen in ashburn, except that one feed may be a few watts over for the mgmt switch in each rack [17:58:52] cmjohnson1: so you can imagine where having the old power in pmtpa sucked, you had to balance manually between 3 or 4 tiny circuits in each rack [17:59:01] so you would lose a bit of overhead in each circuit to waste [17:59:03] it sucked. [18:00:04] So that is power, its not complex once you have the math. [18:00:21] usually mark or myself determine where things are going to be racked, but this is a major consideration in it. [18:00:26] usually its the deciding factor. [18:00:40] we almost always have more U space and network ports in a rack versus power. [18:01:06] not 100% now a days in 42U single feed racks, but meh. [18:01:28] oh, and for power, each power strip has its own mgmt interface and reports to torrus [18:01:44] so torrus shows the power feeds with a very slight delay, or you can login to the power strip manually, racktables has the name of each [18:02:00] but its standardized to ps1-rack-datacenter.mgmt.blahblahblah [18:02:26] Chris tends to use that more than the planning folks, since he needs the feedback right away when racking. [18:02:34] torrus tends to be better for planning. [18:03:47] Nikerabbit: it looks like maplebed might be using the i18n bugfix deployment slot for swift. do you have anything you're rolling out this week? [18:04:04] also esams doesnt do power reporting, since we dont have the spiffy servertech pdus there. [18:04:17] robh: thx for the class on power...so B and C are different then A and D as for as power is concerned [18:04:22] we ahve a readout from them what we use and such on the displays above the rack but thats it [18:04:45] Yea, pmtpa row B and C are the shitty old power [18:04:59] 20A 120V [18:05:14] its why we will be ditching row B and C now [18:05:14] versus 208a 3phase [18:05:27] 208V 3Phase 20Amp [18:05:37] powermedium cannot deliver us 30 amp [18:05:47] its a drawback. [18:05:54] oh they can [18:05:58] they can't cool it though ;) [18:06:23] sadly thats a correct clarifying statement due to past provisions =P [18:06:41] heh, see rob setting up fans to point at the servers [18:06:49] the good old days. [18:08:27] New patchset: Mark Bergsma; "Remove inclusion of nonexistant class" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2319 [18:14:11] RobH, what's the average efficiency of the last PSU you bought at the load you keep them? [18:14:35] (according to the producer, if you didn't test) [18:14:35] heh, that made me laugh. [18:14:52] we dont have detailed numbers, just rough estimates on how many we can pack into a rack [18:14:58] hm [18:15:03] i tend to pull up the mgmt and poll it for them when we reorder [18:15:46] RobH, don't the specifics say anything about efficiency? [18:16:02] Obviously I don't know anything about that sort of PSU. :) [18:16:26] you mean the efficiency on how much is delivered to it versus how much it delivers down? [18:16:32] yes [18:16:38] no clue =P [18:16:43] oic [18:16:53] the servertechs we use in rack, or each server power supply? [18:17:12] How many times do you convert voltage? [18:17:24] I don't know, just the most common one, out of curiosity [18:18:54] New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/2319 [18:18:54] New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/2319 [18:18:54] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2319 [18:19:04] well we take the 3phase 208V in and combine them into the balanced three circuits in the rack, still on AC [18:19:32] then to the power supplies at server level to DC, so only one type conversion, but the voltage is sure to step in there [18:19:40] just not aware of the hard and fast numbers [18:19:49] New patchset: Mark Bergsma; "Fix syntax error" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2320 [18:19:53] if we actually ran the datacenter, this would be a major problem, heh [18:20:59] ok [18:21:00] New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/2320 [18:21:01] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2320 [18:21:38] actually we don't convert voltage [18:21:41] it's 3x 120V [18:22:20] but each server gets two phases, and the voltage across them is 208V [18:22:22] i didnt think so but i wasnt sure enough to say ;] [18:24:28] so I guess the efficiency loss is at single-server PSU level? [18:24:34] (mainly) [18:25:01] i would assume so, its where the most conversion takes place [18:25:14] its why facebook is all DC in the racks arent they? [18:25:17] in most datacenters ther eis a big loss at the psu level [18:25:26] its been awhile since i read that article about the fb datacenters [18:25:35] a more efficient design is to have the PSU convert everything to 12V dc [18:25:38] and have capacitors on the MB [18:25:44] that requires more custom machine building [18:25:51] * LeslieCarr searches for the google whitepaper [18:26:11] does google custom build for that? [18:26:12] http://news.cnet.com/8301-1001_3-10209580-92.html [18:26:21] or they just say its awesome and wish they did? [18:26:39] and i know my answer, nm =P [18:26:53] :) [18:28:55] hrmm, so facebook adopted the open compute model... well, helped form, whatever [18:29:09] and glancing at that, its using a AC to DC power supply per server, thats disappointing. [18:29:18] i somehow recalled that they were using DC in rack [18:29:25] so much less nifty now. [18:30:04] though from the model of having a server design that most folks can easily adopt the ac to dc model works better. [18:30:14] thats what folks are used to delivery of in most racks at most places. [18:30:35] (for small to mid range companies that is, non tech related industries) [18:38:19] I thought they were using DC in the rack too [18:42:00] !log deploying squid config change to put swift in service for all thumbnails with /a/a2 in the URL [18:42:02] Logged the message, Master [18:42:03] and moving dc power is often ineffecient [18:43:15] i recalled some crazy thing about it being DC in power in rack, via a ups pushing DC for every other rack [18:43:20] but i dunno why. [18:45:46] You'd think only transferring it to dc once would've been more efficient [18:46:00] ie than dc for the ups, to then push it back to ac to become dc soon again after [18:46:14] yea [18:48:09] PROBLEM - Puppet freshness on lvs1003 is CRITICAL: Puppet has not run in the last 10 hours [18:48:09] PROBLEM - Puppet freshness on lvs1006 is CRITICAL: Puppet has not run in the last 10 hours [18:59:04] New patchset: Asher; "adding db1043 to s1 eqiad" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2321 [19:00:23] New review: Asher; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/2321 [19:00:23] Change merged: Asher; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2321 [19:14:04] RECOVERY - Host db1018 is UP: PING OK - Packet loss = 0%, RTA = 30.86 ms [19:14:24] PROBLEM - Backend Squid HTTP on sq52 is CRITICAL: Connection refused [19:14:24] PROBLEM - Backend Squid HTTP on sq80 is CRITICAL: Connection refused [19:14:24] PROBLEM - Backend Squid HTTP on sq86 is CRITICAL: Connection refused [19:14:35] PROBLEM - Backend Squid HTTP on sq42 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:15:04] PROBLEM - NTP on db1018 is CRITICAL: NTP CRITICAL: Offset unknown [19:18:03] error on save on mediawiki.org: "A database query syntax error has occurred. This may indicate a bug in the software. The last attempted database query was:(SQL query hidden)from within function "SqlBagOStuff::set". Database returned error "1637: Too many active concurrent transactions (10.0.6.50)"." [19:18:04] PROBLEM - Backend Squid HTTP on sq57 is CRITICAL: Connection refused [19:18:44] PROBLEM - mysqld processes on db1018 is CRITICAL: PROCS CRITICAL: 0 processes with command name mysqld [19:18:46] yeah, at this point [19:25:54] RECOVERY - Backend Squid HTTP on sq52 is OK: HTTP OK HTTP/1.0 200 OK - 459 bytes in 0.023 seconds [19:25:54] RECOVERY - Backend Squid HTTP on sq80 is OK: HTTP OK HTTP/1.0 200 OK - 459 bytes in 0.002 seconds [19:26:24] RECOVERY - Backend Squid HTTP on sq42 is OK: HTTP OK HTTP/1.0 200 OK - 460 bytes in 0.017 seconds [19:27:04] RECOVERY - NTP on db1018 is OK: NTP OK: Offset 0.01375472546 secs [19:27:59] New patchset: Andre Engels; "* Creating a new separate variable.py, which amongst other things is a repository for variables, so we don't have to re-define them each time * Moved the actual determination of what a variable does with a log line to this new variable.py, so that the" [analytics/reportcard] (master) - https://gerrit.wikimedia.org/r/2322 [19:28:54] !log swift deploy aborted due to squid config issues [19:28:56] Logged the message, Master [19:29:34] RECOVERY - Backend Squid HTTP on sq57 is OK: HTTP OK HTTP/1.0 200 OK - 459 bytes in 0.042 seconds [19:32:20] Jeff_Green: OTRS API: https://bugzilla.wikimedia.org/show_bug.cgi?id=28976 [19:34:41] i would mark it stalled behind the upgrade question [19:37:44] RECOVERY - Backend Squid HTTP on sq86 is OK: HTTP OK HTTP/1.0 200 OK - 459 bytes in 0.019 seconds [19:39:55] robla: that's why we can't have one parsercache db [19:40:01] robla: my i18n was rescheduled to happen earlier today [19:40:26] looks like not all places where updated to say that [19:40:38] Nikerabbit: k...thanks [19:41:19] binasher: I've not disputed the need for it [19:42:20] New patchset: Catrope; "* Creating a new separate variable.py, which amongst other things is a repository for variables, so we don't have to re-define them each time * Moved the actual determination of what a variable does with a log line to this new variable.py, so that the" [analytics/reportcard] (master) - https://gerrit.wikimedia.org/r/2322 [19:42:50] robla: http://ganglia.wikimedia.org/latest/graph_all_periods.php?c=Text%20squids%20pmtpa&m=load_one&r=hour&s=by%20name&hc=4&mc=2&st=1328557276&g=network_report&z=large&c=Text%20squids%20pmtpa [19:43:16] that dip matches up chronologically [19:53:24] RECOVERY - mysqld processes on db1018 is OK: PROCS OK: 1 process with command name mysqld [20:01:13] cmjohnson1: we are going to have to migrate db9 [20:01:19] so that is ideally a weekend move. [20:01:27] okay [20:01:35] it touches so many services that its impossible to really find a good time during the week [20:02:08] so I will start the email chain to let folks know about it, are you available perhaps this or next weekend? [20:04:08] how much is on db9 now? [20:04:17] blog, bz.. [20:04:23] !log running mk-slave-prefetch on db1018 which was down for 5 days to see if it can catch up [20:04:25] Logged the message, Master [20:04:59] blogs, bugzilla, civicrm, contacts, dashboard, etherpad, survey, observium, racktables, rt [20:05:01] robh: Saturday the 11th [20:05:09] cmjohnson1: yep, you available? [20:05:43] if i can easily move db9's ip address (or get help with that part), we can migrate off of it mid-week [20:06:18] if we do that, then youy wont have a backup db though right? [20:06:34] robh -- that copy of civicrm is dead, it was migrated to db1008 [20:06:40] yay [20:06:48] long live shit queries [20:06:54] heh, db9 will be cleaned of that crap when binasher moves it [20:07:04] migrates it i mean [20:07:27] Reedy: :-P i'm sure there are plenty other shit queries in some of those other services [20:07:31] :D [20:07:36] Jeff_Green: you told me not to drop civicrm there because you thought it was used other than for fundraising [20:07:48] not *that* civicrm [20:08:10] Jeff_Green: can you drop it, pretty please? :) [20:08:12] there were several civicrm/drupal databases on that box [20:08:18] i'm not as sure what's what [20:08:24] sure [20:09:25] RobH: ip move would be ok.. especially if the db9 ip is brought up as a secondary ip on the new box. the second new box can be replicating from it via its main address [20:09:48] binasher: database "civicrm" is dropped, that's the only remaining one that was migrated to the fundraising db cluster [20:16:55] PROBLEM - MySQL Slave Delay on db1018 is CRITICAL: CRIT replication delay 441481 seconds [20:25:17] New review: Diederik; "Hi Andre," [analytics/reportcard] (master); V: -1 C: -2; - https://gerrit.wikimedia.org/r/2322 [20:50:14] PROBLEM - MySQL Slave Delay on db1043 is CRITICAL: CRIT replication delay 20413 seconds [21:13:02] so [21:13:04] let's move the block of [21:13:18] line 334 in squid.conf.php [21:13:20] to line [21:13:29] 295 [21:13:41] so they end up before the auto generated lines [21:13:53] these cache_peer lines don't have a round-robin statement, nor carp [21:13:57] so these are in orde [21:13:58] r [21:14:01] +1 [21:14:08] ok, moving. [21:14:18] that means swift will normally get 100% of the traffic in a/a2 [21:14:25] however, if it goes down, ms5 will still handle it [21:14:28] and that's what we want right [21:14:37] yeah. [21:14:43] perfect [21:14:43] reload the file - I've saved the change. [21:15:03] looks good to me [21:15:04] let's make [21:15:33] I added in one more comment line saying why it was at the top. [21:15:37] ok [21:15:42] we can diff the sq86 config only [21:15:43] ok, made, looking at the diff now. [21:15:44] makes it a bit easier ;) [21:15:53] oh! yes. [21:16:19] it's looking good to me [21:16:24] you can ./deploy sq86 when you're ready [21:16:35] deploying [21:16:44] reload done. [21:16:51] swift is serving traffic :) [21:16:53] according to cache_mgr [21:17:11] er no [21:17:12] sorry [21:17:21] I didn't look right [21:17:28] 0 fetches... [21:17:31] I'm not sure what I'm looking at here. [21:17:55] in cache_mgr? [21:18:14] 3 fetches [21:18:15] yay [21:18:17] so it is working [21:18:21] http://pastebin.com/AQBTeTeX ? [21:18:22] it's just getting very little traffic from one squid ;) [21:18:32] oh, no. [21:18:35] I'm looking at the back ond [21:18:39] front end squid [21:18:41] not the back end. [21:18:42] you're looking at the frontend [21:18:43] yep [21:18:49] pick sq86 port 3128 [21:18:58] ok, this makes much more sense. [21:20:19] and there's traffic in the swift logs! [21:20:20] \o/ [21:20:21] * mark installs tshark on sq86 to monitor the traffic [21:21:08] there is traffic [21:21:11] very little, but there is ;) [21:22:20] I just asked for 20 kittens and saw half go to each ms-fe host in their access logs. [21:22:35] which confirms (I think) that all /a/a2 traffic is going via swift. [21:22:42] yes it should [21:23:03] * mark checks the response headers [21:23:21] and I can see all the thumbs in the swift container listing [21:23:36] purging my test kittens [21:23:54] Hypertext Transfer Protocol [21:23:55] HTTP/1.1 200 OK\r\n [21:23:55] [Expert Info (Chat/Sequence): HTTP/1.1 200 OK\r\n] [21:23:55] [Message: HTTP/1.1 200 OK\r\n] [21:23:55] [Severity level: Chat] [21:23:55] [Group: Sequence] [21:23:55] Request Version: HTTP/1.1 [21:23:56] Response Code: 200 [21:23:56] Content-Type: image/jpeg\r\n [21:23:57] Last-Modified: Tue, 17 May 2011 22:30:32 GMT\r\n [21:23:57] Date: Mon, 06 Feb 2012 21:21:59 GMT\r\n [21:23:58] <^demon> Poor kittens :\ [21:23:58] Connection: close\r\n [21:23:58] \r\n [21:24:38] kittens purged and the swift container no longer shows them. [21:24:39] \o/ [21:24:43] no e-tag or other caching headers that could cause problems [21:25:00] so how does purging work now? [21:25:02] from mediawiki? [21:25:04] yeah. [21:25:07] awesome [21:25:12] AaronSchulz deployed taht last week. [21:25:33] ^demon: don't worry, there will always be kittens in swift. [21:25:34] I just added a kitten back [21:25:38] 193px [21:25:39] LVS queries for an 80px kitten once/sec. [21:26:03] ok [21:26:06] let's do some more squids :) [21:26:09] apergos: are you using the same kitten I am? I don't see it yet. [21:26:10] you can just ./deploy to individual hosts [21:26:13] apergos: see bug 34231, you make your kittens last :) [21:26:15] or ./deploy pmtpa.upload [21:26:19] for all upload squids in pmtpa [21:26:23] a/a2 [21:26:26] I'll do the latter. [21:26:36] oh, apergos you probably didn't hit squid86 [21:26:40] wget -S 'http://upload.wikimedia.org/wikipedia/commons/thumb/a/a2/Little_kitten_.jpg/193px-Little_kitten_.jpg' [21:26:46] ah [21:26:46] 84 [21:26:57] yeah I saw 80something and thought, well, there ya go [21:27:00] ok, ran ./deploy pmtpa.upload [21:27:05] do !log it :) [21:27:25] !log deployed new squid.conf to enable swift for all thumbs with /a/a2 in the URL [21:27:27] Logged the message, Master [21:27:43] done (with a bunch of timeouts) [21:27:54] I think we have a few dead squids in there [21:28:00] i'll clean them up tomorrow [21:28:47] how does ganglia get the request stats data? [21:28:57] for which, swift? [21:28:59] or ms5? [21:29:00] yeah [21:29:02] swift [21:29:09] tailing /var/log/syslog [21:29:11] and parsing the log lines [21:29:17] ah [21:29:21] (ganglia-logtailer) [21:29:28] swift doesn't provide that in a better way? [21:29:35] ohhhh this doesn'thave the Server: nginx/0.7.65 header in the response [21:29:43] it does have a statistics module that I need to spend more time looking at [21:30:05] apergos: no server header in fact? ;) [21:30:06] I wonder if it should have some sort of server line [21:30:16] right. none [21:30:25] yeah it has minimal headers right now [21:30:41] there should be a 198px kitten someplace [21:30:43] .ok, rerunning my curl 20 kittens test, using upload instead of sq86 [21:30:45] it would be nice if the frontends could at least add their hostnames [21:31:35] apergos: I see both your 198px kitten and my 60-80 kittens. [21:31:42] nice [21:31:47] so [21:31:49] instead of using lvs [21:31:52] we can also have squid load balance [21:31:54] so that checks the ams->tampa->swift chain [21:32:09] mark: what'd be the benefit? [21:32:15] i'm not really sure [21:32:19] some statistics from squid [21:32:22] about cache peer health [21:32:29] some better failover actually [21:32:29] I like the idea of swift being a big black box [21:32:41] because squid monitors individual proxies [21:32:48] it temporarily disables one if one is misbehaving [21:32:50] New patchset: Asher; "making db1021 a ganglia aggregator in place of 1001" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2323 [21:32:57] hmmm... [21:33:03] maplebed: reminder...include POSTs in stats :) [21:33:14] but we need lvs for mediawiki anyway [21:33:20] at least, I assume mediawiki is talking to lvs [21:33:27] yes it is. [21:33:42] everything views swift as "http://ms-fe.pmtpa.wmnet/" [21:33:49] yeah let's keep it like that for now [21:34:40] when we install varnish in eqiad [21:34:47] we can have varnish talk directly to swift as well [21:34:51] and to ms5 via the upload squids [21:35:56] New patchset: Bhartshorne; "adding POST-specific stats to swift" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2324 [21:35:58] AaronSchulz: ^^^ [21:37:35] New patchset: Asher; "making db1021 a ganglia aggregator in place of 1001" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2323 [21:37:49] arrrg...where is a link to just view the whole new file? [21:37:59] "(Download)" wants to save it to some local file [21:38:10] !log initial deploy of swift to serve thumbnails is complete [21:38:11] Logged the message, Master [21:38:22] New review: Asher; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/2323 [21:38:22] \o/ [21:38:22] Change merged: Asher; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2323 [21:38:27] AaronSchulz: there's a little tiny text 'gitweb' [21:38:27] yay [21:38:59] * AaronSchulz waits for it too load [21:39:03] New review: Bhartshorne; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/2324 [21:39:04] Change merged: Bhartshorne; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2324 [21:39:41] ruh roh. [21:39:55] heh [21:39:57] binasher: you or I may have temporarily broken puppet/git [21:40:10] nevermind. it's better now. [21:40:15] binasher: can I merge your change? [21:40:15] https://gerrit.wikimedia.org/r/gitweb?p=operations/puppet.git;a=blob;f=files/swift/SwiftProxyLogtailer.py;h=d89780b43be9f5a2e1fa844577c76f01a518c1c1;hb=d89780b43be9f5a2e1fa844577c76f01a518c1c1 [21:40:17] formey is slow [21:40:22] no syntax highlighting :/ [21:40:24] maplebed: sure [21:40:35] merged [21:40:46] ok, congrats ben [21:40:49] i'm going off now [21:40:54] night [21:40:56] thanks a ton for your help mark [21:41:01] you're welcome [21:41:04] sorry I missed it earlier [21:41:28] * maplebed goes to write email and post on the village pump [21:41:31] RECOVERY - MySQL Slave Delay on db1043 is OK: OK replication delay 0 seconds [21:41:36] have fun ;) [21:41:38] see you tomorrow [21:41:54] AaronSchulz: two minutes and you should see POST added to the ganglia swift view. [21:42:17] * AaronSchulz won't expect much for now [21:42:35] maplebed: oh, did you have time to set up a second swift user for MW? [21:42:47] no, I didn't do that yet. [21:42:51] sorry. [21:43:00] meh :) [21:52:59] sleep for me [21:53:30] but first I wanna see the swift stats in ganglia :-) [21:54:25] that's some impressive cpu usage [21:54:25] Well, they were supposd to be ready 10 minutes ago ;) [21:54:36] if ms1 is really doing swift [21:54:40] http://ganglia.wikimedia.org/latest/?c=Swift%20pmtpa&h=ms1.pmtpa.wmnet&m=load_one&r=hour&s=by%20name&hc=4&mc=2 [21:54:41] http://ganglia.wikimedia.org/latest/?r=hour&cs=&ce=&m=&tab=v&vn=swift [21:54:50] and http://ganglia.wikimedia.org/latest/?r=hour&cs=&ce=&m=cpu_report&s=descending&c=Swift+pmtpa&h=&host_regex=&max_graphs=0&tab=m&vn=&sh=1&z=small&hc=4 [21:55:24] what wil happen when more traffic is moved over, I wonder [21:56:12] I also really wonder what the spikes are [21:56:27] not a lot - what you're seeing here is a lot of background noise. You can look at the day view for what more traffic looks like [21:57:45] that's even odder [21:58:35] the query response time graphs are very cool :-) [21:59:01] New patchset: Asher; "removing group for old mobile ruby gateway" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2325 [21:59:02] ok, have a great day! [21:59:25] \o/ Glad you like them. [22:03:10] AaronSchulz: IIRC if the parser doesn't find any log lines that match (eg POST), it simply doesn't report the statistic. so the fact that there's no POST graph is, I think, because there are no POSTs hitting swift (yet?). [22:03:21] New review: Demon; "Less ruby \o/" [operations/puppet] (production) C: 0; - https://gerrit.wikimedia.org/r/2325 [22:10:41] !log pulled db38 from enwiki, running normal "alter table revision add rev_sha1" and on db1043, the pt-online-schema-change equiv (with --chunk-size=1000, --sleep=0.1) to compare timing [22:10:42] Logged the message, Master [22:13:50] New patchset: Declerambaul; "added .DS_STORE to .gitignore. Added meaningful words to the readme." [analytics/reportcard] (master) - https://gerrit.wikimedia.org/r/2326 [22:23:52] New review: Diederik; "Ok." [analytics/reportcard] (master); V: 1 C: 2; - https://gerrit.wikimedia.org/r/2326 [22:23:52] Change merged: Diederik; [analytics/reportcard] (master) - https://gerrit.wikimedia.org/r/2326 [22:28:42] New patchset: Asher; "fix reporting query rate and slave status" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2327 [22:31:27] New review: Asher; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/2327 [22:32:16] New review: Asher; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/2327 [22:33:43] New review: Asher; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/2325 [22:33:44] Change merged: Asher; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2327 [22:33:44] Change merged: Asher; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2325 [22:33:51] PROBLEM - MySQL Slave Delay on db38 is CRITICAL: CRIT replication delay 1562 seconds [23:01:45] New patchset: Asher; "backticks needed, $() fails in nrpe subshell" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2328 [23:02:16] New review: Asher; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/2328 [23:02:22] Change merged: Asher; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2328 [23:04:31] binasher: ^^^ totally lame. [23:04:41] makes me grumpy. [23:05:28] $() fails??!? [23:06:04] <^demon> maplebed's got a case of the mondays. [23:06:13] * maplebed spanks ^demon with an impenetrable Nintendo Wii [23:06:28] wow, /slap has gotten inventive. [23:06:29] wow it looks like i missed out on some fun chat :) [23:07:03] * maplebed was just responding to [3:06 PM] <^demon> maplebed's got a case of the mondays. [23:07:07] it might have worked if i escaped the $ but ghetto backticks work there for sure [23:07:16] <^demon> Look, maplebed found a use for my Wii that's been collecting dust! [23:07:33] $ has special meaning there for passing nagios macros into checks [23:07:56] my irc client has no /spank! [23:08:10] "ghetto backticks"? lol [23:09:24] <^demon> AaronSchulz: As opposed to backtticks that reside in affluent suburbs and spend their saturdays at the country club. [23:09:25] PROBLEM - Mobile WAP site on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [23:09:25] PROBLEM - MySQL Replication Heartbeat on db38 is CRITICAL: CRIT replication delay 3696 seconds [23:10:11] thats more like it! [23:13:25] binasher: you should have said "ghetto backticks work there for shizzle" [23:15:50] * ^demon whacks AaronSchulz [23:15:52] <^demon> No. [23:16:20] ^demon: reminds me of one of those HipHop PHP articles [23:16:33] you meant "fo shizzle" [23:16:40] yes, I do, yes I do [23:18:46] <^demon> Nobody ever *means* that. [23:18:48] <^demon> You can't. [23:22:25] PROBLEM - MySQL Replication Heartbeat on db1005 is CRITICAL: CRIT replication delay 58445 seconds [23:30:05] PROBLEM - Full LVS Snapshot on db1033 is CRITICAL: Connection refused by host [23:30:05] PROBLEM - MySQL disk space on db1033 is CRITICAL: Connection refused by host [23:32:05] PROBLEM - DPKG on storage3 is CRITICAL: Connection refused by host [23:33:05] RECOVERY - Mobile WAP site on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 1642 bytes in 0.020 seconds [23:37:25] PROBLEM - MySQL disk space on storage3 is CRITICAL: Connection refused by host [23:38:15] PROBLEM - MySQL Replication Heartbeat on db1021 is CRITICAL: CRIT replication delay 59396 seconds [23:40:34] New patchset: Asher; "lower mysql stat collect threshold" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2329 [23:41:35] RECOVERY - MySQL disk space on db1033 is OK: DISK OK [23:41:35] RECOVERY - Full LVS Snapshot on db1033 is OK: OK no full LVM snapshot volumes [23:41:45] PROBLEM - Disk space on es2 is CRITICAL: Connection refused by host [23:41:45] RECOVERY - MySQL Slave Running on db1021 is OK: OK replication Slave_IO_Running: Yes Slave_SQL_Running: Yes Last_Error: [23:41:55] PROBLEM - MySQL disk space on es2 is CRITICAL: Connection refused by host [23:42:05] PROBLEM - DPKG on es1003 is CRITICAL: Connection refused by host [23:42:05] PROBLEM - RAID on es1003 is CRITICAL: Connection refused by host [23:42:47] es pages due to nrpe failing to restart after puppet update [23:43:09] rgr [23:43:35] RECOVERY - DPKG on storage3 is OK: All packages OK [23:43:48] dang /etc/dsh/group/ext-stores [23:45:37] puppet restarted it correctly on 6 / 8 of them.. ok [23:45:45] PROBLEM - Disk space on db1027 is CRITICAL: Connection refused by host [23:46:35] PROBLEM - RAID on db1027 is CRITICAL: Connection refused by host [23:48:13] New review: Asher; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/2329 [23:48:14] Change merged: Asher; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2329 [23:49:05] RECOVERY - MySQL disk space on storage3 is OK: DISK OK [23:51:15] RECOVERY - MySQL Slave Running on db1005 is OK: OK replication Slave_IO_Running: Yes Slave_SQL_Running: Yes Last_Error: [23:53:26] RECOVERY - Disk space on es2 is OK: DISK OK [23:53:35] RECOVERY - MySQL disk space on es2 is OK: DISK OK [23:53:35] RECOVERY - DPKG on es1003 is OK: All packages OK [23:54:05] RECOVERY - RAID on es1003 is OK: OK: State is Optimal, checked 2 logical device(s) [23:57:15] RECOVERY - Disk space on db1027 is OK: DISK OK [23:58:05] RECOVERY - RAID on db1027 is OK: OK: State is Optimal, checked 2 logical device(s)