[00:06:28] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:07:18] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.127 second response time [00:12:53] !log Running namespaceDupes.php --fix via foreachwiki in screen session on terbium [00:13:01] Logged the message, Master [00:23:08] PROBLEM - MySQL Slave Delay on db1025 is CRITICAL: CRIT replication delay 182 seconds [00:25:08] RECOVERY - MySQL Slave Delay on db1025 is OK: OK replication delay 2 seconds [00:31:29] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:32:18] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.130 second response time [00:33:08] PROBLEM - MySQL Slave Delay on db1025 is CRITICAL: CRIT replication delay 182 seconds [00:35:08] RECOVERY - MySQL Slave Delay on db1025 is OK: OK replication delay 2 seconds [00:38:10] PROBLEM - MySQL Slave Delay on db1025 is CRITICAL: CRIT replication delay 182 seconds [00:40:10] RECOVERY - MySQL Slave Delay on db1025 is OK: OK replication delay 2 seconds [00:43:10] PROBLEM - MySQL Slave Delay on db1025 is CRITICAL: CRIT replication delay 182 seconds [00:45:10] RECOVERY - MySQL Slave Delay on db1025 is OK: OK replication delay 2 seconds [00:48:10] PROBLEM - MySQL Slave Delay on db1025 is CRITICAL: CRIT replication delay 182 seconds [00:50:09] RECOVERY - MySQL Slave Delay on db1025 is OK: OK replication delay 2 seconds [00:53:10] PROBLEM - MySQL Slave Delay on db1025 is CRITICAL: CRIT replication delay 182 seconds [00:55:09] RECOVERY - MySQL Slave Delay on db1025 is OK: OK replication delay 2 seconds [00:57:30] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:59:18] New review: Tim Starling; "Fenari doesn't appear to have a separate partition for /tmp, so this wouldn't help there. " [operations/puppet] (production) - https://gerrit.wikimedia.org/r/57774 [00:59:20] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.155 second response time [01:08:09] PROBLEM - MySQL Slave Delay on db1025 is CRITICAL: CRIT replication delay 182 seconds [01:09:08] RECOVERY - MySQL Slave Delay on db1025 is OK: OK replication delay 8 seconds [01:23:23] New review: Tim Starling; "Looks good, thanks for that. Sorry about the delay in reviewing this." [operations/debs/lucene-search-2] (master); V: 2 C: 2; - https://gerrit.wikimedia.org/r/56354 [01:23:28] Change merged: Tim Starling; [operations/debs/lucene-search-2] (master) - https://gerrit.wikimedia.org/r/56354 [01:39:07] PROBLEM - MySQL Slave Delay on db1025 is CRITICAL: CRIT replication delay 223 seconds [01:40:08] RECOVERY - MySQL Slave Delay on db1025 is OK: OK replication delay 2 seconds [01:53:08] PROBLEM - MySQL Slave Delay on db1025 is CRITICAL: CRIT replication delay 181 seconds [01:55:07] RECOVERY - MySQL Slave Delay on db1025 is OK: OK replication delay 2 seconds [02:05:30] !log LocalisationUpdate completed (1.22wmf2) at Mon Apr 22 02:05:30 UTC 2013 [02:05:38] Logged the message, Master [02:08:56] !log LocalisationUpdate completed (1.22wmf1) at Mon Apr 22 02:08:56 UTC 2013 [02:09:03] Logged the message, Master [02:09:42] PROBLEM - MySQL Slave Delay on db1025 is CRITICAL: CRIT replication delay 217 seconds [02:14:59] !log LocalisationUpdate ResourceLoader cache refresh completed at Mon Apr 22 02:14:58 UTC 2013 [02:15:05] Logged the message, Master [02:15:42] RECOVERY - MySQL Slave Delay on db1025 is OK: OK replication delay 30 seconds [02:23:42] PROBLEM - MySQL Slave Delay on db1025 is CRITICAL: CRIT replication delay 196 seconds [02:26:42] RECOVERY - MySQL Slave Delay on db1025 is OK: OK replication delay 6 seconds [02:29:42] PROBLEM - MySQL Slave Delay on db1025 is CRITICAL: CRIT replication delay 186 seconds [02:30:42] RECOVERY - MySQL Slave Delay on db1025 is OK: OK replication delay 30 seconds [02:33:42] PROBLEM - MySQL Slave Delay on db1025 is CRITICAL: CRIT replication delay 210 seconds [02:36:22] PROBLEM - Puppet freshness on cp3003 is CRITICAL: No successful Puppet run in the last 10 hours [02:36:23] PROBLEM - Puppet freshness on virt3 is CRITICAL: No successful Puppet run in the last 10 hours [02:38:42] PROBLEM - MySQL Slave Delay on db1025 is CRITICAL: CRIT replication delay 211 seconds [02:41:42] RECOVERY - MySQL Slave Delay on db1025 is OK: OK replication delay 16 seconds [02:48:42] PROBLEM - MySQL Slave Delay on db1025 is CRITICAL: CRIT replication delay 211 seconds [02:50:42] RECOVERY - MySQL Slave Delay on db1025 is OK: OK replication delay 12 seconds [02:53:42] PROBLEM - MySQL Slave Delay on db1025 is CRITICAL: CRIT replication delay 192 seconds [02:58:42] PROBLEM - MySQL Slave Delay on db1025 is CRITICAL: CRIT replication delay 211 seconds [02:59:42] RECOVERY - MySQL Slave Delay on db1025 is OK: OK replication delay 18 seconds [03:13:39] PROBLEM - MySQL Slave Delay on db1025 is CRITICAL: CRIT replication delay 196 seconds [03:15:39] RECOVERY - MySQL Slave Delay on db1025 is OK: OK replication delay 23 seconds [03:27:09] PROBLEM - Puppet freshness on ms-fe3001 is CRITICAL: No successful Puppet run in the last 10 hours [03:38:39] PROBLEM - MySQL Slave Delay on db1025 is CRITICAL: CRIT replication delay 211 seconds [03:43:39] PROBLEM - MySQL Slave Delay on db1025 is CRITICAL: CRIT replication delay 211 seconds [03:44:40] RECOVERY - MySQL Slave Delay on db1025 is OK: OK replication delay 15 seconds [03:53:39] PROBLEM - MySQL Slave Delay on db1025 is CRITICAL: CRIT replication delay 211 seconds [03:58:39] PROBLEM - MySQL Slave Delay on db1025 is CRITICAL: CRIT replication delay 211 seconds [04:01:39] RECOVERY - MySQL Slave Delay on db1025 is OK: OK replication delay 15 seconds [04:14:37] PROBLEM - MySQL Slave Delay on db1025 is CRITICAL: CRIT replication delay 218 seconds [04:16:07] PROBLEM - Puppet freshness on gallium is CRITICAL: No successful Puppet run in the last 10 hours [04:18:42] PROBLEM - MySQL Slave Delay on db1025 is CRITICAL: CRIT replication delay 211 seconds [04:23:37] PROBLEM - MySQL Slave Delay on db1025 is CRITICAL: CRIT replication delay 211 seconds [04:24:37] RECOVERY - MySQL Slave Delay on db1025 is OK: OK replication delay 28 seconds [04:28:37] PROBLEM - MySQL Slave Delay on db1025 is CRITICAL: CRIT replication delay 211 seconds [04:30:37] RECOVERY - MySQL Slave Delay on db1025 is OK: OK replication delay 16 seconds [04:33:37] PROBLEM - MySQL Slave Delay on db1025 is CRITICAL: CRIT replication delay 196 seconds [04:35:56] RECOVERY - MySQL Slave Delay on db1025 is OK: OK replication delay 30 seconds [04:38:15] PROBLEM - MySQL Slave Delay on db1025 is CRITICAL: CRIT replication delay 183 seconds [04:40:15] RECOVERY - MySQL Slave Delay on db1025 is OK: OK replication delay 4 seconds [04:49:15] PROBLEM - MySQL Slave Delay on db1025 is CRITICAL: CRIT replication delay 208 seconds [04:50:15] RECOVERY - MySQL Slave Delay on db1025 is OK: OK replication delay 4 seconds [04:56:35] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [04:57:26] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.123 second response time [05:00:36] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [05:03:26] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.127 second response time [05:08:15] PROBLEM - MySQL Slave Delay on db1025 is CRITICAL: CRIT replication delay 183 seconds [05:10:15] RECOVERY - MySQL Slave Delay on db1025 is OK: OK replication delay 4 seconds [05:13:15] PROBLEM - MySQL Slave Delay on db1025 is CRITICAL: CRIT replication delay 184 seconds [05:15:15] RECOVERY - MySQL Slave Delay on db1025 is OK: OK replication delay 4 seconds [05:28:15] PROBLEM - MySQL Slave Delay on db1025 is CRITICAL: CRIT replication delay 184 seconds [05:29:15] RECOVERY - MySQL Slave Delay on db1025 is OK: OK replication delay 3 seconds [05:33:15] PROBLEM - MySQL Slave Delay on db1025 is CRITICAL: CRIT replication delay 183 seconds [05:35:08] RECOVERY - MySQL Slave Delay on db1025 is OK: OK replication delay 4 seconds [05:49:08] PROBLEM - MySQL Slave Delay on db1025 is CRITICAL: CRIT replication delay 225 seconds [05:50:08] RECOVERY - MySQL Slave Delay on db1025 is OK: OK replication delay 4 seconds [05:57:38] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [05:58:08] PROBLEM - MySQL Slave Delay on db1025 is CRITICAL: CRIT replication delay 184 seconds [05:59:08] RECOVERY - MySQL Slave Delay on db1025 is OK: OK replication delay 12 seconds [05:59:29] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.123 second response time [06:08:14] PROBLEM - MySQL Slave Delay on db1025 is CRITICAL: CRIT replication delay 184 seconds [06:08:54] PROBLEM - Puppet freshness on lvs1006 is CRITICAL: No successful Puppet run in the last 10 hours [06:08:54] PROBLEM - Puppet freshness on lvs1005 is CRITICAL: No successful Puppet run in the last 10 hours [06:08:54] PROBLEM - Puppet freshness on lvs1004 is CRITICAL: No successful Puppet run in the last 10 hours [06:10:14] RECOVERY - MySQL Slave Delay on db1025 is OK: OK replication delay 4 seconds [06:18:18] PROBLEM - MySQL Slave Delay on db1025 is CRITICAL: CRIT replication delay 184 seconds [06:20:14] RECOVERY - MySQL Slave Delay on db1025 is OK: OK replication delay 4 seconds [06:28:15] PROBLEM - MySQL Slave Delay on db1025 is CRITICAL: CRIT replication delay 184 seconds [06:30:14] RECOVERY - MySQL Slave Delay on db1025 is OK: OK replication delay 4 seconds [06:33:08] PROBLEM - MySQL Slave Delay on db1025 is CRITICAL: CRIT replication delay 184 seconds [06:35:09] RECOVERY - MySQL Slave Delay on db1025 is OK: OK replication delay 4 seconds [07:10:31] PROBLEM - Puppet freshness on virt1005 is CRITICAL: No successful Puppet run in the last 10 hours [07:55:23] lo [07:57:04] New patchset: Legoktm; "Convert logbot to use ircbot.SingleServerIRCBot for auto-reconnection." [operations/debs/adminbot] (master) - https://gerrit.wikimedia.org/r/60240 [08:01:18] New patchset: Legoktm; "Convert logbot to use ircbot.SingleServerIRCBot for auto-reconnection." [operations/debs/adminbot] (master) - https://gerrit.wikimedia.org/r/60240 [08:10:09] New review: MZMcBride; "Looks good to me." [operations/debs/adminbot] (master) - https://gerrit.wikimedia.org/r/59371 [08:16:44] :D [08:17:18] Susan: If it looks good, why did't you +1? [08:17:20] yet another repo I wasn't aware of [08:44:36] New review: Hashar; "What about the production entries?" [operations/mediawiki-config] (master) C: -1; - https://gerrit.wikimedia.org/r/60231 [10:13:22] New review: Faidon; "(1 comment)" [operations/puppet] (production) C: -1; - https://gerrit.wikimedia.org/r/60187 [11:14:24] New patchset: Mark Bergsma; "Rename dysprosium's backend to -sda and -sdb" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/60253 [11:15:17] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/60253 [12:15:08] New patchset: Mark Bergsma; "Double the backend weight on dysprosium" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/60257 [12:16:05] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/60257 [12:21:03] New patchset: Mark Bergsma; "Make weights for esams fe -> eqiad be and eqiad fr -> eqiad be equal" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/60258 [12:21:36] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/60258 [12:36:49] PROBLEM - Puppet freshness on cp3003 is CRITICAL: No successful Puppet run in the last 10 hours [12:36:49] PROBLEM - Puppet freshness on virt3 is CRITICAL: No successful Puppet run in the last 10 hours [12:57:22] New patchset: Mark Bergsma; "Set dysprosium backend load at 4x" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/60260 [12:57:57] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/60260 [13:03:39] PROBLEM - Apache HTTP on mw1159 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:03:59] PROBLEM - Apache HTTP on mw1154 is CRITICAL: Connection timed out [13:04:09] PROBLEM - Apache HTTP on mw1158 is CRITICAL: Connection timed out [13:04:29] PROBLEM - RAID on ms-be6 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:04:30] PROBLEM - Apache HTTP on mw1157 is CRITICAL: Connection timed out [13:04:30] PROBLEM - Apache HTTP on mw1155 is CRITICAL: Connection timed out [13:04:30] PROBLEM - Apache HTTP on mw1160 is CRITICAL: Connection timed out [13:04:30] PROBLEM - LVS HTTP IPv4 on rendering.svc.eqiad.wmnet is CRITICAL: Connection timed out [13:04:39] PROBLEM - Apache HTTP on mw1156 is CRITICAL: Connection timed out [13:05:20] PROBLEM - Apache HTTP on mw1153 is CRITICAL: Connection timed out [13:05:30] Hi [13:05:38] Is there a way of doing a status check [13:05:40] ? [13:05:46] I'm getting some unexpected 503 erorrs [13:05:50] mark: I think that's you [13:06:05] http://upload.wikimedia.org/wikipedia/commons/thumb/e/e5/Goody_Two-Shoes_%281881%29.djvu/page81-1410px-Goody_Two-Shoes_%281881%29.djvu.jpg [13:06:07] 503 [13:06:09] swift req/s have dived [13:06:15] dived? [13:06:20] http://ganglia.wikimedia.org/latest/?r=hour&cs=&ce=&tab=v&vn=swift+frontend+proxies [13:06:55] http://ganglia.wikimedia.org/latest/?r=hour&cs=&ce=&tab=v&vn=swift+backend+storage [13:07:08] backends are melting [13:07:26] io wait through the roof [13:07:29] New patchset: Mark Bergsma; "Revert "Double the backend weight on dysprosium"" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/60261 [13:07:40] Good afternoon? [13:07:49] Change abandoned: Mark Bergsma; "(no reason)" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/60261 [13:07:52] I'm based in the UK [13:07:58] New patchset: Mark Bergsma; "Revert "Set dysprosium backend load at 4x"" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/60262 [13:08:09] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/60262 [13:09:20] apergos: do you recall off the top of your head which ones are c2100s/r720xd + h310/r720xd + h710? [13:09:35] Qcoder00: it's better to ask in #wikimedia-tech where there isn't operational stuff going on [13:09:44] I don't know which ones now have the new ssds and controllers, no [13:10:00] ms be 2, 4 and 1? maybe are the three left that are c2100s [13:10:04] one of them shoul dcome out today [13:10:19] ok, I'll have a look to know for sure [13:10:30] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:10:39] that's probaby best [13:10:43] *probably [13:11:11] there's a big disparity between the load of some of them vs. the others, I'd like to make sure that's the h310 and not some other thing [13:11:24] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.127 second response time [13:12:10] RECOVERY - Apache HTTP on mw1158 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 5.297 second response time [13:12:21] RECOVERY - Apache HTTP on mw1155 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.045 second response time [13:12:21] RECOVERY - Apache HTTP on mw1157 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.048 second response time [13:12:21] RECOVERY - Apache HTTP on mw1160 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.055 second response time [13:12:26] ms-be2 & ms-be9 are depooled, ms-be12 is 66%, right? [13:12:30] RECOVERY - Apache HTTP on mw1159 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.045 second response time [13:12:31] RECOVERY - LVS HTTP IPv4 on rendering.svc.eqiad.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 67996 bytes in 0.152 second response time [13:12:40] New patchset: Odder; "(bug 44308) Add new namespaces and aliases on zhwikibooks" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/60263 [13:12:50] RECOVERY - Apache HTTP on mw1154 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.043 second response time [13:13:30] RECOVERY - Apache HTTP on mw1156 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.045 second response time [13:14:43] apergos: where's ms-be1? [13:15:20] RECOVERY - Apache HTTP on mw1153 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.064 second response time [13:15:32] what do you mean, where is it? [13:15:49] down for 11 days, still pooled... [13:15:58] wtf [13:16:18] well it's not scheduled to be down. how did I not see a notice about it? [13:16:27] ms-be9 is down for 10 days, depooled [13:16:30] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:16:35] yes, ms-be9 is expected. [13:16:42] ms-be2 will be in that state later today. [13:16:52] 10 days? [13:16:52] ms-be1 is not due for that [13:16:52] yes. more than 10 [13:17:16] why that long? [13:17:20] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.167 second response time [13:17:21] why is this? because dell wants there servers. yesterday. and the last four can come out without them waiting for us to put the new ones back in [13:17:31] because if they wait it will take twice as long. [13:17:36] *their servers [13:17:42] gah no good typing today [13:18:13] but I would like to know what happened to ms-be1, I will ask chris if he knows anything [13:18:20] we have 25% of the cluster depooled or down, plus another server at 66% [13:18:32] not very reasonable load either [13:19:13] the one at 66 can and should go to 100 before anything else happens [13:19:37] chris? [13:19:41] but we only have one more server left if that's the case (if ms-be1 has been gone that long, its partitions are already replicated elsehwere now) [13:19:46] did you attempt to powercycle and failed? [13:19:47] or should I? [13:19:52] no, I haven't done anything yet [13:20:03] I don't want to bring it back up yet til I find out what's going on [13:20:11] since it's scheduled to be replaced [13:20:35] I would just as soon let it stay down and put the new one in, and get th old one out of here [13:22:30] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:23:21] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.130 second response time [13:24:01] can we restore the cluster to a healthy state asap? :) [13:25:26] unfortunately nothing can be done on this cluster asap, it's all tediously slow. I can ask for a replacement to be racked for ms-be2 or 1 (since 1 is down) but it will take several days regardless for traffic to migrate to it [13:26:22] I bet ms-be11 and 12 have the new controller and ssds [13:26:40] why would you not have a replacement box racked when it's down anyway? [13:26:59] because steve might be doing other things [13:27:10] oh for christ sake [13:27:28] what? [13:27:50] PROBLEM - Puppet freshness on ms-fe3001 is CRITICAL: No successful Puppet run in the last 10 hours [13:27:56] I don't know if there is something ready to go for ms-be2 right now or not, I haven't been coordinating it [13:28:03] that's whay I would have asked chris [13:28:32] you're the one doing this replacement isn't it? [13:28:51] well tbh chris is the one getting the heat from dell [13:29:05] not steve, not me [13:30:05] so, we currently have [13:31:08] 3 boxes depooled or down, 6 boxes with a H310 which can't take much load, 1 being a C2100 and 2 with H710 which are the only ones sane, but one of them is at 66% [13:32:20] I suggest: a) increase 66% to 100% now, b) replace ms-be1 now and put ms-be1/2/9 with H710s back in the cluster [13:33:24] are you doing (a) or should I? [13:33:42] and let's ping steve when he joins irc [13:34:06] we should ping chris when he joins, he is the one coordinating the racking and replacement [13:35:25] I'm not sure if you realize, we just had swift melt a few moments ago for a slight increase of traffic [13:36:00] if you increase from 66 to 100 now instead of waiting for the new box to be racked later today, we are not going to gain much. it takes several days for that data to move around [13:37:11] no, I was not watching here, and irritatingly I did not hear my phone [13:37:43] New patchset: Krinkle; "Pester IRC as well when a draft is published" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/50044 [13:37:53] I'm not sure what will happen if a box fails for a random reason now [13:38:15] let's make something to fix this asap [13:38:27] s/make/do/ even [13:39:31] and boy, it's just amazing how crappy the h310s are [13:39:59] well, the quickest (which means however that dell will wait longer for their servers) is to bring ms-be1 back on line as it is. it will be somewht out of sync, but it will have most of the data [13:40:16] even c2100s are so much better [13:40:17] there's literally nothing else "quick" that can be done [13:40:42] yeah, I am seeing the graphs for the boxes with the h310s [13:40:44] astoundingly bad [13:44:32] I don't understand why you decided to remove c2100s without replacing them [13:44:50] i.e. why we have ms-be9 down for 10 days [13:44:58] (or more as you said) [13:46:00] because in the past this cluster had a lot of head room, and dell ha been getting very pushy about getting their gear back, and putting new servers back in as the old one comes out makes everything take twice as long [13:47:23] it's definitely my bad tha I didn't see ms-be1 out, though. still don't know how that happened. [13:47:36] hmm [13:47:37] ipmitool> chassis power status [13:47:37] Error: Unable to establish LAN session [13:47:37] Unable to get Chassis Power Status [13:47:47] that's after [13:47:49] root@sockpuppet:~# ipmtool -U root -H ms-be1 shell [13:47:55] ganglia is all red with the h310s having an increased load, not sure where you see the headroom [13:48:01] ms-be1.mgmt, not ms-be1 [13:48:06] grrr [13:48:28] apparently there isn't any now [13:48:44] ? [13:48:45] time passes, spare capacity gets used [13:55:22] RECOVERY - Host ms-be1 is UP: PING OK - Packet loss = 0%, RTA = 26.78 ms [14:01:12] apergos: the new 720 (ms-be2 replacement) should already be on the rack. Once Steve gets in to the DC he will set it up [14:01:19] ok [14:01:24] I see you are doing the backread [14:01:27] yes [14:01:50] so it looks like the hope that we could pull out the remaining 2 boxes without replacing as we go, is dead in the water [14:02:08] dell will not like it but in fact their h310s and their c2100s are the reason we are in this fix [14:02:34] you might remind them of the h310s :-P [14:03:33] okay...not worried about Dell [14:03:48] in the meantime however, ms-be2 is ready to be powered off [14:03:56] the only bright spot of new in the whole thing [14:04:00] *news [14:04:46] ok...good [14:09:25] New patchset: Mark Bergsma; "Set dysprosium backend weight to 6x" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/60273 [14:10:22] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/60273 [14:10:29] i'm gonna push the limit again [14:10:48] You, could someone check https://gerrit.wikimedia.org/r/#/c/59969/2 out pretty please? [14:10:57] s/You/Yo/ [14:12:24] Coren: I saw it, I don't have any objections -- whitespace is weird with 4 spaces AND tabs though [14:12:48] paravoid: Ah, right, my default vim settings for C. [14:13:05] Coren: but you should probably get a review from someone who knows a bit more about tool labs [14:13:07] paravoid: That shouldn't be hard to fix. [14:13:09] ryan for example :) [14:16:13] PROBLEM - Puppet freshness on gallium is CRITICAL: No successful Puppet run in the last 10 hours [14:18:34] PROBLEM - Apache HTTP on mw1155 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:18:34] PROBLEM - Apache HTTP on mw1156 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:18:39] argh [14:19:23] PROBLEM - Apache HTTP on mw1153 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:19:47] New patchset: Mark Bergsma; "Set dysprosium backend load at 3x" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/60275 [14:19:58] perhaps today is a bad day to work :P [14:20:06] heh [14:20:13] RECOVERY - Apache HTTP on mw1153 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.063 second response time [14:20:15] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/60275 [14:20:23] RECOVERY - Apache HTTP on mw1155 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.045 second response time [14:20:24] RECOVERY - Apache HTTP on mw1156 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.049 second response time [14:20:48] this isn't the first time you've done this test though, is it? [14:21:09] I think ms-be1/2 11/10 days ago must have been the tipping point [14:22:32] it is [14:22:41] before was frontend [14:22:42] now it's backend [14:22:48] so any misses hit swift directly [14:23:02] oh [14:26:24] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:28:14] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.125 second response time [14:43:07] New patchset: Reedy; "Debugging for EducationProgram" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/60279 [14:43:22] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:44:05] Change merged: Reedy; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/60279 [14:44:11] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.128 second response time [14:44:38] !log reedy synchronized wmf-config/InitialiseSettings.php [14:44:46] Logged the message, Master [14:51:53] anyone could look into change 50044? [14:52:02] been lurking there for a time now [14:53:05] ori-l: ping [14:53:51] it looks fine, but I can't actually merge it, since I'm not ops [14:53:55] ah [14:54:25] I'll poke someone later on today (PST) if nobody else picks it up [14:54:25] always difficult to know whom exactly are ops [14:55:05] ... [14:55:17] New patchset: Cmjohnson; "Updating dhcpd files for ms-be9" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/60280 [14:55:44] https://meta.wikimedia.org/wiki/Sysadmins [14:55:48] Slightly out of date though [14:56:02] ah [14:56:05] Change merged: Cmjohnson; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/60280 [14:56:06] New review: MZMcBride; "What happened here? This changeset seems to have been approved, but never merged." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/8438 [14:57:22] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:57:49] where does is show that it was approved? [14:58:04] apergos: "Patch Set 1: Looks good to me, approved" [14:58:11] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.124 second response time [14:58:36] unless tim typed that by hand just to fnuck with everyone after a year [14:58:58] dunno, but I do see that it needs [14:59:00] well everything [14:59:11] verified, code review *and* rebased [14:59:42] I don't know what it looked like a year ago, but surely if it had been +2 then, it would show up in the chart [15:00:15] although I guess it was verified and that doesn't show up either [15:00:30] apergos: per comment, tim aprooved it back then [15:00:46] New review: Reedy; "Certainly won't merge cleanly now." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/8438 [15:00:46] yeah I see his comment [15:01:59] I wonder if deomon oughta look at that (but what he's going to be able to say after a year of gerrit hiccups, I dunno) [15:01:59] New review: Jeremyb; "Not sure why it's not merged yet (and I barely even remember the history...)" [operations/puppet] (production) C: -1; - https://gerrit.wikimedia.org/r/8438 [15:02:56] apergos: or we (you) could just rebase/review/verify it and let it go to the mist of history [15:02:58] ツ [15:03:41] I could rebase it and let it be verified but I would not review it right this sec [15:04:17] apergos: so, what's the update for swift now? [15:04:28] ms-be1 is up [15:04:35] apergos: if you have time, could you review https://gerrit.wikimedia.org/r/#/c/50044/ ? [15:04:40] I saw that, you changed your mind it seems :) [15:04:47] Reedy: AzaToth: manifests/admins.pp is better usually but also not necessarily up to date [15:04:52] huh? [15:05:10] no, you wanted (justfiably so) something quick and that's the only thing that coul dbe done quickly [15:05:40] ms-be9 and ms-be2 will be ready to start going back in tongiht or early tomorrow but that is a process which will take 2 weeks [15:05:54] anyways that is the status, [15:06:05] jeremyb_: I would assume the admins themself would have a correct list of all admins ツ [15:06:43] once those are back in and of course ms-be12 to 100 then it should be stable for me to pull out another c2100 [15:06:50] that's the plan [15:06:57] AzaToth: well there's the private puppet repo's root authorized_keys [15:07:23] true [15:07:28] AzaToth: and manifests/admins.pp and the list of people that have the root password [15:07:38] there's not really any other lists i think [15:07:43] you have root passwords? [15:07:52] it hasn't been too long since the password was changed [15:08:07] AzaToth: i assume it's just for use on serial console... [15:08:15] oh [15:09:15] and I assume it's a 128+ character randon string? ツ [15:09:41] taking a week to input manually [15:10:31] i think not. i think it's something that could be typed in less than a minute. (otherwise what's the point?) [15:10:50] (really just speculation...) [15:11:42] Anybody aware of issues with loading stuff from bits.wikimedia.org on specific user pages on Commons in Firefox? I can reproduce, and it's weird. [15:12:13] * andre__ probably better off writing an email to ops@ [15:12:58] * apergos wonders who is on rt now anyways [15:15:21] anyway, an reivew of 50044 would be perfect [15:27:42] AzaToth: errrr, you're not looking for a wikimedia root. and not asking in the right place [15:27:49] AzaToth: try #mediawiki-i18n [15:36:13] mark, yt? [15:39:04] yes [15:40:18] <^demon|busy> jeremyb_: I want https://gerrit.wikimedia.org/r/#/c/8120/ off my review list. Can I abandon? [15:41:22] ugh... i need to not do much wikimedia stuff for a couple days. need to get other stuff done before puppet camp! [15:41:40] oh, that one [15:42:35] ^demon|busy: i'll look at it tomorrow night [15:42:40] <^demon|busy> Okie dokie. [15:44:00] mark, I'd like to continue with new caching rollout - when you'll be available? [15:44:16] isn't that scheduled for tomorrow? [15:45:53] so I'm asking if you'll be there, or we need to find another time:) [15:46:04] I can be there, although the window is a bit late for me ;) [15:47:11] mark, we can do it outside of PST business houurs [15:47:22] that works too [15:47:24] e.g. now:P [15:47:28] fine with me [15:48:30] rolling out everywhere? [15:48:34] yes [15:48:47] yay [15:48:49] now would be excellent [15:48:53] then maybe I can go to the datacenter tomorrow :P [15:49:01] only this rollout was in the way [15:49:26] New patchset: Jgreen; "drush consistent-user and lockdown scheme" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/60285 [15:50:07] Change merged: Jgreen; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/60285 [15:50:15] MaxSem: ready in 5 mins [15:50:42] New patchset: MaxSem; "Enable $wgMFVaryResources on enwiki" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/60286 [15:50:53] let's start with this ^^^ it's about 50% of mobile traffic [15:51:50] hehe [15:51:50] ok [15:52:30] if it works, I'll flip the rest shortly [15:52:44] i'm ready [15:54:28] Change merged: jenkins-bot; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/60286 [15:54:33] New patchset: Ori.livneh; "Create self-standing IPython Notebook Puppet module" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/60187 [15:54:33] New patchset: Ori.livneh; "Use Upstart rather than supervisor to manage IPython" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/60094 [15:54:44] paravoid: you put me up to it, it's all your fault :P [15:54:52] :) [15:55:02] I wasn't sure if it's a good idea [15:55:07] I'm happy to make the call either way tbh [15:55:17] for you to make the call I mean [15:55:59]