[00:08:38] New patchset: Helder.wiki; "(bug 39652) Add "autoreviewer" to $wgRestrictionLevels on ptwiki" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/21475 [00:10:48] PROBLEM - Puppet freshness on ms-fe1 is CRITICAL: Puppet has not run in the last 10 hours [00:13:57] PROBLEM - Puppet freshness on ms-be1007 is CRITICAL: Puppet has not run in the last 10 hours [00:13:57] PROBLEM - Puppet freshness on ms-be1011 is CRITICAL: Puppet has not run in the last 10 hours [00:13:57] PROBLEM - Puppet freshness on ms-be1010 is CRITICAL: Puppet has not run in the last 10 hours [00:13:57] PROBLEM - Puppet freshness on ms-be7 is CRITICAL: Puppet has not run in the last 10 hours [00:17:24] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:19:48] PROBLEM - Puppet freshness on neon is CRITICAL: Puppet has not run in the last 10 hours [00:29:42] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 8.685 seconds [00:47:51] PROBLEM - Puppet freshness on ms-fe2 is CRITICAL: Puppet has not run in the last 10 hours [00:57:54] PROBLEM - Puppet freshness on ms-be9 is CRITICAL: Puppet has not run in the last 10 hours [01:03:27] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:17:15] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.151 seconds [01:40:39] PROBLEM - MySQL Slave Delay on storage3 is CRITICAL: CRIT replication delay 244 seconds [01:41:33] PROBLEM - MySQL Slave Delay on db1025 is CRITICAL: CRIT replication delay 296 seconds [01:45:36] PROBLEM - Puppet freshness on magnesium is CRITICAL: Puppet has not run in the last 10 hours [01:45:36] PROBLEM - Puppet freshness on zinc is CRITICAL: Puppet has not run in the last 10 hours [01:48:00] PROBLEM - Misc_Db_Lag on storage3 is CRITICAL: CHECK MySQL REPLICATION - lag - CRITICAL - Seconds_Behind_Master : 683s [01:50:21] New patchset: Andrew Bogott; "Overhauling gerrit manifest to be a role class" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/13484 [01:51:05] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/13484 [01:51:47] New review: Andrew Bogott; "I've removed the gerrit::common class and rearranged gerrit.pp generally. Most of this had the aim ..." [operations/puppet] (production); V: 0 C: 0; - https://gerrit.wikimedia.org/r/13484 [01:51:54] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:56:51] RECOVERY - MySQL Slave Delay on db1025 is OK: OK replication delay 11 seconds [01:59:06] RECOVERY - MySQL Slave Delay on storage3 is OK: OK replication delay 16 seconds [02:00:00] RECOVERY - Misc_Db_Lag on storage3 is OK: CHECK MySQL REPLICATION - lag - OK - Seconds_Behind_Master : 2s [02:04:21] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.047 seconds [03:11:51] RECOVERY - Puppet freshness on virt0 is OK: puppet ran at Mon Aug 27 03:11:37 UTC 2012 [03:28:07] New patchset: Parent5446; "(bug 39380) Enabling secure login (HTTPS)." [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/21322 [05:48:32] PROBLEM - Puppet freshness on ms-fe3 is CRITICAL: Puppet has not run in the last 10 hours [05:58:35] PROBLEM - Puppet freshness on ms-fe4 is CRITICAL: Puppet has not run in the last 10 hours [06:13:35] PROBLEM - Puppet freshness on ms-be1002 is CRITICAL: Puppet has not run in the last 10 hours [06:13:35] PROBLEM - Puppet freshness on ms-be1003 is CRITICAL: Puppet has not run in the last 10 hours [06:13:35] PROBLEM - Puppet freshness on ms-be1001 is CRITICAL: Puppet has not run in the last 10 hours [06:13:35] PROBLEM - Puppet freshness on ms-be1005 is CRITICAL: Puppet has not run in the last 10 hours [06:13:35] PROBLEM - Puppet freshness on ocg3 is CRITICAL: Puppet has not run in the last 10 hours [06:13:36] PROBLEM - Puppet freshness on ms-be1006 is CRITICAL: Puppet has not run in the last 10 hours [06:13:36] PROBLEM - Puppet freshness on ms-be1009 is CRITICAL: Puppet has not run in the last 10 hours [06:13:37] PROBLEM - Puppet freshness on singer is CRITICAL: Puppet has not run in the last 10 hours [06:13:37] PROBLEM - Puppet freshness on ms-fe1001 is CRITICAL: Puppet has not run in the last 10 hours [06:13:38] PROBLEM - Puppet freshness on virt1004 is CRITICAL: Puppet has not run in the last 10 hours [06:13:38] PROBLEM - Puppet freshness on virt1001 is CRITICAL: Puppet has not run in the last 10 hours [06:13:39] PROBLEM - Puppet freshness on virt1003 is CRITICAL: Puppet has not run in the last 10 hours [06:13:39] PROBLEM - Puppet freshness on virt1002 is CRITICAL: Puppet has not run in the last 10 hours [07:22:30] PROBLEM - Puppet freshness on zhen is CRITICAL: Puppet has not run in the last 10 hours [08:00:44] New patchset: Hashar; "beta: wmgArticleFeedbackLotteryOdds => 0" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/16036 [08:10:55] Change merged: Hashar; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/13427 [09:00:06] PROBLEM - Puppet freshness on ms-be8 is CRITICAL: Puppet has not run in the last 10 hours [09:00:06] PROBLEM - Puppet freshness on ms-be6 is CRITICAL: Puppet has not run in the last 10 hours [09:23:12] PROBLEM - Puppet freshness on ms-be1 is CRITICAL: Puppet has not run in the last 10 hours [09:33:00] New patchset: Mark Bergsma; "Decommission bayes" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/21556 [09:33:47] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/21556 [09:33:55] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/21556 [09:35:19] PROBLEM - Puppet freshness on ms-be2 is CRITICAL: Puppet has not run in the last 10 hours [09:36:55] New patchset: Matthias Mullie; "(bug 36722) Article Feedback - Supporting feedback on help pages" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/17503 [09:38:12] !log Decommissioning bayes, shut it down [09:38:23] Logged the message, Master [09:39:47] New patchset: Matthias Mullie; "Add new AFT permission levels" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/21141 [09:40:16] PROBLEM - Host bayes is DOWN: CRITICAL - Host Unreachable (208.80.152.168) [09:50:07] New patchset: Mark Bergsma; "Remove admins::dctech" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/21558 [09:50:53] New review: gerrit2; "Change did not pass lint check. You will need to send an amended patchset for this (see: https://lab..." [operations/puppet] (production); V: -1 - https://gerrit.wikimedia.org/r/21558 [09:52:33] New patchset: Mark Bergsma; "Remove admins::dctech" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/21558 [09:53:18] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/21558 [09:57:58] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/21558 [10:05:39] New patchset: Mark Bergsma; "Add NetApp monitoring" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/21560 [10:06:22] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/21560 [10:06:33] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/21560 [10:12:13] PROBLEM - Puppet freshness on ms-fe1 is CRITICAL: Puppet has not run in the last 10 hours [10:15:13] PROBLEM - Puppet freshness on ms-be1007 is CRITICAL: Puppet has not run in the last 10 hours [10:15:13] PROBLEM - Puppet freshness on ms-be1010 is CRITICAL: Puppet has not run in the last 10 hours [10:15:13] PROBLEM - Puppet freshness on ms-be7 is CRITICAL: Puppet has not run in the last 10 hours [10:15:13] PROBLEM - Puppet freshness on ms-be1011 is CRITICAL: Puppet has not run in the last 10 hours [10:16:41] mark: hi! if you're on a gerrit tear, could you look at https://gerrit.wikimedia.org/r/#/c/21483/ ? (not urgent) [10:20:38] added one comment [10:21:13] PROBLEM - Puppet freshness on neon is CRITICAL: Puppet has not run in the last 10 hours [10:34:40] mark: huh, i could have sworn it wouldn't work otherwise [10:34:58] something about upstart closing stdin, perhaps? or not managing a pipeline correctly [10:35:36] but a quick test on a dev machine seems to work fine without the subshell. hrm. [10:36:36] well that's 2 subshells ;) [10:37:44] have you seen inception? :P [10:37:59] i have [10:38:11] that's why this is freaking me out ;-p [10:41:01] well, i've gone through the whole initctl stop / (re-)start cycle a few times and everything seems to behave as it should without the extra shell. so good catch -- i don't know what i was thinking. i'll update the patch. [10:42:47] also does it need to run as root? [10:43:03] oh nm [10:43:05] www-data [10:43:11] New patchset: Dereckson; "(bug 39671) Every logged-in user can now edit se.wikimedia.org." [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/21561 [10:47:33] New patchset: Ori.livneh; "Fix VCL bug; use varnishncsa instead of varnishlog" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/21483 [10:47:39] ^^ mark [10:48:17] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/21483 [10:48:43] PROBLEM - Puppet freshness on ms-fe2 is CRITICAL: Puppet has not run in the last 10 hours [10:49:12] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/21483 [10:50:05] w00t! thanks :) [10:54:29] New patchset: Dereckson; "(bug 39671) Every logged-in user can now edit se.wikimedia.org." [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/21561 [10:58:46] PROBLEM - Puppet freshness on ms-be9 is CRITICAL: Puppet has not run in the last 10 hours [11:36:09] New patchset: Matthias Mullie; "remove config var that's no longer being used" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/21570 [11:43:10] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [11:46:46] PROBLEM - Puppet freshness on magnesium is CRITICAL: Puppet has not run in the last 10 hours [11:46:46] PROBLEM - Puppet freshness on zinc is CRITICAL: Puppet has not run in the last 10 hours [11:47:40] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.054 seconds [12:21:16] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:33:07] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.025 seconds [13:05:55] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:07:31] New review: Demon; "We already have gerrit in the apt repo (it's installed that way on manganese). I'll test our your fi..." [operations/puppet] (production) C: 0; - https://gerrit.wikimedia.org/r/13484 [13:20:28] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.048 seconds [13:45:59] New patchset: Hashar; "(bug 38946) hebrew fonts for SVG rendering" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/21588 [13:46:42] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/21588 [13:49:15] paravoid, apergos: any chance you could do the rest of the back ends before we get to 9 this morning? [13:49:22] (9am pacific) [13:49:30] I didn't realize there were more to be done [13:49:42] I'm actually looking at the multiple heads issue though [13:49:45] I didn't forget to mail you on friday, did I? [13:49:53] well at the any heads issue tbh [13:50:45] if all the backends are upgraded before 9, we have the opportunity to try the originals switch then. [13:50:46] ah, I see it, no but [13:51:01] maplebed: a bit early for you isn't it? :) [13:51:03] I think you an aaron can fix the multiple heads issue after I'm gone. [13:51:04] so we shouldn't do the originals switch til we have the head request issue sorted out [13:51:16] but we can upgrade the backends nevertheless [13:51:21] apergos: the multiple heads issue will mostly go away after we stop writing NFS [13:51:21] yes we can [13:51:30] how do we know that? [13:51:31] at least half of them come from the multiple backend consistency check. [13:51:42] maplebed: btw, python-swauth is needed only on frontends, right? [13:51:49] (it's missing from puppet too, I'll add it) [13:51:52] aaron spent a lot of time looking at it on friday, and is continuing to do so. [13:51:54] paravoid: yes. [13:52:34] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:52:35] so that means there are still half that are unaccounted for? [13:52:41] I just wanted to check in early enough that I can go get my dog exercise and get into the office before 9. [13:53:00] apergos: walking through the code we were able to count up to 6; I don't remember whether aaron found the 7th. [13:53:13] ok. I read the scrollabck but didn't see that discussion [13:53:22] sorry, it wasnt' in IRC. [13:53:25] ah [13:53:35] oh irl then? [13:53:42] yes. [13:53:48] ok guess you can't send the logs [13:53:53] but really, the heads thing is not gating either backends or originals; [13:53:57] it's been like that since forever. [13:54:15] I think that's why I didn't mail about that conversation specifically [13:54:27] aaron's going to continue looking at that code this week too. [13:54:41] it's orthogonal to the originals switch [13:54:44] yes it is gating originals. mark asked us to look at this in his email. [13:55:09] the only real problem that it's creating right now is hurting debugging [13:55:34] too much cruft in logs/tcpdumps [13:55:39] well, besides a real performance issue [13:55:45] and it was a good question, which we tracked down sufficiently to understand that it's not an effect of our current switch [13:55:52] right. [13:56:11] the only problem is that if the switch goes south again we won't be able to debug it easily [13:56:12] anyway, if I'm going to get into the office by 9 I need to stop chatting about it. [13:56:28] the point wasn't that it's somehow related to swift code, it's that mw does this, it's inefficient, it should be fixed because it's a real performance hit [13:56:29] in any case, backends upgrades are also orthogonal to this conversation [13:56:33] so let's just do that now [13:56:56] that's fine [13:56:56] +1 paravoid. I'll be back by 9 [13:56:58] Change abandoned: Hashar; "Moved to puppet" [operations/debs/wikimedia-job-runner] (master) - https://gerrit.wikimedia.org/r/11610 [13:57:01] we have a 2hr window then. [13:58:33] maplebed: I'll also do upgrades & reboots to ms-be* if you don't disagree. [13:59:14] paravoid: I'd hold off on reboots but +1 to upgrades [13:59:19] why? [14:01:06] I'm a little nervous about rebooting the backends because I fear they will flip out on boot due to their disks. It's mostly fear, not backed by a know bad state in the systems, but if one doesn't come back then we need to stop upgrading and deal with it before continuing. [14:01:23] sigh [14:01:29] okay [14:01:29] They certainly do need to be rebootable, [14:01:38] and +1 to reboots at some point, [14:01:48] I just wanted to separate that problem (if it even comes up) from the process of upgrading. [14:02:11] they're not related and I didn't want to throw the upgrade process off on an unecessary tangent. [14:02:22] yeah, fair enouigh [14:02:32] ok, really afk now. [14:03:27] apergos: I'm going to start with the rest of the upgrades, ok with that? [14:03:32] sure [14:03:55] are you applying changes to confs and upgrading packages manually? [14:04:06] yes [14:04:42] ok [14:04:43] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 1.031 seconds [14:08:08] New review: Demon; "manifests/role/gerrit.pp seems to be missing from PS25?" [operations/puppet] (production) C: -1; - https://gerrit.wikimedia.org/r/13484 [14:09:20] New review: Andrew Bogott; "That is a reasonable concern." [operations/puppet] (production); V: 0 C: 0; - https://gerrit.wikimedia.org/r/13484 [14:14:37] wow, the ms-be* upgrades are talking a looooong time [14:14:45] like normal package upgrades [14:15:02] wonder why [14:20:07] i/o load probably [14:22:06] apergos: do you know how ms-be10 is out of rotation? [14:22:14] with what mechanism it was disabled? [14:22:27] no. I just have his email, same as you [14:22:34] lemme dig around [14:23:28] hm, ring builder probably [14:25:17] yeah [14:27:55] ms-be10 is precise... [14:28:04] we have a mixture? [14:28:28] hence the 1.4.8 [14:28:29] yes. [14:28:32] which is fine [14:29:01] PROBLEM - swift-object-replicator on ms-be10 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-object-replicator [14:29:01] PROBLEM - swift-account-reaper on ms-be10 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-account-reaper [14:29:01] PROBLEM - swift-container-replicator on ms-be10 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-replicator [14:29:05] what do you mean, hence 1.4.8? [14:29:10] PROBLEM - swift-account-replicator on ms-be10 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-account-replicator [14:29:10] PROBLEM - swift-object-server on ms-be10 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-object-server [14:29:10] PROBLEM - swift-container-server on ms-be10 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-server [14:29:55] New patchset: Andrew Bogott; " Overhauling gerrit manifest to be a role class" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/21592 [14:30:20] apergos: all the others had swift 1.4.3; ms-be10 had 1.4.8. [14:30:29] oh, 1.4.3 [14:30:29] ben was wondering in his mail how that happened [14:30:29] ok [14:30:32] RECOVERY - swift-object-replicator on ms-be10 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-object-replicator [14:30:32] RECOVERY - swift-container-replicator on ms-be10 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-replicator [14:30:32] RECOVERY - swift-account-reaper on ms-be10 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-reaper [14:30:40] RECOVERY - swift-account-replicator on ms-be10 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-replicator [14:30:40] RECOVERY - swift-object-server on ms-be10 is OK: PROCS OK: 25 processes with regex args ^/usr/bin/python /usr/bin/swift-object-server [14:30:40] RECOVERY - swift-container-server on ms-be10 is OK: PROCS OK: 25 processes with regex args ^/usr/bin/python /usr/bin/swift-container-server [14:30:41] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/21592 [14:30:47] !log upgraded swift on ms-be10; (still depooled, no effect) [14:30:57] Logged the message, Master [14:36:26] apergos: could you have a look at the graphs for possible anomalies? [14:36:37] Change abandoned: Andrew Bogott; "oops, wrong ID" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/21592 [14:37:08] I amlooking [14:37:19] New patchset: Andrew Bogott; " Overhauling gerrit manifest to be a role class" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/21593 [14:37:21] had them open since you started [14:38:05] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/21593 [14:38:24] several of them are a little weird but I'm having trouble figuring out why that is [14:39:31] http://ganglia.wikimedia.org/latest/?r=hour&cs=&ce=&tab=v&vn=swift+backend+storage look at the first few of these if you like to see what I mean [14:39:41] Change abandoned: Andrew Bogott; "Apparently I'm going to keep making this same mistake over and over all day." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/21593 [14:39:49] New patchset: Andrew Bogott; "Overhauling gerrit manifest to be a role class" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/13484 [14:40:07] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:40:27] !log upgraded swift on ms-be4 [14:40:35] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/13484 [14:40:37] Logged the message, Master [14:41:21] sinc the only one you had done was ms-be10 which wasn't serving anything... [14:42:18] I've done ms-be4 too [14:42:24] gah those graphs are completely unreadable [14:42:30] too many colors [14:43:21] hello ops :) [14:43:23] I am back around [14:43:24] ! [14:43:27] yeah the dip is from before that though [14:43:42] anyways I would say to carry on fornow [14:45:17] hi hashar welcome back [14:45:29] hashar: heya, had a good time? [14:47:51] rain / sun / rain / sun [14:48:00] and close to no internet connection :-) [14:48:05] so I feel relaxed [14:49:48] very nice [14:49:53] hashar: I'll be spending all next week away from the internet, I am looking forward to it [14:50:11] hashar: I have a question for you about beta labs, can I send you an email? [14:50:49] chrismcmahon: go ahead go ahead :) [14:51:02] New patchset: Alex Monk; "(bug 39306) Add a flood group to itwiktionary." [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/19230 [14:51:03] ok, be a minute or two... [14:51:13] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 4.324 seconds [14:51:26] !log upgraded swift on ms-be3 [14:51:38] Logged the message, Master [14:55:11] apt upgrade on 5, 11, 12 [14:55:58] hiyaaa, could someone approve this one real quick? i had waited until this morning to ask so that I could babysit it and make sure its cool [14:55:59] https://gerrit.wikimedia.org/r/#/c/21391/ [14:58:33] hashar: sent, thanks [15:08:27] still looks ok on the graphs [15:20:30] this does not seem to be related to the upgrades but I do notice a lot of [15:20:35] "object-replicator @ERROR: max connections (2) reached -- try again later" in the logs [15:21:07] several per second, do you know anything about those? [15:23:51] no [15:26:20] ok, we'll see what ben has to say [15:26:37] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:26:57] gmail imap is unbelievably slow this past few days [15:27:11] odd [15:27:28] and is returning "System Error" every now and then [15:27:32] you'd think they pf all folks would nt have network/server issues [15:27:45] it's a second class citizen unfortunately [15:28:03] imap, ah right [15:28:30] on a related note, ms-be* is also very slow I/O-wise, installing updates for the past 30' or so [15:28:42] that's a looong time [15:29:24] these are on hosts still in the pool right? [15:29:42] yes [15:30:16] hope they'lll keep up with the originals traffic [15:30:33] that should be lot less than the thumbs though [15:31:25] by number of requests, notsure about actual # of packets [15:38:01] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 3.472 seconds [15:47:37] !log upgraded swift on ms-be11 [15:47:47] Logged the message, Master [15:49:05] PROBLEM - Puppet freshness on ms-fe3 is CRITICAL: Puppet has not run in the last 10 hours [15:49:21] seems to work so far [15:49:36] apergos: still looking at graphs? [15:49:40] oh yes [15:49:41] and logs [15:49:54] if something goes wacky you'll be the second one to know [15:51:16] :-) [15:54:23] !log upgraded swift on ms-be5 [15:54:32] Logged the message, Master [15:55:37] !log upgraded swift on ms-be12 [15:55:40] all done :) [15:55:43] right on time [15:55:46] sweet [15:55:47] Logged the message, Master [15:56:30] well, except ms-be6 which is still down [15:57:30] them's the breaks [15:59:08] PROBLEM - Puppet freshness on ms-fe4 is CRITICAL: Puppet has not run in the last 10 hours [16:04:24] New patchset: Hashar; "disallow robots on beta" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/21602 [16:05:26] New review: Krinkle; "bump." [operations/mediawiki-config] (master); V: 0 C: 1; - https://gerrit.wikimedia.org/r/16241 [16:10:43] hi paravoid, apergos [16:10:49] New review: Hashar; "Leslie, Daniel, can you please look at this? That would let us run ruby based test suites on galliu..." [operations/puppet] (production) C: 0; - https://gerrit.wikimedia.org/r/16957 [16:10:51] yo [16:11:28] so sadly I don't have backscroll for my time commuting; anything interesting? [16:11:35] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:12:13] para void upgraded he backends, they look fine [16:12:17] I noticed a bunch of [16:12:19] ossm. [16:12:34] Aug 27 06:29:58 10.0.6.201 object-replicator @ERROR: max connections (2) reached -- try again later [16:12:34] in the logs [16:12:42] not just now, but consistently [16:12:49] wondering if you know what that's about [16:13:12] .201, eh? [16:13:18] also finally understood why for every thumb request we send a head, sent mail and mark replied, aaron will want to look at that [16:13:40] not just 201 [16:13:42] it's allof them [16:13:50] several a second across all the backends [16:13:53] oh, object replicator! [16:13:59] sorry, missed that part. [16:14:00] yes [16:14:08] PROBLEM - Puppet freshness on ms-be1001 is CRITICAL: Puppet has not run in the last 10 hours [16:14:08] PROBLEM - Puppet freshness on ms-be1003 is CRITICAL: Puppet has not run in the last 10 hours [16:14:08] PROBLEM - Puppet freshness on ms-be1005 is CRITICAL: Puppet has not run in the last 10 hours [16:14:08] PROBLEM - Puppet freshness on ms-be1006 is CRITICAL: Puppet has not run in the last 10 hours [16:14:08] PROBLEM - Puppet freshness on ms-be1002 is CRITICAL: Puppet has not run in the last 10 hours [16:14:09] PROBLEM - Puppet freshness on ms-be1009 is CRITICAL: Puppet has not run in the last 10 hours [16:14:09] PROBLEM - Puppet freshness on ocg3 is CRITICAL: Puppet has not run in the last 10 hours [16:14:09] yes, it's limited to 2 connections at a time. [16:14:10] PROBLEM - Puppet freshness on ms-fe1001 is CRITICAL: Puppet has not run in the last 10 hours [16:14:10] PROBLEM - Puppet freshness on singer is CRITICAL: Puppet has not run in the last 10 hours [16:14:11] PROBLEM - Puppet freshness on virt1002 is CRITICAL: Puppet has not run in the last 10 hours [16:14:11] PROBLEM - Puppet freshness on virt1003 is CRITICAL: Puppet has not run in the last 10 hours [16:14:12] PROBLEM - Puppet freshness on virt1004 is CRITICAL: Puppet has not run in the last 10 hours [16:14:12] PROBLEM - Puppet freshness on virt1001 is CRITICAL: Puppet has not run in the last 10 hours [16:14:15] I asked in swiftstack but it's a bit soon for anyone to be awake [16:14:27] to know if we should be concerned (if that's slowing things down) [16:14:33] brb bacthroom [16:14:43] object replicator is one of the backend processes. [16:14:53] it doesn't directly affect client requests [16:15:38] the more instances of the object replicator you allow the quicker the cluster heals in the event of a failure, and the more IO load it incurs. If the load from the replicator is too high, it will start to affect performance of the object-server (and client requests) [16:16:18] we'll probably want to play with the values for the number of instances of all the backend processes, but for now they're ok. [16:16:40] I think I may have been too conservative in restricting them to only 2 copies of the replicator, but I'm not sure. [16:16:50] I also see a few "bad rsync return code" [16:16:59] here and there [16:17:11] btw, ms-be10 had 1.4.8 because it runs precise [16:17:53] ah. and it has precise because when the disks all flipped out I rebuilt it and that was after we'd made the switch to default-precise. [16:17:58] ::sigh:: [16:18:38] (the eqiad cluster is also precise, btw.) [16:18:41] yeahI noticed the rsync whines, figured those would be next n the list after the object replicator thing [16:18:46] clean up the logs bit by bit [16:19:29] so. ready to try the switch? [16:19:37] (we're now in our window) [16:20:15] * apergos grits teeth unhapily [16:20:21] with two p's even [16:21:14] yay switch! [16:21:26] not that kind LeslieCarr :-P [16:22:06] hehe [16:24:02] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 7.669 seconds [16:24:15] paravoid: you have the helm [16:24:22] oh do I? :) [16:24:56] you did so well with squid last time, why not? [16:24:57] :D [16:25:05] besides, I like this role of just watching. [16:25:29] hehe [16:25:52] you know if stuff falls over we will have no solution except revert and wait again, right? [16:26:17] let's hope that if stuff falls over again we're able to gather some new piece of information to tell us why. [16:28:50] okay, I'm fixing the squid conf again [16:29:02] three-way merge (we did the ms5 change in the meantime) [16:29:18] maplebed: btw, what's wrong with ms-be10? lots of XFS errors in dmesg [16:29:39] hardware. it doesn't see any of the spinning disks. maybe the controller? [16:30:05] yeah, most likely [16:30:05] I think you'll see the disks are mounted, but if you ls -l /srv/swift-storage, you just see ?s for things that stat should fill in. [16:30:12] ugh [16:32:06] New patchset: Andrew Bogott; "Overhauling gerrit manifest to be a role class" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/13484 [16:32:22] so... [16:32:27] are we going on partial deployment again? [16:32:33] depooling sq51 first? [16:32:42] I think we should just deploy on sq51 [16:32:43] only one squid? I liked that as a plan last time. I say +1. [16:32:50] Change abandoned: Andrew Bogott; "(no reason)" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/13484 [16:33:03] well, there's a) depool sq51 and deploy there b) deploy on sq51 directly c) deploy on all backend squids [16:33:10] I definitely don't want to go for (c) [16:33:13] depool, test there. [16:33:18] then put it back in [16:33:19] (a) [16:33:21] I don't see why we should depool and test [16:33:32] people make typos [16:33:41] I didn't write any new config [16:33:44] just applied the old one. [16:34:00] Change restored: Andrew Bogott; "..." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/13484 [16:34:16] to verify nothing unexpected has changed since the last time? I'm ok with either (a) or (b) though I would probably go (a) just for safety's sake. [16:34:38] (a) [16:34:40] haha [16:34:48] how much longer can it take? (a) just for doublechecking [16:34:54] ottamata: you're just saying that cuz you have two in your name. [16:35:14] (a) is a eyepatche emoticon for me [16:35:18] eyepatched* [16:35:38] we need to go through all the 'disable sq51 on all frontends and esams backends' hoops again [16:35:48] the only real danger is cache pollution [16:35:52] quick q for a very busy room: [16:36:02] htpasswd.stats is an htpasswd file on spence [16:36:06] it is not puppetized [16:36:13] i need to move the site that uses it elsewhere [16:36:18] can I check it into puppet? [16:36:21] pws are scrambled [16:36:21] no. [16:36:23] paravoid: are you guys doing swift deployments of some variety today? [16:36:30] yes it appears we are [16:36:32] ottomata: maybe the private git repo. [16:36:34] * Ryan_Lane sighs [16:36:35] ok [16:36:51] for production. not labs [16:36:59] I'm doing labs upgrade today [16:36:59] paravoid: I leave it up to you. I'm oke either way. [16:37:03] this means I'm on my own [16:37:04] apergos: I think the problem is that I'm triple booked [16:37:19] how did you manage that? nm I'll ask you later [16:37:21] and I'm also double-booked tomorrow [16:37:33] apergos: ops meeting, swift, labs [16:37:34] well I can be here tomorrow [16:37:48] and we're all double booked for the ops meeting [16:38:00] our window ends when the ops meeting begins. [16:38:07] and the clock's a tickin... [16:38:10] yep [16:38:12] ok, re private repo, since there isn't review for that [16:38:18] misc/htpasswd.stats [16:38:20] s'ok? [16:38:21] maplebed: our window is *scheduled* to end when the ops meeting begins [16:38:29] ottomata: dunno. could you check later? [16:38:49] sigh, i guess so, i need to figure out how to get work done with you guys so busy :/ [16:38:55] s'ok, thank youuuu [16:39:01] good luck with swifty [16:39:08] ottomata: didn't you switch to our team? [16:39:25] decide somethiing, I just know that I'm not good at digging out of a hole under pressure [16:39:48] yup! but i need to ask Qs before I do things so I don't cause trouble, ya know? I think Leslie is gonna help me [16:39:50] thanks guyyys! [16:40:02] if the deployment goes past the meeting [16:40:08] then skip the meeting [16:40:15] good point [16:41:59] New patchset: Andrew Bogott; "Overhauling gerrit manifest to be a role class" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/13484 [16:42:44] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/13484 [16:43:35] * maplebed pulls up http://noc.wikimedia.org/cgi-bin/cachemgr.cgi [16:44:37] * apergos is camped on a few strategic hosts [16:45:31] !log depooling sq51 backend [16:45:41] Logged the message, Master [16:51:11] draining... [16:51:40] nice and clrear from ganglia... [16:51:47] not yet. [16:51:50] sp why..... are swift gets already increasing? [16:52:09] * apergos stabs computers [16:53:24] those are the cache misses from sq51 not being available, aren't they? [16:53:34] it would appear so [16:54:35] that's quite a lot though [16:54:43] for one squid, I'll say it is [16:54:50] New patchset: Ottomata; "Hosting stats.wikimedia.org from stat1001." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/21604 [16:54:56] it triped for a while and now it's doubled [16:55:35] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/21604 [16:55:41] image scalers are happy so far [16:55:48] mm hmm [16:55:56] I don't like the big increase though [16:56:39] maplebed: is the GETs per second just GETs? or is it HEADs too? [16:56:45] ms7 (media riginals) has a little more outgoing traffic [16:56:47] if it's just GETs then it doesn't make any sense [16:57:11] you can see that the heads increased as well. [16:57:44] see that where? [16:57:51] oh, err.. nevermind. [16:57:54] I was misreading it. [16:58:20] increased GETs without increased HEADs does make sense - they're coming from the squids. [16:58:27] the squids don't head before get, only the scalers do that. [16:58:47] so it just means that there's thumbnails in swift that the squids are requesting to make up for their fallen comrade. [16:58:50] hm, you're right [16:59:03] that many though? [16:59:03] which we expect. [16:59:08] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:59:21] the squids have a really high cache hit rate. the volume doesn't surprise me. [16:59:51] we normally have 24 backend upload squids [17:00:14] that's just 4% of our cache [17:00:15] scary [17:00:18] uh huh [17:00:23] I am happy to see that there isn't a corresponding spike in image scaler traffic [17:00:28] I wonder if we can survive two or three of them down at once [17:00:42] well the others are catching up [17:00:43] which helps confirm our diagnosis. [17:01:09] maplebed: remember, it took more than half an hour for the spike to happen last time around [17:01:41] the load spike, yes. the network spike though - was it also sparated by time? I don't remember. [17:03:19] apergos: about how many we can lose... during our first attempt to put MW reads to swift we hit about 3k qps. one squid going down added 200qps to the query load. our normal is about 300. that logic says we should be able to lose 13 squids. Of course it's likely not linear, so I wouldn't go that far. but I think we're ok for 3. [17:03:31] ok, tests against sq51 [17:04:48] oh, it's still got the standard config. [17:05:20] yes [17:05:24] ms be3 and 5 seem more loaded than usual [17:05:49] all the open connections (or rather connectinos in time_wait) are from nagios and/or me. [17:05:56] (spence, fenari, neon) [17:06:11] shall we switch the config and test? [17:06:20] guess we'd better [17:07:31] no logging :-P [17:07:48] okay, I switched sq51's config [17:07:52] still depooled though [17:09:52] confirmed original content came from swift (via x-object-meta-sha1base36 tag and the absense of the server: sun-java header) [17:09:56] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 1.299 seconds [17:10:01] wfm for all of the tests I did [17:10:06] same tests as last time [17:10:19] great [17:10:22] as expected [17:10:25] onwards. [17:10:44] math, orig commons, orig enwiki, thumb enwiki, /archive/, !GET method, /v1/AUTH, /auth [17:13:44] so, pooling again. [17:13:48] +1 [17:13:55] uh huh [17:15:04] !log modified sq51 to serve origs from swift; added sq51 back to the pool [17:15:57] I restarted morebots, give it a sec [17:18:54] apergos: ? [17:18:57] ah [17:18:58] !log modified sq51 to serve origs from swift; added sq51 back to the pool [17:19:08] Logged the message, Master [17:19:10] it doesn't do well with netsplits [17:19:40] so now we wait another 25 mins *sigh* [17:20:58] hmm. [17:21:03] http://upload.wikimedia.org/wikipedia/fr/c/c3/Christ_en_croix_de_gen_paul.jpg [17:21:08] Change merged: Lcarr; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/21604 [17:21:17] ugh oh [17:21:19] that's *bad* [17:21:24] rats rats rats [17:21:25] that's cache pollution [17:21:49] I think that's an original that actually doesn't exist [17:21:54] but it's getting a 500 intsead of a 404. [17:21:55] I can find out [17:21:57] gimme a sec [17:22:08] rolling back sq51 in the meantime. [17:22:13] +1 paravoid [17:22:34] no such file or dir [17:22:35] done [17:22:45] you hit the nail right on the head [17:23:00] *sigh* [17:23:05] does that mean we don't need to worry about the cache pollution? [17:23:10] no [17:23:14] since the problem files aren't legit either way? [17:23:20] nice try [17:23:24] damn. [17:23:28] :-P [17:23:30] we can't do anything about it. [17:23:39] does it still fall into the 5m timeout? [17:23:43] hop so [17:23:46] +e [17:23:48] nope [17:23:52] PROBLEM - Puppet freshness on zhen is CRITICAL: Puppet has not run in the last 10 hours [17:24:15] ughhhhh [17:24:18] let me depool sq51 again. [17:24:55] might learn how well swift does after a purged squid rejoins the pool :-P [17:25:05] New patchset: Demon; "Overhauling gerrit manifest to be a role class" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/13484 [17:25:28] apergos: I don't think you understand [17:25:33] ok [17:25:35] please explain [17:25:36] it's not just sq51's backend cache, it's all frontend caches. [17:25:42] I see the error . [17:25:44] ugh because it was in the poooool [17:25:48] * apergos headdesks [17:25:49] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/13484 [17:26:00] maplebed: do tell. [17:26:36] in rewrite.py, line 341 (not in puppet - on an fe host) [17:26:48] I need to do the resp = thing and I didn't. [17:26:57] (the pattern you see above and below should have been followed there too. [17:27:04] ) [17:27:15] btw, how did you catch that? [17:27:48] I was watching the logs on ms-fe1, just taking a look at repsonse codes. [17:27:57] I saw a 499 on an original and tried it. [17:28:07] then looked in the code where 404 handling's done (for originals) [17:28:18] okay [17:28:26] I think it's time to merge that swift 1.5 commit [17:28:32] re-enable puppet everywhere [17:28:35] and do changes there. [17:28:35] +1 [17:28:46] we should had done that before we started this, but we're rushing everything [17:29:37] I was hoping you might have done that already this morning... sorry I forgot to ask. you're right that we should have earlier. [17:29:39] ah well. [17:32:15] New patchset: Faidon; "swift changes for the upgrade to 1.5.0" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/18264 [17:32:20] (rebased) [17:32:37] sigh... [17:32:55] after you merge but before starting puppet on teh frontends, I'd like to do some comparisons on sockpuppet. [17:32:59] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/18264 [17:33:06] oh yeah, [17:33:26] * apergos is amused by the vim changes at the bottom of the file [17:33:30] !log rolled back sq51's config; depooled sq51 (10' ago) [17:33:33] I saw that too. [17:33:34] :P [17:33:40] Logged the message, Master [17:33:56] modelines are to enforce indentation, not display. [17:34:01] if you want nu or whatever, put it in your .vimrc [17:34:10] * apergos wants emacs. just sayin'  [17:34:12] paravoid: before doing more with sq51, we may have enough time to try again. [17:34:24] paravoid: sadly, since we all use root, I can't do that. [17:34:44] anyway, now that I'm not the only person interacting with those files, I don't mind. [17:34:45] :P [17:34:58] * maplebed stops chatting and examines the diff. [17:35:08] New review: Faidon; "Already in production." [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/18264 [17:35:09] Change merged: Faidon; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/18264 [17:35:28] maplebed: I lost you [17:35:36] don't worry about it. [17:35:55] maplebed: I haven't merged on sockpuppet. feel free to login, compare and merge. [17:36:13] the diffs in gerrit looked good to me [17:36:31] maplebed: and again, "trying until we succeed" is not a good strategy [17:36:43] frontend cache pollution is basically irreversible [17:36:43] they look good to me. merging on sockpuppet. [17:37:03] asking a q here about rewrite.py [17:37:25] merged on sockpuppet. [17:37:29] line 354 you don't need to do the resp = thing? [17:37:32] comparing rewrite to production [17:37:45] !log disabling puppet on all virt nodes [17:37:48] I guess there is no response to be started but just checking [17:37:54] Logged the message, Master [17:38:31] diff confirmed. [17:38:41] apergos: looking. [17:38:44] ok [17:39:10] it seems like you would want it there too so, if not, I'll ask you to explainit later [17:39:17] apergos: we're looking at different versions, I think. [17:39:32] hmok [17:39:41] my line 354 is conf = global_conf.copy() [17:39:44] oh [17:40:16] lemme get the most recent version [17:41:11] ok I guess we're back to line 341 again [17:42:04] maplebed: did you reenable puppet anywhere? [17:42:09] paravoid: no, not yet. [17:42:12] but i think we're ready. [17:42:14] !log disabling OpenStackManager and LdapAuthentication on labsconsole [17:42:20] wait [17:42:22] apergos: yes, lien 341 is exactly where I want to make the change. [17:42:24] Logged the message, Master [17:42:27] waiting. [17:42:54] this change should go in before reqrite.py deploys, yes? [17:42:56] *rewrite [17:43:07] !log disabling all openstack services [17:43:07] hm? what rewrite.py deploys? [17:43:16] Logged the message, Master [17:43:22] I would like to see us deploy a noop via puppet, then make the rewrite change (after testing it) via puppet. [17:43:37] re-enabling puppet now should be a noop on both front and back end hosts. [17:43:40] what ben said [17:43:47] !log disabling openstack services on the virt infrastructure, that is [17:43:52] I'm running a noop on ms-fe1 & ms-be1 [17:43:56] Logged the message, Master [17:43:56] ok [17:44:02] obviously I'm not turning off swift :D [17:44:12] hahahahaha [17:44:16] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:44:17] New patchset: Bhartshorne; "fixing incorrect use of wsgi return objects in swift rewrite" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/21612 [17:44:21] ^^^ the change I'd like to test. [17:44:26] I think I can test it on an eqiad host. [17:44:32] maplebed: can we test it out of production please? [17:44:38] yup [17:45:02] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/21612 [17:45:20] puppet has been very slow lately [17:45:36] Change merged: Bsitu; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/21141 [17:46:04] !log stopping rabbitmq on virt0 [17:46:14] Logged the message, Master [17:46:17] New review: Alex Monk; "Shouldn't you change wgMetaNamespaceTalk to be "Wikimedia_Discusi?n" as well here?" [operations/mediawiki-config] (master) C: -1; - https://gerrit.wikimedia.org/r/21586 [17:46:24] I wonder why [17:46:32] hmm [17:46:48] Gerrit-wm broke that � [17:46:56] ah. [17:46:57] stdlib :) [17:47:24] Ryan_Lane: wild guess: stdlib made everything slower [17:47:42] really? [17:47:51] * Ryan_Lane groans [17:47:59] nothing but hate for puppet [17:48:00] I said wild guess :) [17:48:04] it's only used in one place [17:48:20] something hurt performance a lot [17:49:20] yeah. I've noticed that lately [17:49:42] 0.001135 stat("/var/lib/git/operations/puppet/modules/stdlib/lib/puppet/type", {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0 [17:49:46] 0.001162 stat("/var/lib/git/operations/puppet/modules/stdlib/lib/puppet/type", {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0 [17:49:49] 0.001165 stat("/var/lib/git/operations/puppet/modules/stdlib/lib/puppet/type", {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0 [17:49:53] 0.001110 stat("/var/lib/git/operations/puppet/modules/stdlib/lib/puppet/type", {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0 [17:49:56] thousands of these [17:49:56] the first field is the relative timestamp [17:50:05] yay [17:52:14] so in ten minutes either we close up shop for the meeting or we let ct know we're going to be delayed [17:52:27] I don't like rushing things [17:52:31] robla swung by earlier and suggested his preference is that we keep going. [17:52:37] nor playing with production [17:53:00] esp. when effects can be irreversible [17:53:20] change tested successfully on ms-fe1001 [17:53:32] (tested both the failure condition and the resolved condition) [17:53:58] \o/ [17:54:20] AaronSchulz: hm? [17:54:27] New review: Krinkle; "The correct encoding is UTF-8 indeed." [operations/mediawiki-config] (master); V: 0 C: -1; - https://gerrit.wikimedia.org/r/21393 [17:54:33] anybody want to say no before I merge? [17:54:36] !log backing up all databases on virt0 [17:54:45] !log running a fresh backup of opendj on virt0 [17:54:46] Logged the message, Master [17:54:57] Logged the message, Master [17:55:05] maplebed: go ahead [17:55:05] ok, merging https://gerrit.wikimedia.org/r/#/c/21612/ [17:55:14] Krinkle: gerrit is great at formatting links isn't it? [17:55:15] Change merged: Bhartshorne; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/21612 [17:55:29] puppet is still trying to run in ms-fe/be [17:55:37] so am I telling ct we'll be there or we won't be there? [17:56:02] actually I'll tell him that we're dellayed anyways and we'll see [17:56:05] I think we should continue [17:56:20] sigh [17:56:33] since otherwise we drop verything in 2 minutes [17:56:41] how many other such cases are we going to find before a rollback is not going to do us any good? [17:57:23] one way to find out. [17:58:49] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.037 seconds [17:59:08] paravoid: what would you propose instead of another attempt? [18:00:03] fwiw, the other things I checked that came in to swift before finding teh 500 were looking good. [18:00:28] ah. ct's meeting is done. [18:00:33] ugh. my ldap backup script backs up logs [18:00:46] I left him a pm saying we were running over [18:01:00] gonna go chat for a sec. puppet's still running? [18:01:42] AaronSchulz: What do you mean? [18:01:55] CT says he things we should keep going. [18:02:30] ok so I agree about keeping working on things, the q is whether we make another attempt and hope we don't find anotehr broken thiing or whether there is some better approach we can take [18:02:42] I don't know of a better approach. [18:02:47] here's where I hope paravoid weighs in with some idea [18:02:48] our tests pass. [18:03:12] if we find a bug, we fix it and move on. [18:03:30] Krinkle: https://gerrit.wikimedia.org/r/21393 [18:04:17] AaronSchulz: oh.. I didn't see that. that only happens on refresh. The ajax insertion is fine. [18:04:26] maplebed: have you deployed your change to ms-fe*? [18:04:34] puppet is still running [18:04:39] on ms-fe1/be1 [18:04:48] I didn't see puppet in ps on ms-fe1. [18:05:08] yeah, it's not there. [18:05:14] I'm going to deploy the rewrite change by hand [18:06:46] okay, so puppet's diff [18:07:11] ms-be: lots of changes on /usr/bin/swift-drive-audit [18:07:29] ms-fe: a single line change in /usr/share/ganglia-logtailer/SwiftHTTPLogtailer.py [18:07:38] plus the expected changes. [18:08:42] the changes to swift-drive-audit are something to deal with later. I patched our version and we should port those patches to the new version. [18:08:51] they're only output changes though, so it can wait. [18:09:06] I tried to upstream the changes and am fighting their process. [18:09:26] fixed rewrite deployed to all ms-fe hosts and restarted teh proxy. [18:09:32] on all of them? [18:09:33] !log upgrading virt0 to precise [18:09:36] I think we're set. [18:09:43] Logged the message, Master [18:09:56] hm. I guess I should make the ldap change first [18:10:05] paravoid: have a sec for that ldap change? :) [18:10:11] paravoid: yup. [18:10:30] I've already backup up the ldap server [18:10:36] *backed [18:10:50] (in ldap and real backup formats) [18:10:53] *ldif [18:10:53] we're basically ignoring the fact that our window ended, so I guess I do [18:11:04] heh [18:11:32] paravoid: I think we're also ready to put sq51 back in for a second try. whenever you're ready. [18:12:16] sec [18:15:14] Ryan_Lane: so, just apply changed to ldap now, is that correct? [18:15:20] let me check [18:15:42] changes [18:15:50] I haven't done it yet [18:15:57] oh [18:16:01] I'm asking you for confirmation :) [18:16:03] paravoid: yes, just ldap [18:16:04] yep [18:18:30] Ryan_Lane: done [18:18:44] paravoid (and others)... don't worry about going over on the window, just let us know when you're done [18:18:50] member -> roleOccupant, groupofnames -> organizationalRole for sys/netadmin [18:18:51] robla: thanks. [18:18:58] and owner removed from projects [18:19:11] paravoid: thanks [18:19:39] looks good [18:19:40] great [18:19:45] now to start virt0's upgrade [18:19:46] yw [18:20:09] maplebed: so, should I proceed? [18:20:13] yes. [18:21:10] I see no point staging in sq51 again, do you? [18:22:11] maplebed: ^ [18:22:31] nope, go ahead and pool it in. [18:23:28] New review: Jeremyb; "> The problem is that the other "wrong" one is still in there. I can't delete" [operations/mediawiki-config] (master) C: 0; - https://gerrit.wikimedia.org/r/21393 [18:25:07] okay, deployed sq51 and pooled [18:25:14] well, not entirely pooled yet [18:25:22] partially, let's check that this works first. [18:25:56] k. [18:26:04] what do you mean partially? [18:26:34] sq51 frontend only. [18:26:46] I don't see any traffic to swift from sq51 yet. [18:27:33] you don't? that's weird. [18:28:02] I don't see the expected traffic spike on sq51 either [18:28:25] ah, there's one. [18:28:31] and three! [18:28:33] I fully deployed now [18:29:14] sq51 that is [18:30:18] ok, verified that 499s I see in the logs are normal 404s. [18:30:45] give me a url [18:30:55] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:31:13] wikipedia-commons-local-public.70/7/70/People_swimming_in_the_Rh%2525C3%2525B4ne_river.JPG [18:31:28] yeah, found another one too [18:31:29] http://upload.wikimedia.org/wikipedia/fr/c/c3/Christ_en_croix_de_gen_paul.jpg2 [18:31:46] leaking the account btw [18:31:50] not the token fortunately [18:32:07] nicer 404 pages would be nice too [18:33:26] the account id is public (at least in puppet) [18:33:51] but yeah, we could change the 404 message. [18:35:34] File not found: /v1/AUTH_43651b15-ed7a-40b6-b745-47666abf8dfe/wikipedia-fr-local-public.c3/c/c3/Christ_en_croix_de_gen_paul.jpg2 [18:35:50] What is that AUTH_*** portion? [18:35:59] I'm surprised by the number of originals that don't exist... [18:36:16] !log restarting virt0 [18:36:16] preilly: it's the account ID. it's public (at least in puppet) but we should probably pull it from the 404, just cuz it's ugly. [18:36:26] Logged the message, Master [18:36:37] what are 507s? [18:36:58] not sure. looking. [18:36:59] preilly: see above. account id [18:38:25] PROBLEM - Host virt0 is DOWN: PING CRITICAL - Packet loss = 100% [18:39:09] I see a few for gets and for heads both [18:39:10] http://upload.wikimedia.org/wikipedia/commons/7/7a/Smorod.gif heh, nice picture :) [18:39:26] apergos: anywhere other than on the object server? [18:39:44] nope [18:40:03] google says 'insufficient space'; I'd bet that it correlates to broken disks. [18:40:16] * maplebed checks to see if it's new with this depoly [18:40:35] nope, not new to this depoly. [18:40:40] no, there are some from early in the day [18:41:16] RECOVERY - Host virt0 is UP: PING OK - Packet loss = 0%, RTA = 0.96 ms [18:41:22] !log force running puppet on virt0 [18:41:32] Logged the message, Master [18:42:27] good luck wit that [18:42:32] heh [18:42:35] it's going to take a looooong time. [18:42:42] I need to push something in first too [18:42:56] it's been live 10 minutes; I don't see any evil-looking variations in the graphs yet. [18:43:22] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 8.184 seconds [18:43:37] New patchset: Ryan Lane; "Changing virt0's nova version to essex" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/21615 [18:44:24] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/21615 [18:44:54] New review: Platonides; "It should automatically become Wikimedia_discusi?n when this change to $wgMetaNamespace gets in." [operations/mediawiki-config] (master) C: 0; - https://gerrit.wikimedia.org/r/21586 [18:47:57] so far so good. [18:48:08] checked on a 200 and the cache hit looked reasonable [18:54:05] ugh. mysql won't start on virt0 [18:54:13] Ryan_Lane: why not? [18:54:29] it thinks something it bound on its port [18:54:42] and tries to restart over and over [18:55:08] apparmor perhaps? [18:55:26] yep [18:55:29] that's it [18:55:29] maplebed: apergos: paravoid: redirect deployment done? [18:55:32] fucking apparmor [18:55:48] robla: one squid is pointing at swift. [18:55:50] we're letting it sit for a bit [18:55:54] no, not done, first phase (single squid) is done and we're still watching it for a bit [18:56:12] we'll move some more in what, 20 minutes? [18:56:21] when did this one go around? [18:56:25] * apergos looks at the scrollback [18:56:31] robla: do you have any swift-related deployment waiting? [18:56:39] just before 18:30UTC. [18:56:47] about 10 mins then [18:56:51] the only deployment waiting is 1.20wmf10 to enwiki, which should be pretty routine [18:56:57] apergos: 25 minutes. [18:57:07] yeah, fix your clock apergos :P [18:57:08] about 10 more mins [18:57:15] heh... [18:57:18] I'm thinking maybe we get it out of the way now [18:57:25] AaronSchulz: your thoughts? [18:57:46] so I would ask you to wait the ten mins [18:57:47] here's why [18:57:53] * AaronSchulz doesn't have any strong opinion, though waiting 10 is ok [18:57:56] if stuff goes to hell we'd like toknow if it was us or you [18:58:05] makes sense. ok [18:58:07] does 1.20wmf10 contain any cloudfiles-related changes? [18:58:10] then if you want to cut in first I have no problem with it [18:58:30] AaronSchulz: ^ [18:58:36] paravoid: I'm not sure, but it's already running on commons (since Wednesday of last week) [18:58:46] aha. [18:58:49] just token caching afaik [18:58:53] so I don't see an overlap then [18:59:29] well at this point it's 5 more mins so... [18:59:31] :-P [18:59:52] yeah, I don't mind, just so long as we don't completely miss our window [19:00:10] maplebed: any insight on why this happens: http://ganglia.wikimedia.org/latest/graph.php?r=hour&z=xlarge&hreg[]=^ms-fe[1-9].pmtpa&mreg[]=swift_.*_404_90th>ype=line&title=Swift+90th+percentile+query+response+time+-+404s&aggregate=1 [19:00:34] paravoid: yeah, it's a flaw in the ganglia logtailer module I wrote. [19:00:42] hm? [19:00:59] when a metric doesn't appear during a time window, it doesn't report that metric, whic means ganglia hangs on to the previously reported statistic. [19:01:09] PROBLEM - Puppet freshness on ms-be6 is CRITICAL: Puppet has not run in the last 10 hours [19:01:09] PROBLEM - Puppet freshness on ms-be8 is CRITICAL: Puppet has not run in the last 10 hours [19:01:18] the problem is that it just doesn't report it instead of reporting 0. [19:02:47] in general, when you see a straight line like that in ganglia it's false data. [19:04:23] robla: AaronSchulz paravoid just to be clear on the schedule, are we putting another squid in or are we doing the enwiki deploy? [19:04:45] I say do enwiki deploy first, then we do all the squids. [19:04:47] enwiki I guess [19:04:50] unless there's another window after that [19:05:11] paravoid: all of them? you're brave. you don't want to do, say, 2 more first? [19:05:17] also +1 on the ordering. [19:05:31] well 5 mins is up [19:05:43] so robla, it's your turn if maplebed and paravoid agree [19:05:45] yeah, let's get enwiki out of the way. should only need 5 min or so for that [19:06:12] AaronSchulz: is deploying now [19:06:22] ok [19:07:03] deployment done: http://en.wikipedia.org/wiki/Main_Page [19:07:18] oops...meant http://en.wikipedia.org/wiki/Special:Version [19:07:23] that was fast [19:07:33] do you want any time to watch for fallout independent of us? [19:07:52] just a couple minutes to see if anything weird shows up in the log [19:08:18] okey dokey [19:15:49] * robla pings AaronSchulz in real life to give the all clear :) [19:16:18] some exception flooding [19:16:40] #0 /usr/local/apache/common-local/php-1.20wmf10/includes/Uri.php(101): Uri->setUri(NULL) [19:16:50] from mobile frontend [19:18:06] !log temporarily stopping puppet on emery to test log2udp AFT clicktracking relay to vanadium [19:18:13] 10/sec [19:18:16] Logged the message, Master [19:18:33] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:18:49] ouch [19:18:59] why is it calling wfParseUrl(NULL)? [19:19:07] preilly: ^ [19:19:29] is anyone else seeing network pain in eqiad at the moment? [19:19:35] * AaronSchulz is looking at exception.log [19:20:13] mark, LeslieCarr: ^^ [19:20:20] Jeff_Green: best to ping people :) [19:20:20] AaronSchulz: why is what? [19:20:29] looking [19:20:31] Ryan_Lane: ya [19:20:45] LeslieCarr: i'm not 100% sure it's network but something is up with boron suddenly [19:20:56] what kind of issue ? [19:21:00] load is low, and it feels like horrible packet loss [19:21:17] but pings doesn't seem lossy [19:21:21] hrm [19:21:38] you didn't just do anything did you? [19:21:45] preilly: well, $this->wmlContext->getCurrentUrl() must be returning null [19:22:12] nope [19:22:29] odd [19:22:54] PROBLEM - Varnish traffic logger on cp1043 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishncsa [19:23:03] PROBLEM - Varnish traffic logger on cp1041 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishncsa [19:23:20] checking out the srx's [19:23:30] PROBLEM - Varnish traffic logger on cp1042 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishncsa [19:23:30] PROBLEM - Varnish traffic logger on cp1036 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishncsa [19:23:30] PROBLEM - Varnish traffic logger on cp1030 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishncsa [19:23:30] PROBLEM - Varnish traffic logger on cp1026 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishncsa [19:23:39] PROBLEM - Varnish traffic logger on cp1024 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishncsa [19:23:40] PROBLEM - Varnish traffic logger on cp1034 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishncsa [19:23:40] LeslieCarr: happening again [19:23:40] PROBLEM - Varnish traffic logger on cp1044 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishncsa [19:23:48] PROBLEM - Varnish traffic logger on cp1022 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishncsa [19:23:48] PROBLEM - Varnish traffic logger on cp1021 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishncsa [19:23:57] PROBLEM - Varnish traffic logger on cp1032 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishncsa [19:23:57] PROBLEM - Varnish traffic logger on cp1028 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishncsa [19:24:02] AaronSchulz: hmm [19:24:10] from your ip ssh'ed to boron, right ? [19:24:15] PROBLEM - Puppet freshness on ms-be1 is CRITICAL: Puppet has not run in the last 10 hours [19:24:24] PROBLEM - Varnish traffic logger on cp1035 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishncsa [19:24:24] PROBLEM - Varnish traffic logger on cp1027 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishncsa [19:24:24] PROBLEM - Varnish traffic logger on cp1029 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishncsa [19:24:24] PROBLEM - Varnish traffic logger on cp1023 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishncsa [19:24:33] PROBLEM - Varnish traffic logger on cp1031 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishncsa [19:24:33] PROBLEM - Varnish traffic logger on cp1033 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishncsa [19:24:42] PROBLEM - Varnish traffic logger on cp1025 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishncsa [19:25:51] because i see that session and it looks happy ... [19:27:37] preilly: seems like a setCurrentUrl() call is missing somewhere [19:27:37] memory utilization is normal [19:27:38] hrm [19:28:46] and our route is getting handed off right in ashburn to comcast so the path looks pretty clear [19:28:47] maybe DOMParse [19:29:57] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 7.561 seconds [19:30:24] maplebed, apergos: so? [19:30:30] AaronSchulz: are you seeing this a lot in the logs? [19:30:39] AaronSchulz: should I wait for you? [19:30:42] RECOVERY - Varnish traffic logger on cp1025 is OK: PROCS OK: 3 processes with command name varnishncsa [19:31:18] RECOVERY - Varnish traffic logger on cp1022 is OK: PROCS OK: 3 processes with command name varnishncsa [19:31:36] RECOVERY - Varnish traffic logger on cp1021 is OK: PROCS OK: 3 processes with command name varnishncsa [19:32:01] !log upgrading nova, glance and keystone database schemas [19:32:11] Logged the message, Master [19:33:26] paravoid: Aaron's chasing down an exception. They're pretty clearly unrelated, so I'd be ok either way. [19:33:27] LeslieCarr: sorry--yes. ssh from my home to boron [19:33:33] I know it's getting late for you... [19:33:43] maplebed: that's fine, I planned for it today :) [19:33:45] LeslieCarr: other connections do not seem to be suffering the same way [19:33:49] !log adding keystone services and endpoints [19:33:59] Logged the message, Master [19:34:14] I was aobut to say, if you need to get going (to sleep or whatever) I can stick around for awhile yet [19:34:50] I need food, and will be back in 15. [19:35:01] ok [19:35:12] PROBLEM - Varnish traffic logger on cp1025 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishncsa [19:35:57] PROBLEM - Varnish traffic logger on cp1022 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishncsa [19:36:02] hrm, pings are happy as well, Jeff_Green [19:36:06] PROBLEM - Varnish traffic logger on cp1021 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishncsa [19:36:15] PROBLEM - Puppet freshness on ms-be2 is CRITICAL: Puppet has not run in the last 10 hours [19:36:23] Jeff_Green: i meant pings in between the routers, showing no packet loss or errorings... [19:36:47] yeah [19:37:04] ok. thanks for checking. [19:38:08] !log upgrading OpenStackManager on labsconsole [19:38:17] Logged the message, Master [19:39:09] shit. I had more OSM fixes I needed to push in [19:39:20] I totally forgot to get them reviewed [19:39:42] oh well, they are tiny fixes [19:40:15] <^demon> You didn't push them to gerrit :p [19:40:19] I know [19:40:23] they are on virt1000 [19:40:46] I'm going to push them in really quck [19:40:49] *quick [19:41:12] ok. less tiny than originally imagined [19:43:48] I can't believe I forgot this [19:46:14] ok. OSM upgraded... [19:46:42] ah. I need to fix its config settings [19:47:01] hashar: welcome back [19:47:12] PROBLEM - Puppet freshness on srv193 is CRITICAL: Puppet has not run in the last 10 hours [19:47:26] mutante: howdy :)) [19:51:15] PROBLEM - Puppet freshness on srv194 is CRITICAL: Puppet has not run in the last 10 hours [19:51:34] back. paravoid, did you run another? [19:51:53] no [19:52:03] we were waiting for AaronSchulz to give us the all clear [19:52:10] ok [19:52:20] but he's working on a mobile issue that cropped up durnig their deploymnt [19:52:36] I was actually took the chance of having dinner too :) [19:54:07] rumor has it MaxSem is currently working on a fix. AaronSchulz do you want us to wait for it or should we continue and switch another squid in the interim? [19:54:08] !log upgrading virt2 to precise [19:54:18] Logged the message, Master [19:54:21] !log deploying swift origs on sq51 (1h ago), sq52, sq53 [19:54:31] Logged the message, Master [19:54:37] let's see [19:54:40] dinner, for shame :-P [19:54:43] +1 [19:54:48] maplebed: I'd go ahead [19:54:57] tnx [19:55:03] good you said so cause we just did :-D [19:56:48] while we wait for traffic to pile up [19:56:53] should we talk about remaining issues? [19:58:37] so, I'm tailing the sampled-1000.log for non-thumb requests for sq5[123] and they're very, very few [19:59:20] there shouldn't be many [19:59:32] for media, that is [19:59:34] New patchset: Hashar; "(bug 38217) disallow robots on beta" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/21602 [19:59:45] the vast majority of what people get are thumbs, not originals [19:59:52] so, issues [19:59:58] what to do with math and other stuff on ms7? [20:00:00] New review: Hashar; "PS2: add bug number in commit message" [operations/mediawiki-config] (master); V: 0 C: 0; - https://gerrit.wikimedia.org/r/21602 [20:00:04] New patchset: Ryan Lane; "Changing openstack version to essex for all virt hosts" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/21667 [20:00:32] those have to be moved one at a time, to swift or elsewhere. much of that means code from AaronSchulz [20:00:36] or someone [20:00:58] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/21667 [20:01:08] +1 apergos [20:01:26] apergos: we now only use the one math directory right under upload6/ [20:01:30] and how are we going to copy files? [20:01:34] it's unclear that for example the extensin distributor stuff has the same needs as math; we might have different decisions for those [20:01:37] tim and I changed that...yes, it will still need to be migrated though [20:03:07] there is some stuff like the jar files that I think I can just move out of the way and make sure nothing breaks [20:03:17] but I'll want to poke around first before doing so [20:03:48] we're still writing to ms7 right? [20:04:37] !log rebooting virt2 [20:04:46] Logged the message, Master [20:05:03] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds