[00:08:38] New patchset: Helder.wiki; "(bug 39652) Add "autoreviewer" to $wgRestrictionLevels on ptwiki" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/21475 [00:10:48] PROBLEM - Puppet freshness on ms-fe1 is CRITICAL: Puppet has not run in the last 10 hours [00:13:57] PROBLEM - Puppet freshness on ms-be1007 is CRITICAL: Puppet has not run in the last 10 hours [00:13:57] PROBLEM - Puppet freshness on ms-be1011 is CRITICAL: Puppet has not run in the last 10 hours [00:13:57] PROBLEM - Puppet freshness on ms-be1010 is CRITICAL: Puppet has not run in the last 10 hours [00:13:57] PROBLEM - Puppet freshness on ms-be7 is CRITICAL: Puppet has not run in the last 10 hours [00:17:24] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:19:48] PROBLEM - Puppet freshness on neon is CRITICAL: Puppet has not run in the last 10 hours [00:29:42] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 8.685 seconds [00:47:51] PROBLEM - Puppet freshness on ms-fe2 is CRITICAL: Puppet has not run in the last 10 hours [00:57:54] PROBLEM - Puppet freshness on ms-be9 is CRITICAL: Puppet has not run in the last 10 hours [01:03:27] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:17:15] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.151 seconds [01:40:39] PROBLEM - MySQL Slave Delay on storage3 is CRITICAL: CRIT replication delay 244 seconds [01:41:33] PROBLEM - MySQL Slave Delay on db1025 is CRITICAL: CRIT replication delay 296 seconds [01:45:36] PROBLEM - Puppet freshness on magnesium is CRITICAL: Puppet has not run in the last 10 hours [01:45:36] PROBLEM - Puppet freshness on zinc is CRITICAL: Puppet has not run in the last 10 hours [01:48:00] PROBLEM - Misc_Db_Lag on storage3 is CRITICAL: CHECK MySQL REPLICATION - lag - CRITICAL - Seconds_Behind_Master : 683s [01:50:21] New patchset: Andrew Bogott; "Overhauling gerrit manifest to be a role class" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/13484 [01:51:05] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/13484 [01:51:47] New review: Andrew Bogott; "I've removed the gerrit::common class and rearranged gerrit.pp generally. Most of this had the aim ..." [operations/puppet] (production); V: 0 C: 0; - https://gerrit.wikimedia.org/r/13484 [01:51:54] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:56:51] RECOVERY - MySQL Slave Delay on db1025 is OK: OK replication delay 11 seconds [01:59:06] RECOVERY - MySQL Slave Delay on storage3 is OK: OK replication delay 16 seconds [02:00:00] RECOVERY - Misc_Db_Lag on storage3 is OK: CHECK MySQL REPLICATION - lag - OK - Seconds_Behind_Master : 2s [02:04:21] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.047 seconds [03:11:51] RECOVERY - Puppet freshness on virt0 is OK: puppet ran at Mon Aug 27 03:11:37 UTC 2012 [03:28:07] New patchset: Parent5446; "(bug 39380) Enabling secure login (HTTPS)." [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/21322 [05:48:32] PROBLEM - Puppet freshness on ms-fe3 is CRITICAL: Puppet has not run in the last 10 hours [05:58:35] PROBLEM - Puppet freshness on ms-fe4 is CRITICAL: Puppet has not run in the last 10 hours [06:13:35] PROBLEM - Puppet freshness on ms-be1002 is CRITICAL: Puppet has not run in the last 10 hours [06:13:35] PROBLEM - Puppet freshness on ms-be1003 is CRITICAL: Puppet has not run in the last 10 hours [06:13:35] PROBLEM - Puppet freshness on ms-be1001 is CRITICAL: Puppet has not run in the last 10 hours [06:13:35] PROBLEM - Puppet freshness on ms-be1005 is CRITICAL: Puppet has not run in the last 10 hours [06:13:35] PROBLEM - Puppet freshness on ocg3 is CRITICAL: Puppet has not run in the last 10 hours [06:13:36] PROBLEM - Puppet freshness on ms-be1006 is CRITICAL: Puppet has not run in the last 10 hours [06:13:36] PROBLEM - Puppet freshness on ms-be1009 is CRITICAL: Puppet has not run in the last 10 hours [06:13:37] PROBLEM - Puppet freshness on singer is CRITICAL: Puppet has not run in the last 10 hours [06:13:37] PROBLEM - Puppet freshness on ms-fe1001 is CRITICAL: Puppet has not run in the last 10 hours [06:13:38] PROBLEM - Puppet freshness on virt1004 is CRITICAL: Puppet has not run in the last 10 hours [06:13:38] PROBLEM - Puppet freshness on virt1001 is CRITICAL: Puppet has not run in the last 10 hours [06:13:39] PROBLEM - Puppet freshness on virt1003 is CRITICAL: Puppet has not run in the last 10 hours [06:13:39] PROBLEM - Puppet freshness on virt1002 is CRITICAL: Puppet has not run in the last 10 hours [07:22:30] PROBLEM - Puppet freshness on zhen is CRITICAL: Puppet has not run in the last 10 hours [08:00:44] New patchset: Hashar; "beta: wmgArticleFeedbackLotteryOdds => 0" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/16036 [08:10:55] Change merged: Hashar; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/13427 [09:00:06] PROBLEM - Puppet freshness on ms-be8 is CRITICAL: Puppet has not run in the last 10 hours [09:00:06] PROBLEM - Puppet freshness on ms-be6 is CRITICAL: Puppet has not run in the last 10 hours [09:23:12] PROBLEM - Puppet freshness on ms-be1 is CRITICAL: Puppet has not run in the last 10 hours [09:33:00] New patchset: Mark Bergsma; "Decommission bayes" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/21556 [09:33:47] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/21556 [09:33:55] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/21556 [09:35:19] PROBLEM - Puppet freshness on ms-be2 is CRITICAL: Puppet has not run in the last 10 hours [09:36:55] New patchset: Matthias Mullie; "(bug 36722) Article Feedback - Supporting feedback on help pages" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/17503 [09:38:12] !log Decommissioning bayes, shut it down [09:38:23] Logged the message, Master [09:39:47] New patchset: Matthias Mullie; "Add new AFT permission levels" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/21141 [09:40:16] PROBLEM - Host bayes is DOWN: CRITICAL - Host Unreachable (208.80.152.168) [09:50:07] New patchset: Mark Bergsma; "Remove admins::dctech" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/21558 [09:50:53] New review: gerrit2; "Change did not pass lint check. You will need to send an amended patchset for this (see: https://lab..." [operations/puppet] (production); V: -1 - https://gerrit.wikimedia.org/r/21558 [09:52:33] New patchset: Mark Bergsma; "Remove admins::dctech" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/21558 [09:53:18] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/21558 [09:57:58] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/21558 [10:05:39] New patchset: Mark Bergsma; "Add NetApp monitoring" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/21560 [10:06:22] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/21560 [10:06:33] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/21560 [10:12:13] PROBLEM - Puppet freshness on ms-fe1 is CRITICAL: Puppet has not run in the last 10 hours [10:15:13] PROBLEM - Puppet freshness on ms-be1007 is CRITICAL: Puppet has not run in the last 10 hours [10:15:13] PROBLEM - Puppet freshness on ms-be1010 is CRITICAL: Puppet has not run in the last 10 hours [10:15:13] PROBLEM - Puppet freshness on ms-be7 is CRITICAL: Puppet has not run in the last 10 hours [10:15:13] PROBLEM - Puppet freshness on ms-be1011 is CRITICAL: Puppet has not run in the last 10 hours [10:16:41] mark: hi! if you're on a gerrit tear, could you look at https://gerrit.wikimedia.org/r/#/c/21483/ ? (not urgent) [10:20:38] added one comment [10:21:13] PROBLEM - Puppet freshness on neon is CRITICAL: Puppet has not run in the last 10 hours [10:34:40] mark: huh, i could have sworn it wouldn't work otherwise [10:34:58] something about upstart closing stdin, perhaps? or not managing a pipeline correctly [10:35:36] but a quick test on a dev machine seems to work fine without the subshell. hrm. [10:36:36] well that's 2 subshells ;) [10:37:44] have you seen inception? :P [10:37:59] i have [10:38:11] that's why this is freaking me out ;-p [10:41:01] well, i've gone through the whole initctl stop / (re-)start cycle a few times and everything seems to behave as it should without the extra shell. so good catch -- i don't know what i was thinking. i'll update the patch. [10:42:47] also does it need to run as root? [10:43:03] oh nm [10:43:05] www-data [10:43:11] New patchset: Dereckson; "(bug 39671) Every logged-in user can now edit se.wikimedia.org." [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/21561 [10:47:33] New patchset: Ori.livneh; "Fix VCL bug; use varnishncsa instead of varnishlog" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/21483 [10:47:39] ^^ mark [10:48:17] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/21483 [10:48:43] PROBLEM - Puppet freshness on ms-fe2 is CRITICAL: Puppet has not run in the last 10 hours [10:49:12] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/21483 [10:50:05] w00t! thanks :) [10:54:29] New patchset: Dereckson; "(bug 39671) Every logged-in user can now edit se.wikimedia.org." [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/21561 [10:58:46] PROBLEM - Puppet freshness on ms-be9 is CRITICAL: Puppet has not run in the last 10 hours [11:36:09] New patchset: Matthias Mullie; "remove config var that's no longer being used" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/21570 [11:43:10] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [11:46:46] PROBLEM - Puppet freshness on magnesium is CRITICAL: Puppet has not run in the last 10 hours [11:46:46] PROBLEM - Puppet freshness on zinc is CRITICAL: Puppet has not run in the last 10 hours [11:47:40] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.054 seconds [12:21:16] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:33:07] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.025 seconds [13:05:55] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:07:31] New review: Demon; "We already have gerrit in the apt repo (it's installed that way on manganese). I'll test our your fi..." [operations/puppet] (production) C: 0; - https://gerrit.wikimedia.org/r/13484 [13:20:28] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.048 seconds [13:45:59] New patchset: Hashar; "(bug 38946) hebrew fonts for SVG rendering" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/21588 [13:46:42] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/21588 [13:49:15] paravoid, apergos: any chance you could do the rest of the back ends before we get to 9 this morning? [13:49:22] (9am pacific) [13:49:30] I didn't realize there were more to be done [13:49:42] I'm actually looking at the multiple heads issue though [13:49:45] I didn't forget to mail you on friday, did I? [13:49:53] well at the any heads issue tbh [13:50:45] if all the backends are upgraded before 9, we have the opportunity to try the originals switch then. [13:50:46] ah, I see it, no but [13:51:01] maplebed: a bit early for you isn't it? :) [13:51:03] I think you an aaron can fix the multiple heads issue after I'm gone. [13:51:04] so we shouldn't do the originals switch til we have the head request issue sorted out [13:51:16] but we can upgrade the backends nevertheless [13:51:21] apergos: the multiple heads issue will mostly go away after we stop writing NFS [13:51:21] yes we can [13:51:30] how do we know that? [13:51:31] at least half of them come from the multiple backend consistency check. [13:51:42] maplebed: btw, python-swauth is needed only on frontends, right? [13:51:49] (it's missing from puppet too, I'll add it) [13:51:52] aaron spent a lot of time looking at it on friday, and is continuing to do so. [13:51:54] paravoid: yes. [13:52:34] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:52:35] so that means there are still half that are unaccounted for? [13:52:41] I just wanted to check in early enough that I can go get my dog exercise and get into the office before 9. [13:53:00] apergos: walking through the code we were able to count up to 6; I don't remember whether aaron found the 7th. [13:53:13] ok. I read the scrollabck but didn't see that discussion [13:53:22] sorry, it wasnt' in IRC. [13:53:25] ah [13:53:35] oh irl then? [13:53:42] yes. [13:53:48] ok guess you can't send the logs [13:53:53] but really, the heads thing is not gating either backends or originals; [13:53:57] it's been like that since forever. [13:54:15] I think that's why I didn't mail about that conversation specifically [13:54:27] aaron's going to continue looking at that code this week too. [13:54:41] it's orthogonal to the originals switch [13:54:44] yes it is gating originals. mark asked us to look at this in his email. [13:55:09] the only real problem that it's creating right now is hurting debugging [13:55:34] too much cruft in logs/tcpdumps [13:55:39] well, besides a real performance issue [13:55:45] and it was a good question, which we tracked down sufficiently to understand that it's not an effect of our current switch [13:55:52] right. [13:56:11] the only problem is that if the switch goes south again we won't be able to debug it easily [13:56:12] anyway, if I'm going to get into the office by 9 I need to stop chatting about it. [13:56:28] the point wasn't that it's somehow related to swift code, it's that mw does this, it's inefficient, it should be fixed because it's a real performance hit [13:56:29] in any case, backends upgrades are also orthogonal to this conversation [13:56:33] so let's just do that now [13:56:56] that's fine [13:56:56] +1 paravoid. I'll be back by 9 [13:56:58] Change abandoned: Hashar; "Moved to puppet" [operations/debs/wikimedia-job-runner] (master) - https://gerrit.wikimedia.org/r/11610 [13:57:01] we have a 2hr window then. [13:58:33] maplebed: I'll also do upgrades & reboots to ms-be* if you don't disagree. [13:59:14] paravoid: I'd hold off on reboots but +1 to upgrades [13:59:19] why? [14:01:06] I'm a little nervous about rebooting the backends because I fear they will flip out on boot due to their disks. It's mostly fear, not backed by a know bad state in the systems, but if one doesn't come back then we need to stop upgrading and deal with it before continuing. [14:01:23] sigh [14:01:29] okay [14:01:29] They certainly do need to be rebootable, [14:01:38] and +1 to reboots at some point, [14:01:48] I just wanted to separate that problem (if it even comes up) from the process of upgrading. [14:02:11] they're not related and I didn't want to throw the upgrade process off on an unecessary tangent. [14:02:22] yeah, fair enouigh [14:02:32] ok, really afk now. [14:03:27] apergos: I'm going to start with the rest of the upgrades, ok with that? [14:03:32] sure [14:03:55] are you applying changes to confs and upgrading packages manually? [14:04:06] yes [14:04:42] ok [14:04:43] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 1.031 seconds [14:08:08] New review: Demon; "manifests/role/gerrit.pp seems to be missing from PS25?" [operations/puppet] (production) C: -1; - https://gerrit.wikimedia.org/r/13484 [14:09:20] New review: Andrew Bogott; "That is a reasonable concern." [operations/puppet] (production); V: 0 C: 0; - https://gerrit.wikimedia.org/r/13484 [14:14:37] wow, the ms-be* upgrades are talking a looooong time [14:14:45] like normal package upgrades [14:15:02] wonder why [14:20:07] i/o load probably [14:22:06] apergos: do you know how ms-be10 is out of rotation? [14:22:14] with what mechanism it was disabled? [14:22:27] no. I just have his email, same as you [14:22:34] lemme dig around [14:23:28] hm, ring builder probably [14:25:17] yeah [14:27:55] ms-be10 is precise... [14:28:04] we have a mixture? [14:28:28] hence the 1.4.8 [14:28:29] yes. [14:28:32] which is fine [14:29:01] PROBLEM - swift-object-replicator on ms-be10 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-object-replicator [14:29:01] PROBLEM - swift-account-reaper on ms-be10 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-account-reaper [14:29:01] PROBLEM - swift-container-replicator on ms-be10 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-replicator [14:29:05] what do you mean, hence 1.4.8? [14:29:10] PROBLEM - swift-account-replicator on ms-be10 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-account-replicator [14:29:10] PROBLEM - swift-object-server on ms-be10 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-object-server [14:29:10] PROBLEM - swift-container-server on ms-be10 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-server [14:29:55] New patchset: Andrew Bogott; " Overhauling gerrit manifest to be a role class" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/21592 [14:30:20] apergos: all the others had swift 1.4.3; ms-be10 had 1.4.8. [14:30:29] oh, 1.4.3 [14:30:29] ben was wondering in his mail how that happened [14:30:29] ok [14:30:32] RECOVERY - swift-object-replicator on ms-be10 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-object-replicator [14:30:32] RECOVERY - swift-container-replicator on ms-be10 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-replicator [14:30:32] RECOVERY - swift-account-reaper on ms-be10 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-reaper [14:30:40] RECOVERY - swift-account-replicator on ms-be10 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-replicator [14:30:40] RECOVERY - swift-object-server on ms-be10 is OK: PROCS OK: 25 processes with regex args ^/usr/bin/python /usr/bin/swift-object-server [14:30:40] RECOVERY - swift-container-server on ms-be10 is OK: PROCS OK: 25 processes with regex args ^/usr/bin/python /usr/bin/swift-container-server [14:30:41] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/21592 [14:30:47] !log upgraded swift on ms-be10; (still depooled, no effect) [14:30:57] Logged the message, Master [14:36:26] apergos: could you have a look at the graphs for possible anomalies? [14:36:37] Change abandoned: Andrew Bogott; "oops, wrong ID" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/21592 [14:37:08] I amlooking [14:37:19] New patchset: Andrew Bogott; " Overhauling gerrit manifest to be a role class" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/21593 [14:37:21] had them open since you started [14:38:05] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/21593 [14:38:24] several of them are a little weird but I'm having trouble figuring out why that is [14:39:31] http://ganglia.wikimedia.org/latest/?r=hour&cs=&ce=&tab=v&vn=swift+backend+storage look at the first few of these if you like to see what I mean [14:39:41] Change abandoned: Andrew Bogott; "Apparently I'm going to keep making this same mistake over and over all day." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/21593 [14:39:49] New patchset: Andrew Bogott; "Overhauling gerrit manifest to be a role class" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/13484 [14:40:07] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:40:27] !log upgraded swift on ms-be4 [14:40:35] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/13484 [14:40:37] Logged the message, Master [14:41:21] sinc the only one you had done was ms-be10 which wasn't serving anything... [14:42:18] I've done ms-be4 too [14:42:24] gah those graphs are completely unreadable [14:42:30] too many colors [14:43:21] hello ops :) [14:43:23] I am back around [14:43:24] ! [14:43:27] yeah the dip is from before that though [14:43:42] anyways I would say to carry on fornow [14:45:17] hi hashar welcome back [14:45:29] hashar: heya, had a good time? [14:47:51] rain / sun / rain / sun [14:48:00] and close to no internet connection :-) [14:48:05] so I feel relaxed [14:49:48] very nice [14:49:53] hashar: I'll be spending all next week away from the internet, I am looking forward to it [14:50:11] hashar: I have a question for you about beta labs, can I send you an email? [14:50:49] chrismcmahon: go ahead go ahead :) [14:51:02] New patchset: Alex Monk; "(bug 39306) Add a flood group to itwiktionary." [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/19230 [14:51:03] ok, be a minute or two... [14:51:13] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 4.324 seconds [14:51:26] !log upgraded swift on ms-be3 [14:51:38] Logged the message, Master [14:55:11] apt upgrade on 5, 11, 12 [14:55:58] hiyaaa, could someone approve this one real quick? i had waited until this morning to ask so that I could babysit it and make sure its cool [14:55:59] https://gerrit.wikimedia.org/r/#/c/21391/ [14:58:33] hashar: sent, thanks [15:08:27] still looks ok on the graphs [15:20:30] this does not seem to be related to the upgrades but I do notice a lot of [15:20:35] "object-replicator @ERROR: max connections (2) reached -- try again later" in the logs [15:21:07] several per second, do you know anything about those? [15:23:51] no [15:26:20] ok, we'll see what ben has to say [15:26:37] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:26:57] gmail imap is unbelievably slow this past few days [15:27:11] odd [15:27:28] and is returning "System Error" every now and then [15:27:32] you'd think they pf all folks would nt have network/server issues [15:27:45] it's a second class citizen unfortunately [15:28:03] imap, ah right [15:28:30] on a related note, ms-be* is also very slow I/O-wise, installing updates for the past 30' or so [15:28:42] that's a looong time [15:29:24] these are on hosts still in the pool right? [15:29:42] yes [15:30:16] hope they'lll keep up with the originals traffic [15:30:33] that should be lot less than the thumbs though [15:31:25] by number of requests, notsure about actual # of packets [15:38:01] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 3.472 seconds [15:47:37] !log upgraded swift on ms-be11 [15:47:47] Logged the message, Master [15:49:05] PROBLEM - Puppet freshness on ms-fe3 is CRITICAL: Puppet has not run in the last 10 hours [15:49:21] seems to work so far [15:49:36] apergos: still looking at graphs? [15:49:40] oh yes [15:49:41] and logs [15:49:54] if something goes wacky you'll be the second one to know [15:51:16] :-) [15:54:23] !log upgraded swift on ms-be5 [15:54:32] Logged the message, Master [15:55:37] !log upgraded swift on ms-be12 [15:55:40] all done :) [15:55:43] right on time [15:55:46] sweet [15:55:47] Logged the message, Master [15:56:30] well, except ms-be6 which is still down [15:57:30] them's the breaks [15:59:08] PROBLEM - Puppet freshness on ms-fe4 is CRITICAL: Puppet has not run in the last 10 hours [16:04:24] New patchset: Hashar; "disallow robots on beta" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/21602 [16:05:26] New review: Krinkle; "bump." [operations/mediawiki-config] (master); V: 0 C: 1; - https://gerrit.wikimedia.org/r/16241 [16:10:43] hi paravoid, apergos [16:10:49] New review: Hashar; "Leslie, Daniel, can you please look at this? That would let us run ruby based test suites on galliu..." [operations/puppet] (production) C: 0; - https://gerrit.wikimedia.org/r/16957 [16:10:51] yo [16:11:28] so sadly I don't have backscroll for my time commuting; anything interesting? [16:11:35] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:12:13] para void upgraded he backends, they look fine [16:12:17] I noticed a bunch of [16:12:19] ossm. [16:12:34] Aug 27 06:29:58 10.0.6.201 object-replicator @ERROR: max connections (2) reached -- try again later [16:12:34] in the logs [16:12:42] not just now, but consistently [16:12:49] wondering if you know what that's about [16:13:12] .201, eh? [16:13:18] also finally understood why for every thumb request we send a head, sent mail and mark replied, aaron will want to look at that [16:13:40] not just 201 [16:13:42] it's allof them [16:13:50] several a second across all the backends [16:13:53] oh, object replicator! [16:13:59] sorry, missed that part. [16:14:00] yes [16:14:08] PROBLEM - Puppet freshness on ms-be1001 is CRITICAL: Puppet has not run in the last 10 hours [16:14:08] PROBLEM - Puppet freshness on ms-be1003 is CRITICAL: Puppet has not run in the last 10 hours [16:14:08] PROBLEM - Puppet freshness on ms-be1005 is CRITICAL: Puppet has not run in the last 10 hours [16:14:08] PROBLEM - Puppet freshness on ms-be1006 is CRITICAL: Puppet has not run in the last 10 hours [16:14:08] PROBLEM - Puppet freshness on ms-be1002 is CRITICAL: Puppet has not run in the last 10 hours [16:14:09] PROBLEM - Puppet freshness on ms-be1009 is CRITICAL: Puppet has not run in the last 10 hours [16:14:09] PROBLEM - Puppet freshness on ocg3 is CRITICAL: Puppet has not run in the last 10 hours [16:14:09] yes, it's limited to 2 connections at a time. [16:14:10] PROBLEM - Puppet freshness on ms-fe1001 is CRITICAL: Puppet has not run in the last 10 hours [16:14:10] PROBLEM - Puppet freshness on singer is CRITICAL: Puppet has not run in the last 10 hours [16:14:11] PROBLEM - Puppet freshness on virt1002 is CRITICAL: Puppet has not run in the last 10 hours [16:14:11] PROBLEM - Puppet freshness on virt1003 is CRITICAL: Puppet has not run in the last 10 hours [16:14:12] PROBLEM - Puppet freshness on virt1004 is CRITICAL: Puppet has not run in the last 10 hours [16:14:12] PROBLEM - Puppet freshness on virt1001 is CRITICAL: Puppet has not run in the last 10 hours [16:14:15] I asked in swiftstack but it's a bit soon for anyone to be awake [16:14:27] to know if we should be concerned (if that's slowing things down) [16:14:33] brb bacthroom [16:14:43] object replicator is one of the backend processes. [16:14:53] it doesn't directly affect client requests [16:15:38] the more instances of the object replicator you allow the quicker the cluster heals in the event of a failure, and the more IO load it incurs. If the load from the replicator is too high, it will start to affect performance of the object-server (and client requests) [16:16:18] we'll probably want to play with the values for the number of instances of all the backend processes, but for now they're ok. [16:16:40] I think I may have been too conservative in restricting them to only 2 copies of the replicator, but I'm not sure. [16:16:50] I also see a few "bad rsync return code" [16:16:59] here and there [16:17:11] btw, ms-be10 had 1.4.8 because it runs precise [16:17:53] ah. and it has precise because when the disks all flipped out I rebuilt it and that was after we'd made the switch to default-precise. [16:17:58] ::sigh:: [16:18:38] (the eqiad cluster is also precise, btw.) [16:18:41] yeahI noticed the rsync whines, figured those would be next n the list after the object replicator thing [16:18:46] clean up the logs bit by bit [16:19:29] so. ready to try the switch? [16:19:37] (we're now in our window) [16:20:15] * apergos grits teeth unhapily [16:20:21] with two p's even [16:21:14] yay switch! [16:21:26] not that kind LeslieCarr :-P [16:22:06] hehe [16:24:02] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 7.669 seconds [16:24:15] paravoid: you have the helm [16:24:22] oh do I? :) [16:24:56] you did so well with squid last time, why not? [16:24:57] :D [16:25:05] besides, I like this role of just watching. [16:25:29] hehe [16:25:52] you know if stuff falls over we will have no solution except revert and wait again, right? [16:26:17] let's hope that if stuff falls over again we're able to gather some new piece of information to tell us why. [16:28:50] okay, I'm fixing the squid conf again [16:29:02] three-way merge (we did the ms5 change in the meantime) [16:29:18] maplebed: btw, what's wrong with ms-be10? lots of XFS errors in dmesg [16:29:39] hardware. it doesn't see any of the spinning disks. maybe the controller? [16:30:05] yeah, most likely [16:30:05] I think you'll see the disks are mounted, but if you ls -l /srv/swift-storage, you just see ?s for things that stat should fill in. [16:30:12] ugh [16:32:06] New patchset: Andrew Bogott; "Overhauling gerrit manifest to be a role class" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/13484 [16:32:22] so... [16:32:27] are we going on partial deployment again? [16:32:33] depooling sq51 first? [16:32:42] I think we should just deploy on sq51 [16:32:43] only one squid? I liked that as a plan last time. I say +1. [16:32:50] Change abandoned: Andrew Bogott; "(no reason)" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/13484 [16:33:03] well, there's a) depool sq51 and deploy there b) deploy on sq51 directly c) deploy on all backend squids [16:33:10] I definitely don't want to go for (c) [16:33:13] depool, test there. [16:33:18] then put it back in [16:33:19] (a) [16:33:21] I don't see why we should depool and test [16:33:32] people make typos [16:33:41] I didn't write any new config [16:33:44] just applied the old one. [16:34:00] Change restored: Andrew Bogott; "..." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/13484 [16:34:16] to verify nothing unexpected has changed since the last time? I'm ok with either (a) or (b) though I would probably go (a) just for safety's sake. [16:34:38] (a) [16:34:40] haha [16:34:48] how much longer can it take? (a) just for doublechecking [16:34:54] ottamata: you're just saying that cuz you have two in your name. [16:35:14] (a) is a eyepatche emoticon for me [16:35:18] eyepatched* [16:35:38] we need to go through all the 'disable sq51 on all frontends and esams backends' hoops again [16:35:48] the only real danger is cache pollution [16:35:52] quick q for a very busy room: [16:36:02] htpasswd.stats is an htpasswd file on spence [16:36:06] it is not puppetized [16:36:13] i need to move the site that uses it elsewhere [16:36:18] can I check it into puppet? [16:36:21] pws are scrambled [16:36:21] no. [16:36:23]