[00:00:56] 06Operations, 10Traffic: Content purges are unreliable - https://phabricator.wikimedia.org/T133821#2351728 (10ori) >>! In T133821, @BBlack wrote: > Therefore, it's easy for a race condition to occur where an upper-layer cache gets purged of the item, then immediately gets a new request for the item, and then r... [00:01:46] subbu: sigh, ok. these should also be gone now.. (the problem is that we have different permissions for all this on the regular apt.wm repo but the same puppet class is trying to set it up) [00:02:08] ok ... should i retry now? [00:02:09] manually changed and stopped puppet, need to followup with some code change [00:02:12] yes [00:02:28] i got: Package has already been uploaded to bromine.eqiad.wmnet on bromine.eqiad.wmnet [00:02:28] Nothing more to do for /tmp/parsoid_0.5.0allubuntu1_amd64.changes [00:02:58] not sure if the "post-upload command" needed to be run or if it got run. [00:03:08] uhm.. i was about to delete the /tmp file but it's gone [00:03:54] https://releases.wikimedia.org/debian/pool/main/p/parsoid/ [00:04:09] not there .. [00:04:14] subbu: is that /tmp/ file on your side on tin? [00:04:22] lets delete that [00:04:32] ssastry@tin:/srv/deployment/parsoid/deploy$ ls /tmp/pars* [00:04:43] ssastry@tin:/srv/deployment/parsoid/deploy$ ls /tmp/pars* [00:04:48] awight: works, thanks! [00:04:53] /tmp/parsoid_0.5.0allubuntu1_all.deb /tmp/parsoid_0.5.0allubuntu1_amd64.bromine.eqiad.wmnet.upload /tmp/parsoid_0.5.0allubuntu1_amd64.changes [00:05:02] subbu: yea, those [00:05:07] mutante, should i delete the *bromine* one? [00:05:24] subbu: yes [00:05:39] no perms for me to delete it. [00:05:45] you will have to do it. [00:05:58] ok, done [00:06:13] retrying now. [00:06:40] same as last time .. "Error: post upload command failed." because of earlier perm denied errors. [00:06:41] tgr: do you see the config change now? [00:07:13] subbu: what do you run to upload it? i can try [00:07:23] awight: can we check whether the config change has been enabled? [00:07:24] bmansurov: sorry, I was talking about my other patch [00:07:32] mutante, deb-upload /tmp/parsoid_0.5.0allubuntu1_amd64.changes [00:07:41] scap is still running, for another 20 minutes I'd guess [00:07:46] bmansurov: I started a full scap--I think that part should have gone out though [00:07:51] tgr: ok [00:07:53] ah nope [00:07:55] what tgr said [00:08:00] awight: ok [00:10:31] subbu: Successfully uploaded packages. [00:10:37] but: [00:10:42] No distribution named 'precise-wikimedia' found! [00:10:50] is it really supposed to be preicse? [00:11:40] hmm ... i don't know .. i just followed directions @ https://www.mediawiki.org/wiki/Parsoid/Debian /cc gwicke [00:11:59] checks the distributions file [00:12:12] bmansurov: if you wanna cheat a bit, I think the config is synced now... [00:12:56] subbu: you probably want jessie-wikimedia [00:13:09] awight: how do I cheat? I still don't see it. [00:13:12] we used to publish 'wmf-production' [00:13:25] but I don't think that is a 'release' in the current repo [00:13:45] subbu: check debian/changelog [00:13:53] gwicke, i didn't set precise anywhere during by build .. so, i don't know where that came from. i blindly followed instructions on that wiki page, but, i should probably read up a bit on this. :) [00:13:57] ok, let me. [00:14:21] parsoid (0.5.0allubuntu1) wmf-production; urgency=medium [00:14:21] i also touched that distributions file but it _should_ be like before.. checking that [00:14:25] when you update the changelog with dch, you'll need to set a release [00:14:43] bmansurov: Apparently I don't know what I'm talking about ;) I thought one of these sync-* stages would propagate your config changes, even if the other pieces of the scap hadn't been synced yet. [00:15:03] awight: oh ok, i'll check back later [00:15:05] thanks [00:15:31] subbu: these days, jessie-wikimedia might be a better value for the release [00:15:48] instead of wmf-production? [00:15:56] yes [00:16:01] gwicke: subbu: there is precise-mediawiki but not precise-wikimedia [00:16:01] ok. let me rebuild. [00:16:09] ok [00:16:22] can i manually edit the changelog file and rebuild or do i need to go through dch? [00:16:33] you can manually edit it [00:16:35] k [00:16:56] dch is just a helper, similar to crontab [00:17:33] k [00:18:24] reuploading to tin [00:18:27] https://www.mediawiki.org/wiki/Parsoid/Setup#Ubuntu_.2F_Debian lists both trusty-mediawiki and jessie-mediawiki [00:18:38] 07Blocked-on-Operations, 06Operations, 10Wikidata, 10Wikimedia-Language-setup, and 2 others: Create Wikipedia Jamaican - https://phabricator.wikimedia.org/T134017#2351755 (10StevenJ81) See my comment just above (3rd comment above this). One thing I that needs to be done, apparently, on jamwiki, is to cr... [00:18:51] it's too bad that we don't have sid [00:19:18] this way, we'll break user's apt setup with each switch [00:19:49] gwicke: subbu: everything is -mediawiki there apparently [00:19:54] Codename: jessie-mediawiki [00:19:54] Log: jessie-mediawiki [00:20:11] yea https://releases.wikimedia.org/debian/dists/ [00:20:15] ok .. so, let me rebuild with jessie-mediawiki then :) [00:22:01] der weg ist das ziel [00:22:24] mutante-gwicke code language. [00:22:41] awight: yay, working! [00:22:43] thanks again [00:22:55] awesome. [00:24:27] mutante, gwicke the new build with jessie-mediawiki is on tin in /tmp again [00:24:45] !log awight@tin Finished scap: Deploying labtestwiki AuthManager config; Enabling Popups experiment; CentralNotice fixes for T136408, T136387; Special:Notifications fixes (duration: 25m 08s) [00:24:46] T136408: Update CentralNotice JSHint config to restrict syntax to ES3 (disallow ES5 or ES6) - https://phabricator.wikimedia.org/T136408 [00:24:46] T136387: CentralNotice failing in older browsers due use of ECMAScript 6 syntax - https://phabricator.wikimedia.org/T136387 [00:24:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:25:11] subbu: checked wikitonary, is missing http://www.linguee.com/german-english/translation/der+weg+ist+das+ziel.html [00:25:33] subbu: should i upload ? [00:25:46] RoanKattouw: your scap sync is complete! [00:25:51] mutante, yup .. ah. "the journey is the reward" :) [00:26:06] tgr: AndyRussG: ^ fyi [00:26:32] No distribution named 'precise-wikimedia' found! [00:26:36] hmm [00:26:38] whaat ... [00:26:40] awight: merci! [00:26:42] awight: Thanks! Testing [00:27:11] subbu: i see the .changes file has jessie.. uhm,,, [00:27:13] Does anyone know what bot is doing the mw-core commits when extensions deployment branches are merged? [00:27:30] lol [00:27:33] Oh [00:27:37] mutante, subbu: my guess would be that the precise is injected by the upload toolchain [00:27:47] awight: thanks, tested [00:27:52] great, thank you [00:28:08] mutante, gwicke i have to go soon to please the badminton gods .. i know how to do that .. debian gods have different rituals .. they can wait till tomorrow. :) [00:28:30] * gwicke /knew/ it [00:28:54] subbu: ok, i will keep looking [00:29:40] thanks ... if you can update the wiki page https://www.mediawiki.org/wiki/Parsoid/Debian with any updated instructions, that will be helpful for the next release (in a week's time probably). [00:29:47] i'll check in tomorrow. [00:30:05] ok [00:30:10] bd808: Do you know anything about the bot to create mw-core submodule commits when extension#deployment are merged? [00:32:00] awight: I don't think that's a bot, I think it's Gerrit functionality [00:32:17] There's a bug that causes it to not work for VisualEditor because there are two repos named VisualEditor, I think [00:32:28] Hmmm [00:32:29] oh wow, that's scary! [00:32:33] For CentralNotice I wonder if it would work automagically if you used wmfN branches like the other extensions do [00:32:38] yah exactly [00:32:39] Yeah it really is [00:32:47] And annoying for the VE team [00:33:10] I tried to make a patch to Gerrit at one point... I can't even describe the pain--and gouging holes in legacy code is sort of my jam. [00:33:20] awight: My changes in prod look good BTW, thanks for deploying [00:33:36] Oh yes I tried that a few years ago and gave up in a mix of pain and disgust [00:33:37] The simplest change required completely different formats of the same change, in three different parts of the source. [00:33:52] It sounds like you may have gotten farther than me [00:33:55] hehe okay glad your bugfixes worked out. [00:34:01] I think I got one change uploaded successfully, but it was just some CSS [00:34:10] nah. The last straw was trying to run their custom build tool. [00:34:26] Huh [00:34:33] Some Maven thing? [00:34:40] "Buck" [00:34:46] I assume we have our own Gerrit fork? [00:34:56] built to run on some insane Google cluster, I suppose [00:35:15] No I think we run stock Gerrit [00:35:21] https://buckbuild.com/ [00:35:28] Maybe with a theme or something but I think our customization is relatively minima; [00:35:33] a lot of the google tooling is different because they started quite early [00:35:35] buck was built by facebook to make their android builds faster [00:35:37] I think [00:35:38] Well... AndyRussG I guess RoanKattouw gave us the answer, then--we start using the standard deployment branch? [00:35:40] Aside from config and tooling like Zuul [00:35:59] because they were doing crazyasfuck things where the NUMBER OF METHODS in the facebook application was overruning an integer field somewhere in dalvik [00:36:02] The only issue would be if we still want to prevent master from going out onto deployment with every train. [00:36:22] YuviPanda: hahahaha that's amazing [00:36:23] ah, right- this is the FB equivalent [00:36:38] awight: yes exactly [00:36:45] google has http://www.bazel.io/ [00:36:45] That's my main concern there [00:36:46] awight: but what was it that pushed the tip of deploy out with the train yesterday? [00:37:00] I think master went out, actually. [00:37:05] hmm. [00:37:09] good question, I don't know [00:37:16] awight: what's the blocker for this train cycle? [00:37:23] awight: no, I'm 99% sure it was the tip of deploy... Lemme check [00:37:25] You guys often have prod more than a week behind master don't you? [00:37:30] yah [00:38:11] Though that can be customized in the script that makes the branches [00:38:35] It could make the branch based on last week's branch instead of master [00:38:37] For CentralNotice, yeah quite a lot, often. We could surely be more disciplined, but at least sometimes (like now in fact) it's nice to have stuff on the beta cluster and not be forced to QA there before some arbitrary train departure tine [00:38:57] Our idea (and we're discussing it in order to change, so feedback is welcome!) is that we keep a slightly control freaky grip on what's being deployed, which allows us to merge bits of long-running feature projects to #master, and carefully deploy only the pieces that are ready. [00:39:00] Yeah I see the value in that [00:39:18] And I would probably be doing that too if the stakes were that high [00:39:29] Same deal with DonationInterface. [00:39:58] And fwiw, we're currently talking about putting paymentswiki on the train to reduce our maintenance + incompatibility burden, but the same considerations apply there. [00:39:59] Yea FR + runs on every pageview [00:40:39] Actually, plus another issue, that we occasionally need to shove a DonationInterface deployment through like ASAP because money. [00:40:57] Yeah [00:41:06] Anyway, for CentralNotice perhaps it'd be enough to stop cutting wmf* branches from master, but create those manually? [00:41:11] I think the most important requirement for CN branchcutting is that it be easy to use for deployers, predictible, and flexible [00:41:23] "manually" is not the right term, but yes [00:41:31] awight: I don't think we're currently curring wmf* branches for CN [00:41:36] hmm? I think I meant that--is there something better? [00:41:41] It's all just wmf_deploy [00:41:43] Get the script to cut them based on the previous branch instead of the current state of master [00:42:09] yes okay that sounds perfect. So it's in stasis until we intentionally prepare a deployment. [00:42:28] I say "instead of" not because that's what it currently does, but because that's the default [00:42:36] Yeah exactly [00:43:24] Also it'll be easier to deploy a change only to wmfN but not wmfN-1 [00:43:28] AndyRussG: +1 your point that the only problem with what we have currently is that deployers are not prepared to deal with CentralNotice deployment branches [00:43:36] Like if you need new things from core for example [00:44:17] Hmmm interesting [00:44:40] I really like that. [00:44:53] You *can* do that now too, but the need to do manual submodule stuff makes everything less intuitive [00:44:59] Plus, I'm desperate to align with the rest of WMF. Our island is starting to stink. [00:45:06] And generally the less different this is from normal extensions the better [00:45:14] that. [00:45:31] The stasis thing needs to remain, and that'll be a difference, but everything else we can reconcile [00:46:42] Could master be the ready to deploy version, and another branch send to beta cluster? [00:47:11] I don't think that's a good idea because master is the branch people will develop from [00:47:13] I see there are a few other extensions with special deployment branches, but none of those look like role models... Wikidata, SemanticMediaWiki, SemanticResultFormats, Validator [00:47:41] Dereckson: It's a good thought, though! [00:48:32] awight: Semantic* and Validator are not deployed at WMF AFAIK [00:48:44] Or maybe only on wikitech [00:48:49] Generally the beta cluster is the right place to stop when something is near deployment [00:49:01] Because of the similarity of the setup [00:49:15] (which is its point) [00:49:24] We run our browser tests off it, too [00:50:37] RoanKattouw: we're also trying to get rid of them from wikitech, etc [00:50:40] Dereckson: great job with the postmortem -- thanks again [00:51:23] Lemme think about how the wmf-* plus stasis thing would work... If we do nothing, then release branches continue to be cut from the last deployed thing. If we want to deploy, we figure out the current WMF deployed versions and merge to the branches we want, according to which wikis we want to deploy to. [00:51:48] Exactly [00:52:04] I think that gives FR-tech the stuff we need, including master-beta sync. And I *think* it's transparent to other deployers. [00:52:12] Or, if it's a change you can afford to be lazy about, just put it into the latest wmf branch [00:52:18] You're welcome ori. [00:52:21] They make a revert to the deployment branch just like with other extensions, and don't have to know that it's decoupled from master. [00:52:55] Exactly, except when it comes to expecting patches in master to be deployed quickly [00:53:01] For example [00:53:23] hehe, that will be an enjoyable surprise [00:53:31] Say CentralNotice uses Foo from core, and I am killing Foo [00:53:47] Then I would submit patches to all extensions that use Foo, and once they are all merged, I would feel safe merging my core patch that kills Foo [00:53:59] FR-tech is pretty aware of CentralNotice patches, so I think that'll be okay [00:54:08] Then upon deployment, this will likely break, because CN's patch won't be deployed [00:54:11] https://gerrit.wikimedia.org/r/#/admin/projects/gerrit [00:54:23] * awight wags eyebrows. And we have special credit cards for anyone who submits patches. [00:54:25] OK, yeah if you keep a close eye you can mitigate that [00:54:58] That's a situation that's been happening to date, and I don't remember it biting us yet. [00:55:04] Another idea I could suggest is kind of the reverse of Dereckson's suggestion: put stuff you're not sure about yet into a branch that's not master [00:55:18] Yeah, you're right, that's not a new problem [00:55:29] like a develop branch? [00:55:32] yah there's that. We do short-term, but as AndyRussG said we'd like beta to test [00:55:34] Yeah [00:55:50] hmm [00:55:53] Then you could merge scary stuff into that branch first, and stuff that can / needs to go in right away into master directly [00:55:59] But that has issues too, like the two getting out of sync [00:56:03] So, I don't know [00:56:21] AndyRussG: Can you see long-term feature branches working for us? [00:56:25] In our teams we mostly work around this issue by deliberately merging scary things right after the branch is cut [00:56:39] So that they spend a full week on beta labs prior to deployment [00:56:41] yeesh [00:56:49] Well we did in fact have some out-of-sync headaches today with wmf_deploy. Which were probably exaggerated by my paperwork fetish [00:57:13] Also we sometimes have QA to local testing pre-merge for scary patches [00:57:44] But mostly we just live dangerously and if something breaks we count on the stagedness of the deployment to catch it and on our ability to revert/patch quickly enough and cherry-pick [00:58:07] More use of feature branches might be useful. For example, I'd like all the de-cookieing to code to go out at once [00:58:30] For example, we had a DB error on Special:Notifications that for Reasons didn't happen on mw.org but happened everywhere else. We found it on Wednesday after the train went to group1, and SWATted a fix the same day so group2 never got that breakage [00:58:48] And keeping it all together (several Gerrit changes) might make it easier to control that, though more branches menas also more opportunities for out-of-sync-ness [00:59:16] AndyRussG: That does sound nice for the things we need to deploy in phases [00:59:34] However you may have a reduced tolerance for living dangerously because money, as awight so eloquently put it ;) [00:59:46] 06Operations, 10Traffic: Content purges are unreliable - https://phabricator.wikimedia.org/T133821#2351822 (10BBlack) @ori - thanks for the links! It's good to know it's only one extra purge, I wasn't even sure of that. Technically, even if the delay is high enough, 2 purges can be insufficient to stop all r... [01:00:03] (03PS1) 10EBernhardson: logging: Require acknowledgment of kafka logging [mediawiki-config] - 10https://gerrit.wikimedia.org/r/292514 (https://phabricator.wikimedia.org/T135159) [01:02:24] AndyRussG: Want me to make a task for the make-wmf-branch change to do RoanKattouw's suggestion? Or should we talk more later? [01:02:55] You should also discuss this with the releng people (I think twentyafterfour is the steward of make-wmf-branch but not sure) [01:03:10] awight: let's make a task to study the issue, and also what cwd was talking about for better process around design and WIP patches [01:03:20] Because they run that show, but also because they will have better-informed opinions than me [01:03:22] AndyRussG: that works for me, will do [01:03:24] * AndyRussG represses his inner bureaucrat [01:04:01] RoanKattouw: cool! yeah we should definitely invite releng commentary :) [01:05:21] RoanKattouw: thx also for the really useful insights!! [01:06:11] https://phabricator.wikimedia.org/diffusion/GGER/history/cur-deployed/ [01:06:22] Last patch, July 2014? [01:07:06] Hmm that looks to be out of date then [01:07:15] Hope so! [01:07:17] Maybe that cur-deployed thing is from back when qchris managed it. twentyafterfour manages it now [01:07:32] Either him or ostriches [01:07:40] gwicke: subbu: the parsoid package is now uploaded [01:07:46] Hmm there's a 2.12 branch with a commit from January [01:08:25] !log uploaded parsoid 0.5.0 deb to releases.wm.org [01:08:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [01:08:32] https://phabricator.wikimedia.org/diffusion/GGER/history/2.12/ [01:09:41] !log bromine - puppet currently stopped needs some permission fixes for release upload [01:09:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [01:12:48] mutante: thank you! [01:14:01] gwicke: np, looks like part was my changes but another part was that stuff was owned by "5038" for some reason [01:14:20] definitely need to follow-up with something in puppet [01:28:27] PROBLEM - Start and verify pages via webservices on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 SERVICE UNAVAILABLE - 187 bytes in 10.881 second response time [01:30:26] RECOVERY - Start and verify pages via webservices on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 7.416 second response time [01:39:07] (03PS3) 10GWicke: WIP: logstash_checker script for canary deploys [puppet] - 10https://gerrit.wikimedia.org/r/292505 (https://phabricator.wikimedia.org/T110068) [01:40:14] (03CR) 10jenkins-bot: [V: 04-1] WIP: logstash_checker script for canary deploys [puppet] - 10https://gerrit.wikimedia.org/r/292505 (https://phabricator.wikimedia.org/T110068) (owner: 10GWicke) [01:43:26] (03PS4) 10GWicke: WIP: logstash_checker script for canary deploys [puppet] - 10https://gerrit.wikimedia.org/r/292505 (https://phabricator.wikimedia.org/T110068) [01:44:36] (03CR) 10jenkins-bot: [V: 04-1] WIP: logstash_checker script for canary deploys [puppet] - 10https://gerrit.wikimedia.org/r/292505 (https://phabricator.wikimedia.org/T110068) (owner: 10GWicke) [01:44:58] (03PS2) 10Dzahn: endowment: comment out git:clone until repo exists [puppet] - 10https://gerrit.wikimedia.org/r/292504 (https://phabricator.wikimedia.org/T136793) [01:46:31] (03CR) 10Dzahn: [C: 032] endowment: comment out git:clone until repo exists [puppet] - 10https://gerrit.wikimedia.org/r/292504 (https://phabricator.wikimedia.org/T136793) (owner: 10Dzahn) [01:48:19] RECOVERY - puppet last run on bromine is OK: OK: Puppet is currently enabled, last run 54 seconds ago with 0 failures [01:48:47] 06Operations, 13Patch-For-Review: fix puppet run on bromine - https://phabricator.wikimedia.org/T136793#2351876 (10Dzahn) puppet run is fixed but the following things need to be changed, because they need to be different on bromine vs. carbon 25 Notice: /Stage[main]/Aptrepo/File[/srv/org/wikimedia/reprepro/... [01:49:15] 06Operations, 06Mobile-Apps, 10Traffic, 06Wikipedia-Android-App-Backlog: WikipediaApp for Android hits loads.php on bits.wikimedia.org - https://phabricator.wikimedia.org/T132969#2351878 (10Mholloway) [01:49:34] 06Operations, 13Patch-For-Review: Split carbon's install/mirror roles, provision install1001 - https://phabricator.wikimedia.org/T132757#2351881 (10Dzahn) [01:49:36] 06Operations, 13Patch-For-Review: fix puppet run on bromine - https://phabricator.wikimedia.org/T136793#2351879 (10Dzahn) 05Open>03Resolved 18:53 < icinga-wm> RECOVERY - puppet last run on bromine is OK: OK: Puppet is currently enabled, last run 54 seconds ago with 0 failures [01:50:42] 06Operations, 13Patch-For-Review: Split carbon's install/mirror roles, provision install1001 - https://phabricator.wikimedia.org/T132757#2209585 (10Dzahn) T136793#2351876 is needed ! carbon is fine but bromine has been affected and the permissions for uploaders are different [01:53:15] (03PS5) 10GWicke: WIP: logstash_checker script for canary deploys [puppet] - 10https://gerrit.wikimedia.org/r/292505 (https://phabricator.wikimedia.org/T110068) [01:54:21] (03CR) 10jenkins-bot: [V: 04-1] WIP: logstash_checker script for canary deploys [puppet] - 10https://gerrit.wikimedia.org/r/292505 (https://phabricator.wikimedia.org/T110068) (owner: 10GWicke) [01:57:49] 07Blocked-on-Operations, 06Operations, 10Wikidata, 10Wikimedia-Language-setup, and 2 others: Create Wikipedia Jamaican - https://phabricator.wikimedia.org/T134017#2351898 (10Krenair) Still need the replica database and views created. I'm waiting for ops to run their script for this. [02:00:02] (03PS6) 10GWicke: WIP: logstash_checker script for canary deploys [puppet] - 10https://gerrit.wikimedia.org/r/292505 (https://phabricator.wikimedia.org/T110068) [02:01:25] (03CR) 10jenkins-bot: [V: 04-1] WIP: logstash_checker script for canary deploys [puppet] - 10https://gerrit.wikimedia.org/r/292505 (https://phabricator.wikimedia.org/T110068) (owner: 10GWicke) [02:02:49] (03PS1) 10: ores: move config file to /etc/ores [puppet] - 10https://gerrit.wikimedia.org/r/292516 [02:05:19] (03CR) 10: "Also it seems that worker nodes in labs doesn't have the config section because they are moved to https://github.com/wikimedia/operations-" [puppet] - 10https://gerrit.wikimedia.org/r/292516 [02:10:41] 06Operations, 10Traffic, 10Wiki-Loves-Monuments, 07HTTPS: configure https for www.wikilovesmonuments.org - https://phabricator.wikimedia.org/T118388#2351919 (10Dzahn) You are right, it's fine with www, there it gets an A rating and looks fine to me https://www.ssllabs.com/ssltest/analyze.html?d=www.wikil... [02:14:17] 06Operations, 10Traffic, 10Wiki-Loves-Monuments, 07HTTPS: configure https for www.wikilovesmonuments.org - https://phabricator.wikimedia.org/T118388#2351922 (10Dzahn) >>! In T118388#2351919, @Dzahn wrote: > Ideally with and without www would both be on the cert as "SANs" and after looking closer i see t... [02:17:50] (03PS1) 10: ores: fix staging [puppet] - 10https://gerrit.wikimedia.org/r/292517 [02:18:10] (03PS7) 10GWicke: WIP: logstash_checker script for canary deploys [puppet] - 10https://gerrit.wikimedia.org/r/292505 (https://phabricator.wikimedia.org/T110068) [02:19:26] (03CR) 10jenkins-bot: [V: 04-1] WIP: logstash_checker script for canary deploys [puppet] - 10https://gerrit.wikimedia.org/r/292505 (https://phabricator.wikimedia.org/T110068) (owner: 10GWicke) [02:20:23] (03PS1) 10Dereckson: User rights configuration for meta. wmf-supportsafety group [mediawiki-config] - 10https://gerrit.wikimedia.org/r/292518 (https://phabricator.wikimedia.org/T136864) [02:30:41] !log mwdeploy@tin scap sync-l10n completed (1.28.0-wmf.4) (duration: 08m 35s) [02:30:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:36:39] !log l10nupdate@tin ResourceLoader cache refresh completed at Fri Jun 3 02:36:39 UTC 2016 (duration 5m 58s) [02:36:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:47:26] (03CR) 10Alex Monk: User rights configuration for meta. wmf-supportsafety group (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/292518 (https://phabricator.wikimedia.org/T136864) (owner: 10Dereckson) [02:48:28] (03CR) 10Alex Monk: "If we're going to create a group called grafana, sure. I'd probably call it grafana-write or grafana-admin or something." [puppet] - 10https://gerrit.wikimedia.org/r/292405 (owner: 10Jcrespo) [02:48:36] (03CR) 10Alex Monk: [C: 031] Allow the group of users grafana to connect to the admin interface [puppet] - 10https://gerrit.wikimedia.org/r/292405 (owner: 10Jcrespo) [03:45:04] mutante, thanks ... [04:25:50] 06Operations: Debian repository supporting multiple package versions - https://phabricator.wikimedia.org/T115758#1732058 (10ssastry) This also affects Parsoid .deb package distribution. When we uploaded version 0.5.0 of the parsoid deb today, I noticed that the 0.4.1 version was removed. For now, this is not a p... [04:34:26] PROBLEM - Make sure enwiki dumps are not empty on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 SERVICE UNAVAILABLE - string OK not found on http://checker.tools.wmflabs.org:80/dumps - 288 bytes in 0.028 second response time [04:36:25] PROBLEM - Disk space on dataset1001 is CRITICAL: DISK CRITICAL - free space: /var/lib/nginx 0 MB (0% inode=99%) [04:49:45] RECOVERY - Make sure enwiki dumps are not empty on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 0.038 second response time [04:49:46] RECOVERY - Disk space on dataset1001 is OK: DISK OK [04:55:02] (03PS1) 10Yuvipanda: toollabs: Open up port 80 for kubebuilder [puppet] - 10https://gerrit.wikimedia.org/r/292521 [04:56:33] (03CR) 10Yuvipanda: [C: 032] toollabs: Open up port 80 for kubebuilder [puppet] - 10https://gerrit.wikimedia.org/r/292521 (owner: 10Yuvipanda) [05:02:55] PROBLEM - Disk space on ms-be2012 is CRITICAL: DISK CRITICAL - free space: / 2093 MB (3% inode=96%) [05:11:57] PROBLEM - puppet last run on cp3004 is CRITICAL: CRITICAL: Puppet has 1 failures [05:25:13] 06Operations, 10Traffic, 07HTTPS, 05MW-1.27-release-notes, 13Patch-For-Review: Insecure POST traffic - https://phabricator.wikimedia.org/T105794#2352082 (10Whatamidoing-WMF) My team [[https://de.wikipedia.org/wiki/Benutzer_Diskussion:Merlissimo#Heads-up:_your_bot_may_break_soon.21 |left a message on wiki... [05:37:43] RECOVERY - puppet last run on cp3004 is OK: OK: Puppet is currently enabled, last run 3 seconds ago with 0 failures [05:44:52] 06Operations, 10Traffic: Content purges are unreliable - https://phabricator.wikimedia.org/T133821#2352086 (10ori) >>! In T133821#2245711, @BBlack wrote: > However, we reverted this because it seemed to make the race issues worse at the time. How did you know? Do we have a way of tracking how often we hit the... [06:20:53] (03CR) 10Ori.livneh: "nice!" (035 comments) [puppet] - 10https://gerrit.wikimedia.org/r/292505 (https://phabricator.wikimedia.org/T110068) (owner: 10GWicke) [06:29:03] !log installing nginx security updates on Ubuntu systems (Debian installs updated some days ago) [06:29:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [06:29:43] PROBLEM - puppet last run on cp3008 is CRITICAL: CRITICAL: puppet fail [06:30:24] PROBLEM - puppet last run on cp3048 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:32] PROBLEM - puppet last run on es2013 is CRITICAL: CRITICAL: puppet fail [06:32:26] PROBLEM - puppet last run on mw2208 is CRITICAL: CRITICAL: Puppet has 1 failures [06:32:38] PROBLEM - puppet last run on mw2126 is CRITICAL: CRITICAL: Puppet has 1 failures [06:32:58] PROBLEM - puppet last run on mw1135 is CRITICAL: CRITICAL: Puppet has 1 failures [06:33:07] RECOVERY - puppet last run on cp3008 is OK: OK: Puppet is currently enabled, last run 25 seconds ago with 0 failures [06:36:15] (03CR) 10Nikerabbit: ULS: Stop using /static/current [mediawiki-config] - 10https://gerrit.wikimedia.org/r/289652 (https://phabricator.wikimedia.org/T135806) (owner: 10Nikerabbit) [06:36:38] (03CR) 10Nikerabbit: "Now there is only hash as expected." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/289652 (https://phabricator.wikimedia.org/T135806) (owner: 10Nikerabbit) [06:37:07] PROBLEM - puppet last run on mw2158 is CRITICAL: CRITICAL: Puppet has 1 failures [06:41:47] PROBLEM - DPKG on cp3021 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [06:41:47] PROBLEM - DPKG on cp3013 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [06:41:47] PROBLEM - DPKG on cp3018 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [06:41:47] PROBLEM - DPKG on cp3019 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [06:41:47] PROBLEM - DPKG on cp3017 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [06:41:56] PROBLEM - DPKG on cp3014 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [06:41:57] PROBLEM - DPKG on cp3020 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [06:42:37] PROBLEM - DPKG on cp3015 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [06:42:38] PROBLEM - DPKG on cp3022 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [06:42:47] PROBLEM - DPKG on cp3016 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [06:42:57] PROBLEM - DPKG on cp1043 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [06:43:06] PROBLEM - DPKG on cp1044 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [06:43:16] PROBLEM - DPKG on cp3012 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [06:53:17] PROBLEM - puppet last run on analytics1051 is CRITICAL: CRITICAL: Puppet has 2 failures [06:56:26] RECOVERY - puppet last run on mw2208 is OK: OK: Puppet is currently enabled, last run 48 seconds ago with 0 failures [06:56:47] RECOVERY - puppet last run on es2013 is OK: OK: Puppet is currently enabled, last run 54 seconds ago with 0 failures [06:56:47] RECOVERY - puppet last run on mw2126 is OK: OK: Puppet is currently enabled, last run 30 seconds ago with 0 failures [06:57:26] RECOVERY - puppet last run on mw1135 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:57:47] RECOVERY - puppet last run on cp3048 is OK: OK: Puppet is currently enabled, last run 38 seconds ago with 0 failures [06:58:07] RECOVERY - puppet last run on mw2158 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:01:25] (03PS3) 10KartikMistry: lttoolbox: New upstream version [debs/contenttranslation/lttoolbox] - 10https://gerrit.wikimedia.org/r/269115 (https://phabricator.wikimedia.org/T124137) [07:04:34] PROBLEM - puppet last run on cp2009 is CRITICAL: CRITICAL: Puppet has 1 failures [07:06:24] RECOVERY - DPKG on cp3016 is OK: All packages OK [07:14:23] RECOVERY - DPKG on cp3017 is OK: All packages OK [07:19:23] !log Update cxserver to 19a71f1 [07:19:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [07:20:34] RECOVERY - puppet last run on analytics1051 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:25:34] RECOVERY - DPKG on cp3018 is OK: All packages OK [07:29:53] (03PS6) 10Elukey: Extend the %{format}t timestamp formatter with (begin|end): prefixes [software/varnish/varnishkafka] - 10https://gerrit.wikimedia.org/r/292172 (https://phabricator.wikimedia.org/T136314) [07:31:53] RECOVERY - puppet last run on cp2009 is OK: OK: Puppet is currently enabled, last run 20 seconds ago with 0 failures [07:38:47] (03PS7) 10Elukey: Extend the %{format}t timestamp formatter with (begin|end): prefixes [software/varnish/varnishkafka] - 10https://gerrit.wikimedia.org/r/292172 (https://phabricator.wikimedia.org/T136314) [07:44:55] (03PS8) 10Elukey: Extend the %{format}t timestamp formatter with (begin|end): prefixes [software/varnish/varnishkafka] - 10https://gerrit.wikimedia.org/r/292172 (https://phabricator.wikimedia.org/T136314) [07:51:13] 06Operations, 10Traffic, 07HTTPS, 05MW-1.27-release-notes, 13Patch-For-Review: Insecure POST traffic - https://phabricator.wikimedia.org/T105794#2352261 (10Qgil) @Steinsplitter is active in this task and he might have suggestions about next steps. CCing @Bmueller just in case she has other ideas. [07:57:21] (03CR) 10Elukey: "This last version should be good to review. I managed to solve some performance issue:" [software/varnish/varnishkafka] - 10https://gerrit.wikimedia.org/r/292172 (https://phabricator.wikimedia.org/T136314) (owner: 10Elukey) [08:00:47] (03PS9) 10Elukey: Extend the %{format}t timestamp formatter with (begin|end): prefixes [software/varnish/varnishkafka] - 10https://gerrit.wikimedia.org/r/292172 (https://phabricator.wikimedia.org/T136314) [08:02:16] (03PS2) 10Jcrespo: Allow the group of users grafana-admin to edit [puppet] - 10https://gerrit.wikimedia.org/r/292405 [08:08:35] !log installing libxml2 security updates on jessie systems [08:08:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [08:33:39] PROBLEM - puppet last run on ms-be2015 is CRITICAL: CRITICAL: puppet fail [08:34:25] !log rebooting kafka1012 for kernel upgrades. [08:34:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [08:35:23] !log created new LDAP group grafana-admin, gid=1007 [08:35:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [08:35:51] 06Operations, 10Traffic, 07HTTPS, 05MW-1.27-release-notes, 13Patch-For-Review: Insecure POST traffic - https://phabricator.wikimedia.org/T105794#2352372 (10Bmueller) @Qgil, thanks for letting me know. @Andrew already emailed me last night. I'm going to talk to some people and we try to figure something out. [08:38:54] !log archiving again syslog.1 from ms-be2012 on /srv/swift-storage/sdl1/tmp [08:38:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [08:40:10] PROBLEM - Check status of defined EventLogging jobs on eventlog1001 is CRITICAL: CRITICAL: Stopped EventLogging jobs: forwarder/legacy-zmq [08:40:37] mmmm just restarted eventlogging for the kafka restart, checking [08:41:47] (03CR) 10Jcrespo: "I've created grafana-admin. It still only has me as a group member." [puppet] - 10https://gerrit.wikimedia.org/r/292405 (owner: 10Jcrespo) [08:42:24] !log rolling restart of scb cluster (mathoid, ores-uwsgi) in eqiad to pick up libxml2 security updates [08:42:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [08:44:09] RECOVERY - Check status of defined EventLogging jobs on eventlog1001 is OK: OK: All defined EventLogging jobs are runnning. [08:44:12] RECOVERY - Disk space on ms-be2012 is OK: DISK OK [08:49:45] elukey: it looks like kafka1012 didn't come back after the reboot [08:50:00] ema: yeah I am working on it, fs issues :( [08:50:30] PROBLEM - IPsec on cp3037 is CRITICAL: Strongswan CRITICAL - ok: 52 not-conn: kafka1012_v4,kafka1012_v6 [08:50:30] PROBLEM - IPsec on cp2003 is CRITICAL: Strongswan CRITICAL - ok: 34 not-conn: kafka1012_v4,kafka1012_v6 [08:50:30] PROBLEM - IPsec on cp2008 is CRITICAL: Strongswan CRITICAL - ok: 68 not-conn: kafka1012_v4,kafka1012_v6 [08:50:31] alright, there are a bunch of strongswan alerts on cp* nodes because of that, we'll probably start getting the notifications here soon [08:50:31] PROBLEM - IPsec on cp4013 is CRITICAL: Strongswan CRITICAL - ok: 52 not-conn: kafka1012_v4,kafka1012_v6 [08:50:31] PROBLEM - IPsec on cp3003 is CRITICAL: Strongswan CRITICAL - ok: 26 not-conn: kafka1012_v4,kafka1012_v6 [08:50:31] PROBLEM - IPsec on cp4011 is CRITICAL: Strongswan CRITICAL - ok: 26 not-conn: kafka1012_v4,kafka1012_v6 [08:50:31] PROBLEM - IPsec on cp4009 is CRITICAL: Strongswan CRITICAL - ok: 42 not-conn: kafka1012_v4,kafka1012_v6 [08:50:31] PROBLEM - IPsec on cp4004 is CRITICAL: Strongswan CRITICAL - ok: 26 not-conn: kafka1012_v4,kafka1012_v6 [08:50:39] PROBLEM - IPsec on cp3006 is CRITICAL: Strongswan CRITICAL - ok: 26 not-conn: kafka1012_v4,kafka1012_v6 [08:50:39] PROBLEM - IPsec on cp3005 is CRITICAL: Strongswan CRITICAL - ok: 26 not-conn: kafka1012_v4,kafka1012_v6 [08:50:39] PROBLEM - IPsec on cp3010 is CRITICAL: Strongswan CRITICAL - ok: 26 not-conn: kafka1012_v4,kafka1012_v6 [08:50:40] PROBLEM - IPsec on cp2025 is CRITICAL: Strongswan CRITICAL - ok: 34 not-conn: kafka1012_v4,kafka1012_v6 [08:50:41] PROBLEM - IPsec on cp3042 is CRITICAL: Strongswan CRITICAL - ok: 42 not-conn: kafka1012_v4,kafka1012_v6 [08:50:42] ah snap.. [08:50:42] yeah, those [08:50:50] PROBLEM - IPsec on cp3031 is CRITICAL: Strongswan CRITICAL - ok: 42 not-conn: kafka1012_v4,kafka1012_v6 [08:51:00] PROBLEM - IPsec on cp2010 is CRITICAL: Strongswan CRITICAL - ok: 54 not-conn: kafka1012_v4,kafka1012_v6 [08:51:00] PROBLEM - IPsec on cp2021 is CRITICAL: Strongswan CRITICAL - ok: 34 not-conn: kafka1012_v4,kafka1012_v6 [08:51:00] PROBLEM - IPsec on cp2022 is CRITICAL: Strongswan CRITICAL - ok: 68 not-conn: kafka1012_v4,kafka1012_v6 [08:51:00] PROBLEM - IPsec on cp2018 is CRITICAL: Strongswan CRITICAL - ok: 34 not-conn: kafka1012_v4,kafka1012_v6 [08:51:00] PROBLEM - IPsec on cp2002 is CRITICAL: Strongswan CRITICAL - ok: 68 not-conn: kafka1012_v4,kafka1012_v6 [08:51:00] PROBLEM - IPsec on cp4003 is CRITICAL: Strongswan CRITICAL - ok: 26 not-conn: kafka1012_v4,kafka1012_v6 [08:51:00] PROBLEM - IPsec on cp2004 is CRITICAL: Strongswan CRITICAL - ok: 54 not-conn: kafka1012_v4,kafka1012_v6 [08:51:01] PROBLEM - IPsec on cp2016 is CRITICAL: Strongswan CRITICAL - ok: 54 not-conn: kafka1012_v4,kafka1012_v6 [08:51:01] PROBLEM - IPsec on cp2020 is CRITICAL: Strongswan CRITICAL - ok: 68 not-conn: kafka1012_v4,kafka1012_v6 [08:51:02] PROBLEM - IPsec on cp3036 is CRITICAL: Strongswan CRITICAL - ok: 52 not-conn: kafka1012_v4,kafka1012_v6 [08:51:02] PROBLEM - IPsec on cp4005 is CRITICAL: Strongswan CRITICAL - ok: 52 not-conn: kafka1012_v4,kafka1012_v6 [08:51:10] PROBLEM - IPsec on cp3032 is CRITICAL: Strongswan CRITICAL - ok: 42 not-conn: kafka1012_v4,kafka1012_v6 [08:51:10] PROBLEM - IPsec on cp3007 is CRITICAL: Strongswan CRITICAL - ok: 26 not-conn: kafka1012_v4,kafka1012_v6 [08:51:10] PROBLEM - IPsec on cp3045 is CRITICAL: Strongswan CRITICAL - ok: 52 not-conn: kafka1012_v4,kafka1012_v6 [08:51:10] PROBLEM - IPsec on cp2009 is CRITICAL: Strongswan CRITICAL - ok: 34 not-conn: kafka1012_v4,kafka1012_v6 [08:51:11] PROBLEM - IPsec on cp2006 is CRITICAL: Strongswan CRITICAL - ok: 34 not-conn: kafka1012_v4,kafka1012_v6 [08:51:11] PROBLEM - IPsec on cp2005 is CRITICAL: Strongswan CRITICAL - ok: 68 not-conn: kafka1012_v4,kafka1012_v6 [08:51:12] need help? [08:51:19] PROBLEM - IPsec on cp2012 is CRITICAL: Strongswan CRITICAL - ok: 34 not-conn: kafka1012_v4,kafka1012_v6 [08:51:19] PROBLEM - IPsec on cp2011 is CRITICAL: Strongswan CRITICAL - ok: 68 not-conn: kafka1012_v4,kafka1012_v6 [08:51:19] PROBLEM - IPsec on cp2014 is CRITICAL: Strongswan CRITICAL - ok: 68 not-conn: kafka1012_v4,kafka1012_v6 [08:51:19] PROBLEM - IPsec on cp2017 is CRITICAL: Strongswan CRITICAL - ok: 68 not-conn: kafka1012_v4,kafka1012_v6 [08:51:20] PROBLEM - IPsec on cp4008 is CRITICAL: Strongswan CRITICAL - ok: 42 not-conn: kafka1012_v4,kafka1012_v6 [08:51:20] PROBLEM - IPsec on cp3034 is CRITICAL: Strongswan CRITICAL - ok: 52 not-conn: kafka1012_v4,kafka1012_v6 [08:51:20] PROBLEM - IPsec on cp2026 is CRITICAL: Strongswan CRITICAL - ok: 68 not-conn: kafka1012_v4,kafka1012_v6 [08:51:21] PROBLEM - IPsec on cp4017 is CRITICAL: Strongswan CRITICAL - ok: 42 not-conn: kafka1012_v4,kafka1012_v6 [08:51:21] PROBLEM - IPsec on cp3043 is CRITICAL: Strongswan CRITICAL - ok: 42 not-conn: kafka1012_v4,kafka1012_v6 [08:51:30] PROBLEM - IPsec on cp3004 is CRITICAL: Strongswan CRITICAL - ok: 26 not-conn: kafka1012_v4,kafka1012_v6 [08:51:30] PROBLEM - IPsec on cp3035 is CRITICAL: Strongswan CRITICAL - ok: 52 not-conn: kafka1012_v4,kafka1012_v6 [08:51:30] PROBLEM - IPsec on cp3009 is CRITICAL: Strongswan CRITICAL - ok: 26 not-conn: kafka1012_v4,kafka1012_v6 [08:51:31] PROBLEM - IPsec on cp3038 is CRITICAL: Strongswan CRITICAL - ok: 52 not-conn: kafka1012_v4,kafka1012_v6 [08:51:31] PROBLEM - IPsec on cp2024 is CRITICAL: Strongswan CRITICAL - ok: 68 not-conn: kafka1012_v4,kafka1012_v6 [08:51:31] PROBLEM - IPsec on cp4007 is CRITICAL: Strongswan CRITICAL - ok: 52 not-conn: kafka1012_v4,kafka1012_v6 [08:51:31] PROBLEM - IPsec on cp4019 is CRITICAL: Strongswan CRITICAL - ok: 26 not-conn: kafka1012_v4,kafka1012_v6 [08:51:39] PROBLEM - IPsec on cp4002 is CRITICAL: Strongswan CRITICAL - ok: 26 not-conn: kafka1012_v4,kafka1012_v6 [08:51:40] PROBLEM - IPsec on cp2001 is CRITICAL: Strongswan CRITICAL - ok: 54 not-conn: kafka1012_v4,kafka1012_v6 [08:51:40] PROBLEM - IPsec on cp2007 is CRITICAL: Strongswan CRITICAL - ok: 54 not-conn: kafka1012_v4,kafka1012_v6 [08:51:40] PROBLEM - IPsec on cp2019 is CRITICAL: Strongswan CRITICAL - ok: 54 not-conn: kafka1012_v4,kafka1012_v6 [08:51:40] PROBLEM - IPsec on cp2023 is CRITICAL: Strongswan CRITICAL - ok: 54 not-conn: kafka1012_v4,kafka1012_v6 [08:51:40] PROBLEM - IPsec on cp4001 is CRITICAL: Strongswan CRITICAL - ok: 26 not-conn: kafka1012_v4,kafka1012_v6 [08:51:41] PROBLEM - IPsec on cp3030 is CRITICAL: Strongswan CRITICAL - ok: 42 not-conn: kafka1012_v4,kafka1012_v6 [08:51:50] PROBLEM - IPsec on cp2015 is CRITICAL: Strongswan CRITICAL - ok: 34 not-conn: kafka1012_v4,kafka1012_v6 [08:51:50] PROBLEM - IPsec on cp3039 is CRITICAL: Strongswan CRITICAL - ok: 52 not-conn: kafka1012_v4,kafka1012_v6 [08:51:50] PROBLEM - IPsec on cp4006 is CRITICAL: Strongswan CRITICAL - ok: 52 not-conn: kafka1012_v4,kafka1012_v6 [08:51:50] PROBLEM - IPsec on cp3048 is CRITICAL: Strongswan CRITICAL - ok: 52 not-conn: kafka1012_v4,kafka1012_v6 [08:51:58] jynus: kafka1012 with some boot problems, super mess [08:51:59] PROBLEM - IPsec on cp4020 is CRITICAL: Strongswan CRITICAL - ok: 26 not-conn: kafka1012_v4,kafka1012_v6 [08:52:00] PROBLEM - IPsec on cp4012 is CRITICAL: Strongswan CRITICAL - ok: 26 not-conn: kafka1012_v4,kafka1012_v6 [08:52:00] PROBLEM - IPsec on cp4015 is CRITICAL: Strongswan CRITICAL - ok: 52 not-conn: kafka1012_v4,kafka1012_v6 [08:52:00] PROBLEM - IPsec on cp3046 is CRITICAL: Strongswan CRITICAL - ok: 52 not-conn: kafka1012_v4,kafka1012_v6 [08:52:00] PROBLEM - IPsec on cp3044 is CRITICAL: Strongswan CRITICAL - ok: 52 not-conn: kafka1012_v4,kafka1012_v6 [08:52:07] I mean, alert mess [08:52:09] PROBLEM - IPsec on cp2013 is CRITICAL: Strongswan CRITICAL - ok: 54 not-conn: kafka1012_v4,kafka1012_v6 [08:52:10] PROBLEM - IPsec on cp4016 is CRITICAL: Strongswan CRITICAL - ok: 42 not-conn: kafka1012_v4,kafka1012_v6 [08:52:10] PROBLEM - IPsec on cp4018 is CRITICAL: Strongswan CRITICAL - ok: 42 not-conn: kafka1012_v4,kafka1012_v6 [08:52:10] PROBLEM - IPsec on cp3008 is CRITICAL: Strongswan CRITICAL - ok: 26 not-conn: kafka1012_v4,kafka1012_v6 [08:52:10] PROBLEM - IPsec on cp4010 is CRITICAL: Strongswan CRITICAL - ok: 42 not-conn: kafka1012_v4,kafka1012_v6 [08:52:10] PROBLEM - IPsec on cp3040 is CRITICAL: Strongswan CRITICAL - ok: 42 not-conn: kafka1012_v4,kafka1012_v6 [08:52:10] PROBLEM - IPsec on cp3047 is CRITICAL: Strongswan CRITICAL - ok: 52 not-conn: kafka1012_v4,kafka1012_v6 [08:52:11] PROBLEM - IPsec on cp3033 is CRITICAL: Strongswan CRITICAL - ok: 42 not-conn: kafka1012_v4,kafka1012_v6 [08:52:13] I silenced kafka1012 [08:52:19] PROBLEM - IPsec on cp3049 is CRITICAL: Strongswan CRITICAL - ok: 52 not-conn: kafka1012_v4,kafka1012_v6 [08:52:21] it is ok, I mean if you need help with the boot [08:52:29] PROBLEM - IPsec on cp3041 is CRITICAL: Strongswan CRITICAL - ok: 42 not-conn: kafka1012_v4,kafka1012_v6 [08:52:29] PROBLEM - IPsec on cp4014 is CRITICAL: Strongswan CRITICAL - ok: 52 not-conn: kafka1012_v4,kafka1012_v6 [08:52:32] ah no no I tried with fsck, disk issue [08:52:37] let's see how it goes [08:52:41] RECOVERY - IPsec on cp2025 is OK: Strongswan OK - 36 ESP OK [08:52:42] RECOVERY - IPsec on cp3042 is OK: Strongswan OK - 44 ESP OK [08:52:50] RECOVERY - IPsec on cp3031 is OK: Strongswan OK - 44 ESP OK [08:53:00] RECOVERY - IPsec on cp2010 is OK: Strongswan OK - 56 ESP OK [08:53:00] RECOVERY - IPsec on cp2022 is OK: Strongswan OK - 70 ESP OK [08:53:00] RECOVERY - IPsec on cp2021 is OK: Strongswan OK - 36 ESP OK [08:53:00] RECOVERY - IPsec on cp2018 is OK: Strongswan OK - 36 ESP OK [08:53:00] RECOVERY - IPsec on cp2002 is OK: Strongswan OK - 70 ESP OK [08:53:01] RECOVERY - IPsec on cp4003 is OK: Strongswan OK - 28 ESP OK [08:53:01] RECOVERY - IPsec on cp2004 is OK: Strongswan OK - 56 ESP OK [08:53:02] RECOVERY - IPsec on cp2016 is OK: Strongswan OK - 56 ESP OK [08:53:02] RECOVERY - IPsec on cp2020 is OK: Strongswan OK - 70 ESP OK [08:53:09] RECOVERY - IPsec on cp4005 is OK: Strongswan OK - 54 ESP OK [08:53:09] RECOVERY - IPsec on cp3036 is OK: Strongswan OK - 54 ESP OK [08:53:10] RECOVERY - IPsec on cp2009 is OK: Strongswan OK - 36 ESP OK [08:53:10] RECOVERY - IPsec on cp3007 is OK: Strongswan OK - 28 ESP OK [08:53:10] RECOVERY - IPsec on cp3045 is OK: Strongswan OK - 54 ESP OK [08:53:10] RECOVERY - IPsec on cp3032 is OK: Strongswan OK - 44 ESP OK [08:53:11] RECOVERY - IPsec on cp2006 is OK: Strongswan OK - 36 ESP OK [08:53:19] RECOVERY - IPsec on cp2005 is OK: Strongswan OK - 70 ESP OK [08:53:19] RECOVERY - IPsec on cp2012 is OK: Strongswan OK - 36 ESP OK [08:53:19] RECOVERY - IPsec on cp2011 is OK: Strongswan OK - 70 ESP OK [08:53:19] RECOVERY - IPsec on cp2014 is OK: Strongswan OK - 70 ESP OK [08:53:19] RECOVERY - IPsec on cp2017 is OK: Strongswan OK - 70 ESP OK [08:53:20] RECOVERY - IPsec on cp4008 is OK: Strongswan OK - 44 ESP OK [08:53:20] RECOVERY - IPsec on cp2026 is OK: Strongswan OK - 70 ESP OK [08:53:21] RECOVERY - IPsec on cp4017 is OK: Strongswan OK - 44 ESP OK [08:53:21] RECOVERY - IPsec on cp3034 is OK: Strongswan OK - 54 ESP OK [08:53:22] RECOVERY - IPsec on cp3043 is OK: Strongswan OK - 44 ESP OK [08:53:30] RECOVERY - IPsec on cp3009 is OK: Strongswan OK - 28 ESP OK [08:53:30] RECOVERY - IPsec on cp3035 is OK: Strongswan OK - 54 ESP OK [08:53:30] RECOVERY - IPsec on cp3004 is OK: Strongswan OK - 28 ESP OK [08:53:31] RECOVERY - IPsec on cp3038 is OK: Strongswan OK - 54 ESP OK [08:53:31] RECOVERY - IPsec on cp4007 is OK: Strongswan OK - 54 ESP OK [08:53:31] RECOVERY - IPsec on cp2024 is OK: Strongswan OK - 70 ESP OK [08:53:39] RECOVERY - IPsec on cp4019 is OK: Strongswan OK - 28 ESP OK [08:53:39] RECOVERY - IPsec on cp4002 is OK: Strongswan OK - 28 ESP OK [08:53:40] RECOVERY - IPsec on cp2001 is OK: Strongswan OK - 56 ESP OK [08:53:40] RECOVERY - IPsec on cp2007 is OK: Strongswan OK - 56 ESP OK [08:53:40] RECOVERY - IPsec on cp2019 is OK: Strongswan OK - 56 ESP OK [08:53:40] RECOVERY - IPsec on cp2023 is OK: Strongswan OK - 56 ESP OK [08:53:40] RECOVERY - IPsec on cp4001 is OK: Strongswan OK - 28 ESP OK [08:53:41] RECOVERY - IPsec on cp3030 is OK: Strongswan OK - 44 ESP OK [08:53:50] RECOVERY - IPsec on cp2015 is OK: Strongswan OK - 36 ESP OK [08:53:50] RECOVERY - IPsec on cp3039 is OK: Strongswan OK - 54 ESP OK [08:53:50] RECOVERY - IPsec on cp4006 is OK: Strongswan OK - 54 ESP OK [08:53:51] RECOVERY - IPsec on cp3048 is OK: Strongswan OK - 54 ESP OK [08:53:53] elukey: looks like it's back! :) [08:53:59] RECOVERY - IPsec on cp4020 is OK: Strongswan OK - 28 ESP OK [08:53:59] RECOVERY - IPsec on cp4012 is OK: Strongswan OK - 28 ESP OK [08:53:59] RECOVERY - IPsec on cp4015 is OK: Strongswan OK - 54 ESP OK [08:54:00] RECOVERY - IPsec on cp3044 is OK: Strongswan OK - 54 ESP OK [08:54:00] RECOVERY - IPsec on cp3046 is OK: Strongswan OK - 54 ESP OK [08:54:03] (03CR) 10Jcrespo: "Documented on https://wikitech.wikimedia.org/wiki/LDAP_Groups" [puppet] - 10https://gerrit.wikimedia.org/r/292405 (owner: 10Jcrespo) [08:54:09] RECOVERY - IPsec on cp2013 is OK: Strongswan OK - 56 ESP OK [08:54:09] RECOVERY - IPsec on cp4016 is OK: Strongswan OK - 44 ESP OK [08:54:10] RECOVERY - IPsec on cp4018 is OK: Strongswan OK - 44 ESP OK [08:54:10] RECOVERY - IPsec on cp4010 is OK: Strongswan OK - 44 ESP OK [08:54:10] RECOVERY - IPsec on cp3040 is OK: Strongswan OK - 44 ESP OK [08:54:10] RECOVERY - IPsec on cp3047 is OK: Strongswan OK - 54 ESP OK [08:54:15] PROBLEM - Kafka Broker Server on kafka1022 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args Kafka /etc/kafka/server.properties [08:54:15] RECOVERY - IPsec on cp3008 is OK: Strongswan OK - 28 ESP OK [08:54:15] RECOVERY - IPsec on cp3033 is OK: Strongswan OK - 44 ESP OK [08:54:19] RECOVERY - IPsec on cp3049 is OK: Strongswan OK - 54 ESP OK [08:54:30] RECOVERY - IPsec on cp4014 is OK: Strongswan OK - 54 ESP OK [08:54:30] RECOVERY - IPsec on cp2003 is OK: Strongswan OK - 36 ESP OK [08:54:30] RECOVERY - IPsec on cp2008 is OK: Strongswan OK - 70 ESP OK [08:54:31] RECOVERY - IPsec on cp4013 is OK: Strongswan OK - 54 ESP OK [08:54:39] RECOVERY - IPsec on cp4011 is OK: Strongswan OK - 28 ESP OK [08:54:39] RECOVERY - IPsec on cp4009 is OK: Strongswan OK - 44 ESP OK [08:54:39] RECOVERY - IPsec on cp4004 is OK: Strongswan OK - 28 ESP OK [08:55:07] I suppose it booted but the process did not start yet? [08:55:42] jynus: that's kafka1022, the one with boot issues was kafka 1012 [08:55:48] oh [08:55:50] PROBLEM - Kafka Broker Under Replicated Partitions on kafka1014 is CRITICAL: CRITICAL: 66.67% of data above the critical threshold [10.0] [08:56:30] RECOVERY - IPsec on cp3041 is OK: Strongswan OK - 44 ESP OK [08:56:30] RECOVERY - IPsec on cp3037 is OK: Strongswan OK - 54 ESP OK [08:56:31] RECOVERY - IPsec on cp3003 is OK: Strongswan OK - 28 ESP OK [08:56:39] RECOVERY - IPsec on cp3006 is OK: Strongswan OK - 28 ESP OK [08:56:39] RECOVERY - IPsec on cp3010 is OK: Strongswan OK - 28 ESP OK [08:56:39] RECOVERY - IPsec on cp3005 is OK: Strongswan OK - 28 ESP OK [08:56:50] PROBLEM - Kafka Broker Under Replicated Partitions on kafka1018 is CRITICAL: CRITICAL: 66.67% of data above the critical threshold [10.0] [08:56:51] 06Operations, 06Research-and-Data-Backlog, 10Research-management, 06Revision-Scoring-As-A-Service, and 3 others: [Epic] Deploy Revscoring/ORES service in Prod - https://phabricator.wikimedia.org/T106867#2352395 (10akosiaris) [08:57:29] PROBLEM - Kafka Broker Under Replicated Partitions on kafka1013 is CRITICAL: CRITICAL: 68.97% of data above the critical threshold [10.0] [08:57:49] PROBLEM - Kafka Broker Under Replicated Partitions on kafka1020 is CRITICAL: CRITICAL: 66.67% of data above the critical threshold [10.0] [08:57:50] yeah kafka1022 shutdown, checking why [08:58:18] "Shut down completely" [08:58:45] code=killed, status=9/KILL [08:59:10] RECOVERY - puppet last run on ms-be2015 is OK: OK: Puppet is currently enabled, last run 6 seconds ago with 0 failures [08:59:49] PROBLEM - puppet last run on cp3045 is CRITICAL: CRITICAL: Puppet has 4 failures [09:00:14] RECOVERY - Kafka Broker Server on kafka1022 is OK: PROCS OK: 1 process with command name java, args Kafka /etc/kafka/server.properties [09:00:45] I have no idea what happened [09:00:52] hm... [09:01:39] PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [1000.0] [09:01:46] the first thought was "oh no I rebooted also 1022 istead of 1012" [09:02:09] ok all the brokers are up [09:02:15] going to verify in sync replicas [09:02:30] Closing connection due to error during produce request with correlation ion due to error during produce request with correlation id 0 from client id kafka-php with ack=0 c and partition to exceptions: [mediawiki_CirrusSearchRequestSet,6] -> kafka.common.NotLeaderForPartitionExcepti [09:02:59] PROBLEM - puppet last run on cp3007 is CRITICAL: CRITICAL: puppet fail [09:03:00] RECOVERY - DPKG on cp1043 is OK: All packages OK [09:03:00] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [1000.0] [09:03:34] without knowing anything about kafca, cluster partition leader down, it kills itself [09:03:46] I will be checking the errors [09:03:51] (5xx) [09:04:20] RECOVERY - DPKG on cp1044 is OK: All packages OK [09:05:02] jynus: it shouldn't do anything like that, I am going to double check what I did to verify that wasn't me [09:05:28] but partition leaders are supposed to go down, in sync replicas are there to pick up broken stuff [09:05:49] but this is the first complete cluster restart with kafka 0.9 [09:05:55] so new joy [09:06:53] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [09:07:39] RECOVERY - Esams HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [09:08:41] one thing that I did on kafka1022 was running kafka-preferred-replica election [09:09:06] to balance the fact that 1012 was down and there were replicas not up to speed [09:09:28] (so no 3 replication factor) [09:09:38] and I think that triggered the shutdown [09:09:45] going to double check [09:10:00] EventLogging is ok and Kafka is not running fine [09:10:12] sorry for the noise [09:12:18] elukey: now or not? :-) [09:12:55] I see some alerts regarding partitions, but not sure if we just have to wait? [09:13:00] moritzm: sometimes my brain doesn't work properly, now :) [09:13:21] seems so [09:13:28] jynus: partitions looks good https://grafana.wikimedia.org/dashboard/db/kafka [09:14:00] I also checked with kafka topics --describe on the brokers [09:14:06] all synced [09:14:07] yes, the alerts have some lag [09:14:41] I see nf_conntrack not happy during the events [09:15:22] jynus: was there any alert on that regard or did you just check yourself? [09:15:30] RECOVERY - Kafka Broker Under Replicated Partitions on kafka1013 is OK: OK: Less than 50.00% above the threshold [1.0] [09:15:48] ema, regarding network or what? [09:15:50] RECOVERY - Kafka Broker Under Replicated Partitions on kafka1014 is OK: OK: Less than 50.00% above the threshold [1.0] [09:15:55] jynus: nf_conntrack [09:16:01] jynus: we have 512 max tracking so all good, I believe that was traffic flowing to other brokers to balance the cluster [09:16:04] I am just looking at grafana [09:16:38] oh I see :) [09:16:49] not concerned, just looking at things I would not directly expect [09:16:50] RECOVERY - Kafka Broker Under Replicated Partitions on kafka1018 is OK: OK: Less than 50.00% above the threshold [1.0] [09:17:08] (unlike lag, under-replicated partitions, etc.) [09:17:51] so the other brokers need to absorbe the traffic from the one that is down, and rebalance the cluster [09:18:00] you have some nice graphing there [09:18:10] all credits to ottomata :) [09:18:20] look at the partition leaders [09:18:47] (I know it is them going down) [09:19:03] yeah kafka1022 is not a leader yet, I need to issue a command [09:19:12] but I want to verify that all is ok [09:19:40] RECOVERY - Kafka Broker Under Replicated Partitions on kafka1020 is OK: OK: Less than 50.00% above the threshold [1.0] [09:19:45] also really sorry for the page, just seen it :( [09:20:40] PROBLEM - puppet last run on labvirt1010 is CRITICAL: CRITICAL: Puppet has 3 failures [09:22:05] the 5xx was esams-only, so even if it happened just after that, I would say not related [09:23:22] n00b question: how do I run a salt command on all nodes with a certain role? https://wikitech.wikimedia.org/wiki/Salt#Run_command_on_all_nodes_in_a_puppet_role doesn't seem to work as advertised [09:23:40] RECOVERY - puppet last run on cp3045 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [09:23:52] ema, you cannot [09:24:02] that doesn't work, was disabled for new installs [09:24:10] oh, OK [09:24:11] do as we did [09:24:35] create a salt grai within the class [09:25:20] https://phabricator.wikimedia.org/rOPUPf3a7799818fea09964bab08cf017b0d6d4792187 [09:26:29] well, you can use the grains defined in debdeploy [09:26:30] RECOVERY - puppet last run on cp3007 is OK: OK: Puppet is currently enabled, last run 6 seconds ago with 0 failures [09:26:39] BTW, I literally edited that and added: "Note: the following 3 examples using custom the custom grain "rolename" have been disabled due to install issues: https://gerrit.wikimedia.org/r/123834 " [09:26:42] they are based on roles [09:27:01] e.g. if you want to address all MXes: [09:27:15] salt -G debdeploy-mx:standard test.ping [09:27:46] all entries are in /etc/debdeploy.conf on neodymium [09:27:53] jynus: oh yes! I clicked on the ToC link though, so your note was not visible [09:29:34] alternativelly, you can fix that and make it work somehow ;-) [09:30:38] we did it like this becase we still needed more fine-grained (pun intended) classification [09:30:44] :) [09:31:25] (03CR) 10Alexandros Kosiaris: [C: 04-1] lttoolbox: New upstream version (032 comments) [debs/contenttranslation/lttoolbox] - 10https://gerrit.wikimedia.org/r/269115 (https://phabricator.wikimedia.org/T124137) (owner: 10KartikMistry) [09:32:12] 06Operations, 10Ops-Access-Requests, 06WMF-NDA-Requests: NDA request for @WMDE-leszek - https://phabricator.wikimedia.org/T133145#2352464 (10jcrespo) p:05Triage>03Normal [09:36:03] 06Operations, 10Ops-Access-Requests, 06WMF-NDA-Requests: NDA-Request Jan Dittrich - https://phabricator.wikimedia.org/T136560#2352477 (10jcrespo) p:05Triage>03Normal a:03jcrespo [09:37:54] 06Operations, 10Ops-Access-Requests, 10LDAP-Access-Requests, 06WMF-NDA-Requests: NDA request for @thiemowmde - https://phabricator.wikimedia.org/T135994#2352481 (10jcrespo) p:05Triage>03Normal a:03jcrespo [09:41:25] 06Operations, 10Ops-Access-Requests, 13Patch-For-Review: Requesting deployment access (for deploying to scb) for Halfak - https://phabricator.wikimedia.org/T136612#2352488 (10jcrespo) p:05Triage>03Normal [09:43:12] 06Operations, 10Ops-Access-Requests: Requesting access to stat1002 for Pcoombe - https://phabricator.wikimedia.org/T136343#2352491 (10jcrespo) p:05Triage>03Normal a:03jcrespo [09:46:39] RECOVERY - puppet last run on labvirt1010 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [09:47:33] (03PS1) 10Jcrespo: Add Pcoombe to analytics-privatedata-users posix group [puppet] - 10https://gerrit.wikimedia.org/r/292534 (https://phabricator.wikimedia.org/T136343) [09:50:10] (03CR) 10Jcrespo: [C: 032] Add Pcoombe to analytics-privatedata-users posix group [puppet] - 10https://gerrit.wikimedia.org/r/292534 (https://phabricator.wikimedia.org/T136343) (owner: 10Jcrespo) [09:52:40] (03PS1) 10Reedy: Disable updatequerypage for wikitech running on non silver host [puppet] - 10https://gerrit.wikimedia.org/r/292537 (https://phabricator.wikimedia.org/T136926) [09:54:17] (03CR) 10jenkins-bot: [V: 04-1] Disable updatequerypage for wikitech running on non silver host [puppet] - 10https://gerrit.wikimedia.org/r/292537 (https://phabricator.wikimedia.org/T136926) (owner: 10Reedy) [09:54:59] 06Operations, 10Ops-Access-Requests, 13Patch-For-Review: Requesting access to stat1002 for Pcoombe - https://phabricator.wikimedia.org/T136343#2352516 (10jcrespo) Everything seems ok from my side, please try access, resolve if it works for you: ``` Notice: /Stage[main]/Admin/Admin::Hashuser[pcoombe]/Admin::U... [09:56:05] (03CR) 10Alexandros Kosiaris: [C: 032] ores: fix staging [puppet] - 10https://gerrit.wikimedia.org/r/292517 [09:58:05] 06Operations, 10Ops-Access-Requests, 13Patch-For-Review: Requesting deployment access (for deploying to scb) for Ladsgroup - https://phabricator.wikimedia.org/T136406#2352520 (10jcrespo) p:05Triage>03Normal [10:01:52] (03CR) 10Jcrespo: [C: 031] pybal: Fix PEP8 violations [puppet] - 10https://gerrit.wikimedia.org/r/291183 (owner: 10BryanDavis) [10:06:42] (03PS2) 10Alexandros Kosiaris: ores: fix staging [puppet] - 10https://gerrit.wikimedia.org/r/292517 [10:07:16] (03CR) 10Alexandros Kosiaris: [V: 032] ores: fix staging [puppet] - 10https://gerrit.wikimedia.org/r/292517 [10:19:49] (03PS1) 10Gehel: Update configuration for elasticsearch 2.3 upgrade on eqiad [puppet] - 10https://gerrit.wikimedia.org/r/292541 [10:22:04] (03CR) 10DCausse: [C: 031] Update configuration for elasticsearch 2.3 upgrade on eqiad [puppet] - 10https://gerrit.wikimedia.org/r/292541 (owner: 10Gehel) [10:23:11] !log restarting apache on planet1001 (serving planet.wikimedia.org) for libxml2 security update [10:23:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [10:29:57] (03PS2) 10Gehel: Update configuration for elasticsearch 2.3 upgrade on eqiad [puppet] - 10https://gerrit.wikimedia.org/r/292541 [10:30:38] (03PS1) 10Alexandros Kosiaris: Introduce ores.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/292542 (https://phabricator.wikimedia.org/T124203) [10:30:43] (03PS3) 10Gehel: Update configuration for elasticsearch 2.3 upgrade on eqiad [puppet] - 10https://gerrit.wikimedia.org/r/292541 (https://phabricator.wikimedia.org/T133126) [10:30:53] 06Operations, 10DNS, 10Traffic, 10Wikimedia-Apache-configuration: Create moon.wikimedia.org and redirect it to https://meta.wikimedia.org/wiki/Wikipedia_to_the_Moon - https://phabricator.wikimedia.org/T136557#2352598 (10MartinRulsch) >>! In T136557#2342785, @BBlack wrote: > Unclear from the description: Is... [10:35:34] !log restarting apache on bohrium (serving piwik.wikimedia.org) for libxml2 security update [10:35:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [10:38:44] (03CR) 10Gehel: "puppet compiler: https://puppet-compiler.wmflabs.org/3045/" [puppet] - 10https://gerrit.wikimedia.org/r/292541 (https://phabricator.wikimedia.org/T133126) (owner: 10Gehel) [10:39:12] (03CR) 10DCausse: [C: 031] Update configuration for elasticsearch 2.3 upgrade on eqiad [puppet] - 10https://gerrit.wikimedia.org/r/292541 (https://phabricator.wikimedia.org/T133126) (owner: 10Gehel) [10:39:30] !log Starting upgrade of elasticsearch eqiad cluster to 2.3 (T133126) [10:39:31] T133126: Upgrade eqiad data centre to Elasticsearch 2.3 - https://phabricator.wikimedia.org/T133126 [10:39:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [10:39:44] (03CR) 10Gehel: [C: 032] Update configuration for elasticsearch 2.3 upgrade on eqiad [puppet] - 10https://gerrit.wikimedia.org/r/292541 (https://phabricator.wikimedia.org/T133126) (owner: 10Gehel) [10:41:58] PROBLEM - zuul_merger_service_running on gallium is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/share/python/zuul/bin/python /usr/bin/zuul-merger [10:42:09] ^^ that is me [10:42:26] restarted [10:44:07] RECOVERY - zuul_merger_service_running on gallium is OK: PROCS OK: 1 process with regex args ^/usr/share/python/zuul/bin/python /usr/bin/zuul-merger [10:45:05] (03PS2) 10Alexandros Kosiaris: Introduce ores.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/292542 (https://phabricator.wikimedia.org/T124203) [10:48:15] !log taking elasticsearch eqiad cluster down for upgrade to 2.3 (T133126) [10:48:16] T133126: Upgrade eqiad data centre to Elasticsearch 2.3 - https://phabricator.wikimedia.org/T133126 [10:48:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [10:52:39] 06Operations, 13Patch-For-Review: setup/deploy oresrdb1001-oresrdb1002 - https://phabricator.wikimedia.org/T125562#2352638 (10akosiaris) 05Open>03Resolved Resolved for a long time now. [10:53:27] PROBLEM - PyBal backends health check on lvs1006 is CRITICAL: PYBAL CRITICAL - search-https_9243 - Could not depool server elastic1024.eqiad.wmnet because of too many down!: search_9200 - Could not depool server elastic1009.eqiad.wmnet because of too many down! [10:53:47] PROBLEM - PyBal backends health check on lvs1003 is CRITICAL: PYBAL CRITICAL - search-https_9243 - Could not depool server elastic1030.eqiad.wmnet because of too many down!: search_9200 - Could not depool server elastic1021.eqiad.wmnet because of too many down! [10:53:47] PROBLEM - PyBal backends health check on lvs1012 is CRITICAL: PYBAL CRITICAL - search-https_9243 - Could not depool server elastic1016.eqiad.wmnet because of too many down!: search_9200 - Could not depool server elastic1021.eqiad.wmnet because of too many down! [10:53:55] ^ it's ok, eqiad being upgraded [10:53:56] ^ me again, sorry again... [10:54:02] I was about to ask [10:54:06] ok thanks [10:54:13] PROBLEM - LVS HTTP IPv4 on search.svc.eqiad.wmnet is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 341 bytes in 0.075 second response time [10:54:21] that's gonna page [10:54:28] hey [10:54:59] a bit more care guys with those things please... [10:55:15] is search traffic drained off eqiad? [10:55:31] yes, all searches are sent to codfw [10:55:48] ACKNOWLEDGEMENT - PyBal backends health check on lvs1003 is CRITICAL: PYBAL CRITICAL - search-https_9243 - Could not depool server elastic1030.eqiad.wmnet because of too many down!: search_9200 - Could not depool server elastic1021.eqiad.wmnet because of too many down! Gehel elasticsearch upgrade in progress [10:55:49] ACKNOWLEDGEMENT - PyBal backends health check on lvs1006 is CRITICAL: PYBAL CRITICAL - search-https_9243 - Could not depool server elastic1024.eqiad.wmnet because of too many down!: search_9200 - Could not depool server elastic1009.eqiad.wmnet because of too many down! Gehel elasticsearch upgrade in progress [10:55:49] ACKNOWLEDGEMENT - PyBal backends health check on lvs1012 is CRITICAL: PYBAL CRITICAL - search-https_9243 - Could not depool server elastic1016.eqiad.wmnet because of too many down!: search_9200 - Could not depool server elastic1021.eqiad.wmnet because of too many down! Gehel elasticsearch upgrade in progress [10:55:54] ACKNOWLEDGEMENT - LVS HTTP IPv4 on search.svc.eqiad.wmnet is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 341 bytes in 0.075 second response time Gehel elasticsearch upgrade in progress [10:56:30] (03PS1) 10Alexandros Kosiaris: ores: Add varnish backend in the misc cluster [puppet] - 10https://gerrit.wikimedia.org/r/292543 (https://phabricator.wikimedia.org/T124203) [10:58:32] RECOVERY - LVS HTTP IPv4 on search.svc.eqiad.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 503 bytes in 0.065 second response time [10:59:48] RECOVERY - PyBal backends health check on lvs1006 is OK: PYBAL OK - All pools are healthy [11:00:08] RECOVERY - PyBal backends health check on lvs1003 is OK: PYBAL OK - All pools are healthy [11:00:08] RECOVERY - PyBal backends health check on lvs1012 is OK: PYBAL OK - All pools are healthy [11:05:10] 06Operations, 10Ops-Access-Requests, 13Patch-For-Review: Requesting access to stat1002 for Pcoombe - https://phabricator.wikimedia.org/T136343#2352653 (10Pcoombe) 05Open>03Resolved Working for me. Thanks! [11:08:04] 06Operations, 10DBA, 13Patch-For-Review: Install, configure and provision recently arrived db core machines - https://phabricator.wikimedia.org/T133398#2352666 (10jcrespo) [11:08:06] 06Operations, 10ops-eqiad, 13Patch-For-Review: Rack and set up 16 db's db1079-1094 - https://phabricator.wikimedia.org/T135253#2352664 (10jcrespo) 05Open>03Resolved All installed, work now and boot by default to disk. [11:18:30] !log freezing writes from CirrusSearch to eqiad clsuter during upgrade (T133126) [11:18:31] T133126: Upgrade eqiad data centre to Elasticsearch 2.3 - https://phabricator.wikimedia.org/T133126 [11:18:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [11:18:36] 06Operations, 10ops-eqiad, 10Analytics: Smartctl disk defects on kafka1012 - https://phabricator.wikimedia.org/T136933#2352719 (10elukey) [11:21:49] PROBLEM - puppet last run on labstore2003 is CRITICAL: CRITICAL: puppet fail [11:26:39] PROBLEM - CirrusSearch eqiad 95th percentile latency on graphite1001 is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [1000.0] [11:27:17] (03PS1) 10Jcrespo: Add extra new core database server to s2 shard [puppet] - 10https://gerrit.wikimedia.org/r/292545 (https://phabricator.wikimedia.org/T133398) [11:28:32] (03CR) 10Jcrespo: [C: 032] Add extra new core database server to s2 shard [puppet] - 10https://gerrit.wikimedia.org/r/292545 (https://phabricator.wikimedia.org/T133398) (owner: 10Jcrespo) [11:30:50] ACKNOWLEDGEMENT - CirrusSearch eqiad 95th percentile latency on graphite1001 is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [1000.0] Gehel upgrade in progress, so cluster not yet fully up to speed yet. Ill keep an eye on it. [11:36:08] RECOVERY - CirrusSearch eqiad 95th percentile latency on graphite1001 is OK: OK: Less than 20.00% above the threshold [500.0] [11:46:36] (03PS1) 10Jcrespo: Add 3 new s4 shard core db servers [puppet] - 10https://gerrit.wikimedia.org/r/292546 (https://phabricator.wikimedia.org/T133398) [11:47:53] RECOVERY - puppet last run on labstore2003 is OK: OK: Puppet is currently enabled, last run 23 seconds ago with 0 failures [12:12:34] PROBLEM - Start and verify pages via webservices on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 SERVICE UNAVAILABLE - 187 bytes in 10.791 second response time [12:15:55] PROBLEM - Redis status tcp_6479 on rdb2006 is CRITICAL: CRITICAL: replication_delay is 681 600 - REDIS on 10.192.48.44:6479 has 1 databases (db0) with 5747229 keys - replication_delay is 681 [12:20:43] RECOVERY - Start and verify pages via webservices on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 16.149 second response time [12:21:28] 06Operations, 10ops-codfw: lvs2006 degraded RAID - https://phabricator.wikimedia.org/T136584#2352827 (10Volans) 05Open>03Resolved [12:32:33] PROBLEM - Redis status tcp_6480 on rdb2006 is CRITICAL: CRITICAL: replication_delay is 690 600 - REDIS on 10.192.48.44:6480 has 1 databases (db0) with 4105453 keys - replication_delay is 690 [12:34:55] 06Operations, 06Repository-Admins, 13Patch-For-Review: request Gerrit project/repo wikimedia/endowment - https://phabricator.wikimedia.org/T136736#2352851 (10Aklapper) [12:36:28] (03PS2) 10Dereckson: User rights configuration for meta. wmf-supportsafety group [mediawiki-config] - 10https://gerrit.wikimedia.org/r/292518 (https://phabricator.wikimedia.org/T136864) [12:38:11] (03CR) 10Dereckson: User rights configuration for meta. wmf-supportsafety group (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/292518 (https://phabricator.wikimedia.org/T136864) (owner: 10Dereckson) [12:40:32] (03CR) 10Alexandros Kosiaris: [C: 04-1] Adding Icinga checks for Maps (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/291023 (https://phabricator.wikimedia.org/T135647) (owner: 10Gehel) [12:44:12] 06Operations, 10Traffic, 07HTTPS, 05MW-1.27-release-notes, 13Patch-For-Review: Insecure POST traffic - https://phabricator.wikimedia.org/T105794#2352863 (10BBlack) Note T121279 is more-specific to the Merlbot/Java issues and has some recent traffic at the bottom too. [12:54:44] PROBLEM - Redis status tcp_6481 on rdb2006 is CRITICAL: CRITICAL: replication_delay is 626 600 - REDIS on 10.192.48.44:6481 has 1 databases (db0) with 3528957 keys - replication_delay is 626 [12:59:24] PROBLEM - Redis status tcp_6379 on rdb2002 is CRITICAL: CRITICAL: replication_delay is 650 600 - REDIS on 10.192.0.120:6379 has 1 databases (db0) with 4704872 keys - replication_delay is 650 [13:11:01] PROBLEM - Redis status tcp_6380 on rdb2002 is CRITICAL: CRITICAL: replication_delay is 632 600 - REDIS on 10.192.0.120:6380 has 1 databases (db0) with 1775 keys - replication_delay is 632 [13:11:33] !log rebooting kafka200[12] (codfw EventBus) for kernel upgrades [13:11:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:14:00] !log elukey@palladium conftool action : set/pooled=no; selector: kafka2002.codfw.wmnet [13:14:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:16:11] !log Reenabling puppet on gallium. Forgot to put it back yesterday [13:16:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:16:50] PROBLEM - puppet last run on gallium is CRITICAL: CRITICAL: Puppet last ran 1 day ago [13:18:41] RECOVERY - puppet last run on gallium is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [13:19:01] PROBLEM - Redis status tcp_6379 on rdb1002 is CRITICAL: CRITICAL: replication_delay is 655 600 - REDIS on 10.64.32.77:6379 has 1 databases (db0) with 10487920 keys - replication_delay is 655 [13:19:11] PROBLEM - Redis status tcp_6381 on rdb2002 is CRITICAL: CRITICAL: replication_delay is 713 600 - REDIS on 10.192.0.120:6381 has 1 databases (db0) with 1214473 keys - replication_delay is 713 [13:22:21] !log elukey@palladium conftool action : set/pooled=yes; selector: kafka2002.codfw.wmnet [13:22:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:22:40] !log elukey@palladium conftool action : set/pooled=no; selector: kafka2001.codfw.wmnet [13:22:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:23:10] PROBLEM - Redis status tcp_6381 on rdb1001 is CRITICAL: CRITICAL ERROR - Can not connect to 10.64.32.76 on port 6381 [13:25:00] RECOVERY - Redis status tcp_6381 on rdb1001 is OK: OK: REDIS on 10.64.32.76:6381 has 1 databases (db0) with 5784819 keys [13:27:22] !log elukey@palladium conftool action : set/pooled=yes; selector: kafka2001.codfw.wmnet [13:27:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:29:52] PROBLEM - Redis status tcp_6381 on rdb1002 is CRITICAL: CRITICAL: replication_delay is 616 600 - REDIS on 10.64.32.77:6381 has 1 databases (db0) with 5784936 keys - replication_delay is 616 [13:32:11] PROBLEM - eventlogging-service-eventbus endpoints health on kafka2001 is CRITICAL: /v1/events (Produce a valid test event) is CRITICAL: Test Produce a valid test event returned the unexpected status 500 (expecting: 201) [13:32:40] PROBLEM - wikidata.org dispatch lag is higher than 300s on wikidata is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - pattern not found - 1666 bytes in 0.626 second response time [13:33:57] today is not a good day [13:34:44] I tried the usual /v1/topics call from palladium and it was working [13:34:51] now a 500 mmmm [13:35:28] 06Operations: Move cp3030+ from OE14 to OE13 in racktables - https://phabricator.wikimedia.org/T136403#2333471 (10jcrespo) cp3030-3039 are, according to racktables on OE10 (not 14). On OE13 there is ssl3001-ssl3004 and cp3001-cp3002 on the same hight. are those not there, or do you mean others? [13:35:39] I am not sure about the wikidata.org lag - afaik it should be completely separated from eventbus codfw [13:36:11] PROBLEM - Redis status tcp_6380 on rdb1002 is CRITICAL: CRITICAL: replication_delay is 661 600 - REDIS on 10.64.32.77:6380 has 1 databases (db0) with 5781095 keys - replication_delay is 661 [13:36:20] RECOVERY - eventlogging-service-eventbus endpoints health on kafka2001 is OK: All endpoints are healthy [13:36:41] yeah yeah thanks, but why you died a minute ago :/ [13:37:10] (03PS2) 10Hashar: zuul: enhance logging [puppet] - 10https://gerrit.wikimedia.org/r/291913 [13:37:54] (03CR) 10Paladox: [C: 031] zuul: enhance logging [puppet] - 10https://gerrit.wikimedia.org/r/291913 (owner: 10Hashar) [13:38:31] RECOVERY - wikidata.org dispatch lag is higher than 300s on wikidata is OK: HTTP OK: HTTP/1.1 200 OK - 1678 bytes in 0.425 second response time [13:39:12] PROBLEM - eventlogging-service-eventbus endpoints health on kafka2002 is CRITICAL: /v1/events (Produce a valid test event) is CRITICAL: Test Produce a valid test event returned the unexpected status 500 (expecting: 201) [13:40:30] PROBLEM - Misc HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [13:40:32] PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [13:41:00] PROBLEM - puppet last run on mw2173 is CRITICAL: CRITICAL: Puppet has 1 failures [13:41:21] RECOVERY - eventlogging-service-eventbus endpoints health on kafka2002 is OK: All endpoints are healthy [13:42:22] (03PS3) 10Hashar: zuul: enhance logging [puppet] - 10https://gerrit.wikimedia.org/r/291913 [13:43:59] (03CR) 10Jcrespo: [C: 032] Add 3 new s4 shard core db servers [puppet] - 10https://gerrit.wikimedia.org/r/292546 (https://phabricator.wikimedia.org/T133398) (owner: 10Jcrespo) [13:44:22] RECOVERY - Misc HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [13:44:31] RECOVERY - Esams HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [13:46:42] so the issue with eventbus seems to be: [KafkaApi-2002] Produce request with correlation id 2 from client kafka-python on partition [codfw.test.eve [13:46:45] nt,0] failed due to Leader not local for partition [codfw.test.event,0] on broker 2002 [13:48:32] PROBLEM - Redis status tcp_6379 on rdb2006 is CRITICAL: CRITICAL: replication_delay is 694 600 - REDIS on 10.192.48.44:6379 has 1 databases (db0) with 2396722 keys - replication_delay is 694 [13:49:51] PROBLEM - Redis status tcp_6381 on rdb2006 is CRITICAL: CRITICAL: replication_delay is 631 600 - REDIS on 10.192.48.44:6381 has 1 databases (db0) with 5777185 keys - replication_delay is 631 [13:52:35] (03PS4) 10Hashar: zuul: enhance logging [puppet] - 10https://gerrit.wikimedia.org/r/291913 [13:52:41] (03PS1) 10Rush: assign labs::nfs::secondary to labstore100[45] [puppet] - 10https://gerrit.wikimedia.org/r/292558 (https://phabricator.wikimedia.org/T126083) [13:54:53] (03CR) 10Rush: [C: 032] assign labs::nfs::secondary to labstore100[45] [puppet] - 10https://gerrit.wikimedia.org/r/292558 (https://phabricator.wikimedia.org/T126083) (owner: 10Rush) [13:55:20] jynus: caught this change for you on palladium "Add 3 new s4 shard core db servers" [13:55:34] seems pretty minimal [13:55:42] goign to merge it w/ mine [13:57:25] please do [13:57:29] got distracted, as usual [13:58:14] it has only one gotcha, which is it can page if I am not quick [13:58:27] (03PS1) 10Rush: Match class assignment to file for labstore::fileserver::secondary [puppet] - 10https://gerrit.wikimedia.org/r/292559 (https://phabricator.wikimedia.org/T126083) [13:58:32] jynus: done [14:00:51] PROBLEM - wikidata.org dispatch lag is higher than 300s on wikidata is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - pattern not found - 1667 bytes in 0.591 second response time [14:04:40] (03PS1) 10Muehlenhoff: Add debdeploy salt grains for druid and wire them up in debdeploy [puppet] - 10https://gerrit.wikimedia.org/r/292560 [14:06:36] (03CR) 10Rush: [C: 032] Match class assignment to file for labstore::fileserver::secondary [puppet] - 10https://gerrit.wikimedia.org/r/292559 (https://phabricator.wikimedia.org/T126083) (owner: 10Rush) [14:06:41] PROBLEM - puppet last run on labstore1004 is CRITICAL: CRITICAL: puppet fail [14:07:21] RECOVERY - puppet last run on mw2173 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [14:08:31] PROBLEM - Redis status tcp_6379 on rdb1008 is CRITICAL: CRITICAL: replication_delay is 822 600 - REDIS on 10.64.32.19:6379 has 1 databases (db0) with 5776981 keys - replication_delay is 822 [14:09:55] (03PS2) 10Muehlenhoff: Add debdeploy salt grains for druid and wire them up in debdeploy [puppet] - 10https://gerrit.wikimedia.org/r/292560 [14:11:16] 06Operations, 06Editing-Department, 06Parsing-Team, 06Services: Services team goals April - June 2016 (Q4 2015/16) - https://phabricator.wikimedia.org/T118871#2353018 (10mobrovac) [14:11:46] (03PS1) 10Rush: labstore secondary role cleanup and remove 00[12] specifics [puppet] - 10https://gerrit.wikimedia.org/r/292561 (https://phabricator.wikimedia.org/T126083) [14:13:20] (03CR) 10Muehlenhoff: [C: 032 V: 032] Add debdeploy salt grains for druid and wire them up in debdeploy [puppet] - 10https://gerrit.wikimedia.org/r/292560 (owner: 10Muehlenhoff) [14:13:50] (03PS2) 10Rush: labstore secondary role cleanup and remove 00[12] specifics [puppet] - 10https://gerrit.wikimedia.org/r/292561 (https://phabricator.wikimedia.org/T126083) [14:15:11] PROBLEM - Redis status tcp_6381 on rdb1008 is CRITICAL: CRITICAL: replication_delay is 723 600 - REDIS on 10.64.32.19:6381 has 1 databases (db0) with 5776247 keys - replication_delay is 723 [14:16:59] (03PS1) 10Jcrespo: Add new s5 database core servers [puppet] - 10https://gerrit.wikimedia.org/r/292562 (https://phabricator.wikimedia.org/T133398) [14:17:41] (03CR) 10Rush: [C: 032] labstore secondary role cleanup and remove 00[12] specifics [puppet] - 10https://gerrit.wikimedia.org/r/292561 (https://phabricator.wikimedia.org/T126083) (owner: 10Rush) [14:17:54] (03PS2) 10Jcrespo: Add new s5 database core servers [puppet] - 10https://gerrit.wikimedia.org/r/292562 (https://phabricator.wikimedia.org/T133398) [14:20:00] PROBLEM - Redis status tcp_6380 on rdb2004 is CRITICAL: CRITICAL: replication_delay is 701 600 - REDIS on 10.192.16.123:6380 has 1 databases (db0) with 10482819 keys - replication_delay is 701 [14:20:12] RECOVERY - puppet last run on labstore1004 is OK: OK: Puppet is currently enabled, last run 28 seconds ago with 0 failures [14:21:50] (03CR) 10Jcrespo: [C: 032] Add new s5 database core servers [puppet] - 10https://gerrit.wikimedia.org/r/292562 (https://phabricator.wikimedia.org/T133398) (owner: 10Jcrespo) [14:22:09] (03CR) 10Jcrespo: [V: 032] Add new s5 database core servers [puppet] - 10https://gerrit.wikimedia.org/r/292562 (https://phabricator.wikimedia.org/T133398) (owner: 10Jcrespo) [14:22:51] PROBLEM - Redis status tcp_6379 on rdb2004 is CRITICAL: CRITICAL: replication_delay is 623 600 - REDIS on 10.192.16.123:6379 has 1 databases (db0) with 10479389 keys - replication_delay is 623 [14:25:53] (03PS5) 10Hashar: zuul: enhance logging [puppet] - 10https://gerrit.wikimedia.org/r/291913 [14:29:16] 06Operations, 10Traffic: Set up LVS connection sync - https://phabricator.wikimedia.org/T136944#2353061 (10BBlack) [14:29:35] 06Operations, 10Traffic: Set up LVS connection sync - https://phabricator.wikimedia.org/T136944#2353074 (10BBlack) [14:32:28] PROBLEM - MD RAID on rdb1007 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [14:34:08] PROBLEM - Redis status tcp_6380 on rdb1008 is CRITICAL: CRITICAL: replication_delay is 722 600 - REDIS on 10.64.32.19:6380 has 1 databases (db0) with 5782196 keys - replication_delay is 722 [14:34:28] RECOVERY - MD RAID on rdb1007 is OK: OK: Active: 6, Working: 6, Failed: 0, Spare: 0 [14:36:31] (03PS1) 10Muehlenhoff: Add firejail wrapper for ffmpeg in video scalers [puppet] - 10https://gerrit.wikimedia.org/r/292563 [14:36:43] (03PS1) 10Jcrespo: Add new s6 core database servers [puppet] - 10https://gerrit.wikimedia.org/r/292564 (https://phabricator.wikimedia.org/T133398) [14:36:49] PROBLEM - Redis status tcp_6380 on rdb1001 is CRITICAL: CRITICAL ERROR - Can not connect to 10.64.32.76 on port 6380 [14:38:05] (03CR) 10Jcrespo: [C: 032 V: 032] Add new s6 core database servers [puppet] - 10https://gerrit.wikimedia.org/r/292564 (https://phabricator.wikimedia.org/T133398) (owner: 10Jcrespo) [14:38:22] !log un-freezing writes from CirrusSearch to eqiad cluster during upgrade (T133126) [14:38:24] T133126: Upgrade eqiad data centre to Elasticsearch 2.3 - https://phabricator.wikimedia.org/T133126 [14:38:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:40:49] RECOVERY - Redis status tcp_6380 on rdb1001 is OK: OK: REDIS on 10.64.32.76:6380 has 1 databases (db0) with 5781760 keys [14:47:19] PROBLEM - Redis status tcp_6379 on rdb1006 is CRITICAL: CRITICAL: replication_delay is 691 600 - REDIS on 10.64.48.55:6379 has 1 databases (db0) with 5789951 keys - replication_delay is 691 [14:47:40] (03PS2) 10Muehlenhoff: Add firejail wrapper for ffmpeg in video scalers [puppet] - 10https://gerrit.wikimedia.org/r/292563 [14:47:59] (03CR) 10Muehlenhoff: [C: 032 V: 032] Add firejail wrapper for ffmpeg in video scalers [puppet] - 10https://gerrit.wikimedia.org/r/292563 (owner: 10Muehlenhoff) [14:49:20] RECOVERY - Redis status tcp_6379 on rdb1006 is OK: OK: REDIS on 10.64.48.55:6379 has 1 databases (db0) with 5676512 keys - replication_delay is 0 [14:50:32] Hi, quick question (hopefully -operations is not too bad) how does Wikimedia's Varnish installation handle changes to MediaWiki:Common.css (and similar pages)? You cache load.php stuff, but because of the URL pattern of load.php there would be no way to really purge such resources. is the current way just to edit MediaWiki:common.css, and then wait until [14:50:32] the load.php cache expires? [14:51:20] RECOVERY - Redis status tcp_6380 on rdb1002 is OK: OK: REDIS on 10.64.32.77:6380 has 1 databases (db0) with 5667821 keys - replication_delay is 0 [14:51:49] RECOVERY - Redis status tcp_6381 on rdb1008 is OK: OK: REDIS on 10.64.32.19:6381 has 1 databases (db0) with 5663521 keys - replication_delay is 0 [14:51:55] SPF|Cloud: the load.php requires has &version= appended, with the value being some kind of hash of the contents of MediaWiki:Common.css (and other such pages) [14:52:33] so it doesn't need purging. clients will just request a new URL [14:52:49] RECOVERY - Redis status tcp_6379 on rdb1008 is OK: OK: REDIS on 10.64.32.19:6379 has 1 databases (db0) with 5663870 keys - replication_delay is 0 [14:52:50] RECOVERY - Redis status tcp_6381 on rdb1002 is OK: OK: REDIS on 10.64.32.77:6381 has 1 databases (db0) with 5668235 keys - replication_delay is 0 [14:53:31] RECOVERY - Redis status tcp_6380 on rdb1008 is OK: OK: REDIS on 10.64.32.19:6380 has 1 databases (db0) with 5669565 keys - replication_delay is 0 [14:53:53] (03PS2) 10Jcrespo: MariaDB: Add new coredb servers [puppet] - 10https://gerrit.wikimedia.org/r/292197 (https://phabricator.wikimedia.org/T133398) (owner: 10Volans) [14:54:20] RECOVERY - Redis status tcp_6379 on rdb2006 is OK: OK: REDIS on 10.192.48.44:6379 has 1 databases (db0) with 5676015 keys - replication_delay is 0 [14:54:30] RECOVERY - Redis status tcp_6379 on rdb1002 is OK: OK: REDIS on 10.64.32.77:6379 has 1 databases (db0) with 10373325 keys - replication_delay is 0 [14:54:59] RECOVERY - Redis status tcp_6381 on rdb2006 is OK: OK: REDIS on 10.192.48.44:6381 has 1 databases (db0) with 5663448 keys - replication_delay is 0 [14:55:18] 06Operations, 13Patch-For-Review: create endowment.wm.org microsite - https://phabricator.wikimedia.org/T136735#2353107 (10demon) [14:55:20] 06Operations, 06Repository-Admins, 13Patch-For-Review: request Gerrit project/repo wikimedia/endowment - https://phabricator.wikimedia.org/T136736#2353105 (10demon) 05Open>03Resolved Has already been created: https://gerrit.wikimedia.org/r/#/admin/projects/wikimedia/endowment [14:56:10] RECOVERY - Redis status tcp_6379 on rdb2004 is OK: OK: REDIS on 10.192.16.123:6379 has 1 databases (db0) with 10371529 keys - replication_delay is 0 [14:56:12] (03CR) 10Jcrespo: [C: 032] MariaDB: Add new coredb servers [puppet] - 10https://gerrit.wikimedia.org/r/292197 (https://phabricator.wikimedia.org/T133398) (owner: 10Volans) [14:56:20] RECOVERY - Redis status tcp_6380 on rdb2004 is OK: OK: REDIS on 10.192.16.123:6380 has 1 databases (db0) with 10375220 keys - replication_delay is 0 [14:57:56] MatmaRex: For testing I created MediaWiki:Common.css with #p-logo { display: none; }. MediaWiki noticed that and added the second stylesheet /w/load.php?debug=false&lang=nl&modules=site&only=styles&skin=vector, so yeah the logo was gone now. But when changing the page again I'm now stuck at the Varnish cache level [14:58:44] I only see some &version= params for load.php requests for javascript, not for css. is it perhaps something that changed in 1.27 or 1.28 (I'm running latest 1.26 version) [14:59:22] oh, hmm. maybe we don't do that for styles. i'm not sure then [14:59:49] RECOVERY - Redis status tcp_6381 on rdb2002 is OK: OK: REDIS on 10.192.0.120:6381 has 1 databases (db0) with 5667208 keys - replication_delay is 0 [15:00:31] RECOVERY - wikidata.org dispatch lag is higher than 300s on wikidata is OK: HTTP OK: HTTP/1.1 200 OK - 1667 bytes in 0.192 second response time [15:01:21] I think I just need to wait for https://phabricator.wikimedia.org/T96797 ? [15:01:56] RECOVERY - Redis status tcp_6380 on rdb2002 is OK: OK: REDIS on 10.192.0.120:6380 has 1 databases (db0) with 5667066 keys - replication_delay is 0 [15:04:45] RECOVERY - Redis status tcp_6481 on rdb2006 is OK: OK: REDIS on 10.192.48.44:6481 has 1 databases (db0) with 5661733 keys - replication_delay is 0 [15:04:46] PROBLEM - puppet last run on kafka1002 is CRITICAL: CRITICAL: puppet fail [15:07:16] PROBLEM - puppet last run on kafka2001 is CRITICAL: CRITICAL: puppet fail [15:07:45] RECOVERY - Redis status tcp_6480 on rdb2006 is OK: OK: REDIS on 10.192.48.44:6480 has 1 databases (db0) with 5667824 keys - replication_delay is 0 [15:08:15] RECOVERY - Redis status tcp_6379 on rdb2002 is OK: OK: REDIS on 10.192.0.120:6379 has 1 databases (db0) with 10371431 keys - replication_delay is 0 [15:10:47] (03PS1) 10Elukey: Remove old and redundant AQS specific alarms. [puppet] - 10https://gerrit.wikimedia.org/r/292568 (https://phabricator.wikimedia.org/T135145) [15:12:35] RECOVERY - Redis status tcp_6479 on rdb2006 is OK: OK: REDIS on 10.192.48.44:6479 has 1 databases (db0) with 5661190 keys - replication_delay is 0 [15:17:26] RECOVERY - puppet last run on kafka2001 is OK: OK: Puppet is currently enabled, last run 42 seconds ago with 0 failures [15:17:56] (03PS1) 10Gergő Tisza: Revert "Workaround for T136644" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/292572 (https://phabricator.wikimedia.org/T136929) [15:18:40] (03PS1) 10Mobrovac: Change Prop: Use the URIs for MW and RB from service::configuration [puppet] - 10https://gerrit.wikimedia.org/r/292573 [15:18:46] (03CR) 10Gergő Tisza: [C: 032] Revert "Workaround for T136644" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/292572 (https://phabricator.wikimedia.org/T136929) (owner: 10Gergő Tisza) [15:19:35] thanks tgr [15:19:46] (03Merged) 10jenkins-bot: Revert "Workaround for T136644" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/292572 (https://phabricator.wikimedia.org/T136929) (owner: 10Gergő Tisza) [15:20:42] tgr: we were just discussing this, want me around for deploy, or are you doing already? [15:20:58] thcipriani: doing it [15:21:08] thank you! [15:22:49] (03CR) 10Mobrovac: "PCC confirms it's a no-op in prod: https://puppet-compiler.wmflabs.org/3047/" [puppet] - 10https://gerrit.wikimedia.org/r/292573 (owner: 10Mobrovac) [15:23:05] RECOVERY - puppet last run on kafka1002 is OK: OK: Puppet is currently enabled, last run 44 seconds ago with 0 failures [15:24:07] !log tgr@tin Started scap: revert AbuseFilter + config to pre-extension-registration state T136929 [15:24:08] T136929: DB error trying to set block option in abusefilter - https://phabricator.wikimedia.org/T136929 [15:24:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:30:20] !log tgr@tin Finished scap: revert AbuseFilter + config to pre-extension-registration state T136929 (duration: 06m 13s) [15:30:22] T136929: DB error trying to set block option in abusefilter - https://phabricator.wikimedia.org/T136929 [15:30:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:33:12] mutante, i need to fix the deb pkg since I created a broken version .. I forgot about some version string changes I had to make in a few places. can i run the upload script myself once I do that, or do I need you to do that? [15:36:12] (03PS2) 10Gehel: Revert "Increase the number of workers for osm2pgsql." [puppet] - 10https://gerrit.wikimedia.org/r/292394 (owner: 10MaxSem) [15:37:38] (03CR) 10Gehel: [C: 032] Revert "Increase the number of workers for osm2pgsql." [puppet] - 10https://gerrit.wikimedia.org/r/292394 (owner: 10MaxSem) [15:48:38] could a root remove "/tmp/parsoid_0.5.0allubuntu1_amd64.bromine.eqiad.wmnet.upload" on tin? i want to try to upload an updated package. [16:08:52] PROBLEM - Outgoing network saturation on labstore1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [106250000.0] [16:28:25] (03PS1) 10Alexandros Kosiaris: Update openldap module's README [puppet] - 10https://gerrit.wikimedia.org/r/292585 [16:36:12] PROBLEM - restbase endpoints health on restbase1012 is CRITICAL: Generic error: Generic connection error: HTTPConnectionPool(host=10.64.32.79, port=7231): Max retries exceeded with url: /en.wikipedia.org/v1/?spec (Caused by ProtocolError(Connection aborted., error(111, Connection refused))) [16:36:32] PROBLEM - Restbase root url on restbase1012 is CRITICAL: Connection refused [16:39:47] 06Operations, 10ops-codfw: rack/setup/deploy mw22[1-5][0-9] switch configuration - https://phabricator.wikimedia.org/T136670#2353456 (10Papaul) @Rob adding 4 more servers in rack B4 mw2239 row B rack B4 ge-4/0/0 mw2240 row B rack B4 ge-4/0/1 mw2241 row B rack B4 ge-4/0/2 mw2242 row B rack B4 g... [16:41:33] looking ^ [16:41:44] mobrovac: beat me to it [16:42:03] mobrovac: another logged shutdown? [16:42:23] (code=exited, status=0/SUCCESS) [16:42:24] wth? [16:42:55] mobrovac: ¯\_(ツ)_/¯ [16:43:03] heh [16:44:04] 06Operations, 10DNS, 10Traffic, 10Wikimedia-Apache-configuration: Create moon.wikimedia.org and redirect it to https://meta.wikimedia.org/wiki/Wikipedia_to_the_Moon - https://phabricator.wikimedia.org/T136557#2353461 (10BBlack) I'm not so much asking about the total timeline, but about whether the intent i... [16:46:02] RECOVERY - restbase endpoints health on restbase1012 is OK: All endpoints are healthy [16:46:22] RECOVERY - Restbase root url on restbase1012 is OK: HTTP OK: HTTP/1.1 200 - 15273 bytes in 0.011 second response time [16:47:03] (03PS2) 10Dzahn: add "lint:ignore"s for several "puppet URL without modules" [puppet] - 10https://gerrit.wikimedia.org/r/291122 [16:47:51] (03PS3) 10Dzahn: add "lint:ignore"s for several "puppet URL without modules" [puppet] - 10https://gerrit.wikimedia.org/r/291122 [16:50:52] (03PS4) 10Dzahn: add "lint:ignore"s for several "puppet URL without modules" [puppet] - 10https://gerrit.wikimedia.org/r/291122 (https://phabricator.wikimedia.org/T93645) [16:51:37] urandom: hi! Tried to look into the Phab task but didn't come up with anything useful :)0 [16:51:49] (03CR) 10Dzahn: [C: 032] "(special) comments only" [puppet] - 10https://gerrit.wikimedia.org/r/291122 (https://phabricator.wikimedia.org/T93645) (owner: 10Dzahn) [16:51:53] (the 0 is a typo) [16:52:03] 06Operations, 06Services: RESTBase shutting down spontaneously - https://phabricator.wikimedia.org/T136957#2353511 (10mobrovac) [16:52:36] (03CR) 10Ottomata: Extend the %{format}t timestamp formatter with (begin|end): prefixes (031 comment) [software/varnish/varnishkafka] - 10https://gerrit.wikimedia.org/r/292172 (https://phabricator.wikimedia.org/T136314) (owner: 10Elukey) [16:52:37] elukey: no worries, i've narrowed it down to the new mmap decompression reads [16:53:08] I saw that, really interesting [16:53:20] elukey: still need to figure out *what* exactly is so wrong there, but it helps to know where to look (and to have a work-around) [16:58:02] !log magnesium - shutdown -h now, bye [16:58:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:58:51] (03PS2) 10Dzahn: remove magnesium's public IP [dns] - 10https://gerrit.wikimedia.org/r/292474 (https://phabricator.wikimedia.org/T123713) [17:01:22] (03PS1) 10Dzahn: ssl: delete rt.wikimedia.org.crt [puppet] - 10https://gerrit.wikimedia.org/r/292589 (https://phabricator.wikimedia.org/T119112) [17:02:03] (03PS2) 10Dzahn: ssl: delete rt.wikimedia.org.crt [puppet] - 10https://gerrit.wikimedia.org/r/292589 (https://phabricator.wikimedia.org/T119112) [17:05:41] (03CR) 10Dzahn: [C: 032] remove magnesium's public IP [dns] - 10https://gerrit.wikimedia.org/r/292474 (https://phabricator.wikimedia.org/T123713) (owner: 10Dzahn) [17:08:40] (03PS4) 10Dzahn: varnish: mv wikimedia_vcl, netmapper_upd to separate files [puppet] - 10https://gerrit.wikimedia.org/r/290875 [17:09:51] mutante, i need to fix the deb pkg since I created a broken version .. I have a new deb pkg uploaded to /tmp on tin. can i run the upload script myself, or do you have to do that? [17:11:37] subbu: easiest right now is if i do it [17:12:10] ok .. thanks. the new deb files are in /tmp/ .. whenever you get a chance. [17:12:10] subbu: it's on my todo for today to fix it [17:12:21] k [17:12:40] subbu: ok, just a few minutes, will let you know [17:14:54] !log bast4001 coming down for second hdd installation. (there are currently no active users on system) [17:14:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:18:23] robh: you'll have to reinstall too, btw :( [17:20:18] (03CR) 10MarcoAurelio: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/292518 (https://phabricator.wikimedia.org/T136864) (owner: 10Dereckson) [17:22:24] subbu: so i worked around the problems we know but there is a new issue, your new package has different content now but the same version so it doesn't like that [17:22:45] deb" is already registered with different checksums [17:23:41] PROBLEM - Redis status tcp_6479 on rdb2006 is CRITICAL: CRITICAL: replication_delay is 688 600 - REDIS on 10.192.48.44:6479 has 1 databases (db0) with 5674094 keys - replication_delay is 688 [17:23:42] could you change the minor version? [17:23:47] ok. [17:24:20] separately, i have a qn. about removing a deb pkg .. later on, not right now ... i might want to get rid of the 0.4.0 that is there right now. [17:27:13] paravoid: yep, and it'll be over the link [17:27:28] so slow tftp [17:27:35] not that slow [17:27:37] but ulsfo to dallas will be fast enough [17:27:38] but it won't work that easily [17:27:38] true [17:27:41] RECOVERY - Redis status tcp_6479 on rdb2006 is OK: OK: REDIS on 10.192.48.44:6479 has 1 databases (db0) with 5655214 keys - replication_delay is 0 [17:27:48] we filter on the routers now? [17:28:02] in the past it was just change the install_server dhcp config [17:28:10] no [17:28:18] and you don't need to change it, I've fixed that permanently now [17:28:26] mutante i am also going to name it parsoid_0.5.1all_all.deb (removed the ubuntu string from it). [17:28:27] (you can define next-servers in host { } stanzas too) [17:28:28] but [17:28:29] https://phabricator.wikimedia.org/T123674#2237815 [17:28:40] subbu: i am not sure myself yet what is the proper way. paravoid, would we delete old packages from the releases repo? and if there is a broken release like that, always up the version to replace it? [17:28:48] subbu: ok [17:28:54] paravoid: ouch! [17:29:12] so once it comes back up and i confirm the second hdd is there, wanna go ahead and do that again on the routers and i'll reinstall it right away? [17:29:15] robh: cool @ bast4001 though! [17:29:24] robh: asw, but yeah, can do [17:29:37] i mean i can do it since you put in the commands, is it just do that via one commit [17:29:40] and then roll it back/ [17:29:49] should be, yes :) [17:29:56] but since its just me pasting things i rather you be around! cool [17:29:59] mutante: what do you mean? a newer version replaces the older one [17:30:04] i'll check it out right now [17:30:15] robh: don't forget to change the partman profile to raid, obviously! [17:30:25] yep =] [17:30:32] so no need to change the dhcp it just falls through to the next/ [17:30:33] ? [17:30:43] sorry, sitting on floor, my typing is suffering [17:31:06] or have to do that via the host stanza? [17:31:22] robh: https://gerrit.wikimedia.org/r/#/c/283627/ [17:31:30] moritzm: one question was if there is ever a reason to replace a package with the same nam , like when it was broken by mistake but same software version [17:31:40] ahhh [17:31:47] robh: basically, https://gerrit.wikimedia.org/r/#/c/283627/1/modules/install_server/files/dhcpd/linux-host-entries.ttyS1-115200 [17:31:51] thts sensible [17:31:57] moritzm: and the other one was that subbu said he would like to get rid of the 0.4.0 here https://releases.wikimedia.org/debian/pool/main/p/parsoid/ [17:32:05] and makes me wonder why none of us thought to do it sooner, heh [17:32:06] you can say "host bast4001 { hardware ethernet ...; next-server install2001; }" [17:32:23] let us wait on removing 0.4.0 till we get the 0.5* version in place and know to be good for a couple days. [17:33:19] ok, yes [17:34:12] mutante, moritzm i'll follow your lead as to whether i should upoad the 0.5.1 deb or if we should remove the broken 0.5.0 deb and re-upload a fixed 0.50 deb. [17:34:27] can you explicitly yank old versions, yes. I used that before when we moved from a locally built version to an official backport [17:34:56] subbu: if you install 0.5.1 to the same suite, the previous 0.5 should be removed automatically [17:35:07] suite as in "trusty-wikimedia" [17:35:10] the part that is broken is not anything in parsoid itself, but only in the package ? [17:35:12] or "jessie-wikimedia" [17:35:18] should he change the version? [17:35:28] yes, the package i uploaded had broken code. [17:35:37] ppl cannot use it. [17:36:06] ok, then it seems like it's 0.5.1 [17:37:07] ok. [17:37:40] and yea, the 0.4 is still there because the suite change to mediawiki-jessie [17:38:18] (03Abandoned) 10Mobrovac: Change Prop: Don't decode the responses' bodies [puppet] - 10https://gerrit.wikimedia.org/r/292366 (owner: 10Mobrovac) [17:38:35] got it. [17:39:07] separately .. reg. multiple versions .. https://phabricator.wikimedia.org/T115758#2352038 but nothing to resolve today. [17:39:16] (03PS1) 10RobH: setting bast4001 to use raided disks [puppet] - 10https://gerrit.wikimedia.org/r/292592 [17:39:16] ok, disks detected, i'm going to boot it up to os again and try the reimage script [17:39:17] * subbu is uploaded 0.5.1 deb to tin now [17:39:22] ive always done manually [17:39:44] (03CR) 10RobH: [C: 032] setting bast4001 to use raided disks [puppet] - 10https://gerrit.wikimedia.org/r/292592 (owner: 10RobH) [17:39:54] subbu: got the .changes file too? [17:40:08] i see it, ok [17:40:27] yes. [17:40:55] that should teach me to install the dbg pkg locally and test it before uploading. [17:41:22] !log uploaded parsoid 0.5.1 to releases [17:41:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:41:28] https://releases.wikimedia.org/debian/pool/main/p/parsoid/ [17:41:58] moritzm: how do i do the export here? [17:42:05] Warning: database 'jessie-mediawiki|main|amd64' was modified but no index file was exported. [17:42:08] Changes will only be visible after the next 'export'! [17:43:34] paravoid: ok, im going to apply those firewall rules and reimage it [17:43:45] robh: asw, not cr1 [17:43:45] just letting you know in case i fubar it ;] [17:43:48] yep! [17:43:55] robh@asw-ulsfo> [17:44:14] mutante, can you remove the 0.5.0 version from the releases? don't want anyone to get that one. [17:44:28] it would have replace it but there were errors [17:44:30] parsoid_0.5.1all_amd64.changes 100% 766 0.8KB/s 00:00 [17:44:33] Successfully uploaded packages. [17:44:36] File "pool/main/p/parsoid/parsoid_0.5.0allubuntu1_all.deb" is already registered with different checksums! [17:44:46] ah, ok. [17:44:47] paravoid: this is done in the edit interfaces scope right? [17:45:05] juniper doesnt warn you if you do it wrong it just misconfigures it sometimes for this stuff [17:45:06] so i ask [17:45:14] no [17:45:16] top scope [17:45:29] i get syntax error [17:45:41] set firewall, it doesnt accept firewall in the top edit scope [17:45:59] {master:2}[edit] [17:45:59] robh@asw-ulsfo# set firewall [17:45:59] ^ [17:46:01] syntax error. [17:46:05] hmm [17:46:09] maybe access rights issue? [17:46:11] maybe i dont have rights [17:46:20] can you just apply them all for it for now =] [17:46:34] done :) [17:46:34] i couldnt run some other commands in switches a month ago for the pwer supply investigation iirc [17:46:37] thanks! [17:46:42] ok, attempting to reimage [17:46:52] RECOVERY - Outgoing network saturation on labstore1001 is OK: OK: Less than 10.00% above the threshold [93750000.0] [17:48:36] mutante, anything i should do on my end or is it something on the deb-upload side? i am not following what happened there. [17:50:52] subbu: i retried it, it says clearly that i upload 0.5.1, then it says still that 0.5.0 is already registered with different checksums, i'm not sure yet why 0.5.0 is still a matter [17:51:35] i didn't change anything in the code .. just bumped the version # .. so, if changelog is not part of the checksum computation 0.5.0 and 0.5.1 will have the same checksum .. maybe that is the issue? if so, i can fix that. [17:51:44] meh i forgot to enable redirection after post to monitor the status, fixing. [17:51:53] mutante, i can make some changes to the parsoid code so it gets a different checksum. [17:52:04] subbu: at the same time, it tells us that: Will not put 'parsoid' in 'jessie-mediawiki|main|amd64', as already there with same version '0.5.1all'. [17:52:10] so that is also there [17:52:23] i see .. [17:53:43] subbu: on a system that has this repo in the sources.list, apt-get install parsoid .. where can we test that [17:54:03] i tested it on my laptop. [17:54:09] did you get 0.5.1 ? [17:54:20] it pulls in parsoid_0.5.0allubuntu1_all.deb [17:54:46] my laptop is ubuntu .. so i assumed that is why it pulls that .. not sure. [17:56:20] ok, its loading the tftp image [17:56:34] moritzm: how to yank the old version please [17:56:39] paravoid: so this looks like its still goign to be a 30 minute copy [17:57:00] im going to go get lunch and not stand in datacenter floor =] (unfortunately the break area has no coverage for my phone or mifi ;) [17:57:10] no worries :) [17:57:32] (03PS6) 10Hashar: zuul: enhance logging [puppet] - 10https://gerrit.wikimedia.org/r/291913 [17:57:44] paravoid: so if you wanna give me the rights on the switch to undo your commit then you dont need to hang around for install completion (your call) [17:57:53] i'll head back home, back online in about 35 minutes or so [17:58:12] I'll stay around [17:58:31] I can give you the rights too [17:58:36] I'm just bored right now [17:58:41] ask me again next week, happy to do that [17:58:57] cool, back shortly [17:59:27] let us give paravoid some work .. [17:59:45] :) [18:00:21] subbu: ok, so im removing it [18:00:32] then we have unreferenced packages [18:00:51] then we can remove that and upload it again [18:01:50] ok. [18:02:20] did that, and still [18:02:22] Uploading to bromine.eqiad.wmnet (via scp to bromine.eqiad.wmnet): [18:02:23] parsoid_0.5.1all_all.deb 100% 11MB 11.0MB/s 00:00 [18:02:26] parsoid_0.5.1all_amd64.changes 100% 766 0.8KB/s 00:00 [18:02:29] Successfully uploaded packages. [18:02:31] Error: trying to put version '0.5.0allubuntu1' of 'parsoid' in 'jessie-mediawiki|main|amd64', [18:02:34] while there already is the stricly newer '0.5.1all' in there. [18:02:41] somewhere there is still 0.5.0 in there [18:03:52] maybe its permissions to write to the db.. looking for that [18:04:35] but look here https://releases.wikimedia.org/debian/pool/main/p/parsoid/ [18:04:44] just 0.5.1 not 0.5.0 anymore [18:06:10] i did a "sudo apt-get purge parsoid ; sudo apt-get update ; sudo apt-get install parsoid" and still got parsoid_0.5.0allubuntu1_all.deb [18:10:00] subbu: how does your sources.list look ? [18:10:07] let me try it too [18:10:24] eb https://releases.wikimedia.org/debian jessie-mediawiki main [18:10:24] deb https://releases.wikimedia.org/debian trusty-mediawiki main [18:10:33] *deb [18:14:16] eh, going to a labs instance for that now [18:19:26] btw, https! i have to install apt-transport-https on stretch but in jessie its there..? [18:20:12] ok, yea, i also still get 0.5.0, meh. [18:29:00] (03CR) 10Alex Monk: [C: 04-1] "Yeah, this used to work until we changed the way that silver is configured. I'm wondering if we could make it use some dummy configuration" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/292537 (https://phabricator.wikimedia.org/T136926) (owner: 10Reedy) [18:39:47] (03CR) 10BBlack: [C: 031] ssl: delete rt.wikimedia.org.crt [puppet] - 10https://gerrit.wikimedia.org/r/292589 (https://phabricator.wikimedia.org/T119112) (owner: 10Dzahn) [18:44:49] paravoid: so installer finished and its now getting its keys signed and first puppet run [18:44:57] post puppet run we can drop the firewall temp rules i'd think [18:46:00] which seems to be running on its own, i didnt fire it off so the script says to do so, but i guess it did. [18:49:49] robh: ok, rolled back [18:49:53] and with that, I'm gone [18:50:34] thanks! puppet is still running on its own in background (non ideal but i guess the script does it) [18:52:37] mutante: congrat on the port of RT to Jessie :) Your mail announce on ops list is pretty clear and nice! [18:56:32] (03PS4) 10Ori.livneh: pybal: Fix PEP8 violations [puppet] - 10https://gerrit.wikimedia.org/r/291183 (owner: 10BryanDavis) [18:56:55] (03CR) 10Ori.livneh: [C: 032 V: 032] "the json / re imports are just leftovers from using a different collector as a template" [puppet] - 10https://gerrit.wikimedia.org/r/291183 (owner: 10BryanDavis) [18:58:40] robh: Did your bast4001 changes cause its SSH key to change? [18:59:24] I'm getting key verification errors and I didn't see anything about it on the ops list, but I also see that you were doing stuff with bast4001 just an hour ago [19:02:10] hashar: thank you :) [19:02:15] subbu: fixed! [19:02:45] 06Operations, 10Traffic, 07HTTPS, 05MW-1.27-release-notes, 13Patch-For-Review: Insecure POST traffic - https://phabricator.wikimedia.org/T105794#2354003 (10BBlack) Graph from logstash of insecure req rate over the past 28 days in 12h increments: {F4108994} [19:04:11] !log releases apt repo on bromine: export fresh jessie-mediawiki indexes [19:04:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:06:17] subbu: in the end besides the other issues import had worked but signing had failed, so reprepro list shows the new version but the index did not get updated, updating the index failed because i did not run it with reprepro env so it could not read the GPG key to sign it [19:06:28] Krenair: Looking at https://phabricator.wikimedia.org/T136920 now [19:06:30] curious.. [19:07:58] 06Operations, 10hardware-requests: eqiad: spare allocation to replace labmon1001 - https://phabricator.wikimedia.org/T136970#2354012 (10RobH) [19:08:01] Krenair: Also note how the query string doesn't contain a hash, which means it was unable to find the file on disk, too. [19:08:25] 06Operations, 10ops-eqiad: install/setup new labmon1001 system - https://phabricator.wikimedia.org/T136972#2354048 (10RobH) [19:08:39] 06Operations, 10hardware-requests: eqiad: spare allocation to replace labmon1001 - https://phabricator.wikimedia.org/T136970#2354079 (10RobH) [19:08:42] subbu: had to reprepro export jessie-mediawiki to refresh the indexes, while being user reprepro with the right env/homedir to find .gnupg .. and if there would not have been the issue with the same checksum before then this part would have happened automatically [19:08:46] YuviPanda: ^ https://phabricator.wikimedia.org/T136972 [19:08:54] can you review that to ensure I have all the steps as far as you can tell? [19:09:05] i made the hw-request for mark, and then a task for the setup of the actual host [19:10:01] looking, robh [19:10:22] i think continuing to fight with the out of warranty broken hardware is wasting your time at this point [19:10:29] robh: ya +1 [19:10:37] robh: yup! [19:10:55] ok, so if it looks good, i'll task it to cmjohnson1 to do the onsite parts (and racktables updates) [19:11:04] safer to have onsite update racktables since we're reusing a hostname. [19:11:08] 06Operations, 10hardware-requests: eqiad: spare allocation to replace labmon1001 - https://phabricator.wikimedia.org/T136970#2354097 (10yuvipanda) (I've filed T136968 to track allocating a service IP for this so we do not run into the host naming problem in the future) [19:11:15] +1 [19:12:02] 06Operations, 10ops-eqiad: install/setup new labmon1001 system - https://phabricator.wikimedia.org/T136972#2354100 (10RobH) Assigning this to @Cmjohnson to pull the disks of the old sytem and place into the new one. Also he should update the racktables entries, since he'll be the one modifying the hostnames o... [19:12:18] cmjohnson1: so https://phabricator.wikimedia.org/T136972 is assigned to you please review for the onsite steps and let me know if you have questions [19:12:37] im guessing the data cpy will take a bit, i imagine thats why YuviPanda wants to try to get it installed today [19:12:44] yeah [19:12:46] so need to isntall, then post install attach the usb disk [19:12:56] it took about 10ish hours last time [19:13:05] read should be a lot faster [19:13:20] i tested write and read via a nearly identical old usb drive toaster here [19:13:26] and read was a third of the time. [19:13:35] ah [19:13:37] nice [19:13:41] (i just wrote a 8 gb file but still) [19:14:06] yeah, my time macine backups are via one of them [19:14:10] so i kinda use one weekly. [19:15:10] ah [19:15:12] nice [19:15:19] * YuviPanda is still averse to having physical things [19:17:36] robh: what is the priority of labmon? [19:17:47] PROBLEM - Redis status tcp_6479 on rdb2006 is CRITICAL: CRITICAL: replication_delay is 619 600 - REDIS on 10.192.48.44:6479 has 1 databases (db0) with 5661424 keys - replication_delay is 619 [19:17:55] cmjohnson1: the only thing we need onsite is the disk swap [19:17:57] can it wait? I have 21 servers sitting still that need to be picked up and racked [19:18:01] okay [19:18:12] i think YuviPanda wanted today, since its downtiem for users due to the hw failure [19:18:15] can I do that now? [19:18:17] im doing all the dns stuff [19:18:19] yeah go for it [19:18:23] okay..cool [19:18:26] thanks cmjohnson1 [19:18:35] and its an existing system so ports on switch are labeled already, checking shortly [19:18:57] 06Operations, 06Repository-Admins, 13Patch-For-Review: request Gerrit project/repo wikimedia/endowment - https://phabricator.wikimedia.org/T136736#2354158 (10Dzahn) cool, thanks @demon [19:19:39] RECOVERY - Redis status tcp_6479 on rdb2006 is OK: OK: REDIS on 10.192.48.44:6479 has 1 databases (db0) with 5649737 keys - replication_delay is 54 [19:21:07] robh: going to move restbase1004/labmon10xx to c5 [19:21:15] (03PS1) 10RobH: updating labmon1001 dns [dns] - 10https://gerrit.wikimedia.org/r/292608 [19:21:23] cmjohnson1: ok, just to even out racks? [19:21:29] i have no preference, anywhere in c is fine. [19:21:31] well it's in 10g rack now [19:21:36] mgmt stays same. ahh yes [19:21:39] move away! sorry about that [19:21:51] you can put in the place of existing labmon1001 if that makes it easier? [19:21:54] (your call) [19:22:04] yeah...did you kill off puppet yet? [19:22:19] labmon1001 is offline so you can pull [19:22:29] i have not killed its old puppetdb entry but will shortly [19:22:40] if its keepign the same slot and network cable then it makes life very easy ;D [19:22:49] cmjohnson1: you may wanna reset the bios and drac before you unrack old labmon1001 [19:22:54] that way you never have to rack it again to do so. [19:23:56] robh: just going to c5 for now (easier) [19:24:13] wfm [19:24:42] (03CR) 10RobH: [C: 032] updating labmon1001 dns [dns] - 10https://gerrit.wikimedia.org/r/292608 (owner: 10RobH) [19:25:00] (03PS1) 10Dzahn: Revert "endowment: comment out git:clone until repo exists" [puppet] - 10https://gerrit.wikimedia.org/r/292610 [19:25:18] you know you're going from Dell to HP...there is a chance it doesn't work right [19:25:46] cmjohnson1: no data to keep intact [19:25:51] its all on the usb disk [19:26:06] unless you mean there is a chance the one hard disks dont work, but that is slim [19:26:28] YuviPanda: did you already clean out the old puppetstoreddb entry for labmon? [19:26:36] yeah robh [19:26:43] ok, cool, you take care of puppet and salt keys? [19:26:44] I ran the script which cleans it up iirc [19:26:49] those should be gone too [19:26:51] but I'll verify [19:26:53] awesome [19:27:19] root@palladium:/home/yuvipanda# puppet cert list -a | grep labmon [19:27:21] cmjohnson1: lemme know the network port when you are done and i'll handle that as well [19:27:22] root@palladium:/home/yuvipanda# [19:27:24] so that's gone [19:27:26] I'll verify salt [19:27:51] ok, while chris is moving stuff im going to go ahead and make something to eat for lunch [19:28:09] and new insulin pump, robot parts expire every 3 days =P [19:28:50] yuvipanda@neodymium:~$ sudo salt-key -l all | grep labmon [19:28:53] yuvipanda@neodymium:~$ [19:28:55] all good too [19:29:24] cool, once its moved and disks are in, i'll provision the network port and then it can be installed [19:30:18] \o/ cool! [19:37:56] !log krinkle@tin Synchronized php-1.28.0-wmf.4/extensions/WikimediaEvents/extension.json: T136920 (duration: 00m 28s) [19:37:57] T136920: ext.wikimediaEvents.deprecate.js during debug at nowiki - https://phabricator.wikimedia.org/T136920 [19:38:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:40:45] mutante, okay .. thanks .. i didn't catch all the details, but I'll remember to make sure the checksum changes when I bump deb version next time. [19:41:50] and confirmed, it works. [19:42:21] (03PS2) 10Dzahn: Revert "endowment: comment out git:clone until repo exists" [puppet] - 10https://gerrit.wikimedia.org/r/292610 [19:42:24] subbu: :) yay, ok [19:42:51] and i will work on getting this back to the normal script until next release [19:43:18] (03CR) 10Dzahn: [C: 032] "repo exists now" [puppet] - 10https://gerrit.wikimedia.org/r/292610 (owner: 10Dzahn) [19:44:05] (03PS3) 10Dzahn: ssl: delete rt.wikimedia.org.crt [puppet] - 10https://gerrit.wikimedia.org/r/292589 (https://phabricator.wikimedia.org/T119112) [19:44:37] (03CR) 10Dzahn: [C: 032] ssl: delete rt.wikimedia.org.crt [puppet] - 10https://gerrit.wikimedia.org/r/292589 (https://phabricator.wikimedia.org/T119112) (owner: 10Dzahn) [19:46:30] robh: use labmon1001 mgmt ip or re-label the restbase one? [19:46:47] i did dns [19:46:50] dont need to change mgmt ip [19:47:00] they had asset tag and hostname, i removed hostname for old server [19:47:02] and added for new [19:47:21] cmjohnson1: so that is re-label the restbase one i suppose, and its done =] [19:47:22] (03PS2) 10Jforrester: Enable VisualEditor by default for logged-in users on four Wikipedias [mediawiki-config] - 10https://gerrit.wikimedia.org/r/292274 [19:47:24] (03PS1) 10Jforrester: Switch Wikivoyages to Single Edit Tab mode for VE Beta Feature [mediawiki-config] - 10https://gerrit.wikimedia.org/r/292614 [19:47:26] (03PS1) 10Jforrester: Cleanup: Move never-altered GlobalBlockingBlockXFF into CommonSettings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/292615 [19:47:28] (03PS1) 10Jforrester: Cleanup: Move never-altered CentralAuthUseEventLogging into CommonSettings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/292616 [19:47:30] (03PS1) 10Jforrester: Cleanup: Move never-altered DisableUnmergedEdits into CommonSettings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/292617 [19:47:32] (03PS1) 10Jforrester: Cleanup: Move never-altered NewUserSuppressRC into CommonSettings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/292618 [19:47:34] (03PS1) 10Jforrester: Cleanup: Move never-altered UseDismissableSiteNotice into CommonSettings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/292619 [19:47:36] cool..robh..switch pots updated [19:47:36] (03PS1) 10Jforrester: Cleanup: Move never-altered UseAbuseFilter into CommonSettings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/292620 [19:47:38] (03PS1) 10Jforrester: Cleanup: Move never-altered UseLocalisationUpdate into CommonSettings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/292621 [19:47:40] (03PS1) 10Jforrester: Cleanup: Move never-altered CommonsMetadata* into CommonSettings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/292622 [19:47:42] (03PS1) 10Jforrester: Cleanup: Move never-altered WikiLoveDefault into CommonSettings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/292623 [19:47:44] (03PS1) 10Jforrester: Cleanup: Note a couple of items that are varied in Labs only [mediawiki-config] - 10https://gerrit.wikimedia.org/r/292624 [19:47:44] 06Operations, 10ops-eqiad: install/setup new labmon1001 system - https://phabricator.wikimedia.org/T136972#2354223 (10Cmjohnson) Moved old restbase1004 to rack c5, renamed labmon1001. Setup switch ports and vlan. ge-5/0/18 [19:47:48] What, me, bored? ;-) [19:47:55] (03CR) 10Dzahn: "deleted modules/secret/secrets/ssl/rt.wikimedia.org.key as well" [puppet] - 10https://gerrit.wikimedia.org/r/292589 (https://phabricator.wikimedia.org/T119112) (owner: 10Dzahn) [19:48:55] cmjohnson1: hrmm, i cannot hit up the wmf4659's mgmt? [19:49:07] try all caps WMF [19:49:29] dns isnt case sensitive [19:49:42] you already updated dns? [19:49:53] 112 1H IN PTR wmf4659.mgmt.eqiad.wmnet. [19:49:53] 112 1H IN PTR labmon1001.mgmt.eqiad.wmnet. [19:49:54] the WMF4659.mgmt.eqiad.wmnet never changed [19:50:06] which is what i was using [19:50:20] it's not working yet cuz it's not connected [19:50:25] heh [19:50:35] nothing to ping [19:50:41] ahh, i thought you were done when you mentioned switch ports [19:50:42] my bad [19:50:46] lemme know when its ready to roll [19:51:01] nope...it's HP so it's like a 45 min post .... ;-) [19:51:28] 06Operations: revoke rt.wikimedia.org cert - https://phabricator.wikimedia.org/T136979#2354227 (10Dzahn) [19:51:35] no worries, i just wanted to hop on and ensure disks are seen and shit before you get stuck back into theother racking [19:51:41] 06Operations: revoke rt.wikimedia.org cert - https://phabricator.wikimedia.org/T136979#2354242 (10Dzahn) [19:51:44] mutante: we dont revoke certs like that [19:52:35] 06Operations: revoke rt.wikimedia.org cert - https://phabricator.wikimedia.org/T136979#2354256 (10RobH) We don't revoke certs unless we know they are compromised. Since they are only for a year, adding them to revocation lists has larger repercussions and updates than simply letting it expire. (This is my unde... [19:52:38] task updated [19:52:55] basically revoking a perfectly good cert puts overhead on the revocation system for no good reason [19:53:00] robh: ok, good, that was the purpose, at one point i was asked to make these tickets [19:53:02] so easier for all to simply let it expire [19:53:07] ok [19:53:07] cool [19:53:16] yeah having the task and reasoning is never bad [19:54:05] 06Operations, 13Patch-For-Review: move RT off of magnesium - https://phabricator.wikimedia.org/T119112#2354261 (10Dzahn) [19:54:35] 06Operations, 13Patch-For-Review: move RT off of magnesium - https://phabricator.wikimedia.org/T119112#1818428 (10Dzahn) [19:54:37] 06Operations: revoke rt.wikimedia.org cert - https://phabricator.wikimedia.org/T136979#2354227 (10Dzahn) 05Resolved>03declined [19:55:05] (03PS5) 10Dzahn: varnish: mv wikimedia_vcl, netmapper_upd to separate files [puppet] - 10https://gerrit.wikimedia.org/r/290875 [19:55:46] (03CR) 10Dzahn: [C: 032] "compiler-tested noop" [puppet] - 10https://gerrit.wikimedia.org/r/290875 (owner: 10Dzahn) [19:56:30] ok, sees all 4 disks [19:57:03] robh: you wanna drive? i see all 4 disks..i will leave it connected in case you need me [19:57:04] cannot get into bios via oen time options though, annoying older hp bios [19:57:17] cmjohnson1: i figured you had shit to do =] i can take over for now unless i hit a problem [19:57:18] i go with just annoying HP bios [19:57:23] yeah..just trying to help [19:57:26] just trying to get mac address [19:57:51] worst case i let it hit carbon and steal it from there, heh [19:58:13] robh: show /system1/network1/Integrated_NICs [19:58:20] doesnt show anything [19:58:25] doesn't know the nic is there yet [19:58:29] just shows mgmt [19:58:40] and it just reset again... [19:58:45] media fail pxe [19:58:50] cmjohnson1: says cable isnt connected [19:58:57] [19:59:11] PXE-E61: Media test failure, check cable [19:59:37] change it to pxe 1g nic [19:59:41] it's set for 10g nic [19:59:46] shit..i need to pull that out first [20:00:04] ok, i'll d/c since itll power off [20:00:21] feel free to go into bios via crash cart and set it properly ;] [20:00:32] since it wont go in for me remotely, goes to rbsu command line instead =P [20:01:55] cmjohnson1: you didnt wanna get 22 machines done today did you? not when you can spend all day on this one! [20:02:31] (03PS3) 10Dzahn: varnish: move errorpage.html from misc to module [puppet] - 10https://gerrit.wikimedia.org/r/290876 [20:03:23] robh: yeah..it's not happening.. I wanted to get he fundraising stuff up and for the most part it's ready [20:03:48] (03CR) 10Dzahn: "ok if this would move into the module? or better into a role class?" [puppet] - 10https://gerrit.wikimedia.org/r/290876 (owner: 10Dzahn) [20:05:14] (03PS3) 10Dzahn: add endowment.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/292208 (https://phabricator.wikimedia.org/T136735) [20:06:21] (03CR) 10Dzahn: [C: 032] add endowment.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/292208 (https://phabricator.wikimedia.org/T136735) (owner: 10Dzahn) [20:10:36] (03PS3) 10Dzahn: nginx: remove jessie conditional for mount [puppet/nginx] - 10https://gerrit.wikimedia.org/r/291278 [20:12:52] 06Operations, 13Patch-For-Review: create endowment.wm.org microsite - https://phabricator.wikimedia.org/T136735#2354312 (10Dzahn) [20:14:43] 06Operations, 10Traffic, 10Wiki-Loves-Monuments, 07HTTPS: configure https for www.wikilovesmonuments.org - https://phabricator.wikimedia.org/T118388#2354315 (10Dzahn) strictly the ticket is resolved because it just says to configure it for "www". but would be nice if we can get this fixed too since the cer... [20:45:04] (03PS1) 10RobH: setting labmon1001's new mac address [puppet] - 10https://gerrit.wikimedia.org/r/292632 [20:45:34] (03CR) 10RobH: [C: 032 V: 032] setting labmon1001's new mac address [puppet] - 10https://gerrit.wikimedia.org/r/292632 (owner: 10RobH) [21:14:41] RECOVERY - DPKG on cp3012 is OK: All packages OK [21:15:18] RECOVERY - DPKG on cp3013 is OK: All packages OK [21:15:30] RECOVERY - DPKG on cp3014 is OK: All packages OK [21:16:19] RECOVERY - DPKG on cp3019 is OK: All packages OK [21:16:39] RECOVERY - DPKG on cp3020 is OK: All packages OK [21:16:59] RECOVERY - DPKG on cp3022 is OK: All packages OK [21:16:59] RECOVERY - DPKG on cp3015 is OK: All packages OK [21:16:59] RECOVERY - DPKG on cp3021 is OK: All packages OK [21:27:14] 06Operations, 10ops-eqiad: install/setup new labmon1001 system - https://phabricator.wikimedia.org/T136972#2354462 (10RobH) [21:27:24] 06Operations, 10ops-eqiad: install/setup new labmon1001 system - https://phabricator.wikimedia.org/T136972#2354048 (10RobH) a:05Cmjohnson>03yuvipanda [21:35:08] PROBLEM - carbon-cache@g service on labmon1001 is CRITICAL: Timeout while attempting connection [21:35:28] PROBLEM - carbon-cache@h service on labmon1001 is CRITICAL: Timeout while attempting connection [21:35:49] PROBLEM - carbon-frontend-relay service on labmon1001 is CRITICAL: Timeout while attempting connection [21:35:49] PROBLEM - Check size of conntrack table on labmon1001 is CRITICAL: Timeout while attempting connection [21:36:09] PROBLEM - DPKG on labmon1001 is CRITICAL: Timeout while attempting connection [21:36:17] PROBLEM - carbon-local-relay service on labmon1001 is CRITICAL: Timeout while attempting connection [21:36:28] PROBLEM - configured eth on labmon1001 is CRITICAL: Timeout while attempting connection [21:36:28] PROBLEM - Disk space on labmon1001 is CRITICAL: Timeout while attempting connection [21:36:35] 06Operations, 13Patch-For-Review: create endowment.wm.org microsite - https://phabricator.wikimedia.org/T136735#2354477 (10Dzahn) [21:36:38] PROBLEM - MD RAID on labmon1001 is CRITICAL: Timeout while attempting connection [21:36:38] PROBLEM - dhclient process on labmon1001 is CRITICAL: Timeout while attempting connection [21:36:40] 06Operations, 13Patch-For-Review: create endowment.wm.org microsite - https://phabricator.wikimedia.org/T136735#2346094 (10Dzahn) https://endowment.wikimedia.org/ [21:36:57] PROBLEM - graphite-web uWSGI web app on labmon1001 is CRITICAL: Timeout while attempting connection [21:37:17] PROBLEM - graphite.wikimedia.org on labmon1001 is CRITICAL: Connection timed out [21:37:28] PROBLEM - carbon-cache@a service on labmon1001 is CRITICAL: Timeout while attempting connection [21:37:28] PROBLEM - puppet last run on labmon1001 is CRITICAL: Timeout while attempting connection [21:37:47] PROBLEM - carbon-cache@b service on labmon1001 is CRITICAL: Timeout while attempting connection [21:37:47] PROBLEM - salt-minion processes on labmon1001 is CRITICAL: Timeout while attempting connection [21:37:59] PROBLEM - carbon-cache@c service on labmon1001 is CRITICAL: Timeout while attempting connection [21:38:17] PROBLEM - carbon-cache@d service on labmon1001 is CRITICAL: Timeout while attempting connection [21:38:37] PROBLEM - carbon-cache@e service on labmon1001 is CRITICAL: Timeout while attempting connection [21:38:47] PROBLEM - carbon-cache@f service on labmon1001 is CRITICAL: Timeout while attempting connection [21:44:54] uh oh [21:44:56] ^ is me [21:46:03] (03PS1) 10Jforrester: Enable VisualEditor by default on eleven Wikivoyages [mediawiki-config] - 10https://gerrit.wikimedia.org/r/292746 (https://phabricator.wikimedia.org/T132495) [21:47:16] (03PS1) 10Jforrester: Enable VisualEditor by default for all users of the Chinese Wikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/292747 (https://phabricator.wikimedia.org/T136996) [21:47:52] (03PS1) 10Jforrester: Enable VisualEditor by default for all users of the Russian Wikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/292748 (https://phabricator.wikimedia.org/T136995) [21:48:08] RECOVERY - carbon-cache@f service on labmon1001 is OK: OK - carbon-cache@f is active [21:48:27] RECOVERY - carbon-cache@g service on labmon1001 is OK: OK - carbon-cache@g is active [21:48:33] (03PS1) 10Jforrester: Enable VisualEditor by default for all users of the Italian Wikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/292749 (https://phabricator.wikimedia.org/T136994) [21:48:37] PROBLEM - graphite.wikimedia.org on labmon1001 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 398 bytes in 0.006 second response time [21:48:47] RECOVERY - carbon-cache@h service on labmon1001 is OK: OK - carbon-cache@h is active [21:48:48] RECOVERY - carbon-cache@a service on labmon1001 is OK: OK - carbon-cache@a is active [21:49:07] RECOVERY - Check size of conntrack table on labmon1001 is OK: OK: nf_conntrack is 1 % full [21:49:07] RECOVERY - salt-minion processes on labmon1001 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [21:49:07] RECOVERY - carbon-cache@b service on labmon1001 is OK: OK - carbon-cache@b is active [21:49:07] RECOVERY - carbon-frontend-relay service on labmon1001 is OK: OK - carbon-frontend-relay is active [21:49:10] (03PS1) 10Jforrester: Enable VisualEditor by default for all users of the French Wikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/292750 (https://phabricator.wikimedia.org/T136993) [21:49:18] RECOVERY - carbon-cache@c service on labmon1001 is OK: OK - carbon-cache@c is active [21:49:27] RECOVERY - DPKG on labmon1001 is OK: All packages OK [21:49:28] RECOVERY - carbon-local-relay service on labmon1001 is OK: OK - carbon-local-relay is active [21:49:38] RECOVERY - carbon-cache@d service on labmon1001 is OK: OK - carbon-cache@d is active [21:49:38] RECOVERY - configured eth on labmon1001 is OK: OK - interfaces up [21:49:38] RECOVERY - Disk space on labmon1001 is OK: DISK OK [21:49:48] RECOVERY - dhclient process on labmon1001 is OK: PROCS OK: 0 processes with command name dhclient [21:49:48] RECOVERY - carbon-cache@e service on labmon1001 is OK: OK - carbon-cache@e is active [21:49:48] RECOVERY - MD RAID on labmon1001 is OK: OK: Active: 8, Working: 8, Failed: 0, Spare: 0 [21:49:49] (03PS1) 10Jforrester: Enable VisualEditor by default for all users of the English Wikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/292751 (https://phabricator.wikimedia.org/T136992) [21:50:08] RECOVERY - graphite-web uWSGI web app on labmon1001 is OK: ● uwsgi-graphite-web.service - uwsgi-graphite-web uwsgi app [21:50:39] RECOVERY - puppet last run on labmon1001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [21:51:33] (03PS1) 10Jforrester: Enable VisualEditor by default for all users of the German Wikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/292752 (https://phabricator.wikimedia.org/T136991) [21:52:01] (03PS2) 10Jforrester: Enable VisualEditor by default on eleven Wikivoyages [mediawiki-config] - 10https://gerrit.wikimedia.org/r/292746 (https://phabricator.wikimedia.org/T136990) [21:54:33] (03CR) 10Jforrester: [C: 04-2] "Subject to discussion." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/292747 (https://phabricator.wikimedia.org/T136996) (owner: 10Jforrester) [21:54:41] (03CR) 10Jforrester: [C: 04-2] "Subject to discussion." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/292748 (https://phabricator.wikimedia.org/T136995) (owner: 10Jforrester) [21:54:47] (03CR) 10Jforrester: [C: 04-2] "Subject to discussion." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/292749 (https://phabricator.wikimedia.org/T136994) (owner: 10Jforrester) [21:54:54] (03CR) 10Jforrester: [C: 04-2] "Subject to discussion." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/292750 (https://phabricator.wikimedia.org/T136993) (owner: 10Jforrester) [21:54:57] (03PS4) 10Krinkle: ULS: Stop using /static/current [mediawiki-config] - 10https://gerrit.wikimedia.org/r/289652 (https://phabricator.wikimedia.org/T135806) (owner: 10Nikerabbit) [21:55:02] (03CR) 10Jforrester: [C: 04-2] "Subject to discussion." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/292751 (https://phabricator.wikimedia.org/T136992) (owner: 10Jforrester) [21:55:04] (03CR) 10Krinkle: [C: 031] ULS: Stop using /static/current [mediawiki-config] - 10https://gerrit.wikimedia.org/r/289652 (https://phabricator.wikimedia.org/T135806) (owner: 10Nikerabbit) [21:55:08] (03CR) 10Jforrester: [C: 04-2] "Subject to discussion." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/292752 (https://phabricator.wikimedia.org/T136991) (owner: 10Jforrester) [21:55:15] (03CR) 10Jforrester: [C: 04-1] "Tentatively scheduled for 13 June, subject to discussion." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/292746 (https://phabricator.wikimedia.org/T136990) (owner: 10Jforrester) [21:57:54] !log started copying graphite data from usb back [21:57:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:58:47] PROBLEM - puppet last run on analytics1040 is CRITICAL: CRITICAL: Puppet has 1 failures [22:19:41] PROBLEM - puppet last run on labvirt1010 is CRITICAL: CRITICAL: Puppet has 2 failures [22:24:52] RECOVERY - puppet last run on analytics1040 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [22:28:14] (03PS1) 10Gergő Tisza: Apply AbuseFilter configuration syntax change [mediawiki-config] - 10https://gerrit.wikimedia.org/r/292758 [22:31:39] 06Operations, 10ops-codfw: rack/setup/deploy mw22[1-5][0-9] switch configuration - https://phabricator.wikimedia.org/T136670#2354767 (10RobH) a:05Papaul>03RobH [22:46:49] RECOVERY - puppet last run on labvirt1010 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [22:50:51] (03PS1) 10Krinkle: Follows-up 3463cd6e which added a logo from the local wiki, but at the time Bashkir Wikipedia was having their 10th anniversary. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/292762 [22:51:28] (03PS2) 10Krinkle: Restore bawiki logo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/292762 [22:51:45] (03CR) 10Krinkle: [C: 032] Restore bawiki logo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/292762 (owner: 10Krinkle) [22:52:47] (03Merged) 10jenkins-bot: Restore bawiki logo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/292762 (owner: 10Krinkle) [22:53:22] !log krinkle@tin Synchronized static/images/project-logos/bawiki.png: (no message) (duration: 00m 24s) [22:53:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [22:57:06] !log Purged https://en.wikipedia.org/static/images/project-logos/bawiki.png [22:57:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:00:25] (03PS1) 10Dzahn: releases: delete unused role::releases::upload [puppet] - 10https://gerrit.wikimedia.org/r/292763 [23:07:20] (03PS2) 10Dzahn: releases: delete unused role::releases::upload [puppet] - 10https://gerrit.wikimedia.org/r/292763 [23:07:30] (03CR) 10Dzahn: [C: 032] "no-op http://puppet-compiler.wmflabs.org/3048/" [puppet] - 10https://gerrit.wikimedia.org/r/292763 (owner: 10Dzahn) [23:21:14] PROBLEM - puppet last run on labvirt1010 is CRITICAL: CRITICAL: Puppet has 4 failures [23:23:38] (03PS1) 10Dzahn: aptrepo: make conf/incoming configurable parameter [puppet] - 10https://gerrit.wikimedia.org/r/292765 (https://phabricator.wikimedia.org/T132757) [23:25:02] (03PS2) 10Dzahn: aptrepo: make conf/incoming configurable parameter [puppet] - 10https://gerrit.wikimedia.org/r/292765 (https://phabricator.wikimedia.org/T132757) [23:26:14] PROBLEM - puppet last run on ms-be2016 is CRITICAL: CRITICAL: Puppet has 1 failures [23:34:58] (03CR) 10Dzahn: [C: 032] "http://puppet-compiler.wmflabs.org/3049/" [puppet] - 10https://gerrit.wikimedia.org/r/292765 (https://phabricator.wikimedia.org/T132757) (owner: 10Dzahn) [23:35:04] (03PS1) 10Dzahn: aptrepo: set separate incoming conf for releases [puppet] - 10https://gerrit.wikimedia.org/r/292767 (https://phabricator.wikimedia.org/T132757) [23:37:18] (03PS2) 10Dzahn: aptrepo: set separate incoming conf for releases [puppet] - 10https://gerrit.wikimedia.org/r/292767 (https://phabricator.wikimedia.org/T132757) [23:46:23] RECOVERY - puppet last run on labvirt1010 is OK: OK: Puppet is currently enabled, last run 33 seconds ago with 0 failures [23:52:35] (03CR) 10Dzahn: [C: 032] "changes content on bromine, keeps it as is on carbon http://puppet-compiler.wmflabs.org/3050/" [puppet] - 10https://gerrit.wikimedia.org/r/292767 (https://phabricator.wikimedia.org/T132757) (owner: 10Dzahn) [23:53:32] RECOVERY - puppet last run on ms-be2016 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures