[00:00:04] RoanKattouw ostriches Krenair: Respected human, time to deploy Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160210T0000). Please do the needful. [00:00:04] jdlrobson ebernhardson Dereckson: A patch you scheduled for Evening SWAT (Max 8 patches) is about to be deployed. Please be available during the process. [00:00:08] but there is not lag, QPS are back to normal [00:00:13] something awry [00:00:14] \o [00:00:17] Good evening. [00:00:22] (03PS6) 10EBernhardson: Better mediawiki REPL [puppet] - 10https://gerrit.wikimedia.org/r/268541 [00:00:24] [Tue Feb 09 23:59:47.483451 2016] [proxy:error] [pid 5147:tid 139822668613376] (111)Connection refused: AH00957: FCGI: attempt to connect to 127.0.0.1:9000 (*) failed [00:00:30] thanks guts for staying late [00:00:37] I'll do it [00:00:37] lots of these in the job queue error logs [00:00:44] <_joe_> apergos: that is HHVM [00:00:47] I"m loking randomly at mw1015 [00:00:51] Wow everything is a config patch this tim [00:00:53] I like that [00:00:56] <_joe_> and that's apache with its bug [00:01:10] none from the previous log [00:01:26] what do I do about it? [00:01:36] 6operations: setup YubiHSM and laptop at office - https://phabricator.wikimedia.org/T123818#2013595 (10Dzahn) i have given the laptop back to OIT today to keep it in their room behind the metal door for us and give ops members access when we need it i put some sticky tape on it and wrote "eiximenis.corp.wm.org... [00:01:37] _joe_: [00:01:39] are you guys facing other important issues I can help with? I will be around for some time [00:01:40] <_joe_> apergos: check if hhvm restarted around that time [00:02:08] <_joe_> apergos: it might be that hhvm crashed [00:02:14] (03CR) 10Catrope: [C: 032] Test HTML stripping in production mobile beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/269341 (https://phabricator.wikimedia.org/T124959) (owner: 10Jdlrobson) [00:02:32] yes about 2 mins ago it seems [00:02:36] that it restarted [00:02:41] (03CR) 10jenkins-bot: [V: 04-1] Better mediawiki REPL [puppet] - 10https://gerrit.wikimedia.org/r/268541 (owner: 10EBernhardson) [00:02:47] :S [00:03:01] <_joe_> ok there's your explanation [00:03:06] ugh [00:03:14] okay, thanks [00:03:37] <_joe_> in this case there is no loadbalancer preventing requests to reach the appserver quickly [00:03:46] it's still spewing these, do I need to do anything about it? [00:03:58] <_joe_> let me take a look [00:04:00] PROBLEM - cassandra-a service on praseodymium is CRITICAL: CRITICAL - Expecting active but unit cassandra-a is failed [00:04:03] (03PS3) 10Subramanya Sastry: ruthenium: puppetize script to update parsoid + restart services [puppet] - 10https://gerrit.wikimedia.org/r/269559 [00:04:05] (03PS7) 10EBernhardson: Better mediawiki REPL [puppet] - 10https://gerrit.wikimedia.org/r/268541 [00:04:26] (03Merged) 10jenkins-bot: Test HTML stripping in production mobile beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/269341 (https://phabricator.wikimedia.org/T124959) (owner: 10Jdlrobson) [00:05:50] <_joe_> apergos: hhvm was locked up [00:05:59] <_joe_> I guess we don't have alarms for that [00:06:02] <_joe_> that's bad [00:06:04] how could you tell? and what did you do? [00:06:17] <_joe_> hhvm was using 0% of cpu [00:06:24] ah shoul dhave checked that [00:06:25] <_joe_> I did 'ps -ef' [00:06:29] yeah sorry [00:06:33] <_joe_> then I did hhvm-dump-debug [00:06:49] <_joe_> which gets a stack dump of all hhvm threads [00:06:52] 6operations, 10Salt, 10Trebuchet: salt-minion processes terminate on deployment sync - https://phabricator.wikimedia.org/T122544#2013615 (10greg) [00:06:57] <_joe_> and well, "same old same old" [00:07:03] gotcha [00:07:29] <_joe_> ok now I'm really off [00:07:38] Whoa uhm [00:07:38] jynus: I see you still have s2 in read-only mode? [00:07:47] (03CR) 10Dzahn: "thanks for puppetizing and moving it out of a personal home dir! looks alright, except in the commit message you said "let it be owned by " (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/269559 (owner: 10Subramanya Sastry) [00:07:54] 6operations, 10Salt, 10Trebuchet: salt-minion processes terminate on deployment sync - https://phabricator.wikimedia.org/T122544#2013627 (10ArielGlenn) p:5Triage>3High a:3ArielGlenn [00:08:23] * ebernhardson is going to make the safe assumption swat is canceled and move my patches into tomorrowws sway [00:08:27] t [00:08:29] ok good night joe [00:08:46] Or... bot [00:08:55] The commit message says " Set mediawiki back in read-only mode" but the actual diff disables r/o mode [00:09:02] and that was an hour ago, so I guess I'll continue with SWAT [00:09:03] yeah that was a typo [00:09:08] it's really read/write [00:09:09] (03CR) 10Dzahn: ruthenium: puppetize script to update parsoid + restart services (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/269559 (owner: 10Subramanya Sastry) [00:09:11] oh ok [00:09:28] jynus: how do those logs look, ok now? [00:11:37] !log catrope@mira Synchronized wmf-config/InitialiseSettings.php: Test HTML stripping in production mobile beta (duration: 02m 12s) [00:11:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:12:06] (03CR) 10Ori.livneh: [C: 04-1] ruthenium: puppetize script to update parsoid + restart services (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/269559 (owner: 10Subramanya Sastry) [00:12:24] yes, apergos I am writing a summary report now, saying everthing looks fine [00:12:32] jdlrobson: ---^^ Your change just went out, please confirm [00:12:33] ok sweet [00:12:56] so swat is on, yeah [00:13:00] Yeah we're on [00:13:04] (03PS4) 10Subramanya Sastry: ruthenium: puppetize script to update parsoid + restart services [puppet] - 10https://gerrit.wikimedia.org/r/269559 [00:13:04] ebernhardson: Yours are next, still wanna do em? [00:13:08] RoanKattouw: sure [00:13:41] jynus: well done [00:13:55] ori, thanks [00:13:56] (03CR) 10Catrope: [C: 032] Reduce Kafka timeouts [mediawiki-config] - 10https://gerrit.wikimedia.org/r/267200 (https://phabricator.wikimedia.org/T125084) (owner: 10MaxSem) [00:14:39] (03Merged) 10jenkins-bot: Reduce Kafka timeouts [mediawiki-config] - 10https://gerrit.wikimedia.org/r/267200 (https://phabricator.wikimedia.org/T125084) (owner: 10MaxSem) [00:14:45] ebernhardson: Was this refacotring https://gerrit.wikimedia.org/r/#/c/269568/1/tests/cirrusTest.php intended to be part of the increase replicas change? [00:15:14] the failover was the "easy" part, preparing for MariaDB 10 was the long part, and Sean did a lot of the initial work [00:15:57] RoanKattouw: yes, I am never completely sure how config will be loaded, that code tests our shard/replica counts to make sure they are sane [00:16:02] OK [00:16:02] but let me get you excited because this will open the door for easier failover/multimaster, paralel replication and better performance! [00:16:11] RoanKattouw: as a side effect, it tests that the config that comes out of SiteConfiguration is what you thought it would be when writing it [00:16:12] Just checking since the commit message didn't mention it [00:16:29] Ugh right that file is in the tests/ directory [00:16:36] plus it is a step forward towards multi-datacenter work [00:16:42] (03CR) 10Catrope: [C: 032] Increase completion suggester replicas for busy wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/269568 (https://phabricator.wikimedia.org/T125667) (owner: 10EBernhardson) [00:16:43] I am clearly blind [00:16:55] :) [00:17:21] (03PS5) 10Subramanya Sastry: ruthenium: puppetize script to update parsoid + restart services [puppet] - 10https://gerrit.wikimedia.org/r/269559 [00:17:31] (03Merged) 10jenkins-bot: Increase completion suggester replicas for busy wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/269568 (https://phabricator.wikimedia.org/T125667) (owner: 10EBernhardson) [00:17:34] jynus: Will there be an ops-l email or something about that? i.e. about how whatever you guys did today (I didn't pay too much attention) is good and exciting, in addition to the incident report [00:17:39] apergos: neodymium.. it looks ok in icinga.. but personally i cant ssh to it ..eh [00:17:47] what? [00:17:59] well I'm on it right now [00:18:06] and I just ssh over about 2 mins ago [00:18:11] !log catrope@mira Synchronized wmf-config/logging.php: Reduce Kafka timeouts (duration: 02m 13s) [00:18:11] (03CR) 10Subramanya Sastry: ruthenium: puppetize script to update parsoid + restart services (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/269559 (owner: 10Subramanya Sastry) [00:18:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:18:19] ebernhardson: That---^^ is part one [00:18:39] RoanKattouw, I am writing a report now, but I leave out most of the technical details that may be boring [00:18:40] mutante: and just opened a new window and sshes in [00:18:52] Awesome [00:19:00] neodymium.eqiad.wmnet no typo or anything? [00:19:17] but everithing else is public on operations/puppet [00:19:23] and phabricator [00:19:39] Dereckson: You ready for yours? [00:19:42] Yes, I'm. [00:19:46] Cool [00:19:58] apergos: yea, ssh neodymium.eqiad.wmnet, thanks for confirming.. ehmm [00:20:00] I'll start merging them, and push them out all at once [00:20:02] Except for the schema change one [00:20:23] k [00:20:32] jynus: This one https://gerrit.wikimedia.org/r/#q,260541,n,z says "schema change testing required by jynus", do you know what needs to be done there? [00:20:43] RoanKattouw: everything looks to still be happy. thanks [00:20:53] !log catrope@mira Synchronized wmf-config/InitialiseSettings.php: Increase completion suggester replicas for busy wikis (duration: 02m 11s) [00:20:54] apergos: it must be my connection..other issue too now, it wasnt just this host [00:20:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:21:02] ebernhardson: And there goes your part 2 [00:21:04] I don't see you even trying to get in [00:21:08] how about bastion? [00:21:14] (03CR) 10Catrope: [C: 032] Enable CategoryMembershipChanges on fr.wiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/268879 (https://phabricator.wikimedia.org/T126051) (owner: 10Dereckson) [00:21:22] (03CR) 10Catrope: [C: 032] Set nlwiki collation to uca-nl [mediawiki-config] - 10https://gerrit.wikimedia.org/r/268409 (https://phabricator.wikimedia.org/T125774) (owner: 10Merlijn van Deen) [00:21:31] (03CR) 10Catrope: [C: 032] Add Recherche: to wgContentNamespaces on fr.wikiversity [mediawiki-config] - 10https://gerrit.wikimedia.org/r/268630 (https://phabricator.wikimedia.org/T125948) (owner: 10Dereckson) [00:21:33] (03CR) 10jenkins-bot: [V: 04-1] Enable CategoryMembershipChanges on fr.wiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/268879 (https://phabricator.wikimedia.org/T126051) (owner: 10Dereckson) [00:21:46] (03CR) 10Catrope: [C: 032] Get rid of $wg = $wmg for BetaFeatures [mediawiki-config] - 10https://gerrit.wikimedia.org/r/266470 (https://phabricator.wikimedia.org/T119117) (owner: 10Dereckson) [00:21:49] RoanKattouw: second one has no effect on prod usage, just on a maint script we run weekly from cron. Should be all happy. [00:21:51] apergos: no more outgoing ssh for me right now , but IRC is alive :p [00:21:55] err, on web requests at least [00:21:58] oh dear! [00:22:07] (03Merged) 10jenkins-bot: Set nlwiki collation to uca-nl [mediawiki-config] - 10https://gerrit.wikimedia.org/r/268409 (https://phabricator.wikimedia.org/T125774) (owner: 10Merlijn van Deen) [00:22:08] OK [00:22:16] maybe it's the yubi key :-P [00:22:20] Dereckson: Mind rebasing https://gerrit.wikimedia.org/r/#/c/268879/ ? [00:22:26] k [00:22:36] bd808, back to my complain: I do not complain about scap. I complain about doing switchovers with code deployment :-) [00:22:45] (03Merged) 10jenkins-bot: Add Recherche: to wgContentNamespaces on fr.wikiversity [mediawiki-config] - 10https://gerrit.wikimedia.org/r/268630 (https://phabricator.wikimedia.org/T125948) (owner: 10Dereckson) [00:22:59] (03CR) 10Ori.livneh: [C: 032 V: 032] ruthenium: puppetize script to update parsoid + restart services [puppet] - 10https://gerrit.wikimedia.org/r/269559 (owner: 10Subramanya Sastry) [00:23:07] (03Merged) 10jenkins-bot: Get rid of $wg = $wmg for BetaFeatures [mediawiki-config] - 10https://gerrit.wikimedia.org/r/266470 (https://phabricator.wikimedia.org/T119117) (owner: 10Dereckson) [00:23:09] instead of etcd-controlled proxys [00:23:54] but I have almost tricked Aaron into convincing him to change the architecture [00:24:00] (03PS8) 10Ori.livneh: Better mediawiki REPL [puppet] - 10https://gerrit.wikimedia.org/r/268541 (owner: 10EBernhardson) [00:24:09] (03CR) 10Ori.livneh: [C: 032 V: 032] "Tested with catalog compiler: . Looks great! Thanks for doing this!" [puppet] - 10https://gerrit.wikimedia.org/r/268541 (owner: 10EBernhardson) [00:24:18] *let me change the architecture [00:24:54] Is anyone actually in favor of switchover with code deployment? It's just tech debt, no? [00:25:32] subbu: ran puppet on ruthenium; your change has been applied [00:25:34] well, I got some complains about "losing control" by devels [00:25:36] RoanKattouw: actually, we can probably skip it, wgRCWatchCategoryMembership is now default everywhere, and I don't think we're going to invert that [00:25:50] 6operations, 6Performance-Team, 10scap, 7Epic, 7Tracking: During deployment old servers may populate new cache URIs (tracking) - https://phabricator.wikimedia.org/T47877#2013664 (10greg) [00:25:52] ori: is root:root correct though ? as opposed to parsoid [00:26:17] well 0555, it will work [00:26:28] I want to convince them by something tangible rather than pure "ideas" [00:26:57] why not etcd? let's join the modern age [00:27:09] what? [00:27:13] ori, thanks. on that note, i'll head home now. [00:27:14] I just mentiond that [00:27:18] I'm agreeing [00:27:24] ah [00:27:26] ate: Mon Feb 1 09:06:15 2016 +0000 [00:27:26] Revert "Revert "wgRCWatchCategoryMembership true on wikipedias & commons"" [00:27:41] mutante: it's not something that should be writable by parsoid [00:28:01] I do not thing people are against etcs, but about infrastructure vs. code [00:28:04] *think [00:28:10] !log catrope@mira Synchronized wmf-config/InitialiseSettings.php: Set collation to uca-nl on nlwiki; add Recherche: to content namespaces on frwikiversity; BetaFeatures wmg->wg rename (duration: 02m 12s) [00:28:13] mutante: otherwise it provides a path to privilege escalation, because of the sudo rule [00:28:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:28:15] Testing. [00:28:32] ori: hmm, agreed, yea. it was just because the commit message said parsoid and the code said root [00:28:42] mutante: yeah, sorry for missing that [00:28:51] RECOVERY - cassandra-a service on praseodymium is OK: OK - cassandra-a is active [00:29:35] how fine grained can we make access control, can we say 'for this bank of config settings, this group of users can make changes'? [00:29:35] (03PS2) 10Ori.livneh: Remove ob_start() from docroot/noc/conf/highlight.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/269562 (owner: 10BryanDavis) [00:29:38] ori: thanks for merging, yay that it's puppetized now [00:29:43] (03CR) 10Ori.livneh: [C: 032] Remove ob_start() from docroot/noc/conf/highlight.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/269562 (owner: 10BryanDavis) [00:30:00] RECOVERY - cassandra-a CQL 10.64.16.188:9042 on praseodymium is OK: TCP OK - 0.032 second response time on port 9042 [00:30:22] (03Merged) 10jenkins-bot: Remove ob_start() from docroot/noc/conf/highlight.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/269562 (owner: 10BryanDavis) [00:30:23] it would be nice to see the edit statistics, to see how many edits we actually lose, or if there was a torrent of edits after the fact that compensated for it [00:30:27] because I can see folks not wanting to lose capabilities they currently have [00:30:32] but if we can provide those, eh [00:31:02] RoanKattouw: 268630 and 266470 tested [00:31:09] !log catrope@mira Synchronized wmf-config/CommonSettings.php: BetaFeatures wmg->wg rename, part 2 (duration: 02m 13s) [00:31:10] as to edits, we might have lost some bot edits but they'll just run again. people tend to retry [00:31:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:31:14] actually, current mediawiki load balancer is nice, but there are proxies with way better more features [00:31:37] RoanKattouw: for nl, there is an update collation script to run, but I imagine as it's a large wiki, there would be smoother to run it outside the SWAT window? [00:31:42] +1 [00:32:05] the main issue is speed, with a proxy, codes does not change, and application is immediate [00:32:21] from 70 seconds to 0 seconds [00:32:26] imagine application instant across the cluster [00:32:28] sigh [00:32:38] to dreeeeeaaam the impossible dreeeeaaammm [00:32:45] and on that note I go to bed [00:32:47] imagine not havint to retry backups [00:32:55] :-) good night! [00:33:05] have a quiet rest of your timezone! [00:33:29] (03Abandoned) 10Dereckson: Enable CategoryMembershipChanges on fr.wiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/268879 (https://phabricator.wikimedia.org/T126051) (owner: 10Dereckson) [00:34:20] (03CR) 10Dereckson: "Revert change has been reverted by change Ie834d2d8865e5b1f02c626c0e38feaccd48899c8." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/264733 (owner: 10Addshore) [00:35:03] (03Abandoned) 10Jcrespo: Final configuration after failover (db1018 as the new s2-master) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/269389 (https://phabricator.wikimedia.org/T125215) (owner: 10Jcrespo) [00:35:45] Dereckson: the BetaFeatures wmg -> wg went well? [00:36:01] will you apply the nl charset conversion soon? [00:36:08] wait [00:36:16] puppet change on mira screwed up a lot of vars [00:36:17] legoktm: yes, as far as Special:Preferences and Special:BetaFeatures are concerned. [00:36:18] -MEDIAWIKI_DEPLOYMENT_DIR="/srv/mediawiki" [00:36:18] -MEDIAWIKI_STAGING_DIR="/srv/mediawiki-staging" [00:36:18] +MEDIAWIKI_DEPLOYMENT_DIR="" [00:36:20] +MEDIAWIKI_STAGING_DIR="" [00:36:23] checking to see why [00:36:26] Dereckson: oh, awesome :) [00:36:36] :-O [00:36:50] legoktm: if you wish to see the next by alphabetical order extension, there are some review to do for CategoryTree [00:37:12] * legoktm puts it on his list [00:37:13] who is deploying? can you stop? [00:37:18] ori: those appear to be mentioned in the puppet-compiler you linked for my patch, although i'm not sure how my patch could effect that [00:37:46] legoktm: https://gerrit.wikimedia.org/r/#/q/status:open+project:mediawiki/extensions/CategoryTree+owner:%22Dereckson+%253Cdereckson%2540espace-win.org%253E%22,n,z <-- these three [00:38:06] ori: RoanKattouw [00:38:48] RoanKattouw It means that I requested for this to be tested before appliing later a schema change to all wikis [00:39:10] RoanKattouw: please stop, the deployment master is in a bad state [00:39:16] I'm looking to see why [00:40:50] PROBLEM - Ubuntu mirror in sync with upstream on carbon is CRITICAL: /srv/mirrors/ubuntu is over 12 hours old. [00:42:33] ori: Stopping. Last sync was 16:31 PT [00:43:44] jynus: OK, so does that mean I apply that change to testwiki (after ori fixes the deployment system), then you look at how well it works? [00:44:15] Dereckson: Given that the deployment system just broke anyway, I'll run the collation script [00:44:17] :w [00:44:20] RECOVERY - Ubuntu mirror in sync with upstream on carbon is OK: /srv/mirrors/ubuntu is over 0 hours old. [00:44:40] I have not idea how that works, Dereckson or the person he is testing as a proxy knows [00:44:54] if they do not know how to test it, schema change will be rejected [00:44:54] TTO has added a test procedure to the bug. [00:45:00] I asked him how to test that. [00:45:11] then Dereckson will know how to, RoanKattouw [00:45:26] ori, it is definitelly that last patch [00:45:44] yeah [00:45:52] debating whether to fix or revert [00:46:05] i'll try a quick fix [00:46:14] and depending how how important it is, just revert to not block, then fix [00:46:18] jynus: Dereckson: I'm actually here, by coincidence. I was never able to connect to IRC here before, so I wasn't expecting to be on IRC right now [00:46:30] !log Running updateCollation.php on nlwiki [00:46:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Mr. Obvious [00:47:05] * RoanKattouw waves at tto [00:47:17] tto, sorry to be strict, but if a schema change is required on many wikis, I need some guarantees [00:47:29] Which patch are we even talking about here? [00:47:41] I don't remember requesting any schema changes [00:47:45] and production usually gives many surprises compared to beta/testing [00:47:48] RoanKattouw, hey :) [00:48:14] (03PS1) 10Ori.livneh: Fix-up(?) for I33966cc28 [puppet] - 10https://gerrit.wikimedia.org/r/269574 [00:48:19] I may be confuding you with someone else [00:48:24] tto: https://gerrit.wikimedia.org/r/#q,260541,n,z [00:48:27] (03CR) 10Ori.livneh: [C: 032 V: 032] Fix-up(?) for I33966cc28 [puppet] - 10https://gerrit.wikimedia.org/r/269574 (owner: 10Ori.livneh) [00:49:23] jynus, oh right, that should only need a schema change on testwiki, unless it has already been applied there [00:49:44] https://gerrit.wikimedia.org/r/#/c/135312/73/maintenance/archives/patch-page_lang.sql [00:49:50] yes, it is you [00:50:10] See Nemo's comment on the config patch: "AFAICT jcrespo wants us to verify that the schema change worked, so let's enable Special:PageLanguage." [00:50:24] tto: jcrespo = jynus [00:50:28] yes, jcrespo is me [00:50:34] :-) [00:50:46] I'm aware :) So apparently the schema change has already occurred? [00:50:52] yes [00:51:02] So there's no issue then, surely [00:51:20] schema change on test -> config change on test -> rest of schema changes -> now you are on your own [00:51:27] :-) [00:52:02] RoanKattouw: fixed, all yours [00:52:43] * jynus doesn't understand [00:53:17] jynus: you don't understand me, or RoanKattouw? If me: mira is back to normal. [00:53:22] err, me or tto [00:53:52] I do not understand the fix, or more exactly, how it failed before [00:54:31] (But no need to explain) [01:00:17] one thing before I leave: there is a very small, but greater than 0 possibility of replication on db1024 to fail [01:01:26] do not freak out, it should not be in production anymore, but a page will wake me up in any case [01:02:36] jynus: my hunch is bug in the hiera lookup code that code triggered by the 'require' forcing a particular order of class evaluation [01:02:44] still boggling at it [01:02:48] anyways, go to sleep :P [01:03:20] I see s2 master being hit by the nlwiki maintenance script, and it holds relativelly well [01:04:52] Good to hear [01:05:09] It waits for slaves only every 100k rows :( [01:05:28] Ouch. [01:05:46] Or maybe it's just every 10 batches, with the batch size being 10k [01:05:48] In any case, I'll go do the testwiki config change now [01:05:58] (03CR) 10Catrope: [C: 032] Set $wgPageLanguageUseDB = true for testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/260541 (https://phabricator.wikimedia.org/T69223) (owner: 10TTO) [01:06:27] RoanKattouw: Are you still SWATing? [01:06:43] Yes [01:06:47] Last patch [01:06:49] RoanKattouw: jynus: the script duration could be several days if nl. is as large as fr.wikip, according https://phabricator.wikimedia.org/T56680 [01:06:54] Puppet broke the deployment system before I could do that one [01:06:56] (03Merged) 10jenkins-bot: Set $wgPageLanguageUseDB = true for testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/260541 (https://phabricator.wikimedia.org/T69223) (owner: 10TTO) [01:07:13] Dereckson: Thankfully I'm running it in a screen :) but it seems to be at >10% already [01:07:14] RoanKattouw: Damn, and it's full. [01:07:16] it is a little smaller [01:07:26] 950k out of 8.35M [01:07:40] James_F: What were you wanting to do? [01:07:40] that is the new MariaDB10! [01:07:53] RoanKattouw: Backports for VE and Cite. [01:07:57] jynus: And that's on s2 you say? [01:08:01] yep [01:08:06] we could maybe close https://phabricator.wikimedia.org/T58041 if the new infrastructure performance is so great [01:08:07] so it is a good test [01:08:09] RoanKattouw, testing that now [01:08:10] RoanKattouw: https://gerrit.wikimedia.org/r/#/q/status:open+branch:wmf/1.27.0-wmf.13,n,z is always informative. :-) [01:08:16] I should take note of that for my backfillUnreadWikis.php runs [01:08:20] tto: not yet [01:08:24] Not sure if I have any s2 wikis in there though [01:08:31] in the initial batch that is [01:08:44] tto: the commit is merged, but the code is not deployed [01:08:49] Dereckson: Not deployed yet, eh? Shows how infrequently I come to SWAT [01:08:52] nlwiki *is* on s2 [01:09:19] jynus: Yeah, sorry I'm being confusing. I'm runnning a different script on a few wikis. [01:09:34] Although I guess it writes to the global echo table in the wikishared DB so it doesn't matter [01:09:47] * RoanKattouw tsks at bd808 for merging https://gerrit.wikimedia.org/r/269562 without deploying it [01:10:00] well, I am seeing an increase of updates on s2 [01:10:00] * bd808 looks [01:10:09] * RoanKattouw deploys it [01:10:10] I have not checked the exact source of that [01:10:17] tto: if you're curious about how that works, there is some documentation at https://wikitech.wikimedia.org/wiki/How_to_deploy_code [01:10:18] RoanKattouw: I didn't merge, ori did :P [01:10:20] jynus: Probably me running the collation script on nlwiki on terbium [01:10:29] yes! [01:10:31] Oh ,sorry, yes, author != merger [01:10:53] Anyway, yes, me running that script is because of one of tto's config patches [01:11:02] my bad [01:11:10] Separately from that, for Echo, I'm also running a different script that fills that global table we created a while ago [01:11:30] But that doesn't really depend on which cluster a wiki is on very much, cause that'll only read from ${foo}wiki and write to wikishared [01:12:15] !log catrope@mira Synchronized docroot/noc/conf/highlight.php: Remove ob_start() from highlight.php (duration: 02m 13s) [01:12:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [01:12:28] I do not see much of that [01:12:44] x1 master seems as loaded as usual [01:13:17] jynus: the script reads a lot of data, but only write when there are non ASCII characters [01:13:27] that would explain it [01:13:50] I'm not running the x1 ones right now [01:13:54] And Dutch is not a language with a lot of diacritics [01:13:55] But I ran one overnight on wikidat [01:13:57] Took 13h [01:14:26] it is nice to talk you, I think we do not usually interact, I suppose due to almost no timezone overlap [01:14:53] !log catrope@mira Synchronized wmf-config/InitialiseSettings.php: Set $wgPageLanguageUseDB = true on testwii (duration: 02m 14s) [01:14:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [01:15:15] Now tto you can test. [01:16:01] Bewdy [01:16:01] That's SWAT done, I think [01:16:15] PROBLEM - MariaDB Slave Lag: m3 on db1048 is CRITICAL: CRITICAL slave_sql_lag Seconds_Behind_Master: 928 [01:16:22] jynus: Yeah this happens when you're on line at 2am :P you see all sorts of different people you don't normally see [01:17:35] why was db1048 replication stopped? [01:17:48] What is m3? [01:17:57] * RoanKattouw only knows the sN clusters, clearly hasn't kept up [01:18:10] miscelanous shard 3, aka phabricator dbs [01:18:12] jynus, RoanKattouw: oh, "Hmm, that patch was merged, what's the status now? We'll have a good opportunity to see how slow it is these days with T125774: Set nlwiki collation order to uca-nl. :)" [01:18:24] Aha [01:18:37] Message from MatmaRex on the slow script bug. [01:18:43] Dereckson: How slow what is, the script? [01:18:52] It's almost halfway [01:19:07] Yep, all good [01:19:16] Sorry, no [01:19:17] 1.5M out of 8.35M [01:19:47] apergos: All of my patches are submitted. I can't think of anything else to do right now except to extend tomorrow's scheduled phabricator maintenance window. Then during the maintenance window I will plan to do the steps outlined in T125853. [01:20:17] OK, I'm gonna be afk for a bit, the script seems to be humming along [01:21:24] Before https://gerrit.wikimedia.org/r/#/c/254374/ it took one week for frwiki, some minutes for bswiki [01:21:52] this makes no sense, why db1048 was stopped [01:22:34] tto: I'll post on the phab task when the script is done. Probably gonna be another 3-5 hours [01:22:52] * RoanKattouw sets reminder to check the script's status in ~4h [01:22:55] Thanks [01:23:06] !log started db1048 replication. For some reason, replication was stopped. Need further investigation. [01:23:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [01:23:44] 6operations, 6Release-Engineering-Team, 10scap, 3Scap3: Depool proxies temporarily while scap is ongoing to avoid taxing those nodes - https://phabricator.wikimedia.org/T125629#2013801 (10greg) [01:41:16] !log starting pt-heartbeat on db1018 [01:41:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [01:45:27] (03PS1) 10Jcrespo: Updating new master on codfw configuration too (just in case) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/269581 (https://phabricator.wikimedia.org/T125215) [01:46:06] (03CR) 10Jcrespo: [C: 032] Updating new master on codfw configuration too (just in case) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/269581 (https://phabricator.wikimedia.org/T125215) (owner: 10Jcrespo) [01:46:24] RECOVERY - MariaDB Slave Lag: m3 on db1048 is OK: OK slave_sql_lag Seconds_Behind_Master: 0 [01:49:37] !log jynus@mira Synchronized wmf-config/db-codfw.php: Updating new master on codfw configuration (duration: 02m 15s) [01:49:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [01:56:55] (03PS1) 10Jcrespo: Reducing new s2 master weight for reads [mediawiki-config] - 10https://gerrit.wikimedia.org/r/269585 [01:58:06] (03CR) 10Jcrespo: [C: 032] Reducing new s2 master weight for reads [mediawiki-config] - 10https://gerrit.wikimedia.org/r/269585 (owner: 10Jcrespo) [02:01:23] !log jynus@mira Synchronized wmf-config/db-eqiad.php: Reducing new s2 master weight for reads (duration: 02m 15s) [02:01:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:01:50] RECOVERY - Last backup of the tools filesystem on labstore1001 is OK: OK - Last run for unit replicate-tools was successful [02:07:19] PROBLEM - puppet last run on mc2003 is CRITICAL: CRITICAL: puppet fail [02:33:50] RECOVERY - puppet last run on mc2003 is OK: OK: Puppet is currently enabled, last run 15 seconds ago with 0 failures [02:43:31] PROBLEM - cassandra-a service on praseodymium is CRITICAL: CRITICAL - Expecting active but unit cassandra-a is failed [02:44:21] PROBLEM - cassandra-a CQL 10.64.16.188:9042 on praseodymium is CRITICAL: Connection refused [02:59:19] RECOVERY - cassandra-a service on praseodymium is OK: OK - cassandra-a is active [03:00:10] RECOVERY - cassandra-a CQL 10.64.16.188:9042 on praseodymium is OK: TCP OK - 0.000 second response time on port 9042 [03:20:40] PROBLEM - puppet last run on mw1104 is CRITICAL: CRITICAL: Puppet has 1 failures [03:46:59] RECOVERY - puppet last run on mw1104 is OK: OK: Puppet is currently enabled, last run 39 seconds ago with 0 failures [04:10:59] PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 225, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-4/2/0: down - Core: cr1-codfw:xe-5/2/1 (Telia, IC-307235, 34ms) {#2648} [10Gbps wave]BR [04:26:50] PROBLEM - puppet last run on mw2209 is CRITICAL: CRITICAL: puppet fail [04:48:28] (03CR) 10Subramanya Sastry: "https://github.com/wikimedia/integration-visualdiff/commit/63aadba7dc18447db5d6f34d29e38b97a91bb417 fixes the problem that necessitated th" [puppet] - 10https://gerrit.wikimedia.org/r/269314 (owner: 10Subramanya Sastry) [04:54:39] RECOVERY - puppet last run on mw2209 is OK: OK: Puppet is currently enabled, last run 46 seconds ago with 0 failures [05:03:42] 6operations, 10Analytics-Wikistats, 7Regression: [Regression] stats.wikipedia.org redirect no longer works ("Domain not served here") - https://phabricator.wikimedia.org/T126281#2014030 (10Peachey88) Is the redirected used all that much? Perhaps someone from #operations could have a look at the stats for it. [05:16:35] (03PS1) 10Ottomata: Another new cdh change fix for old role [puppet] - 10https://gerrit.wikimedia.org/r/269601 [05:18:47] (03CR) 10Ottomata: [C: 032] Another new cdh change fix for old role [puppet] - 10https://gerrit.wikimedia.org/r/269601 (owner: 10Ottomata) [05:25:59] RECOVERY - puppet last run on analytics1027 is OK: OK: Puppet is currently enabled, last run 24 seconds ago with 0 failures [05:55:07] (03PS1) 10Dzahn: parsoid: create module, move files and templates there [puppet] - 10https://gerrit.wikimedia.org/r/269602 [06:02:17] (03PS2) 10Dzahn: parsoid: create module, move files and templates [puppet] - 10https://gerrit.wikimedia.org/r/269602 [06:09:57] (03PS1) 10Dzahn: parsoid: one file per role, move to module/role [puppet] - 10https://gerrit.wikimedia.org/r/269603 [06:11:13] (03PS2) 10Dzahn: parsoid: one file per role, move to module/role [puppet] - 10https://gerrit.wikimedia.org/r/269603 [06:11:41] (03PS1) 10Ottomata: Use same values of kafka.max.pull.hrs and kafka.max.historical.days [puppet] - 10https://gerrit.wikimedia.org/r/269604 [06:12:24] (03PS3) 10Dzahn: parsoid: one file per role, move to module/role [puppet] - 10https://gerrit.wikimedia.org/r/269603 [06:12:30] (03CR) 10Ottomata: [C: 032 V: 032] Use same values of kafka.max.pull.hrs and kafka.max.historical.days [puppet] - 10https://gerrit.wikimedia.org/r/269604 (owner: 10Ottomata) [06:15:09] (03PS4) 10Dzahn: parsoid: one file per role, move to module/role [puppet] - 10https://gerrit.wikimedia.org/r/269603 [06:16:31] (03CR) 10jenkins-bot: [V: 04-1] parsoid: one file per role, move to module/role [puppet] - 10https://gerrit.wikimedia.org/r/269603 (owner: 10Dzahn) [06:17:54] oh uh, jenkins [06:17:58] 06:15:52 fatal: cannot create directory at 'hieradata/role/common/analytics/hadoop': No space left on device [06:18:05] that's not a good reason [06:18:40] ! [06:18:55] /dev/mapper/vd-second--local--disk 61G 58G 1.7M 100% /mnt [06:19:31] I deleted some stuff [06:19:31] !log integration-slave-trusty-1014 - out of disk ? jenkins voted things -1 because it had no space left on device [06:19:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [06:19:41] legoktm: on that ^ ? [06:19:43] yeah [06:19:49] :) cool, thank you [06:20:58] I wonder why the shinken alerts didn't trigger [06:20:59] /dev/mapper/vd-second--local--disk 61G 52G 5.6G 91% /mnt [06:21:06] (03CR) 10Dzahn: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/269603 (owner: 10Dzahn) [06:21:07] good enough for now [06:21:19] interesting that "no disk space" turns into "puppet typos -1" :) [06:21:24] yes [06:22:12] legoktm: random guess, the shinken alerts check / but not /mnt [06:22:34] no, we got alerts for other servers earlier [06:22:35] [20:50:10] PROBLEM - Free space - all mounts on integration-slave-trusty-1017 is CRITICAL: CRITICAL: integration.integration-slave-trusty-1017.diskspace._mnt.byte_percentfree (<28.57%) [06:22:38] (in -releng) [06:22:47] hmm, ok [06:28:20] PROBLEM - Unmerged changes on repository puppet on palladium is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet). [06:29:07] ottomata: that's your kafka change ^ [06:29:30] PROBLEM - puppet last run on cp2001 is CRITICAL: CRITICAL: puppet fail [06:29:51] PROBLEM - Unmerged changes on repository puppet on strontium is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet). [06:30:03] (03PS5) 10Dzahn: parsoid: one file per role, move to module/role [puppet] - 10https://gerrit.wikimedia.org/r/269603 [06:30:05] (03PS3) 10Dzahn: parsoid: create module, move files and templates there [puppet] - 10https://gerrit.wikimedia.org/r/269602 [06:30:07] (03PS1) 10Dzahn: parsoid::testing: usr /srv instead /usr/lib [puppet] - 10https://gerrit.wikimedia.org/r/269606 [06:30:49] PROBLEM - puppet last run on mw1008 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:11] PROBLEM - puppet last run on mw1119 is CRITICAL: CRITICAL: Puppet has 2 failures [06:31:27] oo camus [06:31:28] thanks [06:31:31] PROBLEM - puppet last run on cp3048 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:40] merged thanks [06:31:41] RECOVERY - Unmerged changes on repository puppet on strontium is OK: No changes to merge. [06:31:50] PROBLEM - puppet last run on mw2207 is CRITICAL: CRITICAL: Puppet has 2 failures [06:32:01] RECOVERY - Unmerged changes on repository puppet on palladium is OK: No changes to merge. [06:32:11] PROBLEM - puppet last run on mw1110 is CRITICAL: CRITICAL: Puppet has 1 failures [06:32:30] PROBLEM - puppet last run on mw2018 is CRITICAL: CRITICAL: Puppet has 1 failures [06:32:52] ottomata: where did you see camus? [06:33:20] its in a camus property file [06:33:21] PROBLEM - puppet last run on cp3036 is CRITICAL: CRITICAL: Puppet has 1 failures [06:33:45] ottomata: i see it now :) [06:36:43] (03PS6) 10Dzahn: parsoid: one file per role, move to module/role [puppet] - 10https://gerrit.wikimedia.org/r/269603 [06:50:18] (03PS1) 10Dzahn: ssh: fix lint, indentation [puppet] - 10https://gerrit.wikimedia.org/r/269609 [06:52:14] (03PS1) 10Dzahn: pybal: fix lint, indentation [puppet] - 10https://gerrit.wikimedia.org/r/269610 [06:53:28] (03PS1) 10Dzahn: keyholder: fix lint, indentation [puppet] - 10https://gerrit.wikimedia.org/r/269611 [06:54:38] twentyafterfour: I saw that alll the patches are in, I'll try to be around for the phab window if it's not at too ridiciulous a time for me given timezones [06:55:16] it's at 01:00 thursday UTC [06:55:35] apergos: but it can't happen if I don't get the patches merged before that time [06:55:59] RECOVERY - puppet last run on cp3048 is OK: OK: Puppet is currently enabled, last run 1 second ago with 0 failures [06:56:28] yep I know [06:56:38] uh 0100 means 3 am for me [06:56:42] pretty hard to do [06:56:50] RECOVERY - puppet last run on mw2018 is OK: OK: Puppet is currently enabled, last run 14 seconds ago with 0 failures [06:56:51] RECOVERY - puppet last run on mw1008 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:57:00] I"ll likely have to pass that to an sf ops, sorry :-( [06:57:20] RECOVERY - puppet last run on mw1119 is OK: OK: Puppet is currently enabled, last run 31 seconds ago with 0 failures [06:57:30] RECOVERY - puppet last run on cp2001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:57:49] RECOVERY - puppet last run on cp3036 is OK: OK: Puppet is currently enabled, last run 3 seconds ago with 0 failures [06:58:02] RECOVERY - puppet last run on mw2207 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:58:19] RECOVERY - puppet last run on mw1110 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:58:47] ori: if you are still around is there a chance to have that same issue come up with the variables again during deploy? [06:59:05] (03CR) 10jenkins-bot: [V: 04-1] keyholder: fix lint, indentation [puppet] - 10https://gerrit.wikimedia.org/r/269611 (owner: 10Dzahn) [07:01:02] cannot create directory at 'modules/role/manifests/labs/openstack': No space left on device [07:01:05] baaahhhh [07:01:36] hashar..... [07:01:38] sigh [07:02:12] apergos: no, it was a one-off Puppet corner case. I am going to write to the list about it, because it's an interesting puzzle. But it won't recur organically, or at least not immanently -- it is probably the sort of thing that can only happen if you re-jigger the dependency tree just-so. [07:02:34] ori: thanks, I"ll look forward to that email when I'm a bit more awake :-) [07:13:00] <_joe_> ori: shaking the puppet dependency tree? [07:13:27] _joe_: you'll see what I mean in a moment, just finishing up the e-mail [07:14:19] <_joe_> ori: oh I guess you refer to tin [07:14:40] <_joe_> tin/mira, I saw the dependency error message and escaped in terror [07:15:00] <_joe_> most of those issues are either lame or involve puppet agent/parser bugs [07:20:26] !log puppet disabled on analytics1027 til tomorrow while cron is disabled and CirrusSearchRequestSet backfills into Hadoop from kafka [07:20:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [07:25:24] <_joe_> ottomata: heya, it's super-duper late there [07:35:38] heh yeah [07:35:57] _joe_: there was weirdness at the end of my usual work day with some kafka -> hadoop import stuff that discovery team uses [07:36:04] but, i had to run [07:36:10] so, i'm fixing it now :/ [07:36:37] and, the weirdness started 7 days ago because of refinery deployment mistake i made [07:36:40] almost 7 days ago [07:36:45] and no one noticed til today [07:36:47] _joe_: of course, writing it up made me figure it out [07:36:49] and the data is almost expired in kafka [07:36:59] so, gotta import it into hadoop asap before it disappears! [07:37:10] <_joe_> ori: that happens all the time :) [07:37:18] <_joe_> I usually try to explain what happens to people [07:37:26] <_joe_> that usually makes me figure it out [07:38:03] it was YuviPanda's fault [07:38:11] and Puppet [07:38:15] aaannnnd bedtime finally, goodnight yalls! [07:38:31] good night, ottomata|sleep [07:38:51] <_joe_> ori: is a PS incoming? [07:38:58] <_joe_> and ofc it's always YuviPanda's fault [07:39:21] where was this email written to [07:39:34] Well, my particular folly in this case was that I had the good sense to run the change in the catalog compiler, but lacked the good sense to actually look at the results carefully [07:40:11] <_joe_> ori: well, the catalog compiler won't tell you if there is a dependency circle [07:40:20] <_joe_> since those are resolved at execution time [07:40:27] it wasn't a circle [07:40:29] cycle [07:40:38] here's the e-mail I was going to send: [07:40:44] I won't send it now because I know what the problem is [07:40:48] but see if you can figure it out :P [07:40:52] is this from the time I touched the deployment puppet code and then kindof-fucked-it-up and kindof-not and then erased from my memory? [07:41:08] yes [07:41:14] so I'm cleaning up abit of space on integration-slave-trusty-1016 [07:41:25] /mnt was full [07:41:34] apergos: that was a very short hour :P [07:41:35] <_joe_> ori: oh ok, this has nothing to do with what I saw on mira yesterday [07:41:39] I didn't sleep yet [07:41:48] I think I'll keep that period of time erased from memory [07:41:54] I will do so as soon as I'm satisfied we have plenty of room back [07:41:57] it's a very subtle mistake [07:42:09] that interacts badly with Puppet DWIM magic [07:42:15] _joe_: I fixed the hyesterday cycle [07:42:26] it was a bad puppet change, I reverted [07:42:40] <_joe_> apergos: oh ok [07:42:46] <_joe_> speaking of puppet [07:42:50] ori: saving your pastebin to read for later [07:43:13] let me know if you want a spoiler [07:43:21] sure why not :-) [07:43:39] <_joe_> I have to check all jessie redises [07:43:57] <_joe_> they are probably incapable of listening to the external world [07:44:33] <_joe_> well, apart from jobqueues [07:44:48] oh joy [07:45:29] <_joe_> apergos: https://phabricator.wikimedia.org/T126395 [07:45:55] apergos: https://gerrit.wikimedia.org/r/#/c/186598/ moved some of the variables to a different class, but kept the template in place. The template uses the '@' prefix to refer to the variables, so they are looked up in instance scope first. They're not there. Because Puppet is a merciful God, it looks for it in the parent scope as well. Whether or not it finds them depends on whether a class that Puppet does not know is r [07:45:55] equired has been evaluated yet or not. [07:46:50] so that set up the trap, but https://gerrit.wikimedia.org/r/#/c/269574/ set it off [07:47:21] <_joe_> lol [07:47:33] because it forced the template-bearing class (and all of its resources -- that's the gotcha with 'require') to be evaluated before the class with the variables [07:47:47] good fun all around [07:48:18] _joe_: oh wow [07:48:48] There is some ill-conceived method_missing code in Puppet somewhere that is doing this [07:48:48] I literally don't remember that patch [07:48:56] * YuviPanda is happy about that [07:49:40] ori: oh wow to you too [07:49:42] amazing [07:51:28] ori: awesome tracking it down tho [07:54:00] (03PS1) 10Giuseppe Lavagetto: role::memcached: explicitly bind redis to 0.0.0.0 [puppet] - 10https://gerrit.wikimedia.org/r/269616 (https://phabricator.wikimedia.org/T126395) [07:54:45] !log cleared out a bunch of clones from /mnt/jenkins-workspace/workspace on integration-slave-trusty-1016, /mnt was full preventing jenkins from completing e.g. https://integration.wikimedia.org/ci/job/operations-puppet-typos/50458/console [07:54:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [07:55:49] most of it was mwext clones from yesterday and today though so [07:56:02] maybe there oughta be more agressive cleanup [07:56:42] I'll ping hashar about it when I wake up. off to bed for 1 hour [07:57:01] (03CR) 10Giuseppe Lavagetto: [C: 032] role::memcached: explicitly bind redis to 0.0.0.0 [puppet] - 10https://gerrit.wikimedia.org/r/269616 (https://phabricator.wikimedia.org/T126395) (owner: 10Giuseppe Lavagetto) [08:24:56] !log removenode restbase1007-a finished, start cassandra-a on restbase1007 for bootstrap [08:24:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [08:25:59] RECOVERY - cassandra-a service on restbase1007 is OK: OK - cassandra-a is active [08:27:14] 6operations, 5Patch-For-Review: reinstall eqiad memcache servers with jessie - https://phabricator.wikimedia.org/T123711#2014161 (10elukey) Adding a related task handled by Joe as reference: https://phabricator.wikimedia.org/T126395 I didn't check that Redis/Memcached were correctly bound to 0.0.0.0 rather t... [08:27:35] AaronSchulz: yeah global then non-commons then commons sounds good to me re: https://gerrit.wikimedia.org/r/#/c/266609/ [08:28:32] <_joe_> godog: seems a sensible schedule to me too [08:30:03] _joe_: sweet! [08:41:50] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0] [08:42:19] PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [08:44:04] (03PS1) 10Elukey: Add Jessie option for PXE boot to memcached/redis servers. Temporary removing mc1005 for maintenance. Bug: T123711 [puppet] - 10https://gerrit.wikimedia.org/r/269626 (https://phabricator.wikimedia.org/T123711) [08:45:36] ---^ bad CR forgot to remove some lines, ENOCOFFEE [08:46:31] _joe_: the alert is still too sensitive; the ones we got above were for a spike that peaked at 250 5XXs/minute, which is small enough to be caused by a single appserver hanging [08:46:53] something definitely happened, but not something worth screaming about [08:47:13] <_joe_> I agree [08:47:34] <_joe_> I think we should probably double or triple those thresholds [08:47:49] (03PS2) 10Elukey: Add Jessie option for PXE boot to memcached/redis servers. Temporary removing mc1005 for maintenance. Bug: T123711 [puppet] - 10https://gerrit.wikimedia.org/r/269626 (https://phabricator.wikimedia.org/T123711) [08:48:59] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [08:49:10] <_joe_> ori: I'll do it today [08:49:28] 6operations, 10Analytics, 10ContentTranslation-Analytics, 10MediaWiki-extensions-ContentTranslation, and 4 others: schedule a daily run of ContentTranslation analytics scripts - https://phabricator.wikimedia.org/T122479#2014195 (10Arrbee) 5Open>3Resolved [08:49:29] RECOVERY - Esams HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [08:50:18] (03PS3) 10Elukey: Add Jessie option for PXE boot to memcached/redis servers. Temporary removing mc1005 for maintenance. [puppet] - 10https://gerrit.wikimedia.org/r/269626 (https://phabricator.wikimedia.org/T123711) [08:53:55] Hi everyone, Riccardo Coccioli here [08:54:32] <_joe_> hi :) [08:55:31] hi volans ! welcome [08:56:16] hi godog, _joe_, thanks :) [08:57:04] 10Ops-Access-Requests, 6operations: Onboarding of Riccardo Coccioli - https://phabricator.wikimedia.org/T126425#2014199 (10Joe) 3NEW [08:58:12] (03PS1) 10Muehlenhoff: Add some CVE IDs to the changelog which were already fixed in earlier kernels [debs/linux] - 10https://gerrit.wikimedia.org/r/269628 [08:58:14] (03PS1) 10Muehlenhoff: Update to 3.19.8-ckt14 [debs/linux] - 10https://gerrit.wikimedia.org/r/269629 [09:03:30] (03CR) 10Muehlenhoff: [C: 032 V: 032] Add some CVE IDs to the changelog which were already fixed in earlier kernels [debs/linux] - 10https://gerrit.wikimedia.org/r/269628 (owner: 10Muehlenhoff) [09:03:50] PROBLEM - puppet last run on cp3032 is CRITICAL: CRITICAL: puppet fail [09:03:53] (03CR) 10Muehlenhoff: [C: 032 V: 032] Update to 3.19.8-ckt14 [debs/linux] - 10https://gerrit.wikimedia.org/r/269629 (owner: 10Muehlenhoff) [09:04:22] (03Abandoned) 10Muehlenhoff: Also amend debian/changelog with a pointer to the patch [debs/linux] - 10https://gerrit.wikimedia.org/r/264957 (owner: 10Muehlenhoff) [09:04:48] the world is a small place (especially the internetz), welcome volans! (I met Riccardo before) [09:05:27] thanks elukey, yeah, small world [09:05:30] <_joe_> elukey: it definitely is :) [09:06:03] hi, volans [09:06:36] hi, jynus [09:09:48] volans: welcome :-) [09:10:12] sorry, I had a maintenance until late yesterday and I am a bit sleepy yet [09:11:02] no problem [09:14:18] (03CR) 10Filippo Giunchedi: [C: 04-1] "I'd split it in two unrelated commits, depooling a memcache server will cause an impact while switching to pxe boot to jessie won't, at le" [puppet] - 10https://gerrit.wikimedia.org/r/269626 (https://phabricator.wikimedia.org/T123711) (owner: 10Elukey) [09:15:26] 6operations, 6WMF-NDA-Requests: add Riccardo to WMF-NDA and ops phabricator groups - https://phabricator.wikimedia.org/T126429#2014235 (10Joe) 3NEW [09:16:35] 10Ops-Access-Requests, 6operations: Onboarding of Riccardo Coccioli - https://phabricator.wikimedia.org/T126425#2014241 (10Joe) p:5Triage>3Normal [09:17:10] 6operations, 10LDAP-Access-Requests: ldap/ops membership for Riccardo - https://phabricator.wikimedia.org/T126430#2014243 (10Joe) 3NEW [09:18:14] hi moritzm [09:18:51] 6operations: Add riccardo to icinga (contact/paging/permissions) - https://phabricator.wikimedia.org/T126431#2014249 (10Joe) 3NEW [09:21:06] 10Ops-Access-Requests, 6operations: Add Riccardo to ops mailing lists - https://phabricator.wikimedia.org/T126432#2014256 (10Joe) 3NEW [09:22:24] 6operations: Add Riccardo to ops email aliases - https://phabricator.wikimedia.org/T126433#2014262 (10Joe) 3NEW [09:22:28] wasn't there like a checklist? [09:24:09] (03CR) 10Elukey: puppetmaster: adapt wmf-reimage to use remote salt calls (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/269437 (https://phabricator.wikimedia.org/T124761) (owner: 10Giuseppe Lavagetto) [09:24:52] 10Ops-Access-Requests, 6operations: root shell for Riccardo - https://phabricator.wikimedia.org/T126434#2014270 (10Joe) 3NEW [09:24:59] yep, the checklist is on office wiki https://office.wikimedia.org/wiki/Operations/On%28Off%29boarding [09:25:21] 6operations, 10LDAP-Access-Requests: ldap/ops membership for Riccardo - https://phabricator.wikimedia.org/T126430#2014277 (10Joe) [09:25:23] 6operations, 6WMF-NDA-Requests: add Riccardo to WMF-NDA and ops phabricator groups - https://phabricator.wikimedia.org/T126429#2014278 (10Joe) [09:25:25] 10Ops-Access-Requests, 6operations: root shell for Riccardo - https://phabricator.wikimedia.org/T126434#2014270 (10Joe) [09:25:57] <_joe_> jynus: there is, for now I'm just opening tickets in the same fashion we did for ema [09:26:02] <_joe_> to track sensitive things [09:26:25] thanks, _joe_ ! [09:26:52] <_joe_> I am unsure how easy it is to change the email address in wikitech/gerrit if we want to speed things up a bit [09:27:32] he has an LDAP account already? [09:27:45] <_joe_> jynus: nope [09:27:52] <_joe_> wikitech == ldap [09:28:43] ah, I get you know, for a future email [09:31:20] RECOVERY - puppet last run on cp3032 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [09:32:57] brb in 15 minutes [09:33:02] (03PS1) 10Muehlenhoff: Enable ferm on remaining jobrunners [puppet] - 10https://gerrit.wikimedia.org/r/269634 (https://phabricator.wikimedia.org/T104972) [09:33:55] _joe_: wikitech allows me to change my email, but other services (gerrit?) might cache stuff [09:34:57] <_joe_> valhallasw`cloud: yeah I kinda remember that's an annoyance [09:35:04] I need to change Pywikibot Conversion Bot's email anyway (it gets added to gerrit patchsets sometimes because it's my own email address) so let me fiddle with that [09:35:10] <_joe_> ah ok [09:35:12] <_joe_> thanks a lot [09:37:26] _joe_: works like a charm! [09:37:34] <_joe_> valhallasw`cloud: ok good to know [09:37:35] _joe_: you just need to login to gerrit again to update the cache [09:37:40] <_joe_> thanks a lot :) [09:41:36] (03CR) 10Filippo Giunchedi: [C: 04-1] puppetmaster: adapt wmf-reimage to use remote salt calls (036 comments) [puppet] - 10https://gerrit.wikimedia.org/r/269437 (https://phabricator.wikimedia.org/T124761) (owner: 10Giuseppe Lavagetto) [09:46:10] (03CR) 10Elukey: "Yep completely right, thanks for the suggestion!" [puppet] - 10https://gerrit.wikimedia.org/r/269626 (https://phabricator.wikimedia.org/T123711) (owner: 10Elukey) [09:46:28] <_joe_> godog: rb1007 is still down? [09:46:34] (03Abandoned) 10Elukey: Add Jessie option for PXE boot to memcached/redis servers. Temporary removing mc1005 for maintenance. [puppet] - 10https://gerrit.wikimedia.org/r/269626 (https://phabricator.wikimedia.org/T123711) (owner: 10Elukey) [09:47:13] _joe_: only the bits in icinga, ack'ing now [09:47:26] <_joe_> godog: pybal complains :) [09:47:29] ACKNOWLEDGEMENT - Restbase root url on restbase1007 is CRITICAL: Connection refused Filippo Giunchedi cassandra bootstrap in progress [09:47:30] ACKNOWLEDGEMENT - cassandra-a CQL 10.64.0.230:9042 on restbase1007 is CRITICAL: Connection refused Filippo Giunchedi cassandra bootstrap in progress [09:47:30] ACKNOWLEDGEMENT - restbase endpoints health on restbase1007 is CRITICAL: Generic error: Generic connection error: HTTPConnectionPool(host=10.64.0.223, port=7231): Max retries exceeded with url: /en.wikipedia.org/v1/?spec (Caused by ProtocolError(Connection aborted., error(111, Connection refused))) Filippo Giunchedi cassandra bootstrap in progress [09:47:48] <_joe_> maybe set it to pooled=inactive in conftool might help [09:48:27] I ran this yesterday on palladium, confctl --find --action set/pooled=no restbase1007.eqiad.wmnet [09:48:39] should have it been something else? [09:48:45] <_joe_> pooled=no means it's not going to be pooled [09:48:49] (03PS1) 10Elukey: Temporary removing mc1005 from the redis/memcached pools. [puppet] - 10https://gerrit.wikimedia.org/r/269639 (https://phabricator.wikimedia.org/T123711) [09:48:52] <_joe_> but pybal will continue to check [09:49:17] <_joe_> it's ok, if you do set/pooled=inactive, pybal will remove the server from the config [09:50:22] there are spikes of job runners saying Database is in read-only mode for s2 [09:50:32] <_joe_> uhm [09:50:46] should I retry restarting them again? [09:50:59] the previous restart suceeded [09:51:24] this is mw1015 at 9:49:26 [09:51:35] 700 failed RPC calls [09:51:44] <_joe_> jynus: uhm, in logstash? [09:51:48] yes [09:51:50] <_joe_> I can take a look at the appserver [09:51:57] <_joe_> it's pretty strange tbh [09:51:59] the code is updated [09:52:04] I checked them directly [09:52:13] either there is some cache there [09:52:35] or they keep running and it has not restarted (which contradicts my output) [09:52:51] check exceptions at https://logstash.wikimedia.org/#/dashboard/elasticsearch/fatalmonitor [09:53:14] this is important and goal-related, because it is the same thing we will do for datacenter failover [09:53:30] there is something stiky on mediawiki [09:53:48] or it could be just lag [09:54:05] <_joe_> jynus: might be, I'm logging in now [09:54:30] links gets updated, I checked that yeserday [09:54:48] as in the job succeds instantly (almost) [09:55:21] I wonder if it has to do with the apc cache that A*ron recently changed? [09:55:40] mediawiki servers now store its master in cache [09:56:00] (03CR) 10Filippo Giunchedi: [C: 031] Temporary removing mc1005 from the redis/memcached pools. [puppet] - 10https://gerrit.wikimedia.org/r/269639 (https://phabricator.wikimedia.org/T123711) (owner: 10Elukey) [09:57:45] jynus: do you want me to hold off from removing mc1005 from the redis/memcached pool until the issue is resolved? [09:58:03] but there has not been more than 3 seconds of lag on s2 since 4am [09:58:19] no, actually if something, that would help, not get worse [09:58:20] <_joe_> elukey: I don't think that it has anything to do with memcached/redis, tbh [09:58:59] _joe_: I figured, but didn't want to add another variable in the equation. If you guys are ok, I'll merge the change [09:59:28] this happened on mw1015 only [09:59:36] <_joe_> jynus: that's super-strange [09:59:48] o no, wait [10:00:03] has mw1015 400 cases [10:00:15] there rest have 30 or less [10:00:47] there is an additional problem that is making everthing more confising [10:00:56] yesterday at 22 there was a log-related commit [10:00:59] on mediawiki [10:01:42] and now kibana has issues with the log (format?) [10:02:04] so we may be missing half of the clues [10:02:14] <_joe_> yeah I was trying to understand what's happening there [10:02:19] <_joe_> I can take a look on fluorine [10:02:19] see the "Error connecting to {db_server}: {error}" [10:02:35] did they commit Tim's reformat of the logs? [10:02:44] <_joe_> that was apache's logs [10:02:54] this was something related to monolog [10:03:09] <_joe_> oh ok, no idea, we can check ofc [10:03:10] I am not very familiar, is that mediawiki's log? [10:03:15] ok [10:03:16] <_joe_> yes [10:03:27] that is not the cause, however [10:03:38] just something makins things difficult to read [10:04:50] <_joe_> ok, fluorine seems to have full logs [10:05:08] let me open a ticket about this, because this happened before the failover (but it is caused by it) [10:05:52] it started when I moved some of the hosts executing rpc traffic to a 3-tier slave [10:06:00] <_joe_> btw, I see a ton of errors in wfDBError.log [10:06:38] <_joe_> related to db1023.eqiad.wmnet [10:06:54] checking [10:07:06] <_joe_> that was at 8 am utc [10:07:20] <_joe_> at the time of the peak of errors, around 9:45 UTC [10:07:37] <_joe_> I see a flock of 2016-02-10 09:45:03 mw1234 enwiki 1.27.0-wmf.12 wfLogDBError ERROR: Error connecting to 10.64.48.21: Can't connect to MySQL server on '10.64.48.21' (4) {"db_server":"10.64.48.21","db_name":"enwiki","db_user":"wi [10:08:27] that is api issue, known, and counter-measures work automatically [10:08:31] <_joe_> but that's s1 [10:08:31] <_joe_> yes [10:08:53] <_joe_> so, when is that read-only issue happened? [10:09:18] periodically, it is not on db log [10:09:27] <_joe_> oh, ok [10:09:29] it is on the exception log, as it is mediawiki [10:09:52] the databases do not "get" in read only, only mediawiki does [10:10:42] see https://logstash.wikimedia.org/#dashboard/temp/AVLKqRdhptxhN1XapEwW [10:10:53] <_joe_> yes, I thought all "db related" errors went into that bucket [10:11:03] I thought too! [10:11:10] <_joe_> ehehe fair enough [10:11:31] I have to fix thah, but mediawiki logs many things! [10:12:09] so, this is not an emergency, because it is not affecting end users [10:13:47] <_joe_> yeah I can't really explain what we're seeing then [10:14:00] we need a mediawiki expert, right? [10:15:12] <_joe_> maybe, yes, I can start reading the code and see if I can get what's wrong there [10:15:20] I can do that, too [10:15:43] let me open a proper ticket, because this will happen again on datacenter failover [10:15:54] and we need it to be fixed by then [10:16:02] <_joe_> yes [10:16:22] is there anything that you could imaging at infrastructure side, like hhvm cache or anything? [10:16:37] could that survive a deploy/restart [10:17:56] anyway, I do not want to steal more of your time, I will keep you updated, let' focus on more imminent things [10:20:36] !log ms-be2016 / ms-be2017 swift weight to 3500 [10:20:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [10:22:36] 6operations, 10hardware-requests: new labstore hardware for eqiad - https://phabricator.wikimedia.org/T126089#2014346 (10mark) >>! In T126089#2011782, @chasemp wrote: > Need some guidance here outlining how we can sort out new servers with breaking up the existing shelves. This is a diagram that describes the... [10:24:00] all right, proceeding with https://gerrit.wikimedia.org/r/#/c/269639 _joe_, jynus [10:24:26] (03CR) 10Elukey: [C: 032] Temporary removing mc1005 from the redis/memcached pools. [puppet] - 10https://gerrit.wikimedia.org/r/269639 (https://phabricator.wikimedia.org/T123711) (owner: 10Elukey) [10:25:07] 6operations, 10hardware-requests: new labstore hardware for eqiad - https://phabricator.wikimedia.org/T126089#2014348 (10mark) As for the general plan: I think we should simply buy new, higher specced versions of the current labstore1001, with more/faster CPUs/cores, more memory (128 GB perhaps?) and 10G Ethe... [10:26:06] !log Removing mc1005.eqiad from the redis/memcached pools [10:26:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [10:26:35] 6operations, 10hardware-requests: new labstore hardware for eqiad - https://phabricator.wikimedia.org/T126089#2014349 (10mark) One thing to consider is internal drives. The current systems have those (12 of them I think), but they aren't really used, as we can't access them "from the other server" after a fail... [10:28:56] 6operations, 7Swift: add ms-be1019 / 1020 / 1021 to swift - https://phabricator.wikimedia.org/T118183#2014350 (10fgiunchedi) I've been testing ms-be1019 with weight 4000 for the last week or so, the rebalance is almost complete so things should be settling down average wait, top 10 {F3330058} https://graphi... [10:33:00] 6operations, 5Patch-For-Review, 7Swift: install/setup/deploy ms-be2016-2021 - https://phabricator.wikimedia.org/T116842#2014353 (10fgiunchedi) given tests at weight 4000 on ms-be1019 in {T118183} I've bumped ms-be2016 and ms-be2017 weight to 3500 to lessen the load on older nodes and get more disk utilizatio... [10:35:10] PROBLEM - check_mysql on db1008 is CRITICAL: SLOW_SLAVE CRITICAL: Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 690 [10:40:10] RECOVERY - check_mysql on db1008 is OK: Uptime: 1882916 Threads: 2 Questions: 11604686 Slow queries: 12678 Opens: 4624 Flush tables: 2 Open tables: 400 Queries per second avg: 6.163 Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 0 [10:44:10] 7Puppet, 6operations: 'role' function doesn't find classess in autoload layout in manifests/role - https://phabricator.wikimedia.org/T119042#2014374 (10Joe) p:5Triage>3Low [10:44:36] <_joe_> !log removing mw1051-mw1070 from the appservers pool (T126242) [10:44:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [10:58:35] !log disabling puppet, redis and memcached on mc1005 (preparation for Jessie migration) [10:58:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [11:01:49] (03PS1) 10Muehlenhoff: Initial stub role for yubi 2fa auth server [puppet] - 10https://gerrit.wikimedia.org/r/269655 [11:02:08] !log enabling semi-sync replication fo s2 slaves [11:02:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [11:06:26] (03PS1) 10Gehel: Archiva now uses HTTPS [debs/logstash-gelf] - 10https://gerrit.wikimedia.org/r/269656 (https://phabricator.wikimedia.org/T109101) [11:07:45] (03PS2) 10Muehlenhoff: Initial stub role for yubi 2fa auth server [puppet] - 10https://gerrit.wikimedia.org/r/269655 [11:08:16] (03CR) 10Muehlenhoff: [C: 032 V: 032] Initial stub role for yubi 2fa auth server [puppet] - 10https://gerrit.wikimedia.org/r/269655 (owner: 10Muehlenhoff) [11:16:33] !log restbase1007 nodetool-a setstreamthroughput 350 [11:16:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [11:21:12] (03PS2) 10Giuseppe Lavagetto: puppetmaster: adapt wmf-reimage to use remote salt calls [puppet] - 10https://gerrit.wikimedia.org/r/269437 (https://phabricator.wikimedia.org/T124761) [11:21:37] (03CR) 10Giuseppe Lavagetto: puppetmaster: adapt wmf-reimage to use remote salt calls (036 comments) [puppet] - 10https://gerrit.wikimedia.org/r/269437 (https://phabricator.wikimedia.org/T124761) (owner: 10Giuseppe Lavagetto) [11:24:31] 6operations, 5Patch-For-Review, 5codfw-rollout, 3codfw-rollout-Jan-Mar-2016: Reduce the number of appservers we're using in eqiad - https://phabricator.wikimedia.org/T126242#2014505 (10Joe) I just depooled mw1051-69 as well, the cluster still seems unimpressed... [11:25:55] (03PS1) 10Muehlenhoff: Enable role yubiauth for auth1001 [puppet] - 10https://gerrit.wikimedia.org/r/269658 [11:26:33] (03PS2) 10Muehlenhoff: Enable role yubiauth for auth1001 [puppet] - 10https://gerrit.wikimedia.org/r/269658 [11:26:45] (03CR) 10Muehlenhoff: [C: 032 V: 032] Enable role yubiauth for auth1001 [puppet] - 10https://gerrit.wikimedia.org/r/269658 (owner: 10Muehlenhoff) [11:29:23] (03CR) 10Alexandros Kosiaris: [C: 04-1] "various comments inline. Code looks good" (036 comments) [puppet] - 10https://gerrit.wikimedia.org/r/262742 (https://phabricator.wikimedia.org/T113072) (owner: 10Alexandros Kosiaris) [11:30:28] (03CR) 10Filippo Giunchedi: puppetmaster: adapt wmf-reimage to use remote salt calls (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/269437 (https://phabricator.wikimedia.org/T124761) (owner: 10Giuseppe Lavagetto) [11:31:32] (03CR) 10Filippo Giunchedi: [C: 031] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/269437 (https://phabricator.wikimedia.org/T124761) (owner: 10Giuseppe Lavagetto) [11:36:56] !log restbase1001 nodetool setstreamthroughput 400 [11:36:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [11:37:06] (03CR) 10Alexandros Kosiaris: [C: 04-1] "I 'd rather we do not redo the job that pbuilder does anyway but rechecking the directory existence. DRY principle." [puppet] - 10https://gerrit.wikimedia.org/r/269095 (https://phabricator.wikimedia.org/T125999) (owner: 10Hashar) [11:37:21] !log re-enabled puppet on mc1005.eqiad [11:37:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [11:40:21] 10Ops-Access-Requests, 6operations: root shell for Riccardo - https://phabricator.wikimedia.org/T126434#2014520 (10Volans) I've created an account on wikitech/labs. I've created an account here on Phabricator and signed the L3 document. [11:42:57] RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 227, down: 0, dormant: 0, excluded: 0, unused: 0 [11:44:53] !log restbase1001 nodetool setstreamthroughput 500 [11:44:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [11:48:47] with your permission, I want to retry restarting mw1015 job queue processing again, as it seems the one having the most problems [11:51:04] !log restarting mw1015 jobqueue/chron processing again [11:51:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [11:53:59] <_joe_> jynus: you might want to restart hhvm as well as a good measurre :) [11:54:13] true [11:54:36] actually, let me wait to see if this had an impact [11:54:50] then restart HHVM [11:55:56] (03PS1) 10Elukey: Add mc1005.eqiad back into redis/memcached pools [puppet] - 10https://gerrit.wikimedia.org/r/269660 (https://phabricator.wikimedia.org/T123711) [11:55:57] 6operations, 7Graphite, 5Patch-For-Review: udp rcvbuferrors and inerrors on graphite1001 - https://phabricator.wikimedia.org/T101141#2014532 (10fgiunchedi) [11:56:06] errors stopped at 11:51 but it could be a coincidence [11:57:02] <_joe_> probably isn't [11:57:43] which would make the whole situation more strange- I would understand something going on with hhvm [11:58:20] akosiaris, how would you feel about upgrading one of the maps servers to node4.2? [11:58:22] but I double checked and did exactly the same, including mw1015- and I can confirm the process restarted [11:58:36] cc: mobrovac ^^ [11:59:06] 6operations, 7Graphite: add more statsdlb instances for more throughput - https://phabricator.wikimedia.org/T126447#2014535 (10fgiunchedi) 3NEW a:3fgiunchedi [12:03:52] yurik: I would feel good. We should get it done at some point [12:04:24] akosiaris, cool. I think i got kartotherian & tilerator to run on it, time to test it out :) [12:05:22] it is a bit weird at times - npm install sometimes tries to compile unexpected mapnik version and crashes, but running it again is ok. Weird. [12:06:03] (03PS3) 10Giuseppe Lavagetto: puppetmaster: adapt wmf-reimage to use remote salt calls [puppet] - 10https://gerrit.wikimedia.org/r/269437 (https://phabricator.wikimedia.org/T124761) [12:06:16] <_joe_> yurik: you know that is music to the hears of an opsen, right? [12:06:52] <_joe_> if you said "npm geolocates you and sends a flock of ninja assassins to your desk" it would've made npm more appealing :D [12:07:04] ahahaha [12:07:43] of course, i love unexpected behavior myself )) [12:08:06] luckly, npm install only runs once in a docker [12:14:38] 6operations, 10OTRS: upgrade iodine to jessie - https://phabricator.wikimedia.org/T105125#2014550 (10akosiaris) [12:15:38] 6operations, 7Tracking: reduce amount of remaining Ubuntu 12.04 (precise) systems - https://phabricator.wikimedia.org/T123525#2014555 (10akosiaris) [12:15:40] 6operations, 10OTRS, 5Patch-For-Review, 7user-notice: Upgrade OTRS to a more recent stable release - https://phabricator.wikimedia.org/T74109#2014557 (10akosiaris) [12:15:42] 6operations, 10OTRS, 6Security, 7HTTPS: SSL-config of the OTRS is outdated - https://phabricator.wikimedia.org/T91504#2014556 (10akosiaris) [12:15:47] 6operations, 10OTRS: upgrade iodine to jessie - https://phabricator.wikimedia.org/T105125#1436914 (10akosiaris) 5Open>3declined a:3akosiaris iodine will not be migrated to jessie. mendelevium was instead installed in order to replace it. Closing this as declined. I 've updated the task topic as well to r... [12:16:31] 6operations, 6WMF-NDA-Requests: add Riccardo to WMF-NDA and ops phabricator groups - https://phabricator.wikimedia.org/T126429#2014560 (10Joe) I added @Volans to the #operations and #WMF-NDA projects; in doubt if he needs to be added to the acl groups though. @DZahn, any idea? [12:18:22] 6operations, 10OTRS, 5Patch-For-Review, 7user-notice: Upgrade OTRS to a more recent stable release - https://phabricator.wikimedia.org/T74109#2014564 (10akosiaris) 5Open>3Resolved a:3akosiaris I am closing this as resolved. We are at OTRS 5.0.6 and there is no way we are going back to 3.2.x. Issues s... [12:20:47] PROBLEM - RAID on db1021 is CRITICAL: CRITICAL: 1 failed LD(s) (Degraded) [12:21:09] (03PS1) 10Krinkle: varnish: Use url instead of embedded data for logo on error page [puppet] - 10https://gerrit.wikimedia.org/r/269661 [12:22:54] 6operations, 10OTRS: Error while sending emails with OTRS - https://phabricator.wikimedia.org/T125756#2014584 (10akosiaris) 5Open>3Resolved I am gonna close this as resolved since we know what caused it and have a mitigation in place. The actual memory leak is tracked in T126448 now. [12:25:23] ^volans, look, db1021 with a degrade RAID [12:25:44] (03CR) 10Krinkle: [C: 032] "Rolling this out ahead of other deploys so we can rely on this existing later today and tomorrow when it (might) get used and enabled." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/263566 (https://phabricator.wikimedia.org/T99096) (owner: 10Krinkle) [12:26:17] jynus: yep [12:26:50] (03Merged) 10jenkins-bot: Implement /w/static.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/263566 (https://phabricator.wikimedia.org/T99096) (owner: 10Krinkle) [12:26:52] raid works as intended, althoug it caused some slave lag for a few minutes [12:26:59] (03PS1) 10Ema: Port Varnish systemd unit file to Varnish 4 [puppet] - 10https://gerrit.wikimedia.org/r/269664 (https://phabricator.wikimedia.org/T122880) [12:28:58] 6operations, 10OTRS, 7HTTPS: ssl certificate replacement: ticket.wikimedia.org (expires 2016-02-16) - https://phabricator.wikimedia.org/T122320#2014594 (10akosiaris) 5Open>3Invalid A side effect of the upgrade is that this has been turned obsolete. Now OTRS is behind misc-web, treated the same as all oth... [12:33:34] !log disabling puppet on mw1001-1009, mw1011-1016 to enable ferm in batches [12:33:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [12:33:47] (03PS2) 10Muehlenhoff: Enable ferm on remaining jobrunners [puppet] - 10https://gerrit.wikimedia.org/r/269634 (https://phabricator.wikimedia.org/T104972) [12:34:06] (03CR) 10Muehlenhoff: [C: 032 V: 032] Enable ferm on remaining jobrunners [puppet] - 10https://gerrit.wikimedia.org/r/269634 (https://phabricator.wikimedia.org/T104972) (owner: 10Muehlenhoff) [12:35:12] !log krinkle@mira Synchronized w/static.php: (no message) (duration: 02m 11s) [12:35:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [12:35:29] 6operations: Rise in "parent, LightProcess exiting" fatals on mw1019 since 1.27.0-wmf.11 deploy - https://phabricator.wikimedia.org/T124956#2014606 (10Krinkle) I also get them on mira when deploying a file with scap sync-file: ``` [12:27 UTC] krinkle at mira.codfw.wmnet in /srv/mediawiki-staging (master%) $ syn... [12:36:39] PROBLEM - puppet last run on mw1001 is CRITICAL: Timeout while attempting connection [12:38:19] RECOVERY - puppet last run on mw1001 is OK: OK: Puppet is currently enabled, last run 53 seconds ago with 0 failures [12:42:09] !log Purged MediaWiki/wmfstatic/* metrics in Graphite (spurious test data) [12:42:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [12:43:17] (03CR) 10BBlack: [C: 04-1] "We need a way to conditionalize this change for v4 and not v3" [puppet] - 10https://gerrit.wikimedia.org/r/269664 (https://phabricator.wikimedia.org/T122880) (owner: 10Ema) [12:45:11] 6operations, 10ops-eqiad: db1021 degraded RAID - https://phabricator.wikimedia.org/T126451#2014634 (10jcrespo) 3NEW [12:45:17] PROBLEM - puppet last run on mw1002 is CRITICAL: Timeout while attempting connection [12:46:08] 6operations, 10ops-eqiad: db1021 degraded RAID - https://phabricator.wikimedia.org/T126451#2014643 (10jcrespo) There is also the previous disk with a SMART alert: ``` Enclosure Device ID: 32 Slot Number: 7 Drive's position: DiskGroup: 0, Span: 3, Arm: 1 Enclosure position: N/A Device Id: 7 WWN: 5000C5003240F... [12:46:57] ACKNOWLEDGEMENT - RAID on db1021 is CRITICAL: CRITICAL: 1 failed LD(s) (Degraded) Jcrespo https://phabricator.wikimedia.org/T126451 [12:46:57] RECOVERY - puppet last run on mw1002 is OK: OK: Puppet is currently enabled, last run 44 seconds ago with 0 failures [12:51:29] !log restarting hhvm at mw1015 - db errors continue [12:51:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [12:57:03] (03PS2) 10Ema: Maps VCL initial forward-porting to Varnish 4 [puppet] - 10https://gerrit.wikimedia.org/r/269466 (https://phabricator.wikimedia.org/T124279) [12:58:20] (03PS2) 10Gehel: Rebuild logstash-gelf for Ubuntu Trusty [debs/logstash-gelf] - 10https://gerrit.wikimedia.org/r/269656 (https://phabricator.wikimedia.org/T109101) [12:58:40] _joe_, godog: whenever you have a minute https://gerrit.wikimedia.org/r/#/c/269660/ [13:04:03] 6operations, 10Traffic, 5Patch-For-Review: Forward-port VCL to Varnish 4 - https://phabricator.wikimedia.org/T124279#2014667 (10ema) With https://gerrit.wikimedia.org/r/#/c/269664/ and https://gerrit.wikimedia.org/r/#/c/269466/ applied the following procedure allows to upgrade a maps box to Varnish 4: ech... [13:14:19] wt... [13:15:03] the video player on commons is broken for me... [13:15:23] (03CR) 10Filippo Giunchedi: [C: 031] Add mc1005.eqiad back into redis/memcached pools [puppet] - 10https://gerrit.wikimedia.org/r/269660 (https://phabricator.wikimedia.org/T123711) (owner: 10Elukey) [13:15:35] (03CR) 10BBlack: [C: 031] "Nevermind, I misread!" [puppet] - 10https://gerrit.wikimedia.org/r/269664 (https://phabricator.wikimedia.org/T122880) (owner: 10Ema) [13:15:50] https://commons.wikimedia.org/static/1.27.0-wmf.8/extensions/TimedMediaHandler/MwEmbedModules/EmbedPlayer/resources/skins/kskin/images/ksprite.png?b0bc6 404 [13:16:44] 10Ops-Access-Requests, 6operations: root shell for Riccardo - https://phabricator.wikimedia.org/T126434#2014684 (10Joe) [13:16:46] 10Ops-Access-Requests, 6operations: Onboarding of Riccardo Coccioli - https://phabricator.wikimedia.org/T126425#2014685 (10Joe) [13:17:28] Krinkle: i remember there was a change for static resources ? possibly related ? [13:18:40] (03PS2) 10BBlack: varnish: Use url instead of embedded data for logo on error page [puppet] - 10https://gerrit.wikimedia.org/r/269661 (owner: 10Krinkle) [13:18:51] (03CR) 10Giuseppe Lavagetto: [C: 031] Add mc1005.eqiad back into redis/memcached pools [puppet] - 10https://gerrit.wikimedia.org/r/269660 (https://phabricator.wikimedia.org/T123711) (owner: 10Elukey) [13:18:54] hmm, it was fixed after i cleared localStorage... [13:19:08] wmf.8 seems odd, the train is past that [13:19:12] (not with a force refresh) [13:19:45] indeed. sounds like i had old CSS content in my localstorage. [13:19:57] (03CR) 10BBlack: [C: 032 V: 032] varnish: Use url instead of embedded data for logo on error page [puppet] - 10https://gerrit.wikimedia.org/r/269661 (owner: 10Krinkle) [13:20:02] <_joe_> thedj: the sesssion issues have gone away, right? [13:21:04] _joe_: yup, all seem solved. Also a few other reports on en.wp probably were also connected to that (someone complaining about centralauth continiously logging him in again on each pageview, and another person having trouble saving edits on en.mw.wp) [13:21:20] (03PS2) 10Elukey: Add mc1005.eqiad back into redis/memcached pools [puppet] - 10https://gerrit.wikimedia.org/r/269660 (https://phabricator.wikimedia.org/T123711) [13:21:32] (03CR) 10Elukey: [C: 032] Add mc1005.eqiad back into redis/memcached pools [puppet] - 10https://gerrit.wikimedia.org/r/269660 (https://phabricator.wikimedia.org/T123711) (owner: 10Elukey) [13:21:47] <_joe_> thedj: I guess so; some occasional issues can happen while elukey reimages the mc* servers [13:23:34] thedj: really sorry for the issue, please feel free to ping me if you re-encounter any error [13:24:16] he it happens. i'm just very grateful we had graphs that allowed is to map this to the server-admin log timeline :) [13:24:19] (03CR) 10ArielGlenn: [C: 031] "I'm liking it. We (I) can always turn off aes key rotation later if constant reauths from minions turn out to be a nuisance." [puppet] - 10https://gerrit.wikimedia.org/r/269437 (https://phabricator.wikimedia.org/T124761) (owner: 10Giuseppe Lavagetto) [13:24:56] !log adding mc1005.eqiad back into service (redis/memcached) [13:24:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:25:09] thedj --^ :) [13:25:12] <_joe_> elukey: 16 more to go [13:25:30] <_joe_> elukey: I guess we should merge and test wmf-reimage [13:25:44] <_joe_> apergos: I'm unsure I understand your comment about aes key rotation [13:26:18] thedj: btw I've updated https://gerrit.wikimedia.org/r/#/c/266544/ but don't really have the bandwidth now to follow up [13:26:27] the old code specifically deleted salt keys without aes master key rotation (which would force all minions to reauth on next command) [13:26:51] 6operations, 10MediaWiki-Page-editing, 5Patch-For-Review: Unexplained edit token errors - https://phabricator.wikimedia.org/T126395#2014699 (10Joe) @thedj reports his issues are over, it's extremely likely the problem had to do with this misconfiguration. Resolved. [13:26:52] _joe_ definitely, even though the time needed to wait for pool/depool are loooong [13:26:53] we're on a new host with better code and I don't know that we reimage so often that it would be an issue [13:27:03] so I say let's go forward [13:27:14] also the last changeset looks good and quite readable [13:27:16] (03CR) 10Filippo Giunchedi: "minor, but LGTM" (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/269437 (https://phabricator.wikimedia.org/T124761) (owner: 10Giuseppe Lavagetto) [13:27:16] _joe_: [13:28:19] <_joe_> well, sometimes we might want to reimage 8 hosts at the same time [13:28:24] <_joe_> think of appservers [13:28:50] <_joe_> but yeah, we can fix that later if that's needed [13:29:49] oh good catches, I don't ever put the ; but I forget to censor for them too [13:29:56] go dog [13:30:35] (03PS4) 10Giuseppe Lavagetto: puppetmaster: adapt wmf-reimage to use remote salt calls [puppet] - 10https://gerrit.wikimedia.org/r/269437 (https://phabricator.wikimedia.org/T124761) [13:30:47] (03PS2) 10Ema: Port Varnish systemd unit file to Varnish 4 [puppet] - 10https://gerrit.wikimedia.org/r/269664 (https://phabricator.wikimedia.org/T122880) [13:30:50] (03CR) 10Giuseppe Lavagetto: puppetmaster: adapt wmf-reimage to use remote salt calls (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/269437 (https://phabricator.wikimedia.org/T124761) (owner: 10Giuseppe Lavagetto) [13:31:31] (03CR) 10Ema: [C: 032 V: 032] Port Varnish systemd unit file to Varnish 4 [puppet] - 10https://gerrit.wikimedia.org/r/269664 (https://phabricator.wikimedia.org/T122880) (owner: 10Ema) [13:32:24] (03PS1) 10Elukey: Adding support for PXE Jessie Installer to mcXXXX hosts. [puppet] - 10https://gerrit.wikimedia.org/r/269668 (https://phabricator.wikimedia.org/T123711) [13:37:18] (03CR) 10ArielGlenn: [C: 031] "time to merge?" [puppet] - 10https://gerrit.wikimedia.org/r/269437 (https://phabricator.wikimedia.org/T124761) (owner: 10Giuseppe Lavagetto) [13:37:51] (03CR) 10Filippo Giunchedi: [C: 04-1] "extra [] test" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/269437 (https://phabricator.wikimedia.org/T124761) (owner: 10Giuseppe Lavagetto) [13:38:20] hahaha [13:38:26] almost! [13:39:02] (03PS3) 10Ema: Maps VCL initial forward-porting to Varnish 4 [puppet] - 10https://gerrit.wikimedia.org/r/269466 (https://phabricator.wikimedia.org/T124279) [13:41:56] (03Abandoned) 10Aude: Enable data type mathematical expression on wikidata.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/267405 (https://phabricator.wikimedia.org/T124931) (owner: 10Llyrian) [13:45:31] not sure, godog [13:46:30] (03CR) 10ArielGlenn: puppetmaster: adapt wmf-reimage to use remote salt calls (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/269437 (https://phabricator.wikimedia.org/T124761) (owner: 10Giuseppe Lavagetto) [13:47:18] 6operations, 10hardware-requests, 5codfw-rollout, 3codfw-rollout-Jan-Mar-2016: Site: 2 hardware request for redis jobrunners - https://phabricator.wikimedia.org/T126453#2014739 (10Joe) 3NEW a:3Joe [13:47:31] oh, I might have just proved your point [13:47:33] :-D [13:47:40] 6operations, 10hardware-requests, 5codfw-rollout, 3codfw-rollout-Jan-Mar-2016: Site: 2 hardware request for redis jobrunners - https://phabricator.wikimedia.org/T126453#2014739 (10Joe) a:5Joe>3None [13:48:02] PROBLEM - Unmerged changes on repository puppet on strontium is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet). [13:48:28] <_joe_> godog: meh, sorry, I should just look better before resubmitting [13:48:37] (03CR) 10ArielGlenn: puppetmaster: adapt wmf-reimage to use remote salt calls (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/269437 (https://phabricator.wikimedia.org/T124761) (owner: 10Giuseppe Lavagetto) [13:48:51] PROBLEM - Unmerged changes on repository puppet on palladium is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet). [13:48:53] !log restbase1002 nodetool setstreamthroughput 500 [13:48:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:49:08] <_joe_> ok who forgot to merge? [13:49:11] <_joe_> ema! [13:49:30] <_joe_> also, that change seemed dangerous to me [13:50:12] _joe_: oh yes sorry! I forgot to puppet-merge [13:50:30] <_joe_> ema: you're adding some flags to the startup of varnish 3 as well [13:50:39] <_joe_> did you check this won't trigger a restart via puppet? [13:50:48] _joe_: https://puppet-compiler.wmflabs.org/1705/cp1043.eqiad.wmnet/ [13:51:36] as discussed on #-traffic, flags are just shuffled around on varnish 3 unless we overlooked something [13:51:58] <_joe_> ema: ok, as I said, you should check this won't cause a restart of varnish via the base::service_unit puppet define [13:52:01] <_joe_> you are using [13:52:09] <_joe_> I guess within varnish::instance [13:52:25] _joe_: will do, thanks! [13:53:49] (03CR) 10Giuseppe Lavagetto: puppetmaster: adapt wmf-reimage to use remote salt calls (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/269437 (https://phabricator.wikimedia.org/T124761) (owner: 10Giuseppe Lavagetto) [13:54:03] 6operations, 10ContentTranslation-Deployments, 10ContentTranslation-cxserver, 10Parsoid, and 4 others: Decom parsoid-lb.eqiad.wikimedia.org entrypoint - https://phabricator.wikimedia.org/T110474#2014767 (10BBlack) Status updates: 1. Remaining code refs: 1. https://github.com/wikimedia/mediawiki-services... [13:54:15] (03PS5) 10Giuseppe Lavagetto: puppetmaster: adapt wmf-reimage to use remote salt calls [puppet] - 10https://gerrit.wikimedia.org/r/269437 (https://phabricator.wikimedia.org/T124761) [13:55:37] <_joe_> ema: you have refresh => false, in the base::service_unit call, which makes you safe [13:55:54] <_joe_> as in - a change to the unit won't restart your service [13:56:09] _joe_: thanks for checking, I'm testing on a VM just to be 100% sure [13:57:09] <_joe_> (note the beauty of how that is expressed in puppet: https://github.com/wikimedia/operations-puppet/blob/production/modules/base/manifests/service_unit.pp#L122-L126 [13:57:51] wtf [13:58:29] <_joe_> ~> vs ->, because 30 years of perl didn't teach us a single thing... [13:58:44] apparently not [13:59:02] <_joe_> well, with "us" I mean puppetlabs ofc [13:59:08] not just them [13:59:14] lots of folks like them [13:59:36] all right I'm gonna be bold and once again +1 a changeset that godog will reject :-P [14:00:14] (03CR) 10ArielGlenn: [C: 031] "You'll say I should know better. But I don't. LGMT" [puppet] - 10https://gerrit.wikimedia.org/r/269437 (https://phabricator.wikimedia.org/T124761) (owner: 10Giuseppe Lavagetto) [14:00:37] hahaah that actually LGTM too [14:00:44] (03CR) 10Filippo Giunchedi: [C: 031] puppetmaster: adapt wmf-reimage to use remote salt calls [puppet] - 10https://gerrit.wikimedia.org/r/269437 (https://phabricator.wikimedia.org/T124761) (owner: 10Giuseppe Lavagetto) [14:00:48] 6operations, 5codfw-rollout, 3codfw-rollout-Jan-Mar-2016: Be able to switch programmatically between deployment servers in codfw and eqiad - https://phabricator.wikimedia.org/T124024#2014776 (10Joe) I will switch back to tin on monday, February 15th, and document all the needed steps on wikitech. This ticket... [14:02:43] 6operations, 10EventBus, 10hardware-requests: 3 conf200x servers in codfw for zookeeper (and etcd?) - https://phabricator.wikimedia.org/T121882#2014779 (10Joe) I think we can get the same systems we have in eqiad, yes; for the moment, it will only serve zookeeper probably, while I figure out how to replicate... [14:03:55] 6operations, 10EventBus, 10hardware-requests: 3 conf200x servers in codfw for zookeeper (and etcd?) - https://phabricator.wikimedia.org/T121882#2014780 (10Joe) a:5Joe>3RobH [14:07:17] RECOVERY - Unmerged changes on repository puppet on strontium is OK: No changes to merge. [14:08:16] RECOVERY - Unmerged changes on repository puppet on palladium is OK: No changes to merge. [14:08:39] (03CR) 10Giuseppe Lavagetto: [C: 04-1] Adding support for PXE Jessie Installer to mcXXXX hosts. (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/269668 (https://phabricator.wikimedia.org/T123711) (owner: 10Elukey) [14:08:45] PROBLEM - puppet last run on labstore2001 is CRITICAL: CRITICAL: Puppet has 1 failures [14:10:38] <_joe_> elukey: ^^ I think you missed one host [14:11:27] yeah I saw it, sorry! I might have missed it during copy/paste [14:11:31] amending the code reivew [14:11:33] *review [14:11:56] (03PS6) 10Giuseppe Lavagetto: puppetmaster: adapt wmf-reimage to use remote salt calls [puppet] - 10https://gerrit.wikimedia.org/r/269437 (https://phabricator.wikimedia.org/T124761) [14:12:07] (03CR) 10Giuseppe Lavagetto: [C: 032] puppetmaster: adapt wmf-reimage to use remote salt calls [puppet] - 10https://gerrit.wikimedia.org/r/269437 (https://phabricator.wikimedia.org/T124761) (owner: 10Giuseppe Lavagetto) [14:12:19] I was about to say. but I guess that was the rebase. [14:12:20] yay! [14:12:44] (03CR) 10jenkins-bot: [V: 04-1] puppetmaster: adapt wmf-reimage to use remote salt calls [puppet] - 10https://gerrit.wikimedia.org/r/269437 (https://phabricator.wikimedia.org/T124761) (owner: 10Giuseppe Lavagetto) [14:12:46] (03PS2) 10Elukey: Adding support for PXE Jessie Installer to mcXXXX hosts. [puppet] - 10https://gerrit.wikimedia.org/r/269668 (https://phabricator.wikimedia.org/T123711) [14:13:16] <_joe_> now, what? [14:13:25] (03CR) 10Elukey: "Added the missing host, thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/269668 (https://phabricator.wikimedia.org/T123711) (owner: 10Elukey) [14:13:37] failed to write new configuration file .git/config.lock [14:13:53] <_joe_> yup [14:14:02] <_joe_> sounds like another jenkins corruption [14:14:11] <_joe_> hashar is not here, either [14:15:12] yeah I know [14:15:21] this isn't space either [14:17:43] 6operations, 10DBA, 5Patch-For-Review: Prepare db1018 and s2-slaves for s2 master failover - https://phabricator.wikimedia.org/T125215#2014804 (10jcrespo) Failover was done successfully https://wikitech.wikimedia.org/wiki/Planned_Maintenance-February_9_2016 , pending issues being tracked on T126436. [14:17:51] 6operations, 10DBA: db1024 (s2 master) will run out of disk space in ~4 months - https://phabricator.wikimedia.org/T122048#2014808 (10jcrespo) [14:17:53] 6operations, 10DBA, 5codfw-rollout, 3codfw-rollout-Jan-Mar-2016: prepare for mariadb 10.0 masters - https://phabricator.wikimedia.org/T105135#2014809 (10jcrespo) [14:17:55] 6operations, 10DBA, 5Patch-For-Review: Prepare db1018 and s2-slaves for s2 master failover - https://phabricator.wikimedia.org/T125215#2014806 (10jcrespo) 5Open>3Resolved a:3jcrespo [14:19:40] <_joe_> hashar: https://integration.wikimedia.org/ci/job/operations-puppet-puppetlint-strict/37611/console [14:19:47] <_joe_> what should I do to fix such errors? [14:20:04] (03CR) 10Giuseppe Lavagetto: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/269437 (https://phabricator.wikimedia.org/T124761) (owner: 10Giuseppe Lavagetto) [14:20:44] good morning [14:20:46] 6operations, 6WMF-NDA-Requests: add Riccardo to WMF-NDA and ops phabricator groups - https://phabricator.wikimedia.org/T126429#2014816 (10Joe) 5Open>3Resolved [14:20:48] 10Ops-Access-Requests, 6operations: root shell for Riccardo - https://phabricator.wikimedia.org/T126434#2014818 (10Joe) [14:20:50] 10Ops-Access-Requests, 6operations: Onboarding of Riccardo Coccioli - https://phabricator.wikimedia.org/T126425#2014819 (10Joe) [14:21:10] _joe_: looking. Looks like the git daemon on gallium is crazy [14:21:17] oh no stderr: error: failed to write new configuration file .git/config.lock [14:21:22] gotta clear out the lock manually :( [14:21:41] _joe_: 'recheck' might get it to run on another slave [14:23:00] <_joe_> James_F|Away: ok so I was right then, I did that I think [14:23:22] (03PS1) 10Filippo Giunchedi: admin: add gehel to ops [puppet] - 10https://gerrit.wikimedia.org/r/269674 (https://phabricator.wikimedia.org/T125651) [14:23:26] RECOVERY - puppet last run on labstore2001 is OK: OK: Puppet is currently enabled, last run 24 seconds ago with 0 failures [14:24:20] 10Ops-Access-Requests, 6operations, 3Discovery-Search-Sprint, 5Patch-For-Review: Access for new Discovery OpsEng: Guillaume Lederrey - https://phabricator.wikimedia.org/T125651#2014824 (10fgiunchedi) 5Resolved>3Open reopening, missing `ops` group access, see related review https://gerrit.wikimedia.org/... [14:25:51] 10Ops-Access-Requests, 6operations, 3Discovery-Search-Sprint, 5Patch-For-Review: Access for new Discovery OpsEng: Guillaume Lederrey - https://phabricator.wikimedia.org/T125651#2014829 (10ArielGlenn) the ops group is a different level of access, that shouldn't get tcaked onto this ticket. [14:25:54] <_joe_> hashar: also, recheck did obtain the effect of passing the tests, but gerrit still reports jenkins-bot's -1 [14:26:56] (03CR) 10Filippo Giunchedi: [C: 031] Adding support for PXE Jessie Installer to mcXXXX hosts. [puppet] - 10https://gerrit.wikimedia.org/r/269668 (https://phabricator.wikimedia.org/T123711) (owner: 10Elukey) [14:27:23] (03CR) 10Elukey: [C: 032] Adding support for PXE Jessie Installer to mcXXXX hosts. [puppet] - 10https://gerrit.wikimedia.org/r/269668 (https://phabricator.wikimedia.org/T123711) (owner: 10Elukey) [14:27:44] (03CR) 10Giuseppe Lavagetto: [V: 032] puppetmaster: adapt wmf-reimage to use remote salt calls [puppet] - 10https://gerrit.wikimedia.org/r/269437 (https://phabricator.wikimedia.org/T124761) (owner: 10Giuseppe Lavagetto) [14:27:54] (03PS7) 10Giuseppe Lavagetto: puppetmaster: adapt wmf-reimage to use remote salt calls [puppet] - 10https://gerrit.wikimedia.org/r/269437 (https://phabricator.wikimedia.org/T124761) [14:28:02] <_joe_> GRRR [14:28:51] 10Ops-Access-Requests, 6operations, 3Discovery-Search-Sprint, 5Patch-For-Review: Access for new Discovery OpsEng: Guillaume Lederrey - https://phabricator.wikimedia.org/T125651#2014835 (10Krenair) Maybe T126454 [14:35:16] PROBLEM - check_puppetrun on boron is CRITICAL: CRITICAL: Puppet has 1 failures [14:35:48] (03PS1) 10Elukey: Temporary remove of mc1006.eqiad from the redis/memcached pool for maintenance. [puppet] - 10https://gerrit.wikimedia.org/r/269677 (https://phabricator.wikimedia.org/T123711) [14:40:15] RECOVERY - check_puppetrun on boron is OK: OK: Puppet is currently enabled, last run 10 seconds ago with 0 failures [14:40:36] (03CR) 10Giuseppe Lavagetto: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/266510 (https://phabricator.wikimedia.org/T114273) (owner: 10Giuseppe Lavagetto) [14:40:57] (03CR) 10Giuseppe Lavagetto: Define Production service entries for InitialiseSettings (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/266510 (https://phabricator.wikimedia.org/T114273) (owner: 10Giuseppe Lavagetto) [14:46:27] (03CR) 10Elukey: [C: 032] Temporary remove of mc1006.eqiad from the redis/memcached pool for maintenance. [puppet] - 10https://gerrit.wikimedia.org/r/269677 (https://phabricator.wikimedia.org/T123711) (owner: 10Elukey) [14:47:16] akosiaris, i suspect that this will work on 4.2 -- https://gerrit.wikimedia.org/r/#/c/269675/ [14:47:45] i will try to deploy it today, and if it works ok, we might move ahead with the upgrade tests [14:48:03] !log removed mc1006 from the redis/memcached pool for Jessie migration [14:48:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:48:15] yurik: ok [14:51:53] 6operations, 10Traffic: Upgrade to Varnish 4: things to remember - https://phabricator.wikimedia.org/T126206#2014886 (10ema) [14:53:08] ACKNOWLEDGEMENT - PyBal backends health check on lvs1008 is CRITICAL: PYBAL CRITICAL - ocg_8000 - Could not depool server ocg1001.eqiad.wmnet because of too many down! Giuseppe Lavagetto lvs1008 cant reach ocg1002 [14:53:58] (03PS11) 10Giuseppe Lavagetto: Define Production service entries for InitialiseSettings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/266510 (https://phabricator.wikimedia.org/T114273) [14:58:55] (03PS4) 10BBlack: disable SPDY for all cache_text [puppet] - 10https://gerrit.wikimedia.org/r/268893 (https://phabricator.wikimedia.org/T125979) [15:00:09] yurik, akosiaris: FYI: nodejs upstream has upgraded their LTS release from 4.2 to 4.3 in their latest security release (being a rather onorthodox LTS), I'll build that tomorrow morning (but should probably not make a difference) [15:01:06] lol [15:01:07] 10Ops-Access-Requests, 6operations: root shell for Riccardo - https://phabricator.wikimedia.org/T126434#2014897 (10mark) I approve. Considering we're a week late, and we've already vetted him for this access in the interview process, let's move forward. [15:01:43] (03CR) 10Subramanya Sastry: [C: 04-1] "/usr/lib/parsoid is referenced by multiple files and they all need to be updated." [puppet] - 10https://gerrit.wikimedia.org/r/269606 (owner: 10Dzahn) [15:01:56] jouncebot: next [15:01:56] In 0 hour(s) and 58 minute(s): Morning SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160210T1600) [15:02:00] moritzm: wat ? [15:02:09] that sounds absurd [15:02:17] (03PS1) 10Alexandros Kosiaris: otrs: Remove OTRS templates symlink [puppet] - 10https://gerrit.wikimedia.org/r/269683 [15:02:19] (03PS1) 10Alexandros Kosiaris: otrs: Background wikimedia AgentLogo file [puppet] - 10https://gerrit.wikimedia.org/r/269684 (https://phabricator.wikimedia.org/T125912) [15:02:21] (03PS1) 10Alexandros Kosiaris: otrs: Background wikimedia AgentLogo file [puppet] - 10https://gerrit.wikimedia.org/r/269685 (https://phabricator.wikimedia.org/T125911) [15:02:34] !log SPDY disable for cache_text: test starts in a few minutes! - https://phabricator.wikimedia.org/T125979 [15:02:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:03:12] (03PS2) 10Alexandros Kosiaris: otrs: Login wikimedia AgentLogo file [puppet] - 10https://gerrit.wikimedia.org/r/269685 (https://phabricator.wikimedia.org/T125911) [15:03:18] (03CR) 10BBlack: [C: 032 V: 032] disable SPDY for all cache_text [puppet] - 10https://gerrit.wikimedia.org/r/268893 (https://phabricator.wikimedia.org/T125979) (owner: 10BBlack) [15:03:27] (03PS1) 10Ema: Omit thread_pool_add_delay on Varnish 4 [puppet] - 10https://gerrit.wikimedia.org/r/269686 (https://phabricator.wikimedia.org/T126206) [15:04:10] (03PS2) 10Alexandros Kosiaris: otrs: Remove OTRS templates symlink [puppet] - 10https://gerrit.wikimedia.org/r/269683 [15:04:17] (03CR) 10Alexandros Kosiaris: [C: 032 V: 032] otrs: Remove OTRS templates symlink [puppet] - 10https://gerrit.wikimedia.org/r/269683 (owner: 10Alexandros Kosiaris) [15:04:44] akosiaris: got yours [15:04:47] (03PS2) 10Alexandros Kosiaris: otrs: Background wikimedia AgentLogo file [puppet] - 10https://gerrit.wikimedia.org/r/269684 (https://phabricator.wikimedia.org/T125912) [15:04:55] (03CR) 10Alexandros Kosiaris: [C: 032 V: 032] otrs: Background wikimedia AgentLogo file [puppet] - 10https://gerrit.wikimedia.org/r/269684 (https://phabricator.wikimedia.org/T125912) (owner: 10Alexandros Kosiaris) [15:05:09] (03PS3) 10Alexandros Kosiaris: otrs: Login wikimedia AgentLogo file [puppet] - 10https://gerrit.wikimedia.org/r/269685 (https://phabricator.wikimedia.org/T125911) [15:05:19] bblack: yeah, thanks. got more coming ;-) [15:05:19] I think s3-labs link broke [15:05:27] (03CR) 10Alexandros Kosiaris: [C: 032 V: 032] otrs: Login wikimedia AgentLogo file [puppet] - 10https://gerrit.wikimedia.org/r/269685 (https://phabricator.wikimedia.org/T125911) (owner: 10Alexandros Kosiaris) [15:05:32] (replication, I mean) investigating [15:05:52] it also comes w/o further pre-notification and explanation... : "Please note that our LTS "Argon" release line has moved from v4.2.x to v4.3.x due to the security fixes enclosed. There will be no further updates to v4.2.x. Users are advised to upgrade to v4.3.0 as soon as possible." [15:06:12] BTW, akosiaris , I restarted the OTRS slave because it looked like things where okish and we still had the other 2 backups [15:06:40] didn't tell you because you where busy at the time [15:06:42] jynus: yeah, I 've been meaning to ask you about that. [15:06:46] ok thanks [15:06:51] good to know [15:07:19] wtf node [15:07:55] so it's not an LTS [15:08:40] 6operations, 5Patch-For-Review: reinstall eqiad memcache servers with jessie - https://phabricator.wikimedia.org/T123711#2014960 (10elukey) As expected, we are loosing sessions each time a host goes in/out: http://graphite.wikimedia.org/render/?width=586&height=308&target=MediaWiki.edit.failures.session_loss.... [15:08:42] did it automatically upgrade itself too? ;) [15:09:26] I doubt node people can reach those levels of perfection [15:09:59] while [ 1 ]; curl https://upgrade.my.node.js/ | bash; done [15:10:40] quiz: if somebody held a gun to your head and asked you to read a program, would you prefer it to be perl or nodejs ? [15:10:47] perl [15:10:50] definitely perl :) [15:11:24] oh I forgot a "do" in there above [15:11:36] it's only my abhorrence for perl's syntax that stops me just short of saying immediately perl as well [15:11:38] that's the difference between the real Ops and the rest of the world. I always thought that Perl is one of the few write only languages ... [15:11:51] it's the world first write only language! [15:12:20] s/abhorrence/flexibility/ ! [15:12:56] the write-only complaint is legitimate, though. one of the chief real complaints about perl versions <6 is that the only language spec is the interpreter code itself [15:13:15] you can make art AND science out of Perl : http://www.ozonehouse.com/mark/periodic/ [15:13:19] (and that it's factually-impossible to translate that code into a sane language spec in any regular language) [15:13:54] (03PS1) 10Muehlenhoff: Add ferm rules for noc role [puppet] - 10https://gerrit.wikimedia.org/r/269687 [15:13:56] (03PS1) 10Muehlenhoff: Enable base::firewall for mw1152 [puppet] - 10https://gerrit.wikimedia.org/r/269688 [15:14:03] haha, I had not seen that periodic table [15:14:32] good thing about periodic tables is that you can now guess what operators we will still discover ... [15:14:39] 6operations, 10MediaWiki-Page-editing, 5Patch-For-Review: Unexplained edit token errors - https://phabricator.wikimedia.org/T126395#2014976 (10TheDJ) 5Open>3Resolved Let's also set the resolved state in that case :) [15:14:41] Perl6 is quite beautiful though [15:15:35] I will avoid criticism for now on that one. It was just released after all. That christmas!! [15:15:42] heh [15:15:48] perl5 is fine too :P [15:16:15] https://en.wikipedia.org/wiki/Perl_6#Major_changes_from_Perl_5 [15:16:50] i loved a comment I overheard from one of the FOSDEM organizers near the perl booth [15:16:59] "Perl managed to keep a table/stand here this year" [15:17:07] 6operations, 5Patch-For-Review: Ferm rules for job runners - https://phabricator.wikimedia.org/T104972#2014986 (10MoritzMuehlenhoff) 5Open>3Resolved a:3MoritzMuehlenhoff All jobrunners now have ferm enabled. [15:18:14] <_joe_> mark: perl talks were full, ruby were half-empty [15:18:34] that's because the young hipsters haven't heard of perl yet ;) [15:18:42] think it's a new thing [15:18:47] <_joe_> ahahah [15:18:55] if you live long enough, you get to see popular things die off and become popular all over again [15:19:19] (03PS1) 10Volans: admin: add entry for Riccardo [puppet] - 10https://gerrit.wikimedia.org/r/269690 (https://phabricator.wikimedia.org/T126434) [15:19:21] now I can start showing up at Perl conferences again and say I was there before it was uncool which was after it was cool [15:19:21] (03PS1) 10Volans: admin: Added Riccardo to the Ops group [puppet] - 10https://gerrit.wikimedia.org/r/269691 (https://phabricator.wikimedia.org/T126434) [15:19:24] o O ( Java ) [15:19:40] lol [15:19:42] <_joe_> in fact yuvi is in love with perl 6 [15:19:51] see [15:19:56] <_joe_> which kinda proves mark's point [15:21:10] perl and I didn't have to think about it even for a second (vs nodejs) [15:21:39] (03PS1) 10Alexandros Kosiaris: otrs: Fix a typo. s/default/wmf/ [puppet] - 10https://gerrit.wikimedia.org/r/269692 (https://phabricator.wikimedia.org/T125911) [15:21:55] (03PS2) 10Alexandros Kosiaris: otrs: Fix a typo. s/default/wmf/ [puppet] - 10https://gerrit.wikimedia.org/r/269692 (https://phabricator.wikimedia.org/T125911) [15:22:01] (03CR) 10Alexandros Kosiaris: [C: 032 V: 032] otrs: Fix a typo. s/default/wmf/ [puppet] - 10https://gerrit.wikimedia.org/r/269692 (https://phabricator.wikimedia.org/T125911) (owner: 10Alexandros Kosiaris) [15:22:10] !log disabled puppet/memcached/redis on mc1006.eqiad [15:22:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:24:30] (03PS1) 10Dereckson: Enable Math extension on Wikitech [mediawiki-config] - 10https://gerrit.wikimedia.org/r/269694 (https://phabricator.wikimedia.org/T126338) [15:25:31] 6operations, 10OTRS, 5Patch-For-Review: Upload AgentLogo file to OTRS skins directory - https://phabricator.wikimedia.org/T125912#2015066 (10akosiaris) 5Open>3Resolved a:3akosiaris File uploaded and setting changed. Resolving [15:25:58] 6operations, 10OTRS, 5Patch-For-Review: Upload AgentLoginLogo file to OTRS skins directory - https://phabricator.wikimedia.org/T125911#2015071 (10akosiaris) 5Open>3Resolved a:3akosiaris File uploaded and setting changed. Resolving [15:26:24] 6operations: Decom berkelium/curium? - https://phabricator.wikimedia.org/T125962#2015075 (10MoritzMuehlenhoff) Also mentioned in Monday's ops meeting notes, will file for decom unless anyone objects by the end of the week. [15:30:59] 6operations: ferm rules for eventlog - https://phabricator.wikimedia.org/T126462#2015090 (10MoritzMuehlenhoff) 3NEW a:3MoritzMuehlenhoff [15:31:30] 6operations: ferm rules for eventlog - https://phabricator.wikimedia.org/T126462#2015099 (10MoritzMuehlenhoff) [15:31:31] (03CR) 10Subramanya Sastry: "I don't see whare parsoid.vcl.erb is reference in the repo at all. Is it used anywhere? Is there a reference to it in the private puppet? " [puppet] - 10https://gerrit.wikimedia.org/r/269602 (owner: 10Dzahn) [15:34:20] (03CR) 10BBlack: "Pretty sure it's a dead file." [puppet] - 10https://gerrit.wikimedia.org/r/269602 (owner: 10Dzahn) [15:34:25] 6operations, 10hardware-requests, 5codfw-rollout, 3codfw-rollout-Jan-Mar-2016: Site: 2 hardware request for redis jobrunners - https://phabricator.wikimedia.org/T126453#2015103 (10mark) @RobH: can you proceed to get quotes for this, if we don't have standard machines close to this spec? [15:35:51] (03CR) 10Glaisher: [C: 031] Enable Math extension on Wikitech [mediawiki-config] - 10https://gerrit.wikimedia.org/r/269694 (https://phabricator.wikimedia.org/T126338) (owner: 10Dereckson) [15:37:23] 6operations, 6Labs, 10wikitech.wikimedia.org: Rename specific account in LDAP, Wikitech, Gerrit and Phabricator - https://phabricator.wikimedia.org/T85913#2015108 (10demon) [15:39:51] (03CR) 10Mark Bergsma: [C: 031] swiftrepl: name-based filter for objects [software] - 10https://gerrit.wikimedia.org/r/269387 (https://phabricator.wikimedia.org/T125791) (owner: 10Filippo Giunchedi) [15:40:46] (03PS1) 10Milimetric: [WIP] Re-organize analytics dumps to their own page [puppet] - 10https://gerrit.wikimedia.org/r/269696 [15:42:54] !log gerrit: flushed all caches, things will be slow for a bit while they warm [15:42:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:43:26] 6operations, 6Labs, 10wikitech.wikimedia.org: Rename specific account in LDAP, Wikitech, Gerrit and Phabricator - https://phabricator.wikimedia.org/T85913#2015132 (10demon) [15:43:30] 6operations, 10Traffic, 5codfw-rollout, 3codfw-rollout-Jan-Mar-2016: Ability to switch Traffic infrastructure Tier-1 to codfw manually - https://phabricator.wikimedia.org/T125510#2015134 (10mark) >>! In T125510#1989639, @BBlack wrote: > Should also note: while the above list of steps 1-5 sounds roughly cor... [15:43:40] 6operations, 6Labs, 10wikitech.wikimedia.org: Rename specific account in LDAP, Wikitech, Gerrit and Phabricator - https://phabricator.wikimedia.org/T85913#957178 (10demon) I think I got this right. It's been awhile, so please let me know immediately if it's not. [15:43:55] PROBLEM - cassandra-a CQL 10.64.0.202:9042 on xenon is CRITICAL: Connection refused [15:44:06] PROBLEM - cassandra-a service on xenon is CRITICAL: CRITICAL - Expecting active but unit cassandra-a is failed [15:46:30] !log re-enabled puppet on mc1006.eqiad [15:46:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:46:47] (03PS1) 10Jcrespo: CREATE OR REPLACE view + new s2 master [software] - 10https://gerrit.wikimedia.org/r/269700 [15:47:37] RECOVERY - cassandra-a CQL 10.64.0.202:9042 on xenon is OK: TCP OK - 0.002 second response time on port 9042 [15:47:47] RECOVERY - cassandra-a service on xenon is OK: OK - cassandra-a is active [15:50:24] andre__: Maybe you can take a look at the backlog of T706? [15:51:38] !log recreating labsdb heartbeat views to correctly measure lag from the new s2-master [15:51:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:52:53] 6operations, 6Performance-Team, 7Availability, 7Epic, and 3 others: Cleanup active-DC based MW config code and make it more robust and easy to change - https://phabricator.wikimedia.org/T114273#2015154 (10mark) p:5Normal>3High [15:55:05] 6operations, 5codfw-rollout, 3codfw-rollout-Apr-Jun-2015, 3codfw-rollout-Jan-Mar-2016: Document what is left for having a full cluster installation in codfw - https://phabricator.wikimedia.org/T97322#2015160 (10mark) [15:55:34] (03PS2) 10Dzahn: parsoid::testing: usr /srv instead /usr/lib [puppet] - 10https://gerrit.wikimedia.org/r/269606 [15:56:13] (03CR) 10Giuseppe Lavagetto: [C: 04-1] "I realized I need to add the same file as ProductionServices.php for labs too." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/266510 (https://phabricator.wikimedia.org/T114273) (owner: 10Giuseppe Lavagetto) [15:56:16] (03PS3) 10Dzahn: parsoid::testing: use /srv instead /usr/lib [puppet] - 10https://gerrit.wikimedia.org/r/269606 [15:58:04] (03PS1) 10Dereckson: Namespaces configuration on ru.wikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/269702 (https://phabricator.wikimedia.org/T123837) [15:59:15] (03PS1) 10Alexandros Kosiaris: otrs: Remove the otrs cron file [puppet] - 10https://gerrit.wikimedia.org/r/269703 [15:59:33] (03CR) 10jenkins-bot: [V: 04-1] otrs: Remove the otrs cron file [puppet] - 10https://gerrit.wikimedia.org/r/269703 (owner: 10Alexandros Kosiaris) [16:00:12] (03PS2) 10ArielGlenn: allow dataset servers to have rsync access to /srv/dumps on labstore [puppet] - 10https://gerrit.wikimedia.org/r/268692 (https://phabricator.wikimedia.org/T117180) [16:00:14] (03PS4) 10Dzahn: parsoid::testing: use /srv instead /usr/lib [puppet] - 10https://gerrit.wikimedia.org/r/269606 [16:00:36] (03CR) 10ArielGlenn: "note that I won't remove the more specifc stanza right now as there are jobs that rely on it." [puppet] - 10https://gerrit.wikimedia.org/r/268692 (https://phabricator.wikimedia.org/T117180) (owner: 10ArielGlenn) [16:00:39] jouncebot: you lag [16:00:58] * James_F waves. [16:01:16] jouncebot: next [16:01:16] In 3 hour(s) and 58 minute(s): MediaWiki train (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160210T2000) [16:01:21] gj jouncebot [16:01:32] (03CR) 10ArielGlenn: [C: 032] allow dataset servers to have rsync access to /srv/dumps on labstore [puppet] - 10https://gerrit.wikimedia.org/r/268692 (https://phabricator.wikimedia.org/T117180) (owner: 10ArielGlenn) [16:01:41] jouncebot: ntpdate [16:01:53] Helpful. [16:01:57] morning SWAT: Brad (anomie), Chad (ostriches), Tyler (thcipriani), Mark (marktraceur), or Alex (Krenair), do the needful :) [16:01:57] thcipriani|afk: looks like SWAT is empty today [16:02:10] I'm around. [16:02:17] uh, it has things in it [16:02:19] alright, I can SWAT [16:02:19] Umm. [16:02:19] hashar: why? I added. [16:02:22] No? [16:02:24] https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160210T1600 [16:02:30] so that is jouncebot being lazy [16:02:37] I have five patches in SWAT now. :-) [16:02:38] jouncebot, next [16:02:38] In 3 hour(s) and 57 minute(s): MediaWiki train (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160210T2000) [16:02:40] Which require a scap, boo. [16:02:42] yeah, can never trust bots [16:02:44] jouncebot: prev [16:02:46] jouncebot: previous [16:02:50] .. [16:03:08] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/269416 (https://phabricator.wikimedia.org/T125306) (owner: 10KartikMistry) [16:03:26] jouncebot fixes only in swat [16:03:42] Hello. [16:04:21] (03Merged) 10jenkins-bot: CX: Enable specialcx campaign [mediawiki-config] - 10https://gerrit.wikimedia.org/r/269416 (https://phabricator.wikimedia.org/T125306) (owner: 10KartikMistry) [16:05:16] thcipriani: maybe we should mass CR+2 on swat start [16:05:22] then check out / deploy one by one [16:05:37] this way we have CI jobs run ahead of time [16:06:31] (03CR) 10Hashar: [C: 031] Beta: Rebase mw-config submodules [puppet] - 10https://gerrit.wikimedia.org/r/268737 (https://phabricator.wikimedia.org/T126061) (owner: 10Thcipriani) [16:06:48] thcipriani, James_F > I've an appointment at 17:00 UTC, would it be possible to process my two config changes before the wmf.12/wmf13 ones? [16:06:53] (03CR) 10Subramanya Sastry: [C: 031] "LGTM. So, when this is merged and there is a puppet run, does the /usr/lib/parsoid version stay behind and need to be removed via rm -rf?" [puppet] - 10https://gerrit.wikimedia.org/r/269606 (owner: 10Dzahn) [16:07:00] (thanks mosh for these extra [D) [16:07:07] (03PS5) 10Dzahn: parsoid::testing: use /srv instead /usr/lib [puppet] - 10https://gerrit.wikimedia.org/r/269606 [16:07:07] :) [16:07:09] (03PS7) 10Dzahn: parsoid: one file per role, move to module/role [puppet] - 10https://gerrit.wikimedia.org/r/269603 [16:07:11] (03PS4) 10Dzahn: parsoid: create module, move files and templates there [puppet] - 10https://gerrit.wikimedia.org/r/269602 [16:07:13] (03PS1) 10Dzahn: (WIP) parsoid: move rt/vd roles into role module [puppet] - 10https://gerrit.wikimedia.org/r/269707 [16:07:18] arrg, not what i want [16:07:26] dependencies are hard [16:07:27] RECOVERY - Unmerged changes on repository mediawiki_config on tin is OK: No changes to merge. [16:07:35] 6operations, 5codfw-rollout, 3codfw-rollout-Apr-Jun-2015, 3codfw-rollout-Jan-Mar-2016: Document what is left for having a full cluster installation in codfw - https://phabricator.wikimedia.org/T97322#2015187 (10faidon) 5Open>3declined a:3faidon Resolving this in favor of #codfw-rollout & #codfw-rollo... [16:07:50] (03PS1) 10BBlack: tlsproxy: nginx keepalives param for testing [puppet] - 10https://gerrit.wikimedia.org/r/269708 (https://phabricator.wikimedia.org/T107749) [16:08:03] Dereckson: :D sure I'll get yours done next. [16:08:18] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/269702 (https://phabricator.wikimedia.org/T123837) (owner: 10Dereckson) [16:08:25] (03CR) 10Dereckson: [C: 04-1] "Superseded by I63e7046c3aced9b78d92f8a1ca09211e731be062." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/264937 (https://phabricator.wikimedia.org/T123837) (owner: 10Mdann52) [16:08:47] Thanks. [16:09:24] (03CR) 10Dzahn: "@subbu yes, theoretically i could have puppetized that too, by first setting the resource to "ensure => absent" then letting puppet run, t" [puppet] - 10https://gerrit.wikimedia.org/r/269606 (owner: 10Dzahn) [16:09:25] !log thcipriani@mira Synchronized wmf-config/InitialiseSettings.php: SWAT: CX: Enable specialcx campaign [[gerrit:269416]] (duration: 02m 22s) [16:09:27] ^ kart_ check please [16:09:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:09:49] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/269694 (https://phabricator.wikimedia.org/T126338) (owner: 10Dereckson) [16:12:16] jenkins not picking up changes :( [16:12:35] thcipriani: okay [16:13:05] thcipriani: Force push it for now and fix CI later? [16:13:55] thcipriani: Looks fine. We will see the actual 'campagin' later today with train. Thanks! [16:14:03] kart_: kk, thank you [16:14:05] 6operations, 10Traffic, 5codfw-rollout, 3codfw-rollout-Jan-Mar-2016: Ability to switch Traffic infrastructure Tier-1 to codfw manually - https://phabricator.wikimedia.org/T125510#2015200 (10BBlack) >>! In T125510#2015134, @mark wrote: > Do you think it's reasonable to not use eqiad caches at all for a whil... [16:15:06] (03Merged) 10jenkins-bot: Namespaces configuration on ru.wikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/269702 (https://phabricator.wikimedia.org/T123837) (owner: 10Dereckson) [16:15:28] James_F: these submodule bumps have l10n changes you said? [16:15:29] (03CR) 10Subramanya Sastry: [C: 031] "lgtm, how does this affect the currently-running production service? does it get restarted or will this only shuffle around files and leav" [puppet] - 10https://gerrit.wikimedia.org/r/269602 (owner: 10Dzahn) [16:15:56] thcipriani: Yeah, the Cite and Citoid ones. Follow-up to the train yesterday but didn't get into last night's SWAT because it was full. [16:16:30] ah, gotcha. [16:16:30] thcipriani: i18n is broken in wmf.13 already, the patches make it possible post-scap to get it working. :-) [16:16:41] So not running the scap won't make it worse. [16:16:43] But… [16:17:23] (03CR) 10Subramanya Sastry: [C: 031] "Same qn. as on the previous patch." [puppet] - 10https://gerrit.wikimedia.org/r/269603 (owner: 10Dzahn) [16:18:26] !log thcipriani@mira Synchronized wmf-config/InitialiseSettings.php: SWAT: Namespaces configuration on ru.wikisource [[gerrit:269702]] (duration: 02m 10s) [16:18:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:18:32] ^ Dereckson check please [16:18:38] (03Merged) 10jenkins-bot: Enable Math extension on Wikitech [mediawiki-config] - 10https://gerrit.wikimedia.org/r/269694 (https://phabricator.wikimedia.org/T126338) (owner: 10Dereckson) [16:18:38] Testing. [16:19:07] thcipriani: works [16:19:41] oh good: error: insufficient permission for adding an object to repository database .git/objects [16:19:52] Umm. [16:19:58] This one has been triggered sooner this day. [16:20:38] 14:21:41 < hashar> _joe_: 'recheck' might get it to run on another slave [16:20:55] (that won't be helpful for submit pipeline) [16:21:57] that's weird, permission issue went away [16:24:56] !log thcipriani@mira Synchronized wmf-config/InitialiseSettings.php: SWAT: Enable Math extension on Wikitech [[gerrit:269694]] (duration: 02m 14s) [16:24:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:25:01] ^ Dereckson check please [16:25:38] Failed to parse (Missing texvc executable. Please see math/README to configure.): \varphi(n) =n \prod_{p\mid n} \left(1-\frac{1}{p}\right), [16:26:49] We revert and we'll redeploy it when texvc will be available on the server used by wikitech? [16:27:13] Dereckson: kk [16:27:19] I'm preparing the revert change. [16:28:22] havent we migrated out of texvc on prod? [16:28:26] in favor of mathoid [16:28:43] hashar: wikitech is special. [16:28:48] (03CR) 10Elukey: [C: 031] tlsproxy: nginx keepalives param for testing [puppet] - 10https://gerrit.wikimedia.org/r/269708 (https://phabricator.wikimedia.org/T107749) (owner: 10BBlack) [16:28:54] hashar: (Sadly.) [16:29:07] James_F: it is getting better :-} [16:29:23] Yup! [16:30:14] (03PS1) 10Dereckson: Revert "Enable Math extension on Wikitech" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/269710 [16:31:56] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/269710 (owner: 10Dereckson) [16:32:24] (03Merged) 10jenkins-bot: Revert "Enable Math extension on Wikitech" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/269710 (owner: 10Dereckson) [16:32:32] how do we not have the git hook on mira mediawiki-config? [16:32:39] what git hook? [16:32:46] the commit-id one? [16:33:30] 2016-01-28 SAL: 16:37 Krenair: Downloaded and `chmod +x`'d mira:/srv/mediawiki-staging/.git/hooks/commit-msg [16:33:54] huh..didn't let me push the revert I made...missing commit id [16:34:05] Speaking about hooks, we'd need something in git review to add change-id for revert commits. [16:34:11] oh, I think the staging dir broke since then? [16:34:21] I'm filling a bug on the OpenStrack tracker. [16:34:25] I'll do it again [16:34:36] Dereckson: that might explain what happened, no change-id for revert commits [16:34:40] ah [16:34:44] I didn't try to commit --amend [16:35:14] yep, mira:/srv/mediawiki-staging/.git/hooks/commit-msg exists and contains the expected change id script [16:35:18] probably the revert thing [16:35:40] !log thcipriani@mira Synchronized wmf-config/InitialiseSettings.php: SWAT: Revert "Enable Math extension on Wikitech" (duration: 02m 14s) [16:35:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:35:54] Dereckson, I don't think we use git-review in production, I certainly don't [16:36:06] ^ Dereckson ok, reverted. Thanks for the patch. [16:39:08] (revert works, plain page ... again) [16:39:14] Thanks for the deploy. [16:40:49] there's some multimedia include for wikitech somewhere in puppet that *may* be the appropriate place to put texvc [16:40:53] James_F: Okie doke. Are you fine with all your changes going out with a scap? I have them all ready to go on mira. [16:41:06] Yup. [16:42:22] (03CR) 10Ema: [C: 04-1] tlsproxy: nginx keepalives param for testing (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/269708 (https://phabricator.wikimedia.org/T107749) (owner: 10BBlack) [16:42:24] !log thcipriani@mira Started scap: VE, Cite, and Citoid bumps [[gerrit:269592]] [[gerrit:269593]] [[gerrit:269575]] [[gerrit:269590]] [16:42:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:45:59] 6operations, 6Labs, 10wikitech.wikimedia.org, 5Patch-For-Review: Deploy Mathoid for Wikitech too, or texvc as fallback - https://phabricator.wikimedia.org/T126468#2015274 (10Dereckson) 3NEW [16:46:21] 7Blocked-on-Operations, 6Labs, 10Wikimedia-Site-Requests, 10wikitech.wikimedia.org, 5Patch-For-Review: Enable math extension on wikitech - https://phabricator.wikimedia.org/T126338#2015281 (10Dereckson) [16:46:45] Krenair: I can look at that this evening if you wish. [16:47:17] Have to go. See you later. [16:50:36] 6operations, 5codfw-rollout, 3codfw-rollout-Jan-Mar-2016: Enable Redis cross-dc replication - https://phabricator.wikimedia.org/T126470#2015291 (10Joe) 3NEW a:3Joe [16:51:49] PROBLEM - restbase endpoints health on restbase-test2001 is CRITICAL: /page/revision/{revision} (Get rev by ID) is CRITICAL: Test Get rev by ID returned the unexpected status 500 (expecting: 200) [16:51:49] PROBLEM - restbase endpoints health on restbase-test2002 is CRITICAL: /page/revision/{revision} (Get rev by ID) is CRITICAL: Test Get rev by ID returned the unexpected status 500 (expecting: 200) [16:52:10] got this ^^^ [16:52:21] <_joe_> urandom: ok [16:53:37] RECOVERY - restbase endpoints health on restbase-test2001 is OK: All endpoints are healthy [16:55:17] PROBLEM - restbase endpoints health on restbase-test2003 is CRITICAL: /media/math/check/{type} (Mathoid - check test formula) is CRITICAL: Test Mathoid - check test formula returned the unexpected status 500 (expecting: 200) [16:55:28] RECOVERY - restbase endpoints health on restbase-test2002 is OK: All endpoints are healthy [16:56:55] (03PS1) 10Elukey: Add mc1006 back into redis/memcached pools after maintenance. [puppet] - 10https://gerrit.wikimedia.org/r/269715 (https://phabricator.wikimedia.org/T123711) [16:57:07] RECOVERY - restbase endpoints health on restbase-test2003 is OK: All endpoints are healthy [16:58:01] 6operations, 10MediaWiki-Logging, 10Wikimedia-IRC-RC-Server, 10Wikimedia-Stream, and 2 others: Verify that logs, irc, rcstream changes can flow from codfw to eqiad - https://phabricator.wikimedia.org/T126472#2015342 (10Joe) [17:00:15] 6operations: ferm rules for eventlog - https://phabricator.wikimedia.org/T126462#2015351 (10Ottomata) - UDP port 8421 from all networks Hm, I think that's the only port that eventlog1001 listens on: https://github.com/wikimedia/operations-puppet/blob/production/manifests/role/eventlogging.pp#L124 [17:00:48] 6operations: ferm rules for eventlog - https://phabricator.wikimedia.org/T126462#2015355 (10Ottomata) Is this duplicate of https://phabricator.wikimedia.org/T113343 ? [17:01:22] (03CR) 10Elukey: [C: 032] Add mc1006 back into redis/memcached pools after maintenance. [puppet] - 10https://gerrit.wikimedia.org/r/269715 (https://phabricator.wikimedia.org/T123711) (owner: 10Elukey) [17:02:55] !log readded mc1006 back into redis/memcached pool [17:02:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:04:52] (03CR) 10Ottomata: [C: 032 V: 032] Rebuild logstash-gelf for Ubuntu Trusty [debs/logstash-gelf] - 10https://gerrit.wikimedia.org/r/269656 (https://phabricator.wikimedia.org/T109101) (owner: 10Gehel) [17:05:27] (03CR) 10Chad: [C: 032] "Harmless" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/264461 (owner: 10Suriyaa Kudo) [17:05:33] 6operations, 10hardware-requests, 5codfw-rollout, 3codfw-rollout-Jan-Mar-2016: Site: 2 hardware request for redis jobrunners - https://phabricator.wikimedia.org/T126453#2015368 (10RobH) a:3RobH [17:05:36] seeing a lot of Warning: unable to connect to unix:///var/run/nutcracker/redis_eqiad.sock [111]: Connection refused for some reason [17:06:40] (03Merged) 10jenkins-bot: Correct HTML code for WMF image [mediawiki-config] - 10https://gerrit.wikimedia.org/r/264461 (owner: 10Suriyaa Kudo) [17:06:42] (03CR) 10Chad: [C: 032] Add \n to wikiversions.json [mediawiki-config] - 10https://gerrit.wikimedia.org/r/268468 (owner: 10Dereckson) [17:07:37] thcipriani: checking but even memcached shows errors for the first 25 mins (like https://logstash.wikimedia.org/#/dashboard/elasticsearch/memcached) [17:08:01] (03Merged) 10jenkins-bot: Add \n to wikiversions.json [mediawiki-config] - 10https://gerrit.wikimedia.org/r/268468 (owner: 10Dereckson) [17:08:17] should happen something like the previous spikes: http://graphite.wikimedia.org/render/?width=586&height=308&target=MediaWiki.edit.failures.session_loss.count [17:08:52] !sal [17:08:52] https://wikitech.wikimedia.org/wiki/Server_Admin_Log https://tools.wmflabs.org/sal/production See it and you will know all you need. [17:09:01] 6operations, 10MediaWiki-Logging, 10Wikimedia-IRC-RC-Server, 10Wikimedia-Stream, and 2 others: Verify that logs, irc, rcstream changes can flow from codfw to eqiad - https://phabricator.wikimedia.org/T126472#2015389 (10ori) MediaWiki is currently simply writing each change to both RCStream redis instances... [17:10:37] PROBLEM - restbase endpoints health on restbase-test2001 is CRITICAL: /media/math/check/{type} (Mathoid - check test formula) is CRITICAL: Test Mathoid - check test formula returned the unexpected status 500 (expecting: 200) [17:10:59] thcipriani: Err, I didn't realize you're mid-scap. I fetched some stuff to mw-config, but haven't merged it yet. I'll wait til you're done :) [17:11:15] ostriches: thanks :) almost done [17:12:17] !log thcipriani@mira Finished scap: VE, Cite, and Citoid bumps [[gerrit:269592]] [[gerrit:269593]] [[gerrit:269575]] [[gerrit:269590]] (duration: 29m 53s) [17:12:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:12:22] ^ James_F check please [17:12:26] RECOVERY - restbase endpoints health on restbase-test2001 is OK: All endpoints are healthy [17:12:28] ostriches: done [17:12:35] Checking. [17:13:31] thcipriani: Yup, all looks good. [17:13:37] James_F: kk, thanks. [17:15:12] !log demon@mira Synchronized errorpages/404.html: minor html fix (duration: 02m 17s) [17:15:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:16:07] 6operations, 10CirrusSearch, 6Discovery, 7Elasticsearch, and 2 others: Look into encrypting Elasticsearch traffic - https://phabricator.wikimedia.org/T124444#2015410 (10mark) [17:17:17] (03CR) 10Ema: "After this change, the world-visible directories under /var/lib/puppet/ would be:" [puppet] - 10https://gerrit.wikimedia.org/r/268684 (owner: 10Ema) [17:18:33] (03PS2) 10Alexandros Kosiaris: otrs: Remove the otrs cron file [puppet] - 10https://gerrit.wikimedia.org/r/269703 [17:18:56] (03CR) 10jenkins-bot: [V: 04-1] otrs: Remove the otrs cron file [puppet] - 10https://gerrit.wikimedia.org/r/269703 (owner: 10Alexandros Kosiaris) [17:19:48] (03PS1) 10ArielGlenn: pylint dumps mirroring tool: uncamelcase everything [dumps] (ariel) - 10https://gerrit.wikimedia.org/r/269724 [17:19:50] (03PS1) 10ArielGlenn: pylint dumps mirroring tool: fix whitespace issues [dumps] (ariel) - 10https://gerrit.wikimedia.org/r/269725 [17:19:52] (03PS1) 10ArielGlenn: pylint dumps mirroring tool: error whines [dumps] (ariel) - 10https://gerrit.wikimedia.org/r/269726 [17:19:54] (03PS1) 10ArielGlenn: pylint dumps mirroring tool: line too long [dumps] (ariel) - 10https://gerrit.wikimedia.org/r/269727 [17:21:00] !log demon@mira Synchronized multiversion/MWWikiversions.php: newlines in wikiversion.json (duration: 02m 21s) [17:21:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:21:15] 6operations, 7Mail: please delete majorgifts@ and majordonors@ - https://phabricator.wikimedia.org/T126475#2015422 (10eliza) 3NEW a:3Dzahn [17:23:11] 6operations, 10MediaWiki-Logging: Warning: unable to connect to unix:///var/run/nutcracker/redis_eqiad.sock [111]: Connection refused - https://phabricator.wikimedia.org/T126476#2015429 (10thcipriani) 3NEW [17:26:58] (03CR) 10Volans: Display a message in motd if puppet agent is disabled (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/268684 (owner: 10Ema) [17:27:17] (03CR) 10Dzahn: "@subbu it should do nothing at all, a puppet run that does nothing is the expectation" [puppet] - 10https://gerrit.wikimedia.org/r/269602 (owner: 10Dzahn) [17:27:59] (03CR) 10Dzahn: "let me run it in compiler.." [puppet] - 10https://gerrit.wikimedia.org/r/269602 (owner: 10Dzahn) [17:28:03] (03PS2) 10Ottomata: Make MediaWiki camus run in essential queue [puppet] - 10https://gerrit.wikimedia.org/r/269445 (https://phabricator.wikimedia.org/T125967) (owner: 10Joal) [17:28:10] (03CR) 10Ottomata: [C: 032 V: 032] Make MediaWiki camus run in essential queue [puppet] - 10https://gerrit.wikimedia.org/r/269445 (https://phabricator.wikimedia.org/T125967) (owner: 10Joal) [17:29:33] (03CR) 10Ottomata: "We have restarted jobs! Let's verify!" [puppet] - 10https://gerrit.wikimedia.org/r/253474 (https://phabricator.wikimedia.org/T118557) (owner: 10BBlack) [17:32:10] (03CR) 10Ottomata: "Ori, ping! Can you +1 this?" [puppet] - 10https://gerrit.wikimedia.org/r/259596 (owner: 10Ottomata) [17:32:56] (03PS2) 10Cmjohnson: admin: add user ppchelko [puppet] - 10https://gerrit.wikimedia.org/r/269368 (https://phabricator.wikimedia.org/T126283) (owner: 10Dzahn) [17:34:54] 6operations, 7Mail: please delete majorgifts@ and majordonors@ - https://phabricator.wikimedia.org/T126475#2015457 (10Dzahn) Hey there, i deleted this one: ``` -majordonors: rlewis ``` that was the only "major" i could find. majorgifts@ i could not find, also not in our history, the history just has:... [17:36:59] 6operations, 7Mail: remove wikibugs-irc mail alias ? - https://phabricator.wikimedia.org/T123432#2015477 (10Dzahn) [17:37:01] 6operations, 7Mail: please delete majorgifts@ and majordonors@ - https://phabricator.wikimedia.org/T126475#2015475 (10Dzahn) 5Open>3Resolved majorgifts@ should be on your side: [mx1001:~] $ sudo exim4 -bt majorgifts@wikimedia.org majorgifts@wikimedia.org router = ldap_account, transport = remote_smtp... [17:38:25] 6operations, 7Mail: please delete majorgifts@ and majordonors@ - https://phabricator.wikimedia.org/T126475#2015479 (10eliza) thank you [17:38:30] 6operations, 10MediaWiki-Logging: Warning: unable to connect to unix:///var/run/nutcracker/redis_eqiad.sock [111]: Connection refused - https://phabricator.wikimedia.org/T126476#2015480 (10elukey) I have been working on replacing mcXXXX hosts with Jessie today, details in https://phabricator.wikimedia.org/T123... [17:39:29] 6operations, 10MediaWiki-Logging: Warning: unable to connect to unix:///var/run/nutcracker/redis_eqiad.sock [111]: Connection refused - https://phabricator.wikimedia.org/T126476#2015488 (10elukey) p:5Triage>3Normal [17:40:25] 6operations, 6Labs, 10wikitech.wikimedia.org: Rename specific account in LDAP, Wikitech, Gerrit and Phabricator - https://phabricator.wikimedia.org/T85913#2015495 (10demon) Only thing left is to deattach the LDAP account from Phabricator and attach the new one. Should setup OAuth first though so you have an... [17:41:07] thcipriani: I updated https://phabricator.wikimedia.org/T126476, I was updating Redis/memcached instances today [17:41:45] 10Ops-Access-Requests, 6operations, 3Discovery-Search-Sprint, 5Patch-For-Review: Access for new Discovery OpsEng: Guillaume Lederrey - https://phabricator.wikimedia.org/T125651#2015496 (10Dzahn) a:5Dzahn>3None giving back to pool since it's pending approval, to be handled by on-duty person and in next... [17:42:31] 6operations, 5codfw-rollout, 3codfw-rollout-Jan-Mar-2016: Enable Redis cross-dc replication - https://phabricator.wikimedia.org/T126470#2015499 (10ori) * https://github.com/Netflix/dynomite * https://github.com/areina/smitty * https://github.com/Stono/redis-twemproxy-agent (updates twemproxy automatically wh... [17:45:59] (03PS1) 10Dzahn: OTRS: update site.pp, mendelevium, remove iodine [puppet] - 10https://gerrit.wikimedia.org/r/269736 (https://phabricator.wikimedia.org/T105125) [17:53:25] 6operations, 10Traffic, 5codfw-rollout, 3codfw-rollout-Jan-Mar-2016: Ability to switch Traffic infrastructure Tier-1 to codfw manually - https://phabricator.wikimedia.org/T125510#2015546 (10BBlack) Copying in notes from meeting etherpad, which capture some assumptions/thinking beyond what's currently in th... [17:56:57] RECOVERY - Host elastic1021 is UP: PING OK - Packet loss = 0%, RTA = 0.60 ms [17:57:10] 6operations: decom iodine - https://phabricator.wikimedia.org/T126483#2015562 (10Dzahn) 3NEW [17:57:19] (03CR) 10ArielGlenn: [C: 032 V: 032] pylint dumps mirroring tool: uncamelcase everything [dumps] (ariel) - 10https://gerrit.wikimedia.org/r/269724 (owner: 10ArielGlenn) [17:57:29] 6operations, 6Performance-Team, 10Traffic, 5Patch-For-Review: Disable SPDY on cache_text for a week - https://phabricator.wikimedia.org/T125979#2015578 (10BBlack) Note this went live circa 13:05 -> 13:15 UTC Feb 10. So far preliminary data in our graphs looks (to me!) like in the aggregate of client reque... [17:57:48] (03PS1) 10Dzahn: install_server: decom iodine [puppet] - 10https://gerrit.wikimedia.org/r/269738 (https://phabricator.wikimedia.org/T126483) [17:58:14] (03PS2) 10Dzahn: OTRS: update site.pp, mendelevium, remove iodine [puppet] - 10https://gerrit.wikimedia.org/r/269736 (https://phabricator.wikimedia.org/T105125) [17:58:31] (03CR) 10ArielGlenn: [C: 032 V: 032] pylint dumps mirroring tool: fix whitespace issues [dumps] (ariel) - 10https://gerrit.wikimedia.org/r/269725 (owner: 10ArielGlenn) [17:58:46] PROBLEM - RAID on ms-be1008 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [17:59:11] (03CR) 10ArielGlenn: [C: 032 V: 032] pylint dumps mirroring tool: error whines [dumps] (ariel) - 10https://gerrit.wikimedia.org/r/269726 (owner: 10ArielGlenn) [18:00:02] (03CR) 10ArielGlenn: [C: 032] pylint dumps mirroring tool: line too long [dumps] (ariel) - 10https://gerrit.wikimedia.org/r/269727 (owner: 10ArielGlenn) [18:00:36] RECOVERY - RAID on ms-be1008 is OK: OK: optimal, 14 logical, 14 physical [18:01:12] (03PS1) 10Dzahn: decom iodine, former OTRS server [dns] - 10https://gerrit.wikimedia.org/r/269739 (https://phabricator.wikimedia.org/T126483) [18:02:38] (03CR) 10Alexandros Kosiaris: [C: 031] OTRS: update site.pp, mendelevium, remove iodine [puppet] - 10https://gerrit.wikimedia.org/r/269736 (https://phabricator.wikimedia.org/T105125) (owner: 10Dzahn) [18:03:05] (03CR) 10Alexandros Kosiaris: [C: 04-1] "misses the removal of hieradata" [puppet] - 10https://gerrit.wikimedia.org/r/269738 (https://phabricator.wikimedia.org/T126483) (owner: 10Dzahn) [18:04:48] PROBLEM - very high load average likely xfs on ms-be1008 is CRITICAL: CRITICAL - load average: 119.53, 105.57, 70.13 [18:05:26] (03PS2) 10Dzahn: install_server: decom iodine [puppet] - 10https://gerrit.wikimedia.org/r/269738 (https://phabricator.wikimedia.org/T126483) [18:06:22] (03PS3) 10Dzahn: OTRS: update site.pp, mendelevium, remove iodine [puppet] - 10https://gerrit.wikimedia.org/r/269736 (https://phabricator.wikimedia.org/T105125) [18:06:26] PROBLEM - RAID on ms-be1008 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [18:06:57] (03CR) 10Dzahn: [C: 032] OTRS: update site.pp, mendelevium, remove iodine [puppet] - 10https://gerrit.wikimedia.org/r/269736 (https://phabricator.wikimedia.org/T105125) (owner: 10Dzahn) [18:08:04] (03PS24) 10Dduvall: Puppet provider for scap3 [puppet] - 10https://gerrit.wikimedia.org/r/262742 (https://phabricator.wikimedia.org/T113072) (owner: 10Alexandros Kosiaris) [18:08:07] (03CR) 10Dduvall: Puppet provider for scap3 (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/262742 (https://phabricator.wikimedia.org/T113072) (owner: 10Alexandros Kosiaris) [18:10:01] (03CR) 10Cmjohnson: [C: 032] admin: add user ppchelko [puppet] - 10https://gerrit.wikimedia.org/r/269368 (https://phabricator.wikimedia.org/T126283) (owner: 10Dzahn) [18:10:43] !log iodine - schedule downtime, stop puppet, stop salt, .. [18:10:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:11:32] (03PS4) 10Cmjohnson: admin: add ppchelko to restbase-roots [puppet] - 10https://gerrit.wikimedia.org/r/269369 (https://phabricator.wikimedia.org/T126283) (owner: 10Dzahn) [18:12:07] akosiaris: also "shutdown -h now" for iodine? ok? [18:12:57] mutante: actually I need it wiped [18:13:09] it might hold PII of agents [18:13:15] and not only that [18:13:25] akosiaris: yes, it will be wiped [18:13:32] ok then [18:13:52] ok, #greenpeace :) [18:15:05] !log iodine - shutdown, decom [18:15:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:16:00] (03PS4) 10Dzahn: OTRS: update site.pp, mendelevium, remove iodine [puppet] - 10https://gerrit.wikimedia.org/r/269736 (https://phabricator.wikimedia.org/T105125) [18:17:20] cmjohnson: looks like our patches have a rebase dance with each other :p i can do both if you want [18:17:45] mutante: yes please..thanks....have to run to the store and get a new power cable [18:17:58] cmjohnson: ok, i'll take it, cu later [18:18:21] (03PS1) 10ArielGlenn: pylint dumps mirroring tool: self use, parens, attr initialization [dumps] (ariel) - 10https://gerrit.wikimedia.org/r/269743 [18:18:45] (03CR) 10Dzahn: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/269736 (https://phabricator.wikimedia.org/T105125) (owner: 10Dzahn) [18:19:10] 6operations, 10ops-eqiad: Hardware problem (probably memory) on elastic1021 - https://phabricator.wikimedia.org/T125973#2015682 (10Cmjohnson) replaced DIMM and node is back dcausse cmjohnson: thanks, node is back [18:20:10] 6operations, 10ops-eqiad: Hardware problem (probably memory) on elastic1021 - https://phabricator.wikimedia.org/T125973#2015684 (10Cmjohnson) 5Open>3Resolved Return shipping information USPS 9202 3946 5301 2430 8606 47 Fedex 9611918 2393026 52352323 [18:20:39] (03CR) 10Dzahn: "http://puppet-compiler.wmflabs.org/1707/" [puppet] - 10https://gerrit.wikimedia.org/r/269602 (owner: 10Dzahn) [18:24:28] PROBLEM - Kafka Broker Replica Max Lag on kafka1012 is CRITICAL: CRITICAL: 62.50% of data above the critical threshold [5000000.0] [18:26:16] jenkins awol [18:26:56] (03CR) 10Dzahn: [V: 032] OTRS: update site.pp, mendelevium, remove iodine [puppet] - 10https://gerrit.wikimedia.org/r/269736 (https://phabricator.wikimedia.org/T105125) (owner: 10Dzahn) [18:27:10] (03PS1) 10ArielGlenn: dumps mirroring tool, cleanup usage message method [dumps] (ariel) - 10https://gerrit.wikimedia.org/r/269746 [18:27:25] mutante: what's hapenning with jenkins? [18:28:09] apergos: at 09:47 my time (about 45 min ago) it still gave me a Verified +2 , after that it stopped talking to me [18:28:27] yuck [18:28:51] it's busy: https://integration.wikimedia.org/ci/ [18:29:25] (03CR) 10ArielGlenn: [C: 032 V: 032] pylint dumps mirroring tool: self use, parens, attr initialization [dumps] (ariel) - 10https://gerrit.wikimedia.org/r/269743 (owner: 10ArielGlenn) [18:30:04] it'll catch up, ya'll [18:30:05] !log puppetstoredconfigclean.rb iodine.wikimedia.org, revoke puppet cert, delete salt key on new master [18:30:08] (03CR) 10ArielGlenn: [C: 032 V: 032] dumps mirroring tool, cleanup usage message method [dumps] (ariel) - 10https://gerrit.wikimedia.org/r/269746 (owner: 10ArielGlenn) [18:30:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:30:21] bypassing and force-merging hurts in the long run :/ [18:31:04] (03PS3) 10Dzahn: install_server: decom iodine [puppet] - 10https://gerrit.wikimedia.org/r/269738 (https://phabricator.wikimedia.org/T126483) [18:31:19] there's a big swath of core changes [18:31:53] test-T62720-android-emulator ..runs since 5 months ?:) [18:32:06] 6operations, 6Labs, 10wikitech.wikimedia.org: Deploy Mathoid for Wikitech too, or texvc as fallback - https://phabricator.wikimedia.org/T126468#2015716 (10Krenair) [18:32:26] oh, nevermind, wrong column [18:34:54] (03PS3) 10Dzahn: Beta: Rebase mw-config submodules [puppet] - 10https://gerrit.wikimedia.org/r/268737 (https://phabricator.wikimedia.org/T126061) (owner: 10Thcipriani) [18:35:05] adds more stuff to the queue [18:35:33] from -releng: [18:35:41] 18:20 < Krink.le> !log Creating a Trusty slave to support increased demand following MediaWIki php53(precise)>php55(trusty) bump [18:35:45] (03PS2) 10Dzahn: Add ferm rules for noc role [puppet] - 10https://gerrit.wikimedia.org/r/269687 (owner: 10Muehlenhoff) [18:35:56] ok, cool :) [18:37:25] greg-g I'm not bypassing, jenkins doesn't actually verify on that project (yet) [18:37:46] but I did bypass yesterday, I needed a revert to go in and it took jenkins 15 minutes. I didn't wait for it [18:39:22] apergos: ah [18:39:28] RECOVERY - Kafka Broker Replica Max Lag on kafka1012 is OK: OK: Less than 50.00% above the threshold [1000000.0] [18:39:40] greg-g: can you pass on that to hashar btw? weve had some ful disks today breaking jenkins tests and almost certainly that swatch of core changes is at fault [18:40:24] will do [18:40:49] thaniks. he knows about the disk full and was trying to track it down [18:40:53] like, what set it off [18:41:44] since it's across the board I will check all the integration testing instances and proactively clean up workspaces [18:41:51] hopefully that gets us through the rest of the day [18:42:32] (03PS25) 10Dduvall: Puppet provider for scap3 [puppet] - 10https://gerrit.wikimedia.org/r/262742 (https://phabricator.wikimedia.org/T113072) (owner: 10Alexandros Kosiaris) [18:46:32] (03CR) 10Dduvall: "I've fixed up the hash literal syntax for Ruby 1.8 compatibility." [puppet] - 10https://gerrit.wikimedia.org/r/262742 (https://phabricator.wikimedia.org/T113072) (owner: 10Alexandros Kosiaris) [18:49:41] 6operations, 6Performance-Team, 10Traffic, 5Patch-For-Review: Disable SPDY on cache_text for a week - https://phabricator.wikimedia.org/T125979#2015756 (10Gilles) Are you sure this wasn't 15:10 UTC? Isn't that when the patch was merged? [18:53:04] 6operations, 6Performance-Team, 10Traffic, 5Patch-For-Review: Disable SPDY on cache_text for a week - https://phabricator.wikimedia.org/T125979#2015771 (10BBlack) Yup, sorry, thinko while translating timezones. Updated above too! [18:53:54] Heads up, I got two people reporting they can't see thumbnails on pages, developing now [18:55:37] (03PS5) 10Cmjohnson: admin: add ppchelko to restbase-roots [puppet] - 10https://gerrit.wikimedia.org/r/269369 (https://phabricator.wikimedia.org/T126283) (owner: 10Dzahn) [18:56:38] 6operations, 10OTRS, 5Patch-For-Review: Upload AgentLoginLogo file to OTRS skins directory - https://phabricator.wikimedia.org/T125911#2015792 (10Steinsplitter) {F3330695} Logo looks like this. Shouldn't it be transparent? [18:57:35] there's a small increase in the background 500 and 503 status rates on cache_upload, could be related [18:58:14] but normal background rate and the small increase are all in the ~0.5-2.0/sec rate band globally, which is a tiny tiny tiny fraction of raw requests [18:58:14] akosiaris: any chance you are around and have the bandwidth to do one final pass on that scap3 provider? [18:58:33] apergos: I already did [18:58:51] I mean after the last commit about 10 mins ago [18:58:59] ah, ok [18:59:05] and thanks for your earlier review too [18:59:48] 6operations, 6Performance-Team, 10Traffic, 5Patch-For-Review: Disable SPDY on cache_text for a week - https://phabricator.wikimedia.org/T125979#2015813 (10Krinkle) Last 12 hours compared to same time last week. Seems starting in the hour after 15:00 (red mark) there is a noticeable regression. {F3330702 s... [19:00:14] and cache_upload 4xx rates look fairly normal and un-disturbed [19:00:23] (03CR) 10Alexandros Kosiaris: [C: 031] Puppet provider for scap3 [puppet] - 10https://gerrit.wikimedia.org/r/262742 (https://phabricator.wikimedia.org/T113072) (owner: 10Alexandros Kosiaris) [19:00:45] heh, that 1.8 syntax hash thing dan caught would have been a PITA [19:00:57] if this is really just breaking in the past few minutes, though, it may take a few more for stats to catch up [19:01:14] I just wish we could manage to get our puppetmasters on jessie soon [19:02:16] 6operations, 10CirrusSearch, 6Discovery, 7Elasticsearch, and 2 others: Look into encrypting Elasticsearch traffic - https://phabricator.wikimedia.org/T124444#2015827 (10EBernhardson) It would be good to evaluate https://github.com/sonian/elasticsearch-jetty and compare that to how an nginx proxy setup woul... [19:03:56] 6operations, 10OTRS, 5Patch-For-Review: Upload AgentLoginLogo file to OTRS skins directory - https://phabricator.wikimedia.org/T125911#2015848 (10akosiaris) That's probably for T125912 [19:04:30] 6operations, 10OTRS, 5Patch-For-Review: Upload AgentLogo file to OTRS skins directory - https://phabricator.wikimedia.org/T125912#2000454 (10akosiaris) From IRC ``` RD: Thanks for updating the logos- Will have to fix the top one to match the background - once I get the file I'll update the bug ``` [19:04:38] godog: there's a swift box with "high load, probably xfs" again, should i restart ? ms-be1008 [19:04:50] !log restbase disabled puppet in staging, testing brotli compression which requires JAVA_OPTS tuning [19:04:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:05:06] (03PS1) 10Chad: group1 wikis to 1.27.0-wmf.13 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/269752 [19:08:30] 6operations, 6Performance-Team, 10Traffic, 5Patch-For-Review: Disable SPDY on cache_text for a week - https://phabricator.wikimedia.org/T125979#2015878 (10BBlack) I'd say some of those graphs, they fluctuate so much we need more data to confirm the pattern. But the TTFB, DOM-Complete, and onLoad ones cert... [19:08:51] 6operations, 10CirrusSearch, 6Discovery, 7Elasticsearch, and 2 others: Look into encrypting Elasticsearch traffic - https://phabricator.wikimedia.org/T124444#2015879 (10EBernhardson) @joe @mark Could anyone clarify the concerns around using a per-server nginx proxy for SSL termination? It was mentioned in... [19:10:32] MarkTraceur: any further news re thumbnails? [19:10:40] Not currently [19:14:02] (03PS1) 10Dzahn: delete SSL cert for ticket.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/269753 (https://phabricator.wikimedia.org/T122320) [19:18:05] (03PS2) 10Dzahn: delete SSL cert for ticket.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/269753 (https://phabricator.wikimedia.org/T122320) [19:18:44] (03CR) 10Dzahn: [C: 032] Beta: Rebase mw-config submodules [puppet] - 10https://gerrit.wikimedia.org/r/268737 (https://phabricator.wikimedia.org/T126061) (owner: 10Thcipriani) [19:19:18] (03CR) 10Dzahn: [C: 032] Add ferm rules for noc role [puppet] - 10https://gerrit.wikimedia.org/r/269687 (owner: 10Muehlenhoff) [19:19:33] jenkins is better [19:19:41] (03PS3) 10Dzahn: Add ferm rules for noc role [puppet] - 10https://gerrit.wikimedia.org/r/269687 (owner: 10Muehlenhoff) [19:20:00] (03CR) 1020after4: [C: 031] "can we get a +2 please?" [puppet] - 10https://gerrit.wikimedia.org/r/262742 (https://phabricator.wikimedia.org/T113072) (owner: 10Alexandros Kosiaris) [19:20:16] (03PS26) 1020after4: Puppet provider for scap3 [puppet] - 10https://gerrit.wikimedia.org/r/262742 (https://phabricator.wikimedia.org/T113072) (owner: 10Alexandros Kosiaris) [19:20:49] apergos: jenkins got to my stuff now.. fyi [19:21:25] good [19:21:41] twentyafterfour: give me 5 minutes and I'm here [19:21:42] (03PS4) 10Dzahn: install_server: decom iodine [puppet] - 10https://gerrit.wikimedia.org/r/269738 (https://phabricator.wikimedia.org/T126483) [19:21:49] I'm going ci instance space cleanup [19:23:19] apergos: awesome [19:24:03] might be 15 mins, the loop is going to take a little time [19:26:13] bblack: Sorry, he says incognito mode works so it must have been a cache problem [19:26:26] !log ms-be1008 - powercycling - the known XFS issue [19:26:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:27:27] (03CR) 1020after4: "since nothing actually references this provider (yet) it should be safe to deploy it. Next in line is https://gerrit.wikimedia.org/r/#/c/2" [puppet] - 10https://gerrit.wikimedia.org/r/262742 (https://phabricator.wikimedia.org/T113072) (owner: 10Alexandros Kosiaris) [19:28:17] (03CR) 10Krinkle: "Causes T126498 on new slaves." [puppet] - 10https://gerrit.wikimedia.org/r/269370 (https://phabricator.wikimedia.org/T126211) (owner: 10Legoktm) [19:28:40] (03PS2) 1020after4: scap::target to configure scap3 deployment repository and deploy-user. [puppet] - 10https://gerrit.wikimedia.org/r/269560 (https://phabricator.wikimedia.org/T113072) [19:29:10] (03CR) 10Krinkle: "(Actually removed from CI puppetmaster now, it wasn't removed before)" [puppet] - 10https://gerrit.wikimedia.org/r/269370 (https://phabricator.wikimedia.org/T126211) (owner: 10Legoktm) [19:29:19] RECOVERY - very high load average likely xfs on ms-be1008 is OK: OK - load average: 10.03, 2.50, 0.83 [19:30:48] RECOVERY - RAID on ms-be1008 is OK: OK: optimal, 14 logical, 14 physical [19:33:33] (03PS3) 1020after4: scap::target to configure scap3 deployment repository and deploy-user. [puppet] - 10https://gerrit.wikimedia.org/r/269560 (https://phabricator.wikimedia.org/T113072) [19:40:42] (03CR) 1020after4: scap::target to configure scap3 deployment repository and deploy-user. (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/269560 (https://phabricator.wikimedia.org/T113072) (owner: 1020after4) [19:42:53] twentyafterfour: 5 more minutes [19:43:10] (03PS1) 10Dereckson: Separate math and TeX packages class [puppet] - 10https://gerrit.wikimedia.org/r/269758 [19:44:04] (03CR) 10Dzahn: "thanks ottomata, will remove it" [puppet] - 10https://gerrit.wikimedia.org/r/266975 (owner: 10Dzahn) [19:44:24] (03PS1) 10Ottomata: Add $check_jar and $camus_jar parameterize to camus::job [puppet] - 10https://gerrit.wikimedia.org/r/269759 [19:45:36] !log did cleanup across all integration slaves, some were very close to out of room. results: https://phabricator.wikimedia.org/P2587 [19:45:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:45:59] hashar, fyi ^^ [19:46:09] twentyafterfour: ok, I'm now yours [19:47:11] (03CR) 10ArielGlenn: [C: 032] Puppet provider for scap3 [puppet] - 10https://gerrit.wikimedia.org/r/262742 (https://phabricator.wikimedia.org/T113072) (owner: 10Alexandros Kosiaris) [19:47:45] can't merge til jenkins [19:47:51] so now we wait a while, it's backed up [19:47:53] (03PS1) 10Dereckson: Add texvc to role::labs::openstack::nova::manager [puppet] - 10https://gerrit.wikimedia.org/r/269762 (https://phabricator.wikimedia.org/T126468) [19:48:47] apergos: ty [19:49:02] yw [19:49:05] btw, antoine and timo are working on adding more executers over in -releng [19:49:11] might have to do this again tomorrow morning [19:49:19] that would be awesome too [19:49:28] usage spikes are a killer [19:49:47] yeah, everybody's happy to have 5.5 :) [19:51:13] 6operations, 7Mail: Remove exim aliases (chris, cjohnson) - https://phabricator.wikimedia.org/T126505#2016071 (10JKrauska) 3NEW a:3Dzahn [19:52:28] jouncebot: refresh [19:52:30] I refreshed my knowledge about deployments. [19:52:34] jouncebot: next [19:52:34] In 0 hour(s) and 7 minute(s): MediaWiki train (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160210T2000) [19:52:37] wow [19:53:05] might have missed a couple integrtion instances, I didn't see signed puppet certs for them [19:53:08] going back to check now [19:54:55] 6operations, 7Mail: Remove exim aliases (marc, mpelletier) - https://phabricator.wikimedia.org/T126507#2016096 (10JKrauska) 3NEW a:3Dzahn [19:55:17] indeed, trusty 1011-1013 and 1016 [19:55:23] doing those now [19:56:57] 6operations, 7Mail: Remove exim alias (atglenn) - https://phabricator.wikimedia.org/T126508#2016104 (10JKrauska) 3NEW a:3Dzahn [19:57:16] I never use(d) that alias anyways. evah [19:57:17] jenkins: mode +leroy [19:57:35] brown? [19:58:03] apergos: how about this one? ops-dumps: ariel [19:58:05] 6operations, 7Mail: Remove exim aliases (dz, daniel.zahn, mutante) - https://phabricator.wikimedia.org/T126509#2016112 (10JKrauska) 3NEW a:3Dzahn [19:58:12] used all the time!! [19:58:14] do not remove please [19:58:19] (03PS2) 10Dereckson: Separate math and TeX packages classes [puppet] - 10https://gerrit.wikimedia.org/r/269758 [19:58:20] ops-dumps is where all the crons go [19:58:26] so they don't get lost in general cronspam [19:58:56] apergos: reference https://en.wikipedia.org/wiki/Leeroy_Jenkins [19:59:07] maybe there are other aliases with my name on em? [19:59:43] https://www.youtube.com/watch?v=QvwDohEEQ1E that was mine :-P [20:00:04] ostriches: Respected human, time to deploy MediaWiki train (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160210T2000). Please do the needful. [20:00:08] apergos: it's not just about deleting aliases, it's also about moving them over to OIT [20:00:22] so "delete" means "on our side" but not always entirely [20:00:25] sure, but if you can remove any I don't need [20:00:27] even better [20:00:28] as in gone gone [20:00:31] yep [20:00:38] the atglenn can go go [20:00:39] 6operations, 7Mail: Remove exim alias (wikipedia10) - https://phabricator.wikimedia.org/T126511#2016132 (10JKrauska) 3NEW a:3Dzahn [20:00:50] lol, ^ we better add wikipedia15 [20:00:59] nahhh [20:01:01] :-D [20:01:06] Lessss go [20:01:07] add wikipedia20, be proactive! [20:01:08] no, wait.. wikipediafifteen@ sorry [20:01:32] (03CR) 10Chad: [C: 032] group1 wikis to 1.27.0-wmf.13 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/269752 (owner: 10Chad) [20:01:54] (03PS4) 1020after4: scap::target to configure scap3 deployment repository and deploy-user. [puppet] - 10https://gerrit.wikimedia.org/r/269560 (https://phabricator.wikimedia.org/T113072) [20:02:06] (03Merged) 10jenkins-bot: group1 wikis to 1.27.0-wmf.13 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/269752 (owner: 10Chad) [20:03:44] !log demon@mira rebuilt wikiversions.php and synchronized wikiversions files: group1 to wmf.13 [20:03:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:04:03] twentyafterfour: just leave it for now, let's see if jenkins gets to it in a timely fashion at all [20:04:28] PROBLEM - puppet last run on db1041 is CRITICAL: CRITICAL: Puppet has 1 failures [20:05:18] (03PS1) 10Dereckson: Revert "Revert "Enable Math extension on Wikitech"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/269770 [20:05:42] (03PS4) 10Dzahn: admin: allow dc-ops group to read log files [puppet] - 10https://gerrit.wikimedia.org/r/266919 (owner: 10Papaul) [20:06:10] The heck? [20:06:37] ostriches: ? [20:06:57] Warning: fopen() expects parameter 1 to be string, array given in /srv/mediawiki/multiversion/vendor/wikimedia/cdb/src/Reader/PHP.php on line 69 [20:07:04] Notice: Array to string conversion in /srv/mediawiki/multiversion/vendor/wikimedia/cdb/src/Reader/PHP.php on line 71 [20:07:12] I have a guess. [20:07:29] (03PS1) 10Chad: Revert "Add \n to wikiversions.json" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/269772 [20:07:32] apergos: looks like an hour wait for CI [20:07:36] (03CR) 10Chad: [C: 032 V: 032] Revert "Add \n to wikiversions.json" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/269772 (owner: 10Chad) [20:07:55] twentyafterfour: well I'm here for awhile yet [20:08:04] might not be good at 3 am but prolly good til close to that [20:08:45] and I mighta broke someone's jenkins job with my clenaup, find or not [20:08:53] but they'll resubmit it I guess [20:10:00] 7Puppet, 10Continuous-Integration-Infrastructure: Need a better way of testing puppet patches for contint/integration stuff - https://phabricator.wikimedia.org/T126370#2016189 (10scfc) The use case fits what Puppet calls "environments" (which I think is where `production` for the default branch comes from as t... [20:10:23] !log demon@mira Synchronized multiversion/MWWikiversions.php: rm newline addition (duration: 02m 19s) [20:10:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:10:35] !log demon@mira rebuilt wikiversions.php and synchronized wikiversions files: rebuild [20:10:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:11:39] Nope :\ [20:12:57] I've not seen this one before. [20:13:00] 6operations: Adding/Removing users from enWP Arbcom Mailinglist archives - https://phabricator.wikimedia.org/T123787#2016208 (10RobH) [20:13:17] ostriches: ori was hunting and killing some of those errors on Monday I think. See https://gerrit.wikimedia.org/r/#/c/269330/ [20:13:48] Related to $wgInterwikiCache changing to a php file [20:14:03] This is in multiversion [20:14:14] (03CR) 1020after4: "I stole your changes to scap::target and pulled them out into a separate patch: https://gerrit.wikimedia.org/r/#/c/269560" [puppet] - 10https://gerrit.wikimedia.org/r/269143 (owner: 10Thcipriani) [20:14:17] 10Ops-Access-Requests, 6operations: Requesting access to stat1002 for cwdent - https://phabricator.wikimedia.org/T121916#2016238 (10RobH) [20:14:26] 6operations, 10Wikimedia-Mailing-lists: Add @dpatrick to ops mailing list - https://phabricator.wikimedia.org/T121441#2016241 (10RobH) [20:14:40] ostriches: that's where the CDB lib is loaded from for all of MW [20:14:51] because of autoloader ordering [20:15:07] 10Ops-Access-Requests, 6operations: Requesting access to stat1002 for ejegg - https://phabricator.wikimedia.org/T118320#2016260 (10RobH) [20:15:09] apergos: woo it actually verified https://gerrit.wikimedia.org/r/#/c/262742/ [20:15:10] I guess that patch should go into wmf.13 too? [20:15:38] oh, was it only a backport? [20:15:54] oh for cripes [20:15:57] gotta rebase it [20:15:59] grrr [20:16:04] (03PS5) 10Dzahn: admin: allow dc-ops group to read log files [puppet] - 10https://gerrit.wikimedia.org/r/266919 (owner: 10Papaul) [20:16:09] but maybe it will go faster now [20:16:13] ostriches: yeah it missed the branch cut apparently -- https://gerrit.wikimedia.org/r/#/c/269329/ [20:16:24] Yeah, let's port over to wmf.13 [20:16:28] On it [20:16:34] sweet [20:16:43] (03PS27) 10ArielGlenn: Puppet provider for scap3 [puppet] - 10https://gerrit.wikimedia.org/r/262742 (https://phabricator.wikimedia.org/T113072) (owner: 10Alexandros Kosiaris) [20:17:27] !log cleaned up integrations slave trusty 1001,10012,10013, 1016, missed in first round. [20:17:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:17:53] (03CR) 10Dzahn: [C: 031] Enable base::firewall for mw1152 [puppet] - 10https://gerrit.wikimedia.org/r/269688 (owner: 10Muehlenhoff) [20:17:58] ugh... by the time it's tested again it'll need another rebase at the rate things are going [20:17:59] (03PS2) 10Dzahn: Enable base::firewall for mw1152 [puppet] - 10https://gerrit.wikimedia.org/r/269688 (owner: 10Muehlenhoff) [20:18:22] gah some are at 40 or 50% use already, with only 3 hour old stuff in there [20:18:25] bad news [20:18:46] math extension on wikitech, yea [20:18:57] mutante: can you not merge stuff in puppet for a few minutes? cause [20:18:58] waits for all the formulas :) [20:19:07] apergos: i cant anyways [20:19:09] otherwise we're in rebase wars and this patch is taking jenkins for evah [20:19:16] heh ok [20:19:16] same here [20:20:36] Dereckson: do we actually except math on wikitech, or is this just to reduce the difference to production [20:21:28] !log demon@mira Synchronized php-1.27.0-wmf.13/includes/interwiki/Interwiki.php: fix cache stuff (duration: 02m 18s) [20:21:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:23:24] mutante: Analytics Team wants it, for example for https://wikitech.wikimedia.org/wiki/Analytics/Data/Pageview_hourly/Sanitization [20:24:21] Dereckson: ah:) ok! i'll do that later and add those packages, just waiting a bit because jenkins and letting apergos go first [20:24:31] !log tc per client shaping for labstore1001 test [20:24:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:25:10] k [20:25:40] (03PS1) 10Yurik: (WIP) Add allowedDomains param to graphoid config [puppet] - 10https://gerrit.wikimedia.org/r/269819 [20:26:37] any puppet gurus around? ^ need help with that [20:26:52] bd808: That was it, obviously. Since I stopped complaining :p [20:27:19] ^^ should be simple if you know how to convert an object to json string in puppet [20:28:39] 6operations, 10CirrusSearch, 6Discovery, 7Elasticsearch, and 2 others: Look into encrypting Elasticsearch traffic - https://phabricator.wikimedia.org/T124444#2016368 (10Gehel) Discussion indicates that a local nginx reverse proxy for SSL termination is a non optimal solution. I do not understand why. Seems... [20:29:13] ostriches: cool. glad I could remember that happening :) [20:29:25] while I wait might as well make some dinner [20:30:10] RECOVERY - puppet last run on db1041 is OK: OK: Puppet is currently enabled, last run 9 seconds ago with 0 failures [20:35:19] (03CR) 10Cmjohnson: [C: 032] admin: add ppchelko to restbase-roots [puppet] - 10https://gerrit.wikimedia.org/r/269369 (https://phabricator.wikimedia.org/T126283) (owner: 10Dzahn) [20:35:28] (03PS6) 10Cmjohnson: admin: add ppchelko to restbase-roots [puppet] - 10https://gerrit.wikimedia.org/r/269369 (https://phabricator.wikimedia.org/T126283) (owner: 10Dzahn) [20:35:44] 6operations, 7Mail: Remove exim alias: mmi - https://phabricator.wikimedia.org/T126520#2016390 (10JKrauska) 3NEW a:3Dzahn [20:36:05] (03PS2) 10Yurik: (WIP) Add allowedDomains param to graphoid config [puppet] - 10https://gerrit.wikimedia.org/r/269819 [20:40:32] 6operations, 7Mail: Remove exim aliases (marc, mpelletier) - https://phabricator.wikimedia.org/T126507#2016427 (10Dzahn) ``` -# Marc-Andre Pelletier -marc: marc@uberbox.org -mpelletier: marc@uberbox.org - ``` done [20:40:54] 6operations, 7Mail: Remove exim aliases (marc, mpelletier) - https://phabricator.wikimedia.org/T126507#2016428 (10Dzahn) [20:40:55] cmjohnson1: can you hold off on that merge til after this phab scap3 thing gets done? [20:41:03] yep [20:41:07] 6operations, 7Mail: Move most (all?) exim personal aliases to OIT - https://phabricator.wikimedia.org/T122144#2016433 (10Dzahn) [20:41:09] 6operations, 7Mail: Remove exim aliases (marc, mpelletier) - https://phabricator.wikimedia.org/T126507#2016432 (10Dzahn) 5Open>3Resolved [20:41:10] we've been in the rebase wars for a little while now and slow jenkins is slow [20:41:15] sorry about that [20:41:32] yeah..I need to hold off on that merge anyhow so no issue [20:41:37] whew [20:41:59] I'm sittin here doing the reload dance. come on jenkins [20:42:15] if anyone had jenkins as a stalkword they would be very sorry right about now :-D [20:45:14] 6operations, 7Mail: Remove exim aliases: (nitika, noopur, shiju, subhashish) - https://phabricator.wikimedia.org/T126523#2016454 (10JKrauska) 3NEW a:3Dzahn [20:45:21] Jenkins went overloaded since we mass switched to php55 [20:45:22] Jenkins isn't slow. Zuul queue is just backed up because there is not enough capacity. We're adding new worker nodes now. [20:46:01] it is merely a load shift from Precise to Trusty, but we havent provisioned enough new slaves to take in account the load change [20:46:21] also change receiving CR+2 have higher priority than the test changes [20:46:36] hashar: want me to send a mail to wikitech-l about this? [20:47:05] greg-g: yup sure! [20:47:15] 12:49 Code Review - Error [20:47:15] 12:49 Server Unavailable [20:47:16] kk, will do [20:47:18] ^ gerrit ? [20:47:30] wfm [20:47:35] mutante: yup that happens from time to time [20:47:41] papaul: ^ [20:47:48] mutante: got [20:47:50] it [20:47:55] it's unrelated though [20:48:08] papaul: sometime when doing actions in Gerrit you might get the gray box of "server unavailable". Just [OK] it and retry :} [20:48:12] no idea what is the root cause [20:48:25] hashar: thanks [20:50:33] (03CR) 10Mobrovac: [C: 04-1] (WIP) Add allowedDomains param to graphoid config (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/269819 (owner: 10Yurik) [20:51:33] hashar: No cause really. Poor error handling with gerrit's callbacks to its rest api mostly. [20:51:40] ie: random rest callback fails, blow up. [20:53:29] (03PS3) 10Yurik: (WIP) Add allowedDomains param to graphoid config [puppet] - 10https://gerrit.wikimedia.org/r/269819 [20:56:05] Krinkle: I've been watching the slaves get added on the integration page [20:56:52] ostriches: so most probably nothing to worry about on Gerrit server side ? [20:57:03] saw a precise host disappear at about the same time a new trusty instance showed up so... :-) [20:57:27] (03PS1) 10Ottomata: Rename analytics_new to analytics_cluster [puppet] - 10https://gerrit.wikimedia.org/r/269825 (https://phabricator.wikimedia.org/T109859) [20:57:29] (03PS1) 10Ottomata: Add hue and client classes (these got lost?) [puppet] - 10https://gerrit.wikimedia.org/r/269826 (https://phabricator.wikimedia.org/T109859) [20:58:58] (03CR) 10Papaul: [C: 031 V: 031] admin: allow dc-ops group to read log files [puppet] - 10https://gerrit.wikimedia.org/r/266919 (owner: 10Papaul) [21:00:04] gwicke cscott arlolra subbu bearND mdholloway: Dear anthropoid, the time has come. Please deploy Services – Parsoid / OCG / Citoid / Mobileapps / … (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160210T2100). [21:01:55] apergos, is is testing time today? [21:02:03] uh [21:02:04] heh [21:02:09] I wish [21:02:20] (03CR) 10Mobrovac: "Hmmm, the puppet compiler fails to compile it: https://puppet-compiler.wmflabs.org/1708/scb1001.eqiad.wmnet/change.scb1001.eqiad.wmnet.err" [puppet] - 10https://gerrit.wikimedia.org/r/269819 (owner: 10Yurik) [21:02:21] but I'm kind of in jenkins scap3 hell right now [21:02:33] when do you deploy, subbu? [21:02:58] (03CR) 10Ottomata: [C: 032 V: 032] Rename analytics_new to analytics_cluster [puppet] - 10https://gerrit.wikimedia.org/r/269825 (https://phabricator.wikimedia.org/T109859) (owner: 10Ottomata) [21:03:18] apergos, about to start shortly. [21:03:21] (03CR) 10Ottomata: [C: 032 V: 032] Add hue and client classes (these got lost?) [puppet] - 10https://gerrit.wikimedia.org/r/269826 (https://phabricator.wikimedia.org/T109859) (owner: 10Ottomata) [21:03:24] (03Abandoned) 10RobH: new ticket.wikimedia.org certificate (renewal replacment) [puppet] - 10https://gerrit.wikimedia.org/r/260785 (owner: 10RobH) [21:03:28] apergos, but, we can do the test wednesday too. [21:03:39] in theory I have the fixes in but in practice I don't have the bandwidth righ tnow [21:03:44] so tomorrow is better [21:04:18] (03PS28) 10ArielGlenn: Puppet provider for scap3 [puppet] - 10https://gerrit.wikimedia.org/r/262742 (https://phabricator.wikimedia.org/T113072) (owner: 10Alexandros Kosiaris) [21:04:21] hey ottomata [21:04:28] we're kind of in rebase hell on this scap3 thing [21:04:36] can you hld off on further puppet merges til we get this in? [21:04:46] it's in part cause jenkins is backed up [21:04:57] well zuul [21:04:59] apergos, sounds good. [21:05:13] been trying to get it in for an hour I guess now [21:05:49] huh. more than an hour. ugh [21:07:40] * twentyafterfour resists the urge to make a twss joke... oh wait [21:07:44] elukey: hmm. [21:07:46] "Login on Commons is also not working, as of 5 minutes ago. It told me I was centrally logged in and then that I had to login. I was logged in here on WP. When I attempted to login on Commons, I got sent to the main page 3 times. I finally gave up and left." [21:08:02] bd808, tgr, anomie ^ [21:08:08] apparently some people still have some session errors. [21:08:21] this is from WP:VP/T [21:09:21] 6operations, 7Mail: Remove exim aliases (dz, daniel.zahn, mutante) - https://phabricator.wikimedia.org/T126509#2016582 (10Dzahn) ``` -# Daniel Zahn -dz: dzahn -daniel.zahn: dzahn -mutante: dzahn ``` heh, yep, done :) [21:09:32] nothing too strange here yet for session errors -- https://grafana.wikimedia.org/dashboard/db/authentication-metrics?panelId=13&fullscreen [21:09:33] 6operations, 7Mail: Move most (all?) exim personal aliases to OIT - https://phabricator.wikimedia.org/T122144#2016585 (10Dzahn) [21:09:35] 6operations, 7Mail: Remove exim aliases (dz, daniel.zahn, mutante) - https://phabricator.wikimedia.org/T126509#2016584 (10Dzahn) 5Open>3Resolved [21:09:48] I managed to log into Commons without issue... [21:10:00] !log starting parsoid deploy [21:10:00] commons and loginwiki would have just gotten wmf.13 and session manager [21:10:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:10:08] what do people do for "pages", i.e. to get a message with a persistent notification? [21:11:00] anomie: yeah, my pre-existing session gets into commons fine too so not a catastrophic failure certainly [21:11:18] I can log in both directly and via autologin [21:11:40] sorry, I think this is rollback-worthy if there is independent confirmation [21:11:45] apergos: ah, ahha [21:11:47] maybe! ~:) [21:11:52] is it just that one report or are there any others? [21:11:53] apergos: legoktm tells me that we can just merge instead of waiting for jenkins? [21:11:59] ori: one report of one login failing? [21:12:03] * apergos gives ottomata the ole hairyeyeball [21:12:04] ori: You could help by trying to reproduce it instead of just complaining. [21:12:10] twentyafterfour: [21:12:18] believe it if we do't get this one in I shall [21:12:42] re-running the tests for a simple rebase isn't going to catch anything anyway [21:12:45] ori: only one report an hour ago [21:12:46] bd808: one report of one login failing immediately after a branch-update which changed a lot of login code suggests that there is a bug you are not aware of that is affecting an unknown number of people. [21:12:52] bd808: I guess the interesting scenario is preexisting central login with no commons session [21:12:56] and the patch is 100% standalone, no dependencies on other code in ops/puppet [21:13:00] no idea how to test that :( [21:13:28] I was trying to be nice because as greg-g says, bypassing and force-merging hurts in the long run [21:13:33] but I'm almost out of nice [21:14:18] and as I say that, jenkins shows up :-D [21:14:23] !log synced code + restarted parsoid on wtp1001 as a canary [21:14:26] ori: the problem with rollbacks is that they don't get you any closer to figuring out what's wrong [21:14:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:15:08] anomie: this one is climbing in logstash: `Session "{session}": Unverified user provided and no metadata to auth it` [21:16:07] apergos: for this particular patch it couldn't actually catch any new error because the patch is completely independent of the rest of the repo... [21:16:17] yeah I know [21:16:20] but. [21:16:39] tgr: a rollback is something you resort to, not a CI strategy you adopt. I am not suggesting this is the way to do this; I am suggesting it is the way to deal with the current issue [21:16:51] bd808: That may be expected, people with non-central sessions (or expired central sessions) who didn't check "remember me" are going to have unverified users and no metadata in SessionManager. [21:16:53] !log CI dust have settled. Krinkle and I have pooled a lot more Trusty slaves to accommodate for the overload caused by switching to php55 (jobs run on Trusty) [21:16:54] (03CR) 10Legoktm: "Erm...it was removed, and then I put it back after fixing the death loop of doom (sorry for not commenting here). But we *need* this patch" [puppet] - 10https://gerrit.wikimedia.org/r/269370 (https://phabricator.wikimedia.org/T126211) (owner: 10Legoktm) [21:16:57] ok folks anyone who was holding on merges (ottomata mutante cmjohnson1) take some turns [21:16:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:17:36] wtp1001 looked good. restarting parsoid on all nodes. [21:17:49] https://integration.wikimedia.org/zuul/ says there is no longer a long backlog [21:19:15] login frequency looks stable at ~1/sec [21:19:38] bd808, tgr: Maybe it has to do with mobile? I note the user reporting the problem was hitting the mobile site. [21:19:45] of course group2 can crowd out group1 trends so that might not mean much [21:20:31] (03PS1) 10Dereckson: Set timezone, logo, site name on hi.wikiquote [mediawiki-config] - 10https://gerrit.wikimedia.org/r/269827 (https://phabricator.wikimedia.org/T126185) [21:21:00] !log finished deploying parsoid version 8976ab93 [21:21:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:21:26] (03PS1) 10Mobrovac: Graphoid: pass the config options directly to service::node [puppet] - 10https://gerrit.wikimedia.org/r/269828 [21:22:18] bd808, tgr: Looks likely to be mobile: SessionManager is setting the cookies with domain=commons.wikimedia.org where the mobile site is going to need commons.m.wikimedia.org. Any idea what hack mobile used so it would work with pre-SessionManager? [21:22:41] twentyafterfour: https://gerrit.wikimedia.org/r/#/c/269560 is next to look at, yes? [21:22:59] 6operations, 7Mail: Remove exim alias (wikipedia10) - https://phabricator.wikimedia.org/T126511#2016662 (10Dzahn) 5Open>3Resolved done and removed [21:23:01] 6operations, 7Mail: Move most (all?) exim personal aliases to OIT - https://phabricator.wikimedia.org/T122144#2016664 (10Dzahn) [21:23:15] (03CR) 10Dereckson: [C: 031] "Task T126479 clarifies the situation." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/264937 (https://phabricator.wikimedia.org/T123837) (owner: 10Mdann52) [21:24:40] (03CR) 10Mobrovac: "The PCC shows that the only actual change is in the catalogue, not in its output: https://puppet-compiler.wmflabs.org/1709/scb1001.eqiad.w" [puppet] - 10https://gerrit.wikimedia.org/r/269828 (owner: 10Mobrovac) [21:24:55] (03PS6) 10Dzahn: admin: allow dc-ops group to read log files [puppet] - 10https://gerrit.wikimedia.org/r/266919 (owner: 10Papaul) [21:25:30] 6operations, 7Mail: Move most (all?) exim personal aliases to OIT - https://phabricator.wikimedia.org/T122144#2016678 (10Dzahn) [21:25:32] 6operations, 7Mail: Remove exim alias: mmi - https://phabricator.wikimedia.org/T126520#2016676 (10Dzahn) 5Open>3Resolved done ``` ## Communications ## -mmi: jove.oliver@gmail.com, kklaudt@hotmail.com, craig@minassianmedia.com ``` [21:25:51] (03PS2) 10Dzahn: Add texvc to role::labs::openstack::nova::manager [puppet] - 10https://gerrit.wikimedia.org/r/269762 (https://phabricator.wikimedia.org/T126468) (owner: 10Dereckson) [21:26:33] (03PS4) 10Dereckson: Configuration changes to wuu.wikipedia.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/263051 (https://phabricator.wikimedia.org/T122476) (owner: 10Mdann52) [21:27:02] (03PS5) 10Dereckson: Set logo, timezone, autoconfirm on wuu.wikipedia.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/263051 (https://phabricator.wikimedia.org/T122476) (owner: 10Mdann52) [21:27:04] (03CR) 10Papaul: [C: 031 V: 031] install_server: decom iodine [puppet] - 10https://gerrit.wikimedia.org/r/269738 (https://phabricator.wikimedia.org/T126483) (owner: 10Dzahn) [21:28:51] (03PS6) 10Dereckson: Set logo, timezone, autoconfirm on wuu.wikipedia.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/263051 (https://phabricator.wikimedia.org/T122476) (owner: 10Mdann52) [21:29:31] (03CR) 10Dereckson: "PS6: rebased, optipng" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/263051 (https://phabricator.wikimedia.org/T122476) (owner: 10Mdann52) [21:29:53] 6operations, 7Mail: move grants aliases to OIT? - https://phabricator.wikimedia.org/T83791#2016697 (10Dzahn) 5Open>3Resolved Eliza said on http://wmf.zendesk.com/requests/10080 that this has been done. removed on our side --- ``` -## Grants ## -grant: grants -grants: awang, jtud, kharold -grantsad... [21:29:55] 6operations, 7Mail: Move most (all?) exim personal aliases to OIT - https://phabricator.wikimedia.org/T122144#2016699 (10Dzahn) [21:30:04] anomie: wmf-config/mobile.php? [21:30:23] it changes $wgCentralAuthCookieDomain [21:30:25] tgr: Yeah, found that. Trying to debug on mw1017 now. [21:30:26] it seems centralauth had internal logic for this as well [21:30:30] isMobileDomain [21:30:40] class_exists( 'MobileContext' ) [21:30:46] that kind of stuff :) [21:31:14] (03CR) 10Dzahn: [C: 032] admin: allow dc-ops group to read log files [puppet] - 10https://gerrit.wikimedia.org/r/266919 (owner: 10Papaul) [21:31:20] (03PS4) 10Yurik: (WIP) Add allowedDomains param to graphoid config [puppet] - 10https://gerrit.wikimedia.org/r/269819 [21:31:52] i'll answer the VP/T post [21:32:24] bd808: Ugh. Does the magic browser extension not work for commons.m.wikimedia.org or something like that? [21:32:38] anomie: ugh it probably doesn't [21:33:16] (03CR) 10Dereckson: [C: 031] Set logo, timezone, autoconfirm on wuu.wikipedia.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/263051 (https://phabricator.wikimedia.org/T122476) (owner: 10Mdann52) [21:33:32] anomie: looks like it doesn't work for m. at all -- https://github.com/wikimedia/FirefoxWikimediaDebug/blob/master/lib/main.js#L59-L72 [21:35:05] (03PS5) 10Yurik: Add allowedDomains param to graphoid config [puppet] - 10https://gerrit.wikimedia.org/r/269819 [21:35:19] anomie: the chrome one works anywhere [21:35:23] what should I test? [21:35:42] tgr: Log in on commons.m.wikimedia.org [21:35:49] auto or normal? [21:35:54] autologin seems to work [21:36:29] did so [21:36:30] tgr: Does the extension work in Chromium too? What's the link to it? [21:36:31] greg-g, services & train are on the same slot [21:36:34] "No active login attempt is in progress for your session. [21:37:07] https://github.com/wikimedia/ChromeWikimediaDebug [21:37:31] I use Chrome but I would expect something this simple to be portable [21:37:36] apergos: yes [21:38:14] https://gerrit.wikimedia.org/r/#/c/269560 - could use a review by ottomata and/or thcipriani [21:38:25] (03PS5) 1020after4: scap::target to configure scap3 deployment repository and deploy-user. [puppet] - 10https://gerrit.wikimedia.org/r/269560 (https://phabricator.wikimedia.org/T113072) [21:38:26] 10Ops-Access-Requests, 6operations: let datacenter-ops read server logfiles - https://phabricator.wikimedia.org/T126018#2016756 (10Dzahn) merged as this: https://gerrit.wikimedia.org/r/#/c/266919/ allows to run commands as the syslog user to achieve this. talked with papaul and confirmed he could now read s... [21:39:49] 10Ops-Access-Requests, 6operations: let datacenter-ops read server logfiles - https://phabricator.wikimedia.org/T126018#2016757 (10Dzahn) 5Open>3Resolved (it had been ACKed in last meeting) [21:40:58] PROBLEM - puppet last run on cp2008 is CRITICAL: CRITICAL: puppet fail [21:42:13] !log ephemerally lowering compactor thread count on restbase1002 from 10 to 8 (limit combined working space) [21:42:48] 7Blocked-on-Operations, 6operations, 10Wikimedia-General-or-Unknown: Invalidate all users sessions - https://phabricator.wikimedia.org/T124440#2016764 (10Tgr) 5Open>3Resolved [21:43:54] (03PS1) 10Ottomata: Include hiera configuration for labs and prod analytics_cluster role [puppet] - 10https://gerrit.wikimedia.org/r/269839 (https://phabricator.wikimedia.org/T109859) [21:44:13] (03CR) 10Thcipriani: [C: 04-1] "I like most of this patch set, but I think there ought to be a different patch-set that moves the scap provider over to apt." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/269560 (https://phabricator.wikimedia.org/T113072) (owner: 1020after4) [21:47:11] (03PS1) 10Ottomata: Add hive and oozie database grant defines [puppet/cdh] - 10https://gerrit.wikimedia.org/r/269841 (https://phabricator.wikimedia.org/T109859) [21:48:10] (03PS2) 10Ottomata: Include hiera configuration for labs and prod analytics_cluster role [puppet] - 10https://gerrit.wikimedia.org/r/269839 (https://phabricator.wikimedia.org/T109859) [21:48:17] (03CR) 10Ottomata: [C: 032] Add hive and oozie database grant defines [puppet/cdh] - 10https://gerrit.wikimedia.org/r/269841 (https://phabricator.wikimedia.org/T109859) (owner: 10Ottomata) [21:52:28] (03PS3) 10Ottomata: Include hiera configuration for labs and prod analytics_cluster role [puppet] - 10https://gerrit.wikimedia.org/r/269839 (https://phabricator.wikimedia.org/T109859) [21:54:11] (03CR) 1020after4: scap::target to configure scap3 deployment repository and deploy-user. (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/269560 (https://phabricator.wikimedia.org/T113072) (owner: 1020after4) [21:55:52] (03CR) 10Ottomata: [C: 032] Include hiera configuration for labs and prod analytics_cluster role [puppet] - 10https://gerrit.wikimedia.org/r/269839 (https://phabricator.wikimedia.org/T109859) (owner: 10Ottomata) [21:56:16] yurik: bah, yeah, sorry, it's ok for this week [21:56:34] greg-g, so should i go ahead with maps depl? [21:56:42] sure, train is done anyways [21:57:38] (03PS1) 10Ottomata: Include analytics_cluster::client role on analytics1026 for testing [puppet] - 10https://gerrit.wikimedia.org/r/269844 (https://phabricator.wikimedia.org/T109859) [21:58:19] so if https://gerrit.wikimedia.org/r/#/c/269560/5 can't merge with the scap package in apt, can I just manually install the scap package with apt-get in iridium? otherwise I won't be able to merge https://gerrit.wikimedia.org/r/#/c/269561/1 and enabling puppet on iridium will be blocked indefinitely [21:58:30] apergos: thcipriani ^ [21:59:16] https://gerrit.wikimedia.org/r/#/c/269561/1 needs scap installed from apt, but I don't think puppet can have two packages with the same name from different providers? [21:59:52] twentyafterfour: there are just a few places in puppet where the path /srv/deployment/scap/scap is hard-coded that need to be cleaned up was my point on the patch. [22:00:30] PROBLEM - Kafka Broker Replica Max Lag on kafka1020 is CRITICAL: CRITICAL: 57.69% of data above the critical threshold [5000000.0] [22:00:58] PROBLEM - puppet last run on mw1058 is CRITICAL: CRITICAL: puppet fail [22:01:28] PROBLEM - puppet last run on mw1076 is CRITICAL: CRITICAL: puppet fail [22:01:49] PROBLEM - puppet last run on mw1259 is CRITICAL: CRITICAL: puppet fail [22:01:50] PROBLEM - puppet last run on cp1064 is CRITICAL: CRITICAL: puppet fail [22:01:59] PROBLEM - puppet last run on analytics1038 is CRITICAL: CRITICAL: puppet fail [22:02:39] PROBLEM - puppet last run on mw1151 is CRITICAL: CRITICAL: puppet fail [22:02:50] PROBLEM - puppet last run on mw1102 is CRITICAL: CRITICAL: puppet fail [22:03:20] PROBLEM - puppet last run on mw1034 is CRITICAL: CRITICAL: puppet fail [22:03:38] PROBLEM - puppet last run on mw1096 is CRITICAL: CRITICAL: puppet fail [22:03:49] PROBLEM - puppet last run on mw1156 is CRITICAL: CRITICAL: puppet fail [22:04:18] PROBLEM - puppet last run on mw1248 is CRITICAL: CRITICAL: puppet fail [22:04:49] PROBLEM - puppet last run on mw1212 is CRITICAL: CRITICAL: puppet fail [22:05:09] PROBLEM - puppet last run on mw1192 is CRITICAL: CRITICAL: puppet fail [22:05:58] PROBLEM - puppet last run on ms-be1013 is CRITICAL: CRITICAL: puppet fail [22:06:18] PROBLEM - puppet last run on wtp1003 is CRITICAL: CRITICAL: puppet fail [22:06:29] PROBLEM - puppet last run on mw1067 is CRITICAL: CRITICAL: puppet fail [22:06:30] RECOVERY - puppet last run on cp2008 is OK: OK: Puppet is currently enabled, last run 21 seconds ago with 0 failures [22:06:38] PROBLEM - puppet last run on mw1187 is CRITICAL: CRITICAL: puppet fail [22:06:38] PROBLEM - puppet last run on mw1006 is CRITICAL: CRITICAL: puppet fail [22:06:38] PROBLEM - puppet last run on mw1012 is CRITICAL: CRITICAL: puppet fail [22:07:30] PROBLEM - puppet last run on mw1108 is CRITICAL: CRITICAL: puppet fail [22:07:49] PROBLEM - puppet last run on mw1217 is CRITICAL: CRITICAL: puppet fail [22:08:19] PROBLEM - puppet last run on mw1064 is CRITICAL: CRITICAL: puppet fail [22:08:30] PROBLEM - puppet last run on labstore1003 is CRITICAL: CRITICAL: puppet fail [22:08:49] PROBLEM - puppet last run on mw1088 is CRITICAL: CRITICAL: puppet fail [22:09:49] PROBLEM - puppet last run on elastic1030 is CRITICAL: CRITICAL: puppet fail [22:09:53] There seems to be alot of puppet failures. [22:09:58] PROBLEM - puppet last run on mw1113 is CRITICAL: CRITICAL: puppet fail [22:09:59] PROBLEM - puppet last run on mw1065 is CRITICAL: CRITICAL: puppet fail [22:10:50] PROBLEM - puppet last run on labvirt1006 is CRITICAL: CRITICAL: puppet fail [22:10:59] PROBLEM - puppet last run on mw1018 is CRITICAL: CRITICAL: puppet fail [22:11:19] PROBLEM - puppet last run on mw1237 is CRITICAL: CRITICAL: puppet fail [22:11:29] RECOVERY - Kafka Broker Replica Max Lag on kafka1020 is OK: OK: Less than 50.00% above the threshold [1000000.0] [22:11:38] PROBLEM - puppet last run on mw1095 is CRITICAL: CRITICAL: puppet fail [22:11:52] reimaging I guess [22:12:09] PROBLEM - puppet last run on mw1111 is CRITICAL: CRITICAL: puppet fail [22:12:09] PROBLEM - puppet last run on cp1051 is CRITICAL: CRITICAL: puppet fail [22:12:18] PROBLEM - puppet last run on mw1015 is CRITICAL: CRITICAL: puppet fail [22:12:36] checking [22:12:40] PROBLEM - puppet last run on mw1196 is CRITICAL: CRITICAL: puppet fail [22:13:08] Error: Could not retrieve catalog from remote server: Error 400 on SERVER: syntax error on line 203, col -1: `' at /etc/puppet/modules/base/manifests/init.pp:11 on [22:13:08] node mw1064.eqiad.wmnet [22:13:11] or not. hrm [22:13:14] Could not find data item mediawiki_memcached_servers in any Hiera data file and no default supplied at /etc/puppet/modules/mediawiki/manifests/nutcracker.pp:19 on node mw1196.eqiad.wmnet [22:13:19] PROBLEM - puppet last run on mw1056 is CRITICAL: CRITICAL: puppet fail [22:13:19] PROBLEM - puppet last run on mw1127 is CRITICAL: CRITICAL: puppet fail [22:13:30] PROBLEM - Unmerged changes on repository puppet on strontium is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet). [22:13:41] joy [22:13:43] I am going to say that this is a sync issue with the puppet masters [22:13:46] and not a puppet code error [22:13:48] PROBLEM - puppet last run on mw1161 is CRITICAL: CRITICAL: puppet fail [22:13:51] I'll restart puppet on the puppetmaster [22:15:09] PROBLEM - puppet last run on mw1188 is CRITICAL: CRITICAL: puppet fail [22:15:18] PROBLEM - puppet last run on mw1048 is CRITICAL: CRITICAL: puppet fail [22:15:22] uhhh, any puppet compiler pros know what this means? [22:15:26] https://puppet-compiler.wmflabs.org/1710/analytics1026.eqiad.wmnet/prod.analytics1026.eqiad.wmnet.err [22:15:28] RECOVERY - Unmerged changes on repository puppet on strontium is OK: No changes to merge. [22:15:29] PROBLEM - puppet last run on eventlog1001 is CRITICAL: CRITICAL: puppet fail [22:15:43] !log Restarted apache on palladium and strontium [22:15:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [22:15:59] PROBLEM - puppet last run on mw1024 is CRITICAL: CRITICAL: puppet fail [22:16:13] there will be a few additional failures, from hosts which were mid-run [22:16:15] but hopefully no more [22:16:20] PROBLEM - puppet last run on mw1122 is CRITICAL: CRITICAL: puppet fail [22:16:20] PROBLEM - puppet last run on db1029 is CRITICAL: CRITICAL: puppet fail [22:16:29] PROBLEM - puppet last run on mw1201 is CRITICAL: CRITICAL: puppet fail [22:16:58] PROBLEM - puppet last run on mw1033 is CRITICAL: CRITICAL: puppet fail [22:17:23] ottomata: i have not seen that error before, and since it's in base i would expect to have seen it all the time, and i just compiled earlier [22:17:30] PROBLEM - puppet last run on mw1003 is CRITICAL: CRITICAL: puppet fail [22:17:44] ottomata: the line it talks about.. init.pp 11 , is where it gets the puppetmaster setting from hiera [22:17:49] PROBLEM - puppet last run on elastic1004 is CRITICAL: CRITICAL: puppet fail [22:17:50] PROBLEM - puppet last run on mw1189 is CRITICAL: CRITICAL: puppet fail [22:18:00] ottomata: therefore i would think it's maybe a n issue in labs to get the hiera value [22:18:10] in labs? [22:18:17] because the compiler runs there [22:18:17] oh cause this compiler runs in labs [22:18:27] buuut, hm is this a new change? [22:18:29] 7Puppet, 6operations: 'role' function doesn't find classess in autoload layout in manifests/role - https://phabricator.wikimedia.org/T119042#2016905 (10scfc) This doesn't have anything to do with the `role` function. The autoload layout is only applied to `modules/`. What manifests are pulled from `manifests... [22:18:30] otto, I think you should revert your change [22:18:33] this used to work right, I didn't change it [22:18:36] oh? [22:18:39] PROBLEM - puppet last run on mw1173 is CRITICAL: CRITICAL: puppet fail [22:18:46] the common thing for all the errors is failure to resolve hiera data [22:18:53] I think that is spill-over from your change [22:18:58] PROBLEM - puppet last run on mw1027 is CRITICAL: CRITICAL: puppet fail [22:18:58] PROBLEM - puppet last run on mw1241 is CRITICAL: CRITICAL: puppet fail [22:18:59] PROBLEM - puppet last run on mw1166 is CRITICAL: CRITICAL: puppet fail [22:18:59] PROBLEM - puppet last run on es1014 is CRITICAL: CRITICAL: puppet fail [22:18:59] ??? [22:19:00] PROBLEM - puppet last run on mw2003 is CRITICAL: CRITICAL: puppet fail [22:19:00] PROBLEM - puppet last run on mw2212 is CRITICAL: CRITICAL: puppet fail [22:19:00] PROBLEM - puppet last run on mw1150 is CRITICAL: CRITICAL: puppet fail [22:19:00] PROBLEM - puppet last run on mw1025 is CRITICAL: CRITICAL: puppet fail [22:19:00] PROBLEM - puppet last run on elastic1027 is CRITICAL: CRITICAL: puppet fail [22:19:01] oh, that would match this, yea [22:19:03] all these ^ [22:19:03] ? [22:19:07] it fails to do hiera [22:19:09] (03PS1) 10Ori.livneh: Revert "Include hiera configuration for labs and prod analytics_cluster role" [puppet] - 10https://gerrit.wikimedia.org/r/269849 [22:19:09] PROBLEM - puppet last run on db1066 is CRITICAL: CRITICAL: puppet fail [22:19:09] PROBLEM - puppet last run on mw1021 is CRITICAL: CRITICAL: puppet fail [22:19:10] PROBLEM - puppet last run on mw2176 is CRITICAL: CRITICAL: puppet fail [22:19:10] PROBLEM - puppet last run on mw2093 is CRITICAL: CRITICAL: puppet fail [22:19:13] ok [22:19:14] ottomata: yes [22:19:16] ori: +1 [22:19:16] can i revert? [22:19:18] PROBLEM - puppet last run on cp2011 is CRITICAL: CRITICAL: Puppet has 11 failures [22:19:21] yes [22:19:27] (03CR) 10Ori.livneh: [C: 032 V: 032] Revert "Include hiera configuration for labs and prod analytics_cluster role" [puppet] - 10https://gerrit.wikimedia.org/r/269849 (owner: 10Ori.livneh) [22:19:29] PROBLEM - puppet last run on mw1253 is CRITICAL: CRITICAL: puppet fail [22:19:29] PROBLEM - puppet last run on mw1092 is CRITICAL: CRITICAL: puppet fail [22:19:29] PROBLEM - puppet last run on mw1205 is CRITICAL: CRITICAL: Puppet has 5 failures [22:19:30] PROBLEM - puppet last run on mw1054 is CRITICAL: CRITICAL: puppet fail [22:19:30] PROBLEM - puppet last run on mw1118 is CRITICAL: CRITICAL: puppet fail [22:19:39] PROBLEM - puppet last run on mw2113 is CRITICAL: CRITICAL: puppet fail [22:19:39] PROBLEM - puppet last run on mw1066 is CRITICAL: CRITICAL: Puppet has 15 failures [22:19:40] PROBLEM - puppet last run on db1034 is CRITICAL: CRITICAL: puppet fail [22:19:48] PROBLEM - puppet last run on mendelevium is CRITICAL: CRITICAL: puppet fail [22:19:48] PROBLEM - puppet last run on labcontrol1001 is CRITICAL: CRITICAL: puppet fail [22:19:49] PROBLEM - puppet last run on lvs2004 is CRITICAL: CRITICAL: Puppet has 3 failures [22:19:49] PROBLEM - puppet last run on es2010 is CRITICAL: CRITICAL: Puppet has 13 failures [22:19:49] PROBLEM - puppet last run on labvirt1010 is CRITICAL: CRITICAL: puppet fail [22:19:49] PROBLEM - puppet last run on mw2010 is CRITICAL: CRITICAL: Puppet has 10 failures [22:19:49] PROBLEM - puppet last run on mw2184 is CRITICAL: CRITICAL: puppet fail [22:19:50] PROBLEM - puppet last run on mw2143 is CRITICAL: CRITICAL: puppet fail [22:19:50] PROBLEM - puppet last run on cp3042 is CRITICAL: CRITICAL: puppet fail [22:19:51] PROBLEM - puppet last run on mw1107 is CRITICAL: CRITICAL: Puppet has 14 failures [22:19:58] PROBLEM - puppet last run on mw2096 is CRITICAL: CRITICAL: puppet fail [22:19:58] PROBLEM - puppet last run on logstash1002 is CRITICAL: CRITICAL: puppet fail [22:19:58] PROBLEM - puppet last run on mw2117 is CRITICAL: CRITICAL: Puppet has 10 failures [22:19:59] PROBLEM - puppet last run on mw2083 is CRITICAL: CRITICAL: Puppet has 10 failures [22:20:00] PROBLEM - puppet last run on cp2016 is CRITICAL: CRITICAL: Puppet has 6 failures [22:20:08] PROBLEM - puppet last run on db1021 is CRITICAL: CRITICAL: puppet fail [22:20:09] PROBLEM - puppet last run on ms-be1015 is CRITICAL: CRITICAL: puppet fail [22:20:09] PROBLEM - puppet last run on mw2127 is CRITICAL: CRITICAL: puppet fail [22:20:09] PROBLEM - puppet last run on mw2114 is CRITICAL: CRITICAL: Puppet has 14 failures [22:20:18] PROBLEM - puppet last run on mw1104 is CRITICAL: CRITICAL: puppet fail [22:20:19] PROBLEM - puppet last run on analytics1030 is CRITICAL: CRITICAL: puppet fail [22:20:19] PROBLEM - puppet last run on db1040 is CRITICAL: CRITICAL: puppet fail [22:20:19] PROBLEM - puppet last run on ms-fe1002 is CRITICAL: CRITICAL: Puppet has 11 failures [22:20:19] PROBLEM - puppet last run on mw2079 is CRITICAL: CRITICAL: puppet fail [22:20:28] PROBLEM - puppet last run on oresrdb1001 is CRITICAL: CRITICAL: puppet fail [22:20:29] PROBLEM - puppet last run on mw2109 is CRITICAL: CRITICAL: Puppet has 8 failures [22:20:29] PROBLEM - puppet last run on mw1204 is CRITICAL: CRITICAL: Puppet has 24 failures [22:20:30] PROBLEM - puppet last run on mw1194 is CRITICAL: CRITICAL: puppet fail [22:20:38] PROBLEM - puppet last run on mw1154 is CRITICAL: CRITICAL: puppet fail [22:20:38] PROBLEM - puppet last run on mw1047 is CRITICAL: CRITICAL: puppet fail [22:20:38] PROBLEM - puppet last run on mw1137 is CRITICAL: CRITICAL: puppet fail [22:20:38] PROBLEM - puppet last run on mw1213 is CRITICAL: CRITICAL: puppet fail [22:20:39] PROBLEM - puppet last run on db1027 is CRITICAL: CRITICAL: puppet fail [22:20:39] PROBLEM - puppet last run on ms-be2004 is CRITICAL: CRITICAL: Puppet has 4 failures [22:20:48] PROBLEM - puppet last run on lvs1004 is CRITICAL: CRITICAL: puppet fail [22:20:48] PROBLEM - puppet last run on analytics1031 is CRITICAL: CRITICAL: puppet fail [22:20:48] PROBLEM - puppet last run on mw2123 is CRITICAL: CRITICAL: Puppet has 14 failures [22:20:48] PROBLEM - puppet last run on mw1046 is CRITICAL: CRITICAL: Puppet has 6 failures [22:20:49] PROBLEM - puppet last run on mw1129 is CRITICAL: CRITICAL: puppet fail [22:20:49] PROBLEM - puppet last run on dataset1001 is CRITICAL: CRITICAL: puppet fail [22:20:49] PROBLEM - puppet last run on db1068 is CRITICAL: CRITICAL: puppet fail [22:20:49] PROBLEM - puppet last run on mw1155 is CRITICAL: CRITICAL: puppet fail [22:20:50] PROBLEM - puppet last run on mw2019 is CRITICAL: CRITICAL: Puppet has 11 failures [22:20:50] PROBLEM - puppet last run on mw1206 is CRITICAL: CRITICAL: puppet fail [22:20:51] PROBLEM - puppet last run on mw2082 is CRITICAL: CRITICAL: Puppet has 9 failures [22:20:51] you know...hiera is nice and all, but man, is it confusing [22:20:58] PROBLEM - puppet last run on mw1153 is CRITICAL: CRITICAL: Puppet has 14 failures [22:20:59] PROBLEM - puppet last run on mw1011 is CRITICAL: CRITICAL: puppet fail [22:20:59] PROBLEM - puppet last run on mw1235 is CRITICAL: CRITICAL: Puppet has 4 failures [22:20:59] PROBLEM - puppet last run on wtp1001 is CRITICAL: CRITICAL: Puppet has 18 failures [22:21:00] PROBLEM - puppet last run on mw2087 is CRITICAL: CRITICAL: Puppet has 15 failures [22:21:00] PROBLEM - puppet last run on mw2015 is CRITICAL: CRITICAL: Puppet has 10 failures [22:21:00] PROBLEM - puppet last run on mw1128 is CRITICAL: CRITICAL: puppet fail [22:21:00] PROBLEM - puppet last run on mw1143 is CRITICAL: CRITICAL: Puppet has 3 failures [22:21:00] PROBLEM - puppet last run on mw1131 is CRITICAL: CRITICAL: puppet fail [22:21:01] puppet is running successfully now [22:21:02] post-revert [22:21:06] thanks ori [22:21:09] PROBLEM - puppet last run on mw1211 is CRITICAL: CRITICAL: puppet fail [22:21:09] PROBLEM - puppet last run on mw2163 is CRITICAL: CRITICAL: Puppet has 4 failures [22:21:09] PROBLEM - puppet last run on labvirt1009 is CRITICAL: CRITICAL: puppet fail [22:21:09] PROBLEM - puppet last run on cp1073 is CRITICAL: CRITICAL: puppet fail [22:21:09] PROBLEM - puppet last run on mw1020 is CRITICAL: CRITICAL: puppet fail [22:21:09] PROBLEM - puppet last run on palladium is CRITICAL: CRITICAL: puppet fail [22:21:09] PROBLEM - puppet last run on db1042 is CRITICAL: CRITICAL: puppet fail [22:21:10] PROBLEM - puppet last run on mw2080 is CRITICAL: CRITICAL: Puppet has 8 failures [22:21:10] PROBLEM - puppet last run on db1009 is CRITICAL: CRITICAL: puppet fail [22:21:11] PROBLEM - puppet last run on mw1078 is CRITICAL: CRITICAL: puppet fail [22:21:12] not totally sure I understand how my change broke stuff [22:21:17] i didn't include any new classes... [22:21:22] i stopped the bot for a moment [22:21:26] thanks [22:21:56] oh freenode [22:22:01] groan [22:22:02] mutante: if you're tailing icinga-wm's log file, let me know when the tide turns and the recoveries start coming in [22:22:16] I missed whatever you did, ori... what did you do? [22:22:27] jenkins told me i was cool! [22:22:28] :p [22:22:56] ori: ok, yes [22:22:56] I tried puppet-merge on strontium [22:22:58] greg-g: Can I get an urgent deploy slot now? We broke saving edits on mobile on Tuesday. :-( [22:22:59] apergos1: the common thing to all the errors was failure to resolve hiera keys; ottomata had just merged a hiera changed that touched eqiad.yaml, etc. so I reverted it [22:23:06] ah [22:23:07] re: hiera, on https://gerrit.wikimedia.org/r/#/c/269849/1/hieradata/eqiad.yaml the very last line [22:23:09] ottomata: here's a possibility: [22:23:10] it doesnt end in : [22:23:10] +# Increase NameNode heapsize independent from other daemons [22:23:11] +cdh::hadoop::namenode_opts: -Xmx4096m [22:23:11] so it was puppet code after all [22:23:11] meh [22:23:17] 6operations, 10hardware-requests, 5codfw-rollout, 3codfw-rollout-Jan-Mar-2016: codfw: (2) servers for redis jobrunners - https://phabricator.wikimedia.org/T126453#2016924 (10RobH) [22:23:23] hiera syntax, need quotes? [22:23:25] thanks for finding that [22:23:27] yaml? [22:23:34] ottomata: is that cut-off at the end in line 203? [22:23:36] mutante is right tho [22:23:37] yeah [22:23:43] that is more likely to be it [22:23:49] !log deployed and restarted tilerator & tileratorui services [22:23:50] James_F: yes [22:23:53] https://gerrit.wikimedia.org/r/#/c/269849/1/hieradata/eqiad.yaml [22:23:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [22:23:56] AH [22:23:57] yes [22:24:05] be false [22:24:07] greg-g: Thanks. Krenair, go go go. [22:24:20] successful run on mw1064 [22:24:26] crazy [22:24:26] ok [22:24:31] yeah, mutante is seeing the recoveries [22:24:34] jenkins should not have passed me [22:24:48] hmm... sounds like i should pause with maps depl [22:25:10] ottomata: it probably didn't because it was so slow earlier? [22:25:13] yurik: the puppet failure issue is fixed, but coordinate with James_F and Krenair. [22:25:16] James_F, going through jenkins [22:25:18] good eyes mutante, good eyes [22:25:22] mutante: it passed me though [22:25:32] yurik, maps deploy uses trebuchet right? [22:25:32] (03PS1) 10Krinkle: wmfstatic: Reject direct requests to w/static.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/269856 [22:25:35] jenkins-bot [22:25:35] 4:53 PM [22:25:35] Patch Set 3: Verified+2 [22:25:35] Main test build succeeded. [22:25:37] Krenair, yep [22:25:48] yurik, I believe it's separate enough from what we're deploying that there should be no issues [22:25:49] is there not a yaml syntax checker? [22:25:52] yurik: yeah, you should be able to continue while they're doing their deploy [22:26:00] ok [22:26:05] i thought there was some fire [22:26:09] ottomata: arr, yea, needs more checks for yaml then [22:26:15] yurik: was, looking better [22:26:21] cool :) [22:26:23] wait [22:26:29] where are we with auth stuff? [22:26:33] yikes, yeah, especially if a little syntax error can break puppet everywhere [22:26:38] bd808: do you have a sense of the magnitude of the issue? [22:26:52] makes sense though, i guess, since this is in the eqiad.yaml file [22:27:23] brb [22:27:45] ottomata / greg-g: https://phabricator.wikimedia.org/T91496 looks like the task, but it is closed as resolved -- is that a mistake, or is it really done but not working properly? [22:27:47] No test for mediawiki-config? [22:28:18] ori: as far as I know it is still one unreproduceable report. anomie is looking into it possibly being related to mobile domains. [22:28:28] ori closed as declined [22:28:40] yea :p [22:28:41] !log performing rolling restart of Cassandra in restbase staging (experimental gc settings) [22:28:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [22:29:01] have I missed something? The graphs and logs I'm looking at don't show widespread problems [22:29:02] 6operations, 10Continuous-Integration-Infrastructure: Provide lint for yaml files in operations repository - https://phabricator.wikimedia.org/T91496#2016943 (10ori) 5declined>3Open https://gerrit.wikimedia.org/r/#/c/269849/ just broke Puppet everywhere (because of invalid YAML in eqiad.yaml -- see the bot... [22:29:34] bd808: no idea, I am looking to you for an answer to that [22:29:43] ok, ori, puppet-compiler tells me if it doesn't work, will commit a fix and run that on a change before I merge [22:29:54] if there are no additional reports and the metrics all look fine, then great [22:29:59] ori: i see recoveries now [22:30:32] Krenair: Given that the CI active gate queue is 30 minutes long, and there are 65 events in the queue, I think we might need to force-merge. :-( [22:30:39] (03PS1) 10Ottomata: I will include the proper fix as an amendment to this revert-revert before merging. [puppet] - 10https://gerrit.wikimedia.org/r/269857 [22:30:39] James_F, right, doing [22:30:54] force-merge right after we reopen a jenkins bug, heh [22:30:56] bd808, ori: It looks like the issue here is that the hack they put in place for https://phabricator.wikimedia.org/T49647 doesn't work anymore with SessionManager. So they need a new hack. [22:31:18] https://phabricator.wikimedia.org/T49647#525483 has a good explanation of what's going on. [22:31:37] (03PS2) 10Ottomata: I will include the proper fix as an amendment to this revert-revert before merging. [puppet] - 10https://gerrit.wikimedia.org/r/269857 [22:31:45] James_F: greg-g told us that force merging hurts in the long term [22:31:48] ottomata: cool, thanks [22:31:53] (03CR) 10Krinkle: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/269856 (owner: 10Krinkle) [22:32:12] mutante: Yes, but not being able to edit via mobile hurts too. [22:32:27] mutante: This is a tragedy of the commons issue, though. [22:32:29] James_F: :o good reason [22:32:44] bd808: I'm guessing the impact isn [22:32:53] James_F, ready to test? [22:32:54] !log deployed and restarted kartotherian services [22:32:54] which change? [22:32:56] (Also, no-one /should/ be merging right now to wmf.13/wmf.12, so the delay after Jenkins notices shouldn't break other people's merge.) [22:32:59] Krenair: Yes. [22:32:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [22:33:04] (03CR) 10Krinkle: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/269856 (owner: 10Krinkle) [22:33:07] bd808: I'm guessing the impact isn't particularly huge, since this was very likely in wmf.11 as well and no one reported it then. [22:33:12] 6operations, 5codfw-rollout, 3codfw-rollout-Jan-Mar-2016: Check the redis (jobqueue) configuration in codfw - https://phabricator.wikimedia.org/T124672#2016972 (10RobH) [22:33:13] ok, this looks safe...https://puppet-compiler.wmflabs.org/1712/ [22:33:14] 6operations, 10hardware-requests, 5codfw-rollout, 3codfw-rollout-Jan-Mar-2016: codfw: (2) servers for redis jobrunners - https://phabricator.wikimedia.org/T126453#2016968 (10RobH) 5Open>3stalled rdb2001-2004 were purchased, and we don't have spares near this. I'll track the quotes/pricing of this requ... [22:33:34] apergos: I'll probably reschedule the phabricator maintenance window to a few hours later and do it when giuseppe is online, so you don't have to stay late [22:33:44] anomie: we should open a bug if you haven't already and see if we can fix the hack properly [22:33:46] twentyafterfour: that's probably good [22:33:56] zuul/jenkins are idle, a recheck would have taken care of that issue, Krenair James_F [22:33:59] it looks like it might take some time to figure out these two patches anyways [22:34:06] (famous last words? :p) [22:34:07] apergos: exactly [22:34:10] greg-g: No, it just got restarted. [22:34:11] * yurik is done with maps... [22:34:13] mutante: don't forget to bring icinga-wm back :) [22:34:17] James_F: well then :) [22:34:19] greg-g: The queue was 65 items long. [22:34:31] greg-g: That it's just thrown that all away isn't great. :-( [22:34:36] yeah, bring it back, i'm going to merge again and we want to be flooded :) [22:34:37] apergos: thanks for all of the help though, sorry if it caused you stress [22:34:39] bd808: I think I'll just reopen T49647 [22:34:41] James_F: yeah :/ [22:34:42] no stress [22:34:52] we got one thing done thanks to a kosiaris' reiew [22:34:54] *review [22:34:58] ori: i can but it's still 148 CRITs we will be told about [22:34:59] two to go! [22:35:05] heh [22:35:06] mutante: ahhh [22:35:07] !log krenair@mira Synchronized php-1.27.0-wmf.12/extensions/MobileFrontend/resources/mobile.loggingSchemas/SchemaEdit.js: https://gerrit.wikimedia.org/r/#/c/269848/ (duration: 02m 18s) [22:35:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [22:35:15] James_F, [22:35:31] apergos: that one is huge too, it's the patch that unblocks migrating everything from trebuchet to scap3 [22:35:36] twentyafterfour: I do have an idea though [22:35:50] Krenair: Yup, worked. [22:35:55] it would mean a 5 minute phab outage [22:35:57] Krenair: Could you push to wmf.13 too? [22:36:02] it woul dgo like this: [22:36:04] stop apache [22:36:04] James_F, works for me as well [22:36:10] puppet enable and run [22:36:12] check all the repos [22:36:14] !log rolling restart of Cassandra staging complete (experimental gc settings) [22:36:16] start apache [22:36:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [22:36:20] apergos: wayyyyyyyyyyyy too much going on [22:36:22] James_F, wmf.13 was planned for swat later on. want it done now instead? should it go through jenkins? [22:36:29] not right now [22:36:29] Krenair: Now and bypass please. [22:36:33] later when things calm down [22:36:37] nod [22:36:40] we've been 6 days w/ no puppet rnu on that box [22:36:46] Let's get out of the way of other deployers ASAP. [22:36:46] (Sorry everyone.) [22:36:54] then disable puppet again and at least we would be caught up [22:37:08] you might do that during your window later if things are quiet, twentyafterfour [22:37:17] apergos: ok [22:38:09] I guess phd would be restarted too [22:38:12] anyways you get the idea [22:38:22] I can do most of T125853 during the window [22:38:24] James_F, sycning [22:39:10] great [22:40:44] !log krenair@mira Synchronized php-1.27.0-wmf.13/extensions/MobileFrontend/resources/mobile.loggingSchemas/SchemaEdit.js: https://gerrit.wikimedia.org/r/#/c/269755/ (duration: 02m 23s) [22:40:46] James_F, [22:40:47] UNNNN cmon jenkins! [22:40:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [22:42:25] ottomata: it's backed up 50min or so because mw-core changes [22:42:54] somebody already added another instance .. [22:42:58] Krenair: Hmm. Doesn't seem to be working in wmf.13 for some reasno. [22:43:31] ooook, well puppet compiler was happy with my change>>>>>>>> [22:43:45] mutante: do you think I should not merge? :) [22:43:52] i can wait til tomrrow [22:43:55] i suppose :) [22:44:03] a few instances at or over 50% disk usage [22:44:17] I thought the queue was empty [22:44:19] James_F, WFM on cawikibooks? [22:44:38] build queue 2 [22:44:45] Krenair: Failing on en.wikt [22:44:45] * James_F tries test2. [22:45:55] Krenair: Yeah, works on test2. [22:45:55] Odd. [22:45:59] (03CR) 10Krinkle: [C: 032] wmfstatic: Reject direct requests to w/static.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/269856 (owner: 10Krinkle) [22:46:09] Krenair: Thanks. [22:46:25] (03Merged) 10jenkins-bot: wmfstatic: Reject direct requests to w/static.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/269856 (owner: 10Krinkle) [22:46:31] greg-g: Deployment's free of us now; thank you. CC yurik mutante apergos Krinkle. [22:46:44] ottomata: hmm.. i don't know, i have also been waiting a lot for it all day and couldn't decide [22:46:46] thanks for the heads up [22:46:55] ottomata: maybe after this deployment is done [22:47:04] James_F: thanks [22:48:24] aye, kinda end of day for me here anyway, will proceed tomorrow [22:48:25] thanks yalls [22:48:35] I'm feeling that end of day thing too [22:48:53] almost 1 am so pretty much checking out [22:49:13] restarting the bot now [22:49:29] !log krinkle@mira Synchronized w/static.php: (no message) (duration: 02m 18s) [22:49:30] ottomata: apergos: ok! cya [22:49:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [22:50:19] RECOVERY - puppet last run on ocg1002 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [22:50:20] RECOVERY - puppet last run on mw1084 is OK: OK: Puppet is currently enabled, last run 35 seconds ago with 0 failures [22:50:38] RECOVERY - puppet last run on mw1180 is OK: OK: Puppet is currently enabled, last run 19 seconds ago with 0 failures [22:50:48] RECOVERY - puppet last run on labvirt1005 is OK: OK: Puppet is currently enabled, last run 24 seconds ago with 0 failures [22:50:58] RECOVERY - puppet last run on mw1083 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [22:50:58] RECOVERY - puppet last run on mw1138 is OK: OK: Puppet is currently enabled, last run 36 seconds ago with 0 failures [22:51:09] RECOVERY - puppet last run on mw1136 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [22:51:20] RECOVERY - puppet last run on mw1030 is OK: OK: Puppet is currently enabled, last run 21 seconds ago with 0 failures [22:51:40] RECOVERY - puppet last run on mw1040 is OK: OK: Puppet is currently enabled, last run 34 seconds ago with 0 failures [22:51:47] (03PS2) 10Krinkle: [DONT MERGE] mediawiki: Rewrite /w/{skins,resources,extensions} to /w/static.php [puppet] - 10https://gerrit.wikimedia.org/r/268802 (https://phabricator.wikimedia.org/T99096) [22:51:54] (03PS3) 10Krinkle: mediawiki: Rewrite /w/{skins,resources,extensions} to /w/static.php [puppet] - 10https://gerrit.wikimedia.org/r/268802 (https://phabricator.wikimedia.org/T99096) [22:52:00] (03CR) 10Dzahn: [C: 031] Separate math and TeX packages classes [puppet] - 10https://gerrit.wikimedia.org/r/269758 (owner: 10Dereckson) [22:52:09] RECOVERY - puppet last run on mw1132 is OK: OK: Puppet is currently enabled, last run 52 seconds ago with 0 failures [22:52:09] RECOVERY - puppet last run on mw1053 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [22:52:20] 6operations, 10Deployment-Systems, 6Performance-Team, 10Traffic, 5Patch-For-Review: Make Varnish cache for /static/$wmfbranch/ expire when resources change within branch lifetime - https://phabricator.wikimedia.org/T99096#2017086 (10Krinkle) [22:52:30] RECOVERY - puppet last run on mw1165 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [22:52:30] (03CR) 10Dzahn: [C: 031] Add texvc to role::labs::openstack::nova::manager [puppet] - 10https://gerrit.wikimedia.org/r/269762 (https://phabricator.wikimedia.org/T126468) (owner: 10Dereckson) [22:52:31] RECOVERY - puppet last run on mw1218 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [22:52:50] RECOVERY - puppet last run on mw1181 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [22:53:28] (03PS3) 10Dereckson: Separate math and TeX packages classes [puppet] - 10https://gerrit.wikimedia.org/r/269758 [22:56:31] (03PS1) 10Anomie: Better hack for T49647 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/269861 (https://phabricator.wikimedia.org/T49647) [22:57:21] (03PS1) 10Alex Monk: Add new WEF enwiki IP rate limit exception [mediawiki-config] - 10https://gerrit.wikimedia.org/r/269862 (https://phabricator.wikimedia.org/T126541) [22:57:28] (03PS3) 10Dzahn: cassandra: fix top-scope vars without namespaces [puppet] - 10https://gerrit.wikimedia.org/r/266975 [22:58:26] (03Abandoned) 10Yurik: Set CSP to false [puppet] - 10https://gerrit.wikimedia.org/r/268677 (owner: 10Yurik) [22:59:53] (03CR) 10Yurik: [C: 031] Graphoid: pass the config options directly to service::node [puppet] - 10https://gerrit.wikimedia.org/r/269828 (owner: 10Mobrovac) [23:01:22] greg-g: we've got 3 patches to fix T49647 [23:01:40] ....5 digits? [23:01:48] regression [23:02:05] not able to log in on certain mobile domains [23:02:20] ah, ok, you can go before swat if you want :) [23:02:32] thanks :) [23:03:01] ori: do you want to do the deploy honors since you hit +2? [23:03:22] happy to, if you like [23:03:25] up to you [23:03:47] works for me and thanks [23:04:04] (03PS5) 10Ori.livneh: add a dependency on xhprof/xhgui [mediawiki-config] - 10https://gerrit.wikimedia.org/r/251137 [23:04:19] (I'm not deploying that ^^^ now, don't worry ;) [23:04:38] heh [23:05:03] 7Puppet, 6operations: 'role' function doesn't find classess in autoload layout in manifests/role - https://phabricator.wikimedia.org/T119042#2017126 (10Dzahn) @scfc is absolutely correct, it's the import lines in site.pp and yes, move them to modules/role/ is the fix. we know how to for all classes with the st... [23:07:38] sync-masters: 0% (ok: 0; fail: 0; left: 1) [23:07:40] * ori whistles [23:07:40] (03CR) 10Dzahn: "compiler errors are due to https://phabricator.wikimedia.org/T125943" [puppet] - 10https://gerrit.wikimedia.org/r/266975 (owner: 10Dzahn) [23:08:08] it's been taking ~90s [23:08:38] !log ori@tin Synchronized php-1.27.0-wmf.13/includes/WebResponse.php: I13fcc3ce4: Allow changing cookie options in WebResponseSetCookie hook (duration: 01m 37s) [23:08:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:09:18] (03CR) 10jenkins-bot: [V: 04-1] add a dependency on xhprof/xhgui [mediawiki-config] - 10https://gerrit.wikimedia.org/r/251137 (owner: 10Ori.livneh) [23:09:33] future deploy servers should have SSD and 16 cores obviously just to deal with rsync :/ [23:10:08] !log ori@tin Synchronized php-1.27.0-wmf.12/includes/WebResponse.php: I13fcc3ce4: Allow changing cookie options in WebResponseSetCookie hook (duration: 01m 30s) [23:10:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:10:37] (03CR) 10Gergő Tisza: [C: 031] Better hack for T49647 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/269861 (https://phabricator.wikimedia.org/T49647) (owner: 10Anomie) [23:10:39] bd808: are we re-building cdbs on every sync-file or did i strace the wrong script? [23:10:46] tgr: should I deploy that as well? [23:11:05] 7Puppet, 6operations: 'role' function doesn't find classess in autoload layout in manifests/role - https://phabricator.wikimedia.org/T119042#2017140 (10scfc) (@Dzahn, could you point to an example or a task for "role foo" not working with `modules/role/`?) [23:11:12] yes, and it needs to go after [23:11:16] ori: that config change is the bit that will fix the problem with the new hook [23:11:37] (03CR) 10Ori.livneh: [C: 032] Better hack for T49647 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/269861 (https://phabricator.wikimedia.org/T49647) (owner: 10Anomie) [23:11:41] ori: hmmm.. the master sync may be rebuilding CDBs, yes [23:11:51] bd808: yeah, I think that is what is making it slow [23:11:52] which wold explain the slowness [23:11:57] jinx [23:12:24] (03PS5) 10Dzahn: install_server: decom iodine [puppet] - 10https://gerrit.wikimedia.org/r/269738 (https://phabricator.wikimedia.org/T126483) [23:14:29] bd808, yes, it's building cdbs [23:15:03] !log ori@mira Synchronized wmf-config/mobile.php: I6946eccf9c: Better hack for T49647 (duration: 02m 19s) [23:15:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:15:13] so I guess the reason we need that to happen is that if a scap was run from the otehr side it would build the json files from the cdbs [23:15:23] so the cdbs need to be built [23:15:37] but maybe we can figure out how to differ that? [23:17:30] who has access to colabwiki? [23:18:34] ori: is it live? I still cannot login [23:18:51] it's live, but let me check $random_appserver to be 100% sure [23:19:24] 7Puppet, 6operations: 'role' function doesn't find classess in autoload layout in manifests/role - https://phabricator.wikimedia.org/T119042#2017177 (10Dzahn) @scfc So.. if it's a `role foo::bar` (or more levels than that), we move it to `modules/role/manifests/foo/bar.pp` and that works fine, i have moved a c... [23:19:26] tgr: yes, live. [23:19:48] yurik, contractors and wmf staff [23:19:52] yurik, why do you ask? [23:20:17] Krenair, i need to check two pages there - they use graph and data, and i'm trying to kill it everywhere [23:20:43] the cookies are on the right domain now [23:20:57] Krenair, can i access it somehow? [23:21:00] your manager might be able to approve an exception and get oit to create you an account there? not sure [23:21:01] but apparently there are more problems [23:21:02] no central cookies though [23:21:25] (03CR) 10Dzahn: [C: 032] install_server: decom iodine [puppet] - 10https://gerrit.wikimedia.org/r/269738 (https://phabricator.wikimedia.org/T126483) (owner: 10Dzahn) [23:22:50] tgr: I verified that all three of ./php-1.27.0-wmf.{12,13}/includes/WebResponse.php and ./wmf-config/mobile.php are on the latest rev [23:23:03] ...and: you already know what I think. [23:23:49] I'm getting "No active login attempt is in progress for your session." trying to log in at meta.m.wikimedia.org [23:24:09] bd808: revert commons/meta temporarily to .12, leave on .13 on mw1017? [23:24:31] Krenair, do you know who is the "official person who officiates the process of asking a manager", so that I can CC them :)) basically someone with the grant pers [23:24:31] the bug is reproducible so we don't gain anything by keeping it broken [23:24:47] ori: does that seam reasonable to you? (partial rollback?) [23:25:03] is it affecting any other wikis? [23:25:11] yurik, you might be able to email techsupport@ or something [23:25:12] in other words, with commons and meta on wmf12, would the problem be gone? [23:25:14] just wanted to ask that [23:25:17] from the perspective of users [23:25:23] 6operations, 5Patch-For-Review: decom iodine - https://phabricator.wikimedia.org/T126483#2017188 (10Dzahn) server has been removed from puppet, salt and icinga and shutdown here's a DNS change to remove it completely incl. mgmt https://gerrit.wikimedia.org/r/#/c/269739/ please wipe the disks and finish the... [23:25:33] the bug we fixed was specific to those wikis, but the bug that remains maybe not? [23:25:49] 6operations, 10ops-eqiad, 5Patch-For-Review: decom iodine - https://phabricator.wikimedia.org/T126483#2017196 (10Dzahn) [23:26:01] centralauth_Session=043d1ffeab117236589c5b4cb7feacc5; path=/; domain=commons.wikimedia.org; secure; httponly [23:26:17] I'm getting "meta.m.wikimedia.org" cookies [23:26:19] I think roll back wmf13 everywhere, and schedule a wmf13 train re-deploy for later today with greg-g, if you can fix it [23:27:03] fwiw I can log in on mw.org and enwikivoyage with no problems [23:27:08] tgr: what was the fix for "No active login attempt is in progress for your session." on loginwiki? [23:27:57] 6operations, 7Tracking: reduce amount of remaining Ubuntu 12.04 (precise) systems - https://phabricator.wikimedia.org/T123525#2017201 (10Dzahn) [23:27:57] !log performing rolling restart of Cassandra in restbase staging (experimental gc settings) [23:28:07] (03CR) 10Mobrovac: "@Subbu, I can't find any comment of yours on this change... What's question?" [puppet] - 10https://gerrit.wikimedia.org/r/269603 (owner: 10Dzahn) [23:28:16] so, I haven't been able to read up/understand the current issue yet [23:28:22] yeah, but I don't see the point in having the cluster be in a state that is not consistent with the deployment train. The process does not have a good way of representing "most group1 wikis are on wmf13" [23:28:39] bd808: that was both the login page and the CA domain dance tried to set cookies on login.wikimedia.org, I think [23:28:41] that is going to confuse people [23:28:59] not applicable if you are logging in from another site [23:29:04] 6operations, 7Tracking: reduce amount of remaining Ubuntu 12.04 (precise) systems - https://phabricator.wikimedia.org/T123525#1931694 (10Dzahn) iodine down, memcached servers done by elukey total count at 78 [23:29:19] greg-g: a subset of mobile domains are not completing authentication with loginwiki [23:29:26] (03PS1) 10Ori.livneh: Revert "group1 wikis to 1.27.0-wmf.13" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/269872 [23:29:39] bd808: should the mw1017 log be on fluorine yet? [23:29:53] tgr: yeah it was on Monday anyway [23:30:34] why only revert commons/meta, tgr? [23:30:55] because those are the wikis that are definitively known to be affected [23:30:57] (03PS3) 10Dzahn: delete SSL cert for ticket.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/269753 (https://phabricator.wikimedia.org/T122320) [23:31:05] gotcha [23:31:15] but I don't think a partial rollback makes sense; it won't match the model people have in their heads of how the train deploys go [23:31:19] and it leaves room for confusion and error [23:31:23] I agree [23:31:33] (just trying to understand the reasoning) [23:31:57] (03PS2) 10Ori.livneh: Revert "group1 wikis to 1.27.0-wmf.13" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/269872 [23:32:13] 7Puppet, 6operations: 'role' function doesn't find classess in autoload layout in manifests/role - https://phabricator.wikimedia.org/T119042#2017204 (10scfc) For `role foo`, `modules/role/manifests/foo.pp` should definitely work. `/init.pp` only works for the "base" class, in this case "role"; cf. http://docs... [23:32:15] well, I think ori's right, sadly [23:32:41] loginwiki is still setting the cookie on the non-m domains [23:32:45] what's the issue? [23:32:56] (03PS8) 10Dzahn: parsoid: one file per role, move to module/role [puppet] - 10https://gerrit.wikimedia.org/r/269603 [23:33:21] aude: https://phabricator.wikimedia.org/T49647#2017019 [23:33:22] is it possible that HHVM did not pick up the change? [23:33:23] aude: resistance to stepping back looking at the big picture calmly [23:33:31] ori: unhelpful [23:33:48] tgr: I can restart on mw1017 if you want to test that [23:33:51] I'm fine with a role back. we were just trying to figure out how far [23:33:54] we have seen that with config changes [23:33:58] :/ [23:33:59] it's very rare, but maybe? [23:34:07] ori: worth a try [23:34:17] ori: I don't think you need to wait to revert, fwiw [23:34:30] !log Restarted HHVM on mw1017 [23:34:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:34:56] well, if we have waited this far, it is worth it to wait another minute in case it really is fixed by the change that went out and it's just sync issue [23:35:05] ori: I figured reverting on less sites means less chance of something going wrong, but you certainly know more about deploy processes than I do [23:35:05] but yeah [23:36:08] ori: no luck [23:36:10] 6operations, 6Labs, 10wikitech.wikimedia.org: Update wikitech-static OS/PHP version - https://phabricator.wikimedia.org/T126385#2017211 (10Krenair) [23:36:12] 6operations, 7Tracking: reduce amount of remaining Ubuntu 12.04 (precise) systems - https://phabricator.wikimedia.org/T123525#2017210 (10Krenair) [23:36:12] ok, reverting then [23:36:14] greg-g: ack? [23:36:16] ack [23:36:28] thanks [23:36:39] (03CR) 10Ori.livneh: [C: 032] Revert "group1 wikis to 1.27.0-wmf.13" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/269872 (owner: 10Ori.livneh) [23:36:57] (03CR) 10Mobrovac: [C: 04-1] "Since this is a refactor, the roles manifests/role/parsoid_* should also be moved (rt and vd clients and servers)." [puppet] - 10https://gerrit.wikimedia.org/r/269603 (owner: 10Dzahn) [23:37:04] (03Merged) 10jenkins-bot: Revert "group1 wikis to 1.27.0-wmf.13" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/269872 (owner: 10Ori.livneh) [23:38:02] !log ori@mira rebuilt wikiversions.php and synchronized wikiversions files: group1 wikis back to php-1.27.0-wmf.12 [23:38:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:38:40] > ori: unhelpful [23:38:42] right, sorry. [23:39:00] should we move all wikis to .13 on mw1017 to also get some advance testing for tomorrow? [23:39:08] or is that bound to blow something up? [23:39:12] s'ok, in some ways it's also helpful, but I think those invovled were pretty leveled headed today [23:39:33] seems sensible if it helps debug [23:40:00] tgr: all wikis to .13 on mw1017 sounds good to me [23:40:24] I've been thinking about how to make that always the case (mw1017 always on newest branch) [23:40:44] I wonder if we should have an entry in .gitignore for an mw1017-specific config file [23:40:45] bd808++ [23:40:48] so that local hacks don't get clobbered [23:41:49] bd808: but sometimes a problem surfaces on a wiki that is on an older branch [23:41:55] (03PS1) 10Dzahn: OTRS: remove ssl cert and config [puppet] - 10https://gerrit.wikimedia.org/r/269877 (https://phabricator.wikimedia.org/T122320) [23:41:58] bd808: in general it can cause problems, stuff can leak to other wikis via job queue or internal HTTP requests [23:42:01] ori: true [23:42:14] and what tgr said, I hadn't though of that [23:42:34] grumble. we need a canary cluster [23:43:10] bd808: IMO we need to make sure the branch flag is propagated in internal request [23:43:37] right now things can leak bewteen branches due to commons, loginwiki, job runners etc [23:43:44] lots of complexity [23:44:05] although maybe having them run mixed branches would be worse [23:46:42] tgr: have you put mw1017 back to .13? [23:46:54] no, sec [23:47:48] PROBLEM - Disk space on ms-be1008 is CRITICAL: DISK CRITICAL - free space: / 2123 MB (3% inode=74%) [23:48:52] 6operations, 10Analytics-Wikistats, 7Regression: [Regression] stats.wikipedia.org redirect no longer works ("Domain not served here") - https://phabricator.wikimedia.org/T126281#2017268 (10Dzahn) With all due respect, i think the Analytics team is the right one to do analytics of traffic on the statistics site. [23:49:13] done [23:49:40] !log switched mw1017 to wmf.13 (all groups) [23:49:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:50:31] PROBLEM - Disk space on labservices1001 is CRITICAL: DISK CRITICAL - free space: / 336 MB (3% inode=81%) [23:54:00] PROBLEM - Disk space on labservices1001 is CRITICAL: DISK CRITICAL - free space: / 306 MB (3% inode=81%) [23:54:25] (03CR) 10Dzahn: "@mobrovac that next step you mention is done in https://gerrit.wikimedia.org/r/#/c/269707/" [puppet] - 10https://gerrit.wikimedia.org/r/269603 (owner: 10Dzahn) [23:54:49] PROBLEM - Unmerged changes on repository mediawiki_config on tin is CRITICAL: There is one unmerged change in mediawiki_config (dir /srv/mediawiki-staging/). [23:55:55] (03PS2) 10Dzahn: keyholder: fix lint, indentation [puppet] - 10https://gerrit.wikimedia.org/r/269611 [23:57:09] (03CR) 10Dzahn: "path conflict path conflict path.." [puppet] - 10https://gerrit.wikimedia.org/r/268717 (owner: 10Dzahn) [23:59:09] (03CR) 10Anomie: Better hack for T49647 (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/269861 (https://phabricator.wikimedia.org/T49647) (owner: 10Anomie)