[00:00:04] twentyafterfour: Respected human, time to deploy Phabricator update (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160317T0000). Please do the needful. [00:01:54] RECOVERY - Host elastic1004 is UP: PING OK - Packet loss = 0%, RTA = 0.42 ms [00:04:33] ebernhardson: aaand now the commit message is wrong again :) [00:05:24] !log catrope@tin Synchronized wmf-config/InitialiseSettings.php: Enable Flow by default in all talk namespaces on gomwiki (duration: 00m 28s) [00:05:24] PROBLEM - puppet last run on mw2191 is CRITICAL: CRITICAL: Puppet has 1 failures [00:05:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:06:21] ori: err, lo [00:06:22] l [00:07:52] (03PS8) 10EBernhardson: Define curl connection pools in hhvm for cirrus [puppet] - 10https://gerrit.wikimedia.org/r/277919 [00:10:34] (03CR) 10Ori.livneh: [C: 032] "By the way, if this proves beneficial, I'd consider trying to upstream it to PHP7, which is the best way to ensure this doesn't end up bei" [puppet] - 10https://gerrit.wikimedia.org/r/277919 (owner: 10EBernhardson) [00:22:53] 6Operations, 10Phabricator: migrate RT maint-announce into phabricator - https://phabricator.wikimedia.org/T118176#2128190 (10Krenair) >>! In T118176#2128021, @RobH wrote: > This is due to the maint-announcement possibly containing when routes/links are down and could open up possible abuse. Isn't an ordinary... [00:32:04] RECOVERY - puppet last run on mw2191 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [00:57:01] !log rebooting elastic1005.eqiad.wmnet for kernel update [00:58:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [01:00:03] PROBLEM - Host elastic1005 is DOWN: PING CRITICAL - Packet loss = 100% [01:11:29] (03PS1) 10Krinkle: Use 'include' instead of 'include_once' in missing.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/277941 [01:26:12] [16:17:18] legoktm: Did you modify JobQueueGroup.php on tin without checking in your changes? <-- I was told it was ori. I just had to stash/pop it while deploying earlier [01:27:30] * ostriches at least commits [01:29:37] demon@tin /srv/mediawiki-staging/php-1.27.0-wmf.16 (wmf/1.27.0-wmf.16)$ git log --oneline [01:29:37] 3ff6c9c Live debugging hack for ori or somebody [01:29:41] legoktm: ^ [01:30:43] ty :) [01:33:28] Granted, wmf.17 is going out everywhere else tomorrow so it's not there :P [01:43:18] (03PS1) 10Andrew Bogott: WIP: Modify designatedashboard to recognize proxy records [puppet] - 10https://gerrit.wikimedia.org/r/277943 [01:46:54] (03CR) 10Andrew Bogott: "'Enough rope to hang myself'" [puppet] - 10https://gerrit.wikimedia.org/r/277943 (owner: 10Andrew Bogott) [01:55:44] PROBLEM - puppet last run on lvs4002 is CRITICAL: CRITICAL: puppet fail [02:00:10] (03CR) 10Alex Monk: [C: 031] Use 'include' instead of 'include_once' in missing.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/277941 (owner: 10Krinkle) [02:23:44] RECOVERY - puppet last run on lvs4002 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [02:35:42] ostriches, it was later modified by me. can be just reverted because doesn't log much interesting anyway [02:38:36] !log mwdeploy@tin sync-l10n completed (1.27.0-wmf.16) (duration: 17m 36s) [02:38:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:54:44] Would it be possible to get this merged to test potential issues on wmflabs? https://gerrit.wikimedia.org/r/#/c/277452/1 [02:57:45] Dereckson, I suppose so [02:59:07] Dereckson, well. it does need a rebase [03:00:07] Rebasing. [03:02:05] (03PS2) 10Dereckson: Test Collection extension on zh.wikipedia.beta.wmflabs.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/277452 (https://phabricator.wikimedia.org/T128425) [03:02:16] Krenair: here you are ^ [03:02:28] (03CR) 10Alex Monk: [C: 032] Test Collection extension on zh.wikipedia.beta.wmflabs.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/277452 (https://phabricator.wikimedia.org/T128425) (owner: 10Dereckson) [03:03:13] (03Merged) 10jenkins-bot: Test Collection extension on zh.wikipedia.beta.wmflabs.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/277452 (https://phabricator.wikimedia.org/T128425) (owner: 10Dereckson) [03:04:59] Thanks. [03:05:40] meh, something has the scap lock. probably l10nupdate [03:06:38] https://integration.wikimedia.org/ci/job/beta-mediawiki-config-update-eqiad/4530/console indicates (more quickly than usual) it has been deployed successfully. [03:07:03] beta will auto-deploy it anyway so you don't need to worry about that [03:07:03] yeah [03:13:23] !log mwdeploy@tin sync-l10n completed (1.27.0-wmf.17) (duration: 17m 44s) [03:13:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [03:14:13] !log krenair@tin Synchronized wmf-config/InitialiseSettings-labs.php: https://gerrit.wikimedia.org/r/277452 (labs only change, just keeping file in sync) (duration: 00m 27s) [03:14:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [03:18:40] (03PS2) 10Dereckson: Revert "(bug 45233) Groups permissions on pt.wikivoyage" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/276917 (https://phabricator.wikimedia.org/T129487) [03:22:55] !log l10nupdate@tin ResourceLoader cache refresh completed at Thu Mar 17 03:22:55 UTC 2016 (duration 9m 33s) [03:23:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [03:26:18] (03CR) 10Sabya: "There is a problem I noticed. If I am creating a new instance and applying role::labs::ores::precached, it does not have ores binaries yet" [puppet] - 10https://gerrit.wikimedia.org/r/277824 (owner: 10Sabya) [03:29:52] 6Operations: elastic1005.eqiad.wmnet is non-responsive - https://phabricator.wikimedia.org/T130174#2128399 (10EBernhardson) [04:03:50] (03Abandoned) 10Sabya: Add support for running preached as a systemd unit [puppet] - 10https://gerrit.wikimedia.org/r/277824 (owner: 10Sabya) [04:11:34] RECOVERY - Host elastic1005 is UP: PING OK - Packet loss = 0%, RTA = 1.34 ms [04:28:14] (03PS7) 10KartikMistry: Enable non-default MT for some languages [puppet] - 10https://gerrit.wikimedia.org/r/277463 (https://phabricator.wikimedia.org/T129849) [04:44:44] RECOVERY - cassandra-a CQL 10.64.32.202:9042 on restbase1012 is OK: TCP OK - 0.005 second response time on port 9042 [05:50:40] 6Operations, 10Wikimedia-Stream: reboot of rcs servers (stream.wikimedia.org) - https://phabricator.wikimedia.org/T130024#2128514 (10Johan) @Joe: Yes, I realized the possibility of me having done so, thus apologizing for possibly misreading you. (: Mea culpa. [06:02:43] (03PS8) 10KartikMistry: Enable non-default MT for some languages [puppet] - 10https://gerrit.wikimedia.org/r/277463 (https://phabricator.wikimedia.org/T129849) [06:11:27] !log rebooting elastic1006.eqiad.wmnet for kernel update [06:11:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [06:14:43] PROBLEM - Host elastic1006 is DOWN: PING CRITICAL - Packet loss = 100% [06:16:14] akosiaris: how is https://gerrit.wikimedia.org/r/#/c/277463/ now? :) [06:16:27] akosiaris: it will reduce further. [06:16:35] (but not this week for sure) [06:30:53] PROBLEM - puppet last run on mw2207 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:04] PROBLEM - puppet last run on mw1119 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:54] PROBLEM - puppet last run on mw2126 is CRITICAL: CRITICAL: Puppet has 1 failures [06:32:24] PROBLEM - puppet last run on mw1170 is CRITICAL: CRITICAL: Puppet has 2 failures [06:34:24] PROBLEM - puppet last run on mw2138 is CRITICAL: CRITICAL: Puppet has 1 failures [06:46:08] !log Powercycled elastic1006; unresponsive. [06:46:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [06:50:44] RECOVERY - Host elastic1006 is UP: PING OK - Packet loss = 0%, RTA = 0.57 ms [06:56:44] RECOVERY - puppet last run on mw1170 is OK: OK: Puppet is currently enabled, last run 57 seconds ago with 0 failures [06:57:04] RECOVERY - puppet last run on mw2207 is OK: OK: Puppet is currently enabled, last run 14 seconds ago with 0 failures [06:57:23] RECOVERY - puppet last run on mw1119 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:58:05] RECOVERY - puppet last run on mw2126 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:00:45] RECOVERY - puppet last run on mw2138 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:03:38] 6Operations: elastic1005.eqiad.wmnet is non-responsive - https://phabricator.wikimedia.org/T130174#2128532 (10EBernhardson) 5Open>3Resolved a:3EBernhardson back up now. just took a few hours ... [07:06:51] !log Short-circuiting RefreshLinksJob::run() to bail if the root job timestamp is older than March 1st [07:06:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [07:15:04] (03PS1) 10Ori.livneh: Increase the priority of refreshLinks jobs [puppet] - 10https://gerrit.wikimedia.org/r/277948 [07:15:31] (03CR) 10Ori.livneh: [C: 032 V: 032] Increase the priority of refreshLinks jobs [puppet] - 10https://gerrit.wikimedia.org/r/277948 (owner: 10Ori.livneh) [07:17:25] PROBLEM - puppet last run on ganeti2004 is CRITICAL: CRITICAL: puppet fail [07:45:55] RECOVERY - puppet last run on ganeti2004 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:48:43] !log ori@tin Synchronized php-1.27.0-wmf.16/includes/jobqueue/jobs/RefreshLinksJob.php: Job queue bankruptcy: force all refreshlinks jobs to be non-recursive (duration: 00m 24s) [07:48:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [07:51:04] <_joe_> uhm non-recursive means what? [07:51:39] it purges the backlinks of the titles in the job, but not the backlinks of the backlinked titles [07:52:11] on the theory that the recursion is causing a cycle [07:52:31] <_joe_> well if recursion would just go one level deep that wouldn't happen, right? [07:52:44] <_joe_> so if jobs spawned by recursive jobs were non-recursive [07:52:48] it's not at all clear that it does (only go one level deep) [07:53:06] <_joe_> I can see how uncontrolled recursion could cause a shitshow :P [07:54:07] <_joe_> also the throughput always goes toghether with enqueueing [07:54:33] <_joe_> which brings us to the idea that jobs that get processed enqueue a similar amount of other jobs [07:55:22] <_joe_> and your intervention seem to be effective; the job queue is shrinking fast [07:55:27] <_joe_> :) [07:55:40] <_joe_> elukey: warm up the reimaging engines! [07:55:50] <_joe_> we got work to do today I guess [07:56:32] <_joe_> (OTOH it's been a week that 10/15 engineers are looking at this issue and no one has a firm idea of why this crisis happened [07:57:46] <_joe_> this means the jobqueue is ubercomplicated and underinstrumented, and that we need to store/log way more information about jobs than we do) [08:00:09] the fun thing to do would be to try and redesign it, but that is going to be horrendous, because the magnificent new design that corrects all the design mistakes of the existing system will also run into all the edge cases and gotchas that the existing system already works around [08:00:16] o it would just break horribly again and again until it is either stabilized or reverted [08:00:19] *so [08:00:26] the responsible thing to do would be to mount a series of strategic interventions, correcting specific problems without taking away from what is already there [08:00:59] <_joe_> ori: of course it would [08:01:35] the problem is mostly psychological -- this is deeply unattractive work [08:01:41] because users only notice the job queue when it is broken [08:01:45] <_joe_> maybe using kafka as a transport would be a good idea in the long run, btw. We're amply abusing redis [08:01:58] yes, that is one of the strategic interventions i had in mind [08:02:01] <_joe_> I find that kind of work fascinationg [08:02:16] well, also, because it calls for strategic interventions rather than redesign [08:02:22] <_joe_> (the one that users notice only when it breaks) [08:02:23] it's not going to be something you're proud of [08:02:28] <_joe_> ori: ^^ that yes [08:02:31] well, maybe in a sick sort of way [08:02:36] but not something you put on a portfolio or something [08:03:05] not that that is a motivation, i just think that it represents just how there is little opportunity to actually be creative [08:03:31] <_joe_> but then again, we shouldn't just act on personal pleasure, or I'd probably be working 100% on pybal and kubernetes since 6 months ago :P [08:04:02] I left recursion on on mw1001 so I can continue to debug [08:04:13] and I just noticed something weird [08:04:14] <_joe_> makes sense [08:04:52] ok so: https://github.com/wikimedia/mediawiki/blob/3edaa196a7/includes/jobqueue/utils/BacklinkJobUtils.php [08:04:57] o/ [08:05:06] "Break down $job into approximately ($bSize/$cSize) leaf jobs and a single partition job that covers the remaining backlink range (if needed)." [08:05:45] hrm, wait [08:05:48] (hi elukey) [08:08:14] it gets called with empty ranges, which is weird, but that shouldn't result in a job getting enqueued, so i don't think that's it [08:09:24] https://grafana.wikimedia.org/dashboard/db/job-queue-health looks nicer now [08:09:30] 6Operations, 10OCG-General-or-Unknown, 6Scrum-of-Scrums, 6Services: The OCG cleanup cache script doesn't work properly - https://phabricator.wikimedia.org/T120079#2128610 (10Joe) This task might not be "UBN!" in the mind of some of us, but it lies untouched since December and, for the record, **I won't fix... [08:11:41] (brb!) [08:13:17] now it's climbing back up? [08:13:20] how? [08:13:24] <_joe_> rotfl [08:13:55] <_joe_> it's fighting back [08:24:08] <_joe_> also I suspect our graphs are no good, honestly [08:24:12] <_joe_> I have to check those [08:25:19] oh, it's just wikis on wmf17 [08:25:22] i lost track of the train [08:25:29] <_joe_> ok :P [08:25:33] !log ori@tin Synchronized php-1.27.0-wmf.16/includes/jobqueue/jobs/RefreshLinksJob.php: Job queue bankruptcy: force all refreshlinks jobs to be non-recursive (duration: 00m 25s) [08:26:18] !log ori@tin Synchronized php-1.27.0-wmf.17/includes/jobqueue/jobs/RefreshLinksJob.php: Job queue bankruptcy: force all refreshlinks jobs to be non-recursive (duration: 00m 25s) [08:26:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [08:26:21] now it'll clear out [08:26:24] just watch [08:26:56] <_joe_> it was funnier to imagine that the jobqueue gained conscience [08:27:51] 6 million [08:28:09] 5 million [08:28:13] \o/ [08:28:41] <_joe_> what problems will this cause to users? [08:28:43] <_joe_> if any? [08:28:50] ori: usual stupid question - what are we loosing forcing refreshlinks to be non-recursive? [08:29:07] elukey: jobs ;) [08:29:28] "what links here" and red-links may be somewhat stale [08:29:35] I figure that: [08:29:38] <_joe_> elukey: in theory, we're losing consistent refreshes of all pages [08:30:05] - when the job queue gets this backlogged, users take matters into their own hands and write bots that walk through all pages and purge them (yes, that's a thing) [08:30:27] <_joe_> but well, we're currently refreshing multiple times a day pages that don't need to [08:30:39] - whatever the negative impact may be, it is going to be substantially less bad than the delay the size of the queue was imposing on other job types, which are more significant to users [08:30:50] things people actually wait on [08:31:39] but _joe_ is right that this is "in theory", because the 20m was mostly the same set of pages circulating over and over [08:32:27] i'm not worried about those so much as the collateral -- the tail of legitimate recursive refreshlinks jobs [08:35:14] but again: users are (sadly) accustomed to this, which is why there is an api endpoint for forcing purges and forcelinksupdates [08:35:20] example usage: https://phabricator.wikimedia.org/T115325#1767120 [08:36:21] so anything from changing title of a page to change a very used template can trigger refreshlink jobs, that basically "sed" wikisource replacing stuff? Or am I completely wrong? [08:36:43] PROBLEM - Check size of conntrack table on mw1168 is CRITICAL: CRITICAL: nf_conntrack is 90 % full [08:36:58] yep [08:37:20] all right, thanks! Now it is more clear :) [08:38:28] the things users were actually complaining about had to do with other job types [08:38:45] like new articles not showing up in search results [08:39:06] one thing that I'd need to do very soon is starting to edit wikipedia. Shame on me, I know. I need to get more insights from the other side of the infrastructure. [08:39:36] it's fun! :) [08:39:40] <_joe_> I don't think it's your duty [08:39:41] ahh yes makes sense, I'd be complaining too after long hours writing :) [08:40:20] the queue size is 79k now, of which 67k are RestbaseUpdateJobOnDependencyChange [08:40:26] so the jobrunners are probably hammering restbase [08:40:38] _joe_ I don't see it as a duty but a way to better understand my job, and also to do something good :) I've never edited wikipedia because I always thought that I didn't have anything to add, I think it is a standard misconception [08:40:49] anyways, I'm going to undo the hack [08:41:11] <_joe_> ori: if the problem comes back, at least we're sure the problem was there [08:41:20] yeah [08:42:02] this is the least proud engineering work I can recall [08:42:15] I feel like one of the monkeys screaming and jumping around the obelisk in 2001 [08:42:49] <_joe_> it was a monolith, not an obelisk :P [08:42:56] but I think we postponed this kind of intervention as long as we could [08:42:57] as long you don't start hitting us with a bone we're all good! [08:43:40] heh [08:44:02] <_joe_> ori: we also confirmed the problem was infinite recursion [08:44:08] <_joe_> I mean we kind of knew that [08:44:13] !log ori@tin Synchronized php-1.27.0-wmf.17/includes/jobqueue/jobs/RefreshLinksJob.php: Revert: Job queue bankruptcy: force all refreshlinks jobs to be non-recursive (duration: 00m 39s) [08:44:16] <_joe_> but this is definitive proof [08:44:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [08:44:46] you know what we could do [08:44:55] !log ori@tin Synchronized php-1.27.0-wmf.16/includes/jobqueue/jobs/RefreshLinksJob.php: Revert: Job queue bankruptcy: force all refreshlinks jobs to be non-recursive (duration: 00m 29s) [08:44:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [08:45:05] PROBLEM - Kafka Broker Replica Max Lag on kafka1014 is CRITICAL: CRITICAL: 50.00% of data above the critical threshold [5000000.0] [08:45:14] let's do refreshlinks on a total separate set of runners [08:45:28] <_joe_> ori: why? [08:45:30] so that other job types are never delayed on account of refreshlinks spikes [08:45:50] +1 [08:45:59] <_joe_> it's enough to have a separate queue on the jobrunners AIUI [08:46:10] right, yes [08:46:12] that's even simpler [08:46:17] <_joe_> so let's just do that [08:46:21] <_joe_> I can do it today [08:46:34] <_joe_> if I don't get worse than I am now, that is [08:47:12] (03PS1) 10Ori.livneh: Revert "Increase the priority of refreshLinks jobs" [puppet] - 10https://gerrit.wikimedia.org/r/277952 [08:47:15] RECOVERY - Check size of conntrack table on mw1168 is OK: OK: nf_conntrack is 59 % full [08:47:22] <_joe_> this too ^^ [08:47:33] what do you mean? [08:47:45] <_joe_> this is good, too [08:47:51] ah :) [08:47:58] (03CR) 10Ori.livneh: [C: 032 V: 032] Revert "Increase the priority of refreshLinks jobs" [puppet] - 10https://gerrit.wikimedia.org/r/277952 (owner: 10Ori.livneh) [08:48:03] I'm going to update the task and go to sleep [08:48:11] <_joe_> thanks a lot :) [08:48:40] for nothing, anyone can "fix" a runaway train by blowing up the engine [08:48:43] RECOVERY - Kafka Broker Replica Max Lag on kafka1014 is OK: OK: Less than 50.00% above the threshold [1000000.0] [08:49:22] <_joe_> well, as I said, we've verified _that_ was the engine [08:49:34] <_joe_> we had no idea until around yesterday [08:50:11] was kinda obvious after we ruled out the users [08:50:55] MaxSem: was my description of the potential impact accurate? [08:51:37] as in, blown up engine? [08:53:50] "what links here" and red-links may be somewhat stale [08:54:57] also categorylinks, which impacts maint categories and stuff, but overall I think you're right that the rest of the job queue not working is/was more severe [08:55:07] pretty sure templatelinks were also affected - so template updates too [08:55:41] I did log / tail the page titles getting refreshed and it was a finite set repeating endlessly [08:56:00] so I don't think the number of updates that were dropped is as big as the queue size may have suggested [08:56:30] good morning [08:56:40] morning, hashar [08:57:18] e.g. mw1169 commonswiki 1.27.0-wmf.16 runJobs DEBUG: refreshLinks Template:City recursive=1 table=templatelinks range=array(4) division=70 [08:57:22] ori: seems you got the job madness stuff figured out! ;-} [08:57:31] not really [08:57:37] hashar, he cheated! [08:57:45] iddqd, idkfa [08:58:00] MaxSem: as long as you end up at the games End Credit screen, cheating is acceptable [08:58:26] up up down down left right left right A B [08:58:32] no, B A! [08:58:33] but cyberdaemon is much more fun to kill without BFG [08:58:33] damn it [08:59:31] !log restarting elasticsearch server elastic1007.eqiad.wmnet [08:59:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:01:45] hashar: i'm updating the task with the details [09:01:51] ori: awesome [09:02:02] (03PS2) 10Ori.livneh: varnishprocessor.py: remove unused import (varnishlog) [puppet] - 10https://gerrit.wikimedia.org/r/277794 (owner: 10Ema) [09:02:10] (03CR) 10Ori.livneh: [C: 032 V: 032] varnishprocessor.py: remove unused import (varnishlog) [puppet] - 10https://gerrit.wikimedia.org/r/277794 (owner: 10Ema) [09:02:53] !log reimporting missing rows from production to labs (expect some lag during the day) [09:02:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:10:50] "the number of updates that were dropped is as big as the queue size may have suggeste" [09:11:17] the db writes was not changed, the reads were (x5 on the master) [09:23:39] (03PS1) 10Lokal Profil: Clarifying i18n parameters [dumps/dcat] - 10https://gerrit.wikimedia.org/r/277955 [09:30:34] (03PS9) 10KartikMistry: Enable non-default MT for some languages [puppet] - 10https://gerrit.wikimedia.org/r/277463 (https://phabricator.wikimedia.org/T129849) [09:30:53] PROBLEM - Host mw1256 is DOWN: PING CRITICAL - Packet loss = 100% [09:30:53] PROBLEM - Host mw1250 is DOWN: PING CRITICAL - Packet loss = 100% [09:31:15] PROBLEM - Host mw1258 is DOWN: PING CRITICAL - Packet loss = 100% [09:31:15] PROBLEM - Host mw1259 is DOWN: PING CRITICAL - Packet loss = 100% [09:31:44] RECOVERY - Host mw1258 is UP: PING OK - Packet loss = 0%, RTA = 3.90 ms [09:31:44] RECOVERY - Host mw1250 is UP: PING OK - Packet loss = 0%, RTA = 0.56 ms [09:31:54] RECOVERY - Host mw1256 is UP: PING OK - Packet loss = 0%, RTA = 1.83 ms [09:31:54] that's me, these are depooled, but too narrow downtime [09:32:14] PROBLEM - HHVM rendering on mw1254 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:33:54] RECOVERY - HHVM rendering on mw1254 is OK: HTTP OK: HTTP/1.1 200 OK - 67530 bytes in 2.129 second response time [09:34:28] !log Copying data from db2009 to db2008 T130098 [09:34:29] T130098: Create a new x1 slave in codfw (just in case) - https://phabricator.wikimedia.org/T130098 [09:34:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:38:54] RECOVERY - Host mw1259 is UP: PING OK - Packet loss = 0%, RTA = 1.34 ms [09:38:57] (03CR) 10Mobrovac: "What about cxserver and citoid?" [puppet] - 10https://gerrit.wikimedia.org/r/277798 (https://phabricator.wikimedia.org/T127974) (owner: 10Giuseppe Lavagetto) [09:39:14] (03PS1) 10Gehel: Enabling HTTPS access to elasticsearch via LVS [puppet] - 10https://gerrit.wikimedia.org/r/277956 (https://phabricator.wikimedia.org/T124444) [09:42:44] _joe_: can you revisit https://gerrit.wikimedia.org/r/#/c/277463/? Reduced to 10% of file size. [09:47:40] 6Operations, 10MediaWiki-JobQueue, 13Patch-For-Review: The refreshLinks jobs enqueue rate is 10 times the normal rate - https://phabricator.wikimedia.org/T129517#2128754 (10ori) 5Open>3Resolved a:3ori The code path which was inserting the vast majority of jobs is [[ https://github.com/wikimedia/mediawi... [09:47:59] hashar, _joe_ ^ [09:48:32] ori: awesome!!!! [09:48:44] ori: time for you to sleep I guess ? ;-} [09:48:57] yep, bye [09:49:09] \O/ [09:49:48] <_joe_> ori: you're awesome [09:50:08] huge drop https://grafana.wikimedia.org/dashboard/db/job-queue-health?panelId=15&fullscreen !!! [09:50:24] <_joe_> hashar: yes that's ori blowing the engine up (cit.) [09:52:11] (03PS2) 10Giuseppe Lavagetto: cache::text: route all restbase traffic to codfw [puppet] - 10https://gerrit.wikimedia.org/r/277798 (https://phabricator.wikimedia.org/T127974) [09:53:06] there is dbstore1002 lag, that is part of the maintenance for labs, I will not ack it as it should be there during the maintenance [09:53:13] PROBLEM - puppet last run on mw2168 is CRITICAL: CRITICAL: Puppet has 1 failures [09:57:43] <_joe_> kart_: I don't have time now, sorry, I'm in the midst of a big change [09:59:32] (03CR) 10Ema: [C: 032 V: 032] cache::text: route all restbase traffic to codfw [puppet] - 10https://gerrit.wikimedia.org/r/277798 (https://phabricator.wikimedia.org/T127974) (owner: 10Giuseppe Lavagetto) [10:00:04] _joe_ mobrovac: Respected human, time to deploy Services test switch from eqiad to codfw (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160317T1000). Please do the needful. [10:02:24] <_joe_> jouncebot: we're on it [10:03:30] <_joe_> ema: I disabled puppet everywhere, I'll just apply it to cp1008 for now [10:03:52] _joe_: OK [10:04:43] _joe_: I haven't merged the change yet though, let me do that [10:04:58] <_joe_> oh [10:05:06] <_joe_> ok submit it :) [10:05:30] <_joe_> I am doing it [10:05:33] (03CR) 10Elukey: "Some comments, but overall looks good!" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/277790 (https://phabricator.wikimedia.org/T128788) (owner: 10Ema) [10:08:10] <_joe_> !log running puppet across eqiad varnishes to switch traffic of restbase eqiad => codfw [10:08:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [10:10:58] <_joe_> mobrovac: traffic should start flowing to codfw [10:11:10] \o/ [10:11:16] i'm monitoring the logs and metrics [10:11:44] <_joe_> do you see it? [10:11:52] <_joe_> the traffic I mean [10:12:28] not yet [10:12:49] <_joe_> httpry says there is some :) [10:13:40] cool [10:13:58] yep, started [10:14:00] seeing it noe [10:14:02] now [10:14:09] yep network traffic going up on all restbase machines in codfw [10:14:16] <_joe_> !log external traffic is now flowing through restbase in codfw [10:14:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [10:15:08] wooooo [10:16:04] mobrovac: restbase request example with curl? [10:16:13] PROBLEM - Kafka Broker Replica Max Lag on kafka1014 is CRITICAL: CRITICAL: 55.17% of data above the critical threshold [5000000.0] [10:16:42] ema: curl https://en.wikipedia.org/api/rest_v1/page/html/{your_favourite_page} [10:16:49] <_joe_> anyone would care to review https://gerrit.wikimedia.org/r/#/c/277803/ ? [10:16:53] <_joe_> mobrovac: ^^ [10:17:50] looking [10:17:58] <_joe_> this is a prerequisite to the next patch, that makes eqiad call codfw too [10:18:12] <_joe_> (and the second is just temporary) [10:18:24] PROBLEM - Kafka Broker Replica Max Lag on kafka1018 is CRITICAL: CRITICAL: 55.17% of data above the critical threshold [5000000.0] [10:18:34] <_joe_> elukey: can you look at kafka? [10:18:48] <_joe_> it's 99.999% unrelated to the restbase switch [10:18:51] <_joe_> but one never knows [10:19:08] (03CR) 10Mobrovac: [C: 031] Use the local restbase cluster in codfw [mediawiki-config] - 10https://gerrit.wikimedia.org/r/277803 (https://phabricator.wikimedia.org/T127974) (owner: 10Giuseppe Lavagetto) [10:20:05] <_joe_> ok let's merge both patches then [10:20:14] <_joe_> the other one has a +1 from gabriel already [10:20:43] RECOVERY - puppet last run on mw2168 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [10:20:55] !log rebooting osmium for kernel upgrade [10:20:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [10:21:06] (03PS2) 10Giuseppe Lavagetto: Use the local restbase cluster in codfw [mediawiki-config] - 10https://gerrit.wikimedia.org/r/277803 (https://phabricator.wikimedia.org/T127974) [10:21:17] _joe_: why does the next patch "lie" that codfw is eqiad? [10:21:23] ah so that only restbase is switched [10:21:29] (03CR) 10Giuseppe Lavagetto: [C: 032] "Long overdue" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/277803 (https://phabricator.wikimedia.org/T127974) (owner: 10Giuseppe Lavagetto) [10:21:47] <_joe_> mobrovac: it says that in eqiad it should use the codfw lb to reach restbase [10:21:53] (03Merged) 10jenkins-bot: Use the local restbase cluster in codfw [mediawiki-config] - 10https://gerrit.wikimedia.org/r/277803 (https://phabricator.wikimedia.org/T127974) (owner: 10Giuseppe Lavagetto) [10:22:40] <_joe_> mobrovac: I'm merging https://gerrit.wikimedia.org/r/#/c/277804/ too; it will make mediawiki reach the codfw rb [10:22:47] <_joe_> will it handle the load? :P [10:23:40] :) [10:24:11] (03PS2) 10Giuseppe Lavagetto: Switch temporarily eqiad to use the codfw restbase cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/277804 (https://phabricator.wikimedia.org/T127974) [10:24:59] (03CR) 10Giuseppe Lavagetto: [C: 032] Switch temporarily eqiad to use the codfw restbase cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/277804 (https://phabricator.wikimedia.org/T127974) (owner: 10Giuseppe Lavagetto) [10:25:13] _joe_: looking but it is surely noise :) [10:25:26] _joe_: working on alternative solution, no worry. [10:25:44] <_joe_> kart_: cool! sorry I just have no bandwidth right now [10:25:47] <_joe_> maybe a bit later [10:28:05] <_joe_> mobrovac: look out for cassandra load in codfw in particular [10:28:17] !log oblivian@tin Synchronized wmf-config/ProductionServices.php: switching mediawiki to use restbase in codfw (duration: 00m 32s) [10:28:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [10:28:38] (03PS1) 10Giuseppe Lavagetto: Revert "Switch temporarily eqiad to use the codfw restbase cluster" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/277960 [10:28:58] (03PS1) 10Giuseppe Lavagetto: Revert "cache::text: route all restbase traffic to codfw" [puppet] - 10https://gerrit.wikimedia.org/r/277961 [10:29:10] <_joe_> this ^^ is what will rollback [10:29:23] RECOVERY - Kafka Broker Replica Max Lag on kafka1018 is OK: OK: Less than 50.00% above the threshold [1000000.0] [10:29:48] <_joe_> load is definitely going up :P [10:30:00] yup [10:30:28] <_joe_> https://ganglia.wikimedia.org/latest/graph.php?r=hour&z=xlarge&c=Parsoid+eqiad&m=cpu_report&s=by+name&mc=2&g=network_report [10:30:31] <_joe_> parsoid too [10:30:31] (03CR) 10Gehel: [C: 031] Increase purged entry point s-maxage from 12 to 48 hours [puppet] - 10https://gerrit.wikimedia.org/r/277112 (owner: 10GWicke) [10:30:37] <_joe_> wtf where is parsoid codfw? [10:30:44] RECOVERY - Kafka Broker Replica Max Lag on kafka1014 is OK: OK: Less than 50.00% above the threshold [1000000.0] [10:30:59] <_joe_> they're not in ganglia, wtf [10:34:56] (03PS1) 10Giuseppe Lavagetto: parsoid: collect ganglia metrics in codfw too [puppet] - 10https://gerrit.wikimedia.org/r/277962 [10:34:58] (03PS1) 10Giuseppe Lavagetto: sca/scb: collect ganglia metrics in codfw too [puppet] - 10https://gerrit.wikimedia.org/r/277963 [10:34:59] <_joe_> akosiaris: ^^ [10:35:03] <_joe_> remember next time :) [10:35:21] (03CR) 10Giuseppe Lavagetto: [C: 032] parsoid: collect ganglia metrics in codfw too [puppet] - 10https://gerrit.wikimedia.org/r/277962 (owner: 10Giuseppe Lavagetto) [10:35:32] (03CR) 10Giuseppe Lavagetto: [V: 032] parsoid: collect ganglia metrics in codfw too [puppet] - 10https://gerrit.wikimedia.org/r/277962 (owner: 10Giuseppe Lavagetto) [10:35:53] (03CR) 10Giuseppe Lavagetto: [C: 032 V: 032] sca/scb: collect ganglia metrics in codfw too [puppet] - 10https://gerrit.wikimedia.org/r/277963 (owner: 10Giuseppe Lavagetto) [10:36:30] (03CR) 10Gehel: [C: 04-1] "Seems we have removed the logrotate rule. I might just be missing something simple, but it seems to me that we do want to keep logrotate i" [puppet] - 10https://gerrit.wikimedia.org/r/275749 (owner: 10EBernhardson) [10:39:22] mobrovac: there's been a network traffic drop in restbase codfw, is everything fine? [10:39:59] <_joe_> ema: that's an issue with the ganglia collector I fear [10:40:15] right, and it went back up again now [10:40:32] <_joe_> yeah the collector restarted because I added configs [10:40:34] (03PS1) 10Jdrewniak: Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/277964 [10:41:21] _joe_: sure that must be it, the number of CPUs also briefly went down :) [10:41:55] <_joe_> ema: exactly, sorry but I need to collect data for parsoid and sc* clusters as well :P [10:45:32] !log rebooting francium for kernel upgrade [10:45:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [10:46:05] 6Operations, 6Performance-Team, 10Traffic: Segment Navigation Timing data by continent - https://phabricator.wikimedia.org/T128709#2128900 (10mark) @Ori Do you still think you can get to this soon? [10:47:07] (03PS1) 10Giuseppe Lavagetto: citoid: switch traffic temporarily to codfw [puppet] - 10https://gerrit.wikimedia.org/r/277965 [10:47:09] (03PS1) 10Giuseppe Lavagetto: cxserver: switch traffic to codfw temporarily [puppet] - 10https://gerrit.wikimedia.org/r/277966 [10:47:57] _joe_: there is no codfw backend listed for citoid [10:48:22] (and cxserver) [10:48:35] <_joe_> ema: and you're right, I'm doing too many things at the same time [10:48:37] <_joe_> amending [10:48:50] cool [10:53:09] (03PS2) 10Giuseppe Lavagetto: citoid: switch traffic temporarily to codfw [puppet] - 10https://gerrit.wikimedia.org/r/277965 [10:53:11] (03PS2) 10Giuseppe Lavagetto: cxserver: switch traffic to codfw temporarily [puppet] - 10https://gerrit.wikimedia.org/r/277966 [10:53:54] PROBLEM - DPKG on labmon1001 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [10:54:37] <_joe_> ema: done! [10:54:45] awesome [10:54:57] <_joe_> ema: care to review? [10:55:04] I'm on it [10:55:19] (03CR) 10Ema: [C: 031] citoid: switch traffic temporarily to codfw [puppet] - 10https://gerrit.wikimedia.org/r/277965 (owner: 10Giuseppe Lavagetto) [10:55:21] <_joe_> mobrovac: how are latencies? [10:55:38] <_joe_> the parsoid cluster is way overprovisioned in codfw :P [10:55:40] ^ icinga warning for labmon1001 is caused by a packaget upgrade and https://phabricator.wikimedia.org/T127957 [10:55:56] (03CR) 10Ema: [C: 031] cxserver: switch traffic to codfw temporarily [puppet] - 10https://gerrit.wikimedia.org/r/277966 (owner: 10Giuseppe Lavagetto) [10:56:42] <_joe_> mobrovac: also, I'll wait for your ack before moving on with citoid/cxserver [10:57:04] PROBLEM - cassandra CQL 10.192.32.125:9042 on restbase2004 is CRITICAL: Connection refused [10:57:11] wth? ^ [10:57:13] euh [10:57:22] * mobrovac looking [10:57:26] <_joe_> ugh [10:57:34] <_joe_> godog might help too maybe? [10:57:55] he's off today [10:58:01] cass died there [10:58:05] PROBLEM - cassandra service on restbase2004 is CRITICAL: CRITICAL - Expecting active but unit cassandra is failed [10:58:44] PROBLEM - cassandra-a CQL 10.64.0.230:9042 on restbase1007 is CRITICAL: Connection refused [10:58:58] typical - ooom heap [10:59:06] rb1007 too? [10:59:17] looks like [10:59:19] * mobrovac restarting cass on restbase2004 [11:00:32] !log restbase restarted cassandra on restbase2004 - OOM [11:00:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [11:00:47] <_joe_> mobrovac: should I switch back at least mediawiki? [11:01:30] this is likely happening bcause the updates are still done in eqiad [11:01:44] <_joe_> oh restbase itself is pointing to eqiad? [11:01:54] RECOVERY - cassandra service on restbase2004 is OK: OK - cassandra is active [11:02:24] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [1000.0] [11:02:26] !log restarting elasticsearch server elastic1008.eqiad.wmnet [11:02:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [11:02:38] <_joe_> ok this is not a good sign [11:02:49] <_joe_> mobrovac: I'm switching the mediawiki load back to eqiad [11:02:53] RECOVERY - cassandra CQL 10.192.32.125:9042 on restbase2004 is OK: TCP OK - 0.036 second response time on port 9042 [11:03:14] PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 66.67% of data above the critical threshold [1000.0] [11:03:42] on rb1007 cassandra seems to be running though, why Connection refused? [11:03:51] <_joe_> seems to be back [11:04:08] _joe_: on rb2004, yes [11:04:15] ema: that's cassandra-b [11:04:20] we have 2 instances there [11:04:43] on the same port? [11:04:46] !log restbase restarted cassandra-a on restbase1007 [11:04:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [11:04:52] ema: yes, different IP [11:05:00] oh gotcha [11:06:10] <_joe_> anyways, it seems we have something to tweak at the cassandra level [11:08:15] yeah, the disparity between eqiad and codfw is not giving us good results [11:08:24] <_joe_> ook [11:08:30] <_joe_> let's roll back mediawiki for now [11:08:33] +1 [11:09:22] (03CR) 10Giuseppe Lavagetto: [C: 032] Revert "Switch temporarily eqiad to use the codfw restbase cluster" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/277960 (owner: 10Giuseppe Lavagetto) [11:11:05] !log oblivian@tin Synchronized wmf-config/ProductionServices.php: switching mediawiki to use restbase in eqiad again (duration: 00m 32s) [11:11:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [11:12:37] <_joe_> uhm why is traffic not going back? [11:13:24] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [11:14:14] RECOVERY - Esams HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [11:15:14] RECOVERY - cassandra-a CQL 10.64.0.230:9042 on restbase1007 is OK: TCP OK - 0.002 second response time on port 9042 [11:15:31] <_joe_> ok traffic from the jobrunners is back to eqiad [11:17:06] oh ok, if the jobrunners were sending the updates to codfw too, that explains it all [11:17:58] <_joe_> mobrovac: yes, it was a full blown test :) [11:18:11] <_joe_> I'll switch citoid next mobrovac [11:18:15] <_joe_> tell me when to go [11:18:51] _joe_: i think we can go ahead [11:18:57] with citoid [11:18:57] <_joe_> cool [11:19:29] 6Operations, 10MediaWiki-JobQueue, 13Patch-For-Review: The refreshLinks jobs enqueue rate is 10 times the normal rate - https://phabricator.wikimedia.org/T129517#2128949 (10hashar) @ori awesome really, thank you for the summary. I was trying to figure out whether it was a cycle and which cycle it might be b... [11:20:14] (03CR) 10Giuseppe Lavagetto: [C: 032] citoid: switch traffic temporarily to codfw [puppet] - 10https://gerrit.wikimedia.org/r/277965 (owner: 10Giuseppe Lavagetto) [11:23:18] Is there anyone who understand LVS who could have a look at https://gerrit.wikimedia.org/r/#/c/277956/ ? I'm a bit lost... [11:23:49] <_joe_> akosiaris: I can't find citoid in codfw on the LVSs? [11:23:54] <_joe_> where is it located? [11:25:10] <_joe_> mobrovac: luckily enough I checked... [11:25:44] <_joe_> mobrovac: I'm not switching citoid or cxserver now, sorry [11:26:21] <_joe_> I'm reverting my change as I cannot figure out why the IP for citoid.svc.eqiad.wmnet is not present on any of the primary lvs servers in codfw [11:26:37] <_joe_> sorry s/eqiad/codfw/ [11:27:18] hm [11:27:20] <_joe_> it should be on lvs2003 [11:27:27] <_joe_> but ipvsadm -L doesn't show it [11:28:13] _joe_: according to hiera/role/codfw/scb it's 10.2.1.19 [11:28:32] <_joe_> yeah but the problem is more general [11:28:53] iirc, akosiaris restarted lvs200[45] after adding those scb codfw addresses [11:29:08] <_joe_> ok [11:29:12] <_joe_> just found out [11:29:32] <_joe_> alex has restarted pybal only on the backup hosts [11:29:39] <_joe_> so it should be ok [11:30:33] <_joe_> but I'll restart it on lvs2003 too when he acks this [11:30:38] <_joe_> akosiaris: ^^ [11:30:44] <_joe_> let's switch then [11:31:23] !log rebooting install2001 for kernel upgrade [11:31:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [11:31:35] <_joe_> moritzm: no please [11:31:44] yes it is ok [11:31:56] <_joe_> it's the ganglia aggregator for codfw, I kinda need ganglia there [11:31:57] I just did not want to do it on 9 pm [11:32:05] <_joe_> akosiaris: fair enough [11:32:08] _joe_: sorry, already pressed enter. will be back in a minute [11:32:37] <_joe_> moritzm: ok [11:33:08] <_joe_> !log switching citoid to use codfw as well [11:33:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [11:33:52] install2001 back up [11:35:16] _joe_: the switch is live now? [11:35:23] <_joe_> mobrovac: yes [11:35:26] kk [11:37:39] (03CR) 10Alexandros Kosiaris: [C: 04-1] "Some inline comments. First draft looks good I think" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/277956 (https://phabricator.wikimedia.org/T124444) (owner: 10Gehel) [11:37:50] gehel: ^ [11:38:20] akosiaris: Thanks a lot! I'm out of my depth here... [11:38:50] elukey: kafka1012 and 1020 seem to have problems [11:38:58] I am the perp of that file structure (well, the current version at least) more or less so ping me directly whenever needed [11:40:33] PROBLEM - Kafka Broker Replica Max Lag on kafka1012 is CRITICAL: CRITICAL: 62.07% of data above the critical threshold [5000000.0] [11:40:39] akosiaris: If I undertand correctly, I do not have to change anything else than this configuration.yaml file. Correct? No need to add an entry to balancer.pp as there is already a "search" entry. (I'm trying to take stream as an example as there is also an HTTP and an HTTPS version) [11:41:25] !log restarting elasticsearch server elastic1009.eqiad.wmnet [11:41:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [11:41:44] PROBLEM - Kafka Broker Replica Max Lag on kafka1020 is CRITICAL: CRITICAL: 50.00% of data above the critical threshold [5000000.0] [11:42:08] ema: checking [11:42:34] but they were lagging earlier on. It is an outstanding issue with Kafka 0.8, but I'll double check [11:43:16] sometimes a replica goes nuts and thinks that it needs to fetch ALL the updates like it was just bootstrapped [11:43:27] triggering the lag [11:44:00] !log rebooting acamar for kernel upgrade [11:44:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [11:44:43] from https://grafana.wikimedia.org/dashboard/db/kafka everything looks good [11:45:30] * elukey wants kakfa 0.9 [11:51:07] the weird thing is that it doesn't happen for EventBus (that is considerably smaller than analytics, but still) [11:51:25] PROBLEM - PyBal backends health check on lvs2006 is CRITICAL: PYBAL CRITICAL - rendering_80 - Could not depool server mw2089.codfw.wmnet because of too many down!: apaches_80 - Could not depool server mw2191.codfw.wmnet because of too many down!: api_80 - Could not depool server mw2203.codfw.wmnet because of too many down!: kartotherian_6533 - Could not depool server maps-test2001.codfw.wmnet because of too many down!: graphoid_19000 [11:51:48] gehel: yes you got the easier LVS addition you can think of [11:52:03] PROBLEM - PyBal backends health check on lvs2003 is CRITICAL: PYBAL CRITICAL - apaches_80 - Could not depool server mw2049.codfw.wmnet because of too many down!: graphoid_19000 - Could not depool server scb2001.codfw.wmnet because of too many down!: parsoid_8000 - Could not depool server wtp2012.codfw.wmnet because of too many down!: citoid_1970 - Could not depool server scb2001.codfw.wmnet because of too many down!: mathoid_10042 - C [11:52:05] I think you don't need to add anything else [11:52:21] _joe_: that you ^ ? [11:52:43] RECOVERY - Kafka Broker Replica Max Lag on kafka1020 is OK: OK: Less than 50.00% above the threshold [1000000.0] [11:53:05] <_joe_> akosiaris: It will recover in a few, and no it wasn't me [11:55:04] RECOVERY - Kafka Broker Replica Max Lag on kafka1012 is OK: OK: Less than 50.00% above the threshold [1000000.0] [11:55:13] RECOVERY - PyBal backends health check on lvs2006 is OK: PYBAL OK - All pools are healthy [11:55:34] RECOVERY - PyBal backends health check on lvs2003 is OK: PYBAL OK - All pools are healthy [11:55:56] <_joe_> it was a real issue btw [11:57:16] but who was trying to depool all of these servers? [11:57:36] <_joe_> pybal [11:57:42] <_joe_> because fetches failed [11:58:07] <_joe_> Mar 17 11:50:05 lvs2003 pybal[24455]: [restbase_7231 IdleConnection] WARN: restbase2001.codfw.wmnet (ena [11:58:10] <_joe_> bled/down/pooled): Connection failed. [11:58:22] <_joe_> somehow it had trouble connecting to any backend [11:59:02] h [11:59:05] hm [11:59:06] dns ? [11:59:09] no bueno [11:59:14] <_joe_> akosiaris: possible [11:59:17] 6Operations, 6Editing-Department, 6Performance-Team, 7Performance, 7user-notice: Severe save latency regression - https://phabricator.wikimedia.org/T126700#2129095 (10hashar) I have looked again auth the p75 edit response time but over 100 days. It was around 550 ms until January 14th when it dropped to... [12:03:22] (03CR) 10Giuseppe Lavagetto: [C: 032] cxserver: switch traffic to codfw temporarily [puppet] - 10https://gerrit.wikimedia.org/r/277966 (owner: 10Giuseppe Lavagetto) [12:05:00] csteipp_afk: ping [12:05:26] <_joe_> !log switching cxserver to codfw [12:05:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [12:13:42] 6Operations, 10ops-codfw, 6DC-Ops: mw2066 to mw2074 don't reboot cleanly - https://phabricator.wikimedia.org/T130008#2129140 (10Aklapper) [12:13:44] 6Operations, 10ops-codfw, 6DC-Ops: rack new mw maint host - wasat - https://phabricator.wikimedia.org/T129930#2129141 (10Aklapper) [12:13:46] 6Operations, 10ops-eqiad, 6DC-Ops: db1053 failed disk (degraded RAID) - https://phabricator.wikimedia.org/T129829#2129142 (10Aklapper) [12:13:48] 6Operations, 10ops-codfw, 6DC-Ops: Check bast2001 for hardware problems - https://phabricator.wikimedia.org/T129316#2129146 (10Aklapper) [12:13:55] 6Operations, 10ops-codfw, 6DC-Ops: ms-be2010.codfw.wmnet: slot=0 dev=sda failed - https://phabricator.wikimedia.org/T129117#2129150 (10Aklapper) [12:13:57] 6Operations, 10ops-codfw, 6DC-Ops, 13Patch-For-Review: mw2212 had several downtimes recently - test before repool - https://phabricator.wikimedia.org/T129196#2129149 (10Aklapper) [12:13:59] 6Operations, 10ops-eqiad, 6DC-Ops, 13Patch-For-Review: mw1026-69 are shut down and should be physically decommissioned - https://phabricator.wikimedia.org/T129060#2129151 (10Aklapper) [12:14:02] 6Operations, 10ops-codfw, 6DC-Ops, 5codfw-rollout, 3codfw-rollout-Jan-Mar-2016: rack new mw log host - sinistra - https://phabricator.wikimedia.org/T128796#2129152 (10Aklapper) [12:14:07] 6Operations, 10ops-eqiad, 6DC-Ops, 13Patch-For-Review: Rack and Initial setup db1074-79 - https://phabricator.wikimedia.org/T128753#2129157 (10Aklapper) [12:14:09] 6Operations, 10ops-codfw, 6DC-Ops: db2018 failed disk (degraded RAID) - https://phabricator.wikimedia.org/T128057#2129160 (10Aklapper) [12:14:12] 6Operations, 10ops-codfw, 10DBA, 6DC-Ops, 13Patch-For-Review: es2010 controller issue - https://phabricator.wikimedia.org/T127769#2129162 (10Aklapper) [12:14:16] 6Operations, 10ops-eqiad, 6DC-Ops: testing: r430 server / h800 controller / md1200 shelf - https://phabricator.wikimedia.org/T127490#2129167 (10Aklapper) [12:14:20] 6Operations, 10ops-eqiad, 6DC-Ops, 6Labs: disk failure on labsdb1002 - https://phabricator.wikimedia.org/T126946#2129169 (10Aklapper) [12:14:22] 6Operations, 10ops-eqiad, 6DC-Ops, 13Patch-For-Review: decom iodine - https://phabricator.wikimedia.org/T126483#2129172 (10Aklapper) [12:14:26] 6Operations, 10ops-eqiad, 6DC-Ops, 13Patch-For-Review: Decommission mw1037 - https://phabricator.wikimedia.org/T126350#2129173 (10Aklapper) [12:14:28] 6Operations, 10ops-codfw, 6DC-Ops: lvs2002 Embedded Flash/SD-CARD iLO errors - https://phabricator.wikimedia.org/T126321#2129174 (10Aklapper) [12:14:30] 6Operations, 10ops-eqiad, 6DC-Ops: dbstore1001 management interface has saturated the number of available ssh connections - https://phabricator.wikimedia.org/T126227#2129175 (10Aklapper) [12:14:32] 6Operations, 10ops-codfw, 6DC-Ops: es2004 doesn't come back up after reboot - https://phabricator.wikimedia.org/T126203#2129177 (10Aklapper) [12:14:34] 6Operations, 10ops-esams, 6DC-Ops, 10Traffic: cp30[34]x hw/firmware/BMC issues - https://phabricator.wikimedia.org/T126062#2129178 (10Aklapper) [12:14:36] 6Operations, 10ops-eqiad, 6DC-Ops, 10Traffic, 13Patch-For-Review: eqiad cache cluster re-arrangements - https://phabricator.wikimedia.org/T125486#2129181 (10Aklapper) [12:14:38] 6Operations, 10ops-codfw, 6DC-Ops: es2009 degraded RAID - https://phabricator.wikimedia.org/T125442#2129183 (10Aklapper) [12:14:40] 6Operations, 10ops-codfw, 10ops-eqiad, 10ops-esams, and 3 others: Monitor hardware thermal issues - https://phabricator.wikimedia.org/T125205#2129185 (10Aklapper) [12:14:42] 6Operations, 10ops-eqiad, 6DC-Ops, 13Patch-For-Review: decom caesium - https://phabricator.wikimedia.org/T125165#2129186 (10Aklapper) [12:14:44] 6Operations, 10ops-codfw, 6DC-Ops: Codfw-mw* IDRAC firmware upgrade - https://phabricator.wikimedia.org/T125088#2129187 (10Aklapper) [12:14:46] 6Operations, 10ops-eqiad, 6DC-Ops, 13Patch-For-Review: Decommission pc1001-1003 - https://phabricator.wikimedia.org/T124962#2129188 (10Aklapper) [12:14:49] 6Operations, 10ops-codfw, 6DC-Ops: db2012 degraded RAID - https://phabricator.wikimedia.org/T124645#2129193 (10Aklapper) [12:14:51] 6Operations, 10ops-eqiad, 6DC-Ops, 13Patch-For-Review: decomission the netapps in EQIAD: nas1001-a, nas1001-b - https://phabricator.wikimedia.org/T124156#2129195 (10Aklapper) [12:14:53] 6Operations, 10ops-codfw, 6DC-Ops, 10hardware-requests: mw2173 has probably a broken disk, needs substitution and reimaging - https://phabricator.wikimedia.org/T124408#2129194 (10Aklapper) [12:14:57] 6Operations, 10ops-codfw, 6DC-Ops: setup/install/deploy db2033 - https://phabricator.wikimedia.org/T122998#2129201 (10Aklapper) [12:15:03] 6Operations, 10ops-eqiad, 6DC-Ops, 10hardware-requests: Decommission neptunium - https://phabricator.wikimedia.org/T122101#2129208 (10Aklapper) [12:15:08] 6Operations, 10ops-codfw, 6DC-Ops, 10EventBus, and 3 others: setup kafka2001 & kafka2002 - https://phabricator.wikimedia.org/T121558#2129214 (10Aklapper) [12:15:16] 6Operations, 10ops-eqiad, 6DC-Ops: Remove all out of warranty unused cp10xx's from A2 - https://phabricator.wikimedia.org/T120856#2129218 (10Aklapper) [12:15:18] 6Operations, 10ops-codfw, 6DC-Ops: rack/setup/deploy auth2001 as codfw auth system - https://phabricator.wikimedia.org/T120263#2129219 (10Aklapper) [12:15:22] 6Operations, 10ops-codfw, 6DC-Ops: db2019 has a failed disk - https://phabricator.wikimedia.org/T120073#2129224 (10Aklapper) [12:15:24] 6Operations, 10ops-ulsfo, 6DC-Ops: ulsfo temperature-related exceptions - https://phabricator.wikimedia.org/T119631#2129226 (10Aklapper) [12:15:26] 6Operations, 10ops-esams, 6DC-Ops, 10netops: Set up cr2-esams - https://phabricator.wikimedia.org/T118256#2129230 (10Aklapper) [12:15:28] 6Operations, 10ops-eqiad, 6DC-Ops: dbstore1002.mgmt.eqiad.wmnet: "No more sessions are available for this type of connection!" - https://phabricator.wikimedia.org/T119488#2129228 (10Aklapper) [12:15:30] 6Operations, 10ops-eqiad, 6DC-Ops, 10hardware-requests: Decommission rubidium - https://phabricator.wikimedia.org/T118213#2129231 (10Aklapper) [12:15:32] 6Operations, 10ops-eqiad, 6DC-Ops, 13Patch-For-Review: Decommission plutonium - https://phabricator.wikimedia.org/T118586#2129229 (10Aklapper) [12:15:34] 6Operations, 10ops-esams, 6DC-Ops: Power cr2-esams PEM 2/PEM 3 - https://phabricator.wikimedia.org/T118166#2129233 (10Aklapper) [12:15:36] 6Operations, 10ops-eqiad, 6DC-Ops, 10hardware-requests: Decommission calcium - https://phabricator.wikimedia.org/T116790#2129239 (10Aklapper) [12:15:38] 6Operations, 10ops-eqiad, 6DC-Ops, 5Continuous-Integration-Scaling: Reclaim SSD from labnodepool1001.eqiad.wmnet - https://phabricator.wikimedia.org/T116936#2129237 (10Aklapper) [12:15:41] 6Operations, 10ops-codfw, 6DC-Ops: Humidity Alarms from codfw - https://phabricator.wikimedia.org/T110421#2129245 (10Aklapper) [12:15:43] 6Operations, 10ops-eqiad, 6DC-Ops, 10netops: asw-d-eqiad SNMP failures - https://phabricator.wikimedia.org/T112781#2129243 (10Aklapper) [12:15:45] 6Operations, 10ops-codfw, 6DC-Ops: solve mtp panel issue for row uplinks - https://phabricator.wikimedia.org/T112774#2129244 (10Aklapper) [12:15:47] 6Operations, 10ops-eqiad, 6DC-Ops, 10Traffic, and 2 others: rack/setup new eqiad lvs machines - https://phabricator.wikimedia.org/T104458#2129246 (10Aklapper) [12:15:50] 6Operations, 10ops-eqiad, 6DC-Ops: What to do with decommissioned ciscos? - https://phabricator.wikimedia.org/T103374#2129247 (10Aklapper) [12:15:52] 6Operations, 10ops-codfw, 6DC-Ops, 10Incident-20150617-LabsNFSOutage: Labstore2001 controller or shelf failure - https://phabricator.wikimedia.org/T102626#2129248 (10Aklapper) [12:15:54] 6Operations, 10ops-eqiad, 6DC-Ops, 6Labs, 10Labs-Infrastructure: Locate and assign some MD1200 shelves for proper testing of labstore1002 - https://phabricator.wikimedia.org/T101741#2129249 (10Aklapper) [12:15:57] 6Operations, 10ops-esams, 6DC-Ops: Check power supply balance settings on cp3030+ - https://phabricator.wikimedia.org/T98984#2129250 (10Aklapper) [12:15:59] 6Operations, 10ops-eqiad, 6DC-Ops, 10Incident-Labs-NFS-20151216, and 2 others: labstore1002 issues while trying to reboot - https://phabricator.wikimedia.org/T98183#2129251 (10Aklapper) [12:16:02] 6Operations, 10ops-esams, 6DC-Ops, 10Traffic: Decomission amssq31-62 (32 hosts) - https://phabricator.wikimedia.org/T95742#2129252 (10Aklapper) [12:16:05] 6Operations, 10ops-esams, 6DC-Ops: Audit racktables - https://phabricator.wikimedia.org/T94819#2129253 (10Aklapper) [12:16:07] 6Operations, 10ops-esams, 6DC-Ops: Remove unused fibers - https://phabricator.wikimedia.org/T94704#2129254 (10Aklapper) [12:16:09] 6Operations, 10ops-esams, 6DC-Ops, 13Patch-For-Review: decommission cp3001 & cp3002 - https://phabricator.wikimedia.org/T94215#2129255 (10Aklapper) [12:16:11] !log restart pybal on lvs2002 [12:16:11] 6Operations, 10ops-esams, 6DC-Ops: cp3011 hardware fault - https://phabricator.wikimedia.org/T92306#2129256 (10Aklapper) [12:16:13] 6Operations, 10ops-esams, 6DC-Ops: decom amslvs1-4 (dc work) - https://phabricator.wikimedia.org/T87790#2129259 (10Aklapper) [12:16:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [12:16:15] 6Operations, 10ops-esams, 6DC-Ops: Rack ms-be3006 and ms-be3007 - https://phabricator.wikimedia.org/T91637#2129257 (10Aklapper) [12:16:17] 6Operations, 10ops-esams, 6DC-Ops: Upgrade cp3011-3014 with 10G cards - https://phabricator.wikimedia.org/T88684#2129258 (10Aklapper) [12:16:19] 6Operations, 10ops-esams, 6DC-Ops: setup the 2 new esams ms-be systems - https://phabricator.wikimedia.org/T86784#2129261 (10Aklapper) [12:16:21] 6Operations, 10ops-codfw, 6DC-Ops, 10netops: setup wifi in codfw - https://phabricator.wikimedia.org/T86541#2129262 (10Aklapper) [12:16:24] 6Operations, 10ops-esams, 6DC-Ops: Setup management switch in OE12 - https://phabricator.wikimedia.org/T84700#2129265 (10Aklapper) [12:16:26] 6Operations, 10ops-eqiad, 6DC-Ops: Decommission cp1037-1040 - https://phabricator.wikimedia.org/T83553#2129271 (10Aklapper) [12:16:46] 6Operations, 6Project-Admins, 3DevRel-March-2016: Operations-related subprojects/tags reorganization - https://phabricator.wikimedia.org/T119944#2129274 (10Aklapper) >>! In T119944#2056247, @Aklapper wrote: >>! In T119944#2044189, @faidon wrote: >> * For #DC-Ops, tag along all of the #ops-$site + #procuremen... [12:17:58] andre__: impressive mass edit :) [12:18:32] :) [12:19:23] !log restart pybal on lvs2001 [12:19:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [12:19:39] 6Operations, 10ops-eqiad, 6DC-Ops: db1053 failed disk (degraded RAID) - https://phabricator.wikimedia.org/T129829#2129278 (10jcrespo) 5Open>3Resolved a:3jcrespo 32:4 has 3 media errors. Should be ok. [12:24:16] 6Operations, 6Project-Admins, 3DevRel-March-2016: Operations-related subprojects/tags reorganization - https://phabricator.wikimedia.org/T119944#2129300 (10faidon) 5Open>3Resolved Thanks for everything :) [12:24:35] (03PS2) 10Gehel: Enabling HTTPS access to elasticsearch via LVS. [puppet] - 10https://gerrit.wikimedia.org/r/277956 (https://phabricator.wikimedia.org/T124444) [12:25:15] !log restarting elasticsearch server elastic1010.eqiad.wmnet [12:25:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [12:29:30] (03CR) 10Gehel: Add systemd unit for logstash (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/274696 (https://phabricator.wikimedia.org/T126677) (owner: 10Muehlenhoff) [12:33:45] (03PS1) 10Giuseppe Lavagetto: cache::text: switch traffic to restbase, cxserver and citoid back to eqiad [puppet] - 10https://gerrit.wikimedia.org/r/277976 [12:37:07] (03CR) 10Giuseppe Lavagetto: [C: 032] cache::text: switch traffic to restbase, cxserver and citoid back to eqiad [puppet] - 10https://gerrit.wikimedia.org/r/277976 (owner: 10Giuseppe Lavagetto) [12:38:14] <_joe_> !log switching all services back to eqiad [12:38:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [12:41:55] <_joe_> mobrovac: {{done}} [12:42:21] kk [12:42:24] thnx _joe_ [12:43:23] yup, all's back [12:45:07] (03PS1) 10Muehlenhoff: Update to 4.4.6 [debs/linux44] - 10https://gerrit.wikimedia.org/r/277977 [12:46:24] (03CR) 10Muehlenhoff: [C: 032 V: 032] Update to 4.4.6 [debs/linux44] - 10https://gerrit.wikimedia.org/r/277977 (owner: 10Muehlenhoff) [12:49:34] RECOVERY - DPKG on labmon1001 is OK: All packages OK [12:49:46] (03PS1) 10Elukey: Add automatic failover to Hadoop Namenodes. [puppet/cdh] - 10https://gerrit.wikimedia.org/r/277984 (https://phabricator.wikimedia.org/T129838) [12:58:36] (03CR) 10jenkins-bot: [V: 04-1] Add automatic failover to Hadoop Namenodes. [puppet/cdh] - 10https://gerrit.wikimedia.org/r/277984 (https://phabricator.wikimedia.org/T129838) (owner: 10Elukey) [12:59:50] (03CR) 10Gehel: Enabling HTTPS access to elasticsearch via LVS. (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/277956 (https://phabricator.wikimedia.org/T124444) (owner: 10Gehel) [13:00:45] !log rolling reboot of parsoid servers in codfw for kernel upgrade [13:00:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:00:56] 6Operations, 6Commons, 6Multimedia, 10UploadWizard: Uploading files (<100MB) with UploadWizard chunked uploads failing with 500 & 503 - https://phabricator.wikimedia.org/T130204#2129457 (10zhuyifei1999) [13:01:36] (03CR) 10Mobrovac: [C: 031] "LGTM. The puppet compiler is happy too - https://puppet-compiler.wmflabs.org/2081/" [puppet] - 10https://gerrit.wikimedia.org/r/277423 (owner: 10Thcipriani) [13:02:04] !log restarting elasticsearch server elastic1011.eqiad.wmnet [13:02:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:02:29] (03PS2) 10Elukey: Add automatic failover to Hadoop Namenodes. [puppet/cdh] - 10https://gerrit.wikimedia.org/r/277984 (https://phabricator.wikimedia.org/T129838) [13:29:24] (03PS1) 10Mforns: Correct destination dir in browser reports rsync [puppet] - 10https://gerrit.wikimedia.org/r/277987 (https://phabricator.wikimedia.org/T127326) [13:36:31] !log restarting elasticsearch server elastic1012.eqiad.wmnet [13:36:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:37:54] PROBLEM - cassandra service on restbase2004 is CRITICAL: CRITICAL - Expecting active but unit cassandra is failed [13:38:24] PROBLEM - cassandra CQL 10.192.32.125:9042 on restbase2004 is CRITICAL: Connection refused [13:40:00] (03CR) 10Ottomata: [C: 032] Correct destination dir in browser reports rsync [puppet] - 10https://gerrit.wikimedia.org/r/277987 (https://phabricator.wikimedia.org/T127326) (owner: 10Mforns) [13:41:03] grr ^ [13:42:10] !log restbase resrting cassandra on restbase2004 - OOM again [13:42:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:43:14] RECOVERY - cassandra service on restbase2004 is OK: OK - cassandra is active [13:43:44] RECOVERY - cassandra CQL 10.192.32.125:9042 on restbase2004 is OK: TCP OK - 0.037 second response time on port 9042 [13:55:16] (03CR) 10Ottomata: [C: 031] "THIS IS AWESOOOME!" [puppet/cdh] - 10https://gerrit.wikimedia.org/r/277984 (https://phabricator.wikimedia.org/T129838) (owner: 10Elukey) [14:02:48] (03PS3) 10Elukey: Add automatic failover to Hadoop Namenodes. [puppet/cdh] - 10https://gerrit.wikimedia.org/r/277984 (https://phabricator.wikimedia.org/T129838) [14:09:03] !log rolling reboot of parsoid servers in eqiad for kernel upgrade [14:09:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:10:20] (03CR) 10Ottomata: "Oo found a couple of things" (033 comments) [puppet/cdh] - 10https://gerrit.wikimedia.org/r/277984 (https://phabricator.wikimedia.org/T129838) (owner: 10Elukey) [14:17:25] 6Operations, 6Discovery, 7Elasticsearch, 7Epic: Collect threaddumps from elasticsearch at regular intervals - https://phabricator.wikimedia.org/T130209#2129701 (10Gehel) [14:24:44] (03CR) 10EBernhardson: "removing logrotate was entirely unintentional, will fix." [puppet] - 10https://gerrit.wikimedia.org/r/275749 (owner: 10EBernhardson) [14:27:36] !log restarting elasticsearch server elastic1013.eqiad.wmnet [14:27:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:32:45] (03PS5) 10Gehel: Build cirrus completion indices daily [puppet] - 10https://gerrit.wikimedia.org/r/275749 (owner: 10EBernhardson) [14:33:17] 6Operations, 6Labs: revise/fix labstore replicate backup jobs - https://phabricator.wikimedia.org/T127567#2129752 (10chasemp) re: > Multi-week historical copies as space allows I'm open to what makes sense here but I didn't explain the purpose clearly above I think. This is not primarily intended as any kin... [14:35:49] (03CR) 10Gehel: "@ebernhardson: I re-enabled logrotate and creation of log directory (my guess: they were both removed by accident)." [puppet] - 10https://gerrit.wikimedia.org/r/275749 (owner: 10EBernhardson) [14:41:05] (03CR) 10EBernhardson: [C: 031] Build cirrus completion indices daily [puppet] - 10https://gerrit.wikimedia.org/r/275749 (owner: 10EBernhardson) [14:54:36] o/ akosiaris [14:54:52] Was hoping to check in re. your work on getting the ORES servers set up? [14:55:13] And see if theres anything you need from me or Amir1 [15:00:04] anomie ostriches thcipriani marktraceur aude: Respected human, time to deploy Morning SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160317T1500). Please do the needful. [15:00:04] dcausse jan_drewniak: A patch you scheduled for Morning SWAT (Max 8 patches) is about to be deployed. Please be available during the process. [15:00:22] o/ [15:00:30] \o [15:00:33] I can SWAT today. [15:02:34] thcipriani: this is my first SWAT :P I pushed my patch to mediawiki-config (just updating portals) but it hasn't been merged into the repo, and I'm not sure who should do the +2'ing. [15:03:12] jan_drewniak: I'll do the +2, your patch seems fine. [15:03:35] I'll +2, pull down to the deployment machine, and then let you know when it's been sync'd :) [15:03:49] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/277477 (https://phabricator.wikimedia.org/T129502) (owner: 10DCausse) [15:04:57] thcipriani: synced by running the 'sync-portals' script? thanks! [15:05:22] (03Merged) 10jenkins-bot: Enable ICU Folding on greek wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/277477 (https://phabricator.wikimedia.org/T129502) (owner: 10DCausse) [15:06:09] (03PS6) 10DCausse: Build cirrus completion indices daily [puppet] - 10https://gerrit.wikimedia.org/r/275749 (owner: 10EBernhardson) [15:07:19] !log restarting elasticsearch server elastic1014.eqiad.wmnet [15:07:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:08:49] !log thcipriani@tin Synchronized wmf-config/InitialiseSettings.php: SWAT: Enable ICU Folding on greek wikipedia PART I [[gerrit:277477]] (duration: 00m 30s) [15:08:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:09:15] halfak: hey man, it's progressing, albeit slowly. I 've allocated some easy stuff like LVS and such. I am looking at the role classes now.. they 're quire a lot [15:09:23] s/quire/quite/ [15:09:29] !log thcipriani@tin Synchronized wmf-config/CirrusSearch-common.php: SWAT: Enable ICU Folding on greek wikipedia PART II [[gerrit:277477]] (duration: 00m 31s) [15:09:30] ^ dcausse check please [15:09:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:10:10] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/277964 (owner: 10Jdrewniak) [15:10:12] akosiaris, gotcha. Anything that I can help with? [15:10:55] thcipriani: I see the config change, thanks! [15:11:01] (03Merged) 10jenkins-bot: Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/277964 (owner: 10Jdrewniak) [15:11:07] dcausse: cool. Thanks for checking! [15:11:12] akosiaris, Otherwise when should we expect to start experimenting with deployments? [15:11:37] halfak: probably. Mostly question answering I think... I 'll be pinging you [15:11:56] er, deployment ? hopefully by Monday ? [15:12:32] Wooo! [15:12:36] Monday would be great. [15:12:51] We're likely have a few other things in place and ready early next week. [15:12:56] And we'll have Yuvi back :) [15:14:03] btw, I am wonder what's https://wikitech.wikimedia.org/wiki/Nova_Resource:Revscoring and what's https://wikitech.wikimedia.org/wiki/Nova_Resource:Ores for [15:14:20] is the first one ORES + the mediawiki extension ? and the second ones just ORES ? [15:14:27] jan_drewniak: okie doke. starting sync. [15:14:34] !log thcipriani@tin Synchronized portals/prod/wikipedia.org/assets: SWAT: Bumping portals to master [[gerrit:277964]] (duration: 00m 28s) [15:14:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:14:51] (03PS2) 10Ema: Port varnishreqstats and varnishstatsd to new VSL API [puppet] - 10https://gerrit.wikimedia.org/r/277790 (https://phabricator.wikimedia.org/T128788) [15:15:03] !log thcipriani@tin Synchronized portals: SWAT: Bumping portals to master [[gerrit:277964]] (duration: 00m 29s) [15:15:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:15:11] ^ jan_drewniak check please [15:16:20] sweet! looks good :D thanks! [15:16:42] jan_drewniak: cool, thank you for checking! [15:17:01] halfak: btw, I am a bit worried about https://upload.wikimedia.org/wikipedia/commons/9/91/ORES.request_sequence_diagram_%28not_cached%29.svg [15:17:42] I know you 've discussed it with yuvi extensively so I won't push on this one, but it's an unsual architectural decision to have the client waiting ... [15:17:51] akosiaris, \o/ glad the diagram is helpful! [15:18:02] akosiaris, to have the client waiting? [15:18:21] (03CR) 10DCausse: [C: 031] "Just added a new option to increase timeout, these are errors I've seen while inspecting the logs of some failed indices." [puppet] - 10https://gerrit.wikimedia.org/r/275749 (owner: 10EBernhardson) [15:18:24] PROBLEM - Kafka Broker Replica Max Lag on kafka1018 is CRITICAL: CRITICAL: 58.62% of data above the critical threshold [5000000.0] [15:18:46] akosiaris, FWIW, a lot of this sequence diagram happens within celery itself. [15:19:04] Hi. [15:19:09] (03PS1) 10Dereckson: Add Portal namespace to ne.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/278009 (https://phabricator.wikimedia.org/T130108) [15:19:13] akosiaris, But I suspect that you are concerned about the async result of the call to celery [15:19:21] Yuvi had concerns too. [15:19:21] thcipriani: would it still be possible to append this patch to the SWAT? ^ [15:19:44] (03Abandoned) 10Giuseppe Lavagetto: Revert "cache::text: route all restbase traffic to codfw" [puppet] - 10https://gerrit.wikimedia.org/r/277961 (owner: 10Giuseppe Lavagetto) [15:19:54] Dereckson: sure. put it on the calendar, please. [15:19:56] But we've been running this against 100 external requests/minute and 200 total requests for almost a year now. [15:20:02] * halfak gets a performance plot. [15:20:04] PROBLEM - Kafka Broker Replica Max Lag on kafka1013 is CRITICAL: CRITICAL: 60.00% of data above the critical threshold [5000000.0] [15:20:08] That should be in the wikitech page [15:20:26] yeah I know it has been discussed already [15:20:29] thcipriani: thanks, I'm adding it. [15:20:29] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/278009 (https://phabricator.wikimedia.org/T130108) (owner: 10Dereckson) [15:20:54] (03Merged) 10jenkins-bot: Add Portal namespace to ne.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/278009 (https://phabricator.wikimedia.org/T130108) (owner: 10Dereckson) [15:21:13] it's just that I fear that it might break down spectacularly when it breaks down due to that architecture decision [15:21:48] as clients will be pilling up keeping connections open [15:22:23] Added to calendar. [15:22:38] !log thcipriani@tin Synchronized wmf-config/InitialiseSettings.php: SWAT: Add Portal namespace to ne.wikipedia [[gerrit:278009]] (duration: 00m 28s) [15:22:41] instead of getting an HTTP code of maybe 202 ? [15:22:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:22:45] ^ Dereckson check please [15:23:06] ayway... Obviously not for change now, but something to keep in mind [15:23:27] akosiaris, we ran into this when I set up backpressure. [15:23:37] or even 503 with a Retry-After ... [15:23:47] I did a lot of work to make sure that the bottleneck is the celery queue and that we 503 when that queue gets too full. [15:23:50] thcipriani: tested, works fine. [15:23:59] Dereckson: cool. thank you! [15:24:03] Thank you for the deploy. [15:24:07] queue too full == a request has to wait 1 second before being started by a worker. [15:24:22] <_joe_> I'm late for swat right? [15:24:25] akosiaris, FWIW, Yuvipanda insisted on this backpressure strategy. [15:24:29] <_joe_> thcipriani: right? :( [15:24:52] But I hear you. For now, it seems that the async celery pattern is working OK. I'm most worried about redis as our SPOF :( [15:25:01] Working on that next. [15:25:03] _joe_: if it doesn't require a full scap, we've probably got time for another patch. [15:25:15] halfak: yeah, it was the obvious alternative to always returning 503 with a Retry-After if the result is not in the cache [15:25:34] <_joe_> thcipriani: https://gerrit.wikimedia.org/r/#/c/277786/ [15:26:05] halfak: I think we maybe able to handle the redis being a SPOF in production, I 'll have to look a bit into it but we will have 2 replicated redis anyway [15:26:08] akosiaris, +1. We skip the queue is the result is in the cache [15:26:18] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/277786 (owner: 10Giuseppe Lavagetto) [15:26:26] akosiaris, yeah. Yuvipanda has a task open for looking at tewmproxy. [15:26:32] We're also looking at RabbitMQ [15:26:33] _joe_: could you add that to the Deployments page? [15:26:43] halfak: that's exactly what I was going to suggest [15:26:45] <_joe_> thcipriani: will do! [15:26:59] But yuvipanda was worried about us not having experience with RabbitMQ, so it'll be nice if tewmproxy works out. [15:27:06] :) [15:27:18] <_joe_> rabbitmq? EWWW [15:27:20] <_joe_> stay clear [15:27:23] lol [15:27:23] lol [15:27:25] <_joe_> redis is way better [15:27:29] <_joe_> trust me :P [15:27:35] * _joe_ still has PTSD [15:27:43] * halfak does like redis generally [15:27:47] from rabbitmq ? same here [15:27:51] I think most of our redis issues are a missing timeout somewhere. [15:28:00] (03Merged) 10jenkins-bot: Use wmfLocalServices for wgUploadThumbnailRenderHttpCustomDomain [mediawiki-config] - 10https://gerrit.wikimedia.org/r/277786 (owner: 10Giuseppe Lavagetto) [15:28:12] that has even hit mediawiki in production, so it's not hard to believe [15:28:16] (03CR) 10Ema: [C: 031] "Great stuff!" [software/varnish/varnishkafka] - 10https://gerrit.wikimedia.org/r/276439 (https://phabricator.wikimedia.org/T124278) (owner: 10Elukey) [15:28:48] halfak: er btw, what's the diff between https://upload.wikimedia.org/wikipedia/commons/e/eb/ORES.request_sequence_diagram_%28cached%29.svg and https://wikitech.wikimedia.org/wiki/Nova_Resource:Ores ? [15:28:52] <_joe_> thcipriani: tell me when to test it [15:28:54] err [15:28:58] bad paste [15:29:09] _joe_: will do [15:29:17] halfak: I meant https://wikitech.wikimedia.org/wiki/Nova_Resource:Revscoring vs https://wikitech.wikimedia.org/wiki/Nova_Resource:Ores [15:29:31] first one is an ORES install + the mediawiki extension or something ? [15:29:47] revscoring is the oldest project. We used to host everything underneath that. See Nova_Resource:Wikilabels too [15:30:02] Now, we use Nova_Resource:Revscoring for experimentation [15:30:27] So right now, we have some VMs there for experimenting with new puppet configs, tewmproxy, etc. [15:30:29] !log thcipriani@tin Synchronized wmf-config/ProductionServices.php: SWAT: Use wmfLocalServices for wgUploadThumbnailRenderHttpCustomDomain PART I [[gerrit:277786]] (duration: 00m 26s) [15:30:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:30:40] I'm planning to use that project to experiment with kicking the redis server and making changes. [15:30:55] ok, that clears it up for me. thanks for that [15:31:06] :) [15:31:08] !log thcipriani@tin Synchronized wmf-config/InitialiseSettings.php: SWAT: Use wmfLocalServices for wgUploadThumbnailRenderHttpCustomDomain PART II [[gerrit:277786]] (duration: 00m 25s) [15:31:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:31:55] forgot to rebase. One sec. [15:32:04] !log thcipriani@tin Synchronized wmf-config/ProductionServices.php: SWAT: Use wmfLocalServices for wgUploadThumbnailRenderHttpCustomDomain PART I [[gerrit:277786]] (duration: 00m 25s) [15:32:04] PROBLEM - cassandra CQL 10.192.32.125:9042 on restbase2004 is CRITICAL: Connection refused [15:32:44] !log thcipriani@tin Synchronized wmf-config/InitialiseSettings.php: SWAT: Use wmfLocalServices for wgUploadThumbnailRenderHttpCustomDomain PART II [[gerrit:277786]] (duration: 00m 25s) [15:32:44] RECOVERY - Kafka Broker Replica Max Lag on kafka1018 is OK: OK: Less than 50.00% above the threshold [1000000.0] [15:32:50] ^ _joe_ check please [15:32:56] <_joe_> ok [15:33:54] RECOVERY - cassandra CQL 10.192.32.125:9042 on restbase2004 is OK: TCP OK - 0.039 second response time on port 9042 [15:34:33] RECOVERY - Kafka Broker Replica Max Lag on kafka1013 is OK: OK: Less than 50.00% above the threshold [1000000.0] [15:37:09] <_joe_> thcipriani: I see no errors, so should be ok [15:38:00] _joe_: cool. Thanks for checking! Fatalmonitor/logstash didn't explode either :) [15:38:53] <_joe_> that's one of the two things I checked [15:39:42] (03PS1) 10Ottomata: Add $socket parameter to mylvmbackup [puppet] - 10https://gerrit.wikimedia.org/r/278013 (https://phabricator.wikimedia.org/T127991) [15:41:33] urandom: I'm reviewing your patches for the upcoming puppet SWAT (this is my first...() [15:41:54] kk [15:42:12] urandom: https://gerrit.wikimedia.org/r/#/c/277843/ has a comment about someone being "ready to start a bootstrap". I have no idea what that means... [15:42:20] * gehel is still the new guy... [15:42:24] (03CR) 10Eevans: [C: 031] restbase1012.eqiad.wmnet: enable instance 'b' [puppet] - 10https://gerrit.wikimedia.org/r/277843 (https://phabricator.wikimedia.org/T125842) (owner: 10Eevans) [15:42:29] gehel: it's +1 now [15:42:31] (03CR) 10Ottomata: [C: 032] Add $socket parameter to mylvmbackup [puppet] - 10https://gerrit.wikimedia.org/r/278013 (https://phabricator.wikimedia.org/T127991) (owner: 10Ottomata) [15:42:59] gehel: i just wanted to make sure it didn't get merged until we're ready to start the bootstrap [15:43:14] urandom: still, so that I can learn something today, what is "starting a bootstrap" in this context? [15:43:29] <_joe_> urandom: are you going to need opsens assistance for this? [15:44:07] !log restarting elasticsearch server elastic1015.eqiad.wmnet [15:44:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:44:11] _joe_: nope [15:44:18] _joe_: just the merge [15:45:11] gehel: we're moving to a config were we run multiple instances of cassandra on each host, this changeset will add a second to restbase1012 [15:45:23] gehel: it will add the configuration, i should say [15:45:47] afterward, i need to start the bootstrap, which copies data over and onlines the node [15:46:01] urandom: ok, make sense... [15:49:03] akosiaris: Would you have some time to review https://gerrit.wikimedia.org/r/#/c/277956/ again? I think it now make sense... [15:50:37] akosiaris: I'm also unsure about the merge / deploy of this one. I'd appreciate to have someone available when I deploy, to make sure I do not screw it... [15:51:39] gehel: yeah, I will. It's actually quite easy these days [15:52:25] akosiaris: probably, but I don't understand much about LVS and it seems to be central enough that a mistake could hurt a bit... [15:52:32] better safe than sorry... [15:53:11] yeah I am gonna be around and help. When do you want to deploy that ? [15:53:25] looks good btw [15:53:27] lemme +1 it [15:53:36] (03CR) 10Alexandros Kosiaris: [C: 031] Enabling HTTPS access to elasticsearch via LVS. [puppet] - 10https://gerrit.wikimedia.org/r/277956 (https://phabricator.wikimedia.org/T124444) (owner: 10Gehel) [15:53:59] 6Operations, 6Services, 10procurement: codfw: (3) RESTbase nodes - https://phabricator.wikimedia.org/T130218#2130010 (10mark) [15:54:35] 6Operations, 6Services, 10procurement: codfw: (3) RESTbase nodes - https://phabricator.wikimedia.org/T130218#2130025 (10mark) p:5Triage>3High [15:55:17] akosiaris: I have the puppet SWAT starting in 5', which should not take too long. If we can add this one at the end of the list, it would be great! [15:55:42] oh not a trivial change .. does not belong at the SWAT [15:55:54] but that's semantics [15:56:04] we can merge it otherwise whenever you want [15:56:22] akosiaris: let me ping you once SWAT is over... [15:57:10] akosiaris: I kind of understand most of what's in the config file, I just have a doubt about "monitors.ProxyFetch" ... [15:57:41] ok, that will take a while to explain, ping me when SWAT is over [15:58:34] <_joe_> !log performing a load test on mw1239 [15:58:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:00:04] _joe_ gehel: Respected human, time to deploy Puppet SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160317T1600). Please do the needful. [16:00:04] urandom ebernhardson: A patch you scheduled for Puppet SWAT(Max 8 patches) is about to be deployed. Please be available during the process. [16:00:20] * urandom is available! [16:00:56] urandom: let's start with 277112 [16:01:43] gehel: sure, for it and 277836, i'll need to do a rolling restart of restbase [16:01:53] urandom: first time I'm doing this, so I'll ask you to double check what I'm doing... [16:01:53] gehel: i'll do that once both have been applied [16:02:15] _joe_: gehel too late to slip in https://gerrit.wikimedia.org/r/#/c/277423/ ? [16:02:31] urandom: do you need me to run puppet on those nodes? Or is that something you'll do during your rolling restart? [16:02:39] gehel: i can do that [16:03:06] thcipriani: lemme have a look ... [16:03:19] thanks [16:04:48] thcipriani: looks simple enough to me, but I have to admit I do not understand the context. _joe_ could you also have a look at https://gerrit.wikimedia.org/r/#/c/277423/ ? Please ? [16:04:57] (03PS3) 10Gehel: Increase purged entry point s-maxage from 12 to 48 hours [puppet] - 10https://gerrit.wikimedia.org/r/277112 (owner: 10GWicke) [16:05:24] 6Operations, 10CirrusSearch, 6Discovery, 3Discovery-Search-Sprint, and 4 others: Look into encrypting Elasticsearch traffic - https://phabricator.wikimedia.org/T124444#2130053 (10EBernhardson) [16:06:23] urandom: I'm rebasing and merging your first 2 patches, than the 3rd, correct? Or do you want to do all 3 at the same time? [16:06:42] (03CR) 10Gehel: [C: 032] Increase purged entry point s-maxage from 12 to 48 hours [puppet] - 10https://gerrit.wikimedia.org/r/277112 (owner: 10GWicke) [16:06:47] gehel: I'm prepared for whatever order they land :) [16:07:29] (03PS3) 10Gehel: Set up outgoing request filter config. [puppet] - 10https://gerrit.wikimedia.org/r/277836 (owner: 10Ppchelko) [16:07:52] urandom: than let's do 1+2, than 3... [16:08:01] gehel: k [16:09:03] (03CR) 10Gehel: [C: 032] Set up outgoing request filter config. [puppet] - 10https://gerrit.wikimedia.org/r/277836 (owner: 10Ppchelko) [16:09:41] (03PS1) 10ArielGlenn: remove some typo in redirect filter option for abstracts wrapper [dumps] (ariel) - 10https://gerrit.wikimedia.org/r/278017 [16:10:11] urandom: 277112 and 277836 have been merged. I'll let you puppet apply and restart, let me know when you're ready for the third one. [16:10:43] gehel: you can do it whenever [16:11:18] (03PS2) 10Gehel: restbase1012.eqiad.wmnet: enable instance 'b' [puppet] - 10https://gerrit.wikimedia.org/r/277843 (https://phabricator.wikimedia.org/T125842) (owner: 10Eevans) [16:11:29] (03CR) 10ArielGlenn: [C: 032 V: 032] remove some typo in redirect filter option for abstracts wrapper [dumps] (ariel) - 10https://gerrit.wikimedia.org/r/278017 (owner: 10ArielGlenn) [16:11:35] gehel: thanks! [16:13:25] <_joe_> !log load testing done [16:13:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:13:47] (03CR) 10Gehel: [C: 032] restbase1012.eqiad.wmnet: enable instance 'b' [puppet] - 10https://gerrit.wikimedia.org/r/277843 (https://phabricator.wikimedia.org/T125842) (owner: 10Eevans) [16:14:20] urandom: third patch coming up! [16:14:29] merged and ready... [16:15:23] (03PS7) 10Gehel: Build cirrus completion indices daily [puppet] - 10https://gerrit.wikimedia.org/r/275749 (owner: 10EBernhardson) [16:16:56] gehel: looks good so far [16:17:01] ebernhardson, dcausse: I'm getting ready to deploy https://gerrit.wikimedia.org/r/#/c/275749/ can you please help my bad memory: which host is used for those crons? [16:17:21] gehel: terbium [16:17:40] dcausse: thx! I'll run puppet there to validate the change is effective... [16:17:47] thanks! [16:17:54] (03CR) 10Gehel: [C: 032] Build cirrus completion indices daily [puppet] - 10https://gerrit.wikimedia.org/r/275749 (owner: 10EBernhardson) [16:18:01] gehel: terbium [16:18:20] * ebernhardson is late to the party :P [16:18:34] PROBLEM - puppet last run on analytics1002 is CRITICAL: CRITICAL: puppet fail [16:18:52] !log bootstrapping restbase1012-b.eqiad.wmnet : T125842 [16:18:52] <_joe_> gehel: always go check that things run smoothly [16:18:53] T125842: normalize eqiad restbase cluster - replace restbase1001-1006 - https://phabricator.wikimedia.org/T125842 [16:18:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:19:08] <_joe_> and look at the puppet failures [16:19:16] <_joe_> I'll look at the analytics box for you [16:19:23] _joe_: the analytics box... [16:19:35] _joe_: I should have a look too, good exercise for me ... [16:19:56] (03PS1) 10Dereckson: Add rollbacker group to kk.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/278021 (https://phabricator.wikimedia.org/T130215) [16:20:06] <_joe_> uhm seems ok? wtf? [16:20:26] RECOVERY - puppet last run on analytics1002 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [16:21:08] 6Operations, 10Continuous-Integration-Config, 10Dumps-Generation, 13Patch-For-Review, 7WorkType-Maintenance: operations/dumps repo should pass flake8 - https://phabricator.wikimedia.org/T114249#2130115 (10ArielGlenn) 5declined>3Open [16:21:31] _joe_: yep, can't find a single error in puppet logs ... [16:21:51] <_joe_> gehel: me neither [16:22:07] <_joe_> gehel: someone was running puppet manually maybe [16:22:31] _joe_: even manually, shouldn't it produce logs? [16:22:39] <_joe_> nope [16:22:43] <_joe_> it logs to stdout [16:22:54] <_joe_> bbiab [16:23:05] dcausse, ebernhardson: I'm activating the crons on terbium after a noop run. All looks good. Could you double check? [16:23:09] it will record to syslog though [16:23:16] just not puppet.log [16:23:39] gehel: sounds good I can see the new timeout [16:24:07] but reports are sent in both modes, no ? [16:25:20] urandom: does it look all good on your side? [16:25:42] gehel: i'm about to start restbase restarts [16:25:54] the 3rd patch is good, the bootstrap is underway [16:26:05] gehel: 1 and 2 are applied [16:26:54] _joe_: could you double check me on https://gerrit.wikimedia.org/r/#/c/277423/ when you are back? [16:27:03] !log do a for i in mathoid citoid graphoid cxserver; do sudo confctl --tags dc=eqiad,cluster=sca,service=$i --action delete sca1002.eqiad.wmnet ; done. Same for sca1001.eqiad.wmnet [16:27:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:27:18] thcipriani: sorry, first time I'm doing this, I'd like a double check before I merge... [16:27:41] gehel: no problem, probably a good policy, generally :) [16:27:55] !log restarting restbase on xenon.eqiad.wmnet (canary), to apply https://gerrit.wikimedia.org/r/#/c/277112/ and https://gerrit.wikimedia.org/r/#/c/277836/ [16:27:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:28:30] thcipriani: Do you need me to do anything else than merge? Puppet apply somewhere? Or will you take care of that? [16:29:35] (03PS1) 10Volans: Add db2008 commented out to x1 shard [mediawiki-config] - 10https://gerrit.wikimedia.org/r/278022 (https://phabricator.wikimedia.org/T130098) [16:30:02] gehel: I don't have permissions to run puppet, this _should_ be a no-op, but would apply to all service nodes [16:30:15] ^ mobrovac any preference for how this rolls out? [16:31:07] this is a somewhat counter-intuitive change...btw [16:31:14] it's a noop, so no need [16:31:24] no need to force puppet at all i mean [16:31:29] thcipriani: all service nodes == /aqs100[123]\.eqiad\.wmnet/ ? [16:31:32] in order to pass a boolean to scap::target the patch goes through a dance of strings ... [16:31:50] (03CR) 10Luke081515: [C: 031] Add rollbacker group to kk.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/278021 (https://phabricator.wikimedia.org/T130215) (owner: 10Dereckson) [16:33:40] gehel: there are some other nodes in there as well, but should all be a noop. [16:34:14] (03CR) 10Alexandros Kosiaris: "some comments inline" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/277423 (owner: 10Thcipriani) [16:35:02] (03PS1) 10Dzahn: puppet-lint: fix some more indentation warnings [puppet] - 10https://gerrit.wikimedia.org/r/278023 [16:35:55] !log performing rolling restart of restbase staging to apply https://gerrit.wikimedia.org/r/#/c/277112/ and https://gerrit.wikimedia.org/r/#/c/277836/ [16:35:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:37:11] PROBLEM - cassandra-b CQL 10.64.32.203:9042 on restbase1012 is CRITICAL: Connection refused [16:37:24] ^^^ this is me, I will ack it [16:37:31] urandom: ok [16:37:57] urandom: that's the new instance? So nothing to worry about for the moment. Correct? [16:38:00] ACKNOWLEDGEMENT - cassandra-b CQL 10.64.32.203:9042 on restbase1012 is CRITICAL: Connection refused eevans This node is bootstrapping - The acknowledgement expires at: 2016-03-18 16:37:42. [16:38:17] gehel: correct, it's to be expected; it'll clear when the bootstrap finishes [16:39:10] thcipriani: I'm finding the comment from akosiaris on your patch most interesting. I'll exercise my caution here... and will not merge at the moment. Sounds OK to you? [16:39:51] gehel: sure. I can address those comments first. Thanks. [16:40:07] seems like a good time to re-read GRASP ... [16:41:19] !log restarting elasticsearch server elastic1016.eqiad.wmnet [16:41:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:41:47] !log rolling restart of restbase staging complete [16:41:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:43:52] akosiaris: about that LVS deployment, can you walk me through it? [16:44:00] gehel: yup [16:44:17] but first, proxyfetch [16:44:26] yes [16:44:27] so, pybal does checks on hosts [16:44:32] various kinds [16:44:41] the most obvious being IdleConnection [16:44:48] which just open a TCP connection and keeps it open [16:45:05] (03PS3) 10Gehel: Enabling HTTPS access to elasticsearch via LVS. [puppet] - 10https://gerrit.wikimedia.org/r/277956 (https://phabricator.wikimedia.org/T124444) [16:45:06] !log performing restart of restbase1003.eqiad.wmnet (canary) to apply https://gerrit.wikimedia.org/r/#/c/277112/ and https://gerrit.wikimedia.org/r/#/c/277836/ [16:45:07] proxyFetch actually tries to fetch an HTTP resource [16:45:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:45:29] if any of the monitors declared in the LVS config stanza fail [16:45:36] the backend appserver will be depool [16:45:39] depooled* [16:46:02] akosiaris: so why "Proxy" in this case? [16:46:21] a, naming ? [16:46:52] akosiaris: yeah, always naming :P [16:47:35] ah, because it would fetch from proxies in the first setup IIRC [16:47:40] 6Operations, 10Wikimedia-Stream: reboot of rcs servers (stream.wikimedia.org) - https://phabricator.wikimedia.org/T130024#2130215 (10Dzahn) @Johan Under normal circumstances we should be able to do these with minimal user impact. Nevertheless, because T130147 happened there might have been a bit more for a few... [16:47:50] back then it was squid only that it would pool/depool [16:47:59] akosiaris: also, looking at the HTTP search config, I see a ProxyFetch.url: http://localhost/, which I understand to be an HTTP check on localhost port 80. But there is nothing listening on port 80 on the elasticsearch servers [16:48:11] ah, that is even better [16:48:33] so... the port and IP/hostname is actually defined in the configuration [16:49:05] that specific thing, is in the configuration a URL, but nothing apart the scheme and the request part are honored [16:49:27] akosiaris: code never lies, it's just misleading ... [16:49:29] and localhost there means that the HTTP request will have a Host: localhost header AND NOTHING ELSE [16:49:58] so http://localhost:4234 is the same as http://localhost:234 and http://blahblah [16:50:05] effectively that is [16:50:23] minus the Host: header issue (assuming the backend appserver actually cares about that) [16:50:54] I 've had very recently a discussion with bblack about that... it's actually quite misleading to have that URL there [16:51:06] !log rolling restart of restbase production to apply https://gerrit.wikimedia.org/r/#/c/277112/ and https://gerrit.wikimedia.org/r/#/c/277836/ [16:51:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:51:22] especially since the only things honored are the scheme and the request part [16:51:34] akosiaris: yep, I scratched my head for some time on that one... [16:52:02] well, right now it's actually clearer if you look at the rest of the entries since all of them spot localhost [16:52:10] s/spot/use/ [16:53:00] gehel: https://phabricator.wikimedia.org/rOPUPd6462f274c76de5aec232f5577b2f76a161a8773 .. which was fixed just a few days ago [16:53:08] 2 to be exact [16:53:13] for some definition of fixed [16:53:54] we should split that up in probably scheme, host_header, request [16:53:56] akosiaris: all that means that you can only have checks that work on the same port as the service. So no indirect checks ... [16:54:17] indirect checks ? [16:54:33] on the LVS level ? that would be dangerous I think [16:54:47] like udp service you check it on http on another port because has some admin console [16:55:21] yeah but I don't think we actually have a use case for that [16:55:31] and our LVS is DR (direct routing) [16:55:31] something like that. Or I have a probe in my app that exposes the health of my app, but I only want to expose it through SSL [16:55:47] statsd, mysql doing a query :-P ) [16:55:49] which means no PAT anyway [16:55:50] so I'll check the health of HTTP through HTTPS [16:56:31] akosiaris: that question is academic more than anything else... [16:56:45] that has the obvious problem that HTTP might not be working while the HTTPS is [16:56:50] <_joe_> gehel it's easy to add new monitor to our load balancer [16:56:58] HTTPS probe is* [16:57:14] and the probe misbehaves and all that jazz [16:57:21] but I agree, we digress [16:57:25] wanna merge the patch ? [16:57:50] akosiaris: now that I understand why there is a chance that it actually works, yes! I'd be happy to merge. [16:58:14] (03CR) 10Mobrovac: Pass deploy user from service::node (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/277423 (owner: 10Thcipriani) [16:58:35] akosiaris: to activate it, what do I have to do beside merging? [16:58:58] run puppet on lvs hosts, and restart pybal. BUT!!!!! [16:59:05] there is a huge BUT here obviously [16:59:27] so our LVSes are group into pairs, one active, one passive [17:00:04] yurik gwicke cscott arlolra subbu: Dear anthropoid, the time has come. Please deploy Services – Graphoid / Parsoid / OCG / Citoid (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160317T1700). [17:00:05] lvs2001, lvs2004 is one, lvs2002, lvs2005 another and lvs2003, lvs2006 another [17:00:25] and so on. eqiad is special (as always) as there are 4 right now instead of 2 [17:00:44] lvs1001, lvs1004, lvs1007, lvs1010 and so on [17:00:48] (03PS1) 10Dereckson: Namespaces shorcuts aliases on ru.wikibooks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/278029 (https://phabricator.wikimedia.org/T127591) [17:01:51] akosiaris: lemme check if I understand: for search the LVS in eqiad are 03, 06, 09 and 12. Correct? [17:01:52] so, you first restart pybal on all passive nodes (the lower ones are always the active ones as a rule of thumb), although that is governed by BGP actually. but for now follow the rule of thumb [17:02:04] hmm lemme make sure [17:02:19] akosiaris: I'm taking that rfom balancer.pp [17:02:24] yes that is correct [17:02:40] * gehel understands some of LVS now [17:03:08] even in eqiad, there is always only 1 active (so 3 passives) [17:03:18] yeah eqiad is in a migration phase [17:04:46] (03CR) 10Jcrespo: [C: 031] "As it is a pure comment change, let's wait to merge it at the same time than other codfw change, such as https://gerrit.wikimedia.org/r/27" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/278022 (https://phabricator.wikimedia.org/T130098) (owner: 10Volans) [17:05:52] gehel: actually, hmmm we might be missing a part ... conftool [17:06:09] damn, it was too easy... [17:07:14] hehe, yeah, I thihnk it should be easy though [17:07:41] hmm so termination is done by another set of nginxs though, right ? [17:07:53] installed on the elasticsearch hosts ? [17:09:08] akosiaris: yes [17:09:14] I 'll add a comment on the patch of what needs to be added [17:09:27] (03CR) 10Gilles: [C: 031] Remove obsolete comment [mediawiki-config] - 10https://gerrit.wikimedia.org/r/277359 (owner: 10Aaron Schulz) [17:10:50] !log rolling restart of restbase production complete [17:10:52] akosiaris: I'll be in standup in 20 minutes and I'd like to have a bit more free time when I deploy that one. Any chance you'd be available tomorrow (we don't have a rule against this kind of deployment on Fridays, do we?) [17:10:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:11:13] no we don't [17:11:26] I 'll be around tomorrow European morning [17:12:29] akosiaris: Need to take my son to daycare in the morning, but I should be available around 10am CET at the latest. Ok for you ? [17:12:36] yup [17:12:44] (03CR) 10Alexandros Kosiaris: [C: 04-1] "lacks a conftool stanza." [puppet] - 10https://gerrit.wikimedia.org/r/277956 (https://phabricator.wikimedia.org/T124444) (owner: 10Gehel) [17:13:04] conftool lesson tomorrow then ;-) [17:13:08] akosiaris: thanks a lot for taking the time to explain all this! [17:13:12] yw [17:13:19] * gehel loves learning new things! [17:14:00] (03CR) 10Alexandros Kosiaris: "s/9300/9243/ on the above obviously" [puppet] - 10https://gerrit.wikimedia.org/r/277956 (https://phabricator.wikimedia.org/T124444) (owner: 10Gehel) [17:14:10] urandom: rolling restart is completed? All looks good? Can I consider this Puppet SWAT a success? [17:14:17] (03PS1) 10Dereckson: Set विक्सनरी as alias for NS_PROJECT on ne.wiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/278032 (https://phabricator.wikimedia.org/T129768) [17:14:31] * volans too... but tomorrow he'll not be able to listen the next part unfortunately, will read the backlog :) [17:14:34] gehel: yup! [17:14:40] gehel: thanks for your help! [17:14:47] thanks Alexandros [17:15:33] * gehel is your humble servant... [17:22:54] (03CR) 10Alexandros Kosiaris: Pass deploy user from service::node (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/277423 (owner: 10Thcipriani) [17:24:38] Can anyone point me to where access request documentation is at? Cannot find it on wikitech [17:25:45] nuria: yes, looking for [17:25:52] mutante: tanks! [17:25:59] *thanks [17:26:18] nuria: basically, you click + in maniphest and under regular "create task" there is also "create access request" [17:26:36] mutante: in phab correct? [17:26:37] it's not much more than a ticket to ops with that additional tag "ops-access-request" [17:26:40] yes [17:27:06] https://wikitech.wikimedia.org/wiki/Requesting_shell_access has the docs [17:27:14] nuria: https://wikitech.wikimedia.org/wiki/Requesting_shell_access [17:27:26] mutante: k [17:27:33] mutante: ETOOLATE this time :) [17:28:23] 6Operations, 10Ops-Access-Requests: Access Request - https://phabricator.wikimedia.org/T130226#2130333 (10Nuria) [17:28:24] yes ;p [17:29:04] 6Operations, 10Ops-Access-Requests: Access Request - https://phabricator.wikimedia.org/T130226#2130348 (10Dzahn) a:3Dzahn [17:29:36] 6Operations, 10Ops-Access-Requests: stat1001 access + sudo rights for nuria and mforns - https://phabricator.wikimedia.org/T130226#2130350 (10Krenair) [17:38:43] 6Operations, 10Wikimedia-Site-Requests, 10Wikimedia-Video: Upload the Wikimania 2014 videos to Commons - https://phabricator.wikimedia.org/T106038#2130382 (10MarcoAurelio) Or send the videos for manual upload to [[ https://wikitech.wikimedia.org/wiki/Clusters | EQIAD or other appropriate place ]]? [17:40:20] PROBLEM - cassandra CQL 10.192.32.125:9042 on restbase2004 is CRITICAL: Connection refused [17:40:21] PROBLEM - cassandra service on restbase2004 is CRITICAL: CRITICAL - Expecting active but unit cassandra is failed [17:50:50] PROBLEM - puppet last run on mw1025 is CRITICAL: CRITICAL: Puppet has 1 failures [17:54:40] (03PS1) 10Dzahn: admin: add group statistics-web-roots [puppet] - 10https://gerrit.wikimedia.org/r/278041 (https://phabricator.wikimedia.org/T130226) [17:55:08] 6Operations, 10Wikimedia-Mailing-lists: Remove/ archive inspire@lists.wikimedia.org - https://phabricator.wikimedia.org/T126640#2130484 (10Capt_Swing) Thank you @Dzahn! [17:55:24] (03PS2) 10Dzahn: admin: add group statistics-web-roots [puppet] - 10https://gerrit.wikimedia.org/r/278041 (https://phabricator.wikimedia.org/T130226) [17:57:16] ^^^^^ urandom: related to your restart? Anything I can do to help? [17:57:29] (03PS1) 10Volans: DB: Expose Puppet SSL certs and generate CA cert [puppet/mariadb] - 10https://gerrit.wikimedia.org/r/278042 (https://phabricator.wikimedia.org/T111654) [17:57:40] mutante: hi, can you get out some errors logs for me? [17:57:50] !log restarting elasticsearch server elastic1017.eqiad.wmnet [17:57:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:58:50] (03CR) 10Volans: "Do we have a way to test it with the puppet compiler before merging it into the submodule?" [puppet/mariadb] - 10https://gerrit.wikimedia.org/r/278042 (https://phabricator.wikimedia.org/T111654) (owner: 10Volans) [17:58:52] (03PS1) 10ArielGlenn: add dataset nfs clients to hiera, make nfs service subscribe to exports [puppet] - 10https://gerrit.wikimedia.org/r/278043 (https://phabricator.wikimedia.org/T111586) [17:58:59] MatmaRex: what do you need? [17:59:09] urandom: cass died yet again on rb2004, i can restart it, but i'm not really in shape to look into it further [17:59:10] starting ocg deploy (at tail end of window) [17:59:14] urandom: could you? [17:59:56] (03PS1) 10Dzahn: admin: add mforns, nuria to statistics-web-roots [puppet] - 10https://gerrit.wikimedia.org/r/278044 (https://phabricator.wikimedia.org/T130226) [18:00:03] mutante: i sent you the URL via PM [18:00:05] andrewbogott moritzm: Dear anthropoid, the time has come. Please deploy Wikitech maintenance (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160317T1800). [18:00:41] !log restbase restarted cassandra on restbase2004, OOM [18:00:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:01:37] moritzm: I'm here — want me to do anything or just stand by? [18:01:41] 7Blocked-on-Operations, 6Operations, 10Datasets-General-or-Unknown: Snapshot hosts need to be manually added to dataset1001's exports - https://phabricator.wikimedia.org/T111586#2130503 (10ArielGlenn) https://gerrit.wikimedia.org/r/#/c/278043/ as a first take. At least it ought to sort out the export -r iss... [18:02:00] RECOVERY - cassandra CQL 10.192.32.125:9042 on restbase2004 is OK: TCP OK - 0.040 second response time on port 9042 [18:02:00] 6Operations, 6Commons, 6Multimedia, 10UploadWizard: Special:UploadStash thumbnails failing to generate with 500 & 503 - https://phabricator.wikimedia.org/T130204#2130504 (10matmarex) [18:02:09] RECOVERY - cassandra service on restbase2004 is OK: OK - cassandra is active [18:02:45] andrewbogott: I'll start with silver? [18:02:52] sure [18:03:12] 6Operations, 6Commons, 6Multimedia, 10UploadWizard: Special:UploadStash thumbnails failing to generate with 500 & 503 - https://phabricator.wikimedia.org/T130204#2129457 (10matmarex) These errors are just failures to generate the thumbnails and should not interfere with uploads, perhaps there's something e... [18:03:19] !log rebooting silver for kernel update [18:03:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:04:48] 6Operations, 10ArchCom-RfC, 10OCG-General-or-Unknown, 6Scrum-of-Scrums, 6Services: The OCG cleanup cache script doesn't work properly - https://phabricator.wikimedia.org/T120079#2130508 (10RobLa-WMF) I'm adding #ArchCom-RFC to add this to #ArchCom's managed backlog [18:05:46] (03CR) 10Mobrovac: Pass deploy user from service::node (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/277423 (owner: 10Thcipriani) [18:06:20] !log rebooting holmium for kernel update [18:07:49] PROBLEM - Host labs-ns3.wikimedia.org is DOWN: CRITICAL - Host Unreachable (208.80.154.12) [18:08:20] PROBLEM - Ensure NFS exports are maintained for new instances with NFS on labstore1001 is CRITICAL: CRITICAL - Expecting active but unit nfs-exports is failed [18:08:27] mobrovac: yeah, i'm on it [18:08:38] kk thnx [18:08:49] i guess you restarted it? [18:09:03] 6Operations, 6Commons, 6Multimedia, 10UploadWizard: Special:UploadStash thumbnails failing to generate with 500 & 503 - https://phabricator.wikimedia.org/T130204#2129457 (10Dzahn) i have been asked to check logs. this is from oxygen /srv/log/webrequests/5xx.json where that filename appears. all the same I... [18:09:27] hmm, wikitech is offline? :) [18:10:18] FlorianSW: yep, silver is rebooting for kernel update [18:10:35] ema: yeah, have seen it now when scrolling through the log :) [18:10:39] thanks anyway :) [18:10:40] RECOVERY - Host labs-ns3.wikimedia.org is UP: PING OK - Packet loss = 0%, RTA = 0.99 ms [18:10:44] FlorianSW: sure! [18:12:15] 6Operations, 6Commons, 10MediaWiki-Uploading, 6Multimedia: Special:UploadStash thumbnails failing to generate with 500 & 503 - https://phabricator.wikimedia.org/T130204#2130517 (10matmarex) [18:14:50] (03CR) 10Jcrespo: "No. :-( On the other side, merging to the submodule has no production efect. I will review it tomorrow." [puppet/mariadb] - 10https://gerrit.wikimedia.org/r/278042 (https://phabricator.wikimedia.org/T111654) (owner: 10Volans) [18:16:19] RECOVERY - puppet last run on mw1025 is OK: OK: Puppet is currently enabled, last run 11 seconds ago with 0 failures [18:18:57] silver reboot took a little longer, the reboot hung and needed to be powercycled via the serial console [18:19:55] moritzm: it looks pretty alive now :) [18:23:20] !log rebooting labcontrol1002 for kernel update [18:23:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:25:39] PROBLEM - Host labs-ns1.wikimedia.org is DOWN: PING CRITICAL - Packet loss = 100% [18:26:50] RECOVERY - Host labs-ns1.wikimedia.org is UP: PING OK - Packet loss = 0%, RTA = 1.17 ms [18:28:46] !log rebooting labnet1001 for kernel update [18:28:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:36:53] (03PS4) 10Elukey: Add automatic failover to Hadoop Namenodes. [puppet/cdh] - 10https://gerrit.wikimedia.org/r/277984 (https://phabricator.wikimedia.org/T129838) [18:36:58] !log updated OCG to version c1a8232594fe846bd2374efd8f7c20d7e97ac449 [18:37:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:37:21] 6Operations, 10Wikimedia-Site-Requests, 10Wikimedia-Video: Upload the Wikimania 2014 videos to Commons - https://phabricator.wikimedia.org/T106038#2130601 (10Krenair) If swift can't handle the size of the videos, server-side upload isn't going to be able to help. [18:47:04] (03PS1) 10Ottomata: Set PATH in flock mylvmbackup subshell [puppet] - 10https://gerrit.wikimedia.org/r/278051 (https://phabricator.wikimedia.org/T127991) [18:47:24] 6Operations, 10Wikimedia-Site-Requests, 10Wikimedia-Video: Upload the Wikimania 2014 videos to Commons - https://phabricator.wikimedia.org/T106038#1457394 (10Dzahn) quote from that non-public ticket : "The files were uploaded to labs by coren, and converted by me to webm. the output of the conversion is larg... [18:47:49] (03PS1) 10Volans: [WIP] DB: Use generated CA for the TLS transition [puppet] - 10https://gerrit.wikimedia.org/r/278052 (https://phabricator.wikimedia.org/T111654) [18:50:46] (03CR) 10Ottomata: [C: 032] Set PATH in flock mylvmbackup subshell [puppet] - 10https://gerrit.wikimedia.org/r/278051 (https://phabricator.wikimedia.org/T127991) (owner: 10Ottomata) [18:50:51] (03CR) 10Volans: "This will be sent for real after https://gerrit.wikimedia.org/r/#/c/278042/ will be merged." [puppet] - 10https://gerrit.wikimedia.org/r/278052 (https://phabricator.wikimedia.org/T111654) (owner: 10Volans) [18:51:04] 6Operations, 6Commons, 10MediaWiki-Uploading, 6Multimedia: Special:UploadStash thumbnails failing to generate with 500 & 503 - https://phabricator.wikimedia.org/T130204#2130708 (10Dzahn) also check backend logs on fluorine but can't find "UploadStash" related issues in exception.log or fatal.log, except th... [18:57:37] (03PS1) 10Dereckson: Enable Translate extension on specieswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/278054 (https://phabricator.wikimedia.org/T129888) [19:00:05] twentyafterfour: Dear anthropoid, the time has come. Please deploy MediaWiki train (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160317T1900). [19:05:11] (03PS1) 10Ottomata: Don't set password in /root/.my.cnf if it is not defined [puppet/mariadb] - 10https://gerrit.wikimedia.org/r/278055 (https://phabricator.wikimedia.org/T127991) [19:05:52] (03PS2) 10Ottomata: Don't set password in /root/.my.cnf if it is not defined [puppet/mariadb] - 10https://gerrit.wikimedia.org/r/278055 (https://phabricator.wikimedia.org/T127991) [19:07:08] (03CR) 10Ottomata: [C: 032] Don't set password in /root/.my.cnf if it is not defined [puppet/mariadb] - 10https://gerrit.wikimedia.org/r/278055 (https://phabricator.wikimedia.org/T127991) (owner: 10Ottomata) [19:08:26] (03PS1) 10Ottomata: Update mariadb module with .my.cnf root password change, set analytics-meta instance root password to undef [puppet] - 10https://gerrit.wikimedia.org/r/278056 (https://phabricator.wikimedia.org/T127991) [19:09:51] (03PS21) 10Ema: Maps VCL forward-port to Varnish 4 [puppet] - 10https://gerrit.wikimedia.org/r/269466 (https://phabricator.wikimedia.org/T124279) [19:13:24] 6Operations, 10Analytics, 10Analytics-Cluster, 10Traffic: Upgrade analytics-eqiad Kafka cluster to Kafka 0.9 - https://phabricator.wikimedia.org/T121562#1881753 (10Nuria) Kafka update is needed for future uses of kafkaconnect instead of camus [19:15:31] 6Operations, 6Commons, 10MediaWiki-Uploading, 6Multimedia: Special:UploadStash thumbnails failing to generate with 500 & 503 - https://phabricator.wikimedia.org/T130204#2129457 (10ori) ``` 2016-03-17 11:39:20 mw1200 commonswiki 1.27.0-wmf.17 exception ERROR: [d1f325d3] /w/api.php?action=query&format=json&p... [19:28:38] (03CR) 10Ottomata: [C: 032] Update mariadb module with .my.cnf root password change, set analytics-meta instance root password to undef [puppet] - 10https://gerrit.wikimedia.org/r/278056 (https://phabricator.wikimedia.org/T127991) (owner: 10Ottomata) [19:32:36] (03PS1) 10Ottomata: Remove extra newline created by last change in root.my.cnf.erb [puppet/mariadb] - 10https://gerrit.wikimedia.org/r/278064 [19:33:20] (03CR) 10Ottomata: [C: 032] Remove extra newline created by last change in root.my.cnf.erb [puppet/mariadb] - 10https://gerrit.wikimedia.org/r/278064 (owner: 10Ottomata) [19:34:33] (03PS1) 10Ottomata: Update mariadb submodule with formatting change, set password to false [puppet] - 10https://gerrit.wikimedia.org/r/278065 (https://phabricator.wikimedia.org/T127991) [19:37:24] grrr.. [19:37:34] 19:34:44 sync-dir failed: Command 'find '/srv/mediawiki-staging/php-1.27.0-wmf.17' -name '*.php' -or -name '*.inc' -or -name '*.phtml' -or -name '*.php5' | xargs -n1 -P6 -exec php -l >/dev/null' returned non-zero exit status 123 [19:37:35] P6 simple grabbing of tickets via xml - https://phabricator.wikimedia.org/P6 [19:38:02] heh, stashbot [19:39:58] (03PS2) 10BryanDavis: Logging: Add ApiRequest kafka logging [mediawiki-config] - 10https://gerrit.wikimedia.org/r/273559 (https://phabricator.wikimedia.org/T108618) [19:40:13] (03CR) 10Ottomata: [C: 032] Update mariadb submodule with formatting change, set password to false [puppet] - 10https://gerrit.wikimedia.org/r/278065 (https://phabricator.wikimedia.org/T127991) (owner: 10Ottomata) [19:41:07] !log restarting elasticsearch server elastic1018.eqiad.wmnet [19:41:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:42:22] (03CR) 10BryanDavis: "All blocking patches and tasks are done. This is finally ready to go out!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/273559 (https://phabricator.wikimedia.org/T108618) (owner: 10BryanDavis) [19:46:17] 6Operations, 6Commons, 10MediaWiki-Uploading, 6Multimedia: Special:UploadStash thumbnails failing to generate with 500 & 503 - https://phabricator.wikimedia.org/T130204#2129457 (10matmarex) @dzahn @ori I split off this error to {T130253}, I think it's unrelated. [19:54:28] 6Operations, 6Commons, 10MediaWiki-Uploading, 6Multimedia: Special:UploadStash thumbnails failing to generate with 500 & 503 - https://phabricator.wikimedia.org/T130204#2131011 (10ori) Managed to reproduce this and to capture the 503 response via varnishlog: {P2786, lines=60, highlight="51-53"} [19:56:51] (03CR) 10EBernhardson: [C: 031] "looks sane to me" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/273559 (https://phabricator.wikimedia.org/T108618) (owner: 10BryanDavis) [20:00:11] !log restarting elasticsearch server elastic1019.eqiad.wmnet [20:00:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:12:59] !log Restarting Cassandra on restbase1007-a.eqiad.wmnet (compaction seems stalled) [20:13:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:14:50] !log Starting Cassandra on restbase2004.codfw.wmnet (OOM. again.) [20:14:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:17:15] 6Operations, 6Commons, 10MediaWiki-Uploading, 6Multimedia: Special:UploadStash thumbnails failing to generate with 500 & 503 - https://phabricator.wikimedia.org/T130204#2131149 (10ori) The content-type is image/png, so there is no point in gzipping, but the backend sends a gzipped-encoded response anyhow. [20:17:39] PROBLEM - cassandra-a CQL 10.64.0.230:9042 on restbase1007 is CRITICAL: Connection refused [20:18:14] ^^^ on this [20:23:00] RECOVERY - cassandra-a CQL 10.64.0.230:9042 on restbase1007 is OK: TCP OK - 0.003 second response time on port 9042 [20:24:16] !log twentyafterfour@tin Started scap: sync php-1.27.0-wmf.17 [20:24:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:27:18] (03PS2) 10Andrew Bogott: Modify designatedashboard to recognize proxy records [puppet] - 10https://gerrit.wikimedia.org/r/277943 [20:27:20] (03PS1) 10Andrew Bogott: Keystone policy.json: Allow anyone to read endpoints or services. [puppet] - 10https://gerrit.wikimedia.org/r/278080 [20:28:35] urandom: let me know if you need help on those Cassandra OOM ... [20:29:36] (03CR) 10jenkins-bot: [V: 04-1] Modify designatedashboard to recognize proxy records [puppet] - 10https://gerrit.wikimedia.org/r/277943 (owner: 10Andrew Bogott) [20:29:50] gehel: thanks [20:32:02] (03PS3) 10Andrew Bogott: Modify designatedashboard to recognize proxy records [puppet] - 10https://gerrit.wikimedia.org/r/277943 [20:33:06] 6Operations, 6Services: Package npm 2.14 - https://phabricator.wikimedia.org/T124474#2131287 (10Paladox) You can use a hack to use npm 4.3 By doing this is package.json Example "scripts": { "test": "npm install npm && node node_modules/npm/cli.js -v && node node_modules/npm/cli.js run-script grunt",... [20:34:10] (03PS3) 10Dzahn: admin: add group statistics-web-roots [puppet] - 10https://gerrit.wikimedia.org/r/278041 (https://phabricator.wikimedia.org/T130226) [20:34:57] (03PS4) 10Dzahn: admin: add group statistics-web-roots [puppet] - 10https://gerrit.wikimedia.org/r/278041 (https://phabricator.wikimedia.org/T130226) [20:35:58] (03PS5) 10Dzahn: admin: add group statistics-web-roots [puppet] - 10https://gerrit.wikimedia.org/r/278041 (https://phabricator.wikimedia.org/T130226) [20:36:20] (03CR) 10Dzahn: [C: 032] "adding empty group. access request is separate." [puppet] - 10https://gerrit.wikimedia.org/r/278041 (https://phabricator.wikimedia.org/T130226) (owner: 10Dzahn) [20:36:40] godog: hey, Can you package scap3 and release it for debian [20:36:59] (03CR) 10Andrew Bogott: [C: 032] Keystone policy.json: Allow anyone to read endpoints or services. [puppet] - 10https://gerrit.wikimedia.org/r/278080 (owner: 10Andrew Bogott) [20:37:14] (03PS6) 10Dzahn: admin: add group statistics-web-roots [puppet] - 10https://gerrit.wikimedia.org/r/278041 (https://phabricator.wikimedia.org/T130226) [20:37:36] (03PS7) 10Dzahn: admin: add group statistics-web-roots [puppet] - 10https://gerrit.wikimedia.org/r/278041 (https://phabricator.wikimedia.org/T130226) [20:38:54] 6Operations, 6Performance-Team, 5codfw-rollout, 3codfw-rollout-Jan-Mar-2016: Figure out how to migrate the jobqueues - https://phabricator.wikimedia.org/T124673#2131355 (10aaron) See also T128730 for comments on doing single-DC maintenance. [20:40:03] 6Operations, 10Ops-Access-Requests, 13Patch-For-Review: stat1001 access + sudo rights for nuria and mforns - https://phabricator.wikimedia.org/T130226#2131362 (10Dzahn) p:5Triage>3Normal [20:40:48] !log Issuing `nodetool scrub -s -- local_group_wikipedia_T_parsoid_html data` on restbase2004.eqiad.wmnet : T130254 [20:40:49] T130254: Investigate recent OOM events on restbase2004 - https://phabricator.wikimedia.org/T130254 [20:40:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:41:03] 6Operations, 6Performance-Team, 6Release-Engineering-Team, 7Availability, and 2 others: Dig through logs from 15 Mar 2016 read-only test and file bugs - https://phabricator.wikimedia.org/T129973#2131383 (10Krinkle) a:3aaron [20:42:02] 6Operations, 6Performance-Team: Update memcached package and configuration options - https://phabricator.wikimedia.org/T129963#2131387 (10aaron) p:5Triage>3Low [20:46:16] 6Operations, 10Ops-Access-Requests, 13Patch-For-Review: stat1001 access + sudo rights for nuria and mforns - https://phabricator.wikimedia.org/T130226#2131444 (10Dzahn) Hi @Nuria, i uploaded 2 changes. One is adding a new group, called "statistics-web-roots" (like an existing group statistics-web-users but... [20:47:28] 6Operations, 6Performance-Team, 5codfw-rollout, 3codfw-rollout-Jan-Mar-2016: Figure out how to migrate the jobqueues - https://phabricator.wikimedia.org/T124673#2131459 (10aaron) >>! In T124673#2099319, @Joe wrote: > While complex and requiring a series of manual steps, this should be manageable to do. I'd... [20:48:13] (03CR) 10Gehel: [C: 031] puppet-lint: fix some more indentation warnings [puppet] - 10https://gerrit.wikimedia.org/r/278023 (owner: 10Dzahn) [20:51:14] !log twentyafterfour@tin Finished scap: sync php-1.27.0-wmf.17 (duration: 26m 57s) [20:51:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:52:40] (03PS2) 10Dzahn: puppet-lint: fix some more indentation warnings [puppet] - 10https://gerrit.wikimedia.org/r/278023 [20:53:04] gehel: thank you, that type of change gets path conflicts so quick, cant leave them too long in gerrit [20:53:20] i think the bots added you automatically [20:53:28] from that special wiki page, right [20:53:37] (03PS4) 10Andrew Bogott: Modify designatedashboard to recognize proxy records [puppet] - 10https://gerrit.wikimedia.org/r/277943 [20:53:54] mutante: yep, I'm watching a few modules... and the change was easy enough to review... [20:54:32] *nod* i like that watch feature [20:56:26] (03PS5) 10Andrew Bogott: Modify designatedashboard to recognize proxy records [puppet] - 10https://gerrit.wikimedia.org/r/277943 [20:56:30] (03CR) 10Dzahn: [C: 032] puppet-lint: fix some more indentation warnings [puppet] - 10https://gerrit.wikimedia.org/r/278023 (owner: 10Dzahn) [20:56:32] 6Operations, 10Phabricator, 6Project-Admins, 6Triagers: Requests for addition to the #acl*Project-Admins group (in comments) - https://phabricator.wikimedia.org/T706#2131646 (10mcruzWMF) Ping @Aklapper [20:59:25] 6Operations, 6Services: Package npm 2.14 - https://phabricator.wikimedia.org/T124474#2131690 (10Paladox) @Hashar helped me to figure out an easyer way of installing npm please ignore the top bit. Please do the following Add this to package.json "dependencies": { "npm": "~3" } you can pick your own npm ver... [21:00:20] hashar ^^^ :) [21:02:06] (03PS1) 1020after4: all wikis to 1.27.0-wmf.17 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/278082 [21:02:46] (03CR) 1020after4: [C: 032] all wikis to 1.27.0-wmf.17 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/278082 (owner: 1020after4) [21:04:59] PROBLEM - Kafka Broker Replica Max Lag on kafka1020 is CRITICAL: CRITICAL: 56.67% of data above the critical threshold [5000000.0] [21:05:05] Session "[30]CentralAuthSessionProvider Metadata merge failed: exception 'MediaWiki\Session\MetadataMergeException' with message 'Key "CentralAuthSource" changed' in /srv/mediawiki/php-1.27.0-wmf.17/includes/session/SessionProvider.php:194 [21:05:17] is this something we should be worried about or ignoring? [21:05:30] there are quite a lot of errors like that right now [21:06:35] (03PS2) 10Dzahn: move microsite roles into a common place [puppet] - 10https://gerrit.wikimedia.org/r/275034 [21:06:50] PROBLEM - Kafka Broker Replica Max Lag on kafka1014 is CRITICAL: CRITICAL: 60.00% of data above the critical threshold [5000000.0] [21:07:30] (03PS3) 10Dzahn: move microsite roles into a common place [puppet] - 10https://gerrit.wikimedia.org/r/275034 [21:07:35] !log twentyafterfour@tin rebuilt wikiversions.php and synchronized wikiversions files: all to 1.27.0-wmf.17 [21:07:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:09:06] 6Operations, 7Need-volunteer, 13Patch-For-Review: align puppet-lint config with coding style - https://phabricator.wikimedia.org/T93645#2131840 (10Dzahn) fix most of the indentation warnings https://gerrit.wikimedia.org/r/#/c/278023/ moving roles that setup microsites https://gerrit.wikimedia.org/r/#/c/275... [21:09:59] (03CR) 10Dzahn: "https://phabricator.wikimedia.org/T93645" [puppet] - 10https://gerrit.wikimedia.org/r/278023 (owner: 10Dzahn) [21:10:18] (03PS4) 10Dzahn: move microsite roles into a common place [puppet] - 10https://gerrit.wikimedia.org/r/275034 [21:17:43] (03PS5) 10Dzahn: move microsite roles into a common place [puppet] - 10https://gerrit.wikimedia.org/r/275034 [21:19:29] RECOVERY - Kafka Broker Replica Max Lag on kafka1020 is OK: OK: Less than 50.00% above the threshold [1000000.0] [21:19:59] PROBLEM - Unmerged changes on repository mediawiki_config on mira is CRITICAL: There is one unmerged change in mediawiki_config (dir /srv/mediawiki-staging/). [21:21:20] RECOVERY - Kafka Broker Replica Max Lag on kafka1014 is OK: OK: Less than 50.00% above the threshold [1000000.0] [21:21:49] RECOVERY - Unmerged changes on repository mediawiki_config on mira is OK: No changes to merge. [21:22:43] twentyafterfour: that error means somebody (usually a broken bot running an old version of pywikibot) has bad session cookies [21:23:18] * bd808 looks to see if it's one of the known problem bots [21:25:14] well errors seem stable, nothing crazy showing up in fatalmonitor [21:25:25] oh, 800 in the last hour is "normal" [21:25:29] 6Operations, 10ops-eqiad, 6DC-Ops, 13Patch-For-Review: Rack and Initial setup db1074-79 - https://phabricator.wikimedia.org/T128753#2132147 (10Cmjohnson) [21:25:34] :) [21:25:40] 6Operations, 10ops-eqiad, 6DC-Ops, 13Patch-For-Review: Rack and Initial setup db1074-79 - https://phabricator.wikimedia.org/T128753#2084820 (10Cmjohnson) 5Open>3Resolved @jcrespo db1077 and db1078 are finished with install. Resolving tasks [21:25:41] when a really broken bot shows up its more like 800 /minute [21:26:10] ok I think everything is cool, I'm going to grab more coffee :) [21:26:12] * twentyafterfour steps away for a minute [21:30:16] (03PS6) 10Dzahn: move microsite roles into a common place [puppet] - 10https://gerrit.wikimedia.org/r/275034 [21:36:14] 6Operations, 10Phabricator, 6Project-Admins, 6Triagers: Requests for addition to the #acl*Project-Admins group (in comments) - https://phabricator.wikimedia.org/T706#2132208 (10Aklapper) @mcruzWMF, @Ocaasi: Thanks! I've added you now. If there are any questions about [[ https://www.mediawiki.org/wiki/Phab... [21:37:35] (03CR) 10Dzahn: [C: 031] "Yes, seems clearly better than the existing approach. Yea, you still have to manually add them, but not having to run "exportfs -r" is a b" [puppet] - 10https://gerrit.wikimedia.org/r/278043 (https://phabricator.wikimedia.org/T111586) (owner: 10ArielGlenn) [21:39:03] greg-g: I've got a config change that I'd like to roll out. https://gerrit.wikimedia.org/r/#/c/273559 -- adding new logging to Kafka. Seems like a quite time to roll it out. [21:39:20] there are already 7 patches in swat [21:40:29] bd808: hokay [21:40:49] 7Blocked-on-Operations, 6Operations, 10Datasets-General-or-Unknown: Snapshot hosts need to be manually added to dataset1001's exports - https://phabricator.wikimedia.org/T111586#1611081 (10Dzahn) Not having to manually run that command is definitely an improvement. And it looks much easier to read in the hie... [21:42:31] greg-g: thx [21:43:48] man the root partition on tin is still pretty low on disk space. [21:43:57] did anyone figure out what is hogging it? [21:45:24] apparently /var/lib/l10nupdate [21:45:49] is there no job cleaning up the old branches caches there? [21:46:14] there's no need to keep them around as long as the actual deployed branches [21:46:39] this comes up about once per deploy lately. yesterday it has been said that they get deleted but after 50 days afair [21:46:42] !log reducing cluster.routing.allocation.disk.watermark.high to 70% on eqiad elasticsearch cluster [21:46:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:47:51] !log deleted /var/lib/l10nupdate/caches/cache-1.27.0-wmf.1[345] on tin. Freed ~4G of disk [21:47:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:48:23] (03PS7) 10Dzahn: move microsite roles into a common place [puppet] - 10https://gerrit.wikimedia.org/r/275034 [21:48:41] mutante: the deployed branches live that long, yes. I wonder if we missed an un-puppeted clean up script for l10nupdate in the rebuild though [21:49:05] once a branch rolls off of prod there is no need to have the l10nupdate cache around anymore. [21:49:33] it is only needed while a version is active and will get nightly l10nudpate patches [21:50:16] (03PS3) 10BryanDavis: Logging: Add ApiRequest kafka logging [mediawiki-config] - 10https://gerrit.wikimedia.org/r/273559 (https://phabricator.wikimedia.org/T108618) [21:50:22] (03CR) 10BryanDavis: [C: 032] Logging: Add ApiRequest kafka logging [mediawiki-config] - 10https://gerrit.wikimedia.org/r/273559 (https://phabricator.wikimedia.org/T108618) (owner: 10BryanDavis) [21:50:43] bd808: that might as well be the case about that script. i remember deleting some of them manually in the distant past but that's about it [21:51:18] it seems like the sort of thing somebody may have hacked up and forgot to ever put in puppet [21:51:36] (03Merged) 10jenkins-bot: Logging: Add ApiRequest kafka logging [mediawiki-config] - 10https://gerrit.wikimedia.org/r/273559 (https://phabricator.wikimedia.org/T108618) (owner: 10BryanDavis) [21:53:38] yes [21:55:49] 6Operations, 6Release-Engineering-Team: setup automatic deletion of old l10nupdate - https://phabricator.wikimedia.org/T130317#2132321 (10Dzahn) [21:55:54] bd808: ^ [21:56:06] lazy copy/paste from IRC [21:56:30] 6Operations, 6Release-Engineering-Team: setup automatic deletion of old l10nupdate - https://phabricator.wikimedia.org/T130317#2132333 (10Dzahn) [21:56:45] thanks mutante [21:57:16] (03PS1) 10Aaron Schulz: Bump jobqueue "connectTimeout" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/278090 [21:57:33] (03PS6) 10Andrew Bogott: Modify designatedashboard to recognize proxy records [puppet] - 10https://gerrit.wikimedia.org/r/277943 [21:57:55] np. i'll step outside for a bit. they are sanding my wall.. so much dust [21:58:30] (03PS2) 10Aaron Schulz: Bump jobqueue "connectTimeout" to 300ms [mediawiki-config] - 10https://gerrit.wikimedia.org/r/278090 [21:58:38] be back from outside [21:58:59] (03CR) 10Andrew Bogott: [C: 032] Modify designatedashboard to recognize proxy records [puppet] - 10https://gerrit.wikimedia.org/r/277943 (owner: 10Andrew Bogott) [22:01:26] (03CR) 10Ori.livneh: "why?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/278090 (owner: 10Aaron Schulz) [22:03:11] 6Operations, 10ArchCom-RfC, 10OCG-General-or-Unknown, 6Scrum-of-Scrums, 6Services: The OCG cleanup cache script doesn't work properly - https://phabricator.wikimedia.org/T120079#2132338 (10cscott) @Joe not sure this is a "horrible service to our community" when the service is running just fine and doing... [22:03:18] 6Operations, 10Deployment-Systems, 6Release-Engineering-Team: setup automatic deletion of old l10nupdate - https://phabricator.wikimedia.org/T130317#2132339 (10greg) [22:04:16] 6Operations, 10Deployment-Systems, 6Release-Engineering-Team: setup automatic deletion of old l10nupdate - https://phabricator.wikimedia.org/T130317#2132321 (10Reedy) Could we just purge those when we purge the usual localisation caches in /srv/mediawiki-staging ? [22:04:21] !log bd808@tin Synchronized wmf-config/event-schemas: Add ApiRequest kafka logging (T108618) (duration: 00m 38s) [22:04:22] T108618: Publish detailed Action API request information to Hadoop - https://phabricator.wikimedia.org/T108618 [22:04:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [22:05:17] !log bd808@tin Synchronized wmf-config/InitialiseSettings.php: Add ApiRequest kafka logging (T108618) (duration: 00m 34s) [22:05:18] T108618: Publish detailed Action API request information to Hadoop - https://phabricator.wikimedia.org/T108618 [22:05:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [22:05:36] 17 Avro failed to serialize record for ApiRequest : {"timeSpentBackend":"Ex [22:05:36] pected integer, but recieved double"} in /srv/mediawiki/php-1.27.0-wmf.17/includ [22:05:36] es/debug/logger/monolog/AvroFormatter.php on line 97 [22:05:39] grr [22:06:28] (03PS6) 10Andrew Bogott: Have the site-branding link link back to horizon rather than to wikitech. [puppet] - 10https://gerrit.wikimedia.org/r/276264 [22:07:07] (03PS2) 10Tim Landscheidt: diamond: Fix comments [puppet] - 10https://gerrit.wikimedia.org/r/273451 [22:08:10] (03PS1) 10BryanDavis: Temporarily disable ApiRequest logging [mediawiki-config] - 10https://gerrit.wikimedia.org/r/278093 [22:08:40] (03CR) 10BryanDavis: [C: 032] Temporarily disable ApiRequest logging [mediawiki-config] - 10https://gerrit.wikimedia.org/r/278093 (owner: 10BryanDavis) [22:09:17] (03CR) 10Andrew Bogott: [C: 032] Have the site-branding link link back to horizon rather than to wikitech. [puppet] - 10https://gerrit.wikimedia.org/r/276264 (owner: 10Andrew Bogott) [22:09:33] (03Merged) 10jenkins-bot: Temporarily disable ApiRequest logging [mediawiki-config] - 10https://gerrit.wikimedia.org/r/278093 (owner: 10BryanDavis) [22:10:29] !log bd808@tin Synchronized wmf-config/InitialiseSettings.php: Disable ApiRequest kafka logging (T108618) (duration: 00m 31s) [22:10:30] T108618: Publish detailed Action API request information to Hadoop - https://phabricator.wikimedia.org/T108618 [22:10:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [22:12:06] (03PS4) 10Tim Landscheidt: ores: Move role classes to module role [puppet] - 10https://gerrit.wikimedia.org/r/270102 [22:12:23] !log bd808@tin Synchronized wmf-config/InitialiseSettings.php: touch (duration: 00m 28s) [22:12:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [22:13:33] (03PS6) 10Tim Landscheidt: shinken: Only regenerate configuration when there are changes [puppet] - 10https://gerrit.wikimedia.org/r/267423 [22:14:27] !log ori@tin Synchronized php-1.27.0-wmf.17/includes/specials/SpecialUploadStash.php: Debug code for T130204 (duration: 00m 28s) [22:14:28] T130204: Special:UploadStash thumbnails failing to generate with 500 & 503 - https://phabricator.wikimedia.org/T130204 [22:14:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [22:15:22] (03Abandoned) 10Andrew Bogott: Added labsconsole.wikimedia.org, the new horizon vhost [dns] - 10https://gerrit.wikimedia.org/r/277571 (owner: 10Andrew Bogott) [22:15:44] 6Operations, 10ArchCom-RfC, 10OCG-General-or-Unknown, 6Scrum-of-Scrums, 6Services: The OCG cleanup cache script doesn't work properly - https://phabricator.wikimedia.org/T120079#2132428 (10cscott) Technical discussion: I think the actual blocking bug for T84723 is not this one -- the provided script work... [22:17:39] !log ori@tin Synchronized php-1.27.0-wmf.17/includes/specials/SpecialUploadStash.php: Revert: Debug code for T130204 (it worked!) (duration: 00m 25s) [22:17:40] T130204: Special:UploadStash thumbnails failing to generate with 500 & 503 - https://phabricator.wikimedia.org/T130204 [22:17:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [22:20:07] 6Operations, 10OCG-General-or-Unknown, 6Services: OCG should not be contacted directly from the appservers but only via LVS - https://phabricator.wikimedia.org/T120077#2132499 (10cscott) So, from my perspective, all that needs to be done here is to have the OCG service check a per-machine status flag (either... [22:20:31] (03PS4) 10Andrew Bogott: Add makedomain tool, for creation of domains in designate. [puppet] - 10https://gerrit.wikimedia.org/r/277456 [22:20:54] 6Operations, 10ArchCom-RfC, 10OCG-General-or-Unknown, 6Scrum-of-Scrums, 6Services: The OCG cleanup cache script doesn't work properly - https://phabricator.wikimedia.org/T120079#2132510 (10cscott) p:5Unbreak!>3High [22:22:19] (03PS5) 10Andrew Bogott: Add makedomain tool, for creation of domains in designate. [puppet] - 10https://gerrit.wikimedia.org/r/277456 [22:23:39] ori: are you done on tin? [22:23:44] yep [22:23:51] colio [22:23:55] *coolio [22:24:11] (03PS1) 10BryanDavis: Revert "Temporarily disable ApiRequest logging" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/278177 [22:24:23] (03CR) 10BryanDavis: [C: 032] Revert "Temporarily disable ApiRequest logging" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/278177 (owner: 10BryanDavis) [22:25:16] * bd808 waits patiently for zuul [22:25:37] seems like I should change my nick to gozar while doing that [22:30:50] (03Merged) 10jenkins-bot: Revert "Temporarily disable ApiRequest logging" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/278177 (owner: 10BryanDavis) [22:32:23] (03PS1) 10BryanDavis: Disable ApiRequest properly [mediawiki-config] - 10https://gerrit.wikimedia.org/r/278180 (https://phabricator.wikimedia.org/T108618) [22:33:01] (03CR) 10BryanDavis: [C: 032] Disable ApiRequest properly [mediawiki-config] - 10https://gerrit.wikimedia.org/r/278180 (https://phabricator.wikimedia.org/T108618) (owner: 10BryanDavis) [22:33:36] (03CR) 10Andrew Bogott: [C: 032] Add makedomain tool, for creation of domains in designate. [puppet] - 10https://gerrit.wikimedia.org/r/277456 (owner: 10Andrew Bogott) [22:36:58] !log restarting elasticsearch server elastic1020.eqiad.wmnet [22:37:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [22:38:36] (03Merged) 10jenkins-bot: Disable ApiRequest properly [mediawiki-config] - 10https://gerrit.wikimedia.org/r/278180 (https://phabricator.wikimedia.org/T108618) (owner: 10BryanDavis) [22:42:18] !log bd808@tin Synchronized wmf-config/InitialiseSettings.php: Disable ApiRequest properly (T108618) (duration: 00m 27s) [22:42:19] T108618: Publish detailed Action API request information to Hadoop - https://phabricator.wikimedia.org/T108618 [22:42:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [22:46:23] * bd808 is still waiting on zuul for backport to land [22:47:48] !log resetting cluster.routing.allocation.disk.watermark.high to 90% on eqiad elasticsearch cluster - shards have moved round, cluster is mostly balanced [22:47:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [22:48:52] 6Operations, 6Discovery, 7Elasticsearch: Icinga should alert on free disk space < 15% - https://phabricator.wikimedia.org/T130329#2132816 (10Gehel) [22:52:54] !log bd808@tin Synchronized php-1.27.0-wmf.17/includes/api/ApiMain.php: Cast API timeSpentBackend to an int (duration: 00m 25s) [22:52:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [22:53:27] seems like I should change my nick to gozar while doing that <--- :> (and gozer*) Also, my landlord pulled out The Bag of Keys yesterday, and I made a Keymaster reference, but it just left him baffled. [22:55:14] quiddity: old people! [22:55:23] and kids these days too! [22:56:04] 6Operations, 10Traffic: Images not showing up at Commons - https://phabricator.wikimedia.org/T128961#2132866 (10Platonides) I assume it was https://upload.wikimedia.org:443/foo, instead of requesting https in port 80, but other than that, such request is not only acceptable but **required** from a http 1.1 ser... [22:56:05] lol (yeah, he's probably my age, mid-30s. Just not a fan of pop-culture.) [22:58:02] 6Operations, 7Availability, 5MW-1.27-release-notes, 13Patch-For-Review, and 4 others: Implement a replication strategy for Swift - https://phabricator.wikimedia.org/T91869#2132871 (10aaron) p:5Normal>3High [22:59:29] 6Operations, 6Performance-Team, 6Release-Engineering-Team, 7Availability, and 2 others: Dig through logs from 15 Mar 2016 read-only test and file bugs - https://phabricator.wikimedia.org/T129973#2132873 (10aaron) p:5Normal>3High [23:00:04] RoanKattouw ostriches Krenair MaxSem awight: Respected human, time to deploy Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160317T2300). Please do the needful. [23:00:04] Dereckson: A patch you scheduled for Evening SWAT (Max 8 patches) is about to be deployed. Please be available during the process. [23:00:13] Hi. [23:03:45] does anyone want to handle swat? [23:03:51] or suppose i can [23:04:31] * aude looks at the patches [23:06:01] ewww, translate extension [23:06:11] otherwise I'd do it :p [23:06:19] hi. i just added a last-minute one. [23:06:36] Krenair: excepted the table creation, what's tricky with Translate? [23:06:53] Krenair: if you want to [23:07:01] i think translate requires some tables [23:07:08] Yup. [23:07:17] It's not the table creation I'd be worried about, it's the end resulting syntax used in wikitext [23:08:15] https://github.com/wikimedia/mediawiki-extensions-Translate/tree/master/sql is slightly confusing [23:08:32] is there not one single schema file? [23:09:07] ugh, I had forgotten [23:09:15] I made mistakes running those before [23:10:04] time to look back through my bash history [23:10:09] think i'm unsure enough about translate that i shouldn't do this one [23:10:47] Last Translate deployment, I think Reedy added the table beforehand. [23:10:49] wikidata has https://phabricator.wikimedia.org/P2789 (6 tables, afaik for translate) [23:11:07] Dereckson: tables should always be added in advance of enabling [23:11:20] Oh [23:11:26] aude: I wrote a script for this [23:11:28] I did it on pre-reimage tin [23:11:44] oh yes [23:11:47] Reedy: oh [23:11:49] it's in createExtensionTables.php [23:11:51] https://github.com/wikimedia/mediawiki-extensions-WikimediaMaintenance/blob/master/createExtensionTables.php [23:11:51] okay [23:12:07] mwscript extensions/WikimediaMaintenance/createExtensionTables.php --wiki=foobarwiki Translate [23:12:30] wikilove also needs a table [23:12:41] for https://gerrit.wikimedia.org/r/#/c/276934/ [23:12:44] wikilove is in that script too [23:12:53] :o [23:13:03] Reedy is the script master [23:13:03] ah [23:13:11] Reedy saves the day [23:13:12] too many hacks to keep in one head [23:13:18] Because fuck doing this manually every time :) [23:13:25] #lazydevops [23:14:12] 180 active users and only like 6 voted? [23:14:24] There are still 180 active users on species? [23:14:43] I know, I was surprised too [23:15:06] baring in mind that the 'active' bar here is an edit in the last 30 days [23:16:19] https://species.wikimedia.org/wiki/Wikispecies:Village_Pump/Archive_20 "A proposal for a new page layout" has more or less the same number of participants [23:16:32] oh non, with the comments section more [23:17:12] https://species.wikimedia.org/wiki/Wikispecies:Village_Pump/Archive_30#AN_and_RfC_are_live_plus_a_request in 2015 [23:18:03] And https://species.wikimedia.org/wiki/Wikispecies:Village_Pump/Archive_31#Species_of_the_Month five [23:18:27] Seems this is the normal amount of participants of a Wikispecies discussion so. [23:18:38] okay [23:18:52] (03PS2) 10Alex Monk: Enable Translate extension on specieswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/278054 (https://phabricator.wikimedia.org/T129888) (owner: 10Dereckson) [23:19:00] (03CR) 10Alex Monk: [C: 032] Enable Translate extension on specieswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/278054 (https://phabricator.wikimedia.org/T129888) (owner: 10Dereckson) [23:19:30] (03Merged) 10jenkins-bot: Enable Translate extension on specieswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/278054 (https://phabricator.wikimedia.org/T129888) (owner: 10Dereckson) [23:19:32] krenair@tin:~$ mwscript extensions/WikimediaMaintenance/createExtensionTables.php specieswiki translate [23:19:32] Creating translate tables...done! [23:19:32] krenair@tin:~$ [23:21:16] !log krenair@tin Synchronized wmf-config/InitialiseSettings.php: https://gerrit.wikimedia.org/r/#/c/278054/ (duration: 00m 47s) [23:21:19] Dereckson, ^ [23:21:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:22:02] Species seems still alive. [23:22:22] That was my observation as well [23:23:34] And the rights are well in Special:ListGroupRights, so looks good to me. [23:23:52] (03PS2) 10Alex Monk: Set विक्सनरी as alias for NS_PROJECT on ne.wiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/278032 (https://phabricator.wikimedia.org/T129768) (owner: 10Dereckson) [23:23:56] (03CR) 10Alex Monk: [C: 032] Set विक्सनरी as alias for NS_PROJECT on ne.wiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/278032 (https://phabricator.wikimedia.org/T129768) (owner: 10Dereckson) [23:24:31] (03Merged) 10jenkins-bot: Set विक्सनरी as alias for NS_PROJECT on ne.wiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/278032 (https://phabricator.wikimedia.org/T129768) (owner: 10Dereckson) [23:25:30] !log krenair@tin Synchronized wmf-config/InitialiseSettings.php: https://gerrit.wikimedia.org/r/#/c/278032/ (duration: 00m 32s) [23:25:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:25:48] Testing. [23:26:00] (03PS1) 10Dzahn: puppet-lint: fix or disable remaining alignment warns [puppet] - 10https://gerrit.wikimedia.org/r/278195 [23:26:24] (03CR) 10jenkins-bot: [V: 04-1] puppet-lint: fix or disable remaining alignment warns [puppet] - 10https://gerrit.wikimedia.org/r/278195 (owner: 10Dzahn) [23:26:47] Works. [23:27:57] er... [23:29:32] Set विक्सनरी as alias for NS_PROJECT on ne.wiktionary [23:29:44] Bad database. [23:30:05] I'm preparing a fix. [23:30:09] what? [23:30:14] oho [23:30:25] oh* [23:30:30] yes, I missed that [23:31:11] (03PS8) 10Dzahn: move microsite roles into a common place [puppet] - 10https://gerrit.wikimedia.org/r/275034 [23:31:30] (03PS4) 10Thcipriani: Pass deploy user from service::node [puppet] - 10https://gerrit.wikimedia.org/r/277423 [23:31:38] (03CR) 10Dzahn: [C: 032] "http://puppet-compiler.wmflabs.org/2090/ the only diff is the role name in the motd" [puppet] - 10https://gerrit.wikimedia.org/r/275034 (owner: 10Dzahn) [23:32:54] !leroy [23:33:31] (03PS1) 10Dereckson: Set विक्सनरी as alias for NS_PROJECT on ne.wiktionar [mediawiki-config] - 10https://gerrit.wikimedia.org/r/278198 (https://phabricator.wikimedia.org/T129768) [23:33:46] (03PS2) 10Alex Monk: Set विक्सनरी as alias for NS_PROJECT on ne.wiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/278198 (https://phabricator.wikimedia.org/T129768) (owner: 10Dereckson) [23:33:48] Dereckson, could be worse [23:33:51] (03PS3) 10Dereckson: Set विक्सनरी as alias for NS_PROJECT on ne.wiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/278198 (https://phabricator.wikimedia.org/T129768) [23:34:02] (edit conflict) [23:34:09] I gave a wiki the logo of another project once, like 3-4 years ago [23:36:00] (03CR) 10Alex Monk: [C: 032] Set विक्सनरी as alias for NS_PROJECT on ne.wiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/278198 (https://phabricator.wikimedia.org/T129768) (owner: 10Dereckson) [23:36:18] (03CR) 10Dzahn: "confirmed noop on bromine and rutherfordium (people, annual, transparency, annual, releases)" [puppet] - 10https://gerrit.wikimedia.org/r/275034 (owner: 10Dzahn) [23:36:50] (03Merged) 10jenkins-bot: Set विक्सनरी as alias for NS_PROJECT on ne.wiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/278198 (https://phabricator.wikimedia.org/T129768) (owner: 10Dereckson) [23:37:34] I gave hywikisource a wikiquote logo, for a couple of days in July 2012 [23:38:00] add randomizer script for April 1st [23:39:13] well.. need a better prank [23:39:42] !log krenair@tin Synchronized wmf-config/InitialiseSettings.php: https://gerrit.wikimedia.org/r/#/c/278198/ (duration: 00m 37s) [23:39:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:40:11] PROBLEM - puppet last run on tin is CRITICAL: CRITICAL: puppet fail [23:40:30] .. on tin? looking [23:41:15] MatmaRex, around? [23:41:15] oh, it's my fault, will fix [23:41:23] Krenair: yeah [23:42:18] 278198 tested [23:42:43] !log krenair@tin Synchronized php-1.27.0-wmf.17/includes/specials/SpecialUploadStash.php: https://gerrit.wikimedia.org/r/#/c/278190/ (duration: 00m 27s) [23:42:45] MatmaRex, ^ [23:42:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:43:01] MatmaRex: you work on UploadStash? [23:43:19] Dereckson: no [23:43:22] Dereckson: only when i have to [23:43:46] (03CR) 10Dereckson: "Follow-up: I8642071873f63182a4a86a887e1a8d745cddd72a" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/278032 (https://phabricator.wikimedia.org/T129768) (owner: 10Dereckson) [23:43:48] Krenair: yay, UploadWizard has thumbnails back. thanks [23:43:50] so yes then? :p [23:43:53] np [23:44:50] 6Operations, 6Commons, 10MediaWiki-Uploading, 6Multimedia, 13Patch-For-Review: Special:UploadStash thumbnails failing to generate with 500 & 503 - https://phabricator.wikimedia.org/T130204#2129457 (10Tgr) Could be the same issue as T90599? [23:44:54] (03CR) 10Alex Monk: [C: 032] Namespaces shorcuts aliases on ru.wikibooks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/278029 (https://phabricator.wikimedia.org/T127591) (owner: 10Dereckson) [23:45:04] 6Operations, 6Commons, 10MediaWiki-Uploading, 6Multimedia: Special:UploadStash thumbnails failing to generate with 500 & 503 - https://phabricator.wikimedia.org/T130204#2132943 (10matmarex) 5Open>3Resolved a:3matmarex Fixed and deployed. [23:45:41] (03Merged) 10jenkins-bot: Namespaces shorcuts aliases on ru.wikibooks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/278029 (https://phabricator.wikimedia.org/T127591) (owner: 10Dereckson) [23:46:05] MatmaRex: we had a server upload request of a 2 Gb file, a member of my local hackerspace tried this, each time, the file successfully were in the stash, but then no way to publish it, it were stuck at 60% after 40 minutes. Any idea if it's a known issue or if a bug should be filled? [23:46:40] !log krenair@tin Synchronized wmf-config/InitialiseSettings.php: https://gerrit.wikimedia.org/r/#/c/278029/ (duration: 00m 28s) [23:46:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:46:55] Dereckson: a bug should always be filed [23:47:13] 'ОУ' => NS_USER_TALK, [23:47:14] heh [23:47:22] PROBLEM - puppet last run on mira is CRITICAL: CRITICAL: puppet fail [23:47:46] Dereckson: it's not really my area though, try AaronSchulz. [23:48:36] we can disable M, it's a redirect to meta [23:49:16] (03PS2) 10Alex Monk: Add rollbacker group to kk.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/278021 (https://phabricator.wikimedia.org/T130215) (owner: 10Dereckson) [23:49:24] Dereckson, ready for the next one? [23:49:26] Krenair: short for Обсуждение участника [23:50:02] (stillg testing ru.wikibooks, but yes you can pass to the next one) [23:51:23] (03CR) 10Alex Monk: [C: 032] Add rollbacker group to kk.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/278021 (https://phabricator.wikimedia.org/T130215) (owner: 10Dereckson) [23:51:28] Okay, tested, 278029 works fine (excepted M which is not useful, as m: is for meta.) [23:51:46] can you upload a commit to get rid of the m? [23:51:48] (03Merged) 10jenkins-bot: Add rollbacker group to kk.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/278021 (https://phabricator.wikimedia.org/T130215) (owner: 10Dereckson) [23:52:14] (03PS1) 10Dzahn: releases: fix role name for upload from deploy servers [puppet] - 10https://gerrit.wikimedia.org/r/278200 [23:53:09] !log krenair@tin Synchronized wmf-config/InitialiseSettings.php: https://gerrit.wikimedia.org/r/#/c/278021/ (duration: 00m 26s) [23:53:12] (03CR) 10Dzahn: [C: 032] releases: fix role name for upload from deploy servers [puppet] - 10https://gerrit.wikimedia.org/r/278200 (owner: 10Dzahn) [23:53:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:53:18] Testing. [23:54:13] Krenair: 278021 tested, works. [23:55:09] (03PS3) 10Alex Monk: Revert "(bug 45233) Groups permissions on pt.wikivoyage" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/276917 (https://phabricator.wikimedia.org/T129487) (owner: 10Dereckson) [23:55:16] (03CR) 10Alex Monk: [C: 032] Revert "(bug 45233) Groups permissions on pt.wikivoyage" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/276917 (https://phabricator.wikimedia.org/T129487) (owner: 10Dereckson) [23:55:47] (03Merged) 10jenkins-bot: Revert "(bug 45233) Groups permissions on pt.wikivoyage" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/276917 (https://phabricator.wikimedia.org/T129487) (owner: 10Dereckson) [23:56:48] (03PS1) 10Dzahn: releases/upload: fix typo in class name [puppet] - 10https://gerrit.wikimedia.org/r/278202 [23:57:14] (03PS1) 10Dereckson: Removed extraneous namespace shorcut alias on ru.wikibooks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/278206 [23:57:30] (03CR) 10Dzahn: [C: 032] releases/upload: fix typo in class name [puppet] - 10https://gerrit.wikimedia.org/r/278202 (owner: 10Dzahn) [23:57:42] !log krenair@tin Synchronized wmf-config/InitialiseSettings.php: https://gerrit.wikimedia.org/r/276917 (duration: 00m 26s) [23:57:46] (03PS2) 10Dereckson: Remove extraneous namespace shorcut alias on ru.wikibooks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/278206 [23:57:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:58:02] (03PS3) 10Alex Monk: Remove extraneous namespace shorcut alias on ru.wikibooks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/278206 (owner: 10Dereckson) [23:58:08] (03CR) 10Alex Monk: [C: 032] Remove extraneous namespace shorcut alias on ru.wikibooks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/278206 (owner: 10Dereckson) [23:58:16] Krenair: 276917 tested, and works