[03:34:43] 10serviceops, 10Mobile-Content-Service, 10Page Content Service, 10Patch-For-Review, 10Reading-Infrastructure-Team-Backlog (Kanban): "worker died, restarting" mobileapps issue - https://phabricator.wikimedia.org/T229286 (10Mholloway) >>! In T229286#5381330, @Mholloway wrote: > I see that the new `/page/ta... [06:49:02] 10serviceops, 10Operations, 10Performance-Team (Radar), 10User-jijiki: Ramp up percentage of users on php7.2 to 100% on both API and appserver clusters - https://phabricator.wikimedia.org/T219150 (10Joe) [06:49:31] 10serviceops, 10Operations, 10Performance-Team (Radar), 10User-jijiki: Ramp up percentage of users on php7.2 to 100% on both API and appserver clusters - https://phabricator.wikimedia.org/T219150 (10Joe) [06:50:45] 10serviceops, 10Operations, 10Performance-Team (Radar), 10User-jijiki: Ramp up percentage of users on php7.2 to 100% on both API and appserver clusters - https://phabricator.wikimedia.org/T219150 (10Joe) [07:29:53] 10serviceops, 10Operations: tmpreaper possible race condition - https://phabricator.wikimedia.org/T151304 (10elukey) Added patch to the Debian bug in https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=763858#10 [07:34:34] 10serviceops, 10Parsoid-PHP, 10CPT Initiatives (Parsoid REST API in PHP (CDP2)), 10Patch-For-Review: Allow to avoid installing HHVM from the mediawiki puppet module and profile - https://phabricator.wikimedia.org/T228976 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by oblivian on cumin1001.... [07:36:02] 10serviceops, 10Parsoid-PHP, 10CPT Initiatives (Parsoid REST API in PHP (CDP2)), 10Patch-For-Review: Allow to avoid installing HHVM from the mediawiki puppet module and profile - https://phabricator.wikimedia.org/T228976 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw1348.eqiad.wmnet'] ` O... [07:37:10] 10serviceops, 10Parsoid-PHP, 10CPT Initiatives (Parsoid REST API in PHP (CDP2)), 10Patch-For-Review: Allow to avoid installing HHVM from the mediawiki puppet module and profile - https://phabricator.wikimedia.org/T228976 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by oblivian on cumin1001.... [08:51:42] 10serviceops, 10Parsoid-PHP, 10CPT Initiatives (Parsoid REST API in PHP (CDP2)), 10Patch-For-Review: Allow to avoid installing HHVM from the mediawiki puppet module and profile - https://phabricator.wikimedia.org/T228976 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw1348.eqiad.wmnet'] ` a... [09:46:43] <_joe_> a reimage without HHVM went flawless on mw1348 [09:46:52] Hi! if we want some of the requests from our service to use the varnish cache then api-ro.discovery.wmnet is the wrong place right? Where should we instead be hitting? [09:47:28] <_joe_> tarrow: why do you need the varnish cache in a service? It seems wrong [09:48:06] <_joe_> tarrow: let me explain. Say you just edited a page, then you want to show the termbox. if it calls a cached api endpoint, chances are the cache has still not been purged, and you get stale content [09:48:28] right [09:48:47] <_joe_> since what you produce goes on to be used by MediaWiki for generating page content, it should not use any caching external to mediawiki [09:49:01] <_joe_> unless you want to take care of synchronization issues :) [09:49:05] cool [09:50:24] _joe_: Thanks; So would you say the diagram a the top of this ticket should be changed? https://phabricator.wikimedia.org/T212189 [09:51:05] <_joe_> yeah that varnish shouldn't be there [09:51:16] <_joe_> I must have missed it when looking at that diagram, sorry [09:51:56] no problem! Just trying to make sure we're following what we said we'd do [09:52:52] <_joe_> right [09:53:13] We started questioning this because we're seeing occasional timeouts between our service and api-ro and wondered if it was because we weren't caching when we're supposed to [09:55:10] <_joe_> W [09:55:29] <_joe_> Did you have a task about those timeouts? [09:57:08] yep: https://phabricator.wikimedia.org/T229313 [10:08:07] <_joe_> so codfw only? [10:10:10] well, I just saw 1 error from eqiad [10:10:23] Maybe 50-100 total in the last 3 weeks form codfw [10:10:53] usually clustered together like at 8pm yesterday [10:11:50] 10serviceops, 10Parsoid-PHP, 10CPT Initiatives (Parsoid REST API in PHP (CDP2)), 10Patch-For-Review: Allow to avoid installing HHVM from the mediawiki puppet module and profile - https://phabricator.wikimedia.org/T228976 (10Joe) 05Open→03Resolved I tested reimaging one application server and it went f... [10:11:55] 10serviceops, 10Parsoid-PHP, 10CPT Initiatives (Parsoid REST API in PHP (CDP2)), 10Patch-For-Review: Deploy Parsoid-PHP with Mediawiki to scandium for RT and performance testing - https://phabricator.wikimedia.org/T228069 (10Joe) [10:14:11] <_joe_> it might be due to some congestion on the link, or some other cross-dc issue [10:14:43] <_joe_> 100 errors in 3 weeks is well below the limit where I start investigating though :) [10:15:38] _joe_: right; it's just that right now the only traffic hitting it is the healthcheck service. We're wondering if that will spike once we have real traffic [10:16:13] <_joe_> possibly, but when you'll have real traffic there, it will be because mw is active-active, so api-ro will point to codfw :) [10:16:21] <_joe_> so no more cross-dc latencies etc. [10:19:02] Cool! I'll put it to the back of my mind then :) [10:21:05] Before we go live we're thinking of load testing our service; plan for that TBC. Any history of people doing that? Mostly totally simulated? Replaying historic traffic? [13:32:47] 10serviceops, 10Operations, 10ops-codfw: (OoW) restbase2009 lockup - https://phabricator.wikimedia.org/T227408 (10Papaul) a:05Papaul→03wiki_willy @jijiki I will talking to @wiki_willy to see what are our options on this. @wiki_willy this system is out if warranty since April 2019 and we do have a proble... [16:04:26] 10serviceops, 10Operations, 10ops-codfw: (OoW) restbase2009 lockup - https://phabricator.wikimedia.org/T227408 (10wiki_willy) @Papaul - if you can't find a spare from any of those decom servers, we can order it, since it's still a while before the 5yr mark. Thanks Willy [16:19:59] are you folks aware of those termbox alerts that have been flapping? [16:48:10] paravoid: hi, yes! We were just talking about them earlier. Ticket is: https://phabricator.wikimedia.org/T229313 [16:49:07] Seems like it's due to the cross DC request timing out [20:55:16] added the appserver roles on scandium, generated mcrouter certs [20:55:23] puppet works and installs all the things.. just... also hhvm [22:05:59] _joe_: had a typo in Hiera key so hhvm got installed... then fixed it and puppet/manually removed hhvm again. so far so good.. the next thing is i think profile::parsoid conflicts with $has_lvs = false and it gets lvs::realserver [22:06:49] <_joe_> uh [22:07:11] <_joe_> then we need to fix that [22:07:15] yep [22:07:28] <_joe_> I can't believe that's the case, parsoid testing has no realserver [22:08:00] i am looking though currently it changes /etc/default/wikimedia-lvs-realserver on each run [22:08:23] <_joe_> mutante: ok, did you set has_lvs to false? [22:08:37] yep, i did [22:08:51] in [22:08:51] hieradata/role/common/parsoid/testing.yaml [22:09:13] and profile::parsoid has Boolean $has_lvs = hiera('has_lvs', true), [22:11:01] <_joe_> ok [22:11:06] <_joe_> I see what the problem is [22:11:15] <_joe_> it should've had has_lvs set to false before [22:11:28] in a separate change? [22:11:34] <_joe_> my suggestion is remove the package wikimedia-lvs-realserver, with --purge [22:11:42] <_joe_> and then run puppet again [22:11:50] ok! will do, thanks [22:12:00] <_joe_> no I mean scandium has no lvs, so it should've been set all along [22:12:04] i did the same for hhvm, btw. with --purge [22:12:23] oh, as in "was already installed before this change" ack [22:14:46] puppet reinstalled it [22:18:05] hieradata/role/common/parsoid.yaml also has profile::lvs::realserver::pools: parsoid [22:18:13] i will keep looking [22:48:46] profile::mediawiki::php::restarts -> require profile::lvs::realserver [22:49:37] without a "has_lvs" around it ("# realserver gives us the pools the server is included in, plus the lvs configuration") so that's why it's not ONLY the configuration class