[04:20:10] 10serviceops, 10Release-Engineering-Team, 10Scap, 10PHP 7.2 support, 10User-jijiki: Enhance MediaWiki deployments for support of php7.x - https://phabricator.wikimedia.org/T224857 (10Joe) >>! In T224857#5231820, @thcipriani wrote: > What affect have the `opcode_invalidate` calls for specific files via sy... [05:27:22] 10serviceops, 10Release-Engineering-Team, 10Scap, 10PHP 7.2 support, 10User-jijiki: Enhance MediaWiki deployments for support of php7.x - https://phabricator.wikimedia.org/T224857 (10greg) >>! In T224857#5232237, @Joe wrote: > Just to be clear, I don't think the following things we do today are advisable... [06:18:13] 10serviceops, 10Release-Engineering-Team, 10Scap, 10PHP 7.2 support, 10User-jijiki: Enhance MediaWiki deployments for support of php7.x - https://phabricator.wikimedia.org/T224857 (10mmodell) >>! In T224857#5229836, @Joe wrote: >>>! In T224857#5229810, @ArielGlenn wrote: > >> Before we start redefining... [06:39:02] 10serviceops, 10Release-Engineering-Team, 10Scap, 10PHP 7.2 support, 10User-jijiki: Enhance MediaWiki deployments for support of php7.x - https://phabricator.wikimedia.org/T224857 (10jijiki) Even though I do agree with most of the things everyone has addressed here, our current problem remains the same:... [06:39:57] 10serviceops, 10Release-Engineering-Team, 10Scap, 10PHP 7.2 support, 10User-jijiki: Enhance MediaWiki deployments for support of php7.x - https://phabricator.wikimedia.org/T224857 (10Joe) >>! In T224857#5232382, @mmodell wrote: >>>! In T224857#5229836, @Joe wrote: >>>>! In T224857#5229810, @ArielGlenn wr... [06:57:50] 10serviceops, 10Release-Engineering-Team, 10Scap, 10PHP 7.2 support, 10User-jijiki: Enhance MediaWiki deployments for support of php7.x - https://phabricator.wikimedia.org/T224857 (10greg) We're tangenting. :) [10:36:48] 10serviceops, 10Operations, 10Reading-Infrastructure-Team-Backlog, 10Release Pipeline, and 2 others: Machine vision image metadata service - https://phabricator.wikimedia.org/T224917 (10mobrovac) [10:43:14] jijiki: with the R200 memcached profile patch merged, are the mw/eqiad reboots good to go or is there something else pending? [10:43:40] moritzm: yes a clean patch I am merging now [10:43:44] and enabling puppet [10:43:52] so I will need 15' tops [10:43:55] is that fine? [10:43:57] and <3! [10:45:05] 10serviceops, 10Operations, 10Release Pipeline, 10Release-Engineering-Team, and 5 others: Introduce kask session storage service to kubernetes - https://phabricator.wikimedia.org/T220401 (10mobrovac) >>! In T220401#5230212, @akosiaris wrote: > > [...] In fact, some numbers I 've heard (I have no actual pro... [10:45:26] jijiki: ack :-) [11:01:12] <_joe_> I think jijiki was referring to mc* restarts for that patch, but I might be wrong [11:02:00] I am enabling puppet, moritz will start restarting them slowly and propagate the change [11:02:24] <_joe_> jijiki: mc vs mw [11:02:38] oh dear you are right [11:02:48] moritzm: so my change is for the mc* hosts [11:03:00] although me and luca have one more patch to push [11:03:12] that we would like to push before the mw* restarts [11:03:33] nono, this is about the mw/eqiad reboots, mc is for later [11:04:15] ok so I will summon elukey to join us after lunch [11:04:58] it is that one [11:05:00] https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/468363/ [11:07:08] ack [11:12:00] moritzm: sorry we through a big boulder in front of you, we'll remove it after lunch [11:12:04] threw* [11:43:56] here I am [11:44:33] the patch just got a -1 :P [11:45:38] anyway, that patch will cause mcrouter to restart, so coupling it with reboots would be ideal but not mandatory. I'd like to test on canaries first for a couple of days, then roll out. Not sure what the schedule is for the mw1* reboots thought, don't want to block [11:45:45] moritzm: --^ [11:47:07] mw1* reboots were up for today actually [11:48:10] elukey: oh right I remembered we said yesterday to check of it restarts mcrouter [11:48:17] and then I doodled graphs :p [11:51:28] 10serviceops, 10MediaWiki-Cache, 10Operations, 10Patch-For-Review, and 2 others: Apply -R 200 to all the memcached mw object cache instances running in eqiad/codfw - https://phabricator.wikimedia.org/T208844 (10jijiki) [11:51:38] 10serviceops, 10MediaWiki-Cache, 10Operations, 10Patch-For-Review, and 2 others: Apply -R 200 to all the memcached mw object cache instances running in eqiad/codfw - https://phabricator.wikimedia.org/T208844 (10jijiki) 05Open→03Resolved [11:52:37] moritzm: I think that we can manage to roll restart mcrouter, proper testing would probably delay the reboots to after the offsite so not worth to wait. jijiki what do you think? [11:53:28] I think it is doable [11:53:52] depooling, retsart, pooling takes about 15' and it will not affect much I think [11:54:29] elukey: we can do the canaries at least [11:54:51] but yes maybe I was over reacting [11:55:13] * jijiki lunch [12:01:34] 10serviceops, 10MediaWiki-Cache, 10Operations, 10Patch-For-Review, and 2 others: Apply -R 200 to all the memcached mw object cache instances running in eqiad/codfw - https://phabricator.wikimedia.org/T208844 (10elukey) This task will be completed after the next round of reboots for the mc* hosts (as FYI fo... [12:02:45] elukey: in an interview, will get back to you later [12:56:12] jijiki, elukey: so TLDR is to start the mw1 reboots now and not wait for the patch? [13:11:20] jijiki: yeah +1 [13:13:19] ok ill start the reboots at ~13:30 UTC unless i hear any push back [13:24:56] <_joe_> jijiki, elukey I'm looking at https://grafana.wikimedia.org/d/GuHySj3mz/php7-transition?refresh=30s&panelId=5&fullscreen&orgId=1&from=1559623228273&to=1559654569450 and I kinda think we need to add a daily restart to php-fpm while I'm not done with my script [13:25:10] <_joe_> every descent is a deploy, basically [13:25:27] ouch :( [13:25:30] <_joe_> so on the long run, we might as well run out of opcache [13:25:41] <_joe_> also [13:25:46] <_joe_> looking at opcache stats [13:26:33] <_joe_> php7adm /opcache-info | jq .memory_usage.current_wasted_percentage [13:26:41] <_joe_> it's at 6.2 percent now [13:26:50] <_joe_> in theory, when it reaches 10%, it should restart [13:27:07] <_joe_> will such a restart bring back the corruptions? [13:27:24] <_joe_> now, I'll have to ask you two to look into this [13:27:35] <_joe_> I really don't have time to follow this thread right now [13:57:43] <_joe_> akosiaris, fsero blubberoid is down in eqiad, can someone take a look? not in a hurry, but still [13:58:01] sure let me take a look [13:58:08] how did you noticed joe? [13:58:17] <_joe_> icinga, #-operations [13:58:41] <_joe_> it just recovered [13:58:48] _joe_: will you relax? [13:58:50] it's blubberoid [13:58:53] call jijiki [13:58:54] :P [13:58:56] <_joe_> ahahah [13:59:08] <_joe_> no I meant, it's interesting to see what is going on I think [13:59:10] seriously, I am handling it [13:59:18] blubberoid pod was created 4 m ago [13:59:21] <_joe_> given I don't expect blubberoid to have many callers [13:59:23] and caused it as well [13:59:25] i guess akosiaris is updating the workers to 1.12 [13:59:26] <_joe_> ahah [13:59:28] <_joe_> ok [13:59:32] right? [13:59:46] in any case to prevent it it wouldnt harm anybody to add more blubber in top of it [13:59:52] that means add another replica [13:59:59] fsero: no, I messed up calico configuration and fixing it now. It involves however a restart of all pods [14:00:14] mmm now im intrigued [14:02:49] <_joe_> akosiaris: should we depool all services in eqiad? [14:05:28] no, it's not that bad [14:06:18] bgp is a pretty decent protocol at these things [14:08:25] the thing that was lost was the aggregated /26 per node. But everything kept on going due to the /32s being announced fine [14:08:48] that and the fact we don't have much intraservice communication [14:15:43] <_joe_> wait until restbase is there [14:40:37] _joe_: we will have a look with luca [14:40:44] do we have a task attached to it? [14:41:49] <_joe_> I guess you can attach a comment to https://phabricator.wikimedia.org/T224491 [14:44:02] ok good, tx ! [14:44:41] 10serviceops, 10Operations, 10PHP 7.2 support, 10Performance-Team (Radar), and 2 others: PHP 7 corruption during deployment (was: PHP 7 fatals on mw1262) - https://phabricator.wikimedia.org/T224491 (10jijiki) [14:45:19] 10serviceops, 10Release-Engineering-Team, 10Scap, 10PHP 7.2 support, 10User-jijiki: Enhance MediaWiki deployments for support of php7.x - https://phabricator.wikimedia.org/T224857 (10thcipriani) >>! In T224857#5232389, @jijiki wrote: > If there are not any better short-term solutions/ideas, depooling-dep... [14:45:26] I got mw 2 servers generating database connection errors [14:45:51] that would be relatively normal, if it wasn't that it was only 2 app servers, and none of others [14:46:18] <_joe_> I see alerts on kubernetes in codfw akosiaris [14:46:28] <_joe_> all services? [14:46:29] _joe_: same thing [14:46:48] https://logstash.wikimedia.org/goto/908c79c679c998531dc0a8b0eba031a7 if someone is interested [14:46:51] paged [14:46:55] <_joe_> jynus: I am but not now [14:47:14] should recover pretty quickly [14:47:18] and should not have paged [14:47:35] <_joe_> if it is paging (still didn't get it) it means something went wrong? [14:47:37] _joe_: more than fair [14:48:15] btw, we should have this service only page jijiki [14:50:46] "Lost connection to MySQL server at 'reading authorization packet', system error: 11" <--- first time I see that error [14:51:01] ah, I saw that before, sorry [14:51:08] I missread it [14:51:12] interesting dashboard there (bookmarked) [14:51:26] apergos: the mysql one? [14:51:27] * jijiki creates a new label on gmail [14:51:38] uh huh, jy nus [14:51:50] There is the better https://logstash.wikimedia.org/app/kibana#/dashboard/87348b60-90dd-11e8-8687-73968bebd217 [14:52:12] but for connections the previous one is more friendly [14:53:06] (nothing worring right now, so please don't spend time unlesss you can affor it) [14:53:29] huh these deadlocks are interesting [14:53:47] nah, just taking a quick look [14:53:56] yeah, it is reported, Amir promised a deploy soon [14:54:30] the background noise, specially from job queue or api is not very high, but relatively constant [14:55:07] gotcha [15:24:46] Anyone fancy merging https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/511145/ for me? [15:25:02] Then I can remove it as a cherry pick on beta (basically makes this config the same as it is for the prod file) [15:29:24] https://github.com/wikimedia/puppet/blob/production/modules/mediawiki/manifests/web/prod_sites.pp#L211-L215 [15:34:32] 10serviceops, 10Operations, 10Release Pipeline, 10Release-Engineering-Team, and 2 others: TEC3:O3:O3.1:Q4 Goal - Move cpjobqueue, Wikidata Termbox SSR (new service), Kask (session storage service) and ORES (partially) through the production CD Pipeline - https://phabricator.wikimedia.org/T220398 (10akosiari... [15:34:38] 10serviceops, 10Operations, 10Release Pipeline, 10Release-Engineering-Team, and 5 others: Introduce kask session storage service to kubernetes - https://phabricator.wikimedia.org/T220401 (10akosiaris) 05Open→03Resolved a:03akosiaris And LVS done today. ` akosiaris@deploy1001:~$ curl -i https://sess... [15:36:54] * apergos whistles innocently [15:38:42] 10serviceops, 10Operations, 10Release Pipeline, 10Release-Engineering-Team, and 5 others: Introduce kask session storage service to kubernetes - https://phabricator.wikimedia.org/T220401 (10Eevans) >>! In T220401#5233899, @akosiaris wrote: > > And LVS done today. > > ` > akosiaris@deploy1001:~$ curl -i h... [15:39:03] :) [15:42:30] akosiaris: where do session storage production settings live? [15:42:47] akosiaris: aka, the stuff that is templated into config.yaml [15:44:32] urandom: deploy1001:/srv/scap-helm/sessionstore [15:45:10] it's the same format as https://gerrit.wikimedia.org/r/plugins/gitiles/operations/deployment-charts/+/refs/heads/master/charts/kask/values.yaml fwiw, just overriding values [15:47:03] akosiaris: is this the canonical location? [15:47:15] it's not in git somewhere? [15:47:19] urandom: it will be [15:47:45] we are almost there, putting the final pieces into place [15:48:05] akosiaris: OK, I was asking in reference to T224995 [15:48:24] which asks that a comment be added for `default_ttl` [15:48:30] 10serviceops, 10Operations, 10Reading-Infrastructure-Team-Backlog, 10Release Pipeline, and 2 others: Machine vision image metadata service - https://phabricator.wikimedia.org/T224917 (10Tgr) Doesn't that sort of duplicate #jade? [15:49:35] akosiaris: I guess the repo will be initialized with what is in these files? [15:49:50] urandom: then I guess that's the place to add it for now. Aside from passwords, we will be adding all of this into deployment-charts/ repo [15:50:07] gotcha [15:52:06] urandom: I'm working on that ATM just need to stitch things together [15:52:24] kk [15:52:27] thanks! [15:52:50] <_joe_> jijiki: can you keep an eye on the opcache free memory? [15:53:10] <_joe_> if it goes below 40 MB, we need to do a rolling restart of php-fpm [15:53:39] sure [15:54:38] <_joe_> both api and appservers ofc, mw1348 is at 80 free MBs [15:54:48] <_joe_> we need an alert on this [15:55:40] I will see what to do [16:11:33] 10serviceops, 10Release-Engineering-Team, 10Scap, 10PHP 7.2 support, 10User-jijiki: Enhance MediaWiki deployments for support of php7.x - https://phabricator.wikimedia.org/T224857 (10Legoktm) Is whatever restart/depool solution that was proposed above smart enough to only do so when a *.php file is touch... [16:24:48] 10serviceops, 10Operations, 10Reading-Infrastructure-Team-Backlog, 10Release Pipeline, and 2 others: Machine vision image metadata service - https://phabricator.wikimedia.org/T224917 (10Mholloway) Hmm, I guess they're aimed in a similar way at editor ratification of others' judgments (in the case of JADE,... [16:41:40] 10serviceops, 10Release-Engineering-Team, 10Scap, 10PHP 7.2 support, 10User-jijiki: Enhance MediaWiki deployments for support of php7.x - https://phabricator.wikimedia.org/T224857 (10jijiki) >>! In T224857#5233723, @thcipriani wrote: >>>! In T224857#5232389, @jijiki wrote: >> If there are not any better... [16:56:53] FYI this is the effect of nutcracker's conns to memcached dropping (after rebooting mw1* hosts) - https://grafana.wikimedia.org/d/000000316/memcache?panelId=40&fullscreen&orgId=1 [17:04:13] <_joe_> also that's resetting the opcache, whioch is good [17:05:12] _joe_: I was thinking [17:05:49] does it make sense to have a cronjob every X and if opcache free is less than 60 [17:06:01] depool and restart [17:06:16] and to avoid swat and trains [17:06:46] we can actually check at say 4 UTC, or between 4-10 UTC [17:07:17] <_joe_> jijiki: there is a whole system in place to do that across a timespan for a cluster [17:07:35] <_joe_> there is an use in the profile::mediawiki::webserver (cron_splay) [17:07:46] ok I will take a look [17:07:47] tx [17:07:55] <_joe_> and a script called hhvm-needs-restart doing pretty much what you said for hhvm on the api cluster [17:08:12] <_joe_> it's a bit sad to revert to doing such things, but meh :P [17:08:18] btw I will restart1348 [17:08:28] before signing off [17:08:44] <_joe_> it should be restarted as in rebooted by moritzm AIUI [17:08:50] well,desperate tines [17:09:05] well, it will take 2' so [17:09:30] <_joe_> tbh I'd be interested in seeing what happens if the opcache is full [17:10:01] ok, then it is up to you :) [17:10:25] <_joe_> let's reconvene tomorrow? I need to get off the keyboard [17:11:54] yea mee too [17:43:47] 10serviceops, 10Operations, 10Reading-Infrastructure-Team-Backlog, 10Release Pipeline, and 2 others: Machine vision image metadata service - https://phabricator.wikimedia.org/T224917 (10Tgr) > it looks like JADE is focused specifically on revisions, and I wouldn't expect the machine vision-derived labels t... [17:45:37] 10serviceops, 10Operations, 10Reading-Infrastructure-Team-Backlog, 10Release Pipeline, and 2 others: Machine vision image metadata service - https://phabricator.wikimedia.org/T224917 (10Tgr) >>! In T224917#5234081, @Mholloway wrote: > We'd still need to interface with the third-party machine vision provide... [17:59:03] jijiki, _joe_ i have accidently rebooted one of the mcrouter proxy boxes - mw1286. i know this cause problems in codfw but not sure what they where. just know that box was on my list of *do not reboot* and i just rebooted :S [18:00:01] <_joe_> ok so, you can safely reboot the eqiad proxies as that's the active datacenter [18:00:09] <_joe_> proxies are important in the inactive one [18:00:22] <_joe_> and we need to be able to pool/depool them easily btw [18:00:37] ahh ok so they can be rebooted as normal in eqiad [18:00:44] <_joe_> we should just loadbalance them [18:00:46] <_joe_> yes [18:01:31] ok great thanks [18:33:28] hi all, i rebooted mw1227 and when it cam back up hhvm was using between 1000-2500% cpu. I have now depooled the server and it is still using ~110% cpu anyone able to provide insight into what it may be [18:33:58] it is slowly going down now [19:12:21] sorry for the ping in -ops those machines have all calmed down now. is this behaviour expected after a reboot. i gusse they all have to recompile there opcache so it makes sense if they start up with higher then normal cpu i guess [20:04:41] 10serviceops, 10Operations, 10PHP 7.2 support, 10Performance-Team (Radar), and 2 others: PHP 7 corruption during deployment (was: PHP 7 fatals on mw1262) - https://phabricator.wikimedia.org/T224491 (10pmiazga) @Niedzielski noticed another, pretty similar issue: {T225018}. System fails with `PHP Fatal error... [20:55:28] for HHVM the bytecode cache is on disk and persists across reboots, but we some times have HHVM load spikes on single API servers, there should be a task, but didn't find it in a quick search, they usually recover at 10-15min [20:56:37] 10serviceops, 10Continuous-Integration-Config, 10Epic, 10Release-Engineering-Team (Kanban): Define variant Wikimedia production config in compiled, static files - https://phabricator.wikimedia.org/T223602 (10Jdforrester-WMF) p:05Triage→03Normal [21:17:53] on the server in question there where some errors like the following in the hhvm logs so it could have just been a conincidence, thanks [21:17:57] error: entire web request took longer than 200 seconds and timed out in /srv/mediawiki/php-1.34.0-wmf.7/extensions/Scribunto/includes/engines/LuaSandbox/Engine.php on line 282 [21:19:37] also thanks for the clarification re persistent cache [21:20:26] * jbond42 feels like he wants the super technical deep dive version of akosi.aris and jij.iki presentation [21:47:46] jbond42: you just need to remember that its hacks and special cases all the way down. There is no bottom. :) [21:49:02] jbond42: etc: there are complaints of officewiki server errors from -staff, I haven't repro'd (not much officewiki use today) but just in case it's related to opcache issues [22:16:50] bd808: lmao, turtles all the way down [22:17:24] greg-g: do you have more info [although not sure related to my work] [22:27:29] jbond42: no, sorry :( [22:31:29] 10serviceops, 10Continuous-Integration-Config, 10Epic, 10Release-Engineering-Team (Kanban): Define variant Wikimedia production config in compiled, static files - https://phabricator.wikimedia.org/T223602 (10Ladsgroup) Here's my two cents: - I have done a similar thing with ores, in couple of months we ma...