[04:20:10] <wikibugs>	 10serviceops, 10Release-Engineering-Team, 10Scap, 10PHP 7.2 support, 10User-jijiki: Enhance MediaWiki deployments for support of php7.x - https://phabricator.wikimedia.org/T224857 (10Joe) >>! In T224857#5231820, @thcipriani wrote: > What affect have the `opcode_invalidate` calls for specific files via sy...
[05:27:22] <wikibugs>	 10serviceops, 10Release-Engineering-Team, 10Scap, 10PHP 7.2 support, 10User-jijiki: Enhance MediaWiki deployments for support of php7.x - https://phabricator.wikimedia.org/T224857 (10greg) >>! In T224857#5232237, @Joe wrote: > Just to be clear, I don't think the following things we do today are advisable...
[06:18:13] <wikibugs>	 10serviceops, 10Release-Engineering-Team, 10Scap, 10PHP 7.2 support, 10User-jijiki: Enhance MediaWiki deployments for support of php7.x - https://phabricator.wikimedia.org/T224857 (10mmodell) >>! In T224857#5229836, @Joe wrote: >>>! In T224857#5229810, @ArielGlenn wrote: >  >> Before we start redefining...
[06:39:02] <wikibugs>	 10serviceops, 10Release-Engineering-Team, 10Scap, 10PHP 7.2 support, 10User-jijiki: Enhance MediaWiki deployments for support of php7.x - https://phabricator.wikimedia.org/T224857 (10jijiki) Even though I do agree with most of the things everyone has addressed here, our current problem remains the same:...
[06:39:57] <wikibugs>	 10serviceops, 10Release-Engineering-Team, 10Scap, 10PHP 7.2 support, 10User-jijiki: Enhance MediaWiki deployments for support of php7.x - https://phabricator.wikimedia.org/T224857 (10Joe) >>! In T224857#5232382, @mmodell wrote: >>>! In T224857#5229836, @Joe wrote: >>>>! In T224857#5229810, @ArielGlenn wr...
[06:57:50] <wikibugs>	 10serviceops, 10Release-Engineering-Team, 10Scap, 10PHP 7.2 support, 10User-jijiki: Enhance MediaWiki deployments for support of php7.x - https://phabricator.wikimedia.org/T224857 (10greg) We're tangenting. :)
[10:36:48] <wikibugs>	 10serviceops, 10Operations, 10Reading-Infrastructure-Team-Backlog, 10Release Pipeline, and 2 others: Machine vision image metadata service - https://phabricator.wikimedia.org/T224917 (10mobrovac)
[10:43:14] <moritzm>	 jijiki: with the R200 memcached profile patch merged, are the mw/eqiad reboots good to go or is there something else pending?
[10:43:40] <jijiki>	 moritzm: yes a clean patch I am merging now
[10:43:44] <jijiki>	 and enabling puppet
[10:43:52] <jijiki>	 so I will need 15' tops
[10:43:55] <jijiki>	 is that fine?
[10:43:57] <jijiki>	 and <3!
[10:45:05] <wikibugs>	 10serviceops, 10Operations, 10Release Pipeline, 10Release-Engineering-Team, and 5 others: Introduce kask session storage service to kubernetes - https://phabricator.wikimedia.org/T220401 (10mobrovac) >>! In T220401#5230212, @akosiaris wrote: > > [...] In fact, some numbers I 've heard (I have no actual pro...
[10:45:26] <moritzm>	 jijiki: ack :-)
[11:01:12] <_joe_>	  I think jijiki was referring to mc* restarts for that patch, but I might be wrong
[11:02:00] <jijiki>	 I am enabling puppet, moritz will start restarting them slowly and propagate the change
[11:02:24] <_joe_>	 jijiki: mc vs mw
[11:02:38] <jijiki>	 oh dear you are right  
[11:02:48] <jijiki>	 moritzm: so my change is for the mc* hosts 
[11:03:00] <jijiki>	 although me and luca have  one more patch to push 
[11:03:12] <jijiki>	 that we  would like to push before  the mw* restarts
[11:03:33] <moritzm>	 nono, this is about the mw/eqiad reboots, mc is for later
[11:04:15] <jijiki>	 ok so I will summon elukey to join us after lunch 
[11:04:58] <jijiki>	 it is that one 
[11:05:00] <jijiki>	 https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/468363/
[11:07:08] <moritzm>	 ack
[11:12:00] <jijiki>	 moritzm: sorry we through a big  boulder in front of you, we'll remove it after lunch  
[11:12:04] <jijiki>	 threw*
[11:43:56] <elukey>	 here I am
[11:44:33] <elukey>	 the patch just got a -1 :P
[11:45:38] <elukey>	 anyway, that patch will cause mcrouter to restart, so coupling it with reboots would be ideal but not mandatory. I'd like to test on canaries first for a couple of days, then roll out. Not sure what the schedule is for the mw1* reboots thought, don't want to block
[11:45:45] <elukey>	 moritzm: --^
[11:47:07] <moritzm>	 mw1* reboots were up for today actually
[11:48:10] <jijiki>	 elukey: oh right I remembered we said yesterday to check of it restarts mcrouter 
[11:48:17] <jijiki>	 and then I doodled graphs :p
[11:51:28] <wikibugs>	 10serviceops, 10MediaWiki-Cache, 10Operations, 10Patch-For-Review, and 2 others: Apply -R 200 to all the memcached mw object cache instances running in eqiad/codfw - https://phabricator.wikimedia.org/T208844 (10jijiki)
[11:51:38] <wikibugs>	 10serviceops, 10MediaWiki-Cache, 10Operations, 10Patch-For-Review, and 2 others: Apply -R 200 to all the memcached mw object cache instances running in eqiad/codfw - https://phabricator.wikimedia.org/T208844 (10jijiki) 05Open→03Resolved
[11:52:37] <elukey>	 moritzm: I think that we can manage to roll restart mcrouter, proper testing would probably delay the reboots to after the offsite so not worth to wait. jijiki what do you think?
[11:53:28] <jijiki>	 I think it is doable
[11:53:52] <jijiki>	 depooling, retsart, pooling takes about 15' and it will not affect much I think 
[11:54:29] <jijiki>	 elukey: we can do the canaries at least 
[11:54:51] <jijiki>	 but yes maybe I was over reacting 
[11:55:13] * jijiki lunch
[12:01:34] <wikibugs>	 10serviceops, 10MediaWiki-Cache, 10Operations, 10Patch-For-Review, and 2 others: Apply -R 200 to all the memcached mw object cache instances running in eqiad/codfw - https://phabricator.wikimedia.org/T208844 (10elukey) This task will be completed after the next round of reboots for the mc* hosts (as FYI fo...
[12:02:45] <moritzm>	 elukey: in an interview, will get back to you later
[12:56:12] <moritzm>	 jijiki, elukey: so TLDR is to start the mw1 reboots now and not wait for the patch?
[13:11:20] <elukey>	 jijiki: yeah +1
[13:13:19] <jbond42>	 ok ill start the reboots at ~13:30 UTC unless i hear any push back
[13:24:56] <_joe_>	 jijiki, elukey I'm looking at https://grafana.wikimedia.org/d/GuHySj3mz/php7-transition?refresh=30s&panelId=5&fullscreen&orgId=1&from=1559623228273&to=1559654569450 and I kinda think we need to add a daily restart to php-fpm while I'm not done with my script
[13:25:10] <_joe_>	 every descent is a deploy, basically
[13:25:27] <elukey>	 ouch :(
[13:25:30] <_joe_>	 so on the long run, we might as well run out of opcache
[13:25:41] <_joe_>	 also
[13:25:46] <_joe_>	 looking at opcache stats
[13:26:33] <_joe_>	 php7adm /opcache-info | jq .memory_usage.current_wasted_percentage
[13:26:41] <_joe_>	 it's at 6.2 percent now
[13:26:50] <_joe_>	 in theory, when it reaches 10%, it should restart
[13:27:07] <_joe_>	 will such a restart bring back the corruptions?
[13:27:24] <_joe_>	 now, I'll have to ask you two to look into this
[13:27:35] <_joe_>	 I really don't have time to follow this thread right now
[13:57:43] <_joe_>	 akosiaris, fsero blubberoid is down in eqiad, can someone take a look? not in a hurry, but still
[13:58:01] <fsero>	 sure let me take a look
[13:58:08] <fsero>	 how did you noticed joe?
[13:58:17] <_joe_>	 icinga, #-operations
[13:58:41] <_joe_>	 it just recovered
[13:58:48] <akosiaris>	 _joe_: will you relax?
[13:58:50] <akosiaris>	 it's blubberoid
[13:58:53] <akosiaris>	 call jijiki
[13:58:54] <akosiaris>	 :P
[13:58:56] <_joe_>	 ahahah
[13:59:08] <_joe_>	 no I meant, it's interesting to see what is going on I think
[13:59:10] <akosiaris>	 seriously, I am handling it
[13:59:18] <fsero>	 blubberoid pod was created 4 m ago
[13:59:21] <_joe_>	 given I don't expect blubberoid to have many callers
[13:59:23] <akosiaris>	 and caused it as well
[13:59:25] <fsero>	 i guess akosiaris is updating the workers to 1.12
[13:59:26] <_joe_>	 ahah
[13:59:28] <_joe_>	 ok
[13:59:32] <fsero>	 right?
[13:59:46] <fsero>	 in any case to prevent it it wouldnt harm anybody to add more blubber in top of it
[13:59:52] <fsero>	 that means add another replica
[13:59:59] <akosiaris>	 fsero: no, I messed up calico configuration and fixing it now. It involves however a restart of all pods
[14:00:14] <fsero>	 mmm now im intrigued
[14:02:49] <_joe_>	 akosiaris: should we depool all services in eqiad?
[14:05:28] <akosiaris>	 no, it's not that bad
[14:06:18] <akosiaris>	 bgp is a pretty decent protocol at these things
[14:08:25] <akosiaris>	 the thing that was lost was the aggregated /26 per node. But everything kept on going due to the /32s being announced fine
[14:08:48] <akosiaris>	 that and the fact we don't have much intraservice communication
[14:15:43] <_joe_>	 wait until restbase is there
[14:40:37] <jijiki>	 _joe_: we will have a look with luca
[14:40:44] <jijiki>	 do we have a task attached to it?
[14:41:49] <_joe_>	 I guess you can attach a comment to https://phabricator.wikimedia.org/T224491 
[14:44:02] <jijiki>	 ok good, tx !
[14:44:41] <wikibugs>	 10serviceops, 10Operations, 10PHP 7.2 support, 10Performance-Team (Radar), and 2 others: PHP 7 corruption during deployment (was: PHP 7 fatals on mw1262) - https://phabricator.wikimedia.org/T224491 (10jijiki)
[14:45:19] <wikibugs>	 10serviceops, 10Release-Engineering-Team, 10Scap, 10PHP 7.2 support, 10User-jijiki: Enhance MediaWiki deployments for support of php7.x - https://phabricator.wikimedia.org/T224857 (10thcipriani) >>! In T224857#5232389, @jijiki wrote: > If there are not any better short-term solutions/ideas, depooling-dep...
[14:45:26] <jynus>	 I got mw 2 servers generating database connection errors
[14:45:51] <jynus>	 that would be relatively normal, if it wasn't that it was only 2 app servers, and none of others
[14:46:18] <_joe_>	 I see alerts on kubernetes in codfw akosiaris
[14:46:28] <_joe_>	 all services?
[14:46:29] <akosiaris>	 _joe_: same thing
[14:46:48] <jynus>	 https://logstash.wikimedia.org/goto/908c79c679c998531dc0a8b0eba031a7 if someone is interested
[14:46:51] <apergos>	 paged
[14:46:55] <_joe_>	 jynus: I am but not now
[14:47:14] <akosiaris>	 should recover pretty quickly
[14:47:18] <akosiaris>	 and should not have paged
[14:47:35] <_joe_>	 if it is paging (still didn't get it) it means something went wrong?
[14:47:37] <jynus>	 _joe_: more than fair
[14:48:15] <akosiaris>	 btw, we should have this service only page jijiki
[14:50:46] <jynus>	 "Lost connection to MySQL server at 'reading authorization packet', system error: 11" <--- first time I see that error
[14:51:01] <jynus>	 ah, I saw that before, sorry
[14:51:08] <jynus>	 I missread it
[14:51:12] <apergos>	 interesting dashboard there (bookmarked)
[14:51:26] <jynus>	 apergos: the mysql one?
[14:51:27] * jijiki creates a new label on gmail
[14:51:38] <apergos>	 uh huh, jy nus
[14:51:50] <jynus>	 There is the better https://logstash.wikimedia.org/app/kibana#/dashboard/87348b60-90dd-11e8-8687-73968bebd217
[14:52:12] <jynus>	 but for connections the previous one is more friendly
[14:53:06] <jynus>	 (nothing worring right now, so please don't spend time unlesss you can affor it)
[14:53:29] <apergos>	 huh these deadlocks are interesting 
[14:53:47] <apergos>	 nah, just taking a quick look
[14:53:56] <jynus>	 yeah, it is reported, Amir promised a deploy soon
[14:54:30] <jynus>	 the background noise, specially from job queue or api is not very high, but relatively constant
[14:55:07] <apergos>	 gotcha
[15:24:46] <Reedy>	 Anyone fancy merging https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/511145/ for me?
[15:25:02] <Reedy>	 Then I can remove it as a cherry pick on beta (basically makes this config the same as it is for the prod file)
[15:29:24] <Reedy>	 https://github.com/wikimedia/puppet/blob/production/modules/mediawiki/manifests/web/prod_sites.pp#L211-L215
[15:34:32] <wikibugs>	 10serviceops, 10Operations, 10Release Pipeline, 10Release-Engineering-Team, and 2 others: TEC3:O3:O3.1:Q4 Goal - Move cpjobqueue, Wikidata Termbox SSR (new service), Kask (session storage service) and ORES (partially) through the production CD Pipeline - https://phabricator.wikimedia.org/T220398 (10akosiari...
[15:34:38] <wikibugs>	 10serviceops, 10Operations, 10Release Pipeline, 10Release-Engineering-Team, and 5 others: Introduce kask session storage service to kubernetes - https://phabricator.wikimedia.org/T220401 (10akosiaris) 05Open→03Resolved a:03akosiaris  And LVS done today.  ` akosiaris@deploy1001:~$ curl -i https://sess...
[15:36:54] * apergos whistles innocently
[15:38:42] <wikibugs>	 10serviceops, 10Operations, 10Release Pipeline, 10Release-Engineering-Team, and 5 others: Introduce kask session storage service to kubernetes - https://phabricator.wikimedia.org/T220401 (10Eevans) >>! In T220401#5233899, @akosiaris wrote: >  > And LVS done today. >  > ` > akosiaris@deploy1001:~$ curl -i h...
[15:39:03] <Reedy>	 :)
[15:42:30] <urandom>	 akosiaris: where do session storage production settings live?
[15:42:47] <urandom>	 akosiaris: aka, the stuff that is templated into config.yaml
[15:44:32] <akosiaris>	 urandom: deploy1001:/srv/scap-helm/sessionstore
[15:45:10] <akosiaris>	 it's the same format as https://gerrit.wikimedia.org/r/plugins/gitiles/operations/deployment-charts/+/refs/heads/master/charts/kask/values.yaml fwiw, just overriding values
[15:47:03] <urandom>	 akosiaris: is this the canonical location?
[15:47:15] <urandom>	 it's not in git somewhere?
[15:47:19] <akosiaris>	 urandom: it will be
[15:47:45] <akosiaris>	 we are almost there, putting the final pieces into place
[15:48:05] <urandom>	 akosiaris: OK, I was asking in reference to T224995
[15:48:24] <urandom>	 which asks that a comment be added for `default_ttl`
[15:48:30] <wikibugs>	 10serviceops, 10Operations, 10Reading-Infrastructure-Team-Backlog, 10Release Pipeline, and 2 others: Machine vision image metadata service - https://phabricator.wikimedia.org/T224917 (10Tgr) Doesn't that sort of duplicate #jade?
[15:49:35] <urandom>	 akosiaris: I guess the repo will be initialized with what is in these files?
[15:49:50] <akosiaris>	 urandom: then I guess that's the place to add it for now. Aside from passwords, we will be adding all of this into deployment-charts/ repo
[15:50:07] <urandom>	 gotcha
[15:52:06] <fsero>	 urandom: I'm working on that ATM just need to stitch things together
[15:52:24] <urandom>	 kk
[15:52:27] <urandom>	 thanks!
[15:52:50] <_joe_>	 jijiki: can you keep an eye on the opcache free memory?
[15:53:10] <_joe_>	 if it goes below 40 MB, we need to do a rolling restart of php-fpm
[15:53:39] <jijiki>	 sure 
[15:54:38] <_joe_>	 both api and appservers ofc, mw1348 is at 80 free MBs
[15:54:48] <_joe_>	 we need an alert on this
[15:55:40] <jijiki>	 I will see what to do 
[16:11:33] <wikibugs>	 10serviceops, 10Release-Engineering-Team, 10Scap, 10PHP 7.2 support, 10User-jijiki: Enhance MediaWiki deployments for support of php7.x - https://phabricator.wikimedia.org/T224857 (10Legoktm) Is whatever restart/depool solution that was proposed above smart enough to only do so when a *.php file is touch...
[16:24:48] <wikibugs>	 10serviceops, 10Operations, 10Reading-Infrastructure-Team-Backlog, 10Release Pipeline, and 2 others: Machine vision image metadata service - https://phabricator.wikimedia.org/T224917 (10Mholloway) Hmm, I guess they're aimed in a similar way at editor ratification of others' judgments (in the case of JADE,...
[16:41:40] <wikibugs>	 10serviceops, 10Release-Engineering-Team, 10Scap, 10PHP 7.2 support, 10User-jijiki: Enhance MediaWiki deployments for support of php7.x - https://phabricator.wikimedia.org/T224857 (10jijiki) >>! In T224857#5233723, @thcipriani wrote: >>>! In T224857#5232389, @jijiki wrote: >> If there are not any better...
[16:56:53] <elukey>	 FYI this is the effect of nutcracker's conns to memcached dropping (after rebooting mw1* hosts) - https://grafana.wikimedia.org/d/000000316/memcache?panelId=40&fullscreen&orgId=1
[17:04:13] <_joe_>	 also that's resetting the opcache, whioch is good
[17:05:12] <jijiki>	 _joe_: I was thinking 
[17:05:49] <jijiki>	 does it make sense to have a cronjob every X and if opcache free is less than 60 
[17:06:01] <jijiki>	 depool and restart 
[17:06:16] <jijiki>	 and to avoid swat and trains
[17:06:46] <jijiki>	 we can actually check at say 4 UTC, or between 4-10 UTC
[17:07:17] <_joe_>	 jijiki: there is a whole system in place to do that across a timespan for a cluster
[17:07:35] <_joe_>	 there is an use in the profile::mediawiki::webserver (cron_splay)
[17:07:46] <jijiki>	 ok I will take a look 
[17:07:47] <jijiki>	 tx 
[17:07:55] <_joe_>	 and a script called hhvm-needs-restart doing pretty much what you said for hhvm on the api cluster
[17:08:12] <_joe_>	 it's a bit sad to revert to doing such things, but meh :P
[17:08:18] <jijiki>	 btw I will restart1348 
[17:08:28] <jijiki>	 before signing off
[17:08:44] <_joe_>	 it should be restarted as in rebooted by moritzm AIUI
[17:08:50] <jijiki>	 well,desperate tines
[17:09:05] <jijiki>	 well, it will take 2' so 
[17:09:30] <_joe_>	 tbh I'd be interested in seeing what happens if the opcache is full
[17:10:01] <jijiki>	 ok, then it is up to you :)
[17:10:25] <_joe_>	 let's reconvene tomorrow? I need to get off the keyboard
[17:11:54] <jijiki>	 yea mee too
[17:43:47] <wikibugs>	 10serviceops, 10Operations, 10Reading-Infrastructure-Team-Backlog, 10Release Pipeline, and 2 others: Machine vision image metadata service - https://phabricator.wikimedia.org/T224917 (10Tgr) > it looks like JADE is focused specifically on revisions, and I wouldn't expect the machine vision-derived labels t...
[17:45:37] <wikibugs>	 10serviceops, 10Operations, 10Reading-Infrastructure-Team-Backlog, 10Release Pipeline, and 2 others: Machine vision image metadata service - https://phabricator.wikimedia.org/T224917 (10Tgr) >>! In T224917#5234081, @Mholloway wrote: > We'd still need to interface with the third-party machine vision provide...
[17:59:03] <jbond42>	 jijiki, _joe_ i have accidently rebooted one of the mcrouter proxy boxes - mw1286.  i know this cause problems in codfw but not sure what they where.  just know that box was on my list of *do not reboot* and i just rebooted :S 
[18:00:01] <_joe_>	 ok so, you can safely reboot the eqiad proxies as that's the active datacenter
[18:00:09] <_joe_>	 proxies are important in the inactive one
[18:00:22] <_joe_>	 and we need to be able to pool/depool them easily btw
[18:00:37] <jbond42>	 ahh ok so they can be rebooted as normal in eqiad
[18:00:44] <_joe_>	 we should just loadbalance them
[18:00:46] <_joe_>	 yes
[18:01:31] <jbond42>	 ok great thanks
[18:33:28] <jbond42>	 hi all, i rebooted mw1227 and when it cam back up hhvm was using between 1000-2500% cpu.  I have now depooled the server and it is still using ~110% cpu anyone able to provide insight into what it may be
[18:33:58] <jbond42>	 it is slowly going down now
[19:12:21] <jbond42>	 sorry for the ping in -ops those machines have all calmed down now.  is this behaviour expected after a reboot.  i gusse they all have to recompile there opcache so it makes sense if they start up with higher then normal cpu i guess
[20:04:41] <wikibugs>	 10serviceops, 10Operations, 10PHP 7.2 support, 10Performance-Team (Radar), and 2 others: PHP 7 corruption during deployment (was: PHP 7 fatals on mw1262) - https://phabricator.wikimedia.org/T224491 (10pmiazga) @Niedzielski noticed another, pretty similar issue: {T225018}. System fails with `PHP Fatal error...
[20:55:28] <moritzm>	 for HHVM the bytecode cache is on disk and persists across reboots, but we some times have HHVM load spikes on single API servers, there should be a task, but didn't find it in a quick search, they usually recover at 10-15min
[20:56:37] <wikibugs>	 10serviceops, 10Continuous-Integration-Config, 10Epic, 10Release-Engineering-Team (Kanban): Define variant Wikimedia production config in compiled, static files - https://phabricator.wikimedia.org/T223602 (10Jdforrester-WMF) p:05Triage→03Normal
[21:17:53] <jbond42>	 on the server in question there where some errors like the following in the hhvm logs so it could have just been a conincidence, thanks
[21:17:57] <jbond42>	 error: entire web request took longer than 200 seconds and timed out in /srv/mediawiki/php-1.34.0-wmf.7/extensions/Scribunto/includes/engines/LuaSandbox/Engine.php on line 282
[21:19:37] <jbond42>	 also thanks for the clarification re persistent cache
[21:20:26] * jbond42 feels like he wants the super technical deep dive version of akosi.aris and jij.iki presentation
[21:47:46] <bd808>	 jbond42: you just need to remember that its hacks and special cases all the way down. There is no bottom. :)
[21:49:02] <greg-g>	 jbond42: etc: there are complaints of officewiki server errors from -staff, I haven't repro'd (not much officewiki use today) but just in case it's related to opcache issues
[22:16:50] <jbond42>	 bd808: lmao, turtles all the way down 
[22:17:24] <jbond42>	 greg-g: do you have more info [although not sure related to my work]
[22:27:29] <greg-g>	 jbond42: no, sorry :(
[22:31:29] <wikibugs>	 10serviceops, 10Continuous-Integration-Config, 10Epic, 10Release-Engineering-Team (Kanban): Define variant Wikimedia production config in compiled, static files - https://phabricator.wikimedia.org/T223602 (10Ladsgroup) Here's my two cents:  - I have done a similar thing with ores, in couple of months we ma...