[07:07:52] 10serviceops, 10Operations, 10Core Platform Team Backlog (Watching / External), 10Patch-For-Review, and 2 others: Use PHP7 to run all async jobs - https://phabricator.wikimedia.org/T219148 (10jijiki) [07:44:18] kindly requesting a comment on https://gerrit.wikimedia.org/r/c/operations/puppet/+/512742 (see my comment there, PS1 and PS3 are 2 different implementation). I'm worried that PS1 is too noisy for the services team [07:47:51] just because you said "kindly" [07:47:53] :D [07:48:42] lol, but there is no comment, just a review :-P [07:49:01] I was hoping you could suggest mob.rovac what's best for them ;) [07:52:31] we can do trial and error [07:55:53] yeah but now I'm confused, did you +1 PS3 or PS1? [07:56:21] given that mar.ko's comment is asking for PS1 :) [08:02:24] PS3 [08:09:45] I am looking a little further [08:12:09] having the alerts for just the endpoints might be "too late" [08:12:29] for eaxmple, the other day i found out restbase-dev1006 has a broken disk only once i tried to deploy restbase there [08:12:50] had we had the other alerts, that would have been caught earlier probably [08:15:27] the issue here is if you are going to betting more alerts than you should [08:15:48] which can be annoying [08:16:00] sending you more alerts costs nothing :p [08:16:27] full list at https://icinga.wikimedia.org/cgi-bin/icinga/status.cgi?host=restbase1007 for example mobrovac [08:17:26] i'm aware of all of the alerts that can happen [08:17:40] and most of them are never noise [08:18:11] alright, let's go with PS1 then [08:18:11] there could be a couple i'd perhaps not be interested in, but that's a minor price to pay imho [08:18:13] and see how it goes [08:18:14] yay [08:18:15] thnx [08:18:30] we are always happy to have more people responding to pages :p [08:18:36] lol [08:18:42] so, we should be thanking you [08:18:49] and possibly buy you lot some beers :p [08:19:16] i'll take a good gin, thnx [08:19:45] someone is feeling english [08:20:00] I'll ping you at gin o'clock [08:33:52] 10serviceops, 10Gerrit, 10Release-Engineering-Team: Gerrit http threads stuck behind sendemail thread - https://phabricator.wikimedia.org/T224448 (10Volans) Restarted gerrit because was stuck and showed the same behaviour of the above graph: https://grafana.wikimedia.org/d/Bw2mQ3iWz/gerrit-javamelody?panelId... [08:39:13] 10serviceops, 10Gerrit, 10Release-Engineering-Team: Gerrit http threads stuck behind sendemail thread - https://phabricator.wikimedia.org/T224448 (10Volans) As suggested by @dcausse let's try to capture a jstack next time it happens: ` sudo -u gerrit2 jstack $(pidof java) ` [08:44:38] 10serviceops, 10Operations, 10Performance-Team: Create an alert for high memcached bw usage - https://phabricator.wikimedia.org/T224454 (10jijiki) [08:49:48] PS1 restored, compiler link in the comment [08:49:53] ready to be merged I guess [08:55:30] mobrovac: merged, within 1h you'll get all of them, enjoy! :-P [08:55:44] grazie volans! [08:55:49] prego! [09:02:22] we can break restbase and see if notifications work properly [09:02:24] * jijiki runs [09:11:08] 10serviceops, 10Operations, 10Performance-Team: Create an alert for high memcached bw usage - https://phabricator.wikimedia.org/T224454 (10jijiki) p:05Triage→03Normal [09:18:13] 10serviceops, 10Operations, 10Core Platform Team Backlog (Watching / External), 10Patch-For-Review, and 2 others: Use PHP7 to run all async jobs - https://phabricator.wikimedia.org/T219148 (10jijiki) [09:24:22] 10serviceops, 10Operations, 10Performance-Team: Create an alert for high memcached bw usage - https://phabricator.wikimedia.org/T224454 (10elukey) Couple of notes: * We'd need to write a meaningful runbook to instruct people what metrics to check (mcrouter, redis, etc..) * Refactor https://grafana.wikimedia... [09:24:40] 10serviceops, 10Operations, 10Performance-Team, 10User-Elukey: Create an alert for high memcached bw usage - https://phabricator.wikimedia.org/T224454 (10elukey) [09:46:08] 10serviceops, 10Gerrit, 10Release-Engineering-Team: Gerrit http threads stuck behind sendemail thread - https://phabricator.wikimedia.org/T224448 (10hashar) We have plenty of stacktraces already and the one in this task description matches :-] The sendEmail thread is on hold waiting for some lock which is a... [10:38:32] 10serviceops, 10Operations, 10User-jijiki: Investigate increase in GET ops registered by mcrouter for the mediawiki appserver cluster - https://phabricator.wikimedia.org/T223647 (10elukey) Some clarification about: > From the [[ https://grafana.wikimedia.org/d/000000316/memcache?panelId=21&fullscreen&orgId=... [10:50:32] <_joe_> so good news. One php7 server had an opcache corruption (I think) which autocured with the next refresh after 60 seconds [11:13:07] 10serviceops, 10Gerrit, 10Release-Engineering-Team: Gerrit http threads stuck behind sendemail thread - https://phabricator.wikimedia.org/T224448 (10Paladox) This looks to be the same as what someone else had https://bugs.chromium.org/p/gerrit/issues/detail?id=7645 [13:15:36] 10serviceops, 10Operations, 10Release Pipeline, 10Release-Engineering-Team, and 4 others: Kask functional testing with Cassandra via the Deployment Pipeline - https://phabricator.wikimedia.org/T224041 (10Eevans) [13:17:03] 10serviceops, 10Operations, 10User-jijiki: Investigate increase in GET ops registered by mcrouter for the mediawiki appserver cluster - https://phabricator.wikimedia.org/T223647 (10elukey) [13:29:40] 10serviceops, 10Gerrit, 10Release-Engineering-Team: Gerrit http threads stuck behind sendemail thread - https://phabricator.wikimedia.org/T224448 (10dcausse) Here SendEmail-1 is not "blocked" it's just waiting for jobs, however the dump says: `Locked ownable synchronizers:- <0x00000001c13617f8> (a java.util.... [13:56:11] 10serviceops, 10Gerrit, 10Release-Engineering-Team: Gerrit http threads stuck behind sendemail thread - https://phabricator.wikimedia.org/T224448 (10thcipriani) >>! In T224448#5216469, @dcausse wrote: > Here SendEmail-1 is not "blocked" it's just waiting for jobs, however the dump says: > `Locked ownable syn... [14:13:26] please add goal updates & other team updates to the weekly meeting notes [14:13:31] ASAP :) [14:34:16] 10serviceops, 10Gerrit, 10Release-Engineering-Team: Gerrit http threads stuck behind sendemail thread - https://phabricator.wikimedia.org/T224448 (10dcausse) >>! In T224448#5216526, @thcipriani wrote: >>>! In T224448#5216469, @dcausse wrote: >> Here SendEmail-1 is not "blocked" it's just waiting for jobs, ho... [14:47:22] 10serviceops, 10Operations, 10User-jijiki: Ramp up percentage of users on php7.2 to 100% on both API and appserver clusters - https://phabricator.wikimedia.org/T219150 (10Jdforrester-WMF) [14:49:54] 10serviceops, 10Gerrit, 10Release-Engineering-Team: Gerrit http threads stuck behind sendemail thread - https://phabricator.wikimedia.org/T224448 (10thcipriani) >>! In T224448#5216643, @dcausse wrote: >>>! In T224448#5216526, @thcipriani wrote: >>>>! In T224448#5216469, @dcausse wrote: >>> Here SendEmail-1 i... [14:53:05] 10serviceops, 10Gerrit, 10Operations, 10Release-Engineering-Team (Watching / External): Gerrit Hardware Upgrade - https://phabricator.wikimedia.org/T222391 (10Cmjohnson) [16:16:49] 10serviceops, 10Operations, 10PHP 7.2 support, 10Performance-Team (Radar), 10Wikimedia-production-error: PHP 7 corruption during deployment (was: PHP 7 fatals on mw1262) - https://phabricator.wikimedia.org/T224491 (10jijiki) [16:43:14] 10serviceops, 10Release-Engineering-Team, 10Continuous-Integration-Config, 10Epic: Define variant Wikimedia production config in compiled, static files - https://phabricator.wikimedia.org/T223602 (10Krinkle) > Production branch pruning needs a commit to delete. Can you elaborate? The link to branch prunin... [16:44:48] 10serviceops, 10Operations, 10PHP 7.2 support, 10Performance-Team (Radar), 10Wikimedia-production-error: PHP 7 corruption during deployment (was: PHP 7 fatals on mw1262) - https://phabricator.wikimedia.org/T224491 (10Krinkle) [16:46:56] 10serviceops, 10Operations, 10PHP 7.2 support, 10Performance-Team (Radar), 10Wikimedia-production-error: PHP 7 corruption during deployment (was: PHP 7 fatals on mw1262) - https://phabricator.wikimedia.org/T224491 (10Krinkle) I'm going to assume for now that T224493 is the same issue, because it too only... [16:49:25] 10serviceops, 10Release-Engineering-Team, 10Continuous-Integration-Config, 10Epic: Define variant Wikimedia production config in compiled, static files - https://phabricator.wikimedia.org/T223602 (10Jdforrester-WMF) [16:50:50] 10serviceops, 10Release-Engineering-Team, 10Continuous-Integration-Config, 10Epic: Define variant Wikimedia production config in compiled, static files - https://phabricator.wikimedia.org/T223602 (10Jdforrester-WMF) >>! In T223602#5217787, @Krinkle wrote: >> Production branch pruning needs a commit to dele... [17:00:14] 10serviceops, 10Release-Engineering-Team, 10Continuous-Integration-Config, 10Epic: Define variant Wikimedia production config in compiled, static files - https://phabricator.wikimedia.org/T223602 (10Krinkle) >>! In T223602#5217833, @Jdforrester-WMF wrote: >>>! In T223602#5217787, @Krinkle wrote: >>> Produc... [17:39:06] 10serviceops, 10Release-Engineering-Team, 10Continuous-Integration-Config, 10Epic: Define variant Wikimedia production config in compiled, static files - https://phabricator.wikimedia.org/T223602 (10Jdforrester-WMF) [17:54:33] 10serviceops, 10Operations, 10PHP 7.2 support, 10Performance-Team (Radar), 10Wikimedia-production-error: PHP 7 corruption during deployment (was: PHP 7 fatals on mw1262) - https://phabricator.wikimedia.org/T224491 (10Krinkle) ##### See also * – someone exper... [18:19:09] 10serviceops, 10Operations, 10User-jijiki: Investigate increase in GET ops registered by mcrouter for the mediawiki appserver cluster - https://phabricator.wikimedia.org/T223647 (10elukey) Interesting data that might support what Joe thinks (namely that HHVM for some reason uses more gets than get): ` tcpdu... [19:54:51] 10serviceops, 10Operations, 10Performance-Team (Radar), 10User-Elukey: Create an alert for high memcached bw usage - https://phabricator.wikimedia.org/T224454 (10kchapman) [19:57:42] 10serviceops, 10Operations, 10observability, 10Performance-Team (Radar), 10User-Elukey: Create an alert for high memcached bw usage - https://phabricator.wikimedia.org/T224454 (10Dzahn) [20:04:34] 10serviceops, 10Operations, 10HHVM, 10Performance-Team (Radar), 10User-Marostegui: Increased instability in MediaWiki backends (according to load balancers) - https://phabricator.wikimedia.org/T223952 (10kchapman) [20:46:18] 10serviceops, 10Operations, 10PHP 7.2 support, 10Performance-Team (Radar), 10Wikimedia-production-error: PHP 7 corruption during deployment (was: PHP 7 fatals on mw1262) - https://phabricator.wikimedia.org/T224491 (10Krinkle) [20:46:52] 10serviceops, 10Operations, 10PHP 7.2 support, 10Performance-Team (Radar), 10Wikimedia-production-error: PHP 7 corruption during deployment (was: PHP 7 fatals on mw1262) - https://phabricator.wikimedia.org/T224491 (10Krinkle) (Added another exception from another server, also around 2,000 events, for a d... [20:48:54] 10serviceops, 10Operations, 10PHP 7.2 support, 10Performance-Team (Radar), 10Wikimedia-production-error: PHP 7 corruption during deployment (was: PHP 7 fatals on mw1262) - https://phabricator.wikimedia.org/T224491 (10Krinkle) [21:14:23] 10serviceops, 10Operations, 10PHP 7.2 support, 10Performance-Team (Radar), 10Wikimedia-production-error: PHP 7 corruption during deployment (was: PHP 7 fatals on mw1262) - https://phabricator.wikimedia.org/T224491 (10Krinkle) [21:14:57] 10serviceops, 10Operations, 10PHP 7.2 support, 10Performance-Team (Radar), 10Wikimedia-production-error: PHP 7 corruption during deployment (was: PHP 7 fatals on mw1262) - https://phabricator.wikimedia.org/T224491 (10Krinkle) (Added two more different errors that are assumed to be corruptions. This time... [21:19:15] 10serviceops, 10Operations, 10PHP 7.2 support, 10Performance-Team (Radar), 10Wikimedia-production-error: PHP 7 corruption during deployment (was: PHP 7 fatals on mw1262) - https://phabricator.wikimedia.org/T224491 (10Krinkle)