[08:11:53] 10serviceops, 10TechCom-RFC (TechCom-Approved): RfC: Standards for external services in the Wikimedia infrastructure. - https://phabricator.wikimedia.org/T208524 (10Joe) @Aklapper the RfC has been edited to reflect what's on phabricator at https://www.mediawiki.org/wiki/Requests_for_comment/Standards_for_exter... [08:13:30] <_joe_> fsero, akosiaris the envoy maintainer reached out, he's got wind we're thinking of using envoy and he wants to chat [08:13:49] <_joe_> I'll write them an email today, do you want to be in cc:? [08:14:50] Yep! [08:17:51] _joe_: sure [08:48:14] hello everybody, as FYI I just merged the change to set the mcrouter async replication behavior as default for all appservers/apis [08:50:42] 10serviceops, 10DBA, 10Operations: Strengthen backup infrastructure and support - https://phabricator.wikimedia.org/T229209 (10akosiaris) [08:50:59] 10serviceops, 10DBA, 10Operations: Strengthen backup infrastructure and support - https://phabricator.wikimedia.org/T229209 (10akosiaris) p:05Triage→03Normal [08:52:05] thank you luca! [08:52:20] 10serviceops, 10Analytics, 10EventBus, 10Patch-For-Review: Allow eventgate-analytics service to reach schema.svc.{eqiad,codfw}.wmnet:8190 - https://phabricator.wikimedia.org/T229051 (10fsero) 05Open→03Resolved merged and applied [09:09:00] 10serviceops, 10DBA, 10Operations, 10Goal: Strengthen backup infrastructure and support - https://phabricator.wikimedia.org/T229209 (10Marostegui) [10:06:05] <_joe_> anyone is doing $things with the appservers? [10:06:25] <_joe_> jijiki or elukey? I need to disable puppet to merge https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/520502/ [10:07:03] I am ok, I don't know if elukey is still deploying his patch [10:07:59] I am ok, I didn't force any puppet run but all appservers should have the new config by now [10:08:36] (started one hour and 20 mins ago, just checked) [10:08:38] <_joe_> ok cool [10:09:36] _joe_ wow the patch looks awesome [10:09:41] wasn't aware of it \o/ [10:10:10] <_joe_> elukey: it would be much better if our version of mtail supported histograms correctly :/ [10:10:35] :( [10:17:57] <_joe_> uhm I just hit a bump in the road [10:18:10] <_joe_> mtail can't read the log file as it's owned by root:adm [10:18:20] _joe_: I had the same problem [10:18:30] <_joe_> so I shall add the mtail user to group adm I guess [10:18:32] I think I added the mtal user [10:18:34] to the group [10:18:35] <_joe_> other ideas? [10:18:48] that is what we have done elsewhere, so let;s be consistent [10:18:54] mm acls? [10:19:00] setfacl? [10:19:21] <_joe_> I don't think that's supported by puppet atm [10:19:30] <_joe_> and logrotate [10:19:57] _joe_: you can check thumbor's mtail [10:20:06] if we want to keep the same pattern regarding that [10:20:13] mmm not sure about puppet but logrotate does support that as it would not notice acls [10:20:15] :P [10:20:24] if you come up with something different, maybe we should change the others as well [10:20:36] <_joe_> fsero: not even when it recreates the file? [10:21:19] if you set an acl on the directory stating the default owners and modes should be ok [10:21:38] lemme look into one example [10:21:39] <_joe_> anyways, where is that defined jijiki? [10:21:47] let me find the patches [10:22:20] <_joe_> fsero: I am aware of how setfacl works, our logrotate now has create 640 root adm [10:22:49] https://gerrit.wikimedia.org/r/#/c/504335/ [10:23:10] <_joe_> oh there is mtail::group [10:23:18] <_joe_> I didn't remember [10:23:48] <_joe_> well "group" is 'root' in theory [10:23:55] <_joe_> but it appears that's not really applied [10:23:59] mtail's? [10:24:06] <_joe_> yeah [10:24:13] I am not sure that is true [10:24:15] <_joe_> class mtail by default [10:24:18] let's check [10:24:22] <_joe_> sets $group = 'root' [10:25:37] <_joe_> [Service] [10:25:38] <_joe_> User=mtail [10:25:40] <_joe_> Group=root [10:25:42] <_joe_> EnvironmentFile=-/etc/default/mtail [10:25:44] <_joe_> this is in the systemd unit [10:27:30] <_joe_> so I' [10:27:41] <_joe_> m not sure how changing that to "adm" would fix things. [10:27:42] _joe_: in any case you are right regarding puppet https://tickets.puppetlabs.com/browse/MODULES-962 not supported so [10:29:07] <_joe_> I mean it works, now I got to understand why "root" didn't :P [10:29:56] <_joe_> apache_http_requests_duration_seconds{bucket="0.25",code="200",handler="proxy:unix:/run/php/fpm-www.sock|fcgi://localhost",method="GET",prog="apache2-mediawiki.mtail"} 18 [10:30:00] <_joe_> nice [10:36:51] 10serviceops, 10TechCom-RFC (TechCom-Approved): RfC: Standards for external services in the Wikimedia infrastructure. - https://phabricator.wikimedia.org/T208524 (10daniel) >>! In T208524#5372064, @Joe wrote: > I didn't add it to the development policy page though. @daniel I'll create a separate page and link... [10:39:56] <_joe_> I really hope mtail doesn't blow up under the load of the appservers [10:40:46] <_joe_> I will leave puppet disabled on the appservers for now [10:40:54] <_joe_> just install it on a few [10:41:02] <_joe_> and see if it blows up or anything [10:58:09] <_joe_> gotta love htop reporting of golang programs [10:58:14] <_joe_> 1000 threads :D [12:31:18] <_joe_> TIL we sometimes respond with 205 to requests to MediaWiki [12:32:21] <_joe_> Special:RecentChanges is responsible for it [12:37:44] <_joe_> I'm activating puppet on all appservers [13:47:58] 10serviceops, 10Analytics, 10EventBus: Allow eventgate-analytics service to reach schema.svc.{eqiad,codfw}.wmnet:8190 - https://phabricator.wikimedia.org/T229051 (10Ottomata) Thank you! [14:52:26] _joe_: do you want help creating charts in grafana from those metrics? [14:52:36] This is the kind of task I should do right now [14:54:38] <_joe_> fsero: yeah I'm experimenting a bit [14:58:38] <_joe_> fsero: https://grafana.wikimedia.org/d/RIA1lzDZk/xxx-joe-appserver [14:59:05] <_joe_> fsero: the most interesting things would be to see the quantiles per handler [14:59:14] <_joe_> and well the error rate [16:19:05] 10serviceops, 10Scap, 10PHP 7.2 support, 10Patch-For-Review, and 3 others: Enhance MediaWiki deployments for support of php7.x - https://phabricator.wikimedia.org/T224857 (10greg) [16:21:46] 10serviceops, 10Continuous-Integration-Config, 10Epic, 10Patch-For-Review, and 2 others: Define variant Wikimedia production config in compiled, static files - https://phabricator.wikimedia.org/T223602 (10Jdforrester-WMF) [18:22:37] 10serviceops, 10Operations, 10Thumbor, 10ops-eqiad, 10User-jijiki: (OoW) thumbor1004 memory errors - https://phabricator.wikimedia.org/T215411 (10Dzahn) server is still alerting https://icinga.wikimedia.org/cgi-bin/icinga/extinfo.cgi?type=2&host=thumbor1004&service=Memory+correctable+errors+-EDAC- needs... [19:01:52] 10serviceops, 10Release-Engineering-Team-TODO, 10Release-Engineering-Team (CI & Testing services): Rebuild integration/config images based on jessie - https://phabricator.wikimedia.org/T219748 (10Jdforrester-WMF) Full audit and jjb job bumps: * docker-registry.wikimedia.org/releng/composer-php56 - None. *... [20:53:41] 10serviceops, 10Patch-For-Review, 10Release-Engineering-Team (CI & Testing services), 10Release-Engineering-Team-TODO (201908): Rebuild integration/config images based on jessie - https://phabricator.wikimedia.org/T219748 (10Jdforrester-WMF) a:03Jdforrester-WMF [21:20:26] 10serviceops, 10Patch-For-Review, 10Release-Engineering-Team (CI & Testing services), 10Release-Engineering-Team-TODO (201908): Rebuild integration/config images based on jessie - https://phabricator.wikimedia.org/T219748 (10Jdforrester-WMF) 05Stalled→03Resolved Oy. [21:20:29] 10serviceops, 10Release Pipeline, 10docker-pkg, 10Patch-For-Review: Remove backports from wikimedia-jessie - https://phabricator.wikimedia.org/T219580 (10Jdforrester-WMF) [22:11:47] 10serviceops, 10Reading-Infrastructure-Team-Backlog: "worker died, restarting" mobileapps issue - https://phabricator.wikimedia.org/T229286 (10jijiki) [22:12:06] 10serviceops, 10Reading-Infrastructure-Team-Backlog: "worker died, restarting" mobileapps issue - https://phabricator.wikimedia.org/T229286 (10jijiki) p:05Triage→03Normal [23:09:48] 10serviceops, 10MediaWiki-General, 10Operations, 10Core Platform Team Workboards (Clinic Duty Team), and 2 others: Some pages will become completely unreachable after PHP7 update due to Unicode changes - https://phabricator.wikimedia.org/T219279 (10WDoranWMF)