[00:17:21] 10serviceops, 10Operations, 10Release-Engineering-Team, 10Patch-For-Review, and 3 others: PHP Fatal error: The UdpSocket to 127.0.0.1:10514 has been closed (from Monolog/SyslogUdp on mwdebug1002) - https://phabricator.wikimedia.org/T214734 (10Krinkle) @jijiki Can we depool mwdebug1002 again so that it can... [11:20:05] <_joe_> hey serviceops, since we're all in EU timezones, should we have the meeting earlier today? [11:20:19] we could [11:20:28] mark: akosiaris [11:20:33] <_joe_> mark: akosiaris apergos [11:20:51] ohhhh [11:20:51] <_joe_> unless mutant.e is back today? [11:20:56] yes please [11:21:05] I think he is still on vacation [11:21:13] he is yeah [11:21:17] cool [11:22:08] if we get a +1 back frm mark then I'm in [11:22:47] my first thought was still "oh no, will we be able to find an available meeting room" [11:22:53] proposed time? [11:23:41] his calendar says he's free in 40 min so we might not get an answer before that [11:23:42] <_joe_> 15:00Z ? [11:24:07] <_joe_> everyone should take the time to think of goals [11:24:10] no his calendar shows booked then [11:24:35] 2-3 or 4-30 pm utc seem to be open [11:24:54] er 4 to 4:30 [11:31:00] 10serviceops, 10Operations, 10Patch-For-Review, 10Puppet, 10User-jbond: Rolling restart of etcd to pick up the renewed CA public certificate. - https://phabricator.wikimedia.org/T237362 (10jbond) Have been trying to document some of the [[ https://wikitech.wikimedia.org/wiki/User:Jbond/Encryption#Conf_to... [11:38:25] 10serviceops, 10Operations, 10observability: rsyslogd: omkafka: action will suspended due to kafka error -187: Local: All broker connections are down - https://phabricator.wikimedia.org/T240560 (10jijiki) [11:38:26] 10serviceops, 10Operations, 10observability: rsyslogd: omkafka: action will suspended due to kafka error -187: Local: All broker connections are down - https://phabricator.wikimedia.org/T240560 (10jijiki) [12:03:48] hey [12:04:00] hey mark [12:04:02] yeah moving is a good idea [12:04:05] did you guys figure out a time? [12:04:09] what's a good time for you?? [12:04:16] whenever it's currently open :) [12:05:00] hey _joe_ rlazarus effie akosiaris can you do in half an hour? i.e. 12.30 utc? [12:05:02] which is in the next 3 hours pretty much [12:05:10] 👍 [12:05:14] +1 [12:05:54] <_joe_> ok [12:05:56] i'll move, we'll see if we get declines [12:06:00] <_joe_> we'll have lunch later I guess [12:06:18] <_joe_> I brought cannoli to ease the starvation [12:06:22] :-) [12:06:33] thankfully italians have a pasta room [12:06:57] <_joe_> I still didn't show it to reuven [12:07:08] I've been missing an entire pasta room??? [12:07:10] he hasn't reached that stage yet [12:07:21] I thought we were friends :( [12:07:26] you haven't even showed me your pasta room :( :( :( [12:08:36] ah mark (wow I could not even tab complete that, had to type every letter), is the offsite location settled yet? [12:08:48] i don't know [12:08:53] joel would know [12:09:03] europe [12:09:07] ok, I'll check tonight, not at all urgent [12:09:09] is pretty certain i think [13:53:41] effie: ok to call the mc gutter pool mc-gp100[1-3] and mc-gp200[1-3]? IIRC this was your proposal, that seems ok for Joe as well. Just wanted to confirm so I'll add to the tasks (the codfw rack/setup/deploy is already created) [13:54:24] oh that is great news ! [13:55:01] I think that naming makes sense, if you agree as well and joe agrees [13:55:07] we are good to go [13:55:22] and moritzm agrees [14:01:09] ack! [14:21:32] 10serviceops, 10Operations, 10Traffic: Appservers behind TLS should support chunked Transfer-Encoding - https://phabricator.wikimedia.org/T240576 (10ema) [14:46:14] 10serviceops, 10Operations, 10Traffic: Appservers behind TLS should support chunked Transfer-Encoding - https://phabricator.wikimedia.org/T240576 (10Joe) While it should be easy to swap nginx for envoy, we need to also convert `profile::services_proxy` to use envoy at the same time. It should not be impossi... [15:04:16] 10serviceops, 10Operations, 10observability: rsyslogd: omkafka: action will suspended due to kafka error -187: Local: All broker connections are down - https://phabricator.wikimedia.org/T240560 (10akosiaris) p:05Triage→03High I am gonna triage as high because there is the fear that we are currently losin... [15:11:28] 10serviceops, 10Operations, 10observability: rsyslogd: omkafka: action will suspended due to kafka error -187: Local: All broker connections are down - https://phabricator.wikimedia.org/T240560 (10Ottomata) This error message might be a false positive. I think librdkafka spits it out (is rsyslog using librd... [15:55:39] 10serviceops: setup new, buster based, kubernetes etcd servers for staging/codfw/eqiad cluster - https://phabricator.wikimedia.org/T239835 (10akosiaris) [15:55:43] 10serviceops, 10Operations, 10Kubernetes: Migrate Kubernetes etcd clusters to Stretch/Buster - https://phabricator.wikimedia.org/T224574 (10akosiaris) [15:55:45] 10serviceops, 10Operations, 10Kubernetes: Migrate etcd networking cluster to Stretch/Buster - https://phabricator.wikimedia.org/T224577 (10akosiaris) [16:39:11] 10serviceops, 10Operations, 10observability: rsyslogd: omkafka: action will suspended due to kafka error -187: Local: All broker connections are down - https://phabricator.wikimedia.org/T240560 (10fgiunchedi) rsyslog does indeed use librdkafka so it might be that! re: losing logs AFAICT that's not happening... [16:39:28] 10serviceops, 10Operations, 10Release-Engineering-Team, 10Performance-Team (Radar), and 2 others: PHP Fatal error: The UdpSocket to 127.0.0.1:10514 has been closed (from Monolog/SyslogUdp on mwdebug1002) - https://phabricator.wikimedia.org/T214734 (10jijiki) First of all, we found that this fatal error is... [16:54:43] 10serviceops, 10Operations, 10Release-Engineering-Team, 10Performance-Team (Radar), and 2 others: PHP Fatal error: The UdpSocket to 127.0.0.1:10514 has been closed (from Monolog/SyslogUdp on mwdebug1002) - https://phabricator.wikimedia.org/T214734 (10Joe) It appears to me that we try to send something on... [17:12:24] 10serviceops, 10Product-Infrastructure-Team-Backlog, 10Core Platform Team Workboards (Clinic Duty Team): PCS internal request rates tripled on 2019-11-19 - https://phabricator.wikimedia.org/T238832 (10Mholloway) Traffic fell to a more manageable level again (though still higher than the level prior to the ju... [17:35:59] 10serviceops, 10Operations, 10Traffic: Use Envoy instead of nginx for TLS termination on Appservers - https://phabricator.wikimedia.org/T240576 (10ema) [17:36:35] 10serviceops, 10Product-Infrastructure-Team-Backlog, 10Core Platform Team Workboards (Clinic Duty Team): PCS internal request rates tripled on 2019-11-19 - https://phabricator.wikimedia.org/T238832 (10cscott) Probably related to {T239983} and the general API server unhappiness in that time period. Seems to... [17:37:45] 10serviceops, 10Proton, 10Product-Infrastructure-Team-Backlog (Kanban): Profile proton memory usage for Helm chart - https://phabricator.wikimedia.org/T238830 (10MSantos) [17:39:31] 10serviceops, 10Operations, 10Traffic: Use Envoy instead of nginx for TLS termination on Appservers - https://phabricator.wikimedia.org/T240576 (10ema) This is a severe case of PEBKAC: `curl` uses HTTP/2 by default, that's why the response has no TE:chunked. Forcing curl to use HTTP/1.1 we can see that inde... [17:39:37] 10serviceops, 10Operations, 10Traffic: Use Envoy instead of nginx for TLS termination on Appservers - https://phabricator.wikimedia.org/T240576 (10ema) p:05Triage→03Normal [17:40:30] 10serviceops, 10Product-Infrastructure-Team-Backlog, 10Core Platform Team Workboards (Clinic Duty Team): PCS internal request rates tripled on 2019-11-19 - https://phabricator.wikimedia.org/T238832 (10cscott) PCS load is still higher than before Nov 19. Hopefully it will return to baseline when we turn off... [18:27:01] 10serviceops, 10Operations: High APCu fragmentation can impact server performance - https://phabricator.wikimedia.org/T240205 (10jcrespo) p:05Triage→03High [18:40:54] 10serviceops, 10Operations: High APCu fragmentation can impact server performance - https://phabricator.wikimedia.org/T240205 (10jcrespo) Question, is this fully implemented and only missing validation that nothing breaks (in which case I will reduce the priority), or is it still WIP? (just checking phabricato... [19:20:12] 10serviceops, 10Operations: High APCu fragmentation can impact server performance - https://phabricator.wikimedia.org/T240205 (10jijiki) 05Open→03Resolved I think we can mark this as resolved, our solution seems to be working. If something breaks, we will open a new task or reopen this one. Thank you!