[10:59:58] 10serviceops, 10Operations, 10ops-eqiad: (Need By Dec 20) rack/setup/install mw13[49-84].eqiad.wmnet - https://phabricator.wikimedia.org/T236437 (10jijiki) @Jclark-ctr if you feel this will work better, we are happy. Either way, this racking is still better than the original one (30 servers in D 5). Thank you! [14:09:26] _joe_: o/ [14:10:03] _joe_: cergen also allows not providing a password [14:10:06] if you don't provide a password [14:10:07] _joe_: seems like codfw k8s cluster is now up and runnings [14:10:12] the key will be output unencrypted [14:10:12] running* [14:10:24] effie: ^ [14:10:29] perhaps we should just do that? instead of outputting an unencrypted key when a pw is provided? [14:10:44] I 'll monitor for the next couple of hours, but up to now LGTM [14:11:06] I haven't yet sent traffic though to it [14:13:20] akosiaris: no servers were harmed in the process? [14:13:47] of course and they were harmed [14:14:08] but they survived and came back stronger [14:14:12] and upgraded! [14:14:12] <_joe_> ottomata: uhhhh ok :P [14:16:12] ok cool, i only remembered that while implementing the change, that makes a bit more sense to me. let's try that when making the next tls service [14:16:17] i'll update your instructions [15:25:20] 10serviceops, 10Operations: kubestagetcd1003 alerts daily via email to root@ for 'unexpected non snapshot file' - https://phabricator.wikimedia.org/T240932 (10akosiaris) 05Open→03Declined that host is to be removed pretty soon. The staging and codfw clusters have been migrated to etcd3 and different set of... [15:27:11] 10serviceops, 10Patch-For-Review: setup new, buster based, kubernetes etcd servers for staging/codfw/eqiad cluster - https://phabricator.wikimedia.org/T239835 (10akosiaris) staging and codfw have successfully been migrated to etcd3. calico is still on the v2 protocol (however on the same set of hosts), we need... [15:32:54] 10serviceops, 10Analytics, 10Analytics-Kanban, 10Operations: cergen should output unencrypted key file for use with envoyproxy kubernetes sidecars - https://phabricator.wikimedia.org/T240990 (10Ottomata) 05Open→03Declined > Instead of modifying cergen to always output an unencrypted key file, could we... [15:34:29] (akosiaris: FYI uncommitted change in /srv/deployment-charts on deploy1001: in helmfile.d/admin/codfw/cluster-helmfile.sh) [15:37:03] ottomata: ah indeed, thanks. mine from the codfw k8s cluster rebootstrapping from today [15:37:04] removed [15:49:41] hm akosiaris maybe i missed somewhere that needs a run-puppet-agent? [15:49:52] i merged the puppet patch to change to port 4392 [15:50:35] ottomata: yeah puppet agent runs and pybal restart. Lemme do that [15:50:40] ahhh [15:50:41] k [15:54:57] 10serviceops, 10Operations: kubestagetcd1003 alerts daily via email to root@ for 'unexpected non snapshot file' - https://phabricator.wikimedia.org/T240932 (10elukey) @akosiaris I'd just remove `0000000000001131-0000000016c32fdb.snap.broken` then to avoid daily cronspam if possible, otherwise no problem :) [15:58:17] akosiaris: yes I know "what is the deal with these guys and cronspam", be patient --^ :D [15:58:55] checking those emails is a constant nerd snipe [15:59:05] self inflicted :D [16:00:06] elukey: I 'll do something better [16:00:10] https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/559113 and delete that VM [16:00:39] akosiaris: thankssss [16:06:14] ottomata: done [16:07:07] ottomata: sorry for taking over that one, it can be a delicate dance at times, e.g. at this instance it required manually deleting the old LVS allocations using ipvsadm [16:07:17] i appreciate it! [16:07:36] just thought from your comment on the puppet patch that all that was needed was service deploy and puppet change [16:07:41] otherwise i would have coordinated with you first [16:07:47] sorry about that [16:09:58] nice it works thanks akosiaris ! [16:11:07] ottomata: cool. thanks as well [16:11:53] akosiaris: i'll be enabling TLS for the other two eventgate services today, and submit matches for LVS to switch to it. we can wait til whenever to merge the LVS ones though [16:11:59] patches* [16:13:16] ok [16:22:25] 10serviceops, 10Operations: kubestagetcd1003 alerts daily via email to root@ for 'unexpected non snapshot file' - https://phabricator.wikimedia.org/T240932 (10akosiaris) vm removed from the fleet. this should a problem no more :-) [17:02:28] 10serviceops: Define the plan for the upgrade of kubernetes cluster to a security supported release - https://phabricator.wikimedia.org/T241076 (10akosiaris) [17:10:42] _joe_, can you take a quick look at https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/558737 ? if that sounds fine from your end, we'll get it swatted today. [17:11:31] <_joe_> sorry I thought I reviewed it this morning [17:11:44] <_joe_> I keep forgetting to submit a vote when I look at code [17:59:18] one of the blubber pipeline gate-and-submit jobs failed with `Error: pods is forbidden: User "jenkins" cannot list resource "pods" in API group "" in the namespace "ci"` during `helm install` https://integration.wikimedia.org/ci/job/blubber-pipeline-rehearse/38/console [17:59:40] yet `KUBECONFIG=/etc/kubernetes/ci-staging.config kubectl -n ci get pods` seems to work just fine from contint1001 [18:01:07] akosiaris, et al ^ not sure why this would have happened. was there some kind of change recently to the jenkins user access? [18:22:14] 10serviceops, 10DBA, 10Operations, 10Goal, 10Patch-For-Review: Strengthen backup infrastructure and support - https://phabricator.wikimedia.org/T229209 (10jcrespo) [18:22:58] 10serviceops, 10DBA, 10Operations, 10Goal, 10Patch-For-Review: Strengthen backup infrastructure and support - https://phabricator.wikimedia.org/T229209 (10jcrespo) [18:33:36] well, the job is now working it seems, so i will just assume the failure was due to now-exorcised demons [18:49:02] * James_F grins. [19:19:53] <_joe_> marxarelli: I guess that happened while alex rebuilt the cluster [19:20:26] <_joe_> 12:15 PM yesterday fits the timeline [19:43:42] _joe_: ah, rgr. i wasn't sure the staging cluster was a part of that, but i see now it was [19:44:20] <_joe_> so that build happened before grants on the cluster were restored [20:52:50] 10serviceops, 10Operations, 10Release-Engineering-Team, 10Patch-For-Review, and 3 others: All debug hosts give (likely spurious) message: PHP Fatal error: The UdpSocket to 127.0.0.1:10514 has been closed (from Monolog/SyslogUdp) - https://phabricator.wikimedia.org/T214734 (10Krinkle) I also see the UdpSoc... [21:46:36] _joe_: If "we could run three different php-fpm pools", we could call them 'blue', 'green', … and 'staging'. ;-) [21:47:36] <_joe_> or group0, group1 and group2 [21:47:42] <_joe_> because traditions [21:47:53] Yes yes, but that wouldn't be grounds for trolling SRE. [21:49:32] <_joe_> I was convinced preload was in 7.3 too [21:50:30] Well, the RfC for it closed only on 2018-11-14: https://wiki.php.net/rfc/preload [21:50:58] But yeah, I was surprised it was new, too. [21:51:14] <_joe_> I definitely knew we don't have it right now [21:51:44] <_joe_> I am interested in understanding if it's even possible to just ship the opcode preload and not the php code [21:51:50] <_joe_> that would be sweet [21:52:13] With a big one-off build step to migrate into opcode? [21:52:32] <_joe_> run but - shocker - CI [21:52:34] <_joe_> yes [21:52:37] <_joe_> *by [21:52:42] Yeah, that'd be interesting. [21:53:09] <_joe_> btw I'm not sure the comments about classes are correct, if this works in any ways like "normal" opcache [21:53:20] <_joe_> "normal" opcache works by code paths [21:53:24] <_joe_> not by classes [21:53:45] <_joe_> so I'm not sure the conclusion brian and florian came to is correct [21:53:53] Now we're mostly enforcing one file per class, though. [21:54:09] And if we need to for performance reasons, we can entirely enforce that. [21:54:17] <_joe_> no that doesn't matter [21:54:37] <_joe_> the problem they think there is is that class A is defined both in wmf.X and wmf.Y [21:54:46] <_joe_> but opcache sees those as different code paths [21:54:58] Oh, right, yeah, as long as the wmf.X bit is in the path. [21:55:12] <_joe_> so yeah I'm not sure I understand their point [21:55:16] (Rather than using the /php/ symlink. No idea if anything in practice still uses that.) [21:55:29] It'd be worth finding out, it sounds like. [21:55:30] <_joe_> the symlinks get resolved anyways [21:55:39] <_joe_> yeah if only I had time [21:55:54] * James_F grumbles about priorities and resourcing. [21:56:14] <_joe_> I've already overflowed my vast amount of backlog over rlazarus and effie [21:56:28] <_joe_> I have enough for a couple more people before we can get to this [21:56:30] <_joe_> :P [23:38:52] James_F: I did some testing a long time ago of HHVM's byte code only stuff -- https://static-bugzilla.wikimedia.org/bug67168.html -- https://github.com/bd808/bug-67168 -- there may or may not be anything interesting there if you decide to try similar things with PHP 7