[07:02:14] 10serviceops, 10Operations, 10Phabricator, 10Release-Engineering-Team-TODO, and 2 others: Reimage both phab1001 and phab2001 to stretch - https://phabricator.wikimedia.org/T190568 (10MoritzMuehlenhoff) >>! In T190568#5319370, @Dzahn wrote: > Next we need to make a decision whether we keep phab1003 as the p... [08:23:53] 10serviceops, 10Core Platform Team Backlog (Watching / External), 10Patch-For-Review, 10Services (watching): Undeploy electron service from WMF production - https://phabricator.wikimedia.org/T226675 (10akosiaris) For what is worth there was 1 extra step (step 0 actually in the order) and it's `Remove disco... [09:42:54] ottomata: urandom tarrow could you please read this page https://wikitech.wikimedia.org/wiki/Migrating_from_scap-helm we would really much like to kill scap-helm ASAP [09:43:14] sure [09:43:43] should read August not july? [09:44:16] well we deprecated if on July we remove it on August [09:44:17] :) [09:44:28] we are giving one month for the change [09:48:06] fsero: Reads fairly clearly to me. Maybe add a section about rolling back mistakes? Is it: make revert commit and `helmfile apply`? How does that work with the wait for cron to get the "new" (reverted) commit? [09:49:26] tarrow: fair enough, cron is enabled to pull changes every min so worst case scenario from merge to deployment is going to last 1m and some seconds before reverted changes are in prod [09:49:46] cool, makes sense [09:49:50] i could also add an extremely way to do it via helm rollback which should be completely discouraged but useful for emergencies [09:50:18] because instead of 1m and some secs via helm rollback could be seconds [09:50:43] however for most cases it should be fine to wait one minute and some seconds [09:52:08] what we should improve is to add Pod Disruption Budgets into our deployments, that way is X% of pods is failing deployment will halt the current deploy and will allow the service to survive with some of the pods [09:52:26] Cool, 1 min is probably fine. Could be worth putting that in the doc because it if was, say, every 5 mins then I'd be more tempted as the deployer to work around it [09:54:29] Its probably obvious (but not to me) how we now do the equivalent of `helm status` and `helm get values` [10:01:58] there is a helmfile status [10:02:05] which is equivalent to helm status [10:03:02] and the get values you dont need it anymore since they are in code in the values.yaml file :) [10:03:32] or private/secrets.yaml if they are secrets that comes from the private repo [10:07:35] isn't values needed in the event you didn't immediately `apply` what is now in git? Or you want to check that you did apply them? [10:07:48] I guess you could use `diff`? [10:10:47] you should use diff [10:11:05] it would render whatss different from what is stored in disk against the state in the cluster [10:11:14] and if the changes makes sense to you move on with apply [10:12:37] Cool; I can pop my questions and answers in to the draft if you like. Or did you already start? [10:14:15] i did not start and i'd appreciate if you amend the page :) [10:14:44] Right ho :) [10:35:59] 10serviceops, 10Prod-Kubernetes, 10Patch-For-Review: Helm packages deployment tool, at least for cluster applications. - https://phabricator.wikimedia.org/T212130 (10fsero) 05Open→03Resolved [10:36:01] 10serviceops, 10Prod-Kubernetes, 10User-fsero: Kubernetes clusters roadmap - https://phabricator.wikimedia.org/T212123 (10fsero) [10:36:40] 10serviceops, 10Patch-For-Review: docker registry swift replication is not replicating content between DCs - https://phabricator.wikimedia.org/T227570 (10fsero) p:05Triage→03High [10:36:52] 10serviceops, 10Patch-For-Review: docker registry swift replication is not replicating content between DCs - https://phabricator.wikimedia.org/T227570 (10fsero) a:03fsero [10:43:35] fsero: Is there a recommended way to `helmfile apply` to both eqiad and codfw at once? [10:44:35] tarrow: no, obviously you can have two sshs and do it at once if you like it but i also discourage a simultaneous deploy [10:44:44] in the case of a bug or a issue [10:45:03] if you do it all at once you are left with no good pods serving traffic [10:45:19] meanwhile doing it in one DC first and then the other could allow us to depool one DC [10:45:26] without affecting end users [10:50:12] OK, so both there are always pods in each DC serving traffic? I was (probably wrongly) under the impression that until switchover it was just eqiad being hit. My main concern was ensuring we don't forget to `helmfile apply` to one DC [11:03:18] Yeah that's a risk, I'd say that if you forget something should remind you :) [11:03:35] Like an alert or phab tickets [11:04:12] additionally we may consider running a cron and notifying owners, problem is about how to do that [12:29:20] 10serviceops, 10Patch-For-Review: docker registry swift replication is not replicating content between DCs - https://phabricator.wikimedia.org/T227570 (10fgiunchedi) I ran an audit of all images that might suffer from this problem, assuming `latest` is the tag we're looking for (which is for the majority of im... [13:53:14] fsero: awesoome this looks great! [13:53:16] i think i understand [13:54:51] you've taken the custom values files from deploy1001, commtted them to helmfile.d and then helmfile knows how to get all the values, namespace, service name, etrc. when running commands, making things much easier to type and less error prone! (ya?) [13:58:53] Yep [13:58:59] That's the gist of it [14:06:25] fsero: the timing is good, I guess, because I never applied to production the deployment I fought so hard on in staging the other day [15:21:45] 10serviceops, 10Operations: Confd died on bast3002 - https://phabricator.wikimedia.org/T227592 (10Aklapper) [17:34:54] 10serviceops, 10Operations, 10Release Pipeline, 10Core Platform Team (RESTBase Split (CDP2)), and 4 others: Deploy the RESTBase front-end service (RESTRouter) to Kubernetes - https://phabricator.wikimedia.org/T223953 (10Pchelolo) [18:39:23] we got new hardware assigned from dcops to replace gerrit server cobalt with gerrit1001 ..needed more RAM and will have to migrate [18:39:59] Yay for more RAM. [18:45:14] mutante: exciting! [18:45:51] https://grafana.wikimedia.org/d/000000377/host-overview?refresh=5m&panelId=4&fullscreen&orgId=1&var-server=cobalt&var-datasource=eqiad%20prometheus%2Fops&var-cluster=misc&from=now-90d&to=now [18:46:06] cdanis: it turns out that one has only 32GB [18:46:10] hasn't been as bad lately (since heap size was reduced a bit), but I'm sure gerrit would be happy to use more RAM if we let it ;) [18:46:18] talking to dcops already [18:46:32] will have to be another box than that one [19:15:47] 10serviceops, 10Operations, 10Core Platform Team Backlog (Watching / External), 10Patch-For-Review, and 2 others: Use PHP7 to run all async jobs - https://phabricator.wikimedia.org/T219148 (10jijiki) [19:23:35] merged code to enable apache forensic logging on phab server. code was copied from what ori uploaded for app servers [19:23:55] so i can confirm it works. [19:24:22] [phab1003:~] $ sudo tail -f /var/log/apache2/forensic.log [19:24:47] but will probably turn it off again in Hiera if it grows too quick [19:25:03] either way an improvement to be able to switch it on on demand [19:36:33] it is something that seems potentially nice to deploy everywhere; being able to find long-running/stuck queries easier is handy [19:45:58] 10serviceops, 10Continuous-Integration-Infrastructure, 10Operations, 10Release-Engineering-Team (CI & Testing services), 10Release-Engineering-Team-TODO (201907): contint1001 store docker images on separate partition or disk - https://phabricator.wikimedia.org/T207707 (10greg) [19:47:54] cdanis: your comment on https://gerrit.wikimedia.org/r/c/operations/puppet/+/511751 might be helpful as currently it is abandoned [19:50:19] {{done}} [19:54:38] thx [19:59:25] mutante: it is not just the size of the log [19:59:46] have we checked how much more disk io this is causing [20:00:05] iirc it is writtinhg 3 lines for each request? [20:01:18] IIRC, 3 lines total (including the normal logging) -- forensic logging adds one long line at the start of every request (unique ID plus a ton of request details), and then one short line at the end of every request (just the ID) [20:03:37] but I wouldn't expect it to be a big I/O hit unless apache was doing a ton of fsync() or something silly [20:04:52] jijiki: i turned it off again but left the Hiera key there to show it exists and can be enabled on demand for debugging [20:06:18] cdanis: it would be interesting to measure the general overhead [20:06:27] but I see your point yes [20:47:11] urandom: i saw the change for the restbase1017 a-b-c listen addresses.. wanted to confirm they have the new IP.. noticed restbase1017 is currently at BusyBox shell .. reinstall failed? [20:47:30] urandom: is that expected and we should merge anytime even before that is fixed and reinstalled? [20:47:45] mutante: I think it can be merged at any time [20:48:06] it's at busybox because... I dunno, I guess cmjohnson just ran out of time [20:48:16] mutante: it's in limbo pending a reimage [20:48:28] oh, i see you asked about the reimage yourself, heh [20:48:47] ok.. looking [20:49:20] given it's already down.. merging should be fine, ACK, doing that [20:49:37] mutante: thanks, that'll be one less thing :) [20:51:55] confirmed the DNS part was done [20:53:18] 10serviceops, 10Cassandra, 10Operations, 10ops-eqiad, and 4 others: Fix restbase1017's physical rack - https://phabricator.wikimedia.org/T222960 (10Dzahn) [20:54:24] 10serviceops, 10Cassandra, 10Operations, 10ops-eqiad, and 4 others: Fix restbase1017's physical rack - https://phabricator.wikimedia.org/T222960 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqiad.wmnet for hosts: ` restbase1017.eqiad.wmnet ` The log can be found in `/va... [20:58:25] debian installer running [21:40:48] had to manually restart hhvm on mw1290 and mw1235, they were in "socket timeout" alerts in icinga for hhvm or hhvm/apache/nginx for about 5h and 1.5h, did not fix themselves [21:41:14] but manual restart made them recover [21:43:28] mwdebug1002 - WARNING: opcache free space is below 100 MB .. hmm [21:50:43] i learned about this: !log mwdebug1002 - php7adm /opcache-free because icinga showed a warning for opcache free space below 100MB [21:50:51] but i did not find a cron / timer yet that runs it [22:11:05] 22:09:07 | restbase1017.eqiad.wmnet | Still waiting for Puppet after 50.0 minutes [22:12:24] hrmm.. i am really unlucky running that script [22:17:01] 100.0% (1/1) of nodes failed to execute command 'source /usr/loca...PUPPET_SUMMARY}"': restbase1017.eqiad.wmnet [22:17:17] 9105 0.0% (0/1) success ratio (< 100.0% threshold) of nodes successfully executed all commands. Aborting. [22:17:20] sigh [23:17:02] 10serviceops, 10Cassandra, 10Operations, 10ops-eqiad, and 4 others: Fix restbase1017's physical rack - https://phabricator.wikimedia.org/T222960 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['restbase1017.eqiad.wmnet'] ` Of which those **FAILED**: ` ['restbase1017.eqiad.wmnet'] `