[01:29:37] 10serviceops, 10Operations, 10WMF-Legal, 10Patch-For-Review: Move old transparency report pages to historical URLs and setup redirect - https://phabricator.wikimedia.org/T230638 (10JbuattiWMF) Hello friends, and happy new year. Legal was hoping to check back in on #1 and #2 (but #1 in particular). Thanks s... [06:11:13] 10serviceops, 10observability, 10Wikimedia-Incident: Add alert for app servers in prod serving outdated MediaWiki branches - https://phabricator.wikimedia.org/T242023 (10Joe) 05Open→03Stalled This isn't going to happen until some effort is put in making scap's management of data saner. An alert can't be... [06:11:15] 10serviceops, 10Release-Engineering-Team-TODO, 10Scap, 10Release-Engineering-Team (Deployment services): Define a mediawiki "version" - https://phabricator.wikimedia.org/T218412 (10Joe) [09:10:27] 10serviceops, 10MediaWiki-General, 10Core Platform Team Workboards (Clinic Duty Team), 10MW-1.35-notes (1.35.0-wmf.2; 2019-10-15), and 3 others: Preemptive refresh in getMultiWithSetCallback() and getMultiWithUnionSetCallback() pollutes cache - https://phabricator.wikimedia.org/T235188 (10Pginer-WMF) [09:15:51] _joe_ akosiaris rlazarus should we discuss racking wrt the new servers in codfw and eqiad on our Thu meeting [09:16:01] or setup a different one? [09:16:28] racking? [09:16:47] do we want to follow something different than the usual "spread them as much as possible?" [09:16:59] the difference here is that [09:17:07] they are the same specs [09:17:16] but we have mw, k8s, and wtp [09:17:33] there are a few servers purposed for parsoid in there [09:18:22] I can try and do some of the legwork and finalise it on Thu [09:34:06] 10serviceops, 10Operations: Migrate Zookeeper/etcd conf cluster in codfw to Buster - https://phabricator.wikimedia.org/T224560 (10MoritzMuehlenhoff) [09:34:16] 10serviceops, 10Operations, 10Wikimedia-Etherpad: Migrate etherpad1001 to Buster - https://phabricator.wikimedia.org/T224580 (10MoritzMuehlenhoff) [09:47:51] akosiaris: can I poke you about the termbox test service? We managed to get it into a bit of a mess yesterday while you were out of the office. (but nothing really urgent since this is on test [09:48:36] tarrow: sure. it would definitely be more fun than reading holiday emails [09:48:42] I believe we tried to helmfile apply a new values file for a new version of the termbox image [09:49:22] unfortunately (for a reason I'll come on to later) the image will not work. We then revered and ran helmfile apply [09:49:54] sadly it seems that helmfile now always times out on trying to apply [09:50:22] and we have a pod (with the new "broken" image) stuck in a crashbackoff loop [09:50:59] happily the old pod from last month is still running fine [09:53:46] ah, yes I see it. lemme figure out what happened [09:53:51] cool [09:54:50] so the "problem" with the new image is that we had a to move the entrypoint. We should also figure out how to sort that in a bit but right now I'd like to clean up after myself (/us) [09:55:28] hmm helmfile sync gets stuck as well at PENDING_UPGRADE [09:55:31] I guess I could just kubectl delete but that seems scary [09:56:05] yep, I guess because it never managed to apply the first "upgrade" [09:57:29] helmfile delete followed by a helmfile sync would fix it [09:57:35] right [09:57:39] and it's fine for a test thing [09:57:47] shall I try that? [09:57:54] but I am worried about production, which is why I want to dig a bit deeper [09:58:08] what if that happened when deploying in production... [09:58:51] yeah, that would be "not great" [09:59:54] fwiw I already tried a `helmfile sync` without a `helmfile delete` first and that times out in the same way as an apply [10:20:32] Also, just to make sure you aren't confused by it: This https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/562272 was merged yesterday but without bumping the chart version, or rebuilding or anything. Maybe it should be reverted but we decided to just stop touching stuff until we had a better idea what was going on. [10:21:08] yeah, I saw that. Since you didn't bump the version it's a noop [10:22:03] sigh, I had hoped that helmfile sync --args '--recreate-pods --force' would solve this, but unfortunately no [10:22:27] the pods were recreated, but the replicaset is still there [10:22:49] so we still have a crashloopbackoff pod, albeit a fresh one [10:23:07] funnily enough, helm allows really easy rollbacks [10:23:27] but helmfile doesn't expose that on purpose to not mess with gitops deploys [10:26:56] yeah, we also even tried and old helm style rollback but that was also a noop. [10:28:53] yeah I saw that [10:28:55] 6 Mon Jan 6 17:11:16 2020 DEPLOYED termbox-0.0.3 Rollback to 1 [10:29:00] I am surprised it did not help at all [10:29:41] * akosiaris rerunning it [10:29:50] ah so now the state of the release is DEPLOYED [10:30:12] but interestingly enough the problematic replicaset is still there [10:51:27] 10serviceops, 10Operations, 10Patch-For-Review: PHP Fatal error: Allowed memory size of 524288000 bytes exhausted (tried to allocate 20480 bytes) in /var/www/php-monitoring/lib.php on line 35 - https://phabricator.wikimedia.org/T240824 (10jijiki) 05Open→03Resolved a:03jijiki There is was bug in the mo... [11:13:08] tarrow: sigh, I failed. In the end, helmfile destroy and helmfile sync fixed it, but that's not great. [11:13:30] I 'll try and reproduce locally in a kube env to not block you any more on this though [11:13:47] I think I have all the info I need (it's essentially a failed upgrade anyway) [11:21:09] tarrow: but to answer your question from yesterday. The reason we override in the helm chart the command and args is just to pass the -c /etc/termbox/config.yaml thing. That file is the one that gets mounted as you mentioned and it differs a bit from the main one (it has the templated monitoring section, and the logging as well. That last part should probably be removed now that we can rely just on stdout since the infrastr [11:21:09] ucture takes care of the logs [11:22:27] so if we could get to a point where the helm provided config file is identical to the one in the image, we could indeed ditch it from the chart [11:25:19] <_joe_> well the whole point of overriding the config file is to support different environments [11:28:43] 10serviceops, 10Arc-Lamp, 10Performance-Team: Backups for arclamp application data - https://phabricator.wikimedia.org/T235481 (10Gilles) a:03dpifke [11:32:20] 10serviceops, 10Arc-Lamp, 10Performance-Team: Decom the ArcLamp pipeline for HHVM/Xenon - https://phabricator.wikimedia.org/T233884 (10Gilles) a:03dpifke [11:37:50] 10serviceops, 10Performance-Team, 10Release-Engineering-Team: Create warmup procedure for MediaWiki app servers - https://phabricator.wikimedia.org/T230037 (10Gilles) a:03dpifke [12:06:26] _joe_: Makes sense that we need to support different environments but ideally I think we have just one source of getting the "environment configuration" into the container. I think if we could do it only by ENV variables that would be nicer. Having both ENV vars and the custom config at run time means that setting end up being specified in more than one place and it's not clear which takes precedence [14:26:38] akosiaris: rebased https://gerrit.wikimedia.org/r/c/operations/puppet/+/549177 [14:26:46] no more cache/text_ats.yaml! :) [14:33:25] yes that duplication is finally gone \o/ [14:36:38] ottomata: merging. I 'll cleanup as well [14:36:46] woo thanks [14:38:51] akosiaris: lemme know when that is applied, i'll check some stuff [14:54:11] ottomata: done [14:54:39] looks like it's working just by browsing https://schema.wikimedia.org/#!/ [14:55:28] looks gooood thank you! [14:55:56] akosiaris: would you want to do some of the other LVS stuff today too? [14:56:04] those are a bit more risky, but should be ok [14:57:03] ottomata: like? [14:57:21] I 've been away for almost 2 weeks, assume my memory cache is empty :P [14:58:13] sent you an email yesterday with a summary :) [14:58:25] but [14:58:26] https://gerrit.wikimedia.org/r/c/operations/puppet/+/559167 [14:58:27] and [14:58:37] ah, still going through my email [14:58:40] https://gerrit.wikimedia.org/r/c/operations/puppet/+/559168 [14:58:54] ok, I 'll have a look [14:59:53] those are active services. the https port works fine with .svc and discovery urls [15:52:36] akosiaris: so if we wanted to bump our chart version to use a new entrypoint are there docs somewhere about how to do that? Looks like we need to build that tgz somwhere right? Do we have to so something special after the commit is merged to get it into a chart repository? [16:56:59] tarrow: after bumping Chart.yaml version, helm package termbox; helm repo index . should suffice. Those last 2 parts we aim to automate soon [16:57:23] :) thanks! [16:57:26] so that only bumping the version in Chart.yaml is required [16:58:05] Would deployment-charts README.md be a good place for me to stick those details for the next people? [17:00:00] yup, definitely [17:27:24] ottomata: regarding https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/559167/, shouldn't we somehow coordinate this with the clients of the service? [17:27:53] It's a scheme + port change, whatever talks to it needs to be made aware really soon after the change is merged [17:28:12] same goes for https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/559168/ as well [17:28:31] and there are bound to be refused queries as well [17:29:59] alternatively we can create a new LVS service, move over the clients and then remove the old one [17:30:12] I 'll reply to the email as well [17:35:33] akosiaris: wow you are very right. [17:35:49] yeah we should make a new service and migrate separately. [17:35:51] will make patches for that. [17:35:52] thanks [20:13:16] 10serviceops, 10observability, 10Wikimedia-Incident: Add alert for app servers in prod serving outdated MediaWiki branches - https://phabricator.wikimedia.org/T242023 (10Krinkle) Enforcing full integrity and equality of the /srv/mediawiki directory would be awesome but that's imho an incremental improvement... [22:26:13] 10serviceops, 10Release-Engineering-Team, 10wikitech.wikimedia.org, 10cloud-services-team (Kanban): Test Wikitech is still running wmf.8 (should be on wmf.11) - https://phabricator.wikimedia.org/T241251 (10Jdforrester-WMF) 05Open→03Resolved a:03Andrew Yup, this is now resolved. Thanks, Andrew. [22:28:53] 10serviceops, 10Release-Engineering-Team, 10wikitech.wikimedia.org, 10cloud-services-team (Kanban): Test Wikitech is still running wmf.8 (should be on wmf.11) - https://phabricator.wikimedia.org/T241251 (10Jdforrester-WMF) 05Resolved→03Open Ignore me, not in the dsh group yet. [22:42:31] 10serviceops, 10Release-Engineering-Team, 10wikitech.wikimedia.org, 10Patch-For-Review, 10cloud-services-team (Kanban): Test Wikitech is still running wmf.8 (should be on wmf.11) - https://phabricator.wikimedia.org/T241251 (10bd808) 05Open→03Resolved a:05Andrew→03bd808 Should be in the dsh group... [23:45:04] 10serviceops, 10Release-Engineering-Team, 10wikitech.wikimedia.org, 10cloud-services-team (Kanban): Test Wikitech is still running wmf.8 (should be on wmf.11) - https://phabricator.wikimedia.org/T241251 (10Dzahn) <+icinga-wm> RECOVERY - mediawiki-installation DSH group on cloudweb2001-dev is OK