[01:00:28] 10serviceops, 10Wikimedia-production-error: Spike of fatal error "Cannot declare class Wikimedia\MWConfig" on mw1379 (2020-06-01) - https://phabricator.wikimedia.org/T254209 (10Krinkle) [03:43:45] 10serviceops, 10Operations, 10Thumbor, 10User-jijiki: Upgrade Thumbor to Buster - https://phabricator.wikimedia.org/T216815 (10Gilles) It's waiting for someone to do it. @jijiki when she gets back from leave, possibly? [10:55:04] 10serviceops, 10observability: "PHP opcache hit ratio" alert shouldn't bother on mwdebug*/scandium/etc - https://phabricator.wikimedia.org/T254025 (10Dzahn) re: alert on scandium. It triggered again today and I left a permanent comment, link to this ticket. [12:05:34] 10serviceops, 10Recommendation-API, 10Release-Engineering-Team, 10Services, 10Patch-For-Review: Migrate recommendation-api to kubernetes - https://phabricator.wikimedia.org/T241230 (10akosiaris) @bmansurov namespaces, rules, tokens have been created. Chart has been merged and publish. You are free to de... [12:06:01] 10serviceops, 10Mobile-Content-Service, 10Page Content Service, 10Patch-For-Review, 10Product-Infrastructure-Team-Backlog (Kanban): Migrate mobileapps to k8s and node 10 - https://phabricator.wikimedia.org/T218733 (10akosiaris) [12:06:35] 10serviceops, 10Mobile-Content-Service, 10Page Content Service, 10Patch-For-Review, 10Product-Infrastructure-Team-Backlog (Kanban): Migrate mobileapps to k8s and node 10 - https://phabricator.wikimedia.org/T218733 (10akosiaris) @mholloway, @bearND namespaces, rules, tokens have been created. Chart has b... [12:57:17] 10serviceops: Migrate ORES redis database functionality to the redis misc cluster - https://phabricator.wikimedia.org/T245591 (10akosiaris) [12:59:11] 10serviceops, 10Operations, 10Traffic, 10Patch-For-Review: Certificate *.wikipedia.org valid until 2020-06-20 - https://phabricator.wikimedia.org/T251726 (10Vgutierrez) can we close this task or at least change the task title to lfocus on the icinga alerts? there is no issue with cert renewal itself :) ` w... [13:08:41] 10serviceops, 10Operations, 10Traffic, 10Patch-For-Review: Certificate *.wikipedia.org valid until 2020-06-20 - https://phabricator.wikimedia.org/T251726 (10Dzahn) Yes, it should be renamed. But i think it is traffic team's decision what to do about the monitoring per this being the " primary automated mon... [13:22:08] Hey! I'm trying some estimation for the new SPARQL Endpoint for Commons cluster: T254232 [13:22:58] This seems too small to make sense in a rack, it looks like wasted space. Do we have an option (Ganeti / k8s, / ... ?) for a small stateful service? [13:23:19] at least for the short term, until we have a better idea of the sizing / data growth? [13:23:48] cc: akosiaris ^ [13:24:47] k8s+stateful isn't currently an option you want to go with. A lot of pain lies that way. But a couple of VMs in ganeti is plausible [13:25:14] what initial capacity asks would you have? [13:25:28] complete guess: 6x single Xeon (2C/4T), 32G RAM, 100G usable HDD space, 1G NIC [13:25:56] the current loaded data is 2.5G (but 6 month old, it probably has done x5 since then) [13:26:12] 32G per VM isn't possible unfortunately. The rest seem pretty ok [13:26:19] hard to estimate the request load, this is a new service :/ [13:26:39] we can probably get away with 16G at the moment [13:27:06] and how many VMs? 2? 3? [13:27:29] at least 2, not sure how we manage redundancy on Ganeti [13:27:47] same as you would with hardware. Split across rows [13:27:55] I was planning on 3 real servers per cluster, so that we can still do maintenance if we loose one [13:28:17] don't we have redundancy built into Ganeti? [13:28:32] * gehel knows almost nothing about Ganeti [13:28:49] we have some. if a node dies, we can start the VM on a different node and data will be there [13:29:14] but being robust enough against the usual failure scenarios you need multi-row [13:29:15] loosing data isn't much of an issue, this is a secondary store [13:29:25] which means 1 VM per row at least [13:29:55] we are now adding 3rd and 4th rows so that should be doable [13:30:15] but we are a bit packed currently so you might have to wait a bit [13:30:25] my understanding was that rows are separate failure domains, so 2 rows should provide redundancy on a row level failure. Or do we consider it likely to loose 2 rows at the same time? [13:31:47] your understanding is correct [13:32:07] the "1 VM per row at least" had me confused [13:32:13] but you did mention 3 real servers per cluster, which is why I pointed out we are now expanding to a 3rd and 4th row [13:32:19] ah, I can explain that [13:32:36] Oh, we only have Ganeti in 2 rows? [13:32:47] yes, currently. We are fixing that these days finally [13:32:55] ok, make sense [13:33:06] and are also going to add capacity and rebalance some things to make room for more VMs [13:33:19] what's your ETA? [13:33:28] this would be at some point a production service, is that an issue? [13:33:46] we can probably delay until Q2 [13:34:16] we have a test instance on WMCS which should be enough to let people start playing with for a while and start implementing a few bots [13:34:36] but we'll want something more robust in term of monitoring at some point [13:34:43] no it's not an issue and Q2 sounds awesome. I was aiming to have the prereqs done before Q1 [13:34:55] Q1 woudl be better ;) [13:35:25] as far as as having capacity in ganeti, sure. I am betting there are going to be more blockers :P [13:35:51] in all honesty, if we can make it happen in Q1, I give it a higher chance of success than in Q2 [13:35:56] blockers? what do you mean? never heard of those before [13:36:00] ahahahahaha [13:36:42] and Ganeti is both in codfw / eqiad, right? we would have DC level redundancy? [13:36:47] yes [13:36:55] ok, I'm updating the task [13:37:32] 10serviceops, 10Operations, 10decommission, 10ops-codfw, 10Patch-For-Review: codfw: decom at least 15 appservers in codfw rack C3 to make room for new servers - https://phabricator.wikimedia.org/T247018 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by dzahn@cumin1001 for hosts: `mw217... [13:39:45] akosiaris: so the RAM limit in Ganeti would be 16G ? [13:40:05] IIRC yes. there is such a limit [13:40:25] but the nodes are anyway 64GB RAM, so even if the software did not have such a limit, the node would :P [13:41:36] 16G might be short (no real data, just a hunch), this needs some more checks on our side [13:42:01] gehel: btw, thanks for reaching so early on. It's nice to have a heads up and not a "ASAP" [13:42:11] 10serviceops, 10Operations, 10decommission, 10ops-codfw, 10Patch-For-Review: codfw: decom at least 15 appservers in codfw rack C3 to make room for new servers - https://phabricator.wikimedia.org/T247018 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by dzahn@cumin1001 for hosts: `mw[21... [13:42:19] he, he,he... this is going to turn to ASAP soon enough! [13:44:33] and it will be nice to not commit to buying real hardware until we have some idea of what is actually needed! [13:45:20] 10serviceops, 10Operations, 10decommission, 10ops-codfw, 10Patch-For-Review: codfw: decom at least 15 appservers in codfw rack C3 to make room for new servers - https://phabricator.wikimedia.org/T247018 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by dzahn@cumin1001 for hosts: `mw[21... [13:45:42] akosiaris: we still need to check a few things on this. But when we are ready, do you need anything from our side to provision the hardware for next year? [13:46:22] hardware? you mena the VMs? Yes, a task https://phabricator.wikimedia.org/project/view/1234/ [13:46:49] no, I mean the real underlying hardware [13:46:58] or is that transparent to me? [13:47:15] the latter [13:47:27] cool! [13:47:32] all the hardware is racked already, we are just adding it to the clusters [13:47:38] * gehel likes when he does not have to care about the details [13:48:19] how do you manage capacity on ganeti? just plan for x% growth every year? [13:52:35] yeah that [13:53:19] gehel: that was my tool this year. https://grafana.wikimedia.org/d/xEDjLvgMz/cluster-resource-predictions?orgId=1&var-datasource=codfw%20prometheus%2Fglobal&var-cluster=ganeti&var-months=12 [13:53:37] Oh right, I saw that one go past. [13:53:51] so history is a predictor of the future! [13:54:03] yeah, what could possibly go wrong? :P [14:42:30] 10serviceops, 10Performance-Team (Radar): Avoid php-opcache corruption in WMF production - https://phabricator.wikimedia.org/T253673 (10Krinkle) [20:29:41] 10serviceops, 10Operations, 10Performance-Team, 10Traffic, and 2 others: Remove "Cache-control: no-cache" hack from wmf-config - https://phabricator.wikimedia.org/T247783 (10Krinkle) [23:16:51] 10serviceops, 10Recommendation-API, 10Release-Engineering-Team, 10Services, 10Patch-For-Review: Migrate recommendation-api to kubernetes - https://phabricator.wikimedia.org/T241230 (10bmansurov) Thanks, @akosiaris! [23:19:53] 10serviceops, 10Recommendation-API, 10Release-Engineering-Team, 10Services, 10Patch-For-Review: Migrate recommendation-api to kubernetes - https://phabricator.wikimedia.org/T241230 (10Reedy)