[01:00:28] <wikibugs>	 10serviceops, 10Wikimedia-production-error: Spike of fatal error "Cannot declare class Wikimedia\MWConfig" on mw1379 (2020-06-01) - https://phabricator.wikimedia.org/T254209 (10Krinkle)
[03:43:45] <wikibugs>	 10serviceops, 10Operations, 10Thumbor, 10User-jijiki: Upgrade Thumbor to Buster - https://phabricator.wikimedia.org/T216815 (10Gilles) It's waiting for someone to do it. @jijiki when she gets back from leave, possibly?
[10:55:04] <wikibugs>	 10serviceops, 10observability: "PHP opcache hit ratio" alert shouldn't bother on mwdebug*/scandium/etc - https://phabricator.wikimedia.org/T254025 (10Dzahn) re: alert on scandium. It triggered again today and I left a permanent comment, link to this ticket.
[12:05:34] <wikibugs>	 10serviceops, 10Recommendation-API, 10Release-Engineering-Team, 10Services, 10Patch-For-Review: Migrate  recommendation-api to kubernetes - https://phabricator.wikimedia.org/T241230 (10akosiaris) @bmansurov namespaces, rules, tokens have been created. Chart has been merged and publish. You are free to de...
[12:06:01] <wikibugs>	 10serviceops, 10Mobile-Content-Service, 10Page Content Service, 10Patch-For-Review, 10Product-Infrastructure-Team-Backlog (Kanban): Migrate mobileapps to k8s and node 10 - https://phabricator.wikimedia.org/T218733 (10akosiaris)
[12:06:35] <wikibugs>	 10serviceops, 10Mobile-Content-Service, 10Page Content Service, 10Patch-For-Review, 10Product-Infrastructure-Team-Backlog (Kanban): Migrate mobileapps to k8s and node 10 - https://phabricator.wikimedia.org/T218733 (10akosiaris) @mholloway, @bearND  namespaces, rules, tokens have been created. Chart has b...
[12:57:17] <wikibugs>	 10serviceops: Migrate ORES redis database functionality to the redis misc cluster - https://phabricator.wikimedia.org/T245591 (10akosiaris)
[12:59:11] <wikibugs>	 10serviceops, 10Operations, 10Traffic, 10Patch-For-Review: Certificate *.wikipedia.org valid until 2020-06-20 - https://phabricator.wikimedia.org/T251726 (10Vgutierrez) can we close this task or at least change the task title to lfocus on the icinga alerts? there is no issue with cert renewal itself :) ` w...
[13:08:41] <wikibugs>	 10serviceops, 10Operations, 10Traffic, 10Patch-For-Review: Certificate *.wikipedia.org valid until 2020-06-20 - https://phabricator.wikimedia.org/T251726 (10Dzahn) Yes, it should be renamed. But i think it is traffic team's decision what to do about the monitoring per this being the " primary automated mon...
[13:22:08] <gehel>	 Hey! I'm trying some estimation for the new SPARQL Endpoint for Commons cluster: T254232
[13:22:58] <gehel>	 This seems too small to make sense in a rack, it looks like wasted space. Do we have an option (Ganeti / k8s, / ... ?) for a small stateful service?
[13:23:19] <gehel>	 at least for the short term, until we have a better idea of the sizing / data growth?
[13:23:48] <gehel>	 cc: akosiaris ^
[13:24:47] <akosiaris>	 k8s+stateful isn't currently an option you want to go with. A lot of pain lies that way. But a couple of VMs in ganeti is plausible
[13:25:14] <akosiaris>	 what initial capacity asks would you have? 
[13:25:28] <gehel>	 complete guess: 6x single Xeon (2C/4T), 32G RAM, 100G usable HDD space, 1G NIC
[13:25:56] <gehel>	 the current loaded data is 2.5G (but 6 month old, it probably has done x5 since then)
[13:26:12] <akosiaris>	 32G per VM isn't possible unfortunately. The rest seem pretty ok
[13:26:19] <gehel>	 hard to estimate the request load, this is a new service :/
[13:26:39] <gehel>	 we can probably get away with 16G at the moment
[13:27:06] <akosiaris>	 and how many VMs? 2? 3?
[13:27:29] <gehel>	 at least 2, not sure how we manage redundancy on Ganeti
[13:27:47] <akosiaris>	 same as you would with hardware. Split across rows
[13:27:55] <gehel>	 I was planning on 3 real servers per cluster, so that we can still do maintenance if we loose one
[13:28:17] <gehel>	 don't we have redundancy built into Ganeti?
[13:28:32] * gehel knows almost nothing about Ganeti
[13:28:49] <akosiaris>	 we have some. if a node dies, we can start the VM on a different node and data will be there
[13:29:14] <akosiaris>	 but being robust enough against the usual failure scenarios you need multi-row
[13:29:15] <gehel>	 loosing data isn't much of an issue, this is a secondary store
[13:29:25] <akosiaris>	 which means 1 VM per row at least
[13:29:55] <akosiaris>	 we are now adding 3rd and 4th rows so that should be doable
[13:30:15] <akosiaris>	 but we are a bit packed currently so you might have to wait a bit 
[13:30:25] <gehel>	 my understanding was that rows are separate failure domains, so 2 rows should provide redundancy on a row level failure. Or do we consider it likely to loose 2 rows at the same time?
[13:31:47] <akosiaris>	 your understanding is correct
[13:32:07] <gehel>	 the "1 VM per row at least" had me confused
[13:32:13] <akosiaris>	 but you did mention 3 real servers per cluster, which is why I pointed out we are now expanding to a 3rd and 4th row
[13:32:19] <akosiaris>	 ah, I can explain that
[13:32:36] <gehel>	 Oh, we only have Ganeti in 2 rows?
[13:32:47] <akosiaris>	 yes, currently. We are fixing that these days finally
[13:32:55] <gehel>	 ok, make sense
[13:33:06] <akosiaris>	 and are also going to add capacity and rebalance some things to make room for more VMs
[13:33:19] <akosiaris>	 what's your ETA?
[13:33:28] <gehel>	 this would be at some point a production service, is that an issue?
[13:33:46] <gehel>	 we can probably delay until Q2
[13:34:16] <gehel>	 we have a test instance on WMCS which should be enough to let people start playing with for a while and start implementing a few bots
[13:34:36] <gehel>	 but we'll want something more robust in term of monitoring at some point
[13:34:43] <akosiaris>	 no it's not an issue and Q2 sounds awesome. I was aiming to have the prereqs done before Q1
[13:34:55] <gehel>	 Q1 woudl be better ;)
[13:35:25] <akosiaris>	 as far as as having capacity in ganeti, sure. I am betting there are going to be more blockers :P
[13:35:51] <akosiaris>	 in all honesty, if we can make it happen in Q1, I give it a higher chance of success than in Q2
[13:35:56] <gehel>	 blockers? what do you mean? never heard of those before
[13:36:00] <akosiaris>	 ahahahahaha 
[13:36:42] <gehel>	 and Ganeti is both in codfw / eqiad, right? we would have DC level redundancy?
[13:36:47] <akosiaris>	 yes
[13:36:55] <gehel>	 ok, I'm updating the task
[13:37:32] <wikibugs>	 10serviceops, 10Operations, 10decommission, 10ops-codfw, 10Patch-For-Review: codfw: decom at least 15 appservers  in codfw rack C3 to make room for new servers - https://phabricator.wikimedia.org/T247018 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by dzahn@cumin1001 for hosts: `mw217...
[13:39:45] <gehel>	 akosiaris: so the RAM limit in Ganeti would be 16G ?
[13:40:05] <akosiaris>	 IIRC yes. there is such a limit
[13:40:25] <akosiaris>	 but the nodes are anyway 64GB RAM, so even if the software did not have such a limit, the node would :P
[13:41:36] <gehel>	 16G might be short (no real data, just a hunch), this needs some more checks on our side
[13:42:01] <akosiaris>	 gehel: btw, thanks for reaching so early on. It's nice to have a heads up and not a "ASAP" 
[13:42:11] <wikibugs>	 10serviceops, 10Operations, 10decommission, 10ops-codfw, 10Patch-For-Review: codfw: decom at least 15 appservers  in codfw rack C3 to make room for new servers - https://phabricator.wikimedia.org/T247018 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by dzahn@cumin1001 for hosts: `mw[21...
[13:42:19] <gehel>	 he, he,he... this is going to turn to ASAP soon enough!
[13:44:33] <gehel>	 and it will be nice to not commit to buying real hardware until we have some idea of what is actually needed!
[13:45:20] <wikibugs>	 10serviceops, 10Operations, 10decommission, 10ops-codfw, 10Patch-For-Review: codfw: decom at least 15 appservers  in codfw rack C3 to make room for new servers - https://phabricator.wikimedia.org/T247018 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by dzahn@cumin1001 for hosts: `mw[21...
[13:45:42] <gehel>	 akosiaris: we still need to check a few things on this. But when we are ready, do you need anything from our side to provision the hardware for next year?
[13:46:22] <akosiaris>	 hardware? you mena the VMs? Yes, a task https://phabricator.wikimedia.org/project/view/1234/
[13:46:49] <gehel>	 no, I mean the real underlying hardware
[13:46:58] <gehel>	 or is that transparent to me?
[13:47:15] <akosiaris>	 the latter
[13:47:27] <gehel>	 cool!
[13:47:32] <akosiaris>	 all the hardware is racked already, we are just adding it to the clusters
[13:47:38] * gehel likes when he does not have to care about the details
[13:48:19] <gehel>	 how do you manage capacity on ganeti? just plan for x% growth every year?
[13:52:35] <akosiaris>	 yeah that
[13:53:19] <akosiaris>	 gehel: that was my tool this year. https://grafana.wikimedia.org/d/xEDjLvgMz/cluster-resource-predictions?orgId=1&var-datasource=codfw%20prometheus%2Fglobal&var-cluster=ganeti&var-months=12
[13:53:37] <gehel>	 Oh right, I saw that one go past.
[13:53:51] <gehel>	 so history is a predictor of the future!
[13:54:03] <akosiaris>	 yeah, what could possibly go wrong? :P
[14:42:30] <wikibugs>	 10serviceops, 10Performance-Team (Radar): Avoid php-opcache corruption in WMF production - https://phabricator.wikimedia.org/T253673 (10Krinkle)
[20:29:41] <wikibugs>	 10serviceops, 10Operations, 10Performance-Team, 10Traffic, and 2 others: Remove "Cache-control: no-cache" hack from wmf-config - https://phabricator.wikimedia.org/T247783 (10Krinkle)
[23:16:51] <wikibugs>	 10serviceops, 10Recommendation-API, 10Release-Engineering-Team, 10Services, 10Patch-For-Review: Migrate  recommendation-api to kubernetes - https://phabricator.wikimedia.org/T241230 (10bmansurov) Thanks, @akosiaris!
[23:19:53] <wikibugs>	 10serviceops, 10Recommendation-API, 10Release-Engineering-Team, 10Services, 10Patch-For-Review: Migrate recommendation-api to kubernetes - https://phabricator.wikimedia.org/T241230 (10Reedy)