[00:33:26] <AntiComposite>	 Are there problems with the wiki replicas right now?
[00:34:00] <AntiComposite>	 I can't connect from PAWS, pymysql waits a while then raises InternalError: (1105, '(proxy) all backends are down')
[00:34:36] <AntiComposite>	 Connecting from k8s works fine every time
[01:51:09] <Zppix>	 AntiComposite:  im able to access it from bastion using sql enwiki
[01:51:14] <Zppix>	 enwiki_p*
[04:29:17] <DSquirrelGM>	 should update status on T245804
[04:29:18] <stashbot>	 T245804: Reassign base URLs for toolinfo records' web service links - https://phabricator.wikimedia.org/T245804
[07:38:42] <arturo>	 morning
[07:39:13] <arturo>	 anyone playing with the new toolforge.org domain?
[11:50:42] <arturo>	 godog: you around?
[12:20:14] <RhinosF1>	 !log tools.zppixbot switched wiki to zppixbot.toolforge.org - T250080
[12:20:17] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.zppixbot/SAL
[12:20:17] <stashbot>	 T250080: Switch ZppixBotWiki to use new domain - https://phabricator.wikimedia.org/T250080
[12:24:31] <godog>	 arturo: sure, what's up ?
[12:26:30] <arturo>	 godog: we have a meeting today in the WMCS to discuss future of our monitoring/alerting stack in the cloud
[12:26:54] <arturo>	 we have been using shinken, but it is mostly an abandoned project upstream and will stop using it
[12:27:02] <RhinosF1>	 !log tools.zppixbot updated website to deploy T250083
[12:27:04] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.zppixbot/SAL
[12:27:04] <stashbot>	 T250083: Update Documentation to use ZppixBot.toolforge.org - https://phabricator.wikimedia.org/T250083
[12:27:18] <arturo>	 we are looking for replacements for a cloud-ready solution (you know, multi-tenant, blabla)
[12:27:27] <arturo>	 and I wonder if you might have any recomendation godog 
[12:28:26] <godog>	 arturo: it largerly depends on the exact use cases / requirements but right off the bat I'd recommend exploring alertmanager and prometheus alerting
[12:28:40] <godog>	 like we'll be doing in production, per the alerting roadmap
[12:29:55] <arturo>	 godog: I don't even know if we can monitor arbitrary endpoints with prometheus
[12:30:11] <RhinosF1>	 !log tools.zppixbot set meeting_log_baseurl = https://zppixbot.toolforge.org - T250078
[12:30:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.zppixbot/SAL
[12:30:13] <stashbot>	 T250078: Switch meetbot to use new domain - https://phabricator.wikimedia.org/T250078
[12:30:17] <arturo>	 I know about the xxxx_up metric
[12:31:01] <RhinosF1>	 !log tools.zppixbot-test set meeting_log_baseurl = https://zppixbot-test.toolforge.org - T250078
[12:31:03] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.zppixbot-test/SAL
[12:31:40] <godog>	 arturo: for basic http checks you can use the blackbox exporter, and custom things can be implemented e.g. via a daemon that runs the checks and exposes the results
[12:32:12] <godog>	 arturo: happy to assist/discuss more if you'd like, e.g. the full list of use cases
[12:32:57] <arturo>	 what I envision as our idea use case is something like Monitoring-as-a-Service
[12:33:04] <arturo>	 ideal*
[12:33:09] <RhinosF1>	 !log tools.zppixbot reboot bot & webservice to switchover - T250076
[12:33:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.zppixbot/SAL
[12:33:13] <stashbot>	 T250076: Switch the redirect for ZppixBot’s tests web service  - https://phabricator.wikimedia.org/T250076
[12:33:15] <RhinosF1>	 !log tools.zppixbot-test reboot bot & webservice to switchover - T250076
[12:33:17] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.zppixbot-test/SAL
[12:33:24] * RhinosF1 is done !logging
[12:33:40] <arturo>	 CloudVPS users click a button somewhere to enable monitoring for their projects. We then reuse that for toolforge, too, of course godog 
[12:34:04] <arturo>	 RhinosF1: are you using the `--canonical` switch?
[12:35:46] <RhinosF1>	 arturo: once I finish restarting both tools
[12:39:25] <RhinosF1>	 arturo: zppixbot is one it now
[12:39:28] <godog>	 arturo: ah! ok I think I got it more or less
[12:42:45] <RhinosF1>	 arturo: zppixbot-test is now
[12:43:20] <arturo>	 RhinosF1: ok!
[12:51:10] <RhinosF1>	 looks like everything is working
[12:57:51] <arturo>	 godog: how would we handle multi tenancy with alermanager?
[12:57:54] <arturo>	 alartmanager*
[12:58:02] <arturo>	 alertmanager**
[13:01:25] <arturo>	 I think it is more of how you store prometheus metrics
[13:04:02] <godog>	 arturo: part of the answer has to do with the metrics themselves yeah
[13:05:23] <godog>	 it is a bit of a long winded discussion with a bunch of options depending on what's needed though, happy to followup with a bit more structure/time though
[13:05:39] <arturo>	 fair
[13:38:17] <jeh>	 arturo: the openstack sd config is a simple way to inventory and add "multi-tenancy" tags for the cloudvps instances 
[13:38:32] <jeh>	 https://prometheus.io/docs/prometheus/latest/configuration/configuration/#openstack_sd_config
[13:41:02] <jeh>	 you can then reference the project tags with the alert manager for routing notifications 
[15:26:27] <arturo>	 jeh: yes, interesting. We do something similar with the toolforge kubernetes
[15:27:31] <arturo>	 I wonder what would be our prometheus server
[15:27:52] <arturo>	 if using cloudmetrics, I don't think it can reach private instance addresses?
[15:28:01] <bstorm_>	 Also, you can do things with labels on the query end https://github.com/openshift/prom-label-proxy, which can be used for individual alertmanagers
[15:28:17] <bstorm_>	 I have no idea how well it works, but this is how people make Thanos multitenant, apparently
[15:30:11] <bstorm_>	 obviously, also openshift
[15:30:30] <jeh>	 I think we'd want the prometheus server inside CloudVPS, like shinken 
[15:30:58] <arturo>	 that makes sense, or at least a proxy
[15:30:59] <bstorm_>	 But yeah, for prometheus itself, to use labeling as the mechanism, we'd need a big beast of a prometheus server (or cortex or thanos...which are scalable, long-term-storage prometheus)
[15:31:02] <arturo>	 but what about storage?
[15:31:37] <bstorm_>	 That's what cortex and thanos resolve in their own ways. Thanos uses object storage (which would be ceph for us), and cortex uses cassandra plus object stores, depending.
[15:31:58] <arturo>	 ok
[15:32:06] <bstorm_>	 Prometheus alone would just need a huge disk or something :-/
[15:32:20] <jeh>	 it really depends on what data we want to collect and the retention period
[15:32:24] <arturo>	 cloudmetrics right now has 2.3T disk, with 734G in use
[15:34:10] <bstorm_>	 and that's without the VM info we want in there
[15:34:17] <jeh>	 we have to be mindful of how many labels are on a metric, but I'm not sure I agree that we'd need a beast of a prometheus server or some other product
[15:34:28] <bstorm_>	 Fair enough
[15:35:44] <bstorm_>	 jeh: that brings up scope a bit in general. Maybe the first thing to talk about in half an hour is how much do we want to monitor in this particular stage of things :)
[15:36:21] <bstorm_>	 replacing shinken isn't that hard compared to replacing cloudmetrics, shinken, tools-prometheus and monitoring for other tenants
[15:36:51] <bstorm_>	 All that other stuff may be totally unnecessary until we want them
[15:37:04] <bstorm_>	 And it still might not need much!
[15:37:39] <jeh>	 bstorm_: defining scope and what multi-tenant features we want sounds good
[15:38:44] <bd808>	 getting all the way to monitoring everything inside Cloud VPS would be nice, but the more immediate need is replacing shinken which I think is only tracking tools, deployment-prep, and maybe a couple of other projects
[15:41:11] * bstorm_ starts writing up an agenda
[15:44:47] <bstorm_>	 We'll probably want an etherpad for design and notes, but I put a general agenda in gdoc
[15:46:31] <bstorm_>	 Edits welcome, obviously :)
[17:32:20] <bstorm_>	 !log tools updating the maintain-kubeusers:beta image on tools-docker-imagebuilder-01 T246123
[17:32:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL
[17:32:24] <stashbot>	 T246123: Switch PodSecurityPolicy API versioning in maintain-kubeusers from extensions/v1beta1 to policy.k8s.io/v1beta1 - https://phabricator.wikimedia.org/T246123
[18:19:55] <bstorm_>	 !log tools updating the maintain-kubeusers:latest image T246123
[18:19:58] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL
[18:19:59] <stashbot>	 T246123: Switch PodSecurityPolicy API versioning in maintain-kubeusers from extensions/v1beta1 to policy.k8s.io/v1beta1 - https://phabricator.wikimedia.org/T246123
[18:26:12] <bstorm_>	 !log tools Deployed new code and RBAC for maintain-kubeusers T246123
[18:26:15] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL
[18:26:16] <stashbot>	 T246123: Switch PodSecurityPolicy API versioning in maintain-kubeusers from extensions/v1beta1 to policy.k8s.io/v1beta1 - https://phabricator.wikimedia.org/T246123
[20:13:14] <wm-bot>	 !log tools.fourohfour <bd808> kubectl delete ingress default-route-www.toolforge.org
[20:13:16] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.fourohfour/SAL
[20:17:12] <wm-bot>	 !log tools.fourohfour <bd808> webservice stop; webservice start
[20:17:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.fourohfour/SAL
[20:24:25] <wm-bot>	 !log tools.lexeme-forms <lucaswerkmeister> deployed 44b5df2897 (edit mode: show lemma, show conflicts, add missing statements)
[20:24:27] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.lexeme-forms/SAL
[20:28:40] <wm-bot>	 !log tools.www <bd808> kubectl create --validate=true -f ingress.yaml
[20:28:41] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.www/SAL