[00:33:26] Are there problems with the wiki replicas right now? [00:34:00] I can't connect from PAWS, pymysql waits a while then raises InternalError: (1105, '(proxy) all backends are down') [00:34:36] Connecting from k8s works fine every time [01:51:09] AntiComposite: im able to access it from bastion using sql enwiki [01:51:14] enwiki_p* [04:29:17] should update status on T245804 [04:29:18] T245804: Reassign base URLs for toolinfo records' web service links - https://phabricator.wikimedia.org/T245804 [07:38:42] morning [07:39:13] anyone playing with the new toolforge.org domain? [11:50:42] godog: you around? [12:20:14] !log tools.zppixbot switched wiki to zppixbot.toolforge.org - T250080 [12:20:17] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.zppixbot/SAL [12:20:17] T250080: Switch ZppixBotWiki to use new domain - https://phabricator.wikimedia.org/T250080 [12:24:31] arturo: sure, what's up ? [12:26:30] godog: we have a meeting today in the WMCS to discuss future of our monitoring/alerting stack in the cloud [12:26:54] we have been using shinken, but it is mostly an abandoned project upstream and will stop using it [12:27:02] !log tools.zppixbot updated website to deploy T250083 [12:27:04] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.zppixbot/SAL [12:27:04] T250083: Update Documentation to use ZppixBot.toolforge.org - https://phabricator.wikimedia.org/T250083 [12:27:18] we are looking for replacements for a cloud-ready solution (you know, multi-tenant, blabla) [12:27:27] and I wonder if you might have any recomendation godog [12:28:26] arturo: it largerly depends on the exact use cases / requirements but right off the bat I'd recommend exploring alertmanager and prometheus alerting [12:28:40] like we'll be doing in production, per the alerting roadmap [12:29:55] godog: I don't even know if we can monitor arbitrary endpoints with prometheus [12:30:11] !log tools.zppixbot set meeting_log_baseurl = https://zppixbot.toolforge.org - T250078 [12:30:12] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.zppixbot/SAL [12:30:13] T250078: Switch meetbot to use new domain - https://phabricator.wikimedia.org/T250078 [12:30:17] I know about the xxxx_up metric [12:31:01] !log tools.zppixbot-test set meeting_log_baseurl = https://zppixbot-test.toolforge.org - T250078 [12:31:03] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.zppixbot-test/SAL [12:31:40] arturo: for basic http checks you can use the blackbox exporter, and custom things can be implemented e.g. via a daemon that runs the checks and exposes the results [12:32:12] arturo: happy to assist/discuss more if you'd like, e.g. the full list of use cases [12:32:57] what I envision as our idea use case is something like Monitoring-as-a-Service [12:33:04] ideal* [12:33:09] !log tools.zppixbot reboot bot & webservice to switchover - T250076 [12:33:12] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.zppixbot/SAL [12:33:13] T250076: Switch the redirect for ZppixBot’s tests web service - https://phabricator.wikimedia.org/T250076 [12:33:15] !log tools.zppixbot-test reboot bot & webservice to switchover - T250076 [12:33:17] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.zppixbot-test/SAL [12:33:24] * RhinosF1 is done !logging [12:33:40] CloudVPS users click a button somewhere to enable monitoring for their projects. We then reuse that for toolforge, too, of course godog [12:34:04] RhinosF1: are you using the `--canonical` switch? [12:35:46] arturo: once I finish restarting both tools [12:39:25] arturo: zppixbot is one it now [12:39:28] arturo: ah! ok I think I got it more or less [12:42:45] arturo: zppixbot-test is now [12:43:20] RhinosF1: ok! [12:51:10] looks like everything is working [12:57:51] godog: how would we handle multi tenancy with alermanager? [12:57:54] alartmanager* [12:58:02] alertmanager** [13:01:25] I think it is more of how you store prometheus metrics [13:04:02] arturo: part of the answer has to do with the metrics themselves yeah [13:05:23] it is a bit of a long winded discussion with a bunch of options depending on what's needed though, happy to followup with a bit more structure/time though [13:05:39] fair [13:38:17] arturo: the openstack sd config is a simple way to inventory and add "multi-tenancy" tags for the cloudvps instances [13:38:32] https://prometheus.io/docs/prometheus/latest/configuration/configuration/#openstack_sd_config [13:41:02] you can then reference the project tags with the alert manager for routing notifications [15:26:27] jeh: yes, interesting. We do something similar with the toolforge kubernetes [15:27:31] I wonder what would be our prometheus server [15:27:52] if using cloudmetrics, I don't think it can reach private instance addresses? [15:28:01] Also, you can do things with labels on the query end https://github.com/openshift/prom-label-proxy, which can be used for individual alertmanagers [15:28:17] I have no idea how well it works, but this is how people make Thanos multitenant, apparently [15:30:11] obviously, also openshift [15:30:30] I think we'd want the prometheus server inside CloudVPS, like shinken [15:30:58] that makes sense, or at least a proxy [15:30:59] But yeah, for prometheus itself, to use labeling as the mechanism, we'd need a big beast of a prometheus server (or cortex or thanos...which are scalable, long-term-storage prometheus) [15:31:02] but what about storage? [15:31:37] That's what cortex and thanos resolve in their own ways. Thanos uses object storage (which would be ceph for us), and cortex uses cassandra plus object stores, depending. [15:31:58] ok [15:32:06] Prometheus alone would just need a huge disk or something :-/ [15:32:20] it really depends on what data we want to collect and the retention period [15:32:24] cloudmetrics right now has 2.3T disk, with 734G in use [15:34:10] and that's without the VM info we want in there [15:34:17] we have to be mindful of how many labels are on a metric, but I'm not sure I agree that we'd need a beast of a prometheus server or some other product [15:34:28] Fair enough [15:35:44] jeh: that brings up scope a bit in general. Maybe the first thing to talk about in half an hour is how much do we want to monitor in this particular stage of things :) [15:36:21] replacing shinken isn't that hard compared to replacing cloudmetrics, shinken, tools-prometheus and monitoring for other tenants [15:36:51] All that other stuff may be totally unnecessary until we want them [15:37:04] And it still might not need much! [15:37:39] bstorm_: defining scope and what multi-tenant features we want sounds good [15:38:44] getting all the way to monitoring everything inside Cloud VPS would be nice, but the more immediate need is replacing shinken which I think is only tracking tools, deployment-prep, and maybe a couple of other projects [15:41:11] * bstorm_ starts writing up an agenda [15:44:47] We'll probably want an etherpad for design and notes, but I put a general agenda in gdoc [15:46:31] Edits welcome, obviously :) [17:32:20] !log tools updating the maintain-kubeusers:beta image on tools-docker-imagebuilder-01 T246123 [17:32:24] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [17:32:24] T246123: Switch PodSecurityPolicy API versioning in maintain-kubeusers from extensions/v1beta1 to policy.k8s.io/v1beta1 - https://phabricator.wikimedia.org/T246123 [18:19:55] !log tools updating the maintain-kubeusers:latest image T246123 [18:19:58] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [18:19:59] T246123: Switch PodSecurityPolicy API versioning in maintain-kubeusers from extensions/v1beta1 to policy.k8s.io/v1beta1 - https://phabricator.wikimedia.org/T246123 [18:26:12] !log tools Deployed new code and RBAC for maintain-kubeusers T246123 [18:26:15] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [18:26:16] T246123: Switch PodSecurityPolicy API versioning in maintain-kubeusers from extensions/v1beta1 to policy.k8s.io/v1beta1 - https://phabricator.wikimedia.org/T246123 [20:13:14] !log tools.fourohfour kubectl delete ingress default-route-www.toolforge.org [20:13:16] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.fourohfour/SAL [20:17:12] !log tools.fourohfour webservice stop; webservice start [20:17:13] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.fourohfour/SAL [20:24:25] !log tools.lexeme-forms deployed 44b5df2897 (edit mode: show lemma, show conflicts, add missing statements) [20:24:27] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.lexeme-forms/SAL [20:28:40] !log tools.www kubectl create --validate=true -f ingress.yaml [20:28:41] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.www/SAL