[11:07:59] How hard is it to get max_user_connections raised? https://phabricator.wikimedia.org/T232210 [11:09:07] is it possible to find out what max_user_connections is? [11:09:44] because all of my Flask tools use 4 workers by default, as far as I’m aware, so I think they couldn’t use more than 4 parallel connections either [11:09:49] but I don’t know how OABot is run [11:09:51] modules/role/templates/mariadb/mysqld_config/tools.my.cnf.erb:max_user_connections = 20 [11:09:52] possibly? [11:09:57] sounds like it [11:10:14] Nemo_bis: are you serving more than 20 requests in parallel? [11:10:28] because otherwise my guess would be that some connections might not be closed properly [11:10:28] modules/role/templates/mariadb/mysqld_config/tools.my.cnf.erb:# max_user_connections set for T216170 [11:10:29] T216170: toolsdb - Per-user connection limits - https://phabricator.wikimedia.org/T216170 [11:10:35] ticket is about toolsdb [11:12:06] https://phabricator.wikimedia.org/T216170#4955787 sounds like what I was thinking of too [11:15:05] yeah I don’t see any `close` or `with` in https://github.com/dissemin/oabot/blob/7fea5f33f7be0687e32b2e023443af929ac896c9/src/oabot/userstats.py [11:18:30] I don't think we serve as many requests [11:18:40] I *thought* I had icnrease from 4 to 8 but I don't remember how :) [11:18:44] Right, let me check that [11:18:56] I’m leaving a comment on Phabricator right now [15:11:32] !log tools `sudo kill -9 10635` on tools-k8s-master-01 (T194859) [15:11:36] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [15:11:37] T194859: Toolforge maintain-kubeusers doesn't fail well when LDAP servers are unreachable - https://phabricator.wikimedia.org/T194859 [15:29:34] !log ores manually disabling uwsgi statsd on ores-web-01 [15:29:37] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Ores/SAL [15:39:47] !log ores lowering web workers per node to 36 https://wikitech.wikimedia.org/w/index.php?title=Hiera:Ores&diff=1836902&oldid=1826595 [15:39:50] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Ores/SAL [15:40:05] !log ores manually running puppet on ores-web-01 [15:40:06] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Ores/SAL [15:41:36] Aha! It looks like we were running out of swap. [15:41:59] Strange because we'd see 1.5GB of memory left, but now that there's plenty of swap, I think it is working. [15:42:04] I don't see statsd complaining either. [15:43:15] !log ores manually running puppet on ores-web-02 and 03 [15:43:16] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Ores/SAL [15:51:29] I spoke too soon. More oom despite plenty of RAM free. [16:46:07] !log ores ran apt-get update/upgrade on ores-web-01 [16:46:09] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Ores/SAL [16:54:53] halfak: just a point of data for you, as of about a month ago new virtual instances in Cloud VPS projects have no swap partition. We made that change because in most modern systems paging ram to a swap partition makes things worse rather than better. [16:55:32] so I would suggest that if you see any of your nodes using swap that's a sign that the node is overloaded [17:18:39] !log ores staging ores-wmflabs-deploy:f76823b [17:18:42] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Ores/SAL [17:18:57] thanks bd808 [17:19:24] It's possible this problem started a month ago and has only now gotten worse with increased usage. [17:19:33] But I have another hypothesis I'm working from :) [18:02:23] Hello! Is there a problem with amsterdam wikimedia? The address 91.198.174.192 does not really work [18:02:34] 64 bytes from 91.198.174.192: icmp_seq=44 ttl=55 time=846 ms [18:02:39] 64 bytes from 91.198.174.192: icmp_seq=53 ttl=55 time=835 ms [18:02:45] 60 packets transmitted, 2 received, 96% packet loss, time 60350ms [18:02:52] Wurgl, yes, they're depooling it now [18:03:03] okay [18:03:21] cloud doesn't run in anything outside eqiad FWIW [18:04:19] I know, but where to ask, when I cannot access wikipedia-pages to look for the perfect channel? [18:04:26] good point [18:04:33] +1 [18:08:40] Wurgl, so you personally know in future, you can get more info about this stuff in #wikimedia-operations :) [18:08:55] thanks [18:37:06] I believe they also have periodic backups [18:37:51] Zppix: ehh... nothing anyone should rely on unfortunately [18:38:03] True [18:39:14] we do have replication on the NFS cluster to a different host in the same data center. And a periodic snapshot system for NFS as well, but instance storage in both Cloud VPS and Toolforge is single copy on local disks [18:40:09] ok