[11:07:59] <Nemo_bis>	 How hard is it to get max_user_connections raised? https://phabricator.wikimedia.org/T232210
[11:09:07] <Lucas_WMDE>	 is it possible to find out what max_user_connections is?
[11:09:44] <Lucas_WMDE>	 because all of my Flask tools use 4 workers by default, as far as I’m aware, so I think they couldn’t use more than 4 parallel connections either
[11:09:49] <Lucas_WMDE>	 but I don’t know how OABot is run
[11:09:51] <Krenair>	 modules/role/templates/mariadb/mysqld_config/tools.my.cnf.erb:max_user_connections = 20
[11:09:52] <Krenair>	 possibly?
[11:09:57] <Lucas_WMDE>	 sounds like it
[11:10:14] <Lucas_WMDE>	 Nemo_bis: are you serving more than 20 requests in parallel?
[11:10:28] <Lucas_WMDE>	 because otherwise my guess would be that some connections might not be closed properly
[11:10:28] <Krenair>	 modules/role/templates/mariadb/mysqld_config/tools.my.cnf.erb:# max_user_connections set for T216170
[11:10:29] <stashbot>	 T216170: toolsdb - Per-user connection limits - https://phabricator.wikimedia.org/T216170
[11:10:35] <Krenair>	 ticket is about toolsdb
[11:12:06] <Lucas_WMDE>	 https://phabricator.wikimedia.org/T216170#4955787 sounds like what I was thinking of too
[11:15:05] <Lucas_WMDE>	 yeah I don’t see any `close` or `with` in https://github.com/dissemin/oabot/blob/7fea5f33f7be0687e32b2e023443af929ac896c9/src/oabot/userstats.py
[11:18:30] <Nemo_bis>	 I don't think we serve as many requests
[11:18:40] <Nemo_bis>	 I *thought* I had icnrease from 4 to 8 but I don't remember how :) 
[11:18:44] <Nemo_bis>	 Right, let me  check that
[11:18:56] <Lucas_WMDE>	 I’m leaving a comment on Phabricator right now
[15:11:32] <bd808>	 !log tools `sudo kill -9 10635` on tools-k8s-master-01 (T194859)
[15:11:36] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL
[15:11:37] <stashbot>	 T194859: Toolforge maintain-kubeusers doesn't fail well when LDAP servers are unreachable - https://phabricator.wikimedia.org/T194859
[15:29:34] <halfak>	 !log ores manually disabling uwsgi statsd on ores-web-01
[15:29:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Ores/SAL
[15:39:47] <halfak>	 !log ores lowering web workers per node to 36 https://wikitech.wikimedia.org/w/index.php?title=Hiera:Ores&diff=1836902&oldid=1826595
[15:39:50] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Ores/SAL
[15:40:05] <halfak>	 !log ores manually running puppet on ores-web-01
[15:40:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Ores/SAL
[15:41:36] <halfak>	 Aha!  It looks like we were running out of swap. 
[15:41:59] <halfak>	 Strange because we'd see 1.5GB of memory left, but now that there's plenty of swap, I think it is working. 
[15:42:04] <halfak>	 I don't see statsd complaining either. 
[15:43:15] <halfak>	 !log ores manually running puppet on ores-web-02 and 03
[15:43:16] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Ores/SAL
[15:51:29] <halfak>	 I spoke too soon.  More oom despite plenty of RAM free.  
[16:46:07] <halfak>	 !log ores ran apt-get update/upgrade on ores-web-01
[16:46:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Ores/SAL
[16:54:53] <bd808>	 halfak: just a point of data for you, as of about a month ago new virtual instances in Cloud VPS projects have no swap partition. We made that change because in most modern systems paging ram to a swap partition makes things worse rather than better.
[16:55:32] <bd808>	 so I would suggest that if you see any of your nodes using swap that's a sign that the node is overloaded
[17:18:39] <halfak>	 !log ores staging ores-wmflabs-deploy:f76823b
[17:18:42] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Ores/SAL
[17:18:57] <halfak>	 thanks bd808 
[17:19:24] <halfak>	 It's possible this problem started a month ago and has only now gotten worse with increased usage. 
[17:19:33] <halfak>	 But I have another hypothesis I'm working from :) 
[18:02:23] <Wurgl>	 Hello! Is there a problem with amsterdam wikimedia? The address 91.198.174.192 does not really work
[18:02:34] <Wurgl>	 64 bytes from 91.198.174.192: icmp_seq=44 ttl=55 time=846 ms
[18:02:39] <Wurgl>	 64 bytes from 91.198.174.192: icmp_seq=53 ttl=55 time=835 ms
[18:02:45] <Wurgl>	 60 packets transmitted, 2 received, 96% packet loss, time 60350ms
[18:02:52] <Krenair>	 Wurgl, yes, they're depooling it now
[18:03:03] <Wurgl>	 okay
[18:03:21] <Krenair>	 cloud doesn't run in anything outside eqiad FWIW
[18:04:19] <Wurgl>	 I know, but where to ask, when I cannot access wikipedia-pages to look for the perfect channel?
[18:04:26] <Krenair>	 good point
[18:04:33] <Wurgl>	 +1
[18:08:40] <Krenair>	 Wurgl, so you personally know in future, you can get more info about this stuff in #wikimedia-operations :)
[18:08:55] <Wurgl>	 thanks
[18:37:06] <Zppix>	 I believe they also have periodic backups
[18:37:51] <bd808>	 Zppix: ehh... nothing anyone should rely on unfortunately
[18:38:03] <Zppix>	 True
[18:39:14] <bd808>	 we do have replication on the NFS cluster to a different host in the same data center. And a periodic snapshot system for NFS as well, but instance storage in both Cloud VPS and Toolforge is single copy on local disks
[18:40:09] <mmecor>	 ok