[14:12:27] andrewbogott: how constrained was the previous machine for outreachdashboard? I'm noticing that the latency for its update queue is much lower today than it has been for many weeks. Not sure if it's because it's got higher throughput after the move, or some other reason. [15:09:42] ragesoss: Your VM was on a physical server that was severely oversubscribed. We are not exactly sure at the moment why, but the OpenStack scheduler put >90 VMs on the same host -- T192422 [15:09:42] T192422: labvirt1015 is hosting more than 90 VMs - https://phabricator.wikimedia.org/T192422 [15:15:18] cool. well, that seems to explain some of why parallelizing bottlenecks would often be much more effective on the Wiki Education server than the labs one. [16:11:35] bd808: it actually wasn't on the one with 90 VMs, it was on labvirt1006. ragesoss 's VM is SO busy that it overwhelmed any attempt to do CPU balancing :) [16:11:44] If you look at https://grafana.wikimedia.org/dashboard/db/labs-capacity-planning?panelId=91&fullscreen&orgId=1 (which takes a while to load) [16:12:17] you can see 1006 take a dive and 1016 shoot upwards when we move that one VM [16:12:45] like bd808 said in Berlin, you give people access to servers and they'll find a way to use up all the resources. [16:13:41] Yep, not complaining — just, it's enough of an outlier that it's hard to handle automatically. If you look at 1006 w/out that VM it looks like a perfectly good candidate to receive the next VM. And there's no good way to tell it 'but take heed the next VM is a CPU-gobbling monster' [16:14:55] * ebernhardson is also looking to add some CPU gobbling monsters to cloud as part of a GSOC project too :(