[00:03:28] !log tools Added tools-k8s-worker-25 to 2020 Kubernetes cluster (T244791) [00:03:31] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [00:03:31] T244791: Scale up 2020 Kubernetes cluster for final migration of legacy cluster workloads - https://phabricator.wikimedia.org/T244791 [00:07:17] !log tools Added tools-k8s-worker-26 to 2020 Kubernetes cluster (T244791) [00:07:19] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [00:13:30] !log tools Added tools-k8s-worker-27 to 2020 Kubernetes cluster (T244791) [00:13:33] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [00:13:33] T244791: Scale up 2020 Kubernetes cluster for final migration of legacy cluster workloads - https://phabricator.wikimedia.org/T244791 [00:15:16] !log tools Added tools-k8s-worker-28 to 2020 Kubernetes cluster (T244791) [00:15:18] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [00:17:26] !log tools Added tools-k8s-worker-29 to 2020 Kubernetes cluster (T244791) [00:17:28] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [00:25:36] !log tools Added tools-k8s-worker-30 to 2020 Kubernetes cluster (T244791) [00:25:39] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [00:25:39] T244791: Scale up 2020 Kubernetes cluster for final migration of legacy cluster workloads - https://phabricator.wikimedia.org/T244791 [00:25:42] !log tools Added tools-k8s-worker-31 to 2020 Kubernetes cluster (T244791) [00:25:44] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [00:29:38] !log tools Added tools-k8s-worker-32 to 2020 Kubernetes cluster (T244791) [00:29:41] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [00:32:08] !log tools Added tools-k8s-worker-33 to 2020 Kubernetes cluster (T244791) [00:32:13] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [00:32:15] T244791: Scale up 2020 Kubernetes cluster for final migration of legacy cluster workloads - https://phabricator.wikimedia.org/T244791 [00:34:00] !log tools Added tools-k8s-worker-34 to 2020 Kubernetes cluster (T244791) [00:34:02] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [00:38:13] !log tools Added tools-k8s-worker-35 to 2020 Kubernetes cluster (T244791) [00:38:16] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [00:38:16] T244791: Scale up 2020 Kubernetes cluster for final migration of legacy cluster workloads - https://phabricator.wikimedia.org/T244791 [04:01:00] Do we have any documentation for getting Microsoft SQL Studio working on the labs replicas? [04:25:12] Why would we? [04:25:22] It's for managing MS SQL Servers.. [04:26:37] Well, primarily anyway [04:29:48] I can't imagine that setting it up would be that different than the MySQL Workbench instructions [04:31:09] Setting up the ssh tunnel for the odbc connection might need to be a seperate thing [04:31:44] Between the non-free-ness and the MS-SQL-Server-ness, I don't think it's really come up. [04:32:04] I can see postgresql > mysql [04:32:11] don't see much point in mssql :) [04:33:02] for the apps that only work on it... [04:33:07] But that ship sailed years ago [04:33:48] You could certainly say that for nearly other SQL implementation... None of which helps anyone or anything [04:35:11] mmm... yeah, but as you said, why encourage use of non-free software? [04:35:54] I can see in general why someone would use mssql but in free+open source world, makes less sense [04:36:07] SSMS is free, for some defintions of free ;) [04:36:20] hmm, ok [04:37:11] I've used postgresql for years and honestly, they keep up with improvements and do such a good job that I see no need to use anything else (although for something I know only mysql is supported) [04:37:15] some things* [04:38:05] free and best for my needs so no downside [04:39:51] :) [10:32:22] !log admin running `root@cloudcontrol1004:~# designate server-create --name ns0.openstack.eqiad1.wikimediacloud.org.` (T243766) [10:32:25] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Admin/SAL [10:32:25] T243766: Cloud DNS: proposal for new DNS service names - https://phabricator.wikimedia.org/T243766 [10:32:30] !log admin running `root@cloudcontrol1004:~# designate server-create --name ns1.openstack.eqiad1.wikimediacloud.org.` (T243766) [10:32:32] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Admin/SAL [10:35:10] !log admin running `root@cloudcontrol2001-dev:~# designate server-create --name ns1.openstack.codfw1dev.wikimediacloud.org.` (T243766) [10:35:14] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Admin/SAL [15:14:09] !log tools.stashbot Restarting bot to pick up config changes for T245242 [15:14:11] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.stashbot/SAL [15:14:12] T245242: Allow !log in #wikimedia-sre - https://phabricator.wikimedia.org/T245242 [16:38:43] bd808: Dropped that channel registration [17:05:48] DSquirrelGM: thanks [17:08:15] my main thinking behind it was since we're migrating to that new domain, might be a good idea to at least have it registered [17:11:04] DSquirrelGM: yea, good point. a ticket with CRoslof on it would get that done [17:11:12] he handles the registrations [17:11:40] oh.. you dont even mean the domain.. nevermind :) [17:12:12] no - another channel [17:12:48] gotcha now [17:35:34] Krenair: are you about? I have a deployment-prep puzzle [17:47:32] Seems like Beta Commons currently has csrf token errors on all requests, not sure if that's something that got merged recently or if it's a config problem somewhere [18:03:36] All the tokens in the HTML are blank [19:41:53] andrewbogott, hi [19:42:04] what sort of puzzle? [19:42:44] Krenair: I fixed a puppet failure on deployment that I definitely caused, but that was replaced with a new issue which... maybe was there before? I'm not sure. [19:43:14] the latest error is at T243226 [19:43:14] T243226: Upgrade puppet in deployment-prep (Puppet agent broken in Beta Cluster) - https://phabricator.wikimedia.org/T243226 [19:43:22] the new puppet infra in deployment-prep is a bit dubious [19:43:30] If it's expected/in progress then it's all good, I just want to make sure I didn't break it more this morning [19:44:26] ok [19:44:30] first exception out of the puppetdb logs [19:44:32] Caused by: org.postgresql.util.PSQLException: Connection to deployment-puppetdb03.deployment-prep.eqiad.wmflabs:5432 refused. Check that the hostname and port are correct and that the postmaster is accepting TCP/IP connections. [19:45:02] oh, that just means that puppetdb is down entirely? [19:45:04] where is postgres [19:45:22] yes, puppetdb relies on postgres and presumably can't serve any requests without it [19:47:15] hmmmmmmmmmmmmmmmmm [19:47:20] root@deployment-puppetdb03:/var/log/puppetdb# grep Start /lib/systemd/system/postgresql.service [19:47:20] ExecStart=/bin/true [19:51:29] ok [19:51:47] magic service name, found by looking at /etc/postgres tree and /lib/systemd/system/postgresql@.service [19:52:01] root@deployment-puppetdb03:/var/log/puppetdb# service postgresql@11/main status [19:52:01] Invalid unit name "postgresql@11/main.service" was escaped as "postgresql@11-main.service" (maybe you should use systemd-escape?) [19:52:01] ● postgresql@11-main.service - PostgreSQL Cluster 11-main [19:52:01] Loaded: loaded (/lib/systemd/system/postgresql@.service; enabled-runtime; vendor preset: enabled) [19:52:02] Active: failed (Result: protocol) since Fri 2020-02-14 19:46:39 UTC; 4min 44s ago [19:52:23] FATAL: could not map anonymous shared memory: Cannot allocate memory [19:52:29] knew this server had a RAM problem [19:53:58] of course I blame the JVM [19:55:57] changing role::puppetmaster::puppetdb::shared_buffers from 768MB to 600MB [20:09:55] !log monitoring added profile::prometheus::statsd_exporter::mappings: [] setting to the thanks-be prefix to fix puppet runs [20:09:56] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Monitoring/SAL [20:15:16] halfak: I'd like to get puppet running on ores-web0[4,5,6]. The first complaint is profile::ores::logstash_host being undefined... [20:15:18] any preference? [20:15:48] * andrewbogott will set it to localhost otherwise [20:16:53] andrewbogott: 'localhost' seems correct anyways. [20:16:54] !log ores adding profile::ores::logstash_host: localhost to project-wide puppet config [20:16:54] hieradata/role/common/ores.yaml:profile::ores::logstash_host: localhost [20:16:55] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Ores/SAL [20:17:18] andrewbogott: ^ we should probably just copy that to hieradata/labs [20:17:26] ./project/foo [20:17:29] +1 [20:17:48] or that :) [20:18:04] !log ores adding profile::rsyslog::kafka_shipper::kafka_brokers: [] to project-wide puppet setting [20:18:05] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Ores/SAL [20:18:29] patches welcome :) I'm just trying to get things updated for now [20:19:15] I'll get a patch together quick. [20:19:20] Thanks for the clear notes, mutante [20:21:05] halfak: is ores-misc-01.ores-staging yours as well? [20:21:15] yes [20:21:18] that one is suffering for lack of profile::poolcounter::exporter_port [20:21:44] Oh. Should be the same as ores-staging-01.ores-staging [20:22:17] halfak: the thing is that in production you can do ./role/common/ores.yaml but you can't do role-based in cloud VPS, but you can do ./cloud/eqiad1/ores/common.yaml in the operations/puppet repo and put everything there [20:22:25] Hmm. I see "profile::ores::logstash_host: localhost" in hieradata/role/common/ores.yaml already. [20:22:36] instaces not using it will just ignore it.. so that is usually fine [20:22:37] Oh! [20:22:53] and also if all instances in your project run the same role anyways [20:23:18] halfak: yea.. but cloud VPS doesn't get the role-based thing.. that's why this fails without duplicating it in another place [20:23:51] either in the repo or in horizon [20:24:08] it's up to you which one you prefer [20:24:28] Hmm. They do all have a shared role for this. [20:24:44] halfak: ores-staging-01 doesn't use profile::poolcounter::exporter_port, so there's nothing set for me to copy to misc-01 [20:25:00] Aha. [20:25:49] Hmm. We should have something reasonable in one of our nodes in the base ores project/ [20:26:08] * halfak looks into adding this in horizon [20:26:32] halfak: i would do puppet/hieradata/cloud/eqiad1/ores/common.yaml and put all the Hiera things (for cloud) in that one file. all existing Hiera settings in Horizon i would then also move over there so it's only 1 place. that is if i know these are things that are unlikely to change a lot. [20:27:07] Yeah. Shouldn't change too often. [20:27:28] Any problem with having them in both places for a bit? [20:28:39] It's ok to have them in both places but the horizon page is probably the better place [20:28:44] but also i have +2 and wouldn't have to wait for a change. using Horizon web UI instead means it's quicker to change but putting it in repo means it's easier to reason about (imho). both have pros and cons. mixing both options also works but can be confusing [20:28:50] it's tracked in git now, and you can change it yourself w/out asking an SRE to merge [20:29:19] (arturo just ripped some stuff out of the puppet repo and moved to Horizon, so that's his preference. I haven't really thought about it all that much) [20:29:40] Oh. Well, I'm certainly down for what y'all recommend. Seems like the recommendations are conflicting. [20:32:02] * andrewbogott doesn't much care but thinks consistency is a good idea [20:32:06] I'm going to try the hiera rout for now. [20:32:10] *horizon [20:32:15] *route [20:32:22] definitely friday :| [20:32:33] there are different ways to do it ..within the Horizon option [20:32:44] I'm editing the "Project puppet" [20:33:07] you can do based on instance name (not recommended) or project based (recommended in this case) or prefix based (you don't need it because all are doing the same) [20:33:39] halfak: yea, that's better than doing it on individual instances for usre [20:33:42] sure [20:33:55] Looks like we don't have "profile::poolcounter::exporter_port" anywhere that I can see. [20:34:14] Oh I have "profile::ores::web::poolcounter_nodes:\n- ores-poolcounter-1:7531" [20:34:23] Maybe that's been split into hosts and ports now? [20:34:27] common/profile/poolcounter.yaml:profile::poolcounter::exporter_port: 9106 [20:34:35] this is the production default [20:35:19] and in this case it's using "common/profile/poolcounter.yaml" but common is not _that_ common [20:35:39] Krenair: anything I can do to help/support/take over? (I realize it's Friday night and you might have more interesting things to do than fix puppet) [20:36:13] I'm a little bit lost in what's going on here with poolcounter. [20:36:41] halfak: profile::poolcounter wants 2 things, a list of nodes and a port. yes it's split [20:36:45] the list of nodes is this: [20:36:51] $prometheus_nodes = lookup('prometheus_nodes'), [20:37:59] so you'd need prometheus node addresses in cloud.. BUT it has already been set in cloud hieradata [20:38:04] cloud/eqiad1.yaml:prometheus_nodes: [] [20:38:20] so it should not complain about that not existing.. only about it being an empty array [20:38:49] Sorry I'm confused. Now we're talking about promethius config? I'm still working out what's going on with poolcounter. :| [20:39:15] halfak: they are related. the poolcounter config wants to have a list of the prometheus node FQDNs [20:39:33] 1 class profile::poolcounter( [20:39:33] 2 $prometheus_nodes = lookup('prometheus_nodes'), [20:39:33] 3 $exporter_port = lookup('profile::poolcounter::exporter_port'), [20:39:33] 4 ) { [20:39:44] And when I run `sudo puppet agent -tv` on ores-staging-01 it looks like it errors on nginx startup :| [20:39:45] since you asked if it's split into port and host.. yes it is [20:39:47] Aha. [20:39:50] Gotcha. [20:39:54] and those 2 things are the only parameters wanted by poolcounter [20:40:39] andrewbogott, I think puppetdb came back for a bit, but everything is glacially slow [20:40:51] btw, if you see "lookup()" that is the modern replacement for the hiera() call [20:41:02] you can read it as the same [20:41:10] * andrewbogott chooses a random deployment-prep VM to test [20:41:49] ah yeah, maybe what I'm seeing on deployment-prep is slowness -- everything I've tested says it's already in progress [20:42:04] Krenair: was this a version upgrade or something, that got slow? [20:42:17] yeah [20:42:26] got a new puppetdb machine and a new puppetmaster, they run buster now [20:43:05] after long enough, all deployment-prep hosts will be stuck running slow puppet until we break a connection limit somewhere and something starts rejecting connections. or alternatively maybe they all just run puppet indefinitely [20:43:07] * Krenair shrug [20:43:47] ok -- I'll ignore for now and maybe things will settle out [20:44:19] doubt it [20:44:22] stuff shouldn't be slow [20:45:16] want me to adjust quotas so you can rebuild a bigger puppetdb host? [20:47:05] chicocvenancio: in the paws SAL you have 'shutdown unused instances' -- did that include paws-puppetmaster-01? I'm looking at paws-proxy-02 which is sad for lack of that master [20:49:21] (and it looks like that project has a project-wide puppet setting that makes that the puppetmaster) [20:53:57] Alright. ores-staging-01 is now happy. [20:54:04] Looks like ores-web-04 is happy. [20:55:19] It's weird that misc thinks it needs "profile::poolcounter::exporter_port" I don't know why ores-misc-01 needs it but ores-staging-01 doesn't. [20:56:43] halfak: look at the VM puppet configs [20:56:51] ores-misc applies a poolcounter role [20:56:52] Yeah. Checking that now :) [20:56:55] ores-staging does not [20:57:54] Aha. We have role::poolcounter::server in ores-misc-01. Now it figure out how we manage this in the "ores" project and copy-paste :) [20:58:26] Aha! We don't use poolcounter -- and for good reason. I'm just going to trim that role :) [20:58:49] that'll make things simpler :) [20:59:40] * halfak watches puppet agent TV with great suspense. [21:00:07] \o/ [21:00:10] Looks like we're good. [21:00:27] sorry for the trouble and thanks for the help andrewbogott & mutante :) [21:05:52] yw halfak [21:05:56] glad it works [21:06:10] :) [21:31:43] !log paws restarting paws-puppetmaster-01 so its clients can connect [21:31:45] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Paws/SAL [21:46:29] andrewbogott, it's not that simple, you need working puppet to be able to properly build a new puppetdb in a way that doesn't disrupt things [21:46:39] is it possible to take it down and change the flavour? [21:48:07] some of this issue might be that deployment-puppetmaster03 is on cloudvirt1014 which is having raid issues [21:48:53] oh, as is deployment-puppetdb02 [21:49:48] Krenair: I'm going to move those two if you don't mind, see if that unsticks anythting [21:50:17] oh yeah that might be it [21:50:18] thanks [21:50:29] puppetdb02 won't be it, puppetmaster03 maybe though [21:50:51] (new puppetdb host is 03) [21:50:54] no hang on [21:50:58] what am I talking about [21:51:26] old hosts: puppetdb02 and puppetmaster03, I think? [21:51:38] yes [21:51:44] so those shouldn't be the problem [21:52:00] we're looking for puppetdb03 and puppetmaster04 problems [21:53:09] !log deployment-prep moving deployment-puppetdb02 and deployment-puppetmaster03 off of cloudvirt1014 (which will be drained next week anyway) [21:53:12] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Deployment-prep/SAL [21:53:59] Krenair: I'll need to move those two next week anyway, so moving them now. But I'll check out the other two as soon as that's done [21:54:04] ok [21:54:42] ssh to deployment-puppetmaster04 is a bit slow but then load is high [21:55:06] presumably due to abnormally high numbers of clients at once due to the slowness? [21:56:36] yeah [21:56:48] I'm killing apache there just to make sure that causes load to drop [21:58:19] puppetdb03 meanwhile appears fine [21:58:25] load is definitely dropping [21:58:28] on the master [21:58:32] 21:58:29 up 8 days, 10:42, 4 users, load average: 2.29, 5.49, 6.68 [21:59:19] restarting apache there [21:59:37] so it seems like we really do just have a capacity problem [22:00:30] are 04 and 03 a cluster working together or unrelated? [22:02:27] !log tools.openstack-browser Restarting webservice as everything was waiting and then returning 504 etc [22:02:28] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.openstack-browser/SAL [22:04:36] andrewbogott, unrelated for these purposes [22:04:42] 04 has a CA named for 03 for legacy reasons [22:04:43] ok [22:05:05] they have a copy of the same CA basically but 03 should not be involved in serving requests unless I've misconfigured it [22:05:20] 04 load shot back up: 22:05:12 up 8 days, 10:48, 4 users, load average: 12.38, 9.52, 7.91 [22:07:18] Resizing is technically possible -- I haven't seen it work but I think other folks have succeeded. [22:07:59] ah :/ [22:08:07] the host that 04 is on now has quite a low load so that shouldn't be an issue [22:12:58] looking at apache server-status through ssh deployment-puppetmaster04 -L 8080:localhost:80 [22:13:04] 50 requests currently being processed, 4 idle workers [22:16:36] from what I can tell, individual requests for file metadata are sometimes taking several seconds? [22:20:49] hm, http://deployment-puppetmaster04.deployment-prep.eqiad.wmflabs:8140/puppet/v3/file_metadatas/plugins?environment=production&links=follow&recurse=true&source_permissions=ignore&ignore=.svn&ignore=CVS&ignore=.git&checksum_type=md5 [22:20:53] looks like maybe a big query [22:23:07] if I try that on 03 I get a quick permissions error [22:23:21] {"message":"Not Authorized: Forbidden request: /puppet/v3/file_metadata/plugins [search]","issue_kind":"RUNTIME_ERROR"} [22:23:22] real 0m0.045s [22:23:34] it takes 04 ages to come up with that [22:23:46] it isn't a network timeout though? [22:23:59] no [22:24:03] 03 response header [22:24:04] < Backend-Timing: D=2617 t=1581718990401500 [22:24:06] 04 response header [22:24:11] < Backend-Timing: D=114061668 t=1581718880453797 [22:24:32] if you turn the db backend off does 04 go back to failing quickly? [22:24:52] db backend? [22:24:57] you mean puppetdb? [22:25:45] yeah [22:26:00] Maybe you're already certain that the issue is slow puppettdb [22:26:03] in which case nevermind [22:26:13] not at all [22:28:09] I'm tempted to move 04 to a different cloudvirt just to see if a reboot and a new filesystem helps. It's so slow even with ssh... [22:30:28] root@deployment-puppetmaster04:~# time curl 'https://deployment-puppetmaster04.deployment-prep.eqiad.wmflabs:8140/puppet/v3/file_metadatas/plugins?environment=production&links=follow&recurse=true&source_permissions=ignore&ignore=.svn&ignore=CVS&ignore=.git&checksum_type=md5' [22:30:29] {"message":"Not Authorized: Forbidden request: /puppet/v3/file_metadata/plugins [search]","issue_kind":"RUNTIME_ERROR"} [22:30:29] real 3m50.819s [22:31:37] I'm up for moving it to a different cloudvirt just to see if it helps [22:31:40] andrewbogott [22:31:41] wow [22:31:47] does that happen for /every/ query or just that one? [22:31:56] hm [22:32:27] I'm ready to copy puppetmaster04 but lmk when you're clear [22:32:35] s/copy/migrate/ [22:32:50] apache is very quick to say no if you try using the http: URLs straight out of the logs (it's a TLS port, don't know why logging is wrong) [22:33:16] I'm running a curl for https://deployment-puppetmaster04.deployment-prep.eqiad.wmflabs:8140/puppet/v3/file_metadata/modules/base/check-microcode.py?environment=production&links=manage&checksum_type=md5&source_permissions=ignore and it's sitting there thinking about it [22:33:37] cancelled it [22:33:41] andrewbogott, go for it [22:33:44] ok [22:34:02] another thing to try (when it comes back) is some kind of generic file transfer, in case all we're seeing is network congestion [22:35:02] ~10 minutes to move [22:35:09] * andrewbogott goes to wash some dishes [22:35:14] so right now it's on cloudvirt1013 [22:35:27] where is it going? [22:46:25] 1017 [22:46:48] load is 1.09 so far [22:46:55] but probably clients haven't had a chance to find it yet [22:48:02] alright [22:48:08] ran puppet on puppetdb03 [22:48:08] Notice: Applied catalog in 5.91 seconds [22:48:16] yeah, it seems fine so far [22:48:24] but let's give it a bit and see if it gets bogged down again [22:49:37] root@deployment-puppetmaster04:~# time curl 'https://deployment-puppetmaster04.deployment-prep.eqiad.wmflabs:8140/puppet/v3/file_metadata/modules/base/check-microcode.py?environment=production&links=manage&checksum_type=md5&source_permissions=ignore' [22:49:37] {"message":"Not Authorized: Forbidden request: /puppet/v3/file_metadata/modules/base/check-microcode.py [find]","issue_kind":"RUNTIME_ERROR"} [22:49:38] real 0m0.022s [22:49:54] even deploy01 gets a decent puppet run time now - Notice: Applied catalog in 17.43 seconds [22:50:19] so far this is not very satisfying [22:50:38] If it's really better I'm going to want to move it back to 1013 to see [22:50:42] ok [22:50:48] what else lives on 1013? [22:51:04] did itself quick too: Notice: Applied catalog in 4.25 seconds [22:51:31] if we want to up the load a bit I have a daily cumin cron that runs puppet across all deployment-prep instances, can run that now [22:51:46] not a lot [22:52:06] https://www.irccloud.com/pastebin/6TxBjpAb/ [22:53:27] Krenair: sure, do your worst, let's see what happens [22:53:49] at script start: 22:53:41 up 11 min, 4 users, load average: 1.13, 0.97, 0.67 [22:54:32] load is up to 6.5 [22:54:37] still nowhere near where it was [22:55:08] 9.21 [22:55:29] hm [22:55:46] now if you do that puppetdb query is it super slow? [22:55:58] this should take 6-7 minutes [22:56:00] what puppetdb query? [22:56:08] you mean the puppetmaster queries? [22:56:41] looks like it yeah [22:56:56] ok [22:57:25] so it looks like moving it didn't really fix anything, it just gave us breathing space for a bit [22:57:30] possibly [22:57:39] the thing is those individual runs were quick [23:01:47] load is dropping, maybe things are finishing? [23:03:27] it's just suddenly sped up [23:03:37] dropping v.e.r.y g..r..a..d..u..a..l..l..y.. [23:04:14] maybe it finished compiling most of the catalogs or something [23:04:25] I haven't observed the normal impact of this script on puppetmaster performance [23:04:39] haven't seen this kind of problem on our puppetmasters before actually, even the global ones [23:04:47] me neither [23:05:12] though the global ones are much larger, think we made them both xlarge [23:05:19] this is only a medium on its own [23:05:29] now load is really dropping [23:06:22] looks like it's done, cumin is just waiting for one of the buggy hosts [23:07:05] so -- whether or not I move it back to 1013 depends mostly on how curious you are/how late you want to stay up [23:07:11] I'm fine with leaving well enough alone for now [23:07:24] (which, I regard that last test as 'well enough') [23:08:05] it's done [23:08:14] The number of deployment-prep VMs with 'unknown' puppet status just dropped from 30-ish to 5 [23:08:54] I'm happy for you to move it about if you like [23:09:11] ok, let's see what happens on 1013 [23:09:16] puppet is not something that needs to be running 24/7, it's just important we don't leave it long-term broken [23:10:13] yep, agreed [23:16:42] ok, it's back on 1013 and up [23:16:44] and not looking great [23:17:21] waiting for ssh... [23:17:27] ew [23:17:32] krenair@deployment-puppetmaster04:~$ uptime [23:17:33] 23:17:25 up 1 min, 3 users, load average: 4.53, 1.52, 0.55 [23:18:10] 7 requests currently being processed, 5 idle workers [23:18:12] weird [23:18:12] https://grafana.wikimedia.org/d/aJgffPPmz/wmcs-openstack-eqiad1-hypervisor?orgId=1&var-hypervisor=cloudvirt1013&refresh=30s&from=now-6h&to=now [23:18:55] that dashboard looks totally happy [23:18:56] and yet... [23:19:25] model name : Intel Core Processor (Haswell, no TSX, IBRS) [23:19:27] on this host [23:20:03] 1017 was model name : Intel Core Processor (Broadwell, IBRS) [23:20:27] puppetmaster03 shows model name : Intel Core Processor (Broadwell, IBRS) too [23:20:44] https://phabricator.wikimedia.org/T223971#5232702 complained about these CPUs [23:21:00] OK, if you're convinced (as I am) that 1013 is cursed, going to push this back to 1017 and file a vague 'cloudvirtt1013 is cursed' ticket [23:21:15] or [23:21:21] well, for science, I should move it to 1012 first [23:21:22] maybe also look at what else is on 1013 and how they're doing [23:21:27] to see if it's the CPU model [23:21:30] sure [23:21:38] you do that I'll poke around in k8s [23:21:54] same cpu on 1012? (I'm not following where you're getting the make/model) [23:22:35] this was from cat /proc/cpuinfo [23:23:24] I don't know how you picked 1012 [23:23:33] Just guessing, I'll make sure [23:23:44] 1012 is Haswell [23:24:10] ok, great [23:24:12] moving to 1012 [23:24:25] for deployment-prep all it hosts is cumin02 and a random memc box (seriously why do we have so many deployment-memc instances? who makes all these) [23:27:38] okay tools-k8s-worker- 22,23,31 [23:27:54] these all live on cloudvirt1013 [23:28:43] and don't host anything except calico-node/kube-proxy/cadvisor daemonset pods [23:29:30] they all have low enough though non-trivial load averages [23:30:25] let's compare that to tools-k8s-worker-34 [23:30:44] tools-k8s-worker-34 also has just those daemonset pods, except it has load average: 0.01, 0.08, 0.08 [23:31:43] tools-k8s-worker-34 lives on cloudvirt1021 [23:32:00] whereas [23:32:01] krenair@tools-k8s-worker-22:~$ uptime [23:32:01] 23:31:53 up 23:34, 2 users, load average: 1.69, 1.63, 1.48 [23:32:07] krenair@tools-k8s-worker-23:~$ uptime [23:32:07] 23:29:12 up 23:31, 2 users, load average: 0.62, 0.60, 0.63 [23:32:15] krenair@tools-k8s-worker-31:~$ uptime [23:32:15] 23:32:11 up 23:08, 2 users, load average: 1.83, 1.33, 0.99 [23:32:40] 1012 isn't looking great [23:33:39] ok, back to 1017 it goes [23:34:09] I wonder if that particular set of CPUs is uniquely bad or if it's just 'older machines are slower' [23:34:13] I picked on tools-k8s-workers for these as I figure they're all going to be similarly sized and configured, and it's easy to check what we expect to be running on them [23:34:17] Guess I need to re-read hashar's bug [23:41:20] Krenair: the puppetmaster is back on 1017 and seems OK. I'm going to punch out for now and see about dinner (and mull over whether we need to add visibility/scheduling control for these older CPUs) [23:41:33] Thank you, as always, for your service w/deployment-prep :) [23:41:35] ok [23:41:42] thank you for looking into it [23:46:14] I think what we've found is it does seem to relate to the host [23:46:40] we picked two hosts with an older CPU model and it didn't go well [23:47:31] we picked one host with a newer CPU that we already knew ran deployment-puppetmasters well enough, and it was fine