[06:12:32] I am going to switch phabricator [06:12:34] primary master [07:00:06] s3 backups will end late today [07:49:25] federico3: can you give me the syntax to test your parsercache depool/repool cookbook, so I can test it? [07:49:31] I can do a live test [07:49:37] And if all works fine we can merge and update the doc [07:49:39] I'm testing it right now as we speak :D [07:50:09] good [07:50:09] I found a little glitch. Which section do you want to depool? [07:50:16] Let's try pc1 for instance [07:51:26] test-cookbook -c 1165546 sre.mysql.parsercache show pc1 [07:51:26] detected: Hosts found: pc1011.eqiad.wmnet pc2011.codfw.wmnet [07:51:37] want to depool them? [07:51:43] how is the depool done? [07:52:00] I am not understanding well the code [07:52:16] do you do just a dbctl HOST depool? [07:54:31] test-cookbook -c 1165546 sre.mysql.parsercache --help will pull the latest version [07:54:50] depool pc1 will depool the hosts in the section [07:55:05] or you want to depool just one host? [08:00:25] FIRING: SystemdUnitFailed: prometheus-mysqld-exporter.service on es1047:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:00:25] What I mean is, how do you depool the hosts [08:00:28] ^ expected [08:00:51] Like, whats the dbctl command you are running under the hood [08:02:43] ah, I'm calling dbctl.instance.depool() from spicerack's Dbctl modle [08:04:02] I am not sure that's working [08:04:10] Check: https://phabricator.wikimedia.org/T388389 [08:04:26] Check my initial task description, where I mention how parsercache hosts need to be depooled [08:04:57] They cannot be depooled via normal dbctl instance $HOST depooli [08:06:29] you mean the spicerack module does not issue the same command as https://wikitech.wikimedia.org/wiki/MariaDB/Troubleshooting#Depooling_a_parsercache_host_and_section ? [08:07:49] To depool a host we normally use dbctl instance depool [08:07:56] But for parsercache we have to set-weight 0 [08:08:06] And I am not sure if the spicerack module does that [08:08:15] as a workaround I can run the dbctl commands directly from the cookbook but probably later on we want to fix the sr module [08:08:30] I don't know if the module can do a set weigh [08:08:52] if it can,then great, if not, let's create a task to follow up but for now we can just include the dbctl commands in your cookbook [08:16:21] the depool() method indeed would get confused by pc sections; right now I think I can use the .weight() method [08:18:25] that's good, we only have to it to 0 when depooling and to 1 when repooling [08:19:04] marostegui: the CR is updated [08:20:06] checking [08:20:14] I still have to dry run it [08:24:03] fwiw dry-run is now passing but I'm not seeing other cookbooks calling weight(...) so this would be a first and we are testing it on prod 😢 [08:26:08] BTW the script is currently setting weight one one host, then doing dbctl config commit, then doing the same on the other host. Do we want to do just one commit instead? [08:28:10] federico3: yes, let's do both at the same time [08:28:16] I added some comments on the patch just now too [08:29:42] if one of the 2 calls to weight() fails do we still want to commit? [08:32:18] no [08:32:31] Leaving a one host with 0 and the other with 1 would be very weird from a MW point of view [08:32:36] It should be both or none I'd say [08:32:59] ok [08:33:16] later on we could add a rollback to avoid leaving dbctl dirty [08:36:59] dirty in which sense? [08:37:24] having set a weight only on one host but without running commit [08:37:43] yeah, that'd be bad [08:56:04] FIRING: MysqlPredictiveFreeDiskSpace: Host es1047:9100 predictive low disk space on /srv - https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting - https://grafana.wikimedia.org/goto/Jdz2PnLNg?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DMysqlPredictiveFreeDiskSpace [09:00:20] ^ expected [09:12:20] marostegui: ok I sent a bunch of changes to make the cookbook more resilient [09:12:36] ok, checking [09:14:24] I haven't made the downtime optional yet; the use of -t should be already optional [09:26:04] RESOLVED: MysqlPredictiveFreeDiskSpace: Host es1047:9100 predictive low disk space on /srv - https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting - https://grafana.wikimedia.org/goto/Jdz2PnLNg?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DMysqlPredictiveFreeDiskSpace [11:19:28] federico3: your patch never arrived to gerrit? [11:19:38] [11:12:21] marostegui: ok I sent a bunch of changes to make the cookbook more resilient [11:19:38] ^ [11:20:03] But there's no update to the patch since 10:21, so I am not sure if you pushed it? [11:21:04] let me send a minor update and trigger git review again [11:23:22] I see it now, but I don't see my comments addressed yet, which is fine, but just making sure it is intended [11:24:01] yes, see comment above [11:25:10] ok, let's follow those discussions on the patch, otherwise the information can be scattered across irc and gerrit and future checks of the patch will miss context [11:25:57] I can address the other comments before we merge but I thought we wanted to test the cookbook shortly [11:27:08] Sure, is it testable already? [11:27:20] like the set-weight added from what I can see [11:27:24] Let me test [11:27:48] yes, FWIW I did dry runs but the only real test we can do at this point is the depool [11:28:17] I am testing [11:29:31] marostegui: meanwhile, how did you ran the updates for the read-only es* hosts? I've been looking at using auto_schema for it but maybe it's better to use the host update cookbook. Or did you use some other script? [11:30:20] https://phabricator.wikimedia.org/P78820 it looks good [11:30:35] It is weird that only the last dbctl paste is shown, it can be confusing [11:30:47] It made me had to double check whether the other host was changed too [11:31:17] last paste? It's one commit for both hosts [11:31:35] https://phabricator.wikimedia.org/P78819 [11:31:38] That is the paste [11:31:48] I don't see the codfw host there [11:32:32] that's really sus, yet it was correctly depooled? [11:32:38] In fact [11:32:42] It wasn't committed [11:32:48] It is there pending to be committed [11:33:09] maybe it is pending from your previous test? [11:33:30] no I never ran the script in "real" mode, only dry run [11:33:50] The pool has the same behaviour [11:33:52] plus the script is checking for pending "dbctl config diff" changes before running so it would have stopped [11:33:53] It only commits one hot [11:33:55] host [11:33:58] https://phabricator.wikimedia.org/P78822 [11:34:00] aha found it [11:34:01] the other one is pending [11:34:18] https://www.irccloud.com/pastebin/pvWobMW3/ [11:34:33] it seems that it commits only changes in a datacenter [11:34:54] could be, but yeah, it definitely only commits one host [11:34:57] if we call commit without datacenter will it commit everything in one transaction? [11:35:11] you can group everything in one commit [11:35:52] how does the commit work e.g. does it interacts with etcd in 2 dcs ? [11:36:25] besides, I think it makes sense to print out the diff before committing [11:37:46] federico3: https://phabricator.wikimedia.org/P78825 [11:38:03] federico3: makes sense to print the diff yes [11:38:49] federico3: and this is the pooling back: https://phabricator.wikimedia.org/P78825#316592 [11:39:06] I should have a change ready in a second [11:39:47] great! [11:53:13] marostegui: just updated the CR. It should show the diff [11:55:39] let me check [11:56:41] federico3: There should be a check to catch if the section is pooled, because I have pooled pc1 for a few times and it goes through all the steps and makes me belive it pooled it, but the reality is that it was pooled already [11:56:53] anyway, testing the depool now [11:57:21] ok I can add that, also do we want to check how many *other* sections are pooled before depooling? [11:57:46] The change went well, the commit went thru but the diff wasn't showing to me (although I can check it on phabricator) [11:58:20] federico3: yeah, that would be good to list other depooled sections, but just as a FYI, not to impose anything there, just letting the operator know about it [11:58:32] (BTW did you test without task id ?) [11:59:12] yeah [11:59:16] I never added it [11:59:39] ok, do you mind closing the PR comments that are addressed when you have a sec? [12:00:09] did you see "Changes:" in the log before "Committing dbctl config" ? [12:05:12] yes and it was empty [12:05:58] I closed the addressed comments [12:51:24] Amir1: I can take over the pc hosts if you like [12:51:38] As I was doing the 10G part until I had to be out for a few days [12:51:43] Do you want me to get pc2015 ready? [12:52:24] marostegui: the reason it's postponed is that dc ops can't go to codfw this week [12:52:31] ah ok! [12:52:44] But don't worry, I will take over [12:52:59] Otoh, I'm out next week so yeah. Please take over! [12:53:06] Thank you ❤️ [12:53:07] JennH: Monday for pc2015 sounds good? [13:12:24] marostegui: how did you ran the updates for the read-only es* hosts? I've been looking at using auto_schema for it but maybe it's better to use the host update cookbook. Or did you use some other script? [13:13:35] federico3: I used the host update cookbook [13:15:15] marostegui: did it do all the required steps with no glitches? I only used it on db* [13:16:26] Sorry, I got confused with the other hosts, no, for those hosts I didn't use the cookbook [13:16:37] I used the host reboot cookbook [13:19:29] federico3: essentially I depooled manually, ran full-upgrade -y via cumin, stop mariadb, ran the reboot cookbook, and then start mariadb via cumin too and repool [13:19:42] I have a script for it somewhere, I will look for it [13:38:17] I thought I pushed it to the repo years ago, but seems that I didn't :) [13:38:19] I will do it [13:38:30] It was done quite sometime ago [13:39:25] We should ideally make the upgrade host work with these hosts, but we've never done it yet [13:39:32] I meant the cookbook [13:41:31] yes, I'll look at the script and merge it into the upgrade cookbook [13:42:43] I don't think you can reuse much of it :) [13:42:48] It is a bunch of one liners basically [13:42:53] But the "logic" is there [14:11:09] hey folks; we are doing a backfill for a flink app. I noticed that the state snapshot size we checkpoint to swift (mw-page-content-change bucket) has increased to >100MB (from a few KBs/MBs). Snapshot size is a function of traffic increase because of the backfill. I'm monitoring, but please holler if space is an issue. [14:13:36] FWIW: for this app we have a retention of 10 snapshot, so there's an upper bound to how much space we can hog [14:56:40] jynus: let me know which day/time would be the worst to do a m1 switchover, so I can avoid it :) [14:56:47] Not planning to do it this week, but next week most likely [15:02:34] gmodena: a G or so in swift is fine [15:04:54] any time before next tuesday at 0 hours would be ok [15:12:53] Emperor ack - thanks. [15:59:53] marostegui: thanks - I can go on updating other es* hosts [17:10:25] FIRING: SystemdUnitFailed: wmf_auto_restart_prometheus-mysqld-exporter.service on es1047:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:50:25] RESOLVED: SystemdUnitFailed: wmf_auto_restart_prometheus-mysqld-exporter.service on es1047:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:59:54] I got tired of icinga: https://gerrit.wikimedia.org/r/c/operations/puppet/+/1167691 🙈