[10:07:55] !log tools.qrank increased CPU quota (T277457) [10:08:03] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.qrank/SAL [10:08:03] T277457: Request increased quota for qrank Toolforge tool - https://phabricator.wikimedia.org/T277457 [10:34:03] !log toolsbeta re-create toolsbeta-bastion-05 (T275865) [10:34:07] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Toolsbeta/SAL [10:34:08] T275865: Toolforge: migrate bastions to Debian Buster - https://phabricator.wikimedia.org/T275865 [11:56:06] !log toolsbeta created puppet prefix 'toolsbeta-buster-sgeexec' (T277653) [11:56:10] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Toolsbeta/SAL [11:56:10] T277653: Toolforge: migrate grid to Debian Buster - https://phabricator.wikimedia.org/T277653 [12:00:24] !log toolsbeta create VM toolsbeta-buster-sgeexec-01 (T277653) [12:00:28] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Toolsbeta/SAL [12:38:02] I'm having serious problems with outreachdashboard.wmflabs.org https://phabricator.wikimedia.org/T277651 [12:38:10] !log toolsbeta created puppet prefix 'toolsbeta-buster-gridmaster' (T277653) [12:38:14] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Toolsbeta/SAL [12:38:14] T277653: Toolforge: migrate grid to Debian Buster - https://phabricator.wikimedia.org/T277653 [12:38:32] If anyone is up for helping me debug it, I would really really appreciate it. [12:39:55] !log toolsbeta created VM toolsbeta-buster-gridmaster (T277653) [12:40:02] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Toolsbeta/SAL [12:40:30] ragesoss: how can we help ? [12:41:25] arturo: see https://phabricator.wikimedia.org/T277651 . Basically, the server becomes unresponsive after a short time. Can't even log in via SSH. I can't figure out what might be causing it. [12:42:17] I'm going to restart it again now so I can log back in. [12:42:55] I see [12:45:02] could it be a misconfiguration somewhere? [12:45:19] the queue being full may indicate a leak somewhere, or a cleanup not happening [12:47:14] I was looking for any sort of indication of that, but it appeared that there was no queue backlog up until the moment the system became unresponsive. [12:48:58] the passenger-status screenshot shows nothing in the queue at all right as it went down the most recent time... it stopped responding via SSH at the same time web requests I tried manually stopped resolving, but there was no queue backlog at that point. Then after a few minutes, web requests resolve again but with the queue full message. [12:49:51] it is strange that it stop responding to SSH [12:50:01] did you check kernel logs? dmesg etc [12:51:45] no [12:52:36] is the server memory exhausted, is the kernel OOM killer doing anything? [12:53:19] * arturo needs to change work place and wont be able to follow up now [12:54:19] I uploaded the `top` screenshot to the task, which was from right after it stopped responding. [12:54:41] Doesn't look like it was out of memory. [13:05:48] Not sure what to look for in terms of kernel logs, but I'm running `tail -F /var/log/messages` right now to see if anything interesting shows up if/when it goes down again. [13:06:19] It's been up for about 20 minutes this time with no problems yet. [13:06:31] It went down after 25 minutes last time. [13:08:13] What is the command `jbd2/vda3-8` that was running under root? Something related to Wikimedia Cloud infrastructure I assume? [13:08:53] That doesn't normally show up on `top` but it did right before it became unresponsive last time. [13:12:14] it's the filesystem journaling. [13:20:02] corrupted filesystem somehow? [13:20:20] perhaps rebuilding the VM worth a try [13:22:05] might have just been coincidence, but if that shows up again when the problem recurs, it seems like a strong possibility. [13:22:36] if you have a database, it might be the right move to migrate the data out into a cinder volume [13:22:42] the right moment* [13:22:48] what is a cinder volume? [13:23:02] https://wikitech.wikimedia.org/wiki/Help:Adding_Disk_Space_to_Cloud_VPS_instances#Cinder [13:23:58] Is that different from `/dev/mapper/vd-second--local--disk 138G 80G 51G 62% /srv` ? [13:25:02] ah, that's the deprecated 'With LVM' way of attaching block storage that it has. [13:25:29] yup [13:26:19] so, yeah... perhaps this is the right moment to rebuild and move the database. [13:28:58] I'm actually working with a performance consultant for the next couple of weeks, and we're working on moving this system to a multinode configuration so we can run the background jobs on a differnt VPS from the web requests. [13:30:49] that's great [13:31:34] although, we're not ready to do that today, so rebuilding and migrating immediately is not my idea of a good time. [13:58:29] is physical disk failure a possibility? [13:58:58] Any ways to check for that, and move things automatically? [14:01:49] ragesoss: the VMs use a distributed storage system so physical disk failure would be compensated for [14:02:29] andrewbogott: but some other kind of filesystem corruption is still a possibility? [14:03:35] Right — Ceph provides block storage (essentially virtual drives); the actual filesystem is local to the VM. [14:04:00] and that's the case for the deprecated pre-Cinder block storage as well? [14:04:08] yep [14:04:20] ragesoss: if you can afford the downtime I'd experiment with switching off the webservice and see if the system still locks up over time, might help assign blame [14:04:58] Does your phab task include the fqdn of the affected host? I don't see it in there [14:05:02] andrewbogott: I think I'm going to take it offline anyway, because I want to focus on making a recent copy of the database so I don't lose anything. [14:05:11] what is fqdn? [14:05:13] that's reasonable [14:05:27] fully-qualified domain name, like ..etc [14:05:33] so I can try to ssh :) [14:06:20] programs-and-events-dashboard.globaleducation.eqiad.wmflabs [14:07:04] happen to have an approximate timestamp from the last time it locked up? [14:07:52] oh, nm, I see when you rebooted it last [14:08:00] Mar 17 14:06:37 [14:08:02] 12:22 server time. [14:12:27] in your logs I see non-continuous clusters of messages like this: [14:12:27] programs-and-events-dashboard sidekiq-short[15282]: 2021-03-16T22:15:30.863Z pid=15282 tid=2pym class=CourseDataUpdateWorker jid=0a8e64ddb5262b2540a07d31 elapsed=1.31 INFO: done [14:12:44] Do you know what that is? Is the grouping just related to when queue was clogged vs. not clogged? [14:13:01] those are background jobs. [14:13:44] they pull data about editor activity from assorted APIs and do database operations with it. [14:14:23] yeah — I'm wondering if they're coming in a herd and overwhelming things [14:14:33] do you mind leaving the service up for a while so I can see it when things lock up? [14:14:36] they are typically the lion's share of system load, but they are handled by dedicated processes. [14:14:38] It happens every few minutes right? [14:14:52] no, it has been up for over an hour since the last time there was a problem. [14:14:55] I can leave it up, sure. [14:15:13] oh, huh — how long does it typically run between lockups? [14:15:28] It was happening after a few minutes every time I restarted it last night, and again the first time I restarted it this morning. [14:15:46] then it when 20 minutes until it froze. [14:16:25] current uptime is 90 minutes. [14:16:25] 'k [14:17:29] Once it locks up does it ever recover so that you can ssh again? Or is it a goner until a reboot? [14:17:52] it's a goner until reboot. [14:17:55] ok [14:18:53] one thing I see is that from time to time the database is maxing out disk access. That shouldn't really be a problem but it might be causing some kind of interaction where there are multiple things blocking on IO [14:19:42] I'm going to get some breakfast and we'll see if it dies while I'm eating :) [14:19:47] :D [14:19:48] thanks [14:26:48] To clarify from above... the main storage of the VPS (as opposed to the block storage) is also distributed and resilient to hardware failure, or not? [14:30:34] It's all distributed [14:30:37] Yes, the hard drive (ie. block device) where the OS of that VM is installed in, is hosted in ceph (using rdb), a resilient-to-hardware failure storage solution [14:31:05] Cinder/lvm/local drives/whatever are all backed with Ceph. [14:45:33] maxing out disk access does seem like a possible reason for the system locking up, according to Nate Berkopec (the performance consultant I'm working with). [14:46:33] possibly that's related to jbd2 running and maxing out one CPU at the time it crashed last. [15:03:00] everything is still working so far *shrug* [15:03:27] It seems most likely that this is going to turn out to be a performance scaling thing rather than something interesting with cloud-vps infra. I'm still watching though [15:08:17] thanks. looks MySQL has been dying repeatedly over the last few hours since it's been "up". [15:09:42] ragesoss: I think the first "fix" I would work on if I were in your position is separating the database, redis, and core app into 3 different instances. [15:09:57] bd808: that's exactly the plan. [15:10:51] I think I might have found the immediate cause (or something related to it)... a corrupted MySQL index. [15:12:21] ragesoss: in the meantime you can create a quota request for the cinder storage you need for your future/rebuilt database: https://wikitech.wikimedia.org/wiki/Help:Adding_Disk_Space_to_Cloud_VPS_instances#Storage_Quotas [15:12:37] I'm going to take the app down andrewbogott. will try using `CHECK TABLE` and `OPTIMIZE TABLE`. [15:12:47] ragesoss: ok! [15:16:38] andrewbogott: is it intentional that in Horizon I see a "__DEFAULT__" Cinder volume type in addition to the "standard" one? [15:17:43] Majavah: not on purpose, can you make me a phab task? I'm in a meeting atm [15:17:52] sure andrewbogott [15:17:57] thx [15:20:19] T277666 [15:20:20] T277666: Remove "__DEFAULT__" Cinder volume type from Horizon - https://phabricator.wikimedia.org/T277666 [15:43:47] wasn't there a website where you can see cpu, ram, network etc. usage of VMs? [15:45:02] gifti: one place for that sort of thing is this dashboard -- https://grafana-labs.wikimedia.org/d/000000059/cloud-vps-project-board?orgId=1 [15:50:38] thank you! [16:03:29] ragesoss: since that VM refuses to fail while I'm watching it I'm going to move on to other things. Feel free to ping if you need support when rebuilding into a multi-node cluster. [16:03:46] And don't forget that cinder volumes can be moved between VMs, so they're a good way to get your database off the troubled node and onto a fresh one :) [16:05:00] andrewbogott: thank you. [16:18:22] I'm trying to repair the corrupt index, but after a little while, this happens: [16:18:22] MariaDB [dashboard]> OPTIMIZE TABLE revisions; [16:18:22] ERROR 2013 (HY000): Lost connection to MySQL server during query [16:18:46] also... [16:18:46] MariaDB [dashboard]> check table revisions; [16:18:46] ERROR 2013 (HY000): Lost connection to MySQL server during query [16:27:41] have you been able to get a mysqldump? [16:28:57] legoktm: no. I think have a good dump from the most recent automated backup, March 14. But running the backup script now results in a much smaller file and this: [16:28:58] mysqldump: Error 2013: Lost connection to MySQL server during query when dumping table `revisions` at row: 6311728 [16:29:51] is mysql crashing or is it actually timing out? [16:30:18] I don't know. [16:31:04] is it alright if I ssh in to just look? [16:31:11] legoktm: yes [16:33:50] have you seen what's in /var/log/mysql/error.log ? it links to https://dev.mysql.com/doc/refman/5.6/en/forcing-innodb-recovery.html [16:36:07] ragesoss: ^ [16:37:29] reading now. I hadn't looked at the error.log until now. [16:40:14] trying innodb recovery is what I would do next, though I'd try to backup /srv/mysql/ first just in case [16:42:23] I logged out of the instance now [16:43:53] legoktm: 'trying innodb recover': I change the innodb_force_recovery setting, restart the service... and then what? [16:44:16] cross your fingers :p [16:44:38] if mysql starts, try taking a dump and see how far it lets you go [16:44:41] ragesoss: its about as exiting as watching an fsck run :) [16:44:54] I do that and then run the backup again, you mean? [16:44:56] okay. [16:45:02] will try it. [16:45:34] level 1 says " Lets the server run even if it detects a corrupt page. Tries to make SELECT * FROM tbl_name jump over corrupt index records and pages, which helps in dumping tables. " which is the issue right now according to the error log [16:46:17] and perhaps that will also let me run `optimize table` to fix the index? [16:46:29] do you know how long you will support buster images? [16:48:41] gifti: likely until mid 2023 (it's expected EOL date) [16:48:52] https://wikitech.wikimedia.org/wiki/Operating_system_upgrade_policy#Timeline_of_previous_distribution_deprecations [16:49:58] huh, i thought it was beginning of next year [16:50:31] ragesoss: possibly. you might also want to consider importing the dump from scratch [16:50:47] * Majavah is still removing Jessies [16:51:04] legoktm: drop the whole db, then import the dump in its place? [16:51:18] gifti: I would guess that we will start nudging folks to move from Buster to Bullseye around January 2022. Just a guess at this point. [16:53:27] ok [16:58:09] !log admin disabling all flavors with >20Gb root storage with "update flavors set disabled=1 where root_gb>20;" in nova_eqiad1_api [16:58:14] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Admin/SAL [16:58:15] ragesoss: yeah. I guess if the index repair works and you can run mysql without crashing with recovery = 0, that might be good enough. it sounded like you were going to move to a new instance anyways, and you'd have to dump/re-import for that later [17:12:15] bd808: is it possible to use something like this on WM Cloud? https://registry.terraform.io/providers/terraform-provider-openstack/openstack/latest/docs [17:16:47] !log admin set default cinder quota for projects to 80Gb with "update quota_classes set hard_limit=80 where resource='gigabytes';" on database 'cinder' [17:17:00] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Admin/SAL [17:24:50] ragesoss: not currently no. I think we still have a feature request ticket somewhere in the backlog to open up our OpenStack APIs to the point where terraform could be used to manage instances, but today it is not possible (AFAIK). [17:25:12] bd808: that's what I thought, alas. [17:26:44] T215074 was maybe what I was thinking about in the backlog. So we may have the auth tech available now. [17:26:45] T215074: Support Keystone Application Credentials - https://phabricator.wikimedia.org/T215074 [17:28:31] !log admin restarted the backup-glance-images job to clear errors in systemd T271782 [17:28:42] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Admin/SAL [17:28:42] T271782: cloudvirt1024 reports backup job failures - https://phabricator.wikimedia.org/T271782 [17:53:11] bd808: so it might actually be possible now? Or you haven't yet updated to a compatible OpenStack version? [17:56:39] being able to use Terraform would be *extremely* helpful for this switch to a multi-node architecture. [18:27:20] ragesoss: we are running a new enough version of OpenStack now, but I have no idea what other things would need to be changed to make it all work. [18:28:34] The automation that we do have is ops/puppet.git which does not handle the instance creation but can (with custom Puppet manifests for the project) handle provisioning software and config within the intstances [19:30:40] even if I have to create the instances ahead of time with horizon, if the rest of the orchestration can be done with tooling it sounds like that will be really helpful. [19:36:31] ragesoss: yeah, that's exactly what puppet can do for you. It takes over all of the package installs and file permissions and things like that. [19:37:55] ragesoss: breadcrumbs start at https://wikitech.wikimedia.org/wiki/Puppet, but if you are seriously interested in automation it might help to have a chat to talk over some of the pros and cons. [19:38:56] ragesoss: you would very likely want https://wikitech.wikimedia.org/wiki/Help:Standalone_puppetmaster in your project to smooth out some of the cons [19:40:14] bd808: we're aiming for something that can largely work for both Wiki Ed production on linode and for cloud. sorting out deployment, rather than initial provisioning, is probably the main thing. [19:40:52] maybe I can find a time for you, me and my consultant Nate Berkopec to chat. [20:26:15] !log tools moving tools-elastic-3 to cloudvirt1034; two elastic nodes shouldn't be on the same hv [20:26:20] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [20:49:44] andrewbogott: are you planning on leaving the default cinder volume quota to four volumes? ie do projects with more than four volumes need to get more quota? [20:50:20] Majavah: that's a good point. I don't think there's any reason to be aggressive in capping that [20:50:27] I'll increase it a bit. [20:51:57] Majavah: done [20:52:16] andrewbogott: thanks! [20:53:07] deployment-prep is down to only a few hard and complicated Jessies, when the time comes to migrate off Stretch I'm curious how much cinder quota it will need [20:53:28] Majavah: lots! and we will be happy to see it :) [20:54:37] bd808: I know it's "lots" but I don't anything more specific, I guess we'll see when the default runs out :P [20:54:52] db, elasticsearch, and swift should all move towards cinder for their persistent data in beta cluster [20:55:30] I feel kinda sad creating two large LVM based betacluster database servers just last week when recovering from the db05 disk issue [20:56:14] the "easy" thing for us to do will probably be a quick audit of the existing instances and then a grant of cinder quota == the disk that the instances could consume today minus the 20G base image size [20:57:21] !log tools deployed changes to rbac for kubernetes to add kubectl top access for tools [20:57:25] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [20:57:41] Majavah: moving them should be a lot easier next time. 1) create cinder volume, 2) mount ot instance, 3) stop mariadb, 4) copy state data to cinder volume, 5) profit! [20:58:23] that would be way more than necessary I think, for example the old mwlog server had dozens of gigs of space in /srv while I gave its replacement a 2G cinder volume and that's not even half full [21:06:32] bd808: side note, do you happen to have any ideas who to ping enough to get beta clusters logstash working again? Its been empty for a few weeks now and I haven't been able to magically unbreak it [21:15:48] Majavah: g.odog and s.hdubsh would probably be good folks to ping. AFAIK they take care of the production ELK stack these days [21:16:59] Majavah: those nikcs are https://meta.wikimedia.org/wiki/User:FGiunchedi_(WMF) and https://meta.wikimedia.org/wiki/User:CWhite_(WMF) [21:44:00] waiting to see whether a long database operation works on not is one of the least pleasant things with computers. [21:57:31] legoktm: I think recovery mode, dumping the table, dropping it, then reloading it... seems to have worked. [21:57:36] thanks again! [21:57:40] wooo :D [21:58:00] glad I could help