[00:00:06] tools.fountain@tools-bastion-03:~$ sql tools [00:00:10] ERROR 2003 (HY000): Can't connect to MySQL server on 'tools.db.svc.eqiad.wmflabs' (111) [00:00:56] Yes. Toolsdb has just been restarted readonly by the dba team to try and recover from the issues earlier [00:02:53] is there a phabricator issue to track? [00:08:17] There's an incident report. [00:08:57] The database should be working read-only right now...can anyone confirm? [00:09:01] is there a link or something? I want to redirect some angry users to it lol (https://phabricator.wikimedia.org/T216201) [00:09:04] bd808, you have some data on there, right? [00:09:13] Yup...working on it leloiandudu [00:09:26] https://wikitech.wikimedia.org/wiki/Incident_documentation/20190214-labsdb1005 [00:09:33] bstorm_: thanks! [00:10:15] bstorm_: yes. just tested and I can connect and read data from my tool's db [00:10:30] Thanks! [00:10:47] bstorm_: my tool is working now too [00:10:59] It won't be able to write :) [00:11:08] So don't be too sure [00:11:10] I know :) thanks! [00:17:56] leloiandudu: I just made T216208. You can make things subtasks of that or just merge them in as duplicates as needed [00:17:56] T216208: ToolsDB overload and cleanup - https://phabricator.wikimedia.org/T216208 [00:22:23] I believe we are read-write again [01:15:28] bstorm_: thanks! [01:28:08] !log paws Re-enabled PAWS vhost on paws-proxy-02 [01:28:10] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Paws/SAL [01:32:06] bstorm_: AnomieBOT's database table seems to be locked on toolsdb, so no queries are working. Is that something known? [01:33:59] Eh...no. [01:34:07] Does restarting it work? [01:34:52] anomie: The database was read only for a while, broken for a while....etc. I'm hoping that if you restart, things are good. We resolved a lot of issues, but things could have come back. [01:36:02] * anomie qdels all the jobs [01:36:32] * anomie waits for them to actually go away [01:39:07] * anomie is still waiting... [01:41:51] * anomie logs into the exec nodes and kill -9's the bot's processes [01:42:15] heh... [01:42:59] bstorm_: When I `sql toolsdb` and execute `show full processlist`, I still see queries pending. [01:43:59] Hrm... [01:45:13] Let's see... we're at 600+ connections, not a disaster [01:45:50] There are way too many in "opening tables" state [01:47:18] There's like a hundred from a db called phragile trying to get a single record from a DB... [01:47:21] ugh [01:48:33] The db looks really locked up :( [01:50:14] phragile is a tool that WMDE built a long time ago to do burndown charts and such for Phab [01:51:27] !log tools.phragile KIlled webservice to see if that releases open database connections [01:51:29] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.phragile/SAL [01:54:04] Literally everything that isn't trying to insert or update is stuck in "opening tables" [01:54:54] any idea what table_open_cache is set to? [01:55:06] I can find out [01:55:45] 5000 [01:57:13] if you have a responsive admin session, `show engine innodb status\G` may tell you what locks are stuck. I think "opening tables" is either about a lock or the system running out of cache space [01:57:51] if its across a bunch of users/tables then the lock is maybe on the metadata tables? [01:59:24] https://dev.mysql.com/doc/refman/5.6/en/server-system-variables.html#sysvar_table_open_cache [02:00:04] I do...but that command hangs [02:01:09] locks on the metadata table is a likely rca in general here. However, I no longer see a strong culprit. hoping I'll find something [02:02:13] the tools-db-usage tool was apparently trying for the last hour to refresh its data [02:02:40] I just killed the hung job and am restarting the webservice... [02:03:07] it queries the information_schema space and may have locked things :/ [02:04:03] interactive session from Toolforge is working for me, anything better on the server? [02:04:39] I'm wondering if it never came up right after switching to RW [02:04:58] I cannot get this query to finish [02:05:16] I have an interactive shell on mysql [02:05:28] But I have to kill "show engine innodb status" [02:05:32] !log tools.tools-db-usage Disabled cache refresh cron [02:05:33] bd808: Unknown project "tools.tools-db-usage" [02:05:49] !log tool.tools-db-usage Disabled cache refresh cron [02:05:49] bd808: Unknown project "tool.tools-db-usage" [02:05:52] processlist now up to 706 rows [02:05:53] liar! [02:06:12] !log tools.tool-db-usage Disabled cache refresh cron [02:06:13] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.tool-db-usage/SAL [02:06:17] We are heading for a lockout again [02:06:33] arrgh [02:06:57] `| 71572 | s53941 | 10.68.23.58:56852 | NULL | Query | 158 | Filling schema table | SHOW SESSION STATUS LIKE 'Ssl_cipher' | 0.000 |` ? [02:07:06] the St Valentine's Day ToolsDB massacre [02:07:49] | 71224 | root | localhost | NULL | Killed | 346 | init | show engine innodb status | 0.000 | [02:07:53] That's my query [02:07:58] It's a zombie [02:09:41] That weird query on show session status is from wd-depicts [02:11:11] 740 [02:13:32] That's the number of connections [02:13:51] 747 to give an idea of the growth [02:14:05] If I kill anything, it just goes zombie [02:14:10] I honestly don't know what to do with this [02:14:46] Works better in readonly mode :0 [02:15:02] Hmm...looking around more [02:15:07] I'm going to guess that without per-tool concurrency limits folks just keep hitting refresh on tools that are blocked which opens a new connection which adds to the problem. repeat until force quit/db crash [02:15:54] oracle has a shiny fix for this in server side connection pooling [02:16:10] but in our world even that would not be a real fix [02:16:33] we need containerized per-tool dbs with hard resource limits [02:17:08] or a multi-million dollar datastore like AWS and GCE have [02:22:21] Well, the thing is that nobody's queries are working...so they never die. Naturally they just pile up until we are all locked out completely [02:22:32] Then the alarms go off. Really, the problem is already there [02:22:45] I just don't know what locked it up [02:22:54] We've cut off at least one avenue already [02:25:43] !log tools.wd-depicts Restarted webservice; investigating toolsdb query issues [02:25:45] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.wd-depicts/SAL [02:26:07] 784 connections [02:28:42] If i restart or kill connections, they hang. [03:43:17] !log tools.admin Restarted webservice [03:43:19] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.admin/SAL [03:50:24] !log admin updated VPS base images for Jessie and Stretch, now featuring Stretch 9.7 [03:50:26] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Admin/SAL [09:40:27] is it allowed to host tools which send out emails after users signed up to receive them? [09:40:41] i.e. with consent [11:24:41] sveta: I would avoid collecting user data, but if using the wiki email feature I don't see a problem [11:34:43] !log admin T216190 cleanup from nova database `nova service-delete 35` [11:34:46] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Admin/SAL [11:34:46] T216190: Rebuild labvirt1012 and cloudvirt1012 - https://phabricator.wikimedia.org/T216190 [12:02:17] !log admin more nova service cleanups in the database (labvirts that were reallocated to eqiad1) [12:02:19] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Admin/SAL [12:22:17] !log admin T216239 draining labvirt1009 with a command like this: `root@cloudcontrol1004:~# wmcs-cold-migrate --region eqiad --nova-db nova 2c0cf363-c7c3-42ad-94bd-e586f2492321 labvirt1001` [12:22:20] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Admin/SAL [12:22:21] T216239: CloudVPS: drain and rebuild labvirt1009 as cloudvirt1009 - https://phabricator.wikimedia.org/T216239 [12:39:28] I don't know if it's related to the current outages but I get this [12:39:37] https://www.irccloud.com/pastebin/70Z8EV4J/ [12:39:53] * bawolff as well [12:40:11] I want to migrate away from trusty [12:42:26] Things seem to be working for me now [12:54:04] Amir1: bawolff yes, server was rebooted bc I reallocated it to other virt server [12:54:18] okay thanks! [13:10:38] !log admin T216239 labvirt1019 has been drained [13:10:45] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Admin/SAL [13:10:45] T216239: CloudVPS: drain and rebuild labvirt1009 as cloudvirt1009 - https://phabricator.wikimedia.org/T216239 [13:15:03] !log tools.mrmetadata migrating the webservice to stretch+k8s [13:15:04] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.mrmetadata/SAL [14:39:31] Seems like tools-db will be overloaded soon. Same as yesterdy :-( [14:44:07] Can confirm [14:58:31] My team has a number of tools that use that DB. Is there a way we could audit the query load and determine if problematic load is coming from our tools? [14:58:43] My team = Community Tech [14:59:02] I'm looking for ways that we as users of the DB can help. [15:00:41] Yesterday I said here in the chat: "I had such behaviour when the disk had bad blocks …" but they do not believe me [15:27:30] Wurgl: I would hope that the file storage system they have (RAID or whatever system) would be more robust than one drive having bad blocks [15:27:38] causing all the problems [15:28:12] Wurgl: but agree that this is really putting a damper on productivity [15:30:04] we are working on a HW replacement/alternative for toolsdb [15:36:04] Thanks arturo. [15:37:13] * bstorm_ getting out of bed [15:37:57] yeah, it is a mess, but we have a plan :) [16:28:06] !log tools.wmde-access moved cronjob from trusty to stretch (following [[wikitech:News/Toolforge Trusty Move a cron job]]) [16:28:07] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.wmde-access/SAL [19:19:05] Are we still in a really bad state? Access to my Wikidata game returning 502 Bad Gateway [19:19:10] https://tools.wmflabs.org/wd-depicts/index.html [19:19:16] 502 Bad Gateway [19:20:33] chicocvenancio: good idea, thank you [19:23:43] and I assume the ops team knows that PAWS is also known to be dead? 504 Gateway Time-out [19:29:16] fuzheado: yep, cloud team is aware, it's still the toolsdb problem [19:29:30] OK, thanks. Just wanted to make sure it wasn't me. [20:13:47] !help [20:13:47] SoWhy: If you don't get a response in 15-30 minutes, please create a phabricator task -- https://phabricator.wikimedia.org/maniphest/task/edit/form/1/?projects=wmcs-team [20:23:24] !help I'm have a problem with ToolsDB. I wrote a script that accidentally didn't close the connection. Now I'm getting: `pymysql.err.InternalError: (1205, 'Lock wait timeout exceeded; try restarting transaction')` [20:23:24] audiodude: If you don't get a response in 15-30 minutes, please create a phabricator task -- https://phabricator.wikimedia.org/maniphest/task/edit/form/1/?projects=wmcs-team [20:24:07] audiodude: toolsdb is pretty broken today, so step one is probably to wait until Monday or Tuesday and try again [20:24:44] okay sounds reasonable. Yes I saw the emails and I couldn't piece together what exactly was affected, so I guess that makes sense. [20:26:58] I got an email about "Trusty" today but I cannot follow the steps outlined in the FAQ. Is this related possibly? [20:28:33] SoWhy: related to what? [20:28:57] to what andrew said about toolsdb being broken [20:29:05] it's not related [20:29:39] what step can you not follow? I can explain it if you want [20:29:49] okay. The email told me to switch servers or something [20:29:52] yes [20:30:01] I logged into the trusty and stopped the webserver [20:30:27] but when I log into stretch and try to start it, it tells me "Looks like you already have another webservice running, with a gridengine backend [20:30:29] is it the 'sowhy' tool? [20:30:42] yes (creative, I know) [20:31:13] https://wikitech.wikimedia.org/wiki/News/Toolforge_Trusty_deprecation#'webservice_stop'_says_service_is_not_running,_but_'webservice_start'_says_service_is_running <= does this description match your problem? [20:32:25] it does and it works [20:32:35] I will now bow my head in shame [20:32:47] lol [20:32:58] really though, I can't believe I missed that. [20:33:13] it's fine [20:35:10] I've been a Wikipedian for 14 years now but that toolforge stuff makes me look like a newbie every day^^ [20:41:32] anyways, I'm off again now that that's fixed. Thanks for the assist zhuyifei1999_! [20:43:37] (np) [21:35:22] https://pastebin.com/BMgjZs8x I'm getting this on my tool when trying to process OAuth login for the metawiki on beta cluster [21:46:22] DatGuy: 172.16.5.23 is deployment-db03.deployment-prep.eqiad.wmflabs. [21:46:32] maybe ask in release engineering? [21:46:43] -releng [21:59:23] DatGuy: the database servers for deployment-prep (beta cluster) were knocked offline by the first hardware outage we had this week. I don't know if anyone in RelEng or the community has had time to fix them yet or not. That error looks like there is at least some work remaining [22:00:20] alright. I'll notify in -releng juist to make sure they know [22:00:39] yeah, beta meta is offline [22:03:23] although at least one of the replicas is up [22:03:36] so it seems like mediawiki is failing worse there than it should? [22:09:52] filed T216287 [22:09:53] T216287: MediaWiki should work in readonly mode when the master is down - https://phabricator.wikimedia.org/T216287 [22:49:34] !log clouddb-services created mariadb security group and lvs for a new database T193264 [22:49:36] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Clouddb-services/SAL [22:49:37] T193264: Replace labsdb100[4567] with instances on cloudvirt1019 and cloudvirt1020 - https://phabricator.wikimedia.org/T193264 [23:42:28] !log clouddb-services Added BryanDavis (self), Arturo Borrero Gonzalez, Marostegui, and Jcrespo as admins in project [23:42:30] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Clouddb-services/SAL