[05:00:12] we seem to be running out of space on xtools-prod01.xtools.eqiad.wmflabs but `df` says there's plenty on `/`... how should we be checking for low disk space? [05:00:57] (were getting `file_put_contents(/var/www/var/cache/prod/pools/...): failed to open stream: No space left on device`) [05:30:36] samwilson: I agree that there looks to be plenty of space there. And if I manually create a big file with dd it works fine. So I expect that either this was transient (and the drive was really filled up briefly) or the error message is in error and it's encountering some other failure that is reporting incorrectly. [05:31:00] (that is to say — df should be a valid way to check free space) [05:31:32] andrewbogott: cool. that's good to know. yeah, i guess it was just a short lived thing [05:32:06] our fix is ultimately going to be shifting some more caching to redis, but for now nuking some of the cache seemed to get things working again [06:47:06] bd808: lol [06:47:07] I wonder who set them in -labs. [06:47:07] there are also those about vim and stuffs [09:57:31] jynus: I sent you two memos regarding https://phabricator.wikimedia.org/T172143. not sure if you saw them [10:03:43] memos? [10:05:17] zhuyifei1999_: you seem to have a problem with quarry, you should ping the mainteainers of that tool, I am not one of those [10:05:43] * zhuyifei1999_ is the maintainer, sigh [10:05:58] memoserv [10:06:01] * zhuyifei1999_ help [10:06:05] uh [10:06:16] ^ was /ms help [10:06:40] whatever you are trying to do, which I do not understand, "sudo mysql quarry" seems like a really bad idea [10:07:02] why? [10:07:36] I didn't do a copy of .my.cnf of root's [10:08:17] mysql client should never run as root; but as I said, I do not know what you are trying to achieve [10:08:20] the thing basically commits a new row to the mariadb, then reads the row id and sends it to redis [10:08:54] on another host, a process reads the id of the row and tries to find that row [10:09:11] it gets a NoResultFound [10:09:33] "the mariadb" [10:09:36] ? [10:09:51] quarry's internal database on quarry-main-01 [10:10:45] 5.5.39-MariaDB-0ubuntu0.14.04.1 [10:12:12] the relevant code: https://phabricator.wikimedia.org/diffusion/ANQW/browse/master/quarry/web/app.py;62676f2805b4b5bb5ca714140cd5c12dea5b9478$247 https://phabricator.wikimedia.org/diffusion/ANQW/browse/master/quarry/web/worker.py;62676f2805b4b5bb5ca714140cd5c12dea5b9478$50 [10:12:20] ok, I do not admin that, so with more reason I cannot help with it- you should ask on the mail list for programming help [10:13:04] I think they may help you there (not sure): https://lists.wikimedia.org/mailman/listinfo/cloud [10:13:46] all those things you wrote (e.g. the link to the code) should be on the ticket [10:13:48] not some programming help, but the question is that in what cases would a connection to mariadb fail to find a row commited by another connection? [10:14:19] 2 main options- [10:14:26] (the select happens almost immediately after commit [10:14:27] the row was not really commited [10:14:55] or the second connection has an ongoing transaction and is on a repeatable read or similar isolation level [10:15:03] hmm [10:15:23] would a failed commit generate an id? [10:15:25] rows commited after a transaction has started [10:15:41] I mean the auto-increment id [10:15:47] yes, that can happen [10:15:57] those are resturned before a commit [10:16:13] hmm [10:16:18] e.g. BEGIN; INSERT; COMMIT <--- commit fails [10:16:51] your problem is relatively easy to rebug [10:16:58] I'll try either and see which one is the root cause [10:17:02] rebug? [10:17:11] (debug) slowdown the execution [10:17:21] and check the database contents on every step [10:18:30] it happens very rarely https://quarry.wmflabs.org/query/runs/all, try to find an old 'queued', like in https://quarry.wmflabs.org/query/runs/all?limit=50&from=21319 [10:18:44] you may have a race condition then [10:18:59] yes, that's the title of the ticket [10:19:11] asyncronous execution is not easy [10:19:25] your commit may be taking too much time [10:19:34] or something else [10:19:42] maybe inserts fails sometime [10:19:45] sqlalchemy's commit returned... [10:19:54] independently of the specific bug [10:20:03] I would try the id [10:20:24] if it fails, wait e.g. 1 second, and try once more [10:20:33] hmm [10:20:54] check the transaction logic [10:21:18] when things are solidified on the database (COMMIT) and if you have long running open connections to it [10:21:57] log the queries executed- you can enable the general log on the database [10:22:26] (not for long time on production, but for debugging) general_log = 1 [10:22:39] that will tell you what queries are happening on the database [10:22:51] where do you set that? [10:22:54] as I said, I cannot help you, but that should point you in the right direction [10:23:16] you should have a /etc/my.cnf or /etc/mysql/my.cnf or similar file, then restart mysql [10:23:56] yeah, the latter exists, thanks [10:23:59] https://dev.mysql.com/doc/refman/5.5/en/server-system-variables.html#sysvar_general_log [10:24:57] remember to turn it off and restart after that again, or it will eat all your disk [10:25:12] it should log all connections and queries executed [10:26:42] k thanks [10:52:32] !log quarry deploying d9cc1c8 to quarry-runner-01 & 02 T172143 [10:52:35] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Quarry/SAL [10:52:35] T172143: Weird race condition makes query stuck in queued forever - https://phabricator.wikimedia.org/T172143 [11:28:18] !log tools.wikibug restarted everything because it was being weird [11:28:18] legoktm: Unknown project "tools.wikibug" [11:28:25] !log tools.wikibugs restarted everything because it was being weird [11:28:27] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.wikibugs/SAL [11:38:49] is there a task/plan on debian stretch in tool labs? specifically for PHP 7 since MW will require it soon(TM) [11:38:58] aahhh, I mean toolforge [14:22:03] (03PS1) 10Filippo Giunchedi: secrets: create digicert 2017 empty certs [labs/private] - 10https://gerrit.wikimedia.org/r/401504 [14:23:27] (03CR) 10Filippo Giunchedi: [V: 032 C: 032] secrets: create digicert 2017 empty certs [labs/private] - 10https://gerrit.wikimedia.org/r/401504 (owner: 10Filippo Giunchedi) [15:44:25] legoktm: Grid engine vs. Debian is something we talk about pretty often. Even moving to Jessie will involve scary changes to our OGE versions… It's definitely a thing on our radar but will be ugly. [15:58:38] legoktm: I don't think we have made a task for adding Stretch yet, but it is something we are thinking about. Stretch will probably come to Kubernetes first. We have some things to work out about Grid Engine and OpenStack before we can move the main VMs [16:07:54] legoktm: possibly related, we should look into putting wikibugs back on k8s. I remember vaguely that it was setup their sort of weirdly at one point which led to it being moved back to the grid. [16:08:20] stashbot is on k8s which may or may not give us a template to follow to move wikibugs [16:08:20] See https://wikitech.wikimedia.org/wiki/Tool:Stashbot for help. [17:17:59] chasemp: You about to discuss the rackign of https://phabricator.wikimedia.org/T178937 (labvirt102[12]) [17:18:12] robh: heyo, sure [17:18:14] I'm creating the rackign task now, just wanted to confirm all the (usually the same) settings [17:18:36] so row B is still required for the proper vlan (same as rest of labvirt) right? [17:18:42] robh: atm yes [17:18:43] and otherwise make these the same as the rest [17:18:49] (see, real easy for this!) [17:18:50] ;] [17:18:50] sidenote hopefully we move beyond that this year [17:18:53] heh [17:18:57] yes please [17:48:50] Hey cloud folks. Any of y'all interested in traveling to https://meta.wikimedia.org/wiki/Wiki_Education_Brazil/Projects/Hackathon_2018 ? [17:49:46] Looks like it'll be in March [17:49:54] Date not set yet [17:50:10] halfak: I imagine so but I don't know how it will fit in w/ the travel budget [17:50:45] Let's talk. I *might* have some extra budget we could use if y'all don't have it. [17:51:03] halfak: do you know if they are targeting early or late march? [19:34:53] Nemo_bis: hello! please see T183971 [19:34:53] T183971: dumps project is using 2T of 5T available NFS misc storage - https://phabricator.wikimedia.org/T183971 [19:38:24] !help Hi, happy holidays. Please help me with https://phabricator.wikimedia.org/T183933 [19:38:24] Zoranzoki21: If you don't get a response in 15-30 minutes, please create a phabricator task -- https://phabricator.wikimedia.org/maniphest/task/edit/form/1/?projects=wmcs-team [19:40:05] Hi madhuvishy, wasn't there an ongoing discussion on another ticket already? [19:41:56] How to install pip on toolforge? [19:42:07] Nemo_bis: ah yes https://phabricator.wikimedia.org/T174468 [19:42:58] i've merged them both [19:45:24] Zoranzoki21: you cannot install pip on the bastions, please use a venv in your tool project directory [19:45:55] What? [19:45:57] What t [19:46:00] *to use? [19:46:02] And how? [19:46:11] I have to install requirements for pywikibot [19:46:26] To I can run bot from there [19:47:12] are you following https://www.mediawiki.org/wiki/Manual:Pywikibot/Installation/Toolforge? [19:49:33] Zoranzoki21: ^ [19:50:33] Yes.. Resolved is now [19:51:07] Thank you [21:58:24] Hi all, I've tried to create a diffusion repo via toolsadmin.wikimedia.org but it throws 500... [21:58:38] The repository name was tool-weapon-of-mass-description [21:58:50] Urbanecm: sadly a known bug that I have not deployed the fix for yet [21:59:33] bd808, ok, should I wait for the fix to be deployed or have someone create the repo for me? [21:59:38] T182142 [21:59:38] T182142: Diffusion repository creation fails via toolsadmin - https://phabricator.wikimedia.org/T182142 [22:00:28] Urbanecm: you can do either. There is a manual process to register a repo with a tool if you have someone make you one outside of toolsadmin [22:00:47] I have a github repo currently [22:01:15] I'll try to deploy the fix for this "soon" (maybe tomorrow, definitely by end of day Thursday) [22:01:50] That's greeat