[00:38:07] PROBLEM - Puppet errors on tools-exec-1436 is CRITICAL: CRITICAL: 75.00% of data above the critical threshold [0.0] [00:53:51] 10Data-Services, 10cloud-services-team, 10DBA: Identify tools hosting databases on labsdb100[13] and notify maintainers - https://phabricator.wikimedia.org/T175096#3626138 (10bd808) New tool https://tools.wmflabs.org/tool-db-usage/ can be used to see this data better. It also decodes the owners where possibl... [00:54:03] PROBLEM - Puppet errors on tools-exec-1421 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [0.0] [01:08:09] RECOVERY - Puppet errors on tools-exec-1436 is OK: OK: Less than 1.00% above the threshold [0.0] [01:19:05] RECOVERY - Puppet errors on tools-exec-1421 is OK: OK: Less than 1.00% above the threshold [0.0] [02:05:40] PROBLEM - Puppet errors on tools-exec-1417 is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [0.0] [02:18:27] 10Cloud-Services, 10cloud-services-team (Kanban), 10Patch-For-Review: nova-fullstack is losing instances on creation - https://phabricator.wikimedia.org/T165555#3626197 (10Andrew) 05Open>03Resolved This has almost totally stopped happening; when it does happen it's usually for a good (but new) reason. S... [02:45:46] PROBLEM - Puppet errors on tools-exec-1412 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [0.0] [03:10:39] RECOVERY - Puppet errors on tools-exec-1417 is OK: OK: Less than 1.00% above the threshold [0.0] [03:29:04] PROBLEM - Puppet errors on tools-exec-1413 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [0.0] [03:59:03] RECOVERY - Puppet errors on tools-exec-1413 is OK: OK: Less than 1.00% above the threshold [0.0] [05:25:44] RECOVERY - Puppet errors on tools-exec-1412 is OK: OK: Less than 1.00% above the threshold [0.0] [05:28:43] PROBLEM - Puppet errors on tools-exec-1408 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [0.0] [06:03:42] RECOVERY - Puppet errors on tools-exec-1408 is OK: OK: Less than 1.00% above the threshold [0.0] [06:31:51] 10cloud-services-team (Kanban), 10Operations: puppet ca_server confusion - https://phabricator.wikimedia.org/T176437#3626230 (10Joe) If you want to better understand what puppet_ca does on an agent, and why removing it afterwards "doesn't break anything" there are good reads in the puppet docs: - https://docs... [06:34:21] PROBLEM - Puppet errors on tools-exec-1409 is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [0.0] [06:50:05] PROBLEM - Puppet errors on tools-bastion-03 is CRITICAL: CRITICAL: 66.67% of data above the critical threshold [0.0] [07:09:19] RECOVERY - Puppet errors on tools-exec-1409 is OK: OK: Less than 1.00% above the threshold [0.0] [07:25:04] RECOVERY - Puppet errors on tools-bastion-03 is OK: OK: Less than 1.00% above the threshold [0.0] [07:47:49] PROBLEM - Puppet errors on tools-exec-1412 is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [0.0] [08:02:51] PROBLEM - Puppet errors on tools-worker-1019 is CRITICAL: CRITICAL: 60.00% of data above the critical threshold [0.0] [08:33:53] PROBLEM - Puppet errors on tools-exec-1431 is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [0.0] [08:37:52] RECOVERY - Puppet errors on tools-worker-1019 is OK: OK: Less than 1.00% above the threshold [0.0] [08:48:30] (03PS5) 10Jean-Frédéric: Use isort to sort imports [labs/tools/heritage] - 10https://gerrit.wikimedia.org/r/378741 [08:49:17] (03CR) 10Jean-Frédéric: "> Uploaded patch set 5." [labs/tools/heritage] - 10https://gerrit.wikimedia.org/r/378741 (owner: 10Jean-Frédéric) [08:51:55] PROBLEM - Puppet errors on tools-webgrid-generic-1401 is CRITICAL: CRITICAL: 60.00% of data above the critical threshold [0.0] [09:08:54] RECOVERY - Puppet errors on tools-exec-1431 is OK: OK: Less than 1.00% above the threshold [0.0] [09:26:54] RECOVERY - Puppet errors on tools-webgrid-generic-1401 is OK: OK: Less than 1.00% above the threshold [0.0] [10:29:39] PROBLEM - Puppet errors on tools-exec-1411 is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [0.0] [10:43:45] PROBLEM - Puppet errors on tools-exec-1412 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [0.0] [11:04:37] RECOVERY - Puppet errors on tools-exec-1411 is OK: OK: Less than 1.00% above the threshold [0.0] [11:09:27] 10Tool-Matthewrbowker's-tools, 10User-Matthewrbowker: MATTHEWRBOWKER-8 On-IRC help system for WikiWelcomer - https://phabricator.wikimedia.org/T61067#3626684 (10Liuxinyu970226) [11:11:00] 10Tool-Article-request, 10User-Matthewrbowker: Search Component for Article Requests - https://phabricator.wikimedia.org/T59871#3626685 (10Liuxinyu970226) [11:11:49] 10Tools, 10XTools, 10User-Matthewrbowker: wikiviewstats webservice crashing all the time - https://phabricator.wikimedia.org/T122506#3626686 (10Liuxinyu970226) [11:11:50] 10Tool-Matthewrbowker's-tools, 10User-Matthewrbowker: MATTHEWRBOWKER-1 Add functionality for !link - https://phabricator.wikimedia.org/T61074#3626687 (10Liuxinyu970226) [11:23:47] RECOVERY - Puppet errors on tools-exec-1412 is OK: OK: Less than 1.00% above the threshold [0.0] [12:29:54] PROBLEM - Puppet errors on tools-exec-1431 is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [0.0] [12:45:24] (03CR) 10Lokal Profil: [WIP]Group unused images per source page (032 comments) [labs/tools/heritage] - 10https://gerrit.wikimedia.org/r/379141 (https://phabricator.wikimedia.org/T117327) (owner: 10Lokal Profil) [12:46:33] (03PS1) 10Jean-Frédéric: Rename docker-compose files for default use case [labs/tools/heritage] - 10https://gerrit.wikimedia.org/r/379744 [12:57:18] (03PS1) 10Jean-Frédéric: Refactor bin scripts to use default variables [labs/tools/heritage] - 10https://gerrit.wikimedia.org/r/379745 [13:09:54] RECOVERY - Puppet errors on tools-exec-1431 is OK: OK: Less than 1.00% above the threshold [0.0] [13:17:37] (03PS1) 10Jean-Frédéric: Use a virtualenv for Python dependencies [labs/tools/heritage] - 10https://gerrit.wikimedia.org/r/379751 (https://phabricator.wikimedia.org/T176465) [14:13:46] 10cloud-services-team (Kanban), 10Operations: puppet ca_server confusion - https://phabricator.wikimedia.org/T176437#3627211 (10Andrew) As far as I can see, the docs only describe setting ca_server once, for agents, in the [main] block. I am missing an explanation of why we would set it twice, and what settin... [14:14:07] 10Tool-stewardbots, 10Need-volunteer, 10WorkType-Maintenance: SULWatcher: avoid logging account creations more than once - https://phabricator.wikimedia.org/T156546#3627212 (10MarcoAurelio) [14:14:32] 10Tool-stewardbots, 10WorkType-Maintenance: Delete old data and/or stop logging to stewardbots' SULWatcher SQL DB - https://phabricator.wikimedia.org/T151113#3627213 (10MarcoAurelio) [15:55:26] 10cloud-services-team (FY2017-18), 10Goal: Hire first line technical support contractor - https://phabricator.wikimedia.org/T168488#3627555 (10Quiddity) [16:09:10] (03CR) 10Lokal Profil: Use isort to sort imports (031 comment) [labs/tools/heritage] - 10https://gerrit.wikimedia.org/r/378741 (owner: 10Jean-Frédéric) [16:15:39] (03CR) 10Lokal Profil: "does this also need we no longer need the local pywikibot install?" [labs/tools/heritage] - 10https://gerrit.wikimedia.org/r/379751 (https://phabricator.wikimedia.org/T176465) (owner: 10Jean-Frédéric) [16:18:48] (03CR) 10Jean-Frédéric: "> does this also need we no longer need the local pywikibot install?" [labs/tools/heritage] - 10https://gerrit.wikimedia.org/r/379751 (https://phabricator.wikimedia.org/T176465) (owner: 10Jean-Frédéric) [16:34:40] (03CR) 10Lokal Profil: [C: 04-1] "thanks for doing this :)" (031 comment) [labs/tools/heritage] - 10https://gerrit.wikimedia.org/r/379745 (owner: 10Jean-Frédéric) [16:43:28] (03PS2) 10Jean-Frédéric: Refactor bin scripts to use default variables [labs/tools/heritage] - 10https://gerrit.wikimedia.org/r/379745 [16:44:47] PROBLEM - Puppet errors on tools-exec-1412 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [0.0] [16:45:09] (03CR) 10Lokal Profil: [C: 04-1] Rename docker-compose files for default use case (031 comment) [labs/tools/heritage] - 10https://gerrit.wikimedia.org/r/379744 (owner: 10Jean-Frédéric) [16:45:17] (03PS2) 10Jean-Frédéric: Use a virtualenv for Python dependencies [labs/tools/heritage] - 10https://gerrit.wikimedia.org/r/379751 (https://phabricator.wikimedia.org/T176465) [16:50:27] (03PS2) 10Jean-Frédéric: Rename docker-compose files for default use case [labs/tools/heritage] - 10https://gerrit.wikimedia.org/r/379744 [16:51:07] (03CR) 10Jean-Frédéric: ">" (031 comment) [labs/tools/heritage] - 10https://gerrit.wikimedia.org/r/379744 (owner: 10Jean-Frédéric) [16:54:49] (03CR) 10Lokal Profil: [C: 032] Refactor bin scripts to use default variables (031 comment) [labs/tools/heritage] - 10https://gerrit.wikimedia.org/r/379745 (owner: 10Jean-Frédéric) [16:57:41] (03Merged) 10jenkins-bot: Refactor bin scripts to use default variables [labs/tools/heritage] - 10https://gerrit.wikimedia.org/r/379745 (owner: 10Jean-Frédéric) [17:01:10] (03CR) 10jenkins-bot: Refactor bin scripts to use default variables [labs/tools/heritage] - 10https://gerrit.wikimedia.org/r/379745 (owner: 10Jean-Frédéric) [17:03:15] (03CR) 10Lokal Profil: [C: 032] Rename docker-compose files for default use case [labs/tools/heritage] - 10https://gerrit.wikimedia.org/r/379744 (owner: 10Jean-Frédéric) [17:04:16] (03Merged) 10jenkins-bot: Rename docker-compose files for default use case [labs/tools/heritage] - 10https://gerrit.wikimedia.org/r/379744 (owner: 10Jean-Frédéric) [17:05:07] (03CR) 10jenkins-bot: Rename docker-compose files for default use case [labs/tools/heritage] - 10https://gerrit.wikimedia.org/r/379744 (owner: 10Jean-Frédéric) [17:10:00] !log tools.heritage Deploy latest from Git master: 5510585, 508a947 [17:10:03] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.heritage/SAL [17:19:45] RECOVERY - Puppet errors on tools-exec-1412 is OK: OK: Less than 1.00% above the threshold [0.0] [17:20:35] 10Data-Services, 10XTools: s51187 and p50380g50692 database users are generating excessive lag on replica service - https://phabricator.wikimedia.org/T172882#3627827 (10russblau) Is it possible to give s51290 read-only access to labsdb1001 until I have a chance to update the troublesome scripts? [18:46:07] PROBLEM - SSH on tools-paws-worker-1006 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:07:09] hi! on an m1.large VPS instance like xtools-prod01.xtools.eqiad.wmflabs, how many simultaneous connections (threads) would it be able to handle? [19:07:50] I noticed on my local Apache I can only do so many at once [19:10:56] musikanimal: that likely mostly depends on your apache settings [19:11:29] mkay yeah, I just ran `ps -ef | grep apache` and got 10 + the root process, so I guess that's the number [19:11:53] http://oxpedia.org/wiki/index.php?title=Tune_apache2_for_more_concurrent_connections [19:12:07] beautiful [19:13:14] "by default 150 connections"...? [19:13:21] that should be plenty [19:13:32] but unless you're transferring large files you're likely solving the wrong problem by tuning that value [19:14:09] yes, unless you do things without timeouts (e.g. sql queries). But you should fix those by setting timeouts, not by allowing more connections to hang for long periods of time [19:14:51] alright, we do have a query killer. I'm not sure if we're setting timeouts in a config somewhere [19:15:46] but to your point about more connections hanging... I notice that say, while the Edit Counter is running, I can't load XTools in another tab [19:15:51] but other users can [19:16:06] so it's like it's limiting it just to my connection, or that Apache process, perhaps [19:17:32] I don't really see this as a problem. I don't want people loading the Edit Counter on 10 people in 10 different tabs. They can wait. However we have plans to speed up the Edit Counter by having the app making several MySQL connections, running queries asynchronously [19:18:29] that's probably your browser, not apache [19:18:45] interesting [19:19:06] well I can restart apache and it I'm able to load the app again, if that means anything [19:19:18] try a second browser [19:19:49] okay yeah, I'll do that. Like I said that's not actually my concern, though. Let me explain... and thank you in advance for your advice... [19:20:50] we have our main app server. The plan is that for the Edit Counter (which is a very expensive tool), it will use multi curl to make several requests to the "API server" (a different instance), effectively running queries asynchronously [19:21:12] I've tested this on https://xtools-dev.wmflabs.org and it works great, and that's without the "API server", just async calls to the same app server [19:21:38] However, on my local, it keeps locking up -- where it gets stuck in a loading state [19:23:24] maybe there are too many connections for my apache, and it somehow gets stuck. I know the problem is related to this because I can change the code so it only does 2 or 3 queries at one time and it will load [19:23:51] Even if you would hit the connection limit, the connections would not drop -- they would just wait for previous connections to finish. [19:23:52] so anyway, it not working on my local is fine, but I obviously don't want this scenario to happen in production [19:24:17] yeah, that would make sense, but for some reason on my local it doesn't [19:24:30] 10PAWS: Implement a 'signing OAuth Proxy' for PAWS - https://phabricator.wikimedia.org/T120469#3628092 (10PokestarFan) p:05High>03Unbreak! This is a problem [19:24:31] I can `tail` the logs and see that it only ran so many of the queries, and just stopped [19:24:53] not sure why... but I hope this won't happen on production :/ [19:25:21] Hard to say why this would happen, but I would debug it before pushing it to prod. [19:25:49] My approach would be to add more debug logging, and/or figuring out how to use xdebug (and then just checking where the processes are hanging) [19:26:01] I think it's a server-level thing. I put debug output all over the place, and it's as if the 3rd connection is what pushes it over the edge [19:26:07] if I understand correctly, you have one php script curl another script a dozen times? [19:26:24] one PHP app that curl's itself (but another endpoint), yes [19:26:56] basically we have an internal API [19:27:04] only responds to 127.0.0.1 [19:27:41] the fact that I had no issues on xtools-dev is promising, as it is only a m1.small, and it's not using a separate API server [19:27:46] Sure. But because it's a seperate request, it's in a completely different process, so race conditions shouldn't be an issue [19:28:08] My hunch would then be that the issue is in the way you submit the requests [19:28:45] they're Guzzle promises, but internally it's the multi curl stuff http://php.net/manual/en/function.curl-multi-init.php [19:29:07] the promise somehow never resolves [19:29:10] I guess... [19:29:29] which is odd because I do have a timeout set for that, and it apparently doesn't get hit [19:29:57] as in, it doesn't appear to actually timeout as it should [19:30:01] maybe windows vs linux? the notes note some oddities. [19:30:17] I'm on Mac, but yeah that might be it [19:30:52] I guess I'm just going to do some load testing on production, and see if it ends up being a problem [19:31:05] PROBLEM - Puppet errors on tools-exec-1439 is CRITICAL: CRITICAL: 66.67% of data above the critical threshold [0.0] [19:31:19] the curl multi stuff looks complicated to me. Isn't there an easier way to do async web requests with php? :/ [19:31:30] hehe I don't think so, not that I'm aware of [19:31:42] but that's what Guzzle is for, it does the work for you, supposed to anyway [19:31:56] https://github.com/guzzle/promises [19:32:06] Ah, right. [19:32:47] well I have one more question [19:33:27] back to how I can't load XTools if I have another tab with XTools still processing... [19:33:32] this doesn't happen on Toolforge [19:33:46] any idea why? [19:34:00] load balancing, perhaps? [19:34:05] I think it just depends on the number of open connections, and that depends on how the tool is designed. [19:34:20] If the tool updates the front page by polling the backend, connections are not kept in use [19:34:36] but if it just uses one connection to push small bits of data every now and then it's continuously in use [19:35:53] ok, I think that makes sense [19:36:23] because the new XTools *does* make queries to the replicas on every request (to get replag) [19:36:27] maybe that's it [19:36:55] whereas the old XTools does not [19:38:42] I dunno, like I said that issue I don't really care about. I like that it's making people be patient! hah [19:39:04] thank you very much for your help! :) [19:40:15] You're welcome! [19:50:00] 10PAWS: Implement a 'signing OAuth Proxy' for PAWS - https://phabricator.wikimedia.org/T120469#3628202 (10Multichill) p:05Unbreak!>03High Restored high priority, see https://www.mediawiki.org/wiki/Phabricator/Help#Setting_task_priority [20:06:03] RECOVERY - Puppet errors on tools-exec-1439 is OK: OK: Less than 1.00% above the threshold [0.0] [20:51:05] PROBLEM - Puppet errors on tools-bastion-03 is CRITICAL: CRITICAL: 62.50% of data above the critical threshold [0.0] [21:26:04] RECOVERY - Puppet errors on tools-bastion-03 is OK: OK: Less than 1.00% above the threshold [0.0] [22:04:31] PROBLEM - Puppet errors on tools-webgrid-generic-1403 is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [0.0] [22:36:11] 10Toolforge: Raise tool memory limit - similarity - https://phabricator.wikimedia.org/T176527#3628719 (10Surlycyborg) [22:39:30] RECOVERY - Puppet errors on tools-webgrid-generic-1403 is OK: OK: Less than 1.00% above the threshold [0.0] [22:41:00] 10Toolforge: Raise tool memory limit - similarity - https://phabricator.wikimedia.org/T176527#3628719 (10Betacommand) Ive done some large scale processing but needing more than 4GB of ram is a sign of bad programming. fix that first, dont try to come back later to try and reduce usage then, its too late [23:33:13] PROBLEM - Puppet errors on tools-exec-1441 is CRITICAL: CRITICAL: 66.67% of data above the critical threshold [0.0]