Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Generated too large parser.c ~ 83M #59

Open
pplam opened this issue Jul 6, 2022 · 7 comments
Open

Generated too large parser.c ~ 83M #59

pplam opened this issue Jul 6, 2022 · 7 comments

Comments

@pplam
Copy link
Contributor

pplam commented Jul 6, 2022

With all PRs merged, tree-sitter will generate a very large parser.c file (about 83M). And it takes about 1 min to load the parser by tree-sitter python binding.

Unfortunately I don't have an idea at the moment on how to reduce its size 😭

@m-novikov Have you ever encountered a similar problem, or have any suggestions for optimization?

@imotai
Copy link
Contributor

imotai commented Jul 6, 2022

I think we can generate the parser.c when building this project

@pplam
Copy link
Contributor Author

pplam commented Jul 6, 2022

Yeah, there doesn't seem to be redundant code here, nor particularly complex rules, but still a very large parser is generated.

With compared to the tree-sitter-python parser, the size of the grammar.js file 2x larger (1833 lines vs 967 lines).

But the generated parser.c is almost 64x larger (83M vs 1.3M) than the one in tree-sitter-python.

I don't have an idea on this yet 😭

@imotai
Copy link
Contributor

imotai commented Jul 6, 2022

Yeah, there doesn't seem to be redundant code here, nor particularly complex rules, but still a very large parser is generated.

With compared to the tree-sitter-python parser, the size of the grammar.js file 2x larger (1833 lines vs 967 lines).

But the generated parser.c is almost 64x larger (83M vs 1.3M) than the one in tree-sitter-python.

I don't have an idea on this yet 😭

Have you try the release mode ? the follow library produced by cargo build --release seems not so large.
image

@m-novikov
Copy link
Owner

There is related issue in tree-sitter/tree-sitter#1041
While removing parser from the repo is a good stop-gap solution, we would need to investigate the cause, generation times also getting progressively longer.

@pplam
Copy link
Contributor Author

pplam commented Jul 8, 2022

There is related issue in tree-sitter/tree-sitter#1041 While removing parser from the repo is a good stop-gap solution, we would need to investigate the cause, generation times also getting progressively longer.

@m-novikov Yes, I've seen this similar issue too. But there is not much useful information here.

By remove the generated code from the repo, we just ignore this issue.

On the other hand, when a user using tree-sitter-sql, for example, using it in Python code via the tree-sitter python package. The user use it by the following steps:

  1. Build language (for tree-sitter-sql) by the tree-sitter package, it will look for the parser.c and scanner.cc files in the src dir and compiles them;
  2. Create a new tree-sitter.Parser instance and set language to the language built in step 1;
  3. The src/node-types.json and src/grammar.json can also be loaded and used by users;

If we remove the src dir from the repo, the users still need to generate it by themselves. It will take 40s+ to generate this files (for PR#58). Also the size of parser.c is large (83M+), this leads to a long time (about 1min) to build language (the step 1).

So, to solve the problem, we need to do something to reduce the size of generated code. By now, I did some investigations on this.

  1. Collect generated files size for each commit to see which commit increased the size too much. The following are the top commits:
hexsha commit_message grammar.js grammar.json node-types.json parser.c
8d09bf6 Support CTE, CASE, etc. New supportings:   - CTEs syntax;   - DELETE statement;   - ALTER TABLE RENAME COLUMN;   - Conditional expressions;   - LIMIT clause;   - AT TIME ZONE expression; Improvements:   - The ORDER BY clause;   - The JOIN clause;   - The select_subexpression;   - The UPDATE statement;   - Array type;   - $.type: multi-words types (VARYING, PRECISION, WITH/WITHOUT TIME ZONE);   - The $.field_access expression to $.json_access expression; 0.02M, +0.0M, +2.46% 0.14M, +0.0M, +2.21% 0.11M, +0.01M, +10.77% 3.95M, +1.59M, +67.12%
f1a4b8c Improve create function statement & support create trigger statement 0.03M, +0.0M, +19.18% 0.18M, +0.04M, +30.48% 0.12M, +0.01M, +13.91% 11.89M, +7.94M, +200.86%
bf3ea72 Improve function call to support aggregate expressions 0.03M, -0.0M, -0.48% 0.28M, +0.0M, +1.69% 0.15M, +0.0M, +0.82% 21.78M, +2.82M, +14.88%
1cdb6ab Support window definitions & window function calls 0.04M, +0.0M, +4.05% 0.29M, +0.01M, +4.06% 0.16M, +0.01M, +4.31% 25.25M, +3.47M, +15.91%
d68da8a Improved the FROM clause to make it more general 0.04M, +0.0M, +3.28% 0.3M, +0.01M, +3.5% 0.16M, +0.01M, +5.24% 34.1M, +8.85M, +35.07%
84cedb1 Improve the ALTER TABLE statement 0.05M, +0.01M, +21.2% 0.38M, +0.06M, +19.64% 0.22M, +0.04M, +19.67% 38.05M, +2.45M, +6.88%
33a6c85 Improve the INSERT statement 0.05M, +0.0M, +1.97% 0.38M, +0.01M, +1.84% 0.22M, +0.01M, +2.39% 54.43M, +16.26M, +42.61%
42f750d Support more comparison predicates 1. BETWEEN...AND expression; 2. ISNULL and NOTNULL; 3. IS UNKNOWN and IS NOT UNKNOWN; 0.05M, +0.0M, +1.14% 0.39M, +0.0M, +0.78% 0.24M, +0.02M, +10.1% 61.25M, +6.33M, +11.53%
2660004 Support composite values & resolve syntax conflicts 0.05M, +0.0M, +1.15% 0.39M, +0.0M, +1.03% 0.25M, +0.01M, +3.89% 62.51M, +1.26M, +2.05%
87107e9 Support combining queries, JSON comparison operators and type cast as column default 0.05M, +0.0M, +2.02% 0.4M, +0.01M, +1.29% 0.25M, +0.0M, +0.47% 79.69M, +17.18M, +27.48%
5574bf3 Support VACUUM statement & improve alias 0.05M, +0.0M, +2.28% 0.41M, +0.01M, +3.04% 0.25M, +0.0M, +0.69% 82.5M, +2.81M, +3.53%
  1. Then I checked changes of grammar.js for these top commits, and tried simply comment out some rules or clauses to see the effect. Following are the modifications and result sizes I've tried:
>>>>>>>>>> Sizes in the latest version

-rw-r--r--  1 tim  staff   421K Jul  8 21:49 grammar.json
-rw-r--r--  1 tim  staff    13K Jun 13 10:36 grammar.json.rej
-rw-r--r--  1 tim  staff   264K Jul  8 21:50 node-types.json
-rw-r--r--  1 tim  staff    17K Jun 13 10:36 node-types.json.rej
-rw-r--r--  1 tim  staff    83M Jul  8 21:50 parser.c
-rw-r--r--  1 tim  staff   5.5M Jun 13 10:36 parser.c.rej
-rw-r--r--  1 tim  staff   4.2K Jul  6 01:59 scanner.cc
drwxr-xr-x  3 tim  staff    96B May 31 18:20 tree_sitter


>>>>>>>>>> With the select statement modified

1. Removed select_statement
-rw-r--r--  1 tim  staff   413K Jul  8 21:39 grammar.json
-rw-r--r--  1 tim  staff    13K Jun 13 10:36 grammar.json.rej
-rw-r--r--  1 tim  staff   257K Jul  8 21:39 node-types.json
-rw-r--r--  1 tim  staff    17K Jun 13 10:36 node-types.json.rej
-rw-r--r--  1 tim  staff    29M Jul  8 21:39 parser.c
-rw-r--r--  1 tim  staff   5.5M Jun 13 10:36 parser.c.rej
-rw-r--r--  1 tim  staff   4.2K Jul  6 01:59 scanner.cc
drwxr-xr-x  3 tim  staff    96B May 31 18:20 tree_sitter

2. Removed window_clause in select_statement
-rw-r--r--  1 tim  staff   420K Jul  8 21:42 grammar.json
-rw-r--r--  1 tim  staff    13K Jun 13 10:36 grammar.json.rej
-rw-r--r--  1 tim  staff   263K Jul  8 21:42 node-types.json
-rw-r--r--  1 tim  staff    17K Jun 13 10:36 node-types.json.rej
-rw-r--r--  1 tim  staff    82M Jul  8 21:42 parser.c
-rw-r--r--  1 tim  staff   5.5M Jun 13 10:36 parser.c.rej
-rw-r--r--  1 tim  staff   4.2K Jul  6 01:59 scanner.cc
drwxr-xr-x  3 tim  staff    96B May 31 18:20 tree_sitter

3. Removed from_clause in select_statement
-rw-r--r--  1 tim  staff   421K Jul  8 21:44 grammar.json
-rw-r--r--  1 tim  staff    13K Jun 13 10:36 grammar.json.rej
-rw-r--r--  1 tim  staff   264K Jul  8 21:45 node-types.json
-rw-r--r--  1 tim  staff    17K Jun 13 10:36 node-types.json.rej
-rw-r--r--  1 tim  staff    62M Jul  8 21:45 parser.c
-rw-r--r--  1 tim  staff   5.5M Jun 13 10:36 parser.c.rej
-rw-r--r--  1 tim  staff   4.2K Jul  6 01:59 scanner.cc
drwxr-xr-x  3 tim  staff    96B May 31 18:20 tree_sitter

4. Removed group_by_clause in select_statement
-rw-r--r--  1 tim  staff   421K Jul  8 21:51 grammar.json
-rw-r--r--  1 tim  staff    13K Jun 13 10:36 grammar.json.rej
-rw-r--r--  1 tim  staff   263K Jul  8 21:51 node-types.json
-rw-r--r--  1 tim  staff    17K Jun 13 10:36 node-types.json.rej
-rw-r--r--  1 tim  staff    75M Jul  8 21:51 parser.c
-rw-r--r--  1 tim  staff   5.5M Jun 13 10:36 parser.c.rej
-rw-r--r--  1 tim  staff   4.2K Jul  6 01:59 scanner.cc
drwxr-xr-x  3 tim  staff    96B May 31 18:20 tree_sitter

5. Removed window_clause, from_clause, group_by_clause from select_statement
-rw-r--r--  1 tim  staff   420K Jul  8 21:53 grammar.json
-rw-r--r--  1 tim  staff    13K Jun 13 10:36 grammar.json.rej
-rw-r--r--  1 tim  staff   263K Jul  8 21:53 node-types.json
-rw-r--r--  1 tim  staff    17K Jun 13 10:36 node-types.json.rej
-rw-r--r--  1 tim  staff    53M Jul  8 21:53 parser.c
-rw-r--r--  1 tim  staff   5.5M Jun 13 10:36 parser.c.rej
-rw-r--r--  1 tim  staff   4.2K Jul  6 01:59 scanner.cc
drwxr-xr-x  3 tim  staff    96B May 31 18:20 tree_sitter


>>>>>>>> With expressions modified

1. Removed select_subexpression from _simple_expression
-rw-r--r--  1 tim  staff   421K Jul  8 22:00 grammar.json
-rw-r--r--  1 tim  staff    13K Jun 13 10:36 grammar.json.rej
-rw-r--r--  1 tim  staff   259K Jul  8 22:00 node-types.json
-rw-r--r--  1 tim  staff    17K Jun 13 10:36 node-types.json.rej
-rw-r--r--  1 tim  staff    83M Jul  8 22:00 parser.c
-rw-r--r--  1 tim  staff   5.5M Jun 13 10:36 parser.c.rej
-rw-r--r--  1 tim  staff   4.2K Jul  6 01:59 scanner.cc
drwxr-xr-x  3 tim  staff    96B May 31 18:20 tree_sitter

2. Removed select_subexpression from all rules
-rw-r--r--  1 tim  staff   420K Jul  8 22:07 grammar.json
-rw-r--r--  1 tim  staff    13K Jun 13 10:36 grammar.json.rej
-rw-r--r--  1 tim  staff   259K Jul  8 22:07 node-types.json
-rw-r--r--  1 tim  staff    17K Jun 13 10:36 node-types.json.rej
-rw-r--r--  1 tim  staff    78M Jul  8 22:07 parser.c
-rw-r--r--  1 tim  staff   5.5M Jun 13 10:36 parser.c.rej
-rw-r--r--  1 tim  staff   4.2K Jul  6 01:59 scanner.cc
drwxr-xr-x  3 tim  staff    96B May 31 18:20 tree_sitter

From the above data, we can see that the SELECT statement introduced a large amount of size.

Unfortunately , on how to optimize it, I still have no good idea. Could you guys give me some suggestion?

@pplam
Copy link
Contributor Author

pplam commented Jul 8, 2022

Yeah, there doesn't seem to be redundant code here, nor particularly complex rules, but still a very large parser is generated.
With compared to the tree-sitter-python parser, the size of the grammar.js file 2x larger (1833 lines vs 967 lines).
But the generated parser.c is almost 64x larger (83M vs 1.3M) than the one in tree-sitter-python.
I don't have an idea on this yet 😭

Have you try the release mode ? the follow library produced by cargo build --release seems not so large. image

It seems that the parser.c, scanner.cc etc. files are not generated by this manner, but those file are required when using the tree-sitter-sql, at least in my case (use it in Python). Please take a look at my previous comment for more details.

@m-novikov
Copy link
Owner

parser.c is generated file, but scanner.cc is not. Use in python implies that user run tree-sitter generate before. You can see it in #63 binding.gyp in this PR has a step to generate files required.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants