Faster recursive loading of indexed content #61

josojo · 2023-09-14T07:17:45Z

Currently, loading traced repos ( by running TracedRepo.load_from_disk) of simple repos like the https://github.com/yangky11/lean4-example takes quite some time on my machine.

I reduced the time by a factor of nearly 10x by loading only the needed traced files. This means we first only read the traced files of the main repo and then only the traced libraries that we really depend on when building the dependency graph.

This PR is quite important for me as it allows me to iterate much quicker on the code, as simple tests take much shorter. If we use the same tracing technique, then the unit tests would run on small machines at reasonable times. But this is for a future PR.

Since this recursive loading is not guaranteed to be faster(it's not so easy to parallelize), I put the recursive loading behind a FLAG.

yangky11 · 2023-09-15T02:02:13Z

That's great, thanks! I've been a bit swamped with LeanInfer recently and will take a look when I get a chance.

josojo · 2023-09-15T13:40:17Z

@CodiumAI-Agent /review

CodiumAI-Agent · 2023-09-15T13:41:09Z

PR Analysis

🎯 Main theme: Improving the loading speed of traced repositories by implementing a recursive loading strategy.
📝 PR summary: This PR aims to improve the loading speed of traced repositories by loading only the necessary traced files. It introduces a new environment variable to control whether to use this recursive loading strategy. The changes are mainly in the 'traced_data.py' and 'constants.py' files.
📌 Type of PR: Enhancement
🧪 Relevant tests added: No
🔒 Security concerns: No security concerns found

PR Feedback

💡 General suggestions: The PR is well-structured and the changes are clear. The addition of the environment variable to control the loading strategy is a good approach. However, it would be beneficial to add some tests to ensure that the new loading strategy works as expected and does not introduce any bugs.
🤖 Code feedback:
- relevant file: src/lean_dojo/data_extraction/traced_data.py
  suggestion: Consider handling the case where the XML file does not exist when loading traced dependencies recursively. This can be done by checking if the file exists before trying to load it. [important]
  relevant line: '+ new_traced_file = TracedFile.from_xml(root_dir, xml_to_index, repo)'
- relevant file: src/lean_dojo/data_extraction/traced_data.py
  suggestion: It would be better to use os.path.join() for constructing file paths. This would ensure that the code is platform independent. [medium]
  relevant line: '+ str(root_dir) + "/" + dep_path_str.replace("/src/", "/lib/")'
- relevant file: src/lean_dojo/data_extraction/traced_data.py
  suggestion: The assertion len(json_files) == self.traced_files_graph.number_of_nodes() is skipped when LOAD_TRACED_DEPENDENCIES_RECURSIVELY is True. If this assertion is important, consider finding a way to keep it for both cases. [medium]
  relevant line: '+ if not LOAD_TRACED_DEPENDENCIES_RECURSIVELY:'

How to use

Tag me in a comment '@CodiumAI-Agent' and add one of the following commands:
/review [-i]: Request a review of your Pull Request. For an incremental review, which only considers changes since the last review, include the '-i' option.
/describe: Modify the PR title and description based on the contents of the PR.
/improve [--extended]: Suggest improvements to the code in the PR. Extended mode employs several calls, and provides a more thorough feedback.
/ask <QUESTION>: Pose a question about the PR.
/update_changelog: Update the changelog based on the PR's contents.

To edit any configuration parameter from configuration.toml, add --config_path=new_value
For example: /review --pr_reviewer.extra_instructions="focus on the file: ..."
To list the possible configuration parameters, use the /config command.

yangky11 · 2023-09-16T16:14:18Z

Merging to dev and will update main after testing.

yangky11 · 2023-09-16T16:30:11Z

src/lean_dojo/data_extraction/traced_data.py

+                xml_to_index = (
+                    str(root_dir) + "/" + dep_path_str.replace("/src/", "/lib/")
+                ).replace(".lean", ".trace.xml")
+                new_traced_file = TracedFile.from_xml(root_dir, xml_to_index, repo)


What if this new_traced_filedepends on another file? Maybe we should add it to traced_files to make sure it is processed recursively?

You are correct that it alters the returning dependency tree. I think we should change it and add the new_traced_file into the for loop. Good catch.

But at the same time, I am still wondering at which point we really need second-order dependencies. If I want to prove a theorem 2, that uses theorem 1, I am not interested in the premises of theorem 1, right? However, I don't know the code well enough to decide whether all relevant information can be retrieved from first-order dependencies.

yangky11 · 2023-09-16T16:31:15Z

@josojo Please see the comment in the code. Since it has been merged into dev, I can make necessary changes in the code, but I want to check with you first.

yangky11 · 2023-09-16T17:43:01Z

@josojo Could you please take a look at #63 to see if it's reasonable? Thanks!

josojo added 3 commits September 14, 2023 00:39

Faster recurisve loading of indexed content

d4adeee

linting

af491da

more secure way to load the env variable

6385cf8

yangky11 changed the base branch from main to dev September 16, 2023 16:13

yangky11 merged commit 335f987 into lean-dojo:dev Sep 16, 2023
1 check passed

yangky11 reviewed Sep 16, 2023

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Faster recursive loading of indexed content #61

Faster recursive loading of indexed content #61

josojo commented Sep 14, 2023

yangky11 commented Sep 15, 2023

josojo commented Sep 15, 2023

CodiumAI-Agent commented Sep 15, 2023

yangky11 commented Sep 16, 2023

yangky11 Sep 16, 2023 •

edited

Loading

josojo Sep 16, 2023

yangky11 commented Sep 16, 2023

yangky11 commented Sep 16, 2023

Faster recursive loading of indexed content #61

Faster recursive loading of indexed content #61

Conversation

josojo commented Sep 14, 2023

yangky11 commented Sep 15, 2023

josojo commented Sep 15, 2023

CodiumAI-Agent commented Sep 15, 2023

PR Analysis

PR Feedback

How to use

yangky11 commented Sep 16, 2023

yangky11 Sep 16, 2023 • edited Loading

Choose a reason for hiding this comment

josojo Sep 16, 2023

Choose a reason for hiding this comment

yangky11 commented Sep 16, 2023

yangky11 commented Sep 16, 2023

yangky11 Sep 16, 2023 •

edited

Loading