Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Faster recursive loading of indexed content #61

Merged
merged 3 commits into from
Sep 16, 2023

Conversation

josojo
Copy link
Contributor

@josojo josojo commented Sep 14, 2023

Currently, loading traced repos ( by running TracedRepo.load_from_disk) of simple repos like the https://github.com/yangky11/lean4-example takes quite some time on my machine.

I reduced the time by a factor of nearly 10x by loading only the needed traced files. This means we first only read the traced files of the main repo and then only the traced libraries that we really depend on when building the dependency graph.

This PR is quite important for me as it allows me to iterate much quicker on the code, as simple tests take much shorter. If we use the same tracing technique, then the unit tests would run on small machines at reasonable times. But this is for a future PR.

Since this recursive loading is not guaranteed to be faster(it's not so easy to parallelize), I put the recursive loading behind a FLAG.

@yangky11
Copy link
Member

That's great, thanks! I've been a bit swamped with LeanInfer recently and will take a look when I get a chance.

@josojo
Copy link
Contributor Author

josojo commented Sep 15, 2023

@CodiumAI-Agent /review

@CodiumAI-Agent
Copy link

PR Analysis

  • 🎯 Main theme: Improving the loading speed of traced repositories by implementing a recursive loading strategy.
  • 📝 PR summary: This PR aims to improve the loading speed of traced repositories by loading only the necessary traced files. It introduces a new environment variable to control whether to use this recursive loading strategy. The changes are mainly in the 'traced_data.py' and 'constants.py' files.
  • 📌 Type of PR: Enhancement
  • 🧪 Relevant tests added: No
  • 🔒 Security concerns: No security concerns found

PR Feedback

  • 💡 General suggestions: The PR is well-structured and the changes are clear. The addition of the environment variable to control the loading strategy is a good approach. However, it would be beneficial to add some tests to ensure that the new loading strategy works as expected and does not introduce any bugs.

  • 🤖 Code feedback:

    • relevant file: src/lean_dojo/data_extraction/traced_data.py
      suggestion: Consider handling the case where the XML file does not exist when loading traced dependencies recursively. This can be done by checking if the file exists before trying to load it. [important]
      relevant line: '+ new_traced_file = TracedFile.from_xml(root_dir, xml_to_index, repo)'

    • relevant file: src/lean_dojo/data_extraction/traced_data.py
      suggestion: It would be better to use os.path.join() for constructing file paths. This would ensure that the code is platform independent. [medium]
      relevant line: '+ str(root_dir) + "/" + dep_path_str.replace("/src/", "/lib/")'

    • relevant file: src/lean_dojo/data_extraction/traced_data.py
      suggestion: The assertion len(json_files) == self.traced_files_graph.number_of_nodes() is skipped when LOAD_TRACED_DEPENDENCIES_RECURSIVELY is True. If this assertion is important, consider finding a way to keep it for both cases. [medium]
      relevant line: '+ if not LOAD_TRACED_DEPENDENCIES_RECURSIVELY:'

How to use

Tag me in a comment '@CodiumAI-Agent' and add one of the following commands:
/review [-i]: Request a review of your Pull Request. For an incremental review, which only considers changes since the last review, include the '-i' option.
/describe: Modify the PR title and description based on the contents of the PR.
/improve [--extended]: Suggest improvements to the code in the PR. Extended mode employs several calls, and provides a more thorough feedback.
/ask <QUESTION>: Pose a question about the PR.
/update_changelog: Update the changelog based on the PR's contents.

To edit any configuration parameter from configuration.toml, add --config_path=new_value
For example: /review --pr_reviewer.extra_instructions="focus on the file: ..."
To list the possible configuration parameters, use the /config command.

@yangky11 yangky11 changed the base branch from main to dev September 16, 2023 16:13
@yangky11
Copy link
Member

Merging to dev and will update main after testing.

@yangky11 yangky11 merged commit 335f987 into lean-dojo:dev Sep 16, 2023
1 check passed
xml_to_index = (
str(root_dir) + "/" + dep_path_str.replace("/src/", "/lib/")
).replace(".lean", ".trace.xml")
new_traced_file = TracedFile.from_xml(root_dir, xml_to_index, repo)
Copy link
Member

@yangky11 yangky11 Sep 16, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if this new_traced_filedepends on another file? Maybe we should add it to traced_files to make sure it is processed recursively?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are correct that it alters the returning dependency tree. I think we should change it and add the new_traced_file into the for loop. Good catch.

But at the same time, I am still wondering at which point we really need second-order dependencies. If I want to prove a theorem 2, that uses theorem 1, I am not interested in the premises of theorem 1, right? However, I don't know the code well enough to decide whether all relevant information can be retrieved from first-order dependencies.

@yangky11
Copy link
Member

@josojo Please see the comment in the code. Since it has been merged into dev, I can make necessary changes in the code, but I want to check with you first.

@yangky11
Copy link
Member

@josojo Could you please take a look at #63 to see if it's reasonable? Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants