Project structure #69

josojo · 2023-09-23T11:31:57Z

josojo
Sep 23, 2023

I have now played with LeanDojo for two weeks. And the work done is really nice!

One has to recognize that it was built for the specific requirements of the Reprover and it might not be the best fit for a general framework that enables extracting data for prompt generation + ai result testing via lean-REPL.

Some implications are:

tracing goes through all dependencies of a project and indexes all of them - which takes very long -, even though this is only needed to determine all possible premises that could be applied in a theorem for the reprover. But this is not a standard requirement for most other projects.
tests take very long to run (over a day for some of them); This prevents quick iterations of code.
starting most scripts has a big overhead:
E.g, the dojo component(for interacting with the very nice REPL env) requires a traced repo although this is fundamentally not needed.
The mixture of Lean4 + Lean 3 code makes the maintenance quite hard => fixing type checks is nontrivial due to convoluted types.

To better suit the general goal of building a framework enabling extracting data for prompt generation + ai result testing via lean-REPL. I would propose the following restructurings:

currently most processes depend on a whole LeanGitRepo. I think we should change this to a https://leanprover-community.github.io/mwe.html. The reason is a mwe is the smallest piece needed to run the current lean extraction and REPL scripts.
Interactions: If one wants to interact with a theorem from mathlib4, - with the structure given by 1) - then one would not longer parse the whole mathlib4, but use a small script to build a mwe from the mathlib4 file that contains the theorem. And then this mwe is used to interact with REPL. An e2e test would then only cost 8 seconds according to my measurements & prototype.
Tracing: Tracing should by default not build the whole dependency tree searching through all dependencies for possible premises. Most projects only need a resolution of premises to its full name. For a mwe that depends on mathlib4, this tracing also happens in 6 seconds according to my measurements.
Tracing to determine all possible premises that can be applied for theorem proving can be done in a new process called extensivePremisesTracing
For 2, and 3 we can build tests that are executable within seconds, making the project more maintainable
If one no longer cares about lean3, then everything becomes more readable and easier to maintain.

My question is now:
What do you think about these proposals? Should we try to push these changes into the current repo? Would you like to make the repo more of a general toolset for AI prompt generation and testing? Or should I (or hopefully we! ❤️ ) start from scratch?

josojo · 2023-09-24T18:19:41Z

josojo
Sep 24, 2023
Author

I was too curious to not try it out and came up with a minimal work-in-progress model for running the lean_dojo interaction part in the CI. Here is a small repo that runs a lean_dojo repl interaction test of a mathlib4 theorem in 6 seconds using the mwe approach:
https://github.com/josojo/lean_ai_helper/actions/runs/6291640665/job/17080212329
(Most of the CI-time is used for caching the mathlib4 downloads.)

0 replies

yangky11 · 2023-09-25T02:51:17Z

yangky11
Sep 25, 2023
Maintainer

Hi @josojo, thank you for the thorough suggestions on LeanDojo. Many of them make a lot of sense to me. Please see my individual comments below, and feel free to follow up on any of those points!

it was built for the specific requirements of the Reprover and it might not be the best fit for a general framework that enables extracting data for prompt generation + ai result testing via lean-REPL.

I'd say LeanDojo is designed to operate offline. For example, it can only trace public repos on GitHub. If you generate some new code, you have to commit it to GitHub before using LeanDojo to process it, which may be inconvenient in a setup where the data is generated on the fly

tracing goes through all dependencies of a project and indexes all of them - which takes very long -, even though this is only needed to determine all possible premises that could be applied in a theorem for the reprover. But this is not a standard requirement for most other projects.

We can add additional flags to turn these features on and off (in the form of environment variables or some config files).

tests take very long to run (over a day for some of them); This prevents quick iterations of code.

Most of the time is spent in tracing the repos. I don't see an obvious way to speed up that part (besides the remote caching mechanism that we already have). BTW, it may be possible to speed up testing using pytest-parallel if you have many CPUs, though we didn't try. We usually only run unit tests when updating the main branch.

starting most scripts has a big overhead:
E.g, the dojo component(for interacting with the very nice REPL env) requires a traced repo, although this is fundamentally not needed.

Yes, this part does have a lot of room for improvement. Currently, it needs the traced repo only for locating the target proof in the source file.

The mixture of Lean4 + Lean 3 code makes the maintenance quite hard => fixing type checks is nontrivial due to convoluted types.

If one no longer cares about lean3, then everything becomes more readable and easier to maintain.

I'm open to dropping the Lean 3 support (we can put it in a legacy branch). However, we need to get a better sense of whether and how Lean 3 is still being used. We can ask in Lean's Zulip?

currently most processes depend on a whole LeanGitRepo. I think we should change this to a https://leanprover-community.github.io/mwe.html. The reason is a mwe is the smallest piece needed to run the current lean extraction and REPL scripts.

I agree. It would be great if LeanDojo can work with a local mwe instead of requiring a GitHub repo.

Interactions: If one wants to interact with a theorem from mathlib4, - with the structure given by 1) - then one would not longer parse the whole mathlib4, but use a small script to build a mwe from the mathlib4 file that contains the theorem.

How would you build the mwe automatically and in a completely general way?

Tracing: Tracing should by default not build the whole dependency tree searching through all dependencies for possible premises. Most projects only need a resolution of premises to its full name.
Tracing to determine all possible premises that can be applied for theorem proving can be done in a new process called extensivePremisesTracing

I wouldn't change the current default behavior, but we can add options to support this simplified tracing.

4 replies

yangky11 Sep 25, 2023
Maintainer

Implementing these features seems a nontrivial undertaking, and I personally do not have the capacity to augment LeanDojo substantially (unless as a part of our ongoing research projects). However, if you'd like to implement these features within LeanDojo, I'm happy to help you with any questions about LeanDojo's current design and codebase. We can work in a new branch and replace the current main branch only when the new features are sufficiently mature and documented.

josojo Sep 25, 2023
Author

Thanks for all the feedback!

However, if you'd like to implement these features within LeanDojo, I'm happy to help you with any questions about LeanDojo's current design and codebase. We can work in a new branch and replace the current main branch only when the new features are sufficiently mature and documented.

Nice! I think I will prototype it all a little bit more and validate some of the proposals. Then, I am very happy to port it to this great project so that many AI experiments can benefit from it.

josojo Sep 25, 2023
Author

I'd say LeanDojo is designed to operate offline. For example, it can only trace public repos on GitHub. If you generate some new code, you have to commit it to GitHub before using LeanDojo to process it, which may be inconvenient in a setup where the data is generated on the fly

agreed

We can add additional flags to turn these features on and off (in the form of environment variables or some config files).

Agreed. And this should then also be used to make the tests more independent to run them quicker. Let's do that.

Most of the time is spent in tracing the repos. I don't see an obvious way to speed up that part (besides the remote caching mechanism that we already have).

Yes, tracing to get all premises will not fundamentally become faster. But I am mostly planning to rebuild the tests, so that we can get quicker feedback on code changes. Let's see what I can come up with.

I'm open to dropping the Lean 3 support (we can put it in a legacy branch). However, we need to get a better sense of whether and how Lean 3 is still being used. We can ask in Lean's Zulip?

I think the comparison of the activity on https://github.com/leanprover-community/mathlib and https://github.com/leanprover-community/mathlib4 speaks for itself for the adoption rate. And for legacy projects, people can use legacy branches as you say.

How would you build the mwe automatically and in a completely general way?

For any theorem within a github project(like mathlib4), your parsed data contains the full names of all used premises in a proof, one can simply restrict the imports and previous definitions + theorems to the ones that are required. But yeah, for any theorem, one can not easily come up with a mwe

yangky11 Sep 25, 2023
Maintainer

your parsed data contains the full names of all used premises in a proof, one can simply restrict the imports and previous definitions + theorems to the ones that are required

I think it's actually more complicated than that. Even if a proof does not explicitly rely on premises from an imported file (say, A), it could be impacted indirectly. For example, maybe the proof uses the simp tactic, and the file A might change the set of lemmas used by the simplifier.

josojo · 2023-10-15T14:32:01Z

josojo
Oct 15, 2023
Author

I prototyped many of these proposals from above. Most of them provided their value.
https://github.com/josojo/lean_ai_helper is a relatively stable refactoring of this repo that allows quicker iteration:

e2e tests for tracing and repl are super quick: CI finishes in minutes.
everything strictly typed
no lean 3 dependencies
IMO MWEs are a nice way to handle and iterate.

But, I think I will not continue, since it seems https://github.com/semorrison/lean-training-data is an even better approach as it is faster and all native in lean. At least, I want to play around a little bit more with it before I continue the work to port the gained knowledge to this project.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Project structure #69

{{title}}

Replies: 3 comments 4 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

Project structure #69

josojo Sep 23, 2023

Replies: 3 comments · 4 replies

josojo Sep 24, 2023 Author

yangky11 Sep 25, 2023 Maintainer

yangky11 Sep 25, 2023 Maintainer

josojo Sep 25, 2023 Author

josojo Sep 25, 2023 Author

yangky11 Sep 25, 2023 Maintainer

josojo Oct 15, 2023 Author

josojo
Sep 23, 2023

Replies: 3 comments 4 replies

josojo
Sep 24, 2023
Author

yangky11
Sep 25, 2023
Maintainer

yangky11 Sep 25, 2023
Maintainer

josojo Sep 25, 2023
Author

josojo Sep 25, 2023
Author

yangky11 Sep 25, 2023
Maintainer

josojo
Oct 15, 2023
Author