Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cached recipes #1906

Draft
wants to merge 36 commits into
base: master
Choose a base branch
from
Draft

Cached recipes #1906

wants to merge 36 commits into from

Conversation

nmay231
Copy link

@nmay231 nmay231 commented Feb 16, 2024

Closes #867, #1861, cc #1685

Putting this out there to get some preliminary feedback, ask questions, and take advantage of self-inflicted peer pressure 😁

What's in this PR?

Adds a [cached] recipe attribute that marks that recipe as cacheable. A cacheable recipe only reruns when the body of the recipe changes after variables and other {{interpolations}} are evaluated. See examples in #867.

TODOs

  • Set up basic recipe caching.
  • Finish work on how dependencies work with cached recipes.
  • Add --no-cache to force run cached recipes.
  • Add --delete-cache to delete the cache file for this project.
  • Update README.
  • Future PRs (I may or may not get to these):
    • Prevent double work in the evaluator when checking if the recipe needs to be run or not.
    • Expand sha256_file() and blake3_file() to take directories and/or globs, or add a more explicit cache_files()/hash_files() with this functionality.
    • Add cache_env_vars() to allow caching by a glob of different environment variables, e.g. CARGO_.*.
    • Add cache_ignore() to ignore certain content while caching that will always/often change such as uuid() and just_pid().
    • Allow arguments like [cached("expression")] that 'stratify' a recipe so you can cache a recipe by independent targets.

Decisions, decisions... Edge cases and questions

  • Besides backtick expressions and calls to uuid() and just_pid(), are there other impure interpolations that you can think of off the top of your head? I've already scanned the README, but wanted to double check.
  • Cached recipes with dependencies do not have perfectly obvious behaviors. Here is how I am going to go with it (I'm looking for feedback on these):
    • The prior dependencies of cached recipes must also be cached, otherwise, it's ambiguous if you would run a dependency if the main recipe matched the last cached run.
    • Subsequent dependencies (&& after) do not need to be cached and do not run if the recipe matches the last run. I don't think they need to run because they are typically clean-up jobs and are only needed if the recipe actually ran. If the user wants them to run every time, they should set up a separate recipe that depends on the main recipe and subsequent in that order.
    • An unchanged recipe with a changed dependency will re-run both.
    • A changed recipe with an unchanged dependency only re-run the former and not the dependency.
  • Did I miss any stupidly obvious issue that needs addressing? I am writing this at 3 in the morning, so I wouldn't be surprised.

@casey
Copy link
Owner

casey commented Feb 16, 2024

Nice! Some feedback:

  • This initial PR should be as simple as possible, so it's easy to review and land, and additional features can be added in follow-up PRs. So I would leave --no-cache, --delete-cache, and customizing the cache filename for later.

  • What is the Error::InvalidCachedRecipe error for?

  • Can you describe the structure of the cache file JSON?

src/cache.rs Outdated Show resolved Hide resolved
src/cache.rs Outdated Show resolved Hide resolved
@tgross35
Copy link
Contributor

tgross35 commented Feb 16, 2024

This looks awesome!

I still think it could be better to have an explicit "rerun recipe if the value of this thing changes" function so the user can determine what is important, rather than having just need to determine whether a function is impure. This would also mean that you could have backticked expressions that don't invalidate the cache.

Don't think this would be too bad if the indicator function also returns the value, so you could do:

config:
    # There is probably a snappier name than rerun_if_changed
    echo "running for configuration {{ rerun_if_changed(env("PROJECT_CONFIG")) }}"
    # ...

@casey
Copy link
Owner

casey commented Feb 16, 2024

I still think it could be better to have an explicit "rerun recipe if the value of this thing changes" function so the user can determine what is important, rather than having just need to determine whether a function is impure. This would also mean that you could have backticked expressions that don't invalidate the cache.

This is definitely an interesting idea, but I'd like to explore how things look without explicit annotations. If it works well without having to annotate anything, that would be ideal.

@nmay231
Copy link
Author

nmay231 commented Feb 16, 2024

@casey

What is the Error::InvalidCachedRecipe error for?

Having a call to uuid(), and the like in a cached recipe means the recipe will always change and never be cached. So they should be disallowed completely until cache_ignore() is implemented.

I also disallowed backtick interpolation because currently the command would be executed twice (one of the backlog problems). Even when it is evaluated once, it won't be run in the same execution order as run_linewise which only runs interpolations line by line. If the user determines their backtick expression is pure (like a version check rustc --version) then they can put the expression outside the recipe body. If it is impure (relies on previous state) then they can wrap it in cache_ignore() to only run during the main execution.

Can you describe the structure of the cache file JSON?

The structure itself is JustfileCache. As mentioned in my response to @tgross35 above, I probably need to change things a bit.

After this comment, I'll make it so there is a unique cache file per justfile path and working directory combo. The filename on my system is ~/.cache/justfiles/{project_name}-{&path_hash[..16]}.json. The working_directory and (soon to be added) justfile_path will only be there for user debuggability.

I would leave [...] customizing the cache filename for later.

I'm actually using the filename customization to not pollute the contributor's workspace during tests, so I'll leave it for now.

@casey
Copy link
Owner

casey commented Feb 16, 2024

Having a call to uuid(), and the like in a cached recipe means the recipe will always change and never be cached. So they should be disallowed completely until cache_ignore() is implemented.

I think it's fine to just not detect this and the recipe doesn't get cached.

I also disallowed backtick interpolation because currently the command would be executed twice (one of the backlog problems). Even when it is evaluated once, it won't be run in the same execution order as run_linewise which only runs interpolations line by line. If the user determines their backtick expression is pure (like a version check rustc --version) then they can put the expression outside the recipe body. If it is impure (relies on previous state) then they can wrap it in cache_ignore() to only run during the main execution.

Let's punt on linewise recipes and only allow [cache] on shebang recipes for now. We can allow backticks in that case and figure out linewise recipes later. Shebang recipes will be easiest to cache, since we just evaluate the text once and can make a decision to execute or skip based on the evaluated text. I don't want to worry about a cache_ignore function for the time being.

The structure itself is JustfileCache. As mentioned in my response to @tgross35 above, I probably need to change things a bit.

I think we should aim for an initial very simple version which has shortcomings and then fix those one by one.

For now, what do you think about just having a single .justcache directory, which has files named after the sha hash of the evaluated recipe body, which contain the evaluated text of the recipe body?

My thinking is that:

  • It's useful to be able to see what's cached with ls .justcache and seeing evaluated recipe bodies with cat .justcache/HASH.
  • What exactly should be included in the cache key is tricky. There is the evaluated recipe body, the working directory, and justfile settings. Anything else?

We can initially land this feature behind set unstable, and then we can think about what needs to be in the cache keys one at a time. So for example, we can go, "Okay, the working directory needs to be in the cache key, let's do a PR for that.", and make incremental improvements to the feature, until it's ready to be made stable.

@nmay231
Copy link
Author

nmay231 commented Feb 17, 2024

I think it's fine to just not detect this and the recipe doesn't get cached.

Alrighty. I'm just trying to prevent open issues like "[cached] doesn't work". If you prefer a simpler implementation, I'll do that.

For now, what do you think about just having a single .justcache directory, which has files named after the sha hash of the evaluated recipe body, which contain the evaluated text of the recipe body?

I am very opposed to this. While it sounds good in principle to be able to read the previous cached recipes, most of the time the hash is gonna change a lot because the recipe contains sha256_file(...) while the rest of the recipe body remains the same. I don't want to have to delete the cache all the time.

That said, I would not mind if the exact format changed somewhat. For example, I wouldn't mind having the cache file/directory in the working directory instead of the user's general cache directory.

What exactly should be included in the cache key is tricky. There is the evaluated recipe body, the working directory, and justfile settings. Anything else?

The recipe name and/or justfile path should also be included in the cache input. If we do put the cache file/directory in the working directory, that would not need to be part of the input. I can't think of anything else.

We can initially land this feature behind set unstable

👍

@amarao
Copy link
Contributor

amarao commented Feb 21, 2024

Don't forget to put version check into cache. Eventually it will be version 2 of the cache, and someone will run older just with newer cache.

Expected behavior: either invalidate cache, or refuse to run.
Bad behavior: ignore version 2 in the cache file and try to re-use it.

The main issue is that it's impossible to add behavior to older versions (they are old by definition), so safeguard should be from from the very beginning.

@nmay231
Copy link
Author

nmay231 commented Mar 12, 2024

@casey Two quick things:

  1. Do we really need to constrain this to only shebang recipes to prevent double evaluation if this is unstable only? I feel it's better to allow shebang and linewise recipes since it requires --unstable anyways. Then users can keep the benifits of linewise recipes (early exit on failure, linewise command printing) while trying out caching. Of course, I'll work on preventing double evaluation in a separate PR.
  2. Is the current cache format a show-stopper for you, or is the json file okay? I already expressed why I didn't like it in the previous comment. I did simplify it to a file in a local .justcache/ directory, where there's a json file for every justfile * working_dir combo.

(Still need to handle recipe dependencies and add docs, but this should be done other than that)

@nmay231
Copy link
Author

nmay231 commented Mar 26, 2024

@amarao I decided to not include an explicit version in the cache format to simplify things and instead let the schema be the version. Obviously, we still handle outdated/invalid schemas by invalidating the cache.

Besides, running just with a cache file that is the wrong version is much rarer than you might think since .justcache/ should be local to each machine and not included in git history.

@nmay231
Copy link
Author

nmay231 commented Apr 15, 2024

I still have the two previous questions that need to be answered, but it's ready besides that. Whenever you have time for this full-time unpaid job, of course :)

One of those questions is about double evaluation, but that feels much easier to handle in the context of how cache_ignore() will be implemented (since that will basically need to be evaluated once during cache checks and once again during runtime). I will probably implement that and future features as a stacked diff (multiple PRs).

The only other thing to consider (that I know about) is that this might heavily conflict with #1562, however, that seems to be paused at the moment due to the author's lack of time.

@nmay231 nmay231 marked this pull request as ready for review April 15, 2024 21:23
@casey
Copy link
Owner

casey commented May 20, 2024

  1. Do we really need to constrain this to only shebang recipes to prevent double evaluation if this is unstable only? I feel it's better to allow shebang and linewise recipes since it requires --unstable anyways. Then users can keep the benifits of linewise recipes (early exit on failure, linewise command printing) while trying out caching. Of course, I'll work on preventing double evaluation in a separate PR.

I think I'd rather constrain it to shebang recipes at the moment. Linewise recipes seem much more complicated. One issue is that linewise recipes are evaluated line-by-line, so an error in an early line will prevent a later line from running. If a later line relies on an earlier line, for example because it creates a file that is read in an expression in a later line, then evaluating it ahead of time would produce an error. We could, in the case of cached recipes, evaluate linewise recipe lines ahead of time, and only re-run the recipe if no lines have changed, which is probably what we want to do, but would be a larger change, and I'd rather get this PR in sooner rather than later.

  1. Is the current cache format a show-stopper for you, or is the json file okay? I already expressed why I didn't like it in the previous comment. I did simplify it to a file in a local .justcache/ directory, where there's a json file for every justfile * working_dir combo.

Ultimately, I think I'd like to use redb an embedded pure-rust key/value store, for the cache. I'm happy to do that myself though, so whatever format is easiest for you for now and easiest to review is fine. Since redb databases block on concurrent access, this would also prevent errors where two instances of just try to modify the cache at the same time, and potentially also allow for concurrent access, if something like #1562 is ever merged.

One of those questions is about double evaluation, but that feels much easier to handle in the context of how cache_ignore() will be implemented (since that will basically need to be evaluated once during cache checks and once again during runtime). I will probably implement that and future features as a stacked diff (multiple PRs).

I'm not sure I understand this point. Can't recipes be evaluated only once, before running, and the evaluated recipe body used both for cache checks and execution?

@casey
Copy link
Owner

casey commented May 20, 2024

There are some conflicts, but they shouldn't be too bad to resolve. Let me know when this is ready for review!

@casey casey marked this pull request as draft May 25, 2024 08:09
@casey
Copy link
Owner

casey commented May 25, 2024

Converted to a draft just so I can keep track of what's ready to go. Feel free to undraft!

@nmay231
Copy link
Author

nmay231 commented Jun 9, 2024

Sounds good! I recently changed jobs, but I'll try to get back to this this week.

One issue is that linewise recipes are evaluated line-by-line, so an error in an early line will prevent a later line from running. If a later line relies on an earlier line, for example because it creates a file that is read in an expression in a later line, then evaluating it ahead of time would produce an error.

True, but that's only if the user is using backtick expressions to mutate things "during compile time" which should be discouraged. This is one reason I wanted to disallow backtick expressions initially until cache_ignore() was fully implemented. (To prevent confusion, I mean expr {{`with`}} backticks since regular backticks expr `with` backticks are passed to the shell directly and safely run line-by-line.)

In any case, I agree that cached line-by-line recipes should have the same semantics as shebang recipes (all expressions evaluated at start before passed to shell). Ideally, cached line-by-line recipes would run exactly the same way as cached shebang recipes, but you can still reap the benefits of changing line-echos with @, spreading commands over multiple lines with \, etc. In other words, line-by-line recipes would be the odd one out.

One of those questions is about double evaluation, but that feels much easier to handle in the context of how cache_ignore() will be implemented (since that will basically need to be evaluated once during cache checks and once again during runtime). I will probably implement that and future features as a stacked diff (multiple PRs).

I'm not sure I understand this point. Can't recipes be evaluated only once, before running, and the evaluated recipe body used both for cache checks and execution?

I initially thought I would do two evaluation runs to implement cache_ignore(), but it would be much simpler to produce two strings, one for hashing and comparing to the cache, and one to pass to the shell to run.

EDIT: So much for that deadline, haha. I think I'll have time to look at this again soon.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

files as dependency
4 participants