Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Garbage collection functionality #150

Open
rofinn opened this issue Oct 28, 2021 · 7 comments
Open

Garbage collection functionality #150

rofinn opened this issue Oct 28, 2021 · 7 comments

Comments

@rofinn
Copy link
Collaborator

rofinn commented Oct 28, 2021

It'd be really cool if DataDeps had a gc function which could identify old datadeps that aren't referenced/needed anymore and suggest deleting them. I'm not quite sure how it work, but it wouldn't need to be perfect. It could just make a best guess, ask the user, and they can always redownload things they actually need.

@oxinabox
Copy link
Owner

oxinabox commented Oct 29, 2021

Yes, I have some thoughts on this:

Some options:

  1. just use Artifacts.jl instead of DataDeps.jl.
  2. We workout how to hook into what Artifacts use for this.
  3. Do something like generate Artifacts on the fly from DataDep after they are downloaded.

@rofinn
Copy link
Collaborator Author

rofinn commented Oct 29, 2021

Do you mean Artifacts.jl in Pkg.jl? FWIW, I do think the flexibility (e.g., fetch_method, post_fetch_method) and macro are probably worth keeping in our use case.

@oxinabox
Copy link
Owner

Yeah, is Artfiacts part of Pkg right now? I thought it was a seperate stdlib.

I do think the flexibility (e.g., fetch_method, post_fetch_method) and macro are probably worth keeping in our use case.

Agreed.
But option 2 or 3 might let us still do that.

@oxinabox
Copy link
Owner

I think maybe the dynamic artifact creation in OhMyArtifacts.jl might work to achieve option 3.
but I haven't looked closely.
(cc @chengchingwen)

@chengchingwen
Copy link
Contributor

I think it's doable but would take some effort to integrate the two. There are some problems and behavior differences need to be resolved. Things like:

  1. What do we want to bind the life-time of a DataDep to? OhMyArtifacts.jl (also Scratch.jl) bind to the Project(.toml) who call it, so if we want to build on top of that, we would need to figure out who defined that DataDep. Otherwise all the DataDep will live until DataDeps.jl been removed.
  2. It doesn't seem to be possible to migrate the old downloaded DataDep, should we just delete them all?
  3. DataDep name need to be unique, but entry name in Artifacts.jl can bind to different artifact.
  4. Currently each entry in OhMyArtifacts.jl is a single file (instead of a directory), I would need to come up with a better design for holding the directory structure.

OTOH using Scratch.jl might probably be the most effortless and behavior consistent approach (though 1. 2. remains), we can just store each DataDep in a scratchspace folder and bind the lifetime to the caller project with get_scratch!(DataDeps, datadep_name, THAT_PACKAGE) and somehow allow DataDeps.register or DataDep constructor to pass in the caller package. Then when that package get GCed, the scratchspace will also be removed.

@chengchingwen
Copy link
Contributor

@oxinabox I just tag a new release for OhMyArtifacts.jl. It should be able to handle directory now. I can try to make a PR to switch the backend to it, do you have a pointer on what files I would need to modified?

@oxinabox
Copy link
Owner

oxinabox commented Sep 1, 2022

  1. We can work this out by using the __location__ from where the datadep macro is called I think then searching for a Project.toml. Though really i guess we want the location where register was called. For hacks we could use the stacktrace() and go find it?

  2. No leave them alone. But we could put something in the release notes

  3. It is a flaw in DataDeps that they need to be globally unique, since it means clashes are possible. Unlike with the smarter content hashing that Artifacts use. One thing we do need to preserve is the ability to access datadeps via a global name (though that is allowed to error when there is a clash I guess), since i am sure some people install a package and then access the datadeps it registers by name.

Maybe scratch spaces, as you say are the better way to do this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants