Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

docs: remove note on disk space for caching #5534

Merged
merged 5 commits into from
May 24, 2022
Merged

docs: remove note on disk space for caching #5534

merged 5 commits into from
May 24, 2022

Conversation

ltalirz
Copy link
Member

@ltalirz ltalirz commented May 23, 2022

The new repository implementation includes automatic de-duplication of identical files.
Re-running a cached calculation should therefore not result in copies of the results stored
in the repository, and in no increase in disk space usage besides what is needed for storing
metadata for the new calculation nodes, data nodes & links in the database.

The new repository implementation includes automatic de-duplication of identical files.
Re-running a cached calculation should therefore not result in copies of the results stored
in the repository, and in no increase in disk space usage besides what is needed for storing
metadata for the new calculation nodes, data nodes & links in the database.
@ltalirz ltalirz requested a review from sphuber May 23, 2022 11:13
@sphuber
Copy link
Contributor

sphuber commented May 23, 2022

I wouldn't remove this, or at the very least just adjust the text. Even though content in the file repository is now deduplicated, this is just for the psql_dos implementation. It is not guaranteed for other storage backends. Also, the content in the database for the psql_dos is not deduplicated. If you have nodes with a lot of content, that will be cloned in postgres.

@ltalirz
Copy link
Member Author

ltalirz commented May 23, 2022

Thanks for the comment @sphuber

Well, the gist of the sentence is certainly no longer correct and needs to change.

In my view, it would also be fine to delete since the duplication of metadata on the DB level is unlikely to substantially impact disk usage - and there is value to users not spending time worrying about things that are unlikely to impact them.

Given that this is under "topics", however, where people go to learn how things work, I'm also fine with providing a more detailed explanation here.
Let me give it a try.

Copy link
Contributor

@sphuber sphuber left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @ltalirz . Since it is possible to provide different storage implementations I would keep it general, with an explicit note that the default case has autodeduplication for files.

Comment on lines 162 to 163
#. While caching saves unnecessary computations, it does not directly prevent duplication of data: the cached calculation and its output nodes are duplicated.
In practice, however, AiiDA's file repository implementation will detect that any files associated with these nodes are already present and simply point to those, reducing duplication to metadata stored at the database level.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
#. While caching saves unnecessary computations, it does not directly prevent duplication of data: the cached calculation and its output nodes are duplicated.
In practice, however, AiiDA's file repository implementation will detect that any files associated with these nodes are already present and simply point to those, reducing duplication to metadata stored at the database level.
#. While caching saves unnecessary computations, it does not necessarily prevent duplication of data: the cached calculation and its output nodes are duplicated in the storage.
Whether the duplicated nodes actually result in the _size_ of the storage increasing, depends on the storage implementation, which may implement automatic deduplication mechanisms to save space.
This is actually the case for the default storage implementation `psql_dos`; this storage automatically detects files that already exist and will not store them again.

Copy link
Member Author

@ltalirz ltalirz May 23, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks, I wanted to phrase it slightly differently but I've tried to incorporate your points

@ltalirz ltalirz merged commit 9d903ca into main May 24, 2022
@ltalirz ltalirz deleted the docs-caching branch May 24, 2022 07:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants