docs: remove note on disk space for caching #5534

ltalirz · 2022-05-23T11:13:07Z

The new repository implementation includes automatic de-duplication of identical files.
Re-running a cached calculation should therefore not result in copies of the results stored
in the repository, and in no increase in disk space usage besides what is needed for storing
metadata for the new calculation nodes, data nodes & links in the database.

The new repository implementation includes automatic de-duplication of identical files. Re-running a cached calculation should therefore not result in copies of the results stored in the repository, and in no increase in disk space usage besides what is needed for storing metadata for the new calculation nodes, data nodes & links in the database.

sphuber · 2022-05-23T11:15:52Z

I wouldn't remove this, or at the very least just adjust the text. Even though content in the file repository is now deduplicated, this is just for the psql_dos implementation. It is not guaranteed for other storage backends. Also, the content in the database for the psql_dos is not deduplicated. If you have nodes with a lot of content, that will be cloned in postgres.

ltalirz · 2022-05-23T11:42:52Z

Thanks for the comment @sphuber

Well, the gist of the sentence is certainly no longer correct and needs to change.

In my view, it would also be fine to delete since the duplication of metadata on the DB level is unlikely to substantially impact disk usage - and there is value to users not spending time worrying about things that are unlikely to impact them.

Given that this is under "topics", however, where people go to learn how things work, I'm also fine with providing a more detailed explanation here.
Let me give it a try.

sphuber

Thanks @ltalirz . Since it is possible to provide different storage implementations I would keep it general, with an explicit note that the default case has autodeduplication for files.

sphuber · 2022-05-23T14:25:49Z

docs/source/topics/provenance/caching.rst

+#. While caching saves unnecessary computations, it does not directly prevent duplication of data: the cached calculation and its output nodes are duplicated.
+   In practice, however, AiiDA's file repository implementation will detect that any files associated with these nodes are already present and simply point to those, reducing duplication to metadata stored at the database level.


Suggested change

#. While caching saves unnecessary computations, it does not directly prevent duplication of data: the cached calculation and its output nodes are duplicated.

In practice, however, AiiDA's file repository implementation will detect that any files associated with these nodes are already present and simply point to those, reducing duplication to metadata stored at the database level.

#. While caching saves unnecessary computations, it does not necessarily prevent duplication of data: the cached calculation and its output nodes are duplicated in the storage.

Whether the duplicated nodes actually result in the _size_ of the storage increasing, depends on the storage implementation, which may implement automatic deduplication mechanisms to save space.

This is actually the case for the default storage implementation `psql_dos`; this storage automatically detects files that already exist and will not store them again.

thanks, I wanted to phrase it slightly differently but I've tried to incorporate your points

docs/source/topics/provenance/caching.rst

Co-authored-by: Sebastiaan Huber <[email protected]>

ltalirz requested a review from sphuber May 23, 2022 11:13

incorporate feedback from code review

e87ac76

sphuber requested changes May 23, 2022

View reviewed changes

incorporate suggestions from code review

b86a721

sphuber requested changes May 23, 2022

View reviewed changes

docs/source/topics/provenance/caching.rst Outdated Show resolved Hide resolved

ltalirz and others added 2 commits May 23, 2022 23:23

Update docs/source/topics/provenance/caching.rst

906e60b

Co-authored-by: Sebastiaan Huber <[email protected]>

Merge branch 'main' into docs-caching

13c9b55

ltalirz merged commit 9d903ca into main May 24, 2022

ltalirz deleted the docs-caching branch May 24, 2022 07:06

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

docs: remove note on disk space for caching #5534

docs: remove note on disk space for caching #5534

ltalirz commented May 23, 2022

sphuber commented May 23, 2022

ltalirz commented May 23, 2022

sphuber left a comment

sphuber May 23, 2022

ltalirz May 23, 2022 •

edited

Loading

		#. While caching saves unnecessary computations, it does not directly prevent duplication of data: the cached calculation and its output nodes are duplicated.
		In practice, however, AiiDA's file repository implementation will detect that any files associated with these nodes are already present and simply point to those, reducing duplication to metadata stored at the database level.

-#. While caching saves unnecessary computations, it does not directly prevent duplication of data: the cached calculation and its output nodes are duplicated.
-   In practice, however, AiiDA's file repository implementation will detect that any files associated with these nodes are already present and simply point to those, reducing duplication to metadata stored at the database level.
+#. While caching saves unnecessary computations, it does not necessarily prevent duplication of data: the cached calculation and its output nodes are duplicated in the storage.
+   Whether the duplicated nodes actually result in the _size_ of the storage increasing, depends on the storage implementation, which may implement automatic deduplication mechanisms to save space.
+   This is actually the case for the default storage implementation `psql_dos`; this storage automatically detects files that already exist and will not store them again.

docs: remove note on disk space for caching #5534

docs: remove note on disk space for caching #5534

Conversation

ltalirz commented May 23, 2022

sphuber commented May 23, 2022

ltalirz commented May 23, 2022

sphuber left a comment

Choose a reason for hiding this comment

sphuber May 23, 2022

Choose a reason for hiding this comment

ltalirz May 23, 2022 • edited Loading

Choose a reason for hiding this comment

ltalirz May 23, 2022 •

edited

Loading