`verdi`: add repository maintenance commands to operate on the repository #4321

sphuber · 2020-08-26T10:21:45Z

Once the new repository implementation will be merged, the disk object storage (DOS) will need some maintenance now and again to improve performance, that needs to be controllable from verdi. Operations include but are potentially not limited to:

Packing all loose objects
Deleting packed loose objects
Repacking
Deleting unreferenced objects
Vacuuming the index database

Some of these operations can be executed during operation of AiiDA, whereas some can only be executed as a maintenance operation where the DOS cannot be used. There is also a difference in operation efficiency, where some can be very quick and efficient, whereas others (such as repacking) can potentially be very expensive. Currently the idea is that there should be one command that will be used for the majority of users that just needs to be called periodically to optimize the repository. However, it would be good to expose additional commands for the sub operations listed above such that it provides more control and for example can be called in CRON jobs.

Current proposal of subcommands for the new command verdi repository:

backup: perform an rsync of the DOS.
- ? Should this perform rsync of the state as is, or should it automatically perform some maintenance operations first, or alternatively propose the user with these operations and ask for confirmation
maintenance: single-stop-shop command for most users. Will automatically or through prompting optimize the DOS in the most effective way possible.
pack: Pack all loose objects
clean: Delete packed loose objects, delete unreferenced objects and vacuum the database
- --delete-loose/--no-delete-loose:
- --delete-unreferenced/--no-delete-unreferenced:
- --vacuum-index/--vacuum-index:
- --repack: will fully repack the DOS, which is not always necessary and potentially expensive, but can greatly improve performance

This is an incomplete proposal and names are in no way definitive. Let's discuss below and I will keep updating the design here once certain things become agreed upon.

The text was updated successfully, but these errors were encountered:

sphuber · 2020-08-26T10:22:00Z

Pinging @ramirezfranciscof and @giovannipizzi

giovannipizzi · 2020-08-27T10:11:28Z

Do you prefer to add comments here, or that we modify directly the initial comment?
One comment is that I would move --repack as an option to clean - it's the most aggressive operation and needed to recover space after deleting/cleaning, so logically for me it's the 'last' of clean. I would keep pack instead only to pack loose and to things that can be executed while AiiDA is in use.

sphuber · 2020-08-27T10:25:48Z

I would add comments here as new posts and once we agree I will update the OP to reflect the current design proposal

ramirezfranciscof · 2020-08-27T13:33:24Z

I would keep backup separated and have it only do the rsync, no extra optimization operations. Any reasons to include that other stuff?
I'm guessing maintenance is the equivalente of pack + clean --all?
How would clean work by default? All flags are on?

giovannipizzi · 2020-08-27T16:21:10Z

Some tentative answers:

ok to rsync only but there are a few things that it might be good to do before backing up - e.g., calling clean_storage to remove loose objects that have been packed in the meantime. Maybe we can add these as (optional) flags
I think it should prompt through all possible steps down to the most aggressive repacking, with hints on whether it's needed (e.g. quickly compute if the SQLite needs vacuuming, or how many holes there are and if repacking is worthwhile). If the user answers no at some point and it does not make sense to go to the next step it would then just stop asking questions. Essentially, an interactive prompting guide through all maintenance commands.
I think I would have clean just call clean_storage internally, with no vacuum, by default (quick and quite safe). Then add flags for --vacuum, --delete-unrereferenced, and --repack, in order of complexity. I do not know if for these we want to have also the --no version, because I think it would be good e.g. to always re-vacuum after repacking, for instance (I don't remember if this is also enforced), so if not specified, it will be vacuumed if --repack is specified (but only after). It might make things a bit more complex in terms of logic, but we can try to do the right thing for the user (in terms of operations to do and order in which to do them).

chrisjsewell · 2021-05-06T01:57:47Z

Correct me if I'm wrong, but there currently seems no performant way to retrieve all file hashes from the AiiDA DB?

Take this example:

n = Dict()
n.put_object_from_filelike(StringIO("a"), "a/b/c/d")
n.put_object_from_filelike(StringIO("a"), "a/b/c/e")
n.put_object_from_filelike(StringIO("b"), "a/b/c/f")
n.store()
pprint.pprint(n.repository_metadata)

{'o': {'a': {'o': {'b': {'o': {'c': {'o': {'d': {'k': 'ca978112ca1bbdcafac231b39a23dc4da786eff8147c4e72b9807785afee48bb'},
                                           'e': {'k': 'ca978112ca1bbdcafac231b39a23dc4da786eff8147c4e72b9807785afee48bb'},
                                           'f': {'k': '3e23e8160039594a33894f6564e1b1348bbd7a0088d42c4acb73eeaed59c009d'}}}}}}}}}

To my knowledge, this is the only place these hashes are stored in the DB (i.e. DbNode.repository_metadata = JSONField(default=dict, null=True)), and so there is no "direct" SQL approach to retrieve every hash.
Instead, you have to query though every node and run a (complex/costly) function to flatten the recursive dictionary. Something like:

import collections.abc

def flatten(dictionary, parent_key=False, separator='/'):
    items = []
    for key, value in dictionary.items():
        new_key = str(parent_key) + separator + key if parent_key else key
        if isinstance(value, collections.abc.MutableMapping):
            items.extend(flatten(value, new_key, separator).items())
        elif isinstance(value, list):
            for k, v in enumerate(value):
                items.extend(flatten({str(k): v}, new_key).items())
        else:
            items.append((new_key, value))
    return dict(items)

qb = QueryBuilder()
qb.append(Node, project=["repository_metadata"])
hashes = set()
for data, in qb.iterall():
	hashes.update(flatten(data).values())

why did we choose to store the metadata in this way, rather than just a simpler:

repository_metadata = {
	"a/b/c/d": "ca978112ca1bbdcafac231b39a23dc4da786eff8147c4e72b9807785afee48bb"
}

it seems like there are multiple places where we have to unnecessarily go through this flattening/unflattening process, when all you want to do is just map an object path to its key/hash.

Related, I find it a bit odd that we have aiida.repository.common.File that can be either a file or a directory.

(also also, on a minor nitpick we use pathlib.Path in the repository code, whereas technically we should be using pathlib.PurePosixPath)

Lastly, I'm not going to push this, because I know its falling on deaf ears, but as I already said in #2046 (comment), if you were coming at this purely from a performance perspective, then you would take the SQLite table (https://github.com/aiidateam/disk-objectstore/blob/a2561a40cf2e9a4d58325387f913ccf08111f5fd/disk_objectstore/models.py#L8) and move it to the AiiDA (postgres) database, from where you could implement a more powerful many-to-one relationship between the node files and their existence/location in the pack files.

giovannipizzi · 2021-05-06T06:54:55Z

You'r not falling of deaf ears, it's just that before changing a design decision we should avoid that we then incur on other performance bottlenecks so it would be good to discuss and see pros and cons before changing.

Two comments:

you don't necessarily need to move the whole SQLite table to PSQL; the table you mention in Proposal to add DbFile table #4919 would already make queries fast (but could make other things slower so needs benchmarking as discussed there). Also, importantly: the SQLite DB only contains references to the packed objects; you still need to look to the disk to know if a loose version exists. Therefore, I would still keep the SQLite DB as an internal implementation detail of the object store, and possibly look into strategies like Proposal to add DbFile table #4919 or similar if performance needs to be improved
Do you know how long (seconds) it takes to iterate over everything for a large DB? When I tested, it was still quite fast and acceptable also for large DBs. It would be important to create the structure for a large DB (me, or even better @sphuber or @mbercx can provide you with all the JSON data from one of the big DB after migration), and I'd be interested to compare the speed for various operations (not only querying but also storing, and fetching all UUIDs) both with the current schema, and any future proposal (Proposal to add DbFile table #4919 or others) to compare and decide if we need to change or not

giovannipizzi · 2021-05-06T06:57:16Z

(Note that I had benchmarked a preliminary implementation of the repository in this test (now archived) repo and the performance was acceptably fast for all operations that I tested at the time. This included converting to the new format, and back to the old one.
Admittedly, I don't know if I benchmarked only extracting all UUIDs, but it would be good to get an order of magnitude; if it's of the order of 30s even for very large DBs, I think it's acceptable for a maintenance clean operation; if it's much longer, I agree we need to find a better solution

sphuber · 2021-05-06T07:00:49Z

We should also keep in mind that querying for specific file names was never a requirement for the design. This has never been a feature and I don't remember anyone asking for it. In addition, collecting all file hashes from the database is indeed only necessary for a maintenance operation as @giovannipizzi stated, and so we can expect that it does not have to be executed very often and also can take a bit longer if necessary. On the other hand, introducing a new table might introduce a slowdown in every day operations, such as retrieving files from a node, because now there is a join involved. I am not saying that that penalty is by definition not acceptable, but I think we should keep this in mind that there are tradeoffs to both designs probably.

chrisjsewell · 2021-05-06T07:15:40Z

thanks for the responses, although it seems like both of them are more aimed at #4919, rather than the question I asked here, which was:
"why did we choose to store the metadata in this way, rather than just a simpler (path -> key)" (given that this is mainly what all the operations like get_object/put_object require)
and related to that, is it necessary to store empty folders?

sphuber · 2021-05-06T07:23:01Z

Empty folders have been a use case for at least RemoteData nodes for CalcJob plugins, so that is why kept the possibility of representing empty folders. Granted, I don't think the current Repository interface actually provides a direct mkdir method but that would be trivial to add. Similarly, there was a discussion whether the door should be kept open to store metadata on both files, directories included. This may have been a premature optimization, but not completely unthinkable given the recent discussion on #2046

Related, I find it a bit odd that we have aiida.repository.common.File that can be either a file or a directory.

Traditionally, a "file" can be understood to be both a file or directory. The file_type attribute is there to distinguish which one it is for each specific case.

giovannipizzi · 2021-05-06T10:17:15Z

why did we choose to store the metadata in this way, rather than just a simpler (path -> key)

to guarantee for consistency (see #4919 (comment)) and as we envisioned that when fetching a node, you anyway need all the structure: so it's just a one-row fetch (that BTW you already fetched in memory when you asked for that node), rather than one more JOIN query at the SQL level.

is it necessary to store empty folders?

Yes, see #4919 (comment)

chrisjsewell · 2021-05-07T09:45:11Z

Traditionally, a "file" can be understood to be both a file or directory. The file_type attribute is there to distinguish which one it is for each specific case.

I'd be interested to see where you got that "traditional" description from?
Anyhow IMO it's definitely a "not ideal" abstraction, exampled by

(a) that you have to enforce certain polymorphic constraints on a File

        if file_type == FileType.DIRECTORY and key is not None:
            raise ValueError('an object of type `FileType.DIRECTORY` cannot define a key.')

        if file_type == FileType.FILE and objects is not None:
            raise ValueError('an object of type `FileType.FILE` cannot define any objects.')

and (b) the fact there are these three methods on the Repository:

def get_object(self, path: FilePath = None) -> File:
def get_directory(self, path: FilePath = None) -> File:
def get_file(self, path: FilePath) -> File:

It would probably be better to mirror the pathlib.Path API

sphuber · 2021-12-10T13:08:04Z

@ramirezfranciscof can this now be closed since #4965 has been merged? There may be more specific features that we put here originally, such as verdi storage backup, but I suggest we put that in separate issues. All critical functionality for v2.0 is in there, no?

ramirezfranciscof · 2021-12-13T11:18:17Z

As far as I understood it, yes.

sphuber added topic/repository topic/verdi priority/important type/accepted feature approved feature request labels Aug 26, 2020

sphuber changed the title ~~verdi: add vrepository maintenance commands to~~ verdi: add repository maintenance commands to operate on the repository Aug 27, 2020

giovannipizzi added this to the v2.0.0 milestone Mar 24, 2021

sphuber mentioned this issue Apr 28, 2021

Implement the new file repository #4345

Merged

chrisjsewell self-assigned this May 5, 2021

sphuber added priority/critical-blocking must be resolved before next release and removed priority/important labels May 5, 2021

chrisjsewell mentioned this issue May 6, 2021

Proposal to add DbFile table #4919

Closed

This was referenced May 6, 2021

✨ NEW: Add Node.objects.iter_repo_keys #4922

Merged

👌 IMPROVE: add DaemonClient.lock() #4924

Closed

ramirezfranciscof assigned ramirezfranciscof and unassigned chrisjsewell Aug 11, 2021

ramirezfranciscof closed this as completed Dec 13, 2021

This was referenced Dec 13, 2021

Outstanding verdi storage improvements #5272

Open

verdi: add backup sub-command to the storage group #5273

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`verdi`: add repository maintenance commands to operate on the repository #4321

`verdi`: add repository maintenance commands to operate on the repository #4321

sphuber commented Aug 26, 2020 •

edited

Loading

sphuber commented Aug 26, 2020

giovannipizzi commented Aug 27, 2020

sphuber commented Aug 27, 2020

ramirezfranciscof commented Aug 27, 2020

giovannipizzi commented Aug 27, 2020 •

edited

Loading

chrisjsewell commented May 6, 2021 •

edited

Loading

giovannipizzi commented May 6, 2021

giovannipizzi commented May 6, 2021

sphuber commented May 6, 2021

chrisjsewell commented May 6, 2021 •

edited

Loading

sphuber commented May 6, 2021

giovannipizzi commented May 6, 2021

chrisjsewell commented May 7, 2021 •

edited

Loading

sphuber commented Dec 10, 2021 •

edited

Loading

ramirezfranciscof commented Dec 13, 2021

verdi: add repository maintenance commands to operate on the repository #4321

verdi: add repository maintenance commands to operate on the repository #4321

Comments

sphuber commented Aug 26, 2020 • edited Loading

sphuber commented Aug 26, 2020

giovannipizzi commented Aug 27, 2020

sphuber commented Aug 27, 2020

ramirezfranciscof commented Aug 27, 2020

giovannipizzi commented Aug 27, 2020 • edited Loading

chrisjsewell commented May 6, 2021 • edited Loading

giovannipizzi commented May 6, 2021

giovannipizzi commented May 6, 2021

sphuber commented May 6, 2021

chrisjsewell commented May 6, 2021 • edited Loading

sphuber commented May 6, 2021

giovannipizzi commented May 6, 2021

chrisjsewell commented May 7, 2021 • edited Loading

sphuber commented Dec 10, 2021 • edited Loading

ramirezfranciscof commented Dec 13, 2021

`verdi`: add repository maintenance commands to operate on the repository #4321

`verdi`: add repository maintenance commands to operate on the repository #4321

sphuber commented Aug 26, 2020 •

edited

Loading

giovannipizzi commented Aug 27, 2020 •

edited

Loading

chrisjsewell commented May 6, 2021 •

edited

Loading

chrisjsewell commented May 6, 2021 •

edited

Loading

chrisjsewell commented May 7, 2021 •

edited

Loading

sphuber commented Dec 10, 2021 •

edited

Loading