-
Notifications
You must be signed in to change notification settings - Fork 189
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
verdi
: add repository maintenance commands to operate on the repository
#4321
Comments
Pinging @ramirezfranciscof and @giovannipizzi |
Do you prefer to add comments here, or that we modify directly the initial comment? |
verdi
: add vrepository maintenance commands toverdi
: add repository maintenance commands to operate on the repository
I would add comments here as new posts and once we agree I will update the OP to reflect the current design proposal |
|
Some tentative answers:
|
Correct me if I'm wrong, but there currently seems no performant way to retrieve all file hashes from the AiiDA DB? Take this example: n = Dict()
n.put_object_from_filelike(StringIO("a"), "a/b/c/d")
n.put_object_from_filelike(StringIO("a"), "a/b/c/e")
n.put_object_from_filelike(StringIO("b"), "a/b/c/f")
n.store()
pprint.pprint(n.repository_metadata) {'o': {'a': {'o': {'b': {'o': {'c': {'o': {'d': {'k': 'ca978112ca1bbdcafac231b39a23dc4da786eff8147c4e72b9807785afee48bb'},
'e': {'k': 'ca978112ca1bbdcafac231b39a23dc4da786eff8147c4e72b9807785afee48bb'},
'f': {'k': '3e23e8160039594a33894f6564e1b1348bbd7a0088d42c4acb73eeaed59c009d'}}}}}}}}} To my knowledge, this is the only place these hashes are stored in the DB (i.e. import collections.abc
def flatten(dictionary, parent_key=False, separator='/'):
items = []
for key, value in dictionary.items():
new_key = str(parent_key) + separator + key if parent_key else key
if isinstance(value, collections.abc.MutableMapping):
items.extend(flatten(value, new_key, separator).items())
elif isinstance(value, list):
for k, v in enumerate(value):
items.extend(flatten({str(k): v}, new_key).items())
else:
items.append((new_key, value))
return dict(items)
qb = QueryBuilder()
qb.append(Node, project=["repository_metadata"])
hashes = set()
for data, in qb.iterall():
hashes.update(flatten(data).values()) why did we choose to store the metadata in this way, rather than just a simpler: repository_metadata = {
"a/b/c/d": "ca978112ca1bbdcafac231b39a23dc4da786eff8147c4e72b9807785afee48bb"
} it seems like there are multiple places where we have to unnecessarily go through this flattening/unflattening process, when all you want to do is just map an object path to its key/hash. Related, I find it a bit odd that we have (also also, on a minor nitpick we use Lastly, I'm not going to push this, because I know its falling on deaf ears, but as I already said in #2046 (comment), if you were coming at this purely from a performance perspective, then you would take the SQLite table (https://github.com/aiidateam/disk-objectstore/blob/a2561a40cf2e9a4d58325387f913ccf08111f5fd/disk_objectstore/models.py#L8) and move it to the AiiDA (postgres) database, from where you could implement a more powerful many-to-one relationship between the node files and their existence/location in the pack files. |
You'r not falling of deaf ears, it's just that before changing a design decision we should avoid that we then incur on other performance bottlenecks so it would be good to discuss and see pros and cons before changing. Two comments:
|
(Note that I had benchmarked a preliminary implementation of the repository in this test (now archived) repo and the performance was acceptably fast for all operations that I tested at the time. This included converting to the new format, and back to the old one. |
We should also keep in mind that querying for specific file names was never a requirement for the design. This has never been a feature and I don't remember anyone asking for it. In addition, collecting all file hashes from the database is indeed only necessary for a maintenance operation as @giovannipizzi stated, and so we can expect that it does not have to be executed very often and also can take a bit longer if necessary. On the other hand, introducing a new table might introduce a slowdown in every day operations, such as retrieving files from a node, because now there is a join involved. I am not saying that that penalty is by definition not acceptable, but I think we should keep this in mind that there are tradeoffs to both designs probably. |
thanks for the responses, although it seems like both of them are more aimed at #4919, rather than the question I asked here, which was: |
Empty folders have been a use case for at least
Traditionally, a "file" can be understood to be both a file or directory. The |
to guarantee for consistency (see #4919 (comment)) and as we envisioned that when fetching a node, you anyway need all the structure: so it's just a one-row fetch (that BTW you already fetched in memory when you asked for that node), rather than one more JOIN query at the SQL level.
Yes, see #4919 (comment) |
I'd be interested to see where you got that "traditional" description from? (a) that you have to enforce certain polymorphic constraints on a if file_type == FileType.DIRECTORY and key is not None:
raise ValueError('an object of type `FileType.DIRECTORY` cannot define a key.')
if file_type == FileType.FILE and objects is not None:
raise ValueError('an object of type `FileType.FILE` cannot define any objects.') and (b) the fact there are these three methods on the def get_object(self, path: FilePath = None) -> File:
def get_directory(self, path: FilePath = None) -> File:
def get_file(self, path: FilePath) -> File: It would probably be better to mirror the |
@ramirezfranciscof can this now be closed since #4965 has been merged? There may be more specific features that we put here originally, such as |
As far as I understood it, yes. |
Once the new repository implementation will be merged, the disk object storage (DOS) will need some maintenance now and again to improve performance, that needs to be controllable from
verdi
. Operations include but are potentially not limited to:Some of these operations can be executed during operation of AiiDA, whereas some can only be executed as a maintenance operation where the DOS cannot be used. There is also a difference in operation efficiency, where some can be very quick and efficient, whereas others (such as repacking) can potentially be very expensive. Currently the idea is that there should be one command that will be used for the majority of users that just needs to be called periodically to optimize the repository. However, it would be good to expose additional commands for the sub operations listed above such that it provides more control and for example can be called in CRON jobs.
Current proposal of subcommands for the new command
verdi repository
:backup
: perform anrsync
of the DOS.rsync
of the state as is, or should it automatically perform some maintenance operations first, or alternatively propose the user with these operations and ask for confirmationmaintenance
: single-stop-shop command for most users. Will automatically or through prompting optimize the DOS in the most effective way possible.pack
: Pack all loose objectsclean
: Delete packed loose objects, delete unreferenced objects and vacuum the database--delete-loose/--no-delete-loose
:--delete-unreferenced/--no-delete-unreferenced
:--vacuum-index/--vacuum-index
:--repack
: will fully repack the DOS, which is not always necessary and potentially expensive, but can greatly improve performanceThis is an incomplete proposal and names are in no way definitive. Let's discuss below and I will keep updating the design here once certain things become agreed upon.
The text was updated successfully, but these errors were encountered: