ADD: New repository CLI #4965

ramirezfranciscof · 2021-05-31T16:05:05Z

This PR incorporates the new command line tool to control the maintenance tasks of the repository (fix 4321).

The main challenge of this task is reconciling (1) the need for specific control over the backend and the processes happening underneath and (2) a general frontend interface that is simple for the user and versatile to control many possibly different backends. This is something that is a common issue of many software designs, but this is a particular case where the specifics of the underlying processes are quite important and hiding them behind a generic interface could render the whole feature unusable.

Current Implementation

There is a single command verdi repository maintain, that will perform a full maintenance procedure on the repository backend. This procedure might be time intensive and it requires the user to stop using the database to prevent any data corruption, so a proper warning is displayed asking for confirmation (there is also the possibility of guaranteeing the safety with profile locking, see below).

The command also has the flag --live to indicate that only maintenance tasks that can be done while still using AiiDA should be executed. Again, a warning lets the user know that this is not the full procedure and that they should run the full command once they have the time.

I think that this characteristic is essentially the critical minimal information the user needs to handle: can I do this while still using AiiDA or do I need to do a "downtime"? Even considerations of performance control (doing a quick maintenance vs and an in-depth) are secondary, at least in the sense that they are more relevant for "downtime" maintenance but not so much for the "live" option.

Finally I added a --pass_down option that accepts a string that gets send to the backend and can be used for testing, performance analysis, or power-using specific backends. You can see how this is currently being used to have a finer grain control of the different stages of maintenance in the objectstore repository backend.

The case of deletion

One last thing that would be common to all repository backends is the "propagation of the deletion of files". The issue is that the deletion of data only takes direct effect in the AiiDA database (removing the reference from there) without affecting the content of the backend. Therefore it needs to be specifically propagated to the repository backend, a process which is currently performed every time the maintenance is run (with or without --live). This is not strictly linked to the underlying maintenance operations though, and in principle we could separate this command so that users can propagate the deletion to the backend independently.

I chose to initially have this be part of the maintenance so as to present a simpler interface, since it is not even guaranteed that this will have any beneficial effect on its own (for example, if the backend also does soft delete and keeps the unreferenced files around until a full maintenance is executed). This means however that it can't be controlled by users externally since this is performed on the AiiDA side and not part of the backend (and thus should not be influenced by things in pass_down). In principle for the objectstore they can do this propagation only if they pass_down the options to cancel all backend operations, but (1) this is very backend specific and (2) they can't currently choose to skip it in this way. I'm considering adding a --skip-propagation flag for this purpose, but I think since it is an addition it is not critical to have it now (unless there is some specific performance issue with this part).

Tasks:

Incorporate daemon locking mechanism (see this PR)
Add documentation section for new command.

ramirezfranciscof · 2021-05-31T16:05:31Z

Current Draft State

This draft does not yet implement the simplified interface that I have described in the OP. Instead, the interface is left "completely" transparent, in the sense that it allows to easily enable/disable each of the following operations:

"Pack" files: the disk-objectstore will initially keep all individual files "loose", and through this command you can tell it to store them in packs (this duplicates the data).
"Clean" loose files: Remove the loose files that have already been packed (necessary to free the space of the packed files).
"Transmit" deletion of files: when the user deletes files, they are just making AiiDA remove its internal reference to them in the objectstore. In order to actually delete them from there, one needs to run this to take stock of all objects aiida still keeps track of and ask the repository to specifically remove all the ones that are not being kept track of.
"Repack" files: when deleting objects in the disk-objectstore, the loose files are removed but the pack files will just lose their reference in the internal database of the objectstore. This re-organizes the packs and removes all data that is not being referenced there.
"Vacuum" the repository DB: this will re-order the rows of the database so that the tables are optimized for the current status of the packs (which may have changed due to deletion or repacking).

This is so to be able to determine some basic heuristics of the different times that it takes for the disk-objectstore to perform each process and better figure out which can be associated together without much penalty. Specially within the subgroups that can be performed with the daemon running (pack/transmit) and the ones that can't (clean/repack/vacuum). For this purpose I have also created this companion script (you can download and pip-install the full repository in an AiiDA environment, as the script makes use of other tools in there). If you want to test it yourself you can run it by using the following:

(aiida) $  ./test00_timing.py --mid-files

Test Procedure

Creates the test repository (takes around 30 min for me)
Does a first deletion of 50 nodes and then runs the whole series of "transmit" => "clean" => "vacuum" (twice, to get a baseline of how expensive it is to run these commands when they have nothing to do and also check that effectively there are no cross-related effects).
Pack the files of the remaining 150 nodes (twice, to get baseline for "unnecessary" runs) and run the series of "transmit/clean/vacuum" twice.
Delete another set of 49/50 nodes (now packed) and run the series of "transmit/clean/vacuum" twice.
Repack the files of the remaining 100/101 nodes (twice too) and run the series of "transmit/clean/vacuum" twice.

Preliminary results

You can see my full output below, but my summary of it would be:

Deleting around 250 loose files (in 50 nodes) totalling 2gb takes less than 20 seconds. The baseline (the query process of finding unreferenced files) seems to take only 1-2 seconds (although this DB of 200 nodes / 1000 files is rather small, I want to try in some larger one).
Packing 6gb of files takes half a minute. Extrapolating a bit, it would mean that packing a 1tb repository could take 1 hour and a half, which seems pretty decent. It is a risky extrapolation and having the disk writing stuff for over an hour is not good for the computer, so periodic packaging should be advised. Baseline if there is nothing to do is around one second.
Cleaning packed files seems to be faster than deleting a smaller number of loose files? (20 secs vs 2.5 secs?) Maybe this is because the backend only exposes the option to delete a single object so I have to call that for all single files (see line 39 of aiida.repository.control.py), while it would be more efficient to give the object store a list of files to delete and have it take care of it (i.e. expose self.container.delete_objects(...) in the methods of the abstract repository backend; for example, see line 101 in the aiida.repository.backend.disk_object_store.py).
Deleting packed nodes is significantly more expensive than deleting loose nodes (90sec vs 20sec, although again I would like to check what happens if I unify the deletion in a single call).
Repacking seems to be "expensive" even if there is not "much to do" (doing two consecutive calls takes 15 secs both, with no reduction in the second one). Again, as with all these results, I would like to see how it changes with many small files.
Vacuuming seems to always take around 1 second no matter what is there to vacuum.

Full output

Running test for "--mid-files"...

> Now setting up the repository...
>>> Deleting all nodes...
>>> Cleaning up database and repository...
>>> Populating the database...
> Elapsed time: 1636.7547591612674
> The database currently holds 200 nodes and the repo occupies 8.1G

> Now deleting unpacked nodes...
> Elapsed time: 0.10452547576278448
> The database currently holds 150 nodes and the repo occupies 8.1G
> Now cleaning deleted unpacked nodes...
>>> Elapsed time (transmit): 17.87814964680001
>>> Elapsed time (clean): 1.1770451520569623
>>> Elapsed time (vacuum): 1.1977200941182673
> The database currently holds 150 nodes and the repo occupies 5.8G
>>> Elapsed time (transmit): 1.6738419053144753
>>> Elapsed time (clean): 1.1854597837664187
>>> Elapsed time (vacuum): 1.1461118678562343
> The database currently holds 150 nodes and the repo occupies 5.8G

> Now packing the nodes...
> Elapsed time (pack): 37.37593990098685
> The database currently holds 150 nodes and the repo occupies 12G

> Now packing the nodes...
> Elapsed time (pack): 1.2292702598497272
> The database currently holds 150 nodes and the repo occupies 12G
> Now cleaning repacked nodes...
>>> Elapsed time (transmit): 1.7674363791011274
>>> Elapsed time (clean): 2.378125532064587
>>> Elapsed time (vacuum): 1.111530366819352
> The database currently holds 150 nodes and the repo occupies 5.7G
>>> Elapsed time (transmit): 1.69200792722404
>>> Elapsed time (clean): 0.9920809953473508
>>> Elapsed time (vacuum): 1.1101957960054278
> The database currently holds 150 nodes and the repo occupies 5.7G

> Now deleting packed nodes...
> Elapsed time: 0.09688124433159828
> The database currently holds 101 nodes and the repo occupies 5.7G
> Now cleaning deleted packed nodes...
>>> Elapsed time (transmit): 92.88936862908304
>>> Elapsed time (clean): 1.0394411101005971
>>> Elapsed time (vacuum): 1.0659767519682646
> The database currently holds 101 nodes and the repo occupies 5.8G
>>> Elapsed time (transmit): 1.5534060336649418
>>> Elapsed time (clean): 1.0056658750399947
>>> Elapsed time (vacuum): 1.0443645529448986
> The database currently holds 101 nodes and the repo occupies 5.8G

> Now repacking nodes...
> Elapsed time: 16.907301522791386
> The database currently holds 101 nodes and the repo occupies 3.7G

> Now repacking nodes...
> Elapsed time: 17.614450520370156
> The database currently holds 101 nodes and the repo occupies 3.7G
> Now cleaning deleted packed nodes...
>>> Elapsed time (transmit): 1.517629874870181
>>> Elapsed time (clean): 1.0210560159757733
>>> Elapsed time (vacuum): 1.0760509828105569
> The database currently holds 101 nodes and the repo occupies 3.7G
>>> Elapsed time (transmit): 1.5449293879792094
>>> Elapsed time (clean): 1.0809445828199387
>>> Elapsed time (vacuum): 1.1022412879392505
> The database currently holds 101 nodes and the repo occupies 3.7G

I would like to try this with a lot of small files (see --small-files flag) to confirm the timings, but so far I haven't been able to run it to completion (several problems with the connection to my remote machine closing down before finishing creating the DB, this one seems to take a lot more time). I would like to wait until I have this before making any generalized statement supporting or supplementing the implementation proposed in the OP, but wanted to show the current state in case somebody had some feedback or questions.

codecov · 2021-09-01T12:41:31Z

Codecov Report

Merging #4965 (f0aa923) into develop (29915bd) will increase coverage by 0.03%.
The diff coverage is 91.97%.

@@             Coverage Diff             @@
##           develop    #4965      +/-   ##
===========================================
+ Coverage    81.43%   81.46%   +0.03%     
===========================================
  Files          529      530       +1     
  Lines        37002    37113     +111     
===========================================
+ Hits         30128    30229     +101     
- Misses        6874     6884      +10

Flag	Coverage Δ
django	`76.92% <91.97%> (+0.05%)`	⬆️
sqlalchemy	`75.92% <91.97%> (+0.05%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files	Coverage Δ
aiida/backends/general/migrations/utils.py	`85.28% <50.00%> (-0.73%)`	⬇️
aiida/repository/backend/sandbox.py	`97.27% <50.00%> (-2.73%)`	⬇️
...da/tools/archive/implementations/sqlite/backend.py	`81.67% <50.00%> (-0.53%)`	⬇️
aiida/cmdline/commands/cmd_storage.py	`96.97% <92.86%> (-1.14%)`	⬇️
aiida/backends/control.py	`97.23% <97.23%> (ø)`
aiida/repository/backend/disk_object_store.py	`95.29% <97.73%> (+1.74%)`	⬆️
aiida/backends/__init__.py	`92.31% <100.00%> (+1.40%)`	⬆️
aiida/repository/backend/abstract.py	`98.56% <100.00%> (+0.09%)`	⬆️
aiida/transports/plugins/local.py	`81.41% <0.00%> (-0.25%)`	⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 29915bd...f0aa923. Read the comment docs.

chrisjsewell

Heya, also need list_objects for the archive, so just checking we are on the same page

aiida/backends/general/migrations/utils.py

aiida/repository/backend/abstract.py

aiida/repository/backend/disk_object_store.py

aiida/repository/backend/sandbox.py

tests/repository/backend/test_abstract.py

chrisjsewell · 2021-09-29T00:39:37Z

Also need list_objects for the archive, so just checking we are on the same page

might end up extracting this in to a separate PR/commit

aiida/repository/control.py

ramirezfranciscof · 2021-12-07T11:01:40Z

Outstanding Discussion: (writing...)

ramirezfranciscof · 2021-12-08T17:47:38Z

Applied the changes discussed earlier today @sphuber @chrisjsewell so this is ready for review. Of the previously mentioned problems I am still having this one:

Problem 3: The verdi tests are detecting a "potential speed problem". From what I understand this is because it is detecting that the orm is getting imported somewhere it shouldn't in the cmdline, but (1) I have not added any global import it cmd_storage, (2) I have not modified any other cmdline here, (3) I haven't modified anything in the modules that are globally imported in cmd_storage and (4) I haven't added a global import orm or from orm anywhere. So I am a bit at a loss here, I'm not sure what else this error could mean.
  warnings.warn(f'Creating AiiDA configuration folder `{path}`.')
Critical: potential `verdi` speed problem: `aiida.orm` module is imported which is not in: ('aiida.backends', 'aiida.cmdline', 'aiida.common', 'aiida.manage', 'aiida.plugins', 'aiida.restapi')
Error: Process completed with exit code 1.

Any ideas?

sphuber

Thanks @ramirezfranciscof . I think the interface is a lot better now. Just some comments on bits of the new implementation

aiida/backends/control.py

aiida/repository/backend/disk_object_store.py

sphuber · 2021-12-08T20:18:01Z

aiida/repository/backend/disk_object_store.py

+            files_numb = self.container.count_objects()['packed']
+            files_size = self.container.get_total_size()['total_size_packfiles_on_disk'] * BYTES_TO_MB


This information is not really adding anything specific to the maintenance operation is it? It just gives the current size, but that doesn't tell what it will be nor what will be saved. Only the latter would be really interesting IMO

Well, it can help give you an idea of how long it might take to do the repacking. But ok, I can take it out if you prefer.

tests/cmdline/commands/test_storage.py

Co-authored-by: Sebastiaan Huber <[email protected]>

ramirezfranciscof · 2021-12-09T08:59:59Z

Hey @sphuber I think we might have counted differently but this is ready for other review.

BTW related to the error, the only place where I added something related to the orm is here, and I need it only for the typing:

from aiida.orm.implementation import Backend

__all__ = ('MAINTAIN_LOGGER',)
 
MAINTAIN_LOGGER = AIIDA_LOGGER.getChild('maintain')

def repository_maintain(
    full: bool = False,
    dry_run: bool = False,
    backend: Optional[Backend] = None,
    **kwargs,
) -> dict:

sphuber

Thanks @ramirezfranciscof . Looks good to be merged now; just spotted some docstrings you forgot to update after last changes. If you correct those, I will approve and merge.

aiida/repository/backend/disk_object_store.py

ramirezfranciscof · 2021-12-09T11:37:50Z

All tests seem to be passing. I am now having again the problem I describe here, but I just commited ignoring the hooks. Still would like to know why this is happening but I want this merged more.

@sphuber all good now? (notice that there is an outstanding comment in your previous review where I was waiting for confirmation if my comment convinced you or you still want the info taken out)

sphuber

Thanks @ramirezfranciscof . The ignore statement for mypy because of the different signatures is not a big deal, it is the same as that of pylint. Let's keep that for now and merge this.

addressed

ramirezfranciscof force-pushed the repocli branch 4 times, most recently from 5e489f6 to 1cef5c3 Compare September 1, 2021 12:22

ramirezfranciscof force-pushed the repocli branch 3 times, most recently from f7a102b to c6d64c8 Compare September 1, 2021 16:30

ramirezfranciscof force-pushed the repocli branch from c6d64c8 to f7ee52b Compare September 14, 2021 09:31

ramirezfranciscof force-pushed the repocli branch 4 times, most recently from af61e28 to 55cd110 Compare September 22, 2021 08:01

ramirezfranciscof marked this pull request as ready for review September 22, 2021 08:05

ramirezfranciscof force-pushed the repocli branch 3 times, most recently from c317031 to a74ccb3 Compare September 22, 2021 10:25

chrisjsewell suggested changes Sep 29, 2021

View reviewed changes

ramirezfranciscof mentioned this pull request Sep 29, 2021

ADD: Repository methods for repo CLI and other features #5156

Merged

ramirezfranciscof force-pushed the repocli branch 5 times, most recently from c73c4c2 to dbee74d Compare November 8, 2021 17:40

chrisjsewell reviewed Nov 8, 2021

View reviewed changes

aiida/repository/control.py Outdated Show resolved Hide resolved

chrisjsewell reviewed Nov 8, 2021

View reviewed changes

aiida/repository/control.py Outdated Show resolved Hide resolved

ramirezfranciscof force-pushed the repocli branch from dbee74d to 584c5f8 Compare November 9, 2021 08:52

ramirezfranciscof marked this pull request as draft November 16, 2021 12:48

ramirezfranciscof and others added 3 commits December 8, 2021 14:33

Major re-structure of the features

2501cde

Merge branch 'develop' into repocli

fa7d5ff

Fix control tests

8cb0951

ramirezfranciscof force-pushed the repocli branch from 4b5ddec to 8cb0951 Compare December 8, 2021 17:16

ramirezfranciscof requested review from sphuber and chrisjsewell December 8, 2021 17:47

sphuber requested changes Dec 8, 2021

View reviewed changes

ramirezfranciscof and others added 2 commits December 8, 2021 22:42

Apply suggestions from code review

7a6a939

Co-authored-by: Sebastiaan Huber <[email protected]>

Apply latest corrections from PR

2402bf7

ramirezfranciscof force-pushed the repocli branch from 0da0101 to 2402bf7 Compare December 8, 2021 22:07

ramirezfranciscof requested review from sphuber December 8, 2021 23:18

sphuber requested changes Dec 9, 2021

View reviewed changes

aiida/repository/backend/disk_object_store.py Outdated Show resolved Hide resolved

aiida/repository/backend/disk_object_store.py Outdated Show resolved Hide resolved

aiida/repository/backend/disk_object_store.py Show resolved Hide resolved

ramirezfranciscof and others added 3 commits December 9, 2021 10:44

Last PR corrections

1432dcf

Merge branch 'develop' into repocli

5037665

Add mypy ignores

f0aa923

sphuber approved these changes Dec 9, 2021

View reviewed changes

sphuber merged commit 12e4320 into aiidateam:develop Dec 9, 2021

sphuber mentioned this pull request Dec 10, 2021

verdi: add repository maintenance commands to operate on the repository #4321

Closed

This was referenced Dec 13, 2021

Docs: Add documentation on the verdi storage command #5271

Open

Outstanding verdi storage improvements #5272

Open

verdi: add backup sub-command to the storage group #5273

Closed

ramirezfranciscof deleted the repocli branch February 25, 2022 16:57

chrisjsewell mentioned this pull request Mar 12, 2022

🔀 MERGE: Release v2.0.0b1 #5426

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ADD: New repository CLI #4965

ADD: New repository CLI #4965

ramirezfranciscof commented May 31, 2021 •

edited

Loading

ramirezfranciscof commented May 31, 2021 •

edited

Loading

codecov bot commented Sep 1, 2021 •

edited

Loading

chrisjsewell left a comment •

edited

Loading

chrisjsewell commented Sep 29, 2021

ramirezfranciscof commented Dec 7, 2021

ramirezfranciscof commented Dec 8, 2021

sphuber left a comment

sphuber Dec 8, 2021

ramirezfranciscof Dec 8, 2021

ramirezfranciscof commented Dec 9, 2021 •

edited

Loading

sphuber left a comment

ramirezfranciscof commented Dec 9, 2021

sphuber left a comment

		files_numb = self.container.count_objects()['packed']
		files_size = self.container.get_total_size()['total_size_packfiles_on_disk'] * BYTES_TO_MB

ADD: New repository CLI #4965

ADD: New repository CLI #4965

Conversation

ramirezfranciscof commented May 31, 2021 • edited Loading

Current Implementation

The case of deletion

Tasks:

ramirezfranciscof commented May 31, 2021 • edited Loading

Current Draft State

Test Procedure

Preliminary results

codecov bot commented Sep 1, 2021 • edited Loading

Codecov Report

chrisjsewell left a comment • edited Loading

Choose a reason for hiding this comment

chrisjsewell commented Sep 29, 2021

ramirezfranciscof commented Dec 7, 2021

ramirezfranciscof commented Dec 8, 2021

sphuber left a comment

Choose a reason for hiding this comment

sphuber Dec 8, 2021

Choose a reason for hiding this comment

ramirezfranciscof Dec 8, 2021

Choose a reason for hiding this comment

ramirezfranciscof commented Dec 9, 2021 • edited Loading

sphuber left a comment

Choose a reason for hiding this comment

ramirezfranciscof commented Dec 9, 2021

sphuber left a comment

Choose a reason for hiding this comment

ramirezfranciscof commented May 31, 2021 •

edited

Loading

ramirezfranciscof commented May 31, 2021 •

edited

Loading

codecov bot commented Sep 1, 2021 •

edited

Loading

chrisjsewell left a comment •

edited

Loading

ramirezfranciscof commented Dec 9, 2021 •

edited

Loading