Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ADD: New repository CLI #4965

Merged
merged 19 commits into from
Dec 9, 2021
Merged

Conversation

ramirezfranciscof
Copy link
Member

@ramirezfranciscof ramirezfranciscof commented May 31, 2021

This PR incorporates the new command line tool to control the maintenance tasks of the repository (fix 4321).

The main challenge of this task is reconciling (1) the need for specific control over the backend and the processes happening underneath and (2) a general frontend interface that is simple for the user and versatile to control many possibly different backends. This is something that is a common issue of many software designs, but this is a particular case where the specifics of the underlying processes are quite important and hiding them behind a generic interface could render the whole feature unusable.

Current Implementation

There is a single command verdi repository maintain, that will perform a full maintenance procedure on the repository backend. This procedure might be time intensive and it requires the user to stop using the database to prevent any data corruption, so a proper warning is displayed asking for confirmation (there is also the possibility of guaranteeing the safety with profile locking, see below).

The command also has the flag --live to indicate that only maintenance tasks that can be done while still using AiiDA should be executed. Again, a warning lets the user know that this is not the full procedure and that they should run the full command once they have the time.

I think that this characteristic is essentially the critical minimal information the user needs to handle: can I do this while still using AiiDA or do I need to do a "downtime"? Even considerations of performance control (doing a quick maintenance vs and an in-depth) are secondary, at least in the sense that they are more relevant for "downtime" maintenance but not so much for the "live" option.

Finally I added a --pass_down option that accepts a string that gets send to the backend and can be used for testing, performance analysis, or power-using specific backends. You can see how this is currently being used to have a finer grain control of the different stages of maintenance in the objectstore repository backend.

The case of deletion

One last thing that would be common to all repository backends is the "propagation of the deletion of files". The issue is that the deletion of data only takes direct effect in the AiiDA database (removing the reference from there) without affecting the content of the backend. Therefore it needs to be specifically propagated to the repository backend, a process which is currently performed every time the maintenance is run (with or without --live). This is not strictly linked to the underlying maintenance operations though, and in principle we could separate this command so that users can propagate the deletion to the backend independently.

I chose to initially have this be part of the maintenance so as to present a simpler interface, since it is not even guaranteed that this will have any beneficial effect on its own (for example, if the backend also does soft delete and keeps the unreferenced files around until a full maintenance is executed). This means however that it can't be controlled by users externally since this is performed on the AiiDA side and not part of the backend (and thus should not be influenced by things in pass_down). In principle for the objectstore they can do this propagation only if they pass_down the options to cancel all backend operations, but (1) this is very backend specific and (2) they can't currently choose to skip it in this way. I'm considering adding a --skip-propagation flag for this purpose, but I think since it is an addition it is not critical to have it now (unless there is some specific performance issue with this part).

Tasks:

  • Incorporate daemon locking mechanism (see this PR)
  • Add documentation section for new command.

@ramirezfranciscof
Copy link
Member Author

ramirezfranciscof commented May 31, 2021

Current Draft State

This draft does not yet implement the simplified interface that I have described in the OP. Instead, the interface is left "completely" transparent, in the sense that it allows to easily enable/disable each of the following operations:

  • "Pack" files: the disk-objectstore will initially keep all individual files "loose", and through this command you can tell it to store them in packs (this duplicates the data).
  • "Clean" loose files: Remove the loose files that have already been packed (necessary to free the space of the packed files).
  • "Transmit" deletion of files: when the user deletes files, they are just making AiiDA remove its internal reference to them in the objectstore. In order to actually delete them from there, one needs to run this to take stock of all objects aiida still keeps track of and ask the repository to specifically remove all the ones that are not being kept track of.
  • "Repack" files: when deleting objects in the disk-objectstore, the loose files are removed but the pack files will just lose their reference in the internal database of the objectstore. This re-organizes the packs and removes all data that is not being referenced there.
  • "Vacuum" the repository DB: this will re-order the rows of the database so that the tables are optimized for the current status of the packs (which may have changed due to deletion or repacking).

This is so to be able to determine some basic heuristics of the different times that it takes for the disk-objectstore to perform each process and better figure out which can be associated together without much penalty. Specially within the subgroups that can be performed with the daemon running (pack/transmit) and the ones that can't (clean/repack/vacuum). For this purpose I have also created this companion script (you can download and pip-install the full repository in an AiiDA environment, as the script makes use of other tools in there). If you want to test it yourself you can run it by using the following:

(aiida) $  ./test00_timing.py --mid-files

Test Procedure

  1. Creates the test repository (takes around 30 min for me)
  2. Does a first deletion of 50 nodes and then runs the whole series of "transmit" => "clean" => "vacuum" (twice, to get a baseline of how expensive it is to run these commands when they have nothing to do and also check that effectively there are no cross-related effects).
  3. Pack the files of the remaining 150 nodes (twice, to get baseline for "unnecessary" runs) and run the series of "transmit/clean/vacuum" twice.
  4. Delete another set of 49/50 nodes (now packed) and run the series of "transmit/clean/vacuum" twice.
  5. Repack the files of the remaining 100/101 nodes (twice too) and run the series of "transmit/clean/vacuum" twice.

Preliminary results

You can see my full output below, but my summary of it would be:

  • Deleting around 250 loose files (in 50 nodes) totalling 2gb takes less than 20 seconds. The baseline (the query process of finding unreferenced files) seems to take only 1-2 seconds (although this DB of 200 nodes / 1000 files is rather small, I want to try in some larger one).
  • Packing 6gb of files takes half a minute. Extrapolating a bit, it would mean that packing a 1tb repository could take 1 hour and a half, which seems pretty decent. It is a risky extrapolation and having the disk writing stuff for over an hour is not good for the computer, so periodic packaging should be advised. Baseline if there is nothing to do is around one second.
  • Cleaning packed files seems to be faster than deleting a smaller number of loose files? (20 secs vs 2.5 secs?) Maybe this is because the backend only exposes the option to delete a single object so I have to call that for all single files (see line 39 of aiida.repository.control.py), while it would be more efficient to give the object store a list of files to delete and have it take care of it (i.e. expose self.container.delete_objects(...) in the methods of the abstract repository backend; for example, see line 101 in the aiida.repository.backend.disk_object_store.py).
  • Deleting packed nodes is significantly more expensive than deleting loose nodes (90sec vs 20sec, although again I would like to check what happens if I unify the deletion in a single call).
  • Repacking seems to be "expensive" even if there is not "much to do" (doing two consecutive calls takes 15 secs both, with no reduction in the second one). Again, as with all these results, I would like to see how it changes with many small files.
  • Vacuuming seems to always take around 1 second no matter what is there to vacuum.
Full output
Running test for "--mid-files"...

> Now setting up the repository...
>>> Deleting all nodes...
>>> Cleaning up database and repository...
>>> Populating the database...
> Elapsed time: 1636.7547591612674
> The database currently holds 200 nodes and the repo occupies 8.1G

> Now deleting unpacked nodes...
> Elapsed time: 0.10452547576278448
> The database currently holds 150 nodes and the repo occupies 8.1G
> Now cleaning deleted unpacked nodes...
>>> Elapsed time (transmit): 17.87814964680001
>>> Elapsed time (clean): 1.1770451520569623
>>> Elapsed time (vacuum): 1.1977200941182673
> The database currently holds 150 nodes and the repo occupies 5.8G
>>> Elapsed time (transmit): 1.6738419053144753
>>> Elapsed time (clean): 1.1854597837664187
>>> Elapsed time (vacuum): 1.1461118678562343
> The database currently holds 150 nodes and the repo occupies 5.8G

> Now packing the nodes...
> Elapsed time (pack): 37.37593990098685
> The database currently holds 150 nodes and the repo occupies 12G

> Now packing the nodes...
> Elapsed time (pack): 1.2292702598497272
> The database currently holds 150 nodes and the repo occupies 12G
> Now cleaning repacked nodes...
>>> Elapsed time (transmit): 1.7674363791011274
>>> Elapsed time (clean): 2.378125532064587
>>> Elapsed time (vacuum): 1.111530366819352
> The database currently holds 150 nodes and the repo occupies 5.7G
>>> Elapsed time (transmit): 1.69200792722404
>>> Elapsed time (clean): 0.9920809953473508
>>> Elapsed time (vacuum): 1.1101957960054278
> The database currently holds 150 nodes and the repo occupies 5.7G

> Now deleting packed nodes...
> Elapsed time: 0.09688124433159828
> The database currently holds 101 nodes and the repo occupies 5.7G
> Now cleaning deleted packed nodes...
>>> Elapsed time (transmit): 92.88936862908304
>>> Elapsed time (clean): 1.0394411101005971
>>> Elapsed time (vacuum): 1.0659767519682646
> The database currently holds 101 nodes and the repo occupies 5.8G
>>> Elapsed time (transmit): 1.5534060336649418
>>> Elapsed time (clean): 1.0056658750399947
>>> Elapsed time (vacuum): 1.0443645529448986
> The database currently holds 101 nodes and the repo occupies 5.8G

> Now repacking nodes...
> Elapsed time: 16.907301522791386
> The database currently holds 101 nodes and the repo occupies 3.7G

> Now repacking nodes...
> Elapsed time: 17.614450520370156
> The database currently holds 101 nodes and the repo occupies 3.7G
> Now cleaning deleted packed nodes...
>>> Elapsed time (transmit): 1.517629874870181
>>> Elapsed time (clean): 1.0210560159757733
>>> Elapsed time (vacuum): 1.0760509828105569
> The database currently holds 101 nodes and the repo occupies 3.7G
>>> Elapsed time (transmit): 1.5449293879792094
>>> Elapsed time (clean): 1.0809445828199387
>>> Elapsed time (vacuum): 1.1022412879392505
> The database currently holds 101 nodes and the repo occupies 3.7G

I would like to try this with a lot of small files (see --small-files flag) to confirm the timings, but so far I haven't been able to run it to completion (several problems with the connection to my remote machine closing down before finishing creating the DB, this one seems to take a lot more time). I would like to wait until I have this before making any generalized statement supporting or supplementing the implementation proposed in the OP, but wanted to show the current state in case somebody had some feedback or questions.

@codecov
Copy link

codecov bot commented Sep 1, 2021

Codecov Report

Merging #4965 (f0aa923) into develop (29915bd) will increase coverage by 0.03%.
The diff coverage is 91.97%.

Impacted file tree graph

@@             Coverage Diff             @@
##           develop    #4965      +/-   ##
===========================================
+ Coverage    81.43%   81.46%   +0.03%     
===========================================
  Files          529      530       +1     
  Lines        37002    37113     +111     
===========================================
+ Hits         30128    30229     +101     
- Misses        6874     6884      +10     
Flag Coverage Δ
django 76.92% <91.97%> (+0.05%) ⬆️
sqlalchemy 75.92% <91.97%> (+0.05%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
aiida/backends/general/migrations/utils.py 85.28% <50.00%> (-0.73%) ⬇️
aiida/repository/backend/sandbox.py 97.27% <50.00%> (-2.73%) ⬇️
...da/tools/archive/implementations/sqlite/backend.py 81.67% <50.00%> (-0.53%) ⬇️
aiida/cmdline/commands/cmd_storage.py 96.97% <92.86%> (-1.14%) ⬇️
aiida/backends/control.py 97.23% <97.23%> (ø)
aiida/repository/backend/disk_object_store.py 95.29% <97.73%> (+1.74%) ⬆️
aiida/backends/__init__.py 92.31% <100.00%> (+1.40%) ⬆️
aiida/repository/backend/abstract.py 98.56% <100.00%> (+0.09%) ⬆️
aiida/transports/plugins/local.py 81.41% <0.00%> (-0.25%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 29915bd...f0aa923. Read the comment docs.

Copy link
Member

@chrisjsewell chrisjsewell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Heya, also need list_objects for the archive, so just checking we are on the same page

aiida/backends/general/migrations/utils.py Outdated Show resolved Hide resolved
aiida/repository/backend/abstract.py Outdated Show resolved Hide resolved
aiida/repository/backend/abstract.py Outdated Show resolved Hide resolved
aiida/repository/backend/abstract.py Outdated Show resolved Hide resolved
aiida/repository/backend/disk_object_store.py Outdated Show resolved Hide resolved
aiida/repository/backend/sandbox.py Outdated Show resolved Hide resolved
tests/repository/backend/test_abstract.py Outdated Show resolved Hide resolved
@chrisjsewell
Copy link
Member

Also need list_objects for the archive, so just checking we are on the same page

might end up extracting this in to a separate PR/commit

@ramirezfranciscof
Copy link
Member Author

Outstanding Discussion: (writing...)

@ramirezfranciscof
Copy link
Member Author

Applied the changes discussed earlier today @sphuber @chrisjsewell so this is ready for review. Of the previously mentioned problems I am still having this one:

Problem 3: The verdi tests are detecting a "potential speed problem". From what I understand this is because it is detecting that the orm is getting imported somewhere it shouldn't in the cmdline, but (1) I have not added any global import it cmd_storage, (2) I have not modified any other cmdline here, (3) I haven't modified anything in the modules that are globally imported in cmd_storage and (4) I haven't added a global import orm or from orm anywhere. So I am a bit at a loss here, I'm not sure what else this error could mean.

  warnings.warn(f'Creating AiiDA configuration folder `{path}`.')
Critical: potential `verdi` speed problem: `aiida.orm` module is imported which is not in: ('aiida.backends', 'aiida.cmdline', 'aiida.common', 'aiida.manage', 'aiida.plugins', 'aiida.restapi')
Error: Process completed with exit code 1.

Any ideas?

Copy link
Contributor

@sphuber sphuber left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @ramirezfranciscof . I think the interface is a lot better now. Just some comments on bits of the new implementation

aiida/backends/control.py Outdated Show resolved Hide resolved
aiida/backends/control.py Show resolved Hide resolved
aiida/backends/control.py Show resolved Hide resolved
aiida/backends/control.py Outdated Show resolved Hide resolved
aiida/backends/control.py Show resolved Hide resolved
aiida/repository/backend/disk_object_store.py Outdated Show resolved Hide resolved
aiida/repository/backend/disk_object_store.py Outdated Show resolved Hide resolved
aiida/repository/backend/disk_object_store.py Outdated Show resolved Hide resolved
Comment on lines +189 to +190
files_numb = self.container.count_objects()['packed']
files_size = self.container.get_total_size()['total_size_packfiles_on_disk'] * BYTES_TO_MB
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This information is not really adding anything specific to the maintenance operation is it? It just gives the current size, but that doesn't tell what it will be nor what will be saved. Only the latter would be really interesting IMO

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, it can help give you an idea of how long it might take to do the repacking. But ok, I can take it out if you prefer.

tests/cmdline/commands/test_storage.py Show resolved Hide resolved
@ramirezfranciscof
Copy link
Member Author

ramirezfranciscof commented Dec 9, 2021

Hey @sphuber I think we might have counted differently but this is ready for other review.

BTW related to the error, the only place where I added something related to the orm is here, and I need it only for the typing:

from aiida.orm.implementation import Backend

__all__ = ('MAINTAIN_LOGGER',)
 
MAINTAIN_LOGGER = AIIDA_LOGGER.getChild('maintain')

def repository_maintain(
    full: bool = False,
    dry_run: bool = False,
    backend: Optional[Backend] = None,
    **kwargs,
) -> dict:

Copy link
Contributor

@sphuber sphuber left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @ramirezfranciscof . Looks good to be merged now; just spotted some docstrings you forgot to update after last changes. If you correct those, I will approve and merge.

aiida/repository/backend/disk_object_store.py Outdated Show resolved Hide resolved
aiida/repository/backend/disk_object_store.py Outdated Show resolved Hide resolved
aiida/repository/backend/disk_object_store.py Show resolved Hide resolved
@ramirezfranciscof
Copy link
Member Author

All tests seem to be passing. I am now having again the problem I describe here, but I just commited ignoring the hooks. Still would like to know why this is happening but I want this merged more.

@sphuber all good now? (notice that there is an outstanding comment in your previous review where I was waiting for confirmation if my comment convinced you or you still want the info taken out)

Copy link
Contributor

@sphuber sphuber left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @ramirezfranciscof . The ignore statement for mypy because of the different signatures is not a big deal, it is the same as that of pylint. Let's keep that for now and merge this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants