Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Introduce lazy_tree (super dictionaries) #1733

Merged
merged 45 commits into from
Jul 12, 2024

Conversation

braingram
Copy link
Contributor

@braingram braingram commented Jan 12, 2024

Description

This PR adds:

  • lazy_tree option to AsdfConfig
  • lazy_tree argument to asdf.open (defaults to AsdfConfig.lazy_tree)
  • Converter.lazy attribute used to indicate if a converter supports "lazy" objects
  • asdf.lazy_nodes "lazy" container classes for list, dict, ordered dict

By default the "lazy" option is False.

When lazy_tree is True and an ASDF file is opened the tagged nodes in the tree are not immediately converted to custom objects. Instead, the containers in the tree (dicts, lists, OrderedDicts) are replaced with AsdfNode subclasses that act like these containers and convert tagged values to custom objects when they are accessed (See #1705 for discussion of this feature). During conversion, if asdf encounters a Converter that either defines lazy=False or does not define lazy the remainder of the branch will be converted to non-"lazy" objects and passed to the Converter. If instead the Converter defines lazy=True the "lazy" object (ie a AsdfDictNode for a dict) will be passed to the Converter.

Checklist:

  • pre-commit checks ran successfully
  • tests ran successfully
  • for a public change, a changelog entry was added
  • for a public change, documentation was updated
  • for any new features, unit tests were added

@braingram braingram marked this pull request as ready for review January 17, 2024 16:31
@braingram braingram requested a review from a team as a code owner January 17, 2024 16:31
@braingram braingram added this to the 3.1.0 milestone Jan 17, 2024
asdf/_tests/conftest.py Show resolved Hide resolved
asdf/_tests/core/_converters/test_complex.py Show resolved Hide resolved
asdf/_tests/test_lazy_nodes.py Show resolved Hide resolved
asdf/exceptions.py Show resolved Hide resolved
asdf/lazy_nodes.py Outdated Show resolved Hide resolved
asdf/lazy_nodes.py Outdated Show resolved Hide resolved
asdf/lazy_nodes.py Outdated Show resolved Hide resolved
asdf/lazy_nodes.py Outdated Show resolved Hide resolved
asdf/lazy_nodes.py Show resolved Hide resolved
asdf/lazy_nodes.py Outdated Show resolved Hide resolved
@braingram
Copy link
Contributor Author

converted to draft until #1733 (comment) is addressed

@braingram braingram force-pushed the feature/lazy_tree branch 2 times, most recently from b14e530 to e359dde Compare May 13, 2024 15:58
@braingram
Copy link
Contributor Author

The following branch of roman_datamodels adds lazy to the node converters (and makes some minor node changes to account for AsdfDictNode not passing isinstance(..., dict) etc) to allow lazy loading of roman trees:
https://github.com/spacetelescope/roman_datamodels/compare/main...braingram:roman_datamodels:lazy?expand=1

@braingram braingram marked this pull request as ready for review May 14, 2024 16:39
@braingram braingram requested a review from eslavich May 14, 2024 16:39
@braingram
Copy link
Contributor Author

braingram commented Jul 1, 2024

JWST regtests: https://plwishmaster.stsci.edu:8081/job/RT/job/JWST-Developers-Pull-Requests/1571/
passed with 2 unrelated (and common random) failures

romancal regtests: https://github.com/spacetelescope/RegressionTests/actions/runs/9746576095
(I ran the romancal tests with photutils==1.12.0 since 1.13.0 is currently breaking main: spacetelescope/romancal#1291)
ran with no failures

@braingram
Copy link
Contributor Author

@nden @perrygreenfield the regtests all pass with this PR (except for the 2 jwst tests that randomly and frequently fail).

@nden
Copy link
Contributor

nden commented Jul 3, 2024

If I understand how this works from the description, once I request a quantity array, all quantity arrays are loaded into memory. Is this correct?

@nden
Copy link
Contributor

nden commented Jul 3, 2024

No, there's something else, not sure what. The above comment is true for quantity arrays. For numpy arrays, it works as expected. Loading one array does not load any other arrays.

@braingram
Copy link
Contributor Author

Thanks for giving it a try. What file did you use for testing? If it's a roman file things will behave differently if you're using roman_datamodels main vs the "lazy" branch linked above. I think this points to this feature (and PR) needing more documentation.

Here's a non-roman example (please let me know if you give it a try and find anything different from the example) it doesn't require any special versions of anything (except for using asdf from the source branch for this PR).

import asdf
import numpy as np
import astropy.units as u

# make 5 quantiy arrays
qs = [u.Quantity(np.zeros(3+i) + i, u.m) for i in range(5)]

# save them to an ASDF file
af = asdf.AsdfFile()
af["qs"] = qs
af.write_to("test.asdf")

# open the file with a "lazy_tree"
with asdf.open("test.asdf", lazy_tree=True) as af:
    # When opened asdf always reads the first and last block
    # (this is true for lazy and non-lazy trees). Since we
    # are using a 'lazy_tree' only these blocks will be loaded
    # and since these are lazy blocks just the headers will be read.

    print("before accessing quantities")
    print(f"Loaded blocks: {[b.loaded for b in af._blocks._blocks]}")

    # Since we're using a 'lazy_tree' the 'qs' 'list' will be
    # a special AsdfListNode object
    print(f"'qs' type = {type(af['qs'])}")

    # Accessing the first quantity will convert the tagged
    # representation to a quantity
    print(f"qs[0] = {af['qs'][0]=}")
    # but no other blocks will be loaded
    print(f"Loaded blocks: {[b.loaded for b in af._blocks._blocks]}")

    # Accessing the second quantity will cause a block to load
    print(f"qs[1] = {af['qs'][1]=}")
    print(f"Loaded blocks: {[b.loaded for b in af._blocks._blocks]}")

When I run the example I get the following output:

before accessing quantities
Loaded blocks: [True, False, False, False, True]
'qs' type = <class 'asdf.lazy_nodes.AsdfListNode'>
qs[0] = af['qs'][0]=<Quantity [0., 0., 0.] m>
Loaded blocks: [True, False, False, False, True]
qs[1] = af['qs'][1]=<Quantity [1., 1., 1., 1.] m>
Loaded blocks: [True, True, False, False, True]

For the above example the "containers" (list-like AsdfListNode and dict-like AsdfDictNode) objects in the tree are made "lazy" (since lazy_tree=True) and the contained objects only deserialized when they are accessed. For the above example the index 1 item in the "qs" "list" isn't converted to a quantity until it's accessed with qs[1]. At that time asdf turns the tagged representation for index 1 into a quantity (which triggers loading the index 1 block). For the non-accessed items in the "list" (like qs[2]) they're never converted to a quantity in the above example (so for qs[2] the index 2 block is never loaded).

Roman files are a bit different because they use STNode subclasses for containers. If we create a fake "roman" file:

im = roman_datamodels.maker_utils.mk_level2_image()
af = asdf.AsdfFile()
af['roman'] = im
af.write_to("roman.asdf")

If we load it with asdf.open we'll see the following loaded blocks

>> af = asdf.open("roman.asdf", lazy_tree=True)
>> "".join(["1" if b.loaded else "0" for b in af._blocks._blocks])
'100000000000001'

This is only because we haven't accessed the "roman" key from the lazy AsdfDictNode. Accessing "roman" (with the main branch of roman_datamodels) results in many blocks being loaded (every block that maps to a quantity):

>> af["roman"]
>> "".join(["1" if b.loaded else "0" for b in af._blocks._blocks])
'111100001101111'

This is because the converter that deserializes roman_datamodels.stnode.WfiImage doesn't set lazy=True (by default asdf assumes nothing in extensions is lazy). Since the converter for this object isn't lazy asdf will convert everything within the sub-tree that it hands to the converter before calling the converter (so the converter never sees a lazy node, matching the current asdf behavior). Accessing af["roman"] triggers asdf to convert everything within the WfiImage sub-tree.

If instead, we use the modified version of roman_datamodels (which sets lazy=True for the converter handling WfiImage) things are much "lazier".

>> af["roman"]
>> "".join(["1" if b.loaded else "0" for b in af._blocks._blocks])
'100000000000001'

Here accessing just the top level "roman" doesn't trigger asdf to load everything within that sub-tree (since the converter has lazy=True) but if we access the data 1 block will be loaded (the exact one may differ depending on the order of the blocks).

>> af["roman"]["data"]
>> "".join(["1" if b.loaded else "0" for b in af._blocks._blocks])
'100000000100001'

Copy link
Contributor

@perrygreenfield perrygreenfield left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, but it should note in the documentation that currently use of the info method on a lazy true forces loading of everything, for now anyway.

@braingram
Copy link
Contributor Author

Thanks!

I updated the lazy_tree documentation in:
5843ada

Does the updated description sound good? (emphasis added to the new part below, see the commit for the full text and context).

lazy_tree : bool, optional
When True the ASDF tree will not be converted to custom objects
when the file is loaded. Instead, objects will be "lazily" converted
only when they are accessed. Note that the tree will not contain dict
and list instances for containers and instead return instances of classes
defined in asdf.lazy_nodes. Since objects are converted when they
are accessed, traversing the tree (like is done during AsdfFile.info
and AsdfFile.search) will result in nodes being converted.

Copy link
Contributor

@perrygreenfield perrygreenfield left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The documentation update LGTM

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants