Converter block storage #1508

braingram · 2023-03-28T21:41:56Z

This PR adds block storage support to Converters.

Relevant docs are here: https://asdf--1508.org.readthedocs.build/en/1508/asdf/extending/converters.html#block-storage

The key contribution of this PR is allowing Converters to read/write blocks during from_yaml_tree/to_yaml_tree. This is accomplished by (privately) providing the BlockManger to the SerializationContext (that is passed as an argument to Converter methods) and introducing a few new public methods:

Converter.reserve_blocks
~~SerializationContext.load_block~~ EDIT: removed in 92a786a
SerializationContext.get_block_data_callback EDIT: added in 92a786a
SerializationContext.assign_block_key
SerializationContext.find_block_index
asdf.util.BlockKey EDIT: added in a7193a1

The docs update describes the public API and several tests were added to cover these new methods and features (this is also mostly compatible with the asdf_zarr chunking prototype with one exception, the storage settings, described below).

There was one major change required to get this to work. With this PR, when AsdfFile.write_to is called, a new AsdfFile object is created to be used during write. This allows write to modify blocks (in this case overwriting data callbacks used to defer reading data until a block is ready to be written) without effecting blocks created during initial reads of a file. This change required updates to many tests which expected the blocks to change after write_to.

A few minor changes include:

Fixes Calling update after write_to fails #1505 (and adds regression test test_write_to_before_update)
Don't limit IntegerType _value_cache per ctx/AsdfFile
Ignore ERA001 rule that is supposed to remove commented code, this failed several times removing some but not all of the commented code so I just disabled it

There are a few things missing from or related to this PR that should be discussed and likely included as follow-up PRS.

storage settings. I initially had these included and modeled after the existing array settings (set_block_storage == set_array_storage, etc). I removed them because I think a more flexible and simpler interface will be preferred. The asdf_zarr prototype used/uses these settings to determine if a zarray should be included as internal blocks or left as an external zarray. However this required adding additional settings to the block converter, functions to the AsdfFile and to the SerializationContext to cover all places where these settings are used. This did not seem ideal and did not allow for settings that didn't match those used for arrays (setting storage to 'internal', 'external', 'inline', etc). I have a vague idea that this might be a useful way to provide additional options to configure the converter and discussing and deciding on an API seems worthwhile (even if it ends up being a copy of what's used for arrays).
untangle uses of converter methods and reuse of blocks. Currently (and with this PR) there are a lot of multi-use objects within the ASDF internal. These made the updates in this PR difficult and often brittle as a small change to Block (for example) had the possibility of changing reading, writing, compression etc. I spent some time trying to break apart some of the main offenders (mainly the block management) but none of that resulted in a simpler PR. Here are a few other examples:
- from_yaml_tree is used for reading, fil_defaults, remove_defaults and is unaware of the context making it difficult to (for example) know if blocks should be created
- to_yaml_tree is used for writing, validate, fill_default, remove_defaults, twice during update (once to judge tree size)
- blocks are used for storing settings (compression storage), sometimes data (sometimes a file descriptor), file state (offset, allocated size), and for both reading and writitng
  I will open issues for the above if reviews do not lead to related changes or fixes.

braingram · 2023-03-31T23:57:06Z

Based off the discussion during the ASDF tag-up I replaced load_block which required holding onto the SerializationContext with generate_block_data_callback to allow lazy loading without holding onto a reference to the context and removed the use of id(obj) in the examples and tests to avoid issues with memory reuse.

Thanks @eslavich and @perrygreenfield

braingram · 2023-04-06T20:45:35Z

Apologies in advance for what will probably be a novel.

TLDR; the asdf_zarr prototype is now up to date with this PR and is ready to be moved to asdf-format. Also, a follow up PR might allow us to remove the need for reserved blocks.

I've spent a few days trying to sort out a reasonable set of functions to expose the storage options to converters. This would include changes to AsdfFile (where settings are often set by the user), SerializationContext (which is available to the converter) and probably BlockManager as this serves storage of settings (at the moment). The settings are complicated by the key/id issue mentioned above. There are a few points I'd like to bring up after looking into this.

First, the way I was using the settings previously in the asdf_zarr prototype is likely not something we should aim to support. I was looking at the array storage to determine if the zarr chunk store should be included as internal asdf blocks (if the array storage was 'internal') or left as an external store (like a DirectoryStore, if the array storage was 'external'). This conflates options for the Converter and options for the blocks. I rewrote the asdf_zarr to instead require a conversion of zarr-arrays with a to_internal function (without copying data until the asdf file is written). There might be a wider discussion of if we want to provide a means of providing options to converters but I think that this should not be required for chunked array support.

Second, the block management is complicated (as mentioned above) and the BlockKey introduced in this PR seems workable for our current needs. However, I think the following alternative strategy might be better but would require substantial rework of the block management. Currently the block keys are required because memory addresses might be reused such as with the following:

af = asdf.open(fn, mode="rw")  # open a file with a custom object 'a' at id(obj) == 42
del af['a']  # this does not delete the blocks that would be associated with 'a'
b = Foo()  # make a new object, let's assume python puts it at id(b) == 42
af['b'] = b  # now the blocks for 'a' will be incorrectly assigned to 'b'

The alternative strategy would be to have asdf internally manage keys for each block that hold weakrefs to objects (here is a partial example from some of my exploration). This would allow objects to be garbage collected when they are removed from the tree and be resilient to memory address reuse (as 'matching' the key would involve resolving the weakref which would return None for the missing object). This has the benefit of not requiring that developers of converters generate their own BlockKey instances and since Converter.from_yaml_tree returns the object that I think we would want to attach to the blocks we might be able to have ASDF check which blocks were accessed during from_yaml_tree, take the returned object and create and assign block keys outside the converter. This should mean that reserve_blocks is no longer needed and possibly assign_block_key as well (we might still want some way of assigning keys within the converter because, as with the asdf_zarr prototype, a single object might use multiple blocks which may need to be retrieved during to_yaml_tree for an update). I've made a few attempts at getting this to work without rewriting the BlockManager with no success.

asdf/asdf.py

asdf/block.py

asdf/_tests/test_block_converter.py

eslavich · 2023-04-17T04:26:45Z

asdf/block.py

+                sctx = ctx._create_serialization_context()
+                tag = converter.select_tag(node, sctx)
+                for key in converter.reserve_blocks(node, tag, sctx):


Neither the unit test converters nor the asdf_zarr converter use the serialization context or tag, maybe we ought to skip those? I'm not sure it's safe to hand over the serialization context anyway, since _find_used_blocks is called on the original AsdfFile's block manager, and we don't want to modify that one.

The original thought was that different tag versions might require different block usage. I guess one option might be to pass in the tag but not the serialization context but this felt at odds with the other parts of the Converter API. The serialization context would still need to be provided to select_tag so it's possible a converter could mess with blocks at that point as well.

One solution to this might be a more aware serialization context. Something that knows if this is a validation, a write, a read, etc and only allows modifications that are consistent with that operation (e.g. on validate, don't modify, allocate or free blocks). I made a brief attempt at something like this but ran into issues with the block manager (it's difficult to interact without producing a change in it's state).

So some options to address this might be:

drop the serialization context (and possibly tag) argument to reserve_blocks

leave in the argument(s) and add tests to cover their usage

As there is no current use of the serialization context in reserve_blocks I'm not sure what test should be written to cover it's usage. Are there examples of where this argument is used in other converters?

I'm good with either option at this point. I'm also not seeing SerializationContext listed in our user or developer api documentation but attempting to add a fix for that to this PR seems like scope creep.

I vote to drop it :)

Dropped the ctx arg here: 93c2eac

asdf/block.py

eslavich · 2023-04-23T20:12:48Z

asdf/asdf.py

+            for k in modified_keys:
+                if k in self._tree:
+                    del self._tree[k]
+            self._tree.update(pre_write_tree)


It would be safer, I think, to shallow copy the tree before assigning it to naf._tree, and then reassign modified_keys to deep copies of themselves in that shallow-copied tree. That way the original tree stays unmodified even if there's an error and this code doesn't execute.

Great idea! Added that here: 0c44742

eslavich · 2023-04-23T20:22:32Z

asdf/util.py

+    def __hash__(self):
+        return self._key
+
+    def __eq__(self, other):


We're supposed to check the type of "other" and return NotImplemented if we don't know how to compare it, see https://docs.python.org/3/reference/datamodel.html#object.__eq__

Thanks! Added that here: 11159ad

eslavich · 2023-04-23T20:24:34Z

asdf/block.py

+                sctx = ctx._create_serialization_context()
+                tag = converter.select_tag(node, sctx)
+                for key in converter.reserve_blocks(node, tag, sctx):


I vote to drop it :)

eslavich · 2023-04-23T20:25:46Z

asdf/tags/core/integer.py

@@ -76,7 +73,7 @@ def to_tree(cls, node, ctx):

            array = np.array(words, dtype=np.uint32)
            if node._storage == "internal":
-                cls._value_cache[ctx][abs_value] = array
+                cls._value_cache[abs_value] = array


I can drop it in the replacement PR for #1475 too

eslavich

I don't know what roman_datamodels is complaining about, I think this looks great

braingram · 2023-04-28T17:53:54Z

I don't know what roman_datamodels is complaining about, I think this looks great

¯\(ツ)/¯

I think the roman failure was related to:
spacetelescope/roman_datamodels#160

Which means it should go away with this most recent rebase.

I expect the devdeps jobs to fail because of:
#1532

CagtayFabry · 2023-04-29T19:20:18Z

If I understand correctly, this would solve #895 and possibly enable #1013 ?

I haven't looked too deep into it yet though

braingram · 2023-05-01T14:38:29Z

If I understand correctly, this would solve #895 and possibly enable #1013 ?

I haven't looked too deep into it yet though

@CagtayFabry Thanks for taking a look!

I'm not sure if this PR solves #895 as the data callbacks for writing ASDF blocks are expected to return ndarrays (of dtype 'uint8'). However, the BlockConverter example in the tests does convert between ndarrays and bytes (using frombuffer and tobytes):
https://github.com/asdf-format/asdf/pull/1508/files#diff-8498843bd39a0657bc330de864a4ead415900e01f5bcdd15a17cb7af98bbee0fR25-R44

#1013 isn't addressed with this PR. There are some issues with the block management that make addressing this difficult (mainly that block settings, data and state are all stored in the same objects). I've been working on updating the block management in ASDF to allow us to move ndarray to a converter (and drop the legacy type system internally). I think that #1013 should be addressable with the rewrite as it more cleanly separates block options (like storage settings) from other aspects of block management. In addition to the internal changes, the SerializationContext will need to be updated to have methods similar to get/set_array_storage/compression etc.

CagtayFabry · 2023-05-01T17:19:28Z

I see, thank you for explaining @braingram

i'm not quite happy with these and think a more a flexible (supporting more than storage/compression that matches arrays) and simplier interface might be possible.

remove unnecessary uri assignment and note about uri

and deep copy key/values that will be modified

braingram · 2023-05-02T18:48:58Z

Newly failing weldx CI appears related to a new version of pint:
BAMWelDX/weldx#872

braingram added no-changelog-entry-needed development No backport required labels Mar 28, 2023

github-actions bot added this to the 3.0.0 milestone Mar 28, 2023

braingram added the Downstream CI label Mar 30, 2023

braingram force-pushed the converter_block_storage branch from a400378 to 3122d9c Compare March 31, 2023 18:00

braingram removed the no-changelog-entry-needed label Mar 31, 2023

braingram marked this pull request as ready for review March 31, 2023 19:41

braingram requested a review from a team as a code owner March 31, 2023 19:41

braingram requested review from perrygreenfield and eslavich March 31, 2023 19:42

braingram force-pushed the converter_block_storage branch 2 times, most recently from 08d4e8c to 0729620 Compare April 5, 2023 14:26

braingram force-pushed the converter_block_storage branch 2 times, most recently from e36a4d0 to 906072f Compare April 11, 2023 16:01

eslavich reviewed Apr 17, 2023

View reviewed changes

braingram force-pushed the converter_block_storage branch from e2a32b0 to 84f797a Compare April 17, 2023 14:34

braingram requested a review from eslavich April 17, 2023 19:14

braingram mentioned this pull request Apr 17, 2023

Integration with Zarr and chunked arrays #718

Closed

eslavich reviewed Apr 23, 2023

View reviewed changes

braingram force-pushed the converter_block_storage branch from 72513f4 to 0c44742 Compare April 24, 2023 13:01

braingram requested a review from eslavich April 24, 2023 13:03

braingram force-pushed the converter_block_storage branch from 0c44742 to 6cac202 Compare April 24, 2023 19:39

eslavich approved these changes Apr 27, 2023

View reviewed changes

braingram force-pushed the converter_block_storage branch from 6cac202 to b917c43 Compare April 28, 2023 17:51

braingram added 24 commits May 2, 2023 11:27

more code cleanup

22d2af3

remove non-array block storage settings

ead0f7c

i'm not quite happy with these and think a more a flexible (supporting more than storage/compression that matches arrays) and simplier interface might be possible.

pass serialization context to select tag

6cda51b

rename claim_block to assign_block_key

ad5c9a4

add tests for block management through serialization context

63c28fc

test invalid key with get_block_by_key

fc9a8f0

initial block converter docs

4d8d1e8

refine docs for block converter

f8ac711

add changelog

2f5e387

remove comment

74d1111

convert load_block to callback generating get_block_data_callback

fd0e26c

replace use of id(obj) for block key

264fdf5

add asdf.util.BlockKey for generating unique keys

ee21876

add BlockKey test

a730090

add test with external block storage and defined uri

ce5da7e

remove unnecessary uri assignment and note about uri

separate _key_to_block and _data_to_block mappings

a7a6418

don't modify history and asdf_library on write_to

3370627

add note about repeatability of data_callback

ddffc0f

remove unnecessary util.get_array_base

23f9a90

switch 0 to False

a2be1e8

implement __eq__ for BlockKey

c072bf6

return NotImplemented for __eq__ with non-BlockKey

68458cd

remove ctx arg from reserve_blocks

abd6494

shallow copy tree before write_to

2c9b8e1

and deep copy key/values that will be modified

braingram force-pushed the converter_block_storage branch from b917c43 to 2c9b8e1 Compare May 2, 2023 15:27

braingram merged commit 353d213 into asdf-format:main May 2, 2023

braingram deleted the converter_block_storage branch May 2, 2023 18:49

WilliamJamieson mentioned this pull request May 8, 2023

Broken asdf install target. asdf-format/asdf-zarr#4

Closed

braingram mentioned this pull request May 16, 2023

Move ndarray conversion to a Converter #1537

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Converter block storage #1508

Converter block storage #1508

braingram commented Mar 28, 2023 •

edited

Loading

braingram commented Mar 31, 2023 •

edited

Loading

braingram commented Apr 6, 2023

eslavich Apr 17, 2023

braingram Apr 17, 2023

eslavich Apr 23, 2023

braingram Apr 24, 2023

eslavich Apr 23, 2023

braingram Apr 24, 2023

eslavich Apr 23, 2023

braingram Apr 24, 2023

eslavich Apr 23, 2023

eslavich Apr 23, 2023

eslavich left a comment

braingram commented Apr 28, 2023 •

edited

Loading

CagtayFabry commented Apr 29, 2023

braingram commented May 1, 2023

CagtayFabry commented May 1, 2023

braingram commented May 2, 2023

Converter block storage #1508

Converter block storage #1508

Conversation

braingram commented Mar 28, 2023 • edited Loading

braingram commented Mar 31, 2023 • edited Loading

braingram commented Apr 6, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

eslavich left a comment

Choose a reason for hiding this comment

braingram commented Apr 28, 2023 • edited Loading

CagtayFabry commented Apr 29, 2023

braingram commented May 1, 2023

CagtayFabry commented May 1, 2023

braingram commented May 2, 2023

braingram commented Mar 28, 2023 •

edited

Loading

braingram commented Mar 31, 2023 •

edited

Loading

braingram commented Apr 28, 2023 •

edited

Loading