-
Notifications
You must be signed in to change notification settings - Fork 57
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Converter block storage #1508
Converter block storage #1508
Conversation
a400378
to
3122d9c
Compare
Based off the discussion during the ASDF tag-up I replaced Thanks @eslavich and @perrygreenfield |
08d4e8c
to
0729620
Compare
Apologies in advance for what will probably be a novel. TLDR; the asdf_zarr prototype is now up to date with this PR and is ready to be moved to asdf-format. Also, a follow up PR might allow us to remove the need for reserved blocks. I've spent a few days trying to sort out a reasonable set of functions to expose the storage options to converters. This would include changes to First, the way I was using the settings previously in the asdf_zarr prototype is likely not something we should aim to support. I was looking at the array storage to determine if the zarr chunk store should be included as internal asdf blocks (if the array storage was 'internal') or left as an external store (like a DirectoryStore, if the array storage was 'external'). This conflates options for the Converter and options for the blocks. I rewrote the asdf_zarr to instead require a conversion of zarr-arrays with a to_internal function (without copying data until the asdf file is written). There might be a wider discussion of if we want to provide a means of providing options to converters but I think that this should not be required for chunked array support. Second, the block management is complicated (as mentioned above) and the af = asdf.open(fn, mode="rw") # open a file with a custom object 'a' at id(obj) == 42
del af['a'] # this does not delete the blocks that would be associated with 'a'
b = Foo() # make a new object, let's assume python puts it at id(b) == 42
af['b'] = b # now the blocks for 'a' will be incorrectly assigned to 'b' The alternative strategy would be to have asdf internally manage keys for each block that hold weakrefs to objects (here is a partial example from some of my exploration). This would allow objects to be garbage collected when they are removed from the tree and be resilient to memory address reuse (as 'matching' the key would involve resolving the weakref which would return None for the missing object). This has the benefit of not requiring that developers of converters generate their own BlockKey instances and since Converter.from_yaml_tree returns the object that I think we would want to attach to the blocks we might be able to have ASDF check which blocks were accessed during from_yaml_tree, take the returned object and create and assign block keys outside the converter. This should mean that |
e36a4d0
to
906072f
Compare
asdf/block.py
Outdated
sctx = ctx._create_serialization_context() | ||
tag = converter.select_tag(node, sctx) | ||
for key in converter.reserve_blocks(node, tag, sctx): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Neither the unit test converters nor the asdf_zarr converter use the serialization context or tag, maybe we ought to skip those? I'm not sure it's safe to hand over the serialization context anyway, since _find_used_blocks
is called on the original AsdfFile's block manager, and we don't want to modify that one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The original thought was that different tag versions might require different block usage. I guess one option might be to pass in the tag but not the serialization context but this felt at odds with the other parts of the Converter API. The serialization context would still need to be provided to select_tag
so it's possible a converter could mess with blocks at that point as well.
One solution to this might be a more aware serialization context. Something that knows if this is a validation, a write, a read, etc and only allows modifications that are consistent with that operation (e.g. on validate, don't modify, allocate or free blocks). I made a brief attempt at something like this but ran into issues with the block manager (it's difficult to interact without producing a change in it's state).
So some options to address this might be:
- drop the serialization context (and possibly tag) argument to
reserve_blocks
- leave in the argument(s) and add tests to cover their usage
As there is no current use of the serialization context in reserve_blocks
I'm not sure what test should be written to cover it's usage. Are there examples of where this argument is used in other converters?
I'm good with either option at this point. I'm also not seeing SerializationContext
listed in our user or developer api documentation but attempting to add a fix for that to this PR seems like scope creep.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I vote to drop it :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Dropped the ctx
arg here: 93c2eac
e2a32b0
to
84f797a
Compare
asdf/asdf.py
Outdated
for k in modified_keys: | ||
if k in self._tree: | ||
del self._tree[k] | ||
self._tree.update(pre_write_tree) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be safer, I think, to shallow copy the tree before assigning it to naf._tree
, and then reassign modified_keys to deep copies of themselves in that shallow-copied tree. That way the original tree stays unmodified even if there's an error and this code doesn't execute.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great idea! Added that here: 0c44742
def __hash__(self): | ||
return self._key | ||
|
||
def __eq__(self, other): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We're supposed to check the type of "other" and return NotImplemented if we don't know how to compare it, see https://docs.python.org/3/reference/datamodel.html#object.__eq__
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! Added that here: 11159ad
asdf/block.py
Outdated
sctx = ctx._create_serialization_context() | ||
tag = converter.select_tag(node, sctx) | ||
for key in converter.reserve_blocks(node, tag, sctx): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I vote to drop it :)
@@ -76,7 +73,7 @@ def to_tree(cls, node, ctx): | |||
|
|||
array = np.array(words, dtype=np.uint32) | |||
if node._storage == "internal": | |||
cls._value_cache[ctx][abs_value] = array | |||
cls._value_cache[abs_value] = array |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can drop it in the replacement PR for #1475 too
72513f4
to
0c44742
Compare
0c44742
to
6cac202
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know what roman_datamodels is complaining about, I think this looks great
6cac202
to
b917c43
Compare
¯\(ツ)/¯ I think the roman failure was related to: Which means it should go away with this most recent rebase. I expect the devdeps jobs to fail because of: |
@CagtayFabry Thanks for taking a look! I'm not sure if this PR solves #895 as the data callbacks for writing ASDF blocks are expected to return ndarrays (of dtype 'uint8'). However, the #1013 isn't addressed with this PR. There are some issues with the block management that make addressing this difficult (mainly that block settings, data and state are all stored in the same objects). I've been working on updating the block management in ASDF to allow us to move ndarray to a converter (and drop the legacy type system internally). I think that #1013 should be addressable with the rewrite as it more cleanly separates block options (like storage settings) from other aspects of block management. In addition to the internal changes, the |
I see, thank you for explaining @braingram |
i'm not quite happy with these and think a more a flexible (supporting more than storage/compression that matches arrays) and simplier interface might be possible.
remove unnecessary uri assignment and note about uri
and deep copy key/values that will be modified
b917c43
to
2c9b8e1
Compare
Newly failing weldx CI appears related to a new version of pint: |
This PR adds block storage support to Converters.
Relevant docs are here: https://asdf--1508.org.readthedocs.build/en/1508/asdf/extending/converters.html#block-storage
The key contribution of this PR is allowing Converters to read/write blocks during from_yaml_tree/to_yaml_tree. This is accomplished by (privately) providing the BlockManger to the SerializationContext (that is passed as an argument to Converter methods) and introducing a few new public methods:
SerializationContext.load_blockEDIT: removed in 92a786aThe docs update describes the public API and several tests were added to cover these new methods and features (this is also mostly compatible with the asdf_zarr chunking prototype with one exception, the storage settings, described below).
There was one major change required to get this to work. With this PR, when
AsdfFile.write_to
is called, a newAsdfFile
object is created to be used during write. This allows write to modify blocks (in this case overwriting data callbacks used to defer reading data until a block is ready to be written) without effecting blocks created during initial reads of a file. This change required updates to many tests which expected the blocks to change afterwrite_to
.A few minor changes include:
test_write_to_before_update
)ERA001
rule that is supposed to remove commented code, this failed several times removing some but not all of the commented code so I just disabled itThere are a few things missing from or related to this PR that should be discussed and likely included as follow-up PRS.
I will open issues for the above if reviews do not lead to related changes or fixes.