Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix: revert #1249 #1335

Closed
wants to merge 2 commits into from
Closed

Fix: revert #1249 #1335

wants to merge 2 commits into from

Conversation

agoose77
Copy link
Collaborator

@agoose77 agoose77 commented Mar 2, 2022

This reverts #1249 (and adds the proper impl for v2) which solved a problem that shouldn't happen with Awkward-derived arrays.

Awkward should never produce arrays with nested option types, or index types, so we should not need to handle such cases when visiting layouts. This is an oversight on my part, which this PR will fix.

@agoose77
Copy link
Collaborator Author

agoose77 commented Mar 2, 2022

Oh dear, this is still not right - we actually need the existing complex solution because of Union arrays, which can themselves contain options.

@agoose77 agoose77 closed this Mar 2, 2022
@jpivarski
Copy link
Member

I don't think it's necessary to revert #1249. It may be overly cautious, but it doesn't do anything beyond O(1) to do so. I'd just leave it be (since there's so much else that needs to be done...).

@agoose77
Copy link
Collaborator Author

agoose77 commented Mar 2, 2022

@jpivarski right, and also we need this to support IndexedOptionArray->UnionArray->IndexedOptionArray (a valid case for where we need to look at the content of an option type at the same time) 😄

@jpivarski
Copy link
Member

we need this to support IndexedOptionArray->UnionArray->IndexedOptionArray

That's definitely something I hadn't considered.

We don't necessarily need to support it in the sense of ensuring that every function can deal with it. We could, instead, expand the set of node-nestings that are considered invalid. The above could become one of the cases simplify_optiontype checks for and squashes.

Arrow, for instance, doesn't consider "option-type of union-type" to be valid, which was a surprise to me because Arrow puts option-type on absolutely every node in their system (missing data is a node attribute, rather than a separate node). The argument is that if you want Option(Union(X, Y, Z)), you can write Union(Option(X), Option(Y), Option(Z)), and that's something that ak.to_arrow does. As long as Indexes can be shared, the latter doesn't use any more memory. Also, this distinction is not visible: whereas it would make a difference for records:

{x: None, y: None, z: None}

is different from

None

it does not make a difference for unions:

None   # happens to be content 2, rather than content 0, but we don't know that

is the same as

None   # for the whole union, all contents together

We could define a rule that turns every

  • Union(X, Y, Option(Z)) (at least one of the UnionArray contents is option-type)
  • Option(Union(X, Y, Z))
  • Option(Union(X, Y, Option(Z)))

into a canonical form. The best optimum of memory and time (so, excluding BitMaskedArray) would be this:

ByteMaskedArray(UnionArray([X, Y, Z]))

where all of the UnionArray's contents are non-option-type. UnionArrays are already indirecting through an index, so if Option(Z) was an IndexedOptionArray, the Z could be left as-is and the index could be composed with the UnionArray index, sliced for tags == 2 (Z is contents item 2 in this example). The -1s of an IndexedOptionArray or the invalid matches of a Bit/ByteMaskedArray can be sliced through array manipulations to become the ByteMaskedArray that wraps the canonical form.

Usually, we do option-type with IndexedOptionArray because it's the most general; you don't have to create fake data for missing values in RecordArrays (as Arrow does). In this case, a ByteMaskedArray makes more sense because the UnionArray's index can already take up the slack that IndexedOptionArray would have been needed for. There's no sense in nesting index indirection within index indirection. (Because array-slicing is function composition.)

That would tend to make the simplify_optiontype more complex (or would the simplify_optiontype and simplify_uniontype have to be combined?), in such a way that it would make sense to replace the Form implementation with using the TypeTracer.

This could be a new project, if you're interested.

@agoose77
Copy link
Collaborator Author

agoose77 commented Mar 2, 2022

I'm going to move this to a new issue!

@agoose77 agoose77 deleted the fix-revert-1249 branch July 13, 2022 11:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants