-
Notifications
You must be signed in to change notification settings - Fork 89
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: use None
for unknown lengths (1 of 2)
#2168
Conversation
length = unset | ||
for x in inputs: | ||
if isinstance(x, Content): | ||
if length is None: | ||
if length is unset: | ||
length = x.length | ||
elif backend.nplike.known_shape: | ||
assert length == x.length | ||
assert length is not None | ||
assert length is not unset |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm uneasy about how we handle unknown lengths here. There's a lot to consider in broadcasting, and I've not spent a lot of time in my mind on it. I'm hopeful that this PR mostly just renames UnknownLength
, and therefore we can review this logic down the road.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm glad that this is relative to the state in which UnknownLength
still exists, so that it can be a renaming of that token, rather than UnknownLength
→ UnknownScalar
→ None
.
So far it looks good, and we discussed it on Slack. I see there are a few test errors still, but you have my approval.
It's good to see more typing, too.
@jpivarski note that this PR doesn't yet address usage of |
Codecov Report
Additional details and impacted files
|
fbe7108
to
7fa2d99
Compare
_numnull = ak.index.Index64.empty(1, nplike=self._backend.index_nplike) | ||
|
||
assert ( | ||
numnull.nplike is self._backend.index_nplike | ||
_numnull.nplike is self._backend.index_nplike | ||
and self._mask.nplike is self._backend.index_nplike | ||
) | ||
self._handle_error( | ||
self._backend[ | ||
"awkward_ByteMaskedArray_numnull", | ||
numnull.dtype.type, | ||
_numnull.dtype.type, | ||
self._mask.dtype.type, | ||
]( | ||
numnull.data, | ||
_numnull.data, | ||
self._mask.data, | ||
self._mask.length, | ||
self._valid_when, | ||
) | ||
) | ||
numnull = self._backend.index_nplike.as_shape_item(_numnull[0]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a new "pattern". Any kernels that are used to determine a scalar (invariably a length) are prefixed with _
, before conversion to a shape item with as_shape_item
. This ensures we have only integers or None
.
With this PR, the distinction between indices (data) and lengths (metadata) is much stronger. Aligning closer with @jpivarski's original separation of length metadata from "data". They are now explicit groups, ( Merging this PR will raise questions about |
@jpivarski what do you think about the future-breaking change to
|
None
for unknown lengthsNone
for unknown lengths (1 of 2)
We have a policy of not making backward-breaking changes without a deprecation period because it disrupts users and makes a tangle of which versions of interrelated libraries can work together. It is possible to be 100% strict about it, since we have defined what "public" means in a mechanically verifiable way (the underscores), but I'd rather be pragmatic. This change would be hard to put through a deprecation cycle (waiting two minor versions before this PR can be merged, and all of the other work that depends on it), and the case that it affects is obscure: developers using typetracers (that's probably only Coffea and dask-awkward; it's a new interface) who explicitly pass Even if we don't actually see a path to breakage through an interface change, we should still be careful because it could break in a way that we don't foresee. But this case is so obscure that it's a risk I think we can take. Here's a way that an unforseen issue could come in from something like this: if dask-awkward were mirroring the We should call this kind of situation "reasonable risk" and not let it slide to such an extent that we're doing it all the time. In other words, we can only cheat if we think we can get away with it. |
7bb2c00
to
f470e61
Compare
f470e61
to
02f62d3
Compare
TLDR
None
the token for unknown lengths. OnlyTypeTracer
actually produces a shape withNones
.UnknownLength = UnknownScalar(np.int64)
aliaslength=None
for unknowns means thatRecordArray(..., length=None)
needs to change behavior.1nplike
to understand how to manipulate known and unknown lengthsNumpyLike.known_shape
now has concrete meaning: "does this NumpyLike support unknown dimension sizes?"DR
This adds some verbosity to our length handling code in the
Content
methods, but this comes at a benefit in terms of explicitness. I chose to make shape-item-handling an nplike detail, rather than a a cross-nplike standard. This means that nplikeX
should not work with shapes from nplikeY
. Instead, the nplike conversion routines should handle this.The previous approach set
UnknownLength = UnknownScalar(np.int64)
. Benefits includeContent.length
The downsides of this are:
None
for unknown dimensionsThis PR uses
UnknownLength = None
, making unknown sizes more obvious, and further increasing the "safety" aspect by using a non-numeric type.None
is easily serialised, and can be included in forms without serialisation.The use of
None
means that external data needs to be treated carefully if it is used to compute lengths.Footnotes
This PR requires that
length
is unset in order to ignore it. We could useif length is None and backend.nplike.known_shape
to warn users of this existing usage (as you can't haveNone
lengths for known-shape nplikes) ↩I can't actually think of a need to create a new nplike that has one but not the other, but I digress. ↩