-
Notifications
You must be signed in to change notification settings - Fork 89
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: support options in ak.merge_union_of_records
#2236
Conversation
self, condition: ArrayLike, x1: ArrayLike, x2: ArrayLike | ||
) -> TypeTracerArray: | ||
condition, x1, x2 = self.broadcast_arrays(condition, x1, x2) | ||
result_dtype = numpy.result_type(x1, x2) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TODO: implement result_type
for all nplikes. This coincides with the planned introduction of the metadata
submodule
Codecov Report
Additional details and impacted files
|
outer_option_index[is_this_tag] = backend.index_nplike.where( | ||
is_non_null, outer_option_index[is_this_tag], -1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is something that might be better as a kernel (and perhaps we already have a kernel that is applicable here).
def invert_record_union( | ||
tags: ArrayLike, index: ArrayLike, contents, *, backend | ||
) -> ak.contents.RecordArray: | ||
index_nplike = backend.index_nplike | ||
# First, create an ordered list containing the union of all fields | ||
seen_fields = set() | ||
all_fields = [] | ||
for content in contents: | ||
# Find new fields | ||
for field in content.fields: | ||
if field not in seen_fields: | ||
seen_fields.add(field) | ||
all_fields.append(field) | ||
|
||
# Build unions for each field | ||
outer_field_contents = [] | ||
for field in all_fields: | ||
field_tags = index_nplike.asarray(tags, copy=True) | ||
field_index = index_nplike.asarray(index, copy=True) | ||
|
||
# Build contents for union representing current field | ||
field_contents = [c.content(field) for c in contents if c.has_field(field)] | ||
|
||
# Find the best location for option type. | ||
# We will potentially have fewer contents in this per-field union | ||
# than the original outer union-of-records, because some recordarrays | ||
# may not have the given field. | ||
tag_for_missing = 0 | ||
for i, content in enumerate(field_contents): | ||
if content.is_option: | ||
tag_for_missing = i | ||
break | ||
|
||
# If at least one recordarray doesn't have this field, we add | ||
# a special option | ||
if len(field_contents) < len(contents): | ||
# Make the tagged content an option, growing by one to ensure we | ||
# have a known `None` value to index into | ||
tagged_content = field_contents[tag_for_missing] | ||
indexedoption_index = backend.index_nplike.arange( | ||
tagged_content.length + 1, dtype=np.int64 | ||
) | ||
indexedoption_index[ | ||
index_nplike.shape_item_as_index(tagged_content.length) | ||
] = -1 | ||
field_contents[ | ||
tag_for_missing | ||
] = ak.contents.IndexedOptionArray.simplified( | ||
ak.index.Index64(indexedoption_index), tagged_content | ||
) | ||
|
||
# Index of None values in tagged content (content with extra None item at end) | ||
index_missing = index_nplike.shape_item_as_index( | ||
field_contents[tag_for_missing].length - 1 | ||
) | ||
elif layout.is_indexed and layout.content.is_record: | ||
record = layout.content | ||
# Transpose index-of-record to record-of-index | ||
return ak.contents.RecordArray( | ||
[ | ||
ak.contents.IndexedArray.simplified( | ||
layout.index, c, parameters=layout._parameters | ||
) | ||
for c in record.contents | ||
], | ||
record.fields, | ||
record.length, | ||
backend=backend, | ||
# Now build contents for union, by looping over outermost index | ||
# Overwrite tags to adjust for new contents length | ||
# and use the tagged content for any missing values | ||
k = 0 | ||
for j, content in enumerate(contents): | ||
tag_is_j = field_tags == j | ||
|
||
if content.has_field(field): | ||
# Rewrite tags to account for missing fields | ||
field_tags[tag_is_j] = k | ||
k += 1 | ||
|
||
else: | ||
# Rewrite tags to point to option content | ||
field_tags[tag_is_j] = tag_for_missing | ||
# Point each value to missing value | ||
field_index[tag_is_j] = index_missing | ||
|
||
outer_field_contents.append( | ||
ak.contents.UnionArray.simplified( | ||
ak.index.Index8(field_tags), | ||
ak.index.Index64(field_index), | ||
field_contents, | ||
) | ||
) | ||
else: | ||
raise ak._errors.wrap_error(TypeError(layout)) | ||
return ak.contents.RecordArray( | ||
outer_field_contents, all_fields, backend=backend | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This code has not really changed, I only eliminated the index inversion code (this is now handled separately in the new code)
@jpivarski I think this is ready for review. It passes some simple tests, and the logic seems intuitive, but it's obviously complex. I'd be keen to know if you recall any kernels I can use instead of |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it is complex, but the intentions are well documented in comments. I've read through them.
Regarding nplike.where
vs. a dedicated kernel: very few kernels are shared by more than one operation, and I wouldn't be able to tell offhand if there was one that could be applied to this case.
Since it is possible with standard NumPy operations, that's preferable, apart from the performance hit of doing multiple passes over the data.
In #2235 it was raised that
merge_union_of_records
should support records wrapped in options. This PR considers that case.