Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: support options in ak.merge_union_of_records #2236

Merged
merged 12 commits into from
Feb 14, 2023

Conversation

agoose77
Copy link
Collaborator

@agoose77 agoose77 commented Feb 13, 2023

In #2235 it was raised that merge_union_of_records should support records wrapped in options. This PR considers that case.

  • Add tests

self, condition: ArrayLike, x1: ArrayLike, x2: ArrayLike
) -> TypeTracerArray:
condition, x1, x2 = self.broadcast_arrays(condition, x1, x2)
result_dtype = numpy.result_type(x1, x2)
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TODO: implement result_type for all nplikes. This coincides with the planned introduction of the metadata submodule

@codecov
Copy link

codecov bot commented Feb 13, 2023

Codecov Report

Merging #2236 (f384f76) into main (7d10007) will increase coverage by 0.02%.
The diff coverage is 83.10%.

Additional details and impacted files
Impacted Files Coverage Δ
src/awkward/_nplikes/typetracer.py 74.62% <27.27%> (-0.30%) ⬇️
src/awkward/_nplikes/numpylike.py 73.73% <66.66%> (-1.09%) ⬇️
src/awkward/typing.py 83.33% <66.66%> (-5.56%) ⬇️
src/awkward/operations/ak_almost_equal.py 92.72% <75.00%> (-1.62%) ⬇️
...rc/awkward/operations/ak_merge_union_of_records.py 98.03% <98.82%> (+9.32%) ⬆️
src/awkward/_nplikes/array_module.py 90.00% <100.00%> (+0.15%) ⬆️
src/awkward/_nplikes/shape.py 78.68% <0.00%> (+1.63%) ⬆️

@agoose77 agoose77 temporarily deployed to docs-preview February 13, 2023 17:21 — with GitHub Actions Inactive
Comment on lines 190 to 191
outer_option_index[is_this_tag] = backend.index_nplike.where(
is_non_null, outer_option_index[is_this_tag], -1
Copy link
Collaborator Author

@agoose77 agoose77 Feb 13, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is something that might be better as a kernel (and perhaps we already have a kernel that is applicable here).

@agoose77 agoose77 temporarily deployed to docs-preview February 13, 2023 18:02 — with GitHub Actions Inactive
Comment on lines +44 to +126
def invert_record_union(
tags: ArrayLike, index: ArrayLike, contents, *, backend
) -> ak.contents.RecordArray:
index_nplike = backend.index_nplike
# First, create an ordered list containing the union of all fields
seen_fields = set()
all_fields = []
for content in contents:
# Find new fields
for field in content.fields:
if field not in seen_fields:
seen_fields.add(field)
all_fields.append(field)

# Build unions for each field
outer_field_contents = []
for field in all_fields:
field_tags = index_nplike.asarray(tags, copy=True)
field_index = index_nplike.asarray(index, copy=True)

# Build contents for union representing current field
field_contents = [c.content(field) for c in contents if c.has_field(field)]

# Find the best location for option type.
# We will potentially have fewer contents in this per-field union
# than the original outer union-of-records, because some recordarrays
# may not have the given field.
tag_for_missing = 0
for i, content in enumerate(field_contents):
if content.is_option:
tag_for_missing = i
break

# If at least one recordarray doesn't have this field, we add
# a special option
if len(field_contents) < len(contents):
# Make the tagged content an option, growing by one to ensure we
# have a known `None` value to index into
tagged_content = field_contents[tag_for_missing]
indexedoption_index = backend.index_nplike.arange(
tagged_content.length + 1, dtype=np.int64
)
indexedoption_index[
index_nplike.shape_item_as_index(tagged_content.length)
] = -1
field_contents[
tag_for_missing
] = ak.contents.IndexedOptionArray.simplified(
ak.index.Index64(indexedoption_index), tagged_content
)

# Index of None values in tagged content (content with extra None item at end)
index_missing = index_nplike.shape_item_as_index(
field_contents[tag_for_missing].length - 1
)
elif layout.is_indexed and layout.content.is_record:
record = layout.content
# Transpose index-of-record to record-of-index
return ak.contents.RecordArray(
[
ak.contents.IndexedArray.simplified(
layout.index, c, parameters=layout._parameters
)
for c in record.contents
],
record.fields,
record.length,
backend=backend,
# Now build contents for union, by looping over outermost index
# Overwrite tags to adjust for new contents length
# and use the tagged content for any missing values
k = 0
for j, content in enumerate(contents):
tag_is_j = field_tags == j

if content.has_field(field):
# Rewrite tags to account for missing fields
field_tags[tag_is_j] = k
k += 1

else:
# Rewrite tags to point to option content
field_tags[tag_is_j] = tag_for_missing
# Point each value to missing value
field_index[tag_is_j] = index_missing

outer_field_contents.append(
ak.contents.UnionArray.simplified(
ak.index.Index8(field_tags),
ak.index.Index64(field_index),
field_contents,
)
)
else:
raise ak._errors.wrap_error(TypeError(layout))
return ak.contents.RecordArray(
outer_field_contents, all_fields, backend=backend
)
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code has not really changed, I only eliminated the index inversion code (this is now handled separately in the new code)

@agoose77 agoose77 marked this pull request as ready for review February 13, 2023 18:04
@agoose77
Copy link
Collaborator Author

@jpivarski I think this is ready for review. It passes some simple tests, and the logic seems intuitive, but it's obviously complex. I'd be keen to know if you recall any kernels I can use instead of np.where (or I can look tomorrow), and if anything looks amiss or can be done more performantly.

@agoose77 agoose77 temporarily deployed to docs-preview February 13, 2023 18:15 — with GitHub Actions Inactive
Copy link
Member

@jpivarski jpivarski left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it is complex, but the intentions are well documented in comments. I've read through them.

Regarding nplike.where vs. a dedicated kernel: very few kernels are shared by more than one operation, and I wouldn't be able to tell offhand if there was one that could be applied to this case.

Since it is possible with standard NumPy operations, that's preferable, apart from the performance hit of doing multiple passes over the data.

@agoose77 agoose77 temporarily deployed to docs-preview February 14, 2023 08:43 — with GitHub Actions Inactive
@agoose77 agoose77 merged commit 6b49960 into main Feb 14, 2023
@agoose77 agoose77 deleted the agoose77/fix-option-merge-union-of-records branch February 14, 2023 10:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants