-
Notifications
You must be signed in to change notification settings - Fork 5.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Unify ArrowTensorType tables and Tensor blocks #18867
Conversation
looks like a great simplification! I'd defer this to @clarkzinzow to review and accept. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks great! I'm glad that it turned out to be a pretty simple port.
column, self._table.column_names)) | ||
array = self._table[column] | ||
if array.num_chunks > 1: | ||
# TODO(ekl) combine fails since we can't concat ArrowTensorType? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm should be easy to add, we can look at supporting this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, do you have any pointers? I can give it a try
Why are these changes needed?
This PR removes the special-case Tensor block format from Datasets, in favor of representing these as single-column Arrow tables.
Before:
After:
Read/write numpy operations have been changed to support Arrow tables with tensor columns.
This both removes a lot of special-case Tensor handling and cleans up the user story for tensor support. In the future, we can add more methods to make working with Tensors simpler on top of this common path.