Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: Handle Date32 columns in Arrow tables and Polars DataFrames #3377

Merged
merged 8 commits into from
Mar 24, 2024

Conversation

jonmmease
Copy link
Contributor

@jonmmease jonmmease commented Mar 22, 2024

Closes #3280

This includes:

  1. Support sanitizing Date32 columns in pyarrow Tables
  2. For objects that follow the DataFrame interchange protocol, check for direct arrow conversion methods by name ("arrow", "to_arrow", and "to_arrow_table"), and if found use this instead of pyarrow from_dataframe function (which doesn't support Date32 yet).
  3. To make mypy happy, I updated the DataFrameLike protocol to be @runtime_checkable (which adds support for isinstance checks), and switched our if hasattr(data, '__dataframe__'): calls to if isinstance(data, DataFrameLike): calls. This way mypy knows that data is a DataFrameLike inside the branch and is happy for it to be passed to the arrow_table_from_dfi_dataframe function.
  4. Update VegaFusion tests for VegaFusion 1.6.6 and update the minimum VegaFusion version to 1.6.6.

@@ -56,7 +56,7 @@ def _dataset_name(values: Union[dict, list, core.InlineDataset]) -> str:
values = values.to_dict()
if values == [{}]:
return "empty"
values_json = json.dumps(values, sort_keys=True)
values_json = json.dumps(values, sort_keys=True, default=str)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For computing the values hash, I think it's fine to fallback to string representation for types not supported by json.dumps (datetime.date in this case).

@jonmmease jonmmease changed the title fix: Handle Date32 columns in arrow tables fix: Handle Date32 columns in Arrow tables and Polars DataFrames Mar 23, 2024
@@ -241,7 +241,7 @@ def check_pre_transformed_vega_spec(vega_spec):

# Check that the bin transform has been applied
row0 = data_0["values"][0]
assert row0 == {"a": "A", "b": 28, "b_end": 28.0, "b_start": 0.0}
assert row0 == {"a": "A", "b_end": 28.0, "b_start": 0.0}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

VegaFusion 1.6.6 strips out the "description" encoding field (which isn't used in the canvas renderer), so it's able to do a better job dropping unused columns.

Copy link
Contributor

@mattijn mattijn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @jonmmease, really nice PR! Literature level code👍.
I added one more commit with a few more isinstance(data, DataFrameLike) over hasattr(data, __dataframe__)

I'm surprised that pandas has no a to_arrow() method or something a like.

Comment on lines +352 to +357
for convert_method_name in ("arrow", "to_arrow", "to_arrow_table"):
convert_method = getattr(dfi_df, convert_method_name, None)
if callable(convert_method):
result = convert_method()
if isinstance(result, pa.Table):
return result
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is a joy to read this type of code diff. Really nice approach!

@mattijn
Copy link
Contributor

mattijn commented Mar 24, 2024

All tests pass. Merging! Thanks again @jonmease!

@mattijn mattijn merged commit c7c4149 into main Mar 24, 2024
20 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
3 participants