You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
One thing that didn't make it in from the original gist was the partition_fieldkeyword. The motivating idea of that feature there was that sometimes there's a natural index column for your data, but ORDER BY in BigQuery does not scale well and ddf.set_index() requires a shuffle if the data is not sorted. In the example of a date-partitioned table, pre-indexing the dataframe by that date field would speed up a lot of aggregations by date, which seems like a pretty common use case.
But now that I think about it, there's actually nothing particular to partitioned tables in the original logic: if you know the divisions of the dataframe index (or can compute them), then the same read logic should work regardless of whether the table is partitioned or not:
divisions=['AK', 'CA', 'GA', 'IL', 'MD', 'MP', 'NH', 'OK', 'SC', 'VI', 'WY']
ddf.index=Dask Index Structure:
npartitions=10
AK object
CA ...
...
VI ...
WY ...
Name: state, dtype: object
Dask Name: from-delayed, 30 tasks
len(ddf)=56
total_rows=56
I don't think there's any question that this is a useful bit of functionality...but it's not totally clear to me what the API should look like (part of read_gbq? its own function?). I can't think of any analogous dask patterns but if anyone knows of something similar that could be a place to borrow ideas from. pd.read_gbq has an index_col parameter, but in this case we'd probably want to support setting divisions or npartitions as well.
The text was updated successfully, but these errors were encountered:
One thing that didn't make it in from the original gist was the
partition_field
keyword. The motivating idea of that feature there was that sometimes there's a natural index column for your data, butORDER BY
in BigQuery does not scale well andddf.set_index()
requires a shuffle if the data is not sorted. In the example of a date-partitioned table, pre-indexing the dataframe by that date field would speed up a lot of aggregations by date, which seems like a pretty common use case.But now that I think about it, there's actually nothing particular to partitioned tables in the original logic: if you know the divisions of the dataframe index (or can compute them), then the same read logic should work regardless of whether the table is partitioned or not:
Output:
I don't think there's any question that this is a useful bit of functionality...but it's not totally clear to me what the API should look like (part of
read_gbq
? its own function?). I can't think of any analogous dask patterns but if anyone knows of something similar that could be a place to borrow ideas from.pd.read_gbq
has an index_col parameter, but in this case we'd probably want to support setting divisions or npartitions as well.The text was updated successfully, but these errors were encountered: