Support pyarrow.large_* as column type in dataframe upload/ download #1706
Labels
api: bigquery
Issues related to the googleapis/python-bigquery API.
type: feature request
‘Nice-to-have’ improvement, new feature or different behavior or design.
Thanks for stopping by to let us know something could be better!
PLEASE READ: If you have a support contract with Google, please create an issue in the support console instead of filing on GitHub. This will ensure a timely response.
Is your feature request related to a problem? Please describe.
QueryJob.to_arrow
.pyarrow.string
has a 2GiB limit on the size of the data in the column (not just in a single element) that's guaranteed to work correctly. If query results are bigger, they might not immediately break because the data is usually chunked smaller, but many dataframe operations ( like aggregations or even indexing) on these columns trigger a "ArrowInvalid: offset overflow" error. This is mainly caused by bad decisions in Arrow ([C++][Python] Large strings cause ArrowInvalid: offset overflow while concatenating arrays apache/arrow#33049), but we can try to keep BQ users safe. The performance/ memory hit has usually been small, and 2GiB is very easy to cross.Describe the solution you'd like
QueryJob.to_arrow
Describe alternatives you've considered
For 2, I have converted the string columns to large_string myself immediately after loading, and it has not triggered issues yet, but the Arrow API does not seem to guarantee that this should continue to work.
Additional context
The text was updated successfully, but these errors were encountered: