Support pyarrow.large_* as column type in dataframe upload/ download #1706

cvm-a · 2023-10-30T20:49:05Z

Thanks for stopping by to let us know something could be better!

PLEASE READ: If you have a support contract with Google, please create an issue in the support console instead of filing on GitHub. This will ensure a timely response.

Is your feature request related to a problem? Please describe.

(P1 because it's so simple) When uploading a dataframe, I get a Pyarrow could not determine the type of columns" warning raised with with pyarrow.large_string() columns. this should be a trivial addition in _ARROW_SCALAR_IDS_TO_BQ.
(P2 because it takes longer to fix ) When downloading results can we support ( maybe even default to using) pyarrow.large_string ( and other pyarrow.large(*) instead of pyarrow.string for
QueryJob.to_arrow. pyarrow.string has a 2GiB limit on the size of the data in the column (not just in a single element) that's guaranteed to work correctly. If query results are bigger, they might not immediately break because the data is usually chunked smaller, but many dataframe operations ( like aggregations or even indexing) on these columns trigger a "ArrowInvalid: offset overflow" error. This is mainly caused by bad decisions in Arrow ([C++][Python] Large strings cause ArrowInvalid: offset overflow while concatenating arrays apache/arrow#33049), but we can try to keep BQ users safe. The performance/ memory hit has usually been small, and 2GiB is very easy to cross.

Describe the solution you'd like

add pyarrow.large_* keys to _ARROW_SCALAR_IDS_TO_BQ
add an option or default to return large_* types in QueryJob.to_arrow

Describe alternatives you've considered
For 2, I have converted the string columns to large_string myself immediately after loading, and it has not triggered issues yet, but the Arrow API does not seem to guarantee that this should continue to work.
Additional context

The text was updated successfully, but these errors were encountered:

Gaurang033 · 2023-12-15T15:01:11Z

@Linchin does bigquery able to handle more than 2 GB of data in a single cell ?

cvm-a · 2023-12-19T12:56:23Z

The 2GB limit is for the entire column in the Arrow table, since internally the string arrays are effectively just one long string, with the elements delimited by their offsets

…

On Fri, Dec 15, 2023 at 7:01 AM Gaurang Shah ***@***.***> wrote: @Linchin <https://github.com/Linchin> does bigquery able to handle more than 2 GB of data in a single cell ? — Reply to this email directly, view it on GitHub <#1706 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/A7RK7XFUZCB2K3NMG4N5HXDYJRQ4FAVCNFSM6AAAAAA6WURMLGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQNJYGAZDANBZGI> . You are receiving this because you authored the thread.Message ID: ***@***.***>

Linchin · 2024-02-03T00:34:18Z

There are max cell size for CSV and JSON, but there's no mentioning of other formats (except for total file size < 15TB). I suppose that means there's no limit?

product-auto-label bot added the api: bigquery Issues related to the googleapis/python-bigquery API. label Oct 30, 2023

Linchin added the type: feature request ‘Nice-to-have’ improvement, new feature or different behavior or design. label Oct 31, 2023

Linchin mentioned this issue Feb 3, 2024

Revisit the method load_table_from_json() #1646

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support pyarrow.large_* as column type in dataframe upload/ download #1706

Support pyarrow.large_* as column type in dataframe upload/ download #1706

cvm-a commented Oct 30, 2023

Gaurang033 commented Dec 15, 2023

cvm-a commented Dec 19, 2023 via email

Linchin commented Feb 3, 2024

Support pyarrow.large_* as column type in dataframe upload/ download #1706

Support pyarrow.large_* as column type in dataframe upload/ download #1706

Comments

cvm-a commented Oct 30, 2023

Gaurang033 commented Dec 15, 2023

cvm-a commented Dec 19, 2023 via email

Linchin commented Feb 3, 2024