Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BigQuery Python client fails to query JSON field with heterogeneous data types #1980

Open
Hugo-Polloli opened this issue Jul 18, 2024 · 0 comments
Assignees
Labels
api: bigquery Issues related to the googleapis/python-bigquery API.

Comments

@Hugo-Polloli
Copy link

Hugo-Polloli commented Jul 18, 2024

Environment details

  • OS type and version: Debian 5.10.221-1
  • Python version: python --version 3.11.6
  • pip version: pip --version 24.1.2
  • google-cloud-bigquery version: pip show google-cloud-bigquery 3.25.0

Steps to reproduce

  1. Create a table like so :
CREATE OR REPLACE TABLE `my_table`
AS
SELECT JSON'[{"foo":"A"},{"foo":5}]' AS DATA
  1. Query it from python :
from google.cloud.bigquery import Client

client = Client()

df = client.query("SELECT * FROM `my_table`",to_dataframe()

print(df["DATA"][0])

Expected behavior

We should see printed :

[{'foo':'A'},{'foo':5}]

Stack trace

Traceback (most recent call last):
  File "/home/hugo/test/test.py", line 9, in <module>
    ).to_dataframe()
      ^^^^^^^^^^^^^^
  File "/home/hugo/test/venv/lib/python3.11/site-packages/google/cloud/bigquery/job/query.py", line 2053, in to_dataframe
    return query_result.to_dataframe(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/hugo/test/venv/lib/python3.11/site-packages/google/cloud/bigquery/table.py", line 2379, in to_dataframe
    record_batch = self.to_arrow(
                   ^^^^^^^^^^^^^^
  File "/home/hugo/test/venv/lib/python3.11/site-packages/google/cloud/bigquery/table.py", line 1946, in to_arrow
    for record_batch in self.to_arrow_iterable(
  File "/home/hugo/test/venv/lib/python3.11/site-packages/google/cloud/bigquery/table.py", line 1809, in _to_page_iterable
    yield from result_pages
  File "/home/hugo/test/venv/lib/python3.11/site-packages/google/cloud/bigquery/_pandas_helpers.py", line 731, in download_arrow_row_iterator
    yield _row_iterator_page_to_arrow(page, column_names, arrow_types)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/hugo/test/venv/lib/python3.11/site-packages/google/cloud/bigquery/_pandas_helpers.py", line 704, in _row_iterator_page_to_arrow
    arrays.append(pyarrow.array(page._columns[column_index], type=arrow_type))
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "pyarrow/array.pxi", line 368, in pyarrow.lib.array
  File "pyarrow/array.pxi", line 42, in pyarrow.lib._sequence_to_array
  File "pyarrow/error.pxi", line 155, in pyarrow.lib.pyarrow_internal_check_status
  File "pyarrow/error.pxi", line 92, in pyarrow.lib.check_status
pyarrow.lib.ArrowInvalid: Could not convert 'A' with type str: tried to convert to int64

The error suggests that PyArrow is attempting to convert all values for the "foo" key to int64, based on the presence of the integer value 5. However, it fails when encountering the string value "A" for the same key.

The solution would be to make the type inference during JSON data ingestion to handle heterogeneous data type for the same key.

One thing of note, is that the JSON type is not yet explicitly handled in _pandas_helper.py, printing arrow_types inside of download_arrow_row_iterator during the execution of the above code yields [None] as no mapping exist from the JSON bigquery type to any pyarrow type.

@product-auto-label product-auto-label bot added the api: bigquery Issues related to the googleapis/python-bigquery API. label Jul 18, 2024
@Hugo-Polloli Hugo-Polloli changed the title Querying a JSON field containing heterogenous datatypes for the same field fails Querying a JSON field containing heterogeneous datatypes for the same field fails Jul 18, 2024
@Hugo-Polloli Hugo-Polloli changed the title Querying a JSON field containing heterogeneous datatypes for the same field fails BigQuery Python client fails to query JSON field with heterogeneous data types Jul 18, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api: bigquery Issues related to the googleapis/python-bigquery API.
Projects
None yet
Development

No branches or pull requests

3 participants