Cannot append to REQUIRED field when using `client.load_table_from_file` without providing table schema #1981

henryharbeck · 2024-07-20T07:41:15Z

Environment details

OS type and version: Ubuntu 20.04 (WSL)
Python version: 3.11.8
pip version: 24.1.2
google-cloud-bigquery version: 3.25.0

Steps to reproduce

Create a table that has a single field with mode=REQUIRED
Attempt to append to the table using client.load_table_from_file with a parquet file written from memory to a BytesIO buffer. The library writing to the buffer either does not have an option of nullable/required fields (Polars), or nullable=False is provided to the field (PyArrow).

Issue details

I am unable to use client.load_table_from_file to append to an existing table with a REQUIRED field, without providing the table schema in the LoadJobConfig. The docs say that the schema does not need to be supplied if the table already exits.

This issue is similar to googleapis/google-cloud-python#8093, but relates to load_table_from_file rather than load_table_from_dataframe. It is also somewhat related to googleapis/google-cloud-python#8142 (as explicitly suppling the BigQuery table schema fixes the issue), but again this relates to load_table_from_file rather than load_table_from_dataframe.

As an aside, the fix should definitely not require PyArrow. The current Polars code functions without PyArrow if the table BigQuery schema is provided.

I am filing this as a bug rather than a feature request as the docs for schema in JobConfigurationLoad say

The schema can be omitted if the destination table already exists

Which does not hold up in the below example.

Code example

Apologies, in advance that the example is a bit long.

It demonstrates Parquet files written to BytesIO buffers from both Polars and Pyarrow unable to be written to a BigQuery table with mode=REQUIRED.

Code example

from io import BytesIO

import pyarrow as pa
import pyarrow.parquet as pq
from google.cloud import bigquery

import polars as pl

PROJECT = "<project>"


def create_and_return_table(table_name: str, client: bigquery.Client) -> bigquery.Table:
    schema = [bigquery.SchemaField("foo", "INTEGER", mode="REQUIRED")]
    table = bigquery.Table(f"{PROJECT}.testing.{table_name}", schema=schema)

    client.delete_table(table, not_found_ok=True)
    return client.create_table(table)


def polars_way(table: bigquery.Table, client: bigquery.Client):
    df = pl.DataFrame({"foo": [1, 2, 3]})

    with BytesIO() as stream:
        df.write_parquet(stream)

        job_config = bigquery.LoadJobConfig(
            source_format=bigquery.SourceFormat.PARQUET,
            # Default option, but make it explicit
            write_disposition=bigquery.WriteDisposition.WRITE_APPEND,
            # The issue does not occur if the schema is explicitly provided.
            # schema=table.schema,
        )

        job = client.load_table_from_file(
            stream,
            destination=table,
            rewind=True,
            job_config=job_config,
        )

    job.result()


def pyarrow_way(table: bigquery.Table, client: bigquery.Client):
    # nullable=True is the default, but make it explicit
    # This issue does not occur if nullable=False, but the docs imply that shouldn't be
    # required
    pyarrow_schema = pa.schema([pa.field("foo", pa.int64(), nullable=True)])
    pyarrow_table = pa.Table.from_pydict({"foo": [1, 2, 3]}, schema=pyarrow_schema)

    with BytesIO() as stream:
        writer = pq.ParquetWriter(stream, pyarrow_schema)
        writer.write(pyarrow_table)
        writer.close()

        job_config = bigquery.LoadJobConfig(
            source_format=bigquery.SourceFormat.PARQUET,
            # Default option, but make it explicit
            write_disposition=bigquery.WriteDisposition.WRITE_APPEND,
            # The issue does not occur if the schema is explicitly provided.
            # schema=table.schema,
        )
        job = client.load_table_from_file(
            stream,
            destination=table,
            rewind=True,
            job_config=job_config,
        )

    job.result()


def main():
    client = bigquery.Client()
    table = create_and_return_table("test_pl", client)
    polars_way(table, client)
    table = create_and_return_table("test_pa", client)
    pyarrow_way(table, client)


if __name__ == "__main__":
    main()

Stack trace

Both the polars_way and the pyarrow_way raise with the error. Here they both are.

# polars_way
Traceback (most recent call last):
  File "/home/henry/development/polars_bq/combined.py", line 93, in <module>
    main()
  File "/home/henry/development/polars_bq/combined.py", line 80, in main
    polars_way(table, client)
  File "/home/henry/development/polars_bq/combined.py", line 41, in polars_way
    job.result()
  File "/home/henry/development/polars_bq/.venv/lib/python3.11/site-packages/google/cloud/bigquery/job/base.py", line 966, in result
    return super(_AsyncJob, self).result(timeout=timeout, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/henry/development/polars_bq/.venv/lib/python3.11/site-packages/google/api_core/future/polling.py", line 261, in result
    raise self._exception
google.api_core.exceptions.BadRequest: 400 Provided Schema does not match Table <project>:testing.test_pl.
Field foo has changed mode from REQUIRED to NULLABLE; reason: invalid,
message: Provided Schema does not match Table <project>:testing.test_pl. Field foo has changed mode from REQUIRED to NULLABLE

# pyarrow_way
Traceback (most recent call last):
  File "/home/henry/development/polars_bq/combined.py", line 86, in <module>
    main()
  File "/home/henry/development/polars_bq/combined.py", line 82, in main
    pyarrow_way(table, client)
  File "/home/henry/development/polars_bq/combined.py", line 74, in pyarrow_way
    job.result()
  File "/home/henry/development/polars_bq/.venv/lib/python3.11/site-packages/google/cloud/bigquery/job/base.py", line 966, in result
    return super(_AsyncJob, self).result(timeout=timeout, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/henry/development/polars_bq/.venv/lib/python3.11/site-packages/google/api_core/future/polling.py", line 261, in result
    raise self._exception
google.api_core.exceptions.BadRequest: 400 Provided Schema does not match Table <project>:testing.test_pa.
Field foo has changed mode from REQUIRED to NULLABLE; reason: invalid,
message: Provided Schema does not match Table <project>:testing.test_pa. Field foo has changed mode from REQUIRED to NULLABLE

The text was updated successfully, but these errors were encountered:

tswast · 2024-07-29T20:35:59Z

For some additional context, load_table_from_dataframe does a call to get_table if no schema override is supplied by the user:

python-bigquery/google/cloud/bigquery/client.py

Line 2755 in 4383cfe

table = self.get_table(destination)

to solve a very similar problem (googleapis/google-cloud-python#8142, googleapis/google-cloud-python#8093).

henryharbeck · 2024-07-29T23:41:47Z

Thanks for the pointer @tswast
Would you accept a PR implementing something similar for load_table_from_file?

chalmerlowe · 2024-07-30T13:05:38Z

Conceptually, I am amenable to accepting such a PR. Thoughts/Questions:

It does represent a hit on latency. There is a round trip cost to check the schema. For a large upload, this is likely negligible, but for many very small uploads, repeatedly checking the schema could become prohibitive (same as with the dataframe case).
Do you feel comfortable providing unit tests as well? (testing multiple use cases: local schema present/local schema not present... there may be more, but it is early in the morning. We can discuss what makes the most sense.)

@tswast How do you feel about this change? I am leaning toward this being a breaking change. Users will start having checks performed and having schemas set in ways that they are not expecting.

tswast · 2024-07-30T15:09:43Z

Potentially, but we need to be very careful, as we don't want to call get_table twice from the load from dataframe code path. Might be worth a refactor remove from load from dataframe and move to load from file.

A few things we should test for / watch out for:

Only do the fetch logic with the following true: file type is parquet, no schema on job config, no auto detect set, no nullable relaxation set
Test with a file with few columns than the original.
Test with a file with reordered columns from the original.
Test with a file with more columns than the original (and field addition turned on)

As far as extra latency goes, I agree that's a concern. I would much prefer this get addressed in the backend. At first glance, it seems feasible, as seen in this issue it appears that the initial schema nullability check is redundant with per row validation to check for NULL.

tswast · 2024-07-30T15:11:09Z

Regarding if it's a breaking change, that's a concern for me too. I would want to make sure we've validated every possible combination of local data and table schema if we were to accept such a change.

tswast · 2024-07-30T15:41:38Z

I would much prefer this get addressed in the backend.

I've filed internal issue 356379648 to investigate this option. If fixed in the backend, we should be able to remove the workaround from the pandas code path as well.

henryharbeck · 2024-08-10T23:57:58Z

Thank you both the responses and apologies for taking a while to come back.

I would much prefer this get addressed in the backend.

I've filed internal issue 356379648 to investigate this option. If fixed in the backend, we should be able to remove the workaround from the pandas code path as well.

Yes, that would be my ideal preference as well. As stated in the issue description, the current behaviour when using load_table_from_file could be argued to be a bug given the docs for the schema parameter in JobConfigurationLoad say "The schema can be omitted if the destination table already exists". Yet, providing a schema explicitly is currently the only way to append data into a REQUIRED field using this method.

So, I would agree that an internal fix that does not require fetching the schema would be the ideal solution.

If that does not amount, then see below for response to other comments:

It does represent a hit on latency. There is a round trip cost to check the schema.
- Yes, I definitely acknowledge this. Another option could possibly be that is is not default behaviour for load_table_from_file (which would then be different to load_table_from_dataframe)
Do you feel comfortable providing unit tests as well?
- Yes, gladly
We need to be very careful, as we don't want to call get_table twice from the load from dataframe code path. Might be worth a refactor remove from load from dataframe and move to load from file.
- I agree multiple calls to get_table should be avoided, so a refactor seems reasonable

I'll also mention that should #1979 be accepted and implemented into load_table_from_dataframe, that would remove my personal need for this issue. In that case, the current get_table path would also be applied to Polars DataFrames. But again, a fix in the backend avoiding the needing to fetch the schema at all would be great in both cases!

If any updates on the internal issue can be provided here as they arise, that would be greatly appreciated. With that in mind, I will also hold off on a PR for the time being.

product-auto-label bot added the api: bigquery Issues related to the googleapis/python-bigquery API. label Jul 20, 2024

blunderbuss-gcf bot assigned chalmerlowe Jul 20, 2024

henryharbeck mentioned this issue Sep 2, 2024

Cannot append to REPEATED field when using client.load_table_from_file with parquet file #2008

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cannot append to REQUIRED field when using `client.load_table_from_file` without providing table schema #1981

Cannot append to REQUIRED field when using `client.load_table_from_file` without providing table schema #1981

henryharbeck commented Jul 20, 2024

tswast commented Jul 29, 2024 •

edited

Loading

henryharbeck commented Jul 29, 2024

chalmerlowe commented Jul 30, 2024

tswast commented Jul 30, 2024

tswast commented Jul 30, 2024

tswast commented Jul 30, 2024

henryharbeck commented Aug 10, 2024

Cannot append to REQUIRED field when using client.load_table_from_file without providing table schema #1981

Cannot append to REQUIRED field when using client.load_table_from_file without providing table schema #1981

Comments

henryharbeck commented Jul 20, 2024

Environment details

Steps to reproduce

Issue details

Code example

Stack trace

tswast commented Jul 29, 2024 • edited Loading

henryharbeck commented Jul 29, 2024

chalmerlowe commented Jul 30, 2024

tswast commented Jul 30, 2024

tswast commented Jul 30, 2024

tswast commented Jul 30, 2024

henryharbeck commented Aug 10, 2024

Cannot append to REQUIRED field when using `client.load_table_from_file` without providing table schema #1981

Cannot append to REQUIRED field when using `client.load_table_from_file` without providing table schema #1981

tswast commented Jul 29, 2024 •

edited

Loading