-
Notifications
You must be signed in to change notification settings - Fork 306
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Cannot append to REQUIRED field when using client.load_table_from_file
without providing table schema
#1981
Comments
For some additional context, python-bigquery/google/cloud/bigquery/client.py Line 2755 in 4383cfe
|
Thanks for the pointer @tswast |
Conceptually, I am amenable to accepting such a PR. Thoughts/Questions:
@tswast How do you feel about this change? I am leaning toward this being a breaking change. Users will start having checks performed and having schemas set in ways that they are not expecting. |
Potentially, but we need to be very careful, as we don't want to call get_table twice from the load from dataframe code path. Might be worth a refactor remove from load from dataframe and move to load from file. A few things we should test for / watch out for:
As far as extra latency goes, I agree that's a concern. I would much prefer this get addressed in the backend. At first glance, it seems feasible, as seen in this issue it appears that the initial schema nullability check is redundant with per row validation to check for NULL. |
Regarding if it's a breaking change, that's a concern for me too. I would want to make sure we've validated every possible combination of local data and table schema if we were to accept such a change. |
I've filed internal issue 356379648 to investigate this option. If fixed in the backend, we should be able to remove the workaround from the pandas code path as well. |
Thank you both the responses and apologies for taking a while to come back.
Yes, that would be my ideal preference as well. As stated in the issue description, the current behaviour when using So, I would agree that an internal fix that does not require fetching the schema would be the ideal solution. If that does not amount, then see below for response to other comments:
I'll also mention that should #1979 be accepted and implemented into If any updates on the internal issue can be provided here as they arise, that would be greatly appreciated. With that in mind, I will also hold off on a PR for the time being. |
Environment details
google-cloud-bigquery
version: 3.25.0Steps to reproduce
client.load_table_from_file
with a parquet file written from memory to aBytesIO
buffer. The library writing to the buffer either does not have an option of nullable/required fields (Polars), ornullable=False
is provided to the field (PyArrow).Issue details
I am unable to use
client.load_table_from_file
to append to an existing table with a REQUIRED field, without providing the table schema in theLoadJobConfig
. The docs say that the schema does not need to be supplied if the table already exits.This issue is similar to googleapis/google-cloud-python#8093, but relates to
load_table_from_file
rather thanload_table_from_dataframe
. It is also somewhat related to googleapis/google-cloud-python#8142 (as explicitly suppling the BigQuery table schema fixes the issue), but again this relates toload_table_from_file
rather thanload_table_from_dataframe
.As an aside, the fix should definitely not require PyArrow. The current Polars code functions without PyArrow if the table BigQuery schema is provided.
I am filing this as a bug rather than a feature request as the docs for
schema
inJobConfigurationLoad
sayWhich does not hold up in the below example.
Code example
Apologies, in advance that the example is a bit long.
It demonstrates Parquet files written to BytesIO buffers from both Polars and Pyarrow unable to be written to a BigQuery table with mode=REQUIRED.
Code example
Stack trace
Both the
polars_way
and thepyarrow_way
raise with the error. Here they both are.The text was updated successfully, but these errors were encountered: