-
Notifications
You must be signed in to change notification settings - Fork 13
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
to_bigquery ideas - no intermediate storage #3
Comments
This is what I got so far, the code below works if the table already exists, however, if I try with a none existing table
I think the reason why this happens, is because initially all the partitions are trying to create a table, and once one of them did it, then all the others erred. I'm trying to see if for the case where the table doesn't exist and a user would like to use from dask import delayed
import pandas as pd
import pandas_gbq
import dask.dataframe as dd
import dask
from distributed import get_client
from typing import Dict
@delayed
def write_gbq(
df: pd.DataFrame,
destination_table,
project_id,
table_cond,
):
pandas_gbq.to_gbq(df,
destination_table=destination_table,
project_id=project_id, if_exists=table_cond)
def to_gbq(
ddf: dd.DataFrame,
*,
destination_table: str,
project_id: str,
table_cond: str,
compute_options: Dict = None,
):
if compute_options is None:
compute_options = {}
partitions = [
write_gbq(partition, destination_table, project_id, table_cond)
for partition in ddf.to_delayed()
]
try:
client = get_client()
except ValueError:
# Using single-machine scheduler
dask.compute(partitions, **compute_options)
else:
return client.compute(partitions, **compute_options) Any ideas are welcome...I Will put this into a separate branch soon, trying to figure out the best approach since the reading part is not ready/merged yet. cc: @bnaul, @jrbourbeau |
It seems we will have to provide some sort of schema to be able to create an empty table, or somehow write one row to create the table and then append the rest of the partitions |
We went down this road (with bigquery.Client.load_table_from_dataframe instead of pandas_gbq but same thing basically) but abandoned it for a couple of reasons in favor of a Parquet intermediary:
|
Thanks for your response @bnaul, then we might want to wait to implement the writing since relying on an intermediate storage step might not be ideal. |
I've been trying to implement this by creating an empty table with the proper schema inferred from dask dataframes, but no luck. the table gets created but with no schema at all. I open an issue describing the problem on the |
@ncclementi @jrbourbeau there's a new(ish?) "Storage Write API" that's the analog of what we're currently using for reads: https://cloud.google.com/bigquery/docs/write-api#advantages. This bit in particular seems to address my comment above:
I don't see anything about dataframe or pyarrow support though, only GRPC...maybe @tswast could clarify whether there's anything in the works upstream that might facilitate using this API here? |
The API semantics are a great fit, and I do eventually want to build a pandas DataFrame -> BQ Storage Write API connector. Unfortunately, that's a tough task, as the only supported data format in the backend is Protocol Buffers v2 wire format (proto2). Converting from DataFrame -> proto2 is going to take some work, especially if we want to do it efficiently. |
If there isn't a nice way to push data in directly I'd be open to using an ephemeral Parquet dataset. def to_gbq(df, table, ...):
temp_filename = "gs://some-temporary-storage
try:
df.to_parquet(temp_filename)
bigquery.Client.load_table_from_uri("gs://some_temp_path/")
finally:
gcsfs.rmdir(temp_filename) This is error prone though, and potentially in an expensive way. We'd maybe want that function to include a runtime warning? Or maybe this is a bad idea generally, and we should resolve it with best practices documentation instead. |
Unfortunately not. I found a public issue, but the corresponding internal issue is still in the backlog. https://issuetracker.google.com/249245481 Arrow format support for BigQuery Storage Write API Ephemeral Parquet seems to be the way some of my colleagues over on the AI side of GCP are leaning too. You might be able to use some data cycle management to automatically delete temp files in case of the program ending before cleanup can run. https://cloud.google.com/storage/docs/lifecycle-configurations#deletion-example |
Ah, that does sound like a more robust approach. Thanks for the suggestion. |
Yep that's exactly what we do, we have a temporary bucket with a 24 hour lifespan and we run this function for the upload (might be some internal helpers here, probably isn't 100% runnable but you get the gist)
|
Brett's function looks much better, but in case it helps anyone (e.g. anyone wanting to walk through things in ipython like I just was) I was just messing around and had got this example working: import coiled
c = coiled.Cluster(n_workers=4)
import pandas as pd
df = pd.DataFrame({"aaa": ["a" + str(i) for i in range(100)], "bbb": ["b" + str(i) for i in range(100)]})
client = c.get_client()
import dask.dataframe as dd
ddf = dd.from_pandas(df, npartitions=8)
creds_info = ... # redacted! This is a dict that looks like GCP creds JSON
from google.cloud import storage
parquet_location = "gs://david-demo-1-bucket/ab12345.parquet"
ddf.to_parquet(parquet_location, storage_options={"token": creds_info})
from google.cloud import bigquery
from google.oauth2.service_account import Credentials
creds = Credentials.from_service_account_info(info=creds_info)
bq_client = bigquery.Client(credentials=creds)
job_config = bigquery.LoadJobConfig(
source_format=bigquery.SourceFormat.PARQUET,
autodetect=True,
)
uri = "gs://david-demo-1-bucket/ab12345.parquet/*"
load_job = bq_client.load_table_from_uri(
parquet_location + "/*", "test_dataset.ab123456", job_config=job_config
)
load_job.result() |
Currently, the
to_bigquery
presented in the gist uses temporary storage, I think this is not ideal given that the user will have to create the storage to be able to do this.I was wondering if it would be possible to take a similar approach what to was done for
dask-mongo
where thewrite_bgq
would be usingpandas.to_gbq()
on the pandasdf
that comes from each partition. Where partitions will look something likeand
write_bigquery
will have something of the form:The text was updated successfully, but these errors were encountered: