-
Notifications
You must be signed in to change notification settings - Fork 154
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[CT-980] [feature] Use dataproc serverless instead of dataproc cluster #248
Comments
Additional ask in this slack thread: Could we avoid the need for the Currently, we write the compiled code to GCS as text: dbt-bigquery/dbt/adapters/bigquery/impl.py Lines 910 to 913 in 2c40417
It seems possible to pass in a pointer to a local file for Dataproc PySpark jobs instead. We'd need to write to a local file and upload that, which has its downsides (different OS, execution environments, ...) |
we should support both if possible, but default to serverless. user would likely need to specify cluster to use non-serverless |
@ChenyuLInx let's convert this from a spike to what we actually are planning to do |
@ChenyuLInx @lostmygithubaccount Sharing my opinions:
Feel free to disagree — just want to make sure we have the alignment we need to move forward with this |
@jtcohen6 in dbt-spark I added
Two questions:
Tag @lostmygithubaccount also |
Ok, I'm aligned! More explicit is good. I also prefer using
Yes, I think so. I think that will just require us to pull the same configs off the
I see: so we'll be asking users to define both: dataproc_cluster_name: my-cluster
submission_method: cluster If the first is missing, it will raise an error. If the second is missing, dbt will keep using Among acceptance criteria to closing this issue, could we also include re-enabling automated tests for dbt Python models on GCP? I'm hopeful that our use of Dataproc Serverless will make this more reliable (if also slow). If we need to open a separate ticket for retry / connection timeout, let's do it. |
Sounds great!! I think we are all aligned! |
Description
Update python model submission methods in dataproc to support both Dataproc cluster and Dataproc serverless. When a cluster ID is provided, we are going to use the dataproc cluster by default. User can overwrite it by specify a separate config in the yml file of yml file. When cluster ID is not provided, we will submit using serverless by default, user can overwrite it by provide a cluster ID and submission method in the config for model.
Original description
Good context in this Slack thread
Docs: https://cloud.google.com/dataproc-serverless/docs/overview
Dataproc Serverless:
dataproc_cluster_name
+dataproc_region
in their connection profileRisks to investigate:
conda-pack
(docs). The default image for Dataproc Serverless (docs) seems to include a few popular packages (koalas
,numpy
,pandas
,regex
,scikit-learn
, ...), but this is far from everything a user would need.The text was updated successfully, but these errors were encountered: