[CT-980] [feature] Use dataproc serverless instead of dataproc cluster #248

jtcohen6 · 2022-08-03T10:08:41Z

Description

This issue is updated from a spike to a normal issue, original description below

Update python model submission methods in dataproc to support both Dataproc cluster and Dataproc serverless. When a cluster ID is provided, we are going to use the dataproc cluster by default. User can overwrite it by specify a separate config in the yml file of yml file. When cluster ID is not provided, we will submit using serverless by default, user can overwrite it by provide a cluster ID and submission method in the config for model.

being able to run with cluster and serverless at the same time
enable tests for python again(running with serverless)

Original description

Good context in this Slack thread

Docs: https://cloud.google.com/dataproc-serverless/docs/overview

Dataproc Serverless:

does not require an always-on cluster
does not require end users to provide a dataproc_cluster_name + dataproc_region in their connection profile
may have better handling for concurrency / auto-scaling

Risks to investigate:

Less straightforward to set up initially — we would need to clearly document any pitfalls
Potentially more costly per vCPU hour (but much more cost-effective overall)
Longer runtimes, because dbt would have to submit "batch" PySpark jobs
Installing third-party packages is less straightforward, since they cannot be configured on the cluster at creation time. The mechanisms here seem to be a custom image + conda-pack (docs). The default image for Dataproc Serverless (docs) seems to include a few popular packages (koalas, numpy, pandas, regex, scikit-learn, ...), but this is far from everything a user would need.

The text was updated successfully, but these errors were encountered:

jtcohen6 · 2022-08-18T15:20:43Z

Additional ask in this slack thread: Could we avoid the need for the gcs_bucket configuration?

Currently, we write the compiled code to GCS as text:

dbt-bigquery/dbt/adapters/bigquery/impl.py

Lines 910 to 913 in 2c40417

    
           def upload_to_gcs(self, filename: str, compiled_code: str): 
        
               bucket = self.storage_client.get_bucket(self.credential.gcs_bucket) 
        
               blob = bucket.blob(filename) 
        
               blob.upload_from_string(compiled_code)

It seems possible to pass in a pointer to a local file for Dataproc PySpark jobs instead. We'd need to write to a local file and upload that, which has its downsides (different OS, execution environments, ...)

lostmygithubaccount · 2022-08-30T18:20:44Z

we should support both if possible, but default to serverless. user would likely need to specify cluster to use non-serverless

leahwicz · 2022-08-30T18:55:45Z

@ChenyuLInx let's convert this from a spike to what we actually are planning to do

jtcohen6 · 2022-09-07T16:57:56Z

@ChenyuLInx @lostmygithubaccount

Sharing my opinions:

If the user specifies a cluster_name, they want to use a specific Dataproc cluster. If they don't specify a dataproc_cluster_name, or set dataproc_cluster_name to a falsy value (None, ""), they want to use Dataproc serverless.
It should be possible to configure dataproc_cluster_name + dataproc_region + gcs_bucket for specific models, in addition to configuring defaults in profiles.yml. (This enables the same behavior as snowflake_warehouse. It also removes dbt-cloud changes to connection profiles as a true dependency.) If it's configured for a specific model, it takes precedence over the profile config. This would require an implementation change to how DataprocHelper is being instantiated, since different models could use different configuration, and properties need to be pulled off of parsed_model in addition to self.config.credentials.

Feel free to disagree — just want to make sure we have the alignment we need to move forward with this

ChenyuLInx · 2022-09-07T22:40:12Z

@jtcohen6 in dbt-spark I added submission_method to make things a bit more specific, I think we can do similar things here.

default submission_method is serverless, user can also configure it to be cluster for the project/model.
this config will follow current overwriting config rule: project, config in yml, config in model.
we will raise an error if dataproc_cluster_name is not specified while cluster submission method is specified.

Two questions:

Should we also allow the user to overwrite the dataproc_cluster_name in config for specific models
User can have dataproc_cluster_name in their profile specified, but if they don't specify submission_method to be cluster, dbt will still run things as serverless. This could feel odd, what do we want to do? (This is not a problem right now as we require cluste_id for all types of submission method)

Tag @lostmygithubaccount also

jtcohen6 · 2022-09-08T09:28:45Z

@ChenyuLInx

added submission_method to make things a bit more specific, I think we can do similar things here.

Ok, I'm aligned! More explicit is good. I also prefer using cluster + serverless rather than GCP's language (job, batch), which feels less clear.

Should we also allow the user to overwrite the dataproc_cluster_name in config for specific models

Yes, I think so. I think that will just require us to pull the same configs off the parsed_model node if they're available. I know that means we can't build once + reuse DataprocHelper in the way we are currently, though.

User can have dataproc_cluster_name in their profile specified, but if they don't specify submission_method to be cluster, dbt will still run things as serverless. This could feel odd, what do we want to do?

I see: so we'll be asking users to define both:

dataproc_cluster_name: my-cluster
submission_method: cluster

If the first is missing, it will raise an error. If the second is missing, dbt will keep using serverless. I think that's fine, as long as we document it clearly.

Among acceptance criteria to closing this issue, could we also include re-enabling automated tests for dbt Python models on GCP? I'm hopeful that our use of Dataproc Serverless will make this more reliable (if also slow). If we need to open a separate ticket for retry / connection timeout, let's do it.

ChenyuLInx · 2022-09-08T16:42:00Z

Sounds great!! I think we are all aligned!

jtcohen6 added enhancement New feature or request python_models labels Aug 3, 2022

github-actions bot changed the title ~~[Spike] Use dataproc serverless instead of dataproc cluster~~ [CT-980] [Spike] Use dataproc serverless instead of dataproc cluster Aug 3, 2022

jtcohen6 mentioned this issue Aug 7, 2022

Experiment with Dataproc Serverless #259

Closed

4 tasks

ChenyuLInx changed the title ~~[CT-980] [Spike] Use dataproc serverless instead of dataproc cluster~~ [CT-980] [feature] Use dataproc serverless instead of dataproc cluster Sep 9, 2022

This was referenced Sep 9, 2022

add a base python job helper class dbt-labs/dbt-core#5802

Merged

Enhancement/enable serverless #303

Merged

Enhancement/refactor python submission dbt-labs/dbt-spark#452

Merged

ChenyuLInx closed this as completed in #303 Sep 12, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[CT-980] [feature] Use dataproc serverless instead of dataproc cluster #248

[CT-980] [feature] Use dataproc serverless instead of dataproc cluster #248

jtcohen6 commented Aug 3, 2022 •

edited by ChenyuLInx

Loading

jtcohen6 commented Aug 18, 2022

lostmygithubaccount commented Aug 30, 2022

leahwicz commented Aug 30, 2022

jtcohen6 commented Sep 7, 2022

ChenyuLInx commented Sep 7, 2022 •

edited

Loading

jtcohen6 commented Sep 8, 2022 •

edited

Loading

ChenyuLInx commented Sep 8, 2022

[CT-980] [feature] Use dataproc serverless instead of dataproc cluster #248

[CT-980] [feature] Use dataproc serverless instead of dataproc cluster #248

Comments

jtcohen6 commented Aug 3, 2022 • edited by ChenyuLInx Loading

Description

Original description

jtcohen6 commented Aug 18, 2022

lostmygithubaccount commented Aug 30, 2022

leahwicz commented Aug 30, 2022

jtcohen6 commented Sep 7, 2022

ChenyuLInx commented Sep 7, 2022 • edited Loading

jtcohen6 commented Sep 8, 2022 • edited Loading

ChenyuLInx commented Sep 8, 2022

jtcohen6 commented Aug 3, 2022 •

edited by ChenyuLInx

Loading

ChenyuLInx commented Sep 7, 2022 •

edited

Loading

jtcohen6 commented Sep 8, 2022 •

edited

Loading