feat: Add better spark support for snowflake offline store #3419

sfc-gh-madkins · 2022-12-28T22:44:19Z

Signed-off-by: miles.adkins [email protected]

What this PR does / why we need it:

Add spark output to snowflake

Which issue(s) this PR fixes:

Fixes #3364

feast-ci-bot · 2022-12-28T22:44:23Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: sfc-gh-madkins

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [sfc-gh-madkins]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

sfc-gh-madkins · 2022-12-28T22:44:38Z

/ok-to-test

sfc-gh-madkins · 2022-12-28T22:46:12Z

@amithadiraju1694 try this out ... you will need to have a pyspark environment with the snowflake spark connector installed.

You will need to pass in the spark session plus a dict of snowflake login params ... see the function comments

sfc-gh-madkins · 2022-12-28T22:47:47Z

@amithadiraju1694 you probably already tried this code is my guess ... the reason you were getting that error is because the initial input dataframe is scoped to a different connection and spark cant find it

amithadiraju1694 · 2022-12-31T00:39:40Z

@amithadiraju1694 you probably already tried this code is my guess ... the reason you were getting that error is because the initial input dataframe is scoped to a different connection and spark cant find it

Thanks for this @sfc-gh-madkins. Tried different variation of this along with your original solution, but the execution never stops ( data bricks keeps saying "Running Command" ).

My final attempt looked like this in snowflake.py

`
def to_pyspark_df(self, spark_session: SparkSession, sfparam: dict) -> DataFrame:
"""
Method to convert snowflake query results to pyspark data frame.

    Args:
        spark_session: spark Session variable of current environment.

    Returns:
        spark_df: A pyspark dataframe.
    """

    if isinstance(spark_session, SparkSession):
        table_name = "feast_spark_" + uuid.uuid4().hex
        self.to_snowflake(table_name = table_name)
        query = f'SELECT * FROM "{table_name}"'

        spark_df = spark_session.read.format( "net.snowflake.spark.snowflake"  ).options(**sfparam).option("query", query).option("autopushdown" , "on").load()

        query = f'DROP TABLE "{table_name}"'
        execute_snowflake_statement(self.snowflake_conn, query)

        return spark_df`

I tried the original solution as well, which's giving me the same result. I'm wondering if from original solution snowflake.py -> line no 486 to 500 should be run inside with query scope or outside of it. If inside, I'm confused on why it needs to be run inside that scope ?

sfc-gh-madkins · 2022-12-31T01:48:41Z

Are you sure you have the snowflake spark connector installed? Do you see the query being issued to the snowflake side? I was able to test this with success locally. Temporary tables are scoped to a specific snowflake session.

…

On Fri, Dec 30, 2022 at 6:39 PM Amith Adiraju ***@***.***> wrote: @amithadiraju1694 <https://github.com/amithadiraju1694> you probably already tried this code is my guess ... the reason you were getting that error is because the initial input dataframe is scoped to a different connection and spark cant find it Thanks for this @sfc-gh-madkins <https://github.com/sfc-gh-madkins>. Tried different variation of this along with your original solution, but the execution never stops ( data bricks keeps saying "Running Command" ). My final attempt looked like this in snowflake.py ` def to_pyspark_df(self, spark_session: SparkSession, sfparam: dict) -> DataFrame: """ Method to convert snowflake query results to pyspark data frame. Args: spark_session: spark Session variable of current environment. Returns: spark_df: A pyspark dataframe. """ if isinstance(spark_session, SparkSession): table_name = "feast_spark_" + uuid.uuid4().hex self.to_snowflake(table_name = table_name) query = f'SELECT * FROM "{table_name}"' spark_df = spark_session.read.format( "net.snowflake.spark.snowflake" ).options(**sfparam).option("query", query).option("autopushdown" , "on").load() query = f'DROP TABLE "{table_name}"' execute_snowflake_statement(self.snowflake_conn, query) return spark_df` I tried the original solution as well, which's giving me the same result. I'm wondering if from original solution snowflake.py -> line no 486 to 500 should be run inside with query scope or outside of it. If inside, I'm confused on why it needs to be run inside that scope ? — Reply to this email directly, view it on GitHub <#3419 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ATSRCU7KDO26MPIXBREBW7DWP56FPANCNFSM6AAAAAATLTT22M> . You are receiving this because you were mentioned.Message ID: ***@***.***>

amithadiraju1694 · 2023-01-04T19:46:37Z

Are you sure you have the snowflake spark connector installed? Do you see the query being issued to the snowflake side? I was able to test this with success locally. Temporary tables are scoped to a specific snowflake session.
…

I was running my code on databricks, so snowflake connector for spark must be installed already ... Since we're giving "auto pushdown: on" query should be executing on snowflake side, though I'm not able to figure out why it's running forever.

Does self.to_snowflake(table_name=table_name) has to be called from with-in the with self._query_generator() as query scope ? I removed that bit in my previous trial, as I assumed that query variable was not being used anywhere in the code ( we're creating our own query right ? ).

amithadiraju1694 · 2023-01-04T20:27:17Z

Are you sure you have the snowflake spark connector installed? Do you see the query being issued to the snowflake side? I was able to test this with success locally. Temporary tables are scoped to a specific snowflake session.
…

I debugged and found that, the program halts at self.to_snowflake(table_name) ; temporary table with given table name isn't created at all in given database.schema for some reason; not sure if this is cuz of access issues, I'm using a dev instance creds and should have required accesses.

sfc-gh-madkins · 2023-01-04T22:14:20Z

Is there an error on the snowflake side?

…

On Wed, Jan 4, 2023 at 2:27 PM Amith Adiraju ***@***.***> wrote: Are you sure you have the snowflake spark connector installed? Do you see the query being issued to the snowflake side? I was able to test this with success locally. Temporary tables are scoped to a specific snowflake session. … <#m_-5553618855286185838_> I debugged and found that, the program halts at self.to_snowflake(table_name) ; temporary table with given table name isn't created at all in given database.schema for some reason; not sure if this is cuz of access issues, I'm using a dev instance crews and should have required accesses. — Reply to this email directly, view it on GitHub <#3419 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ATSRCU3HYOQEXYBEASCGIEDWQXMK7ANCNFSM6AAAAAATLTT22M> . You are receiving this because you were mentioned.Message ID: ***@***.***>

amithadiraju1694 · 2023-01-04T23:35:54Z

Is there an error on the snowflake side?
…

in to_snowflake method, temporary argument was set to false by default, changing that to true solved the unresponsiveness of the query. But now, I see SQL compilation error , my 'DB.SCHEMA."table_name"' is not found or not authorized. I faced a similar issue before for which I made a quick fix, but even the quick fix isn't working now ( my schema is not public contains underscore in its name ).

sfc-gh-madkins · 2023-01-05T01:56:14Z

Can you try this using the default snowflake project? feast init -t snowflake The temporary argument should be set to false, as you dont want to create a temporary table. sf_params should match the offline store params.

…

On Wed, Jan 4, 2023 at 5:36 PM Amith Adiraju ***@***.***> wrote: Is there an error on the snowflake side? … <#m_-5123287001562101569_> in to_snowflake method, temporary argument was set to false by default, changing that to true solved the unresponsiveness of the query. But now, I see SQL compilation error , my 'DB.SCHEMA."table_name"' is not found or not authorized. I faced a similar issue before for which I made a quick fix, but even the quick fix isn't working now ( my schema is not public contains underscore in its name ). — Reply to this email directly, view it on GitHub <#3419 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ATSRCUZAUV3347KT7Z3DJSTWQYCOLANCNFSM6AAAAAATLTT22M> . You are receiving this because you were mentioned.Message ID: ***@***.***>

sfc-gh-madkins · 2023-03-12T18:55:40Z

@amithadiraju1694 is this new PR going to break your existing code?

Signed-off-by: Miles Adkins <[email protected]>

sfc-gh-madkins · 2023-04-05T16:02:20Z

@adchia this might cause a breaking change for a single user, but he has been unresponsive

feast-ci-bot added the approved label Dec 28, 2022

sfc-gh-madkins enabled auto-merge (squash) December 28, 2022 22:44

feast-ci-bot added the size/M label Dec 28, 2022

feast-ci-bot added the ok-to-test label Dec 28, 2022

sfc-gh-madkins requested a review from adchia December 28, 2022 22:44

sfc-gh-madkins added the do-not-merge/work-in-progress label Dec 30, 2022

sfc-gh-madkins force-pushed the to_spark branch from faf98d7 to 9affa5c Compare March 11, 2023 23:23

feat: Add better spark support for snowflake offline store

4b486b1

Signed-off-by: Miles Adkins <[email protected]>

sfc-gh-madkins force-pushed the to_spark branch from 9affa5c to 4b486b1 Compare April 5, 2023 15:57

sfc-gh-madkins removed the do-not-merge/work-in-progress label Apr 5, 2023

sfc-gh-madkins added the do-not-merge/hold label Apr 6, 2023

sfc-gh-madkins closed this Apr 21, 2023

auto-merge was automatically disabled April 21, 2023 20:26
Pull request was closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Add better spark support for snowflake offline store #3419

feat: Add better spark support for snowflake offline store #3419

sfc-gh-madkins commented Dec 28, 2022 •

edited

Loading

feast-ci-bot commented Dec 28, 2022

sfc-gh-madkins commented Dec 28, 2022

sfc-gh-madkins commented Dec 28, 2022

sfc-gh-madkins commented Dec 28, 2022

amithadiraju1694 commented Dec 31, 2022

sfc-gh-madkins commented Dec 31, 2022 via email •

edited

Loading

amithadiraju1694 commented Jan 4, 2023 •

edited

Loading

amithadiraju1694 commented Jan 4, 2023 •

edited

Loading

sfc-gh-madkins commented Jan 4, 2023 via email

amithadiraju1694 commented Jan 4, 2023

sfc-gh-madkins commented Jan 5, 2023 via email

sfc-gh-madkins commented Mar 12, 2023

sfc-gh-madkins commented Apr 5, 2023

feat: Add better spark support for snowflake offline store #3419

feat: Add better spark support for snowflake offline store #3419

Conversation

sfc-gh-madkins commented Dec 28, 2022 • edited Loading

feast-ci-bot commented Dec 28, 2022

sfc-gh-madkins commented Dec 28, 2022

sfc-gh-madkins commented Dec 28, 2022

sfc-gh-madkins commented Dec 28, 2022

amithadiraju1694 commented Dec 31, 2022

sfc-gh-madkins commented Dec 31, 2022 via email • edited Loading

amithadiraju1694 commented Jan 4, 2023 • edited Loading

amithadiraju1694 commented Jan 4, 2023 • edited Loading

sfc-gh-madkins commented Jan 4, 2023 via email

amithadiraju1694 commented Jan 4, 2023

sfc-gh-madkins commented Jan 5, 2023 via email

sfc-gh-madkins commented Mar 12, 2023

sfc-gh-madkins commented Apr 5, 2023

sfc-gh-madkins commented Dec 28, 2022 •

edited

Loading

sfc-gh-madkins commented Dec 31, 2022 via email •

edited

Loading

amithadiraju1694 commented Jan 4, 2023 •

edited

Loading

amithadiraju1694 commented Jan 4, 2023 •

edited

Loading