Get_historical_features() Does Not Have Option To Return Distributed Dataframe Like A Spark DF #2504

adamluba1 · 2022-04-06T22:23:53Z

When doing get_historical_features().to_df on a large training dataset in databricks I am hitting memory full errors. Since to_df is returning the data as a pandas dataframe it is not able to use the full capacity of the databricks cluster and distribute it across the different nodes the way a spark dataframe might.

adchia · 2022-04-12T16:29:51Z

I believe this is actually possible. See https://github.com/feast-dev/feast/blob/master/sdk/python/feast/infra/offline_stores/contrib/spark_offline_store/spark.py#L252

adamluba1 · 2022-04-12T16:43:12Z

@adchia but this is only available if you are using a spark offline store correct? for example, you cannot query a snowflake table and return it as a spark df, correct?

adchia · 2022-05-01T01:05:26Z

Ah correct. Right now it only allows returning it in arrow batches. Is the intention to be able to materialize through a spark df? Or to do training e.g. with MLLib.

In either case, might make sense for the SparkOfflineStore (this naming doesn't really make too much sense...) to be able to query from data warehouses (but maybe run the join within the DWH)

adamluba1 · 2022-05-03T22:03:21Z

@adchia The intention is to just return snowflake data offline as a training dataset with the results being a distributed spark df. Is it possible to use snowflake as the source with the SparkOfflineStore? I found a way to hack around it by manipulating the to_sql() string and issuing that as a spark query using snowflakes spark connector, but wondering if there is a cleaner way

adchia · 2022-05-05T13:49:02Z

That works as a start and I don't think it's that hacky. Would be happy to check that in.

An alternative would be to flush the output of Snowflakes query to S3 and read that, but that also seems pretty hacky

1. Added feature to offline_store-> snowflake.py to return results of snowflake query as pyspark data frame.This helps spark-based users to distribute data, which often doesn't fit in driver nodes through pandas output. 2. Also added relevant error class, to notify user on missing spark session , particular to this usecase. Signed-off-by: amithadiraju1694 <[email protected]>

1. Added feature to offline_store-> snowflake.py to return results of snowflake query as pyspark data frame.This helps spark-based users to distribute data, which often doesn't fit in driver nodes through pandas output. 2. Also added relevant error class, to notify user on missing spark session , particular to this usecase. Signed-off-by: amithadiraju1694 <[email protected]> Signed-off-by: amithadiraju1694 <[email protected]>

# [0.27.0](v0.26.0...v0.27.0) (2022-12-05) ### Bug Fixes * Changing Snowflake template code to avoid query not implemented … ([#3319](#3319)) ([1590d6b](1590d6b)) * Dask zero division error if parquet dataset has only one partition ([#3236](#3236)) ([69e4a7d](69e4a7d)) * Enable Spark materialization on Yarn ([#3370](#3370)) ([0c20a4e](0c20a4e)) * Ensure that Snowflake accounts for number columns that overspecify precision ([#3306](#3306)) ([0ad0ace](0ad0ace)) * Fix memory leak from usage.py not properly cleaning up call stack ([#3371](#3371)) ([a0c6fde](a0c6fde)) * Fix workflow to contain env vars ([#3379](#3379)) ([548bed9](548bed9)) * Update bytewax materialization ([#3368](#3368)) ([4ebe00f](4ebe00f)) * Update the version counts ([#3378](#3378)) ([8112db5](8112db5)) * Updated AWS Athena template ([#3322](#3322)) ([5956981](5956981)) * Wrong UI data source type display ([#3276](#3276)) ([8f28062](8f28062)) ### Features * Cassandra online store, concurrency in bulk write operations ([#3367](#3367)) ([eaf354c](eaf354c)) * Cassandra online store, concurrent fetching for multiple entities ([#3356](#3356)) ([00fa21f](00fa21f)) * Get Snowflake Query Output As Pyspark Dataframe ([#2504](#2504)) ([#3358](#3358)) ([2f18957](2f18957))

adchia closed this as completed Apr 12, 2022

adchia reopened this May 1, 2022

adchia added kind/feature New feature or request priority/p2 labels May 1, 2022

kevjumba added the Community Contribution Needed We want community to contribute label Aug 3, 2022

kevjumba added this to Feast Roadmap Aug 3, 2022

kevjumba moved this to Todo in Feast Roadmap Aug 3, 2022

amithadiraju1694 mentioned this issue Nov 21, 2022

feat: Get Snowflake Query Output As Pyspark Dataframe (#2504) #3358

Merged

feast-ci-bot closed this as completed in #3358 Nov 23, 2022

Repository owner moved this from Todo to Done in Feast Roadmap Nov 23, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Get_historical_features() Does Not Have Option To Return Distributed Dataframe Like A Spark DF #2504

Get_historical_features() Does Not Have Option To Return Distributed Dataframe Like A Spark DF #2504

adamluba1 commented Apr 6, 2022

adchia commented Apr 12, 2022

adamluba1 commented Apr 12, 2022

adchia commented May 1, 2022

adamluba1 commented May 3, 2022

adchia commented May 5, 2022

Get_historical_features() Does Not Have Option To Return Distributed Dataframe Like A Spark DF #2504

Get_historical_features() Does Not Have Option To Return Distributed Dataframe Like A Spark DF #2504

Comments

adamluba1 commented Apr 6, 2022

adchia commented Apr 12, 2022

adamluba1 commented Apr 12, 2022

adchia commented May 1, 2022

adamluba1 commented May 3, 2022

adchia commented May 5, 2022