Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Get_historical_features() Does Not Have Option To Return Distributed Dataframe Like A Spark DF #2504

Closed
adamluba1 opened this issue Apr 6, 2022 · 5 comments · Fixed by #3358
Labels
Community Contribution Needed We want community to contribute kind/feature New feature or request priority/p2

Comments

@adamluba1
Copy link

When doing get_historical_features().to_df on a large training dataset in databricks I am hitting memory full errors. Since to_df is returning the data as a pandas dataframe it is not able to use the full capacity of the databricks cluster and distribute it across the different nodes the way a spark dataframe might.

@adchia
Copy link
Collaborator

adchia commented Apr 12, 2022

@adchia adchia closed this as completed Apr 12, 2022
@adamluba1
Copy link
Author

@adchia but this is only available if you are using a spark offline store correct? for example, you cannot query a snowflake table and return it as a spark df, correct?

@adchia
Copy link
Collaborator

adchia commented May 1, 2022

Ah correct. Right now it only allows returning it in arrow batches. Is the intention to be able to materialize through a spark df? Or to do training e.g. with MLLib.

In either case, might make sense for the SparkOfflineStore (this naming doesn't really make too much sense...) to be able to query from data warehouses (but maybe run the join within the DWH)

@adchia adchia reopened this May 1, 2022
@adchia adchia added kind/feature New feature or request priority/p2 labels May 1, 2022
@adamluba1
Copy link
Author

@adchia The intention is to just return snowflake data offline as a training dataset with the results being a distributed spark df. Is it possible to use snowflake as the source with the SparkOfflineStore? I found a way to hack around it by manipulating the to_sql() string and issuing that as a spark query using snowflakes spark connector, but wondering if there is a cleaner way

@adchia
Copy link
Collaborator

adchia commented May 5, 2022

That works as a start and I don't think it's that hacky. Would be happy to check that in.

An alternative would be to flush the output of Snowflakes query to S3 and read that, but that also seems pretty hacky

@kevjumba kevjumba added the Community Contribution Needed We want community to contribute label Aug 3, 2022
@kevjumba kevjumba moved this to Todo in Feast Roadmap Aug 3, 2022
amithadiraju1694 added a commit to amithadiraju1694/feast that referenced this issue Nov 22, 2022
1. Added feature to offline_store-> snowflake.py to return results of snowflake query as pyspark data frame.This helps spark-based users to distribute data, which often doesn't fit in driver nodes through pandas output.

2. Also added relevant error class, to notify user on missing spark session , particular to this usecase.

Signed-off-by: amithadiraju1694 <[email protected]>
Repository owner moved this from Todo to Done in Feast Roadmap Nov 23, 2022
feast-ci-bot pushed a commit that referenced this issue Nov 23, 2022
1. Added feature to offline_store-> snowflake.py to return results of snowflake query as pyspark data frame.This helps spark-based users to distribute data, which often doesn't fit in driver nodes through pandas output.

2. Also added relevant error class, to notify user on missing spark session , particular to this usecase.

Signed-off-by: amithadiraju1694 <[email protected]>

Signed-off-by: amithadiraju1694 <[email protected]>
kevjumba pushed a commit that referenced this issue Dec 5, 2022
# [0.27.0](v0.26.0...v0.27.0) (2022-12-05)

### Bug Fixes

* Changing Snowflake template code to avoid query not implemented … ([#3319](#3319)) ([1590d6b](1590d6b))
* Dask zero division error if parquet dataset has only one partition ([#3236](#3236)) ([69e4a7d](69e4a7d))
* Enable Spark materialization on Yarn ([#3370](#3370)) ([0c20a4e](0c20a4e))
* Ensure that Snowflake accounts for number columns that overspecify precision ([#3306](#3306)) ([0ad0ace](0ad0ace))
* Fix memory leak from usage.py not properly cleaning up call stack ([#3371](#3371)) ([a0c6fde](a0c6fde))
* Fix workflow to contain env vars ([#3379](#3379)) ([548bed9](548bed9))
* Update bytewax materialization ([#3368](#3368)) ([4ebe00f](4ebe00f))
* Update the version counts ([#3378](#3378)) ([8112db5](8112db5))
* Updated AWS Athena template ([#3322](#3322)) ([5956981](5956981))
* Wrong UI data source type display ([#3276](#3276)) ([8f28062](8f28062))

### Features

* Cassandra online store, concurrency in bulk write operations ([#3367](#3367)) ([eaf354c](eaf354c))
* Cassandra online store, concurrent fetching for multiple entities ([#3356](#3356)) ([00fa21f](00fa21f))
* Get Snowflake Query Output As Pyspark Dataframe ([#2504](#2504)) ([#3358](#3358)) ([2f18957](2f18957))
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Community Contribution Needed We want community to contribute kind/feature New feature or request priority/p2
Projects
Status: Done
Development

Successfully merging a pull request may close this issue.

3 participants