-
Notifications
You must be signed in to change notification settings - Fork 999
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Get_historical_features() Does Not Have Option To Return Distributed Dataframe Like A Spark DF #2504
Comments
I believe this is actually possible. See https://github.com/feast-dev/feast/blob/master/sdk/python/feast/infra/offline_stores/contrib/spark_offline_store/spark.py#L252 |
@adchia but this is only available if you are using a spark offline store correct? for example, you cannot query a snowflake table and return it as a spark df, correct? |
Ah correct. Right now it only allows returning it in arrow batches. Is the intention to be able to materialize through a spark df? Or to do training e.g. with MLLib. In either case, might make sense for the SparkOfflineStore (this naming doesn't really make too much sense...) to be able to query from data warehouses (but maybe run the join within the DWH) |
@adchia The intention is to just return snowflake data offline as a training dataset with the results being a distributed spark df. Is it possible to use snowflake as the source with the SparkOfflineStore? I found a way to hack around it by manipulating the to_sql() string and issuing that as a spark query using snowflakes spark connector, but wondering if there is a cleaner way |
That works as a start and I don't think it's that hacky. Would be happy to check that in. An alternative would be to flush the output of Snowflakes query to S3 and read that, but that also seems pretty hacky |
1. Added feature to offline_store-> snowflake.py to return results of snowflake query as pyspark data frame.This helps spark-based users to distribute data, which often doesn't fit in driver nodes through pandas output. 2. Also added relevant error class, to notify user on missing spark session , particular to this usecase. Signed-off-by: amithadiraju1694 <[email protected]>
1. Added feature to offline_store-> snowflake.py to return results of snowflake query as pyspark data frame.This helps spark-based users to distribute data, which often doesn't fit in driver nodes through pandas output. 2. Also added relevant error class, to notify user on missing spark session , particular to this usecase. Signed-off-by: amithadiraju1694 <[email protected]> Signed-off-by: amithadiraju1694 <[email protected]>
# [0.27.0](v0.26.0...v0.27.0) (2022-12-05) ### Bug Fixes * Changing Snowflake template code to avoid query not implemented … ([#3319](#3319)) ([1590d6b](1590d6b)) * Dask zero division error if parquet dataset has only one partition ([#3236](#3236)) ([69e4a7d](69e4a7d)) * Enable Spark materialization on Yarn ([#3370](#3370)) ([0c20a4e](0c20a4e)) * Ensure that Snowflake accounts for number columns that overspecify precision ([#3306](#3306)) ([0ad0ace](0ad0ace)) * Fix memory leak from usage.py not properly cleaning up call stack ([#3371](#3371)) ([a0c6fde](a0c6fde)) * Fix workflow to contain env vars ([#3379](#3379)) ([548bed9](548bed9)) * Update bytewax materialization ([#3368](#3368)) ([4ebe00f](4ebe00f)) * Update the version counts ([#3378](#3378)) ([8112db5](8112db5)) * Updated AWS Athena template ([#3322](#3322)) ([5956981](5956981)) * Wrong UI data source type display ([#3276](#3276)) ([8f28062](8f28062)) ### Features * Cassandra online store, concurrency in bulk write operations ([#3367](#3367)) ([eaf354c](eaf354c)) * Cassandra online store, concurrent fetching for multiple entities ([#3356](#3356)) ([00fa21f](00fa21f)) * Get Snowflake Query Output As Pyspark Dataframe ([#2504](#2504)) ([#3358](#3358)) ([2f18957](2f18957))
When doing get_historical_features().to_df on a large training dataset in databricks I am hitting memory full errors. Since to_df is returning the data as a pandas dataframe it is not able to use the full capacity of the databricks cluster and distribute it across the different nodes the way a spark dataframe might.
The text was updated successfully, but these errors were encountered: