Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Versioned does not work for spark.SparkDataSet #1801

Closed
Spectren opened this issue Aug 22, 2022 · 5 comments
Closed

Versioned does not work for spark.SparkDataSet #1801

Spectren opened this issue Aug 22, 2022 · 5 comments
Assignees
Labels
Issue: Bug Report 🐞 Bug that needs to be fixed

Comments

@Spectren
Copy link

Spectren commented Aug 22, 2022

Description

Versioning does not work for spark.SparkDataSet. It will save the version, but immediately after saving it will give the error that it does not exist (although it does and can be read by hand). I'm a newbie, so I might be doing something wrong, however, according to the documentation, everything should be correct.

Context

I wanted to save the processed dataset with the new version

Steps to Reproduce

  1. Add node to prepare the pyspark dataset and return spark.SparkDataSet
  2. For the returned dataset, specify a path like this filepath: /data/base/result
  3. Run the node and get an error

Expected Result

The code will continue to work after saving the dataset version

Actual Result

VersionNotFoundError: Did not find any versions for
SparkDataSet(file_format=parquet,
filepath=/data/inc/.../result, load_args={},
save_args={'mode': overwrite}, version=Version(load=None,
save='2022-08-22T18.30.55.332Z'))

Your Environment

  • Kedro version used (pip show kedro or kedro -V): 0.18.2
  • Python version used (python -V): 3.7.9
  • Operating system and version: Windows 10 Home
@Spectren Spectren changed the title VersionNotFoundError when using versioned spark.SparkDataSet Versioned does not work for spark.SparkDataSet Aug 24, 2022
@alamastor
Copy link

I'm also having this issue, in my case when saving to S3. I think it's due to the way the SparkDataSet sets its glob_function, in the case of s3:// paths it will be left as None and glob the local FS for the versioned files. I suspect it should be using get_protocol_and_path like the pandas.ParquetDataSet does.

@merelcht merelcht added the Issue: Bug Report 🐞 Bug that needs to be fixed label Sep 30, 2022
@merelcht
Copy link
Member

Thanks for reporting this! We'll take this into our sprint work, but we'd also be happy to accept a PR for this 🙂

@ankatiyar ankatiyar self-assigned this Oct 10, 2022
@ankatiyar
Copy link
Contributor

Hi @Spectren, I've tried this out and versioned SparkDataSet seems to be working fine for saving data and loading datasets locally. You might want to check out - https://kedro.readthedocs.io/en/stable/tools_integration/pyspark.html to make sure PySpark is set up correctly.

As for with s3, @alamastor, versioned SparkDataSet also seems to be working. This might be related to permission issues with your AWS credentials. (See related issue: #1768). Kedro shows a VersionNotFoundError when you don't have sufficient permission to read/write/list objects associated with your credentials even if the version of the dataset exists in the store. We've updated the error message (#1881).

Closing this issue but feel free to re-open if this is not resolved. :)

@datajoely datajoely reopened this Jan 12, 2023
@datajoely
Copy link
Contributor

@jmholzer confirmed still an issue on Azure databricks

@noklam
Copy link
Contributor

noklam commented Apr 3, 2023

Closing this in favor of kedro-org/kedro-plugins#117, #2323 and kedro-org/kedro-plugins#114

I am quite confident this should work now, we've added warning and improve the documentation for using it correctly with Databricks.

Since this issues mixed with many different issue (i.e. permission issue with S3, incorrect path on dbfs etc) , if there are problem with this, feel free to open a new issue

@noklam noklam closed this as completed Apr 3, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Issue: Bug Report 🐞 Bug that needs to be fixed
Projects
Archived in project
Development

No branches or pull requests

7 participants