Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pandas.CSVDataSet with remote filepaths cannot be pickled #271

Open
astrojuanlu opened this issue Jul 14, 2023 · 0 comments
Open

pandas.CSVDataSet with remote filepaths cannot be pickled #271

astrojuanlu opened this issue Jul 14, 2023 · 0 comments
Labels
bug Something isn't working Hacktoberfest help wanted Contribution task, outside help would be appreciated!

Comments

@astrojuanlu
Copy link
Member

Description

As per title.

cc @jmnunezd

Context

As a result, pandas.CSVDataSet with remote filepaths cannot be used with ParallelRunner.

Steps to Reproduce

>>> from kedro_datasets.pandas import CSVDataSet
>>> ds = CSVDataSet("https://google.com/data.csv")
>>> from multiprocessing.reduction import ForkingPickler
>>> ForkingPickler.dumps(ds)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/juan_cano/.local/share/rtx/installs/python/3.10.11/lib/python3.10/multiprocessing/reduction.py", line 51, in dumps
    cls(buf, protocol).dump(obj)
_pickle.PicklingError: Can't pickle <function HTTPFileSystem._exists at 0x124f5f1c0>: it's not the same object as fsspec.implementations.http.HTTPFileSystem._exists

Expected Result

Datasets with remote filepaths behave in the same way as datasets with local filepaths:

>>> ds_ok = CSVDataSet("/tmp/data.csv")
>>> ForkingPickler.dumps(ds_ok)
<memory at 0x124dd13c0>

This could be considered a feature request, rather than a bug. But it was surprising that the nature of the filepath could influence the result.

Your Environment

Include as many relevant details about the environment in which you experienced the bug:

  • Kedro version used (pip show kedro or kedro -V): 0.18.11
  • Kedro plugin and kedro plugin version used (pip show kedro-airflow): kedro-datasets 1.4.2
  • Python version used (python -V): 3.10.11
  • Operating system and version: macOS Ventura
@astrojuanlu astrojuanlu added the bug Something isn't working label Jul 14, 2023
@merelcht merelcht added the help wanted Contribution task, outside help would be appreciated! label Jan 12, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working Hacktoberfest help wanted Contribution task, outside help would be appreciated!
Projects
Status: To Do
Status: Todo
Development

No branches or pull requests

2 participants