Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error when writing to a CloudPath file using Pandas #242

Closed
jlondonobo opened this issue Jul 26, 2022 · 2 comments
Closed

Error when writing to a CloudPath file using Pandas #242

jlondonobo opened this issue Jul 26, 2022 · 2 comments

Comments

@jlondonobo
Copy link

Context
I'm getting an error when I try to write files to the cloud using pandas and cloudpathlib.

Code

import pandas as pd
from cloudpathlib import CloudPath

output = CloudPath("s3://bucket/folder/object.csv")

df = pd.DataFrame({
    "col_1": [1, 2, 3],
    "col_2": ["a", "b", "c"]
})
df.to_csv(output / "data.csv")

Error

  File "/Users/joselondono/Documents/Others/cloudpathlib/test_write_dataframe.py", line 10, in <module>
    df.to_csv(output.raw / "data.csv")
  File "/Users/joselondono/opt/miniconda3/envs/cloudpath/lib/python3.10/site-packages/pandas/core/generic.py", line 3551, in to_csv
    return DataFrameRenderer(formatter).to_csv(
  File "/Users/joselondono/opt/miniconda3/envs/cloudpath/lib/python3.10/site-packages/pandas/io/formats/format.py", line 1180, in to_csv
    csv_formatter.save()
  File "/Users/joselondono/opt/miniconda3/envs/cloudpath/lib/python3.10/site-packages/pandas/io/formats/csvs.py", line 241, in save
    with get_handle(
  File "/Users/joselondono/opt/miniconda3/envs/cloudpath/lib/python3.10/site-packages/pandas/io/common.py", line 663, in get_handle
    check_parent_directory(str(handle))
  File "/Users/joselondono/opt/miniconda3/envs/cloudpath/lib/python3.10/site-packages/pandas/io/common.py", line 537, in check_parent_directory
    raise OSError(rf"Cannot save file into a non-existent directory: '{parent}'")
OSError: Cannot save file into a non-existent directory: '/var/folders/hf/cnzhkc851mqcmg0f8nc41z6h0000gn/T/tmp67yk4w9b/human-datalake/projects/raw/folder'

My current solution is to call the method as_uri() when writing objects, but this syntax is not very pathlib like:

import pandas as pd
from cloudpathlib import CloudPath

output = CloudPath("s3://bucket/folder/object.csv")

df = pd.DataFrame({
    "col_1": [1, 2, 3],
    "col_2": ["a", "b", "c"]
})
df.to_csv((output / "data.csv").as_uri())

Problem
Prior to writing a file, pandas tries to turn PathLike objects to strings using their __fspath__() method (see stringify_path function). For CloudPath objects, the __fspath__() method is set to return the path of the local cache instead of the cloud URI (this was introduced with 008663f, see discussion in #72).

This behavior makes pandas try to write files as if they were local, instead of uploading them to the cloud. This can lead to two outcomes:

  1. If the folder containing the item does not exist, pandas will raise an error since the local version of the parent path won't exist.
  2. If the folder containing the item does exist, the file will be written to the local cache but won't get uploaded to the cloud.

Both of these outcomes are undesirable for the use case presented above.

Discussion

  • Returning the URI instead of the local file was considered during the initial implementation of the __fspath()__ method, but was later dropped on the basis of PEP 519.
  • Some considerations have been discussed in a s3Path issue.
  • It would be very useful for pandas users to be able to write objects as if working with pathlib. Do you think it would be possible to find a solution to this?
@jayqi
Copy link
Member

jayqi commented Jul 26, 2022

Hi @jlondonob, this looks to be a specific case of the general problem captured by

Unfortunately, as described in #128, it's a known limitation that cloud paths do not behave as expected when called with open in any kind of write mode (which is a direct result of the implementation of __fspath__, as you have also figured out). This is likely not sufficiently well-documented and can use more visibility. We've also struggled to figure out the best solution here as recorded in #128.

Thank you for some discussion in your report, and you're welcome to continue contributing to the discussion in #128. I'm going to close this issue for now in order to centralize the discussion there.

I also want to note that the workaround that you found of calling as_uri is something that works to let one write a CSV to S3, but it does so because it actually leverages s3fs/fsspec, which is an entirely different and independent framework that pandas supports for interacting with cloud storage services. By using this workaround, you do use cloudpathlib's ability to specify and manipulate paths with a pathlib-like interface, but the actual I/O happens with a different library.

@remi-braun
Copy link

remi-braun commented Jan 19, 2023

Hello,

Just to flag it here, there is the same issue with rasterio (and other libs based of fsspec I think), especially with VRT data (a file that links to other files).
Another workaround that acts like as_uri is just to convert the path to a string, to have the s3:// prefix entering the open method

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants