You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
File "/Users/joselondono/Documents/Others/cloudpathlib/test_write_dataframe.py", line 10, in<module>
df.to_csv(output.raw / "data.csv")
File "/Users/joselondono/opt/miniconda3/envs/cloudpath/lib/python3.10/site-packages/pandas/core/generic.py", line 3551, in to_csv
return DataFrameRenderer(formatter).to_csv(
File "/Users/joselondono/opt/miniconda3/envs/cloudpath/lib/python3.10/site-packages/pandas/io/formats/format.py", line 1180, in to_csv
csv_formatter.save()
File "/Users/joselondono/opt/miniconda3/envs/cloudpath/lib/python3.10/site-packages/pandas/io/formats/csvs.py", line 241, in save
with get_handle(
File "/Users/joselondono/opt/miniconda3/envs/cloudpath/lib/python3.10/site-packages/pandas/io/common.py", line 663, in get_handle
check_parent_directory(str(handle))
File "/Users/joselondono/opt/miniconda3/envs/cloudpath/lib/python3.10/site-packages/pandas/io/common.py", line 537, in check_parent_directory
raise OSError(rf"Cannot save file into a non-existent directory: '{parent}'")
OSError: Cannot save file into a non-existent directory: '/var/folders/hf/cnzhkc851mqcmg0f8nc41z6h0000gn/T/tmp67yk4w9b/human-datalake/projects/raw/folder'
My current solution is to call the method as_uri() when writing objects, but this syntax is not very pathlib like:
Problem
Prior to writing a file, pandas tries to turn PathLike objects to strings using their __fspath__() method (see stringify_path function). For CloudPath objects, the __fspath__() method is set to return the path of the local cache instead of the cloud URI (this was introduced with 008663f, see discussion in #72).
This behavior makes pandas try to write files as if they were local, instead of uploading them to the cloud. This can lead to two outcomes:
If the folder containing the item does not exist, pandas will raise an error since the local version of the parent path won't exist.
If the folder containing the item does exist, the file will be written to the local cache but won't get uploaded to the cloud.
Both of these outcomes are undesirable for the use case presented above.
Discussion
Returning the URI instead of the local file was considered during the initial implementation of the __fspath()__ method, but was later dropped on the basis of PEP 519.
Some considerations have been discussed in a s3Path issue.
It would be very useful for pandas users to be able to write objects as if working with pathlib. Do you think it would be possible to find a solution to this?
The text was updated successfully, but these errors were encountered:
Unfortunately, as described in #128, it's a known limitation that cloud paths do not behave as expected when called with open in any kind of write mode (which is a direct result of the implementation of __fspath__, as you have also figured out). This is likely not sufficiently well-documented and can use more visibility. We've also struggled to figure out the best solution here as recorded in #128.
Thank you for some discussion in your report, and you're welcome to continue contributing to the discussion in #128. I'm going to close this issue for now in order to centralize the discussion there.
I also want to note that the workaround that you found of calling as_uri is something that works to let one write a CSV to S3, but it does so because it actually leverages s3fs/fsspec, which is an entirely different and independent framework that pandas supports for interacting with cloud storage services. By using this workaround, you do use cloudpathlib's ability to specify and manipulate paths with a pathlib-like interface, but the actual I/O happens with a different library.
Just to flag it here, there is the same issue with rasterio (and other libs based of fsspec I think), especially with VRT data (a file that links to other files).
Another workaround that acts like as_uri is just to convert the path to a string, to have the s3:// prefix entering the open method
Context
I'm getting an error when I try to write files to the cloud using
pandas
andcloudpathlib
.Code
Error
My current solution is to call the method
as_uri()
when writing objects, but this syntax is not verypathlib
like:Problem
Prior to writing a file,
pandas
tries to turnPathLike
objects to strings using their__fspath__()
method (seestringify_path
function). ForCloudPath
objects, the__fspath__()
method is set to return the path of the local cache instead of the cloud URI (this was introduced with 008663f, see discussion in #72).This behavior makes
pandas
try to write files as if they were local, instead of uploading them to the cloud. This can lead to two outcomes:pandas
will raise an error since the local version of the parent path won't exist.Both of these outcomes are undesirable for the use case presented above.
Discussion
__fspath()__
method, but was later dropped on the basis of PEP 519.s3Path
issue.pandas
users to be able to write objects as if working withpathlib
. Do you think it would be possible to find a solution to this?The text was updated successfully, but these errors were encountered: