Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(datasets): Take mode argument into account when saving dataset #805

Merged
merged 26 commits into from
Aug 22, 2024

Conversation

merelcht
Copy link
Member

@merelcht merelcht commented Aug 7, 2024

Description

Closes #336 and #513

Development notes

First attempt:
Moved the hard-coded mode argument to the DEFAULT_SAVE_ARGS on top. A caveat here is that the mode must be a "bytes" mode, so it should be wb, ab etc, instead of just w, a...

I could either add some code to check if the mode provided ends in b and add it otherwise, or I can update the docstring to mention this.

This didn't work in the end, because other datasets use save args and fs args separately.

Second attempt:
Moved the hard-coded mode argument to the DEFAULT_FS_ARGS on top. Using fs_args to propagate the defaults and read user provided values. This should work for more datasets, I've tried with pandas.CSVDataset and pandas.JSONDataset but want to wait with changing this in more instances until we have consensus on the approach.

It feels like I'm reverting part of https://github.com/McK-Private/private-kedro/pull/1118 so would like to hear what others think.

Checklist

  • Opened this PR as a 'Draft Pull Request' if it is work-in-progress
  • Updated the documentation to reflect the code changes
  • Added a description of this change in the relevant RELEASE.md file
  • Added tests to cover my changes

@merelcht merelcht self-assigned this Aug 7, 2024
@merelcht merelcht requested review from noklam and lrcouto August 7, 2024 14:50
@noklam
Copy link
Contributor

noklam commented Aug 7, 2024

Is there an append+binary mode in Python? The hardcoded wb is definitely a problem but I think the real problem is we use buffer but this does not seem to support append.

If this work we don't need a new dataset in our docs here: https://docs.kedro.org/en/stable/nodes_and_pipelines/nodes.html#saving-data-with-generators

@merelcht
Copy link
Member Author

merelcht commented Aug 7, 2024

Is there an append+binary mode in Python? The hardcoded wb is definitely a problem but I think the real problem is we use buffer but this does not seem to support append.

If this work we don't need a new dataset in our docs here: https://docs.kedro.org/en/stable/nodes_and_pipelines/nodes.html#saving-data-with-generators

"ab" seems to work

Copy link
Contributor

@ElenaKhaustova ElenaKhaustova left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you, @merelcht, the approach looks good to me.

I would go with the docstring update, so all allowed mode options are reflected in the documentation. From our user research interview, that's the first place users look. I don't think we should replace w with wb and hide working with binary file objects.

@noklam
Copy link
Contributor

noklam commented Aug 16, 2024

Few comments:

See this snippet below, because we are writing to buffer (not sure what's the benefit). The mode argument are not being used at all.

with open("tmp_wb", mode="wb") as f:
    buffer = BytesIO()
    df.to_csv(buffer, index=False, mode="w") # "w" can be anything here and doesn't get used base on some testing.
    f.write(buffer.getvalue())

So the problem is, this create some bad UX because very few people know how to use fsspec specific arguments. To use pandas.CSVDataset with different mode, they will need to define dataset as follow:

my_dataset:
  type: pandas.CSVDataset
  load_args:
    fs_args:
      open_load_args:
        mode: wb

We usually have parity between kedro's datasets versus the underlying mapping, in this case the mode argument does not do anything. i.e. csv_data.load(load_args = load_args) should be equivalent to pd.read_csv(load_args).
This is super awkward to write and we probably don't have docs about this. I'll much prefer if we can have this instead:

my_dataset:
  type: pandas.CSVDataset
  
  load_args:
    mode: wb # Maybe we can manually map this into fsspec args, it solves the append problem but not 'w'.
#    fs_args:
#      open_load_args:
#        mode: wb

@merelcht
Copy link
Member Author

@noklam I totally agree with you that it's bad UX! I'm not even that sure what the purpose is for users of fs_args.. Inside the dataset implementation before my changes, you can see that save_args are sent to the dataset specific API, but for the fs part the mode is hardcoded.

data.to_json(path_or_buf=buf, **self._save_args)

with self._fs.open(save_path, mode="wb") as fs_file:

The problem I ran into is that the same mode can't be used as save_args and as fs_args and there's another issue when storage_options are initialised. I have tried to find the reason for the bytes conversion, but all I can find is that it was introduced in https://github.com/McK-Private/private-kedro/pull/1118. Maybe the true fix is getting rid of all the byte conversion stuff so save/load args can be passed on to the fs operations as well.

@merelcht
Copy link
Member Author

@noklam
Copy link
Contributor

noklam commented Aug 16, 2024

I totally agree with you that it's bad UX! I'm not even that sure what the purpose is for users of fs_args.. Inside the dataset implementation before my changes, you can see that save_args are sent to the dataset specific API, but for the fs part the mode is hardcoded.

There are some use cases, but you have to dig in fsspec docs itself. For example, using fs_args for PartitionedDataset can allow users to pass in a regex to search for a specific pattern.

See: https://filesystem-spec.readthedocs.io/en/latest/api.html#fsspec.spec.AbstractFileSystem.find
and https://docs.kedro.org/projects/kedro-datasets/en/kedro-datasets-4.1.0/api/kedro_datasets.partitions.PartitionedDataset.html#:~:text=load_args%20(Optional%5Bdict%5Bstr%2C%20Any%5D%5D)%20%E2%80%93%20Keyword%20arguments%20to%20be%20passed%20into%20find()%20method%20of%20the%20filesystem%20implementation.

@merelcht
Copy link
Member Author

merelcht commented Aug 20, 2024

Upon further investigation and trying another solution, I've actually come to the conclusion that in order to make this the most consistent across various datasets, the best solution is to move the hard-coded mode argument to DEFAULT_FS_ARGS on top and use fs_args to propagate the defaults and read user provided values.

While in pandas.CSVDataset and pandas.JSONDataset it seemed that fs_args and save_args can be used interchangeably, this is not the case for all datasets. See for example, pandas.FeatherDataset, mode is not an accepted value for save_args (https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.to_feather.html) passed to dataframe.to_feather(...) and so it's required that any mode arguments come through fs_args.

I still think @noklam has a great point in UX not being very good here, because this solution requires a catalog entry like:

my_dataset:
  type: pandas.CSVDataset
  fs_args:
      open_save_args:
        mode: wb

To be fair, this is "expected" and explained in our docs: https://docs.kedro.org/en/stable/data/data_catalog_yaml_examples.html#load-data-from-a-local-binary-file-using-utf-8-encoding
But for the time being I suggest we tackle that UX separately.

Where possible I will update datasets to not use byte conversion, so at least it's not necessary to pass "wb"/"ab" etc.. to mode.

Signed-off-by: Merel Theisen <[email protected]>
@merelcht merelcht marked this pull request as ready for review August 21, 2024 08:23
@astrojuanlu
Copy link
Member

https://github.com/McK-Private/private-kedro/pull/1118 try to delegate to pandas as much as possible, which I think is a good idea.

I'm glad you dug this up @noklam because it's exactly the same I'm proposing for Polars... #625 but looks like the bottleneck is our versioning.

It's not a discussion for this PR but

At this point we should start by documenting what the "social contract" of Kedro datasets is.

Copy link
Contributor

@noklam noklam left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that's a good idea to move those default up and making it explicit. As I understand this is mostly a refactoring change, the default are not changed for any datasets.

This is good enough to address the original issue, the follow up question I have is, is our doc clear enough about this? fsspec is only used for versioned dataset, and it is an implementation detail. Most user wouldn't know immediately but at least it is now a bit more visible in docs.

See: https://kedro--805.org.readthedocs.build/projects/kedro-datasets/en/805/api/kedro_datasets.pandas.CSVDataset.html

@ElenaKhaustova
Copy link
Contributor

ElenaKhaustova commented Aug 22, 2024

While in pandas.CSVDataset and pandas.JSONDataset it seemed that fs_args and save_args can be used interchangeably, this is not the case for all datasets.

I 100% agree with the point about the UX. Should we add some clarifications and examples in the docs until we address the UX problem? For users, it can be not very clear what happens if:

my_dataset:
  type: pandas.CSVDataset
  fs_args:
      open_save_args:
      mode: wb
  save_args:
    mode: a

Also, it might not be clear what the expected way to append (or apply some other setting than default) is - set just fs_args/save_args or both.

Edit: we can add some examples/clarifications here: https://docs.kedro.org/en/stable/data/data_catalog.html#load-and-save-arguments or here: https://docs.kedro.org/en/stable/data/data_catalog_yaml_examples.html#load-data-from-a-local-binary-file-using-utf-8-encoding

@merelcht
Copy link
Member Author

@ElenaKhaustova and @noklam, thanks for the reviews! I agree that some doc updates are needed as well. I will create a PR on the kedro repo to do this.

@merelcht merelcht enabled auto-merge (squash) August 22, 2024 13:06
@merelcht merelcht merged commit c67fa9e into main Aug 22, 2024
14 checks passed
@merelcht merelcht deleted the fix/handle-mode-arg-correctly branch August 22, 2024 13:16
merelcht added a commit to galenseilis/kedro-plugins that referenced this pull request Aug 27, 2024
…edro-org#805)

* Take mode argument into account when saving dataset
* Move mode to default save args
* Revert changes to add mode to save args and add as fs arg default instead
* Separate fs save and load args again
* Add tests for coverage
* Use fs_args to pass mode for all pandas based datasets
* Make other datasets use fs_args for handling mode as well
* Refactor and make all datasets consistent
* Update release notes

---------

Signed-off-by: Merel Theisen <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

mode: "a" in pandas.CSVDataset still overwrites the file
4 participants