Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adjust file system storage class and storage put functions #73

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

convexquad
Copy link
Collaborator

@convexquad convexquad commented Oct 5, 2024

Currently, it is only possible to write Wicker datasets that have column bytes files to S3. Let's make a very light refactoring to make it just a little bit easier to test writing Wicker datasets (with column bytes files) to local filesystems by changing just a couple of the functions to be non-S3 specific.

  • In the refactoring, we will add put_file and put_object functions to the storage interface and update the S3DataStorage class so that the S3-specific put_file_s3 and put_object_s3 functions just call put_file and put_object.
  • Let's also rename an internal class for testing (so that it is more clear that it is just for tests).

In addition, let's remove one overly complex things about the current FileSystemDataStorage class that works with local filesystems and that might damage local filesystem performance when used together with GCSFuse or mountpoint-S3.

@convexquad convexquad self-assigned this Oct 5, 2024
@@ -13,7 +13,7 @@
from wicker.core.persistance import BasicPersistor
from wicker.core.storage import S3PathFactory
from wicker.schema.schema import DatasetSchema
Copy link
Collaborator Author

@convexquad convexquad Oct 8, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The wicker.testing.storage.LocalStorageClass class is used to define mock S3 functions in order to support unit tests. It certainly wasn't named very well! Let's give it a very clear name like TestS3LocalDataStorage instead (and I am open to other names). Since this is an internal test class, it is ok to just change its name without worrying about backwards compatibility.

@@ -47,6 +47,43 @@ def test_fetch_file(self) -> None:
test_string = open_dst_file.readline()
assert test_string == expected_string

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we are adding two new put_file and put_object functions to the FileSystemDataStorage class for local filesystems, let's add unit tests for them.

storage: S3DataStorage = S3DataStorage(),
s3_path_factory: S3PathFactory = S3PathFactory(),
storage: AbstractDataStorage = S3DataStorage(),
s3_path_factory: WickerPathFactory = S3PathFactory(), # S3-specific naming kept for backwards compatibility.
Copy link
Collaborator Author

@convexquad convexquad Oct 8, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Above ^^^^ it is too bad that in the ColumnBytesFileWriter class __init__ function that the argument s3_path_factory was given an S3-specific name.

I can't think of a good way to make a backwards-compatible change (to a name like path_factory, so I just left it as s3_path_factory). Let me know if you have another (backwards-compatible) suggestion!

@@ -44,6 +45,28 @@ def fetch_file(self, input_path: str, local_prefix: str, timeout_seconds: int) -
"""
pass

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Below is the main refactoring in this PR, i.e. add put_file and put_object to the AbstractDataStorage interface.

@@ -52,36 +75,41 @@ def __init__(self) -> None:
"""Constructor for a file system storage instance."""
pass

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Below is the main functionality change to the FileSystemDataStorage class for local file systems. We don't want to add any automatic retry / backoff / delay stuff for local filesystems since for GCP this is directly built into GCSFuse itself.

@@ -238,32 +274,11 @@ class WickerPathFactory:
Our bucket should look like::

<dataset-root>
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just adding one tab to the right to make this comment below easier to read!

@@ -448,32 +484,11 @@ class S3PathFactory(WickerPathFactory):
Our bucket should look like::

s3://<dataset-root>
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just adding one tab to the right to make this comment below easier to read!

def put_file_s3(self, local_path: str, s3_path: str) -> None:
full_tmp_path = self._get_local_path(s3_path)
def put_file(self, local_path: str, target_path: str) -> None:
full_tmp_path = self._get_local_path(target_path)
os.makedirs(os.path.dirname(full_tmp_path), exist_ok=True)
shutil.copy2(local_path, full_tmp_path)


Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please see my comment at the very top of this PR about why we are changing the name of the LocalDataStorage class used for mocking S3 storage to a more descriptive name like TestS3LocalDataStorage.

@@ -83,18 +83,12 @@ def fetch_file(self, input_path: str, local_prefix: str, timeout_seconds: int =
return target_path

# Override.
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This fetch_partial_file_s3 function below is not used anywhere and there are no other references to anything like this, so removing it.

except Exception as e:
logging.error(f"Failed to download/move object for file path: {input_path}")
raise e

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Before, fetch_file used to call this download_with_retries function above. Instead, let's remove the download_with_retries function that has the unnecessary @retry decorator and just change fetch_file to call the shutil.copyfile(input_path, local_path) function directly instead.

"""
os.makedirs(Path(target_path).parent, exist_ok=True)
with open(target_path, "wb") as binary_file:
binary_file.write(object_bytes)


Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Below in the S3DataStorage class, we'll make a backwards compatible change so that put_file_s3 and put_object_s3 functions are changed to call put_file and put_object instead (which do the exactly the same thing they did before).

@convexquad
Copy link
Collaborator Author

@zhenyu let me get a review for this PR that does two refactoring changes:

  1. Change confusing LocalDataStorage test class name.
  2. Add put_file and put_object functions to AbstractDataStorage interface and add these functions to S3DataStorage class (and have the S3-specific versions of these functions call the new functions).

Actually, both of these refactoring changes are optional - I don't absolutely need them. But, I am thinking that they would be helpful to you and anyone else using Wicker. But it would be ok with me if you didn't want to do one or both of them.

I also made a change to the FileSystemDataStorage class to remove the download_with_retries function and this is actually the one important change I want to make in order to remove the @retry annotation that we do not actually need for GCSFuse or mountpoint-S3.

@convexquad convexquad added the enhancement New feature or request label Oct 8, 2024
@convexquad convexquad requested a review from zhenyu October 8, 2024 00:48
@@ -31,7 +31,7 @@ def test_write_empty(self) -> None:
s3_path_factory=path_factory,
):
pass
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Below let's update the unit tests to use the non-S3 specific function names that have been added to the AbstractDataStorage class in this PR.

@@ -37,7 +37,7 @@ def test_fetch_file(self) -> None:
# create local file store
local_datastore = FileSystemDataStorage()
# save file to destination
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Below was actually a bug in the behavior of the test - now fixed.

@@ -44,6 +45,28 @@ def fetch_file(self, input_path: str, local_prefix: str, timeout_seconds: int) -
"""
pass

@abstractmethod
def put_file(self, input_path: str, target_path: str) -> None:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

persist_file?

pass

@abstractmethod
def put_object(self, object_bytes: bytes, target_path: str) -> None:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

persist_content?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants