-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adjust file system storage class and storage put functions #73
base: main
Are you sure you want to change the base?
Conversation
@@ -13,7 +13,7 @@ | |||
from wicker.core.persistance import BasicPersistor | |||
from wicker.core.storage import S3PathFactory | |||
from wicker.schema.schema import DatasetSchema |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The wicker.testing.storage.LocalStorageClass
class is used to define mock S3 functions in order to support unit tests. It certainly wasn't named very well! Let's give it a very clear name like TestS3LocalDataStorage
instead (and I am open to other names). Since this is an internal test class, it is ok to just change its name without worrying about backwards compatibility.
@@ -47,6 +47,43 @@ def test_fetch_file(self) -> None: | |||
test_string = open_dst_file.readline() | |||
assert test_string == expected_string | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since we are adding two new put_file
and put_object
functions to the FileSystemDataStorage
class for local filesystems, let's add unit tests for them.
storage: S3DataStorage = S3DataStorage(), | ||
s3_path_factory: S3PathFactory = S3PathFactory(), | ||
storage: AbstractDataStorage = S3DataStorage(), | ||
s3_path_factory: WickerPathFactory = S3PathFactory(), # S3-specific naming kept for backwards compatibility. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Above ^^^^ it is too bad that in the ColumnBytesFileWriter
class __init__
function that the argument s3_path_factory
was given an S3-specific name.
I can't think of a good way to make a backwards-compatible change (to a name like path_factory
, so I just left it as s3_path_factory
). Let me know if you have another (backwards-compatible) suggestion!
@@ -44,6 +45,28 @@ def fetch_file(self, input_path: str, local_prefix: str, timeout_seconds: int) - | |||
""" | |||
pass | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Below is the main refactoring in this PR, i.e. add put_file
and put_object
to the AbstractDataStorage
interface.
@@ -52,36 +75,41 @@ def __init__(self) -> None: | |||
"""Constructor for a file system storage instance.""" | |||
pass | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Below is the main functionality change to the FileSystemDataStorage
class for local file systems. We don't want to add any automatic retry / backoff / delay stuff for local filesystems since for GCP this is directly built into GCSFuse itself.
@@ -238,32 +274,11 @@ class WickerPathFactory: | |||
Our bucket should look like:: | |||
|
|||
<dataset-root> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just adding one tab to the right to make this comment below easier to read!
@@ -448,32 +484,11 @@ class S3PathFactory(WickerPathFactory): | |||
Our bucket should look like:: | |||
|
|||
s3://<dataset-root> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just adding one tab to the right to make this comment below easier to read!
def put_file_s3(self, local_path: str, s3_path: str) -> None: | ||
full_tmp_path = self._get_local_path(s3_path) | ||
def put_file(self, local_path: str, target_path: str) -> None: | ||
full_tmp_path = self._get_local_path(target_path) | ||
os.makedirs(os.path.dirname(full_tmp_path), exist_ok=True) | ||
shutil.copy2(local_path, full_tmp_path) | ||
|
||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please see my comment at the very top of this PR about why we are changing the name of the LocalDataStorage
class used for mocking S3 storage to a more descriptive name like TestS3LocalDataStorage
.
@@ -83,18 +83,12 @@ def fetch_file(self, input_path: str, local_prefix: str, timeout_seconds: int = | |||
return target_path | |||
|
|||
# Override. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This fetch_partial_file_s3
function below is not used anywhere and there are no other references to anything like this, so removing it.
except Exception as e: | ||
logging.error(f"Failed to download/move object for file path: {input_path}") | ||
raise e | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Before, fetch_file
used to call this download_with_retries
function above. Instead, let's remove the download_with_retries
function that has the unnecessary @retry
decorator and just change fetch_file
to call the shutil.copyfile(input_path, local_path)
function directly instead.
""" | ||
os.makedirs(Path(target_path).parent, exist_ok=True) | ||
with open(target_path, "wb") as binary_file: | ||
binary_file.write(object_bytes) | ||
|
||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Below in the S3DataStorage
class, we'll make a backwards compatible change so that put_file_s3
and put_object_s3
functions are changed to call put_file
and put_object
instead (which do the exactly the same thing they did before).
@zhenyu let me get a review for this PR that does two refactoring changes:
Actually, both of these refactoring changes are optional - I don't absolutely need them. But, I am thinking that they would be helpful to you and anyone else using Wicker. But it would be ok with me if you didn't want to do one or both of them. I also made a change to the |
@@ -31,7 +31,7 @@ def test_write_empty(self) -> None: | |||
s3_path_factory=path_factory, | |||
): | |||
pass |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Below let's update the unit tests to use the non-S3 specific function names that have been added to the AbstractDataStorage
class in this PR.
@@ -37,7 +37,7 @@ def test_fetch_file(self) -> None: | |||
# create local file store | |||
local_datastore = FileSystemDataStorage() | |||
# save file to destination |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Below was actually a bug in the behavior of the test - now fixed.
@@ -44,6 +45,28 @@ def fetch_file(self, input_path: str, local_prefix: str, timeout_seconds: int) - | |||
""" | |||
pass | |||
|
|||
@abstractmethod | |||
def put_file(self, input_path: str, target_path: str) -> None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
persist_file?
pass | ||
|
||
@abstractmethod | ||
def put_object(self, object_bytes: bytes, target_path: str) -> None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
persist_content?
Currently, it is only possible to write Wicker datasets that have column bytes files to S3. Let's make a very light refactoring to make it just a little bit easier to test writing Wicker datasets (with column bytes files) to local filesystems by changing just a couple of the functions to be non-S3 specific.
put_file
andput_object
functions to the storage interface and update theS3DataStorage
class so that the S3-specificput_file_s3
andput_object_s3
functions just callput_file
andput_object
.In addition, let's remove one overly complex things about the current
FileSystemDataStorage
class that works with local filesystems and that might damage local filesystem performance when used together with GCSFuse or mountpoint-S3.