-
Notifications
You must be signed in to change notification settings - Fork 2.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add FileIO, InputFile, and OutputFile abstract base classes #3691
Conversation
Updated this PR to only include the abstract base classes FileIO, InputFile, and OutputFile. The S3FileIO implementation can be opened in a follow-up PR. |
As an example of what an implementation of these base classes would look like, I put together an S3FileIO implementation using smart_open to create seekable file-like objects ( Implementation,
|
Thanks for pointing out the entry_point mechanism that fsspec uses. I have to take a closer look at it but I really like the idea of the user simply plugging in a custom implementation while we maintain "known implementations" in the main library.
I'm wondering if the FileIO implementations need to be storage-specific. For example, pyarrow, boto, and smartopen all could be used as an implementation for various cloud storage solutions. Instead of having something like |
Agreed, I was pointing out that for common connection types additonal dependencies can be avoided for a large number of systems if pyarrow is assumed as dependency. |
I think we will definitely have a mode where pyarrow is used. Certainly if you're reading or writing data, you'd probably want pyarrow. But it isn't unreasonable to have a service that only does metadata interaction and that doesn't need all of pyarrow. It could be done entirely with Python libraries like fastavro, smartopen, and the Iceberg core library. |
Yeah, wasn't sure if pyarrow was going to be considered optional or not. |
"""Checks whether the file exists""" | ||
|
||
@abstractmethod | ||
def open(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there not a return type that we require here? What about IOBase
? @emkornfield do you have a suggestion for this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think IOBase is probably heavy weight. Using protocols seems like the right thing here)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems to require https://pypi.org/project/typing-extensions/ for python <= 3.7
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like that idea! @rdblue does this protocol capture the required methods?
class InputStream(Protocol):
def read(self, n: int) -> bytes:
...
def readable(self) -> bool:
...
def close(self) -> None:
...
def seek(offset: int, whence: int) -> None:
...
def tell(self) -> int:
...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thinking about this some more I think IOBase is a right option here. It would guarantee interop with python standard library. Also if you look at it's design I think it lends credence to my comment below about potentially having one file type which can be inspected to determine if it is readable and writeable.
python/src/iceberg/io/base.py
Outdated
"""Get an OutputFile instance to write bytes to the file at the given location""" | ||
|
||
@abstractmethod | ||
def delete(self, location: str) -> None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor: location could also be an InputFile
or an OutputFile
, in which case this would delete that location. I'm not sure if there's an easy way to express that in type annotations, though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated the typehint to include InputFile
and OutputFile
(using typing.Union
) and also updated the LocalFileIO.delete
method defined in the tests to handle these.
relevant commit: 51ea645
"""Returns an InputFile for the location of this output file""" | ||
|
||
@abstractmethod | ||
def create(self, overwrite: bool = False): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here, is there an IO type that this should return?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If not, we could create one that has the __enter__
and __exit__
methods that make with
automatically close the file?
python/tests/io/test_base.py
Outdated
"""An InputFile implementation for local files (for test use only)""" | ||
|
||
def __init__(self, location: str): | ||
if not location.startswith("file://"): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This could be file:///
or file:/
. The first has an authority section (after //
) but it is empty. The second variation leaves out authority and just has a path. Either way, the path is a full path starting from /
. Also, one case that is not allowed is a URI like file://one/two/three/a.parquet
because the authority is one
and no authority is allowed for local FS URIs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I switched to using urllib.parse.urlparse
which is commonly used in other packages. This returns a ParseResult
where you can check ParseResult.scheme
, ParseResult.path
, etc. and I set that to a property called parsed_location
.
In addition to checking that the scheme is file
, I also added a check that there's no ParseResult.netloc
for a LocalInputFile
or LocalOutputFile
, which is the authority section.
relevant commit: fcb7dc4
python/tests/io/test_base.py
Outdated
super().__init__(location=location.split("file://")[1]) | ||
|
||
def __len__(self): | ||
return len(self._file_obj) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be os.path.getsize(self.location)
? I don't see any other reference to self._file_obj
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's right thanks! I updated it to use the parsed uri added in another commit so it's os.path.getsize(self.parsed_location.path)
now. I also added validation of len
in the tests.
relevant commit: 7b625cf
python/tests/io/test_base.py
Outdated
|
||
def create(self, overwrite: bool = False) -> None: | ||
if not overwrite and self.exists(): | ||
raise FileExistsError(f"{self.location} already exists") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of checking for existence directly, I think this should use mode wbx
when not overwriting, which will fail if the file already exists. That ensures that the check is atomic. With the check here, there is a race condition between two writers that are in this method. Both check that the file doesn't exist and succeed, but then both try to create the file.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome, looks like 'xb' is the right mode description so I updated this and it looks much cleaner!:
def create(self, overwrite: bool = False) -> None:
return open(self.parsed_location.path, "wb" if overwrite else "xb")
relevant commit: 25836bf
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you get the correct FileExistsError
from open
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep! I validate that here in one of the tests.
python/tests/io/test_base.py
Outdated
output_file_location = os.path.join(tmpdirname, "foo.txt") | ||
|
||
# Instantiate an output file | ||
output_file = CustomOutputFile(location=f"file://{output_file_location}") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you have a test that location
is the original location that was passed in?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added some tests to validate the location and also validate that a ValueError
is raised when an authority is provided in the uri (for the LocalInputFile
and LocalOutputFile
implementations in the test file).
The test for validating the location is parameterized so it's easy to add to. For example in the future we could simply add (S3FileIO, "s3://foo/bar/baz.parquet", "s3", "", "/bar/baz.parquet")
to the list.
relevant commit: cc490a5
…tFile and OutputFile
…putFile initialization
…size and add tests
Looks great. Thanks for updating this, @samredai! I'm going to go ahead and merge this. I think there's still an open question about what type to return from the |
"""Get an InputFile instance to read bytes from the file at the given location""" | ||
|
||
@abstractmethod | ||
def new_output(self, location: str) -> OutputFile: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a reason to distinguish between input and output files? it seems like for the most part the APIs are very similar? It seems if a file is going to be only readable or writeable having the implementatin throw not-implemented might be a better choice?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This gives the flexibility to the implementation. You can always implement both base classes, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess this might be more a philosophical question. There are two ways to achieve the flexibility:
- Provide a single class and have users only implement the methods they want (you can document a set of methods that should always be implemented together). Giving users run-time errors when not implemented.
- Separate the functionality into two different interfaces and require all methods be implemented.
My sense is that #2 is more of a java design pattern. I think (but I'm no expert) option #1 is more pythonic/dynamically typed language pattern.
|
||
@abstractmethod | ||
def create(self, overwrite: bool = False): | ||
"""This method should return a file-like object. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
file-like object is a confusing given that these classes are also called File. Is it supposed supposed to only support write() methods?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I should add that I understand it because I am familiar with the python idiom of "file-like" and maybe most users of these class will be since, because after all this is python.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should I specify the required methods which I believe are write
, close
, flush
, and tell
? Maybe also mention that close
should flush
. Or better yet just add a protocol here too with those methods and specify that protocol in the docstring for create
?
"""This method should return an object that matches the OutputStream protocol
...
"""
OutputStream protocol
from typing import Protocol
class OutputStream(Protocol):
def write(self, b: bytes) -> None:
...
def close(self) -> None:
...
def flush(self) -> None:
...
def tell(self) -> int:
...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I think documenting via type returned here makes sense here if we want to go with protocol.
""" | ||
|
||
|
||
class FileIO(ABC): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
doc string
|
||
@abstractmethod | ||
def open(self): | ||
"""This method should return an instance of an seekable input stream.""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit, be consistent ending doc-strings with periods or not.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also specify what should happen if the file doesn't exist.
"""Checks whether the file exists""" | ||
|
||
@abstractmethod | ||
def open(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems to require https://pypi.org/project/typing-extensions/ for python <= 3.7
"""Get an InputFile instance to read bytes from the file at the given location""" | ||
|
||
@abstractmethod | ||
def new_output(self, location: str) -> OutputFile: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess this might be more a philosophical question. There are two ways to achieve the flexibility:
- Provide a single class and have users only implement the methods they want (you can document a set of methods that should always be implemented together). Giving users run-time errors when not implemented.
- Separate the functionality into two different interfaces and require all methods be implemented.
My sense is that #2 is more of a java design pattern. I think (but I'm no expert) option #1 is more pythonic/dynamically typed language pattern.
|
||
@abstractmethod | ||
def create(self, overwrite: bool = False): | ||
"""This method should return a file-like object. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I should add that I understand it because I am familiar with the python idiom of "file-like" and maybe most users of these class will be since, because after all this is python.
In some of the original design discussions, the idea was that we would stick with the NEP 29 deprecation policy so we wouldn't be supporting python versions <= 3.7, @jun-he does that sound right? |
UPDATE: This has been updated to only include the abstract base classes
FileIO
,InputFile
, andOutputFile
. TheS3FileIO
implementation can be opened in a follow-up PR.This brings over the
FileIO
abstraction and includes anS3FileIO
implementation. ImplementingFileIO
requires overriding the__enter__()
and__exit__()
methods where the__enter__()
method sets a byte stream toself.byte_stream
.There's been a few discussions lately around file io and hopefully, this PR helps continue that. I think we should aim to maintain the file io abstraction for all file io operations (metadata files, manifest lists, manifest files, and data files) and allow the flexibility to plug in either an implementation that's packaged with the library or a custom implementation of FileIO that a user brings. An example of how we can do this can be found in PR #3677 in the from_file() method.
This still leaves an open question on how we manage dependencies for all of the implementations. For example, if a user does not plan on using
S3FileIO
or has their own s3 file io implementation that does not depend on boto3, it should not be forced as a hard dependency.