Add FileIO, InputFile, and OutputFile abstract base classes #3691

samredai · 2021-12-08T20:40:17Z

UPDATE: This has been updated to only include the abstract base classes FileIO, InputFile, and OutputFile. The S3FileIO implementation can be opened in a follow-up PR.

This brings over the FileIO abstraction and includes an S3FileIO implementation. Implementing FileIO requires overriding the __enter__() and __exit__() methods where the __enter__() method sets a byte stream to self.byte_stream.

There's been a few discussions lately around file io and hopefully, this PR helps continue that. I think we should aim to maintain the file io abstraction for all file io operations (metadata files, manifest lists, manifest files, and data files) and allow the flexibility to plug in either an implementation that's packaged with the library or a custom implementation of FileIO that a user brings. An example of how we can do this can be found in PR #3677 in the from_file() method.

This still leaves an open question on how we manage dependencies for all of the implementations. For example, if a user does not plan on using S3FileIO or has their own s3 file io implementation that does not depend on boto3, it should not be forced as a hard dependency.

python/src/iceberg/io/file.py

python/src/iceberg/io/s3.py

python/src/iceberg/io/base.py

python/src/iceberg/io/s3.py

python/src/iceberg/io/base.py

python/pyproject.toml

python/src/iceberg/io/s3.py

python/pyproject.toml

python/src/iceberg/io/base.py

samredai · 2021-12-13T14:38:23Z

Updated this PR to only include the abstract base classes FileIO, InputFile, and OutputFile. The S3FileIO implementation can be opened in a follow-up PR.

samredai · 2022-01-04T18:18:01Z

As an example of what an implementation of these base classes would look like, I put together an S3FileIO implementation using smart_open to create seekable file-like objects (smart_open.s3.Reader instances) and validated that this can be fed directly into pyarrow. I also validated that smart_open.s3.MultipartWriter instances work as the where argument to pyarrow's write_table methods.

Implementation, `s3.py`

from iceberg.io.base import FileIO, InputFile, OutputFile
from smart_open import open, parse_uri
import boto3

class S3InputFile(InputFile):

    def __len__(self) -> int:
        return 0
    
    @property
    def exists(self) -> bool:
        try:
            with open(self.location, 'rb') as f:
                pass
        except OSError:
            return False
        return True
    
    def __enter__(self):
        self._stream = open(self.location, 'rb', transport_params={"defer_seek": True})
        return self._stream
    
    def __exit__(self, exc_type, exc_value, exc_traceback):
        self._stream.close()
        return


class S3OutputFile(OutputFile):

    def __call__(self, overwrite: bool = False, **kwargs):
        self._overwrite = overwrite
        return self

    def __len__(self) -> int:
        return 0
    
    @property
    def location(self) -> str:
        """The fully-qualified location of the output file"""
        return self._location

    @property
    def exists(self) -> bool:
        try:
            with open(self.location, 'rb') as f:
                pass
        except OSError:
            return False
        return True
    
    def to_input_file(self) -> S3InputFile:
        return S3InputFile(self.location)

    def __enter__(self):
        if not self._overwrite and self.exists:
            raise FileExistsError(
                f"{self.location} already exists. To overwrite, "
                "set overwrite=True when initializing the S3OutputFile."
            )
            
        self._stream = open(self.location, 'wb')
        return self._stream
    
    def __exit__(self, exc_type, exc_value, exc_traceback):
        self._stream.close()
        return

class S3FileIO(FileIO):
    def new_input(self, location: str):
        return S3InputFile(location=location)

    def new_output(self, location: str, overwrite: bool = False):
        return S3OutputFile(location=location, overwrite=overwrite)

    def delete(self, location: str):
        uri = parse_uri(location)
        s3 = boto3.resource('s3')
        s3.Object(uri.bucket_id, uri.key_id).delete()
        return

`example.py`

from pyarrow import parquet as pq
from s3 import S3FileIO

f1 = "s3://samstestbucket3412/userdata1.parquet"
f2 = "s3://samstestbucket3412/userdata2.parquet"
file_io = S3FileIO()

# Read f1 (a parquet file)
with file_io.new_input(f1) as f:
    table = pq.read_table(f)

# see output below
print("####################\n")
print(type(table), end="\n\n")
print(table, end="\n\n")
print("####################\n")

# Delete f2 if it exists
file_io.delete(f2)

# Write the pyarrow table to f2
with file_io.new_output(f2) as f:
    pq.write_table(table, f)

# Read the newly written table back in
with file_io.new_input(f2) as f:
    table2 = pq.read_table(f)

# see output below
print("####################\n")
print(type(table), end="\n\n")
print(table, end="\n\n")
print("####################\n")


# Try to write to f2 again without overwrite=True
# with file_io.new_output(f2) as f:
#     pq.write_table(table, f)  # Raises a FileExistsError

# Writing f2 again with overwrite=True
with file_io.new_output(f2, overwrite=True) as f:
    pq.write_table(table2, f)

# Delete f2
file_io.delete(f2)

# Write f2 without setting overwrite since it's been deleted
with file_io.new_output(f2) as f:
    pq.write_table(table2, f)

output:

####################

<class 'pyarrow.lib.Table'>

pyarrow.Table
registration_dttm: timestamp[ns]
id: int32
first_name: string
last_name: string
email: string
gender: string
ip_address: string
cc: string
country: string
birthdate: string
salary: double
title: string
comments: string

####################

####################

<class 'pyarrow.lib.Table'>

pyarrow.Table
registration_dttm: timestamp[ns]
id: int32
first_name: string
last_name: string
email: string
gender: string
ip_address: string
cc: string
country: string
birthdate: string
salary: double
title: string
comments: string

####################

samredai · 2022-01-20T21:05:43Z

I think as a generic answer looking at what fsspec has done (and having these as separate packages) that the use can install in there environment probably makes sense.

Thanks for pointing out the entry_point mechanism that fsspec uses. I have to take a closer look at it but I really like the idea of the user simply plugging in a custom implementation while we maintain "known implementations" in the main library.

Specifically for S3, if pyarrow is a hard dependency for parquet reading providing reference implementations based off of its file systems (it comes prepackaged with S3) could make sense.

I'm wondering if the FileIO implementations need to be storage-specific. For example, pyarrow, boto, and smartopen all could be used as an implementation for various cloud storage solutions. Instead of having something like PyarrowS3FileIO to differentiate between maybe like a BotoS3FileIO, we could instead do a PyarrowFileIO which can be an entry point to any of the storage io options provided by pyarrow. I don't think this has any implications for this PR in particular so I'll work on updating this asap with the suggestions and we can tackle these other questions in follow-up discussions.

emkornfield · 2022-01-20T21:18:03Z

I'm wondering if the FileIO implementations need to be storage-specific. For example, pyarrow, boto, and smartopen all could be used as an implementation for various cloud storage solutions. Instead of having something like PyarrowS3FileIO to differentiate between maybe like a BotoS3FileIO, we could instead do a PyarrowFileIO which can be an entry point to any of the storage io options provided by pyarrow. I don't think this has any implications for this PR in particular so I'll work on updating this asap with the suggestions and we can tackle these other questions in follow-up discussions.

Agreed, I was pointing out that for common connection types additonal dependencies can be avoided for a large number of systems if pyarrow is assumed as dependency.

rdblue · 2022-01-20T21:28:56Z

I think we will definitely have a mode where pyarrow is used. Certainly if you're reading or writing data, you'd probably want pyarrow. But it isn't unreasonable to have a service that only does metadata interaction and that doesn't need all of pyarrow. It could be done entirely with Python libraries like fastavro, smartopen, and the Iceberg core library.

emkornfield · 2022-01-20T21:46:06Z

I think we will definitely have a mode where pyarrow is used. Certainly if you're reading or writing data, you'd probably want pyarrow.

Yeah, wasn't sure if pyarrow was going to be considered optional or not.

rdblue · 2022-01-23T18:39:46Z

python/src/iceberg/io/base.py

+        """Checks whether the file exists"""
+
+    @abstractmethod
+    def open(self):


Is there not a return type that we require here? What about IOBase? @emkornfield do you have a suggestion for this?

I think IOBase is probably heavy weight. Using protocols seems like the right thing here)?

This seems to require https://pypi.org/project/typing-extensions/ for python <= 3.7

I like that idea! @rdblue does this protocol capture the required methods?

class InputStream(Protocol): def read(self, n: int) -> bytes: ... def readable(self) -> bool: ... def close(self) -> None: ... def seek(offset: int, whence: int) -> None: ... def tell(self) -> int: ...

Thinking about this some more I think IOBase is a right option here. It would guarantee interop with python standard library. Also if you look at it's design I think it lends credence to my comment below about potentially having one file type which can be inspected to determine if it is readable and writeable.

rdblue · 2022-01-23T18:41:34Z

python/src/iceberg/io/base.py

+        """Get an OutputFile instance to write bytes to the file at the given location"""
+
+    @abstractmethod
+    def delete(self, location: str) -> None:


Minor: location could also be an InputFile or an OutputFile, in which case this would delete that location. I'm not sure if there's an easy way to express that in type annotations, though.

Updated the typehint to include InputFile and OutputFile (using typing.Union) and also updated the LocalFileIO.delete method defined in the tests to handle these.

relevant commit: 51ea645

rdblue · 2022-01-23T18:41:57Z

python/src/iceberg/io/base.py

+        """Returns an InputFile for the location of this output file"""
+
+    @abstractmethod
+    def create(self, overwrite: bool = False):


Same here, is there an IO type that this should return?

If not, we could create one that has the __enter__ and __exit__ methods that make with automatically close the file?

rdblue · 2022-01-23T18:45:19Z

python/tests/io/test_base.py

+    """An InputFile implementation for local files (for test use only)"""
+
+    def __init__(self, location: str):
+        if not location.startswith("file://"):


This could be file:/// or file:/. The first has an authority section (after //) but it is empty. The second variation leaves out authority and just has a path. Either way, the path is a full path starting from /. Also, one case that is not allowed is a URI like file://one/two/three/a.parquet because the authority is one and no authority is allowed for local FS URIs.

I switched to using urllib.parse.urlparse which is commonly used in other packages. This returns a ParseResult where you can check ParseResult.scheme, ParseResult.path, etc. and I set that to a property called parsed_location.

In addition to checking that the scheme is file, I also added a check that there's no ParseResult.netloc for a LocalInputFile or LocalOutputFile, which is the authority section.

relevant commit: fcb7dc4

rdblue · 2022-01-23T18:46:25Z

python/tests/io/test_base.py

+        super().__init__(location=location.split("file://")[1])
+
+    def __len__(self):
+        return len(self._file_obj)


Should this be os.path.getsize(self.location)? I don't see any other reference to self._file_obj.

That's right thanks! I updated it to use the parsed uri added in another commit so it's os.path.getsize(self.parsed_location.path) now. I also added validation of len in the tests.

relevant commit: 7b625cf

rdblue · 2022-01-23T18:49:56Z

python/tests/io/test_base.py

+
+    def create(self, overwrite: bool = False) -> None:
+        if not overwrite and self.exists():
+            raise FileExistsError(f"{self.location} already exists")


Instead of checking for existence directly, I think this should use mode wbx when not overwriting, which will fail if the file already exists. That ensures that the check is atomic. With the check here, there is a race condition between two writers that are in this method. Both check that the file doesn't exist and succeed, but then both try to create the file.

Awesome, looks like 'xb' is the right mode description so I updated this and it looks much cleaner!:

def create(self, overwrite: bool = False) -> None: return open(self.parsed_location.path, "wb" if overwrite else "xb")

relevant commit: 25836bf

Do you get the correct FileExistsError from open?

Yep! I validate that here in one of the tests.

python/tests/io/test_base.py

rdblue · 2022-01-23T18:55:17Z

python/tests/io/test_base.py

+        output_file_location = os.path.join(tmpdirname, "foo.txt")
+
+        # Instantiate an output file
+        output_file = CustomOutputFile(location=f"file://{output_file_location}")


Do you have a test that location is the original location that was passed in?

I added some tests to validate the location and also validate that a ValueError is raised when an authority is provided in the uri (for the LocalInputFile and LocalOutputFile implementations in the test file).

The test for validating the location is parameterized so it's easy to add to. For example in the future we could simply add (S3FileIO, "s3://foo/bar/baz.parquet", "s3", "", "/bar/baz.parquet") to the list.

relevant commit: cc490a5

…tFile and OutputFile

…putFile initialization

…size and add tests

rdblue · 2022-01-24T00:37:22Z

Looks great. Thanks for updating this, @samredai!

I'm going to go ahead and merge this. I think there's still an open question about what type to return from the open and create methods (and how to support with) but the overall structure looks great and we can solve that problem later.

emkornfield · 2022-01-24T17:41:47Z

python/src/iceberg/io/base.py

+        """Get an InputFile instance to read bytes from the file at the given location"""
+
+    @abstractmethod
+    def new_output(self, location: str) -> OutputFile:


Is there a reason to distinguish between input and output files? it seems like for the most part the APIs are very similar? It seems if a file is going to be only readable or writeable having the implementatin throw not-implemented might be a better choice?

This gives the flexibility to the implementation. You can always implement both base classes, right?

I guess this might be more a philosophical question. There are two ways to achieve the flexibility:

Provide a single class and have users only implement the methods they want (you can document a set of methods that should always be implemented together). Giving users run-time errors when not implemented.

Separate the functionality into two different interfaces and require all methods be implemented.

My sense is that #2 is more of a java design pattern. I think (but I'm no expert) option #1 is more pythonic/dynamically typed language pattern.

emkornfield · 2022-01-24T17:43:26Z

python/src/iceberg/io/base.py

+
+    @abstractmethod
+    def create(self, overwrite: bool = False):
+        """This method should return a file-like object.


file-like object is a confusing given that these classes are also called File. Is it supposed supposed to only support write() methods?

I should add that I understand it because I am familiar with the python idiom of "file-like" and maybe most users of these class will be since, because after all this is python.

Should I specify the required methods which I believe are write, close, flush, and tell? Maybe also mention that close should flush. Or better yet just add a protocol here too with those methods and specify that protocol in the docstring for create?

"""This method should return an object that matches the OutputStream protocol ... """

OutputStream protocol

from typing import Protocol class OutputStream(Protocol): def write(self, b: bytes) -> None: ... def close(self) -> None: ... def flush(self) -> None: ... def tell(self) -> int: ...

Yeah, I think documenting via type returned here makes sense here if we want to go with protocol.

emkornfield · 2022-01-24T17:46:59Z

python/src/iceberg/io/base.py

+        """
+
+
+class FileIO(ABC):


emkornfield · 2022-01-24T17:47:42Z

python/src/iceberg/io/base.py

+
+    @abstractmethod
+    def open(self):
+        """This method should return an instance of an seekable input stream."""


nit, be consistent ending doc-strings with periods or not.

Also specify what should happen if the file doesn't exist.

emkornfield · 2022-01-24T17:51:29Z

python/src/iceberg/io/base.py

+        """Checks whether the file exists"""
+
+    @abstractmethod
+    def open(self):


This seems to require https://pypi.org/project/typing-extensions/ for python <= 3.7

emkornfield · 2022-01-24T19:04:37Z

python/src/iceberg/io/base.py

+        """Get an InputFile instance to read bytes from the file at the given location"""
+
+    @abstractmethod
+    def new_output(self, location: str) -> OutputFile:


I guess this might be more a philosophical question. There are two ways to achieve the flexibility:

Provide a single class and have users only implement the methods they want (you can document a set of methods that should always be implemented together). Giving users run-time errors when not implemented.

Separate the functionality into two different interfaces and require all methods be implemented.

My sense is that #2 is more of a java design pattern. I think (but I'm no expert) option #1 is more pythonic/dynamically typed language pattern.

emkornfield · 2022-01-24T19:05:55Z

python/src/iceberg/io/base.py

+
+    @abstractmethod
+    def create(self, overwrite: bool = False):
+        """This method should return a file-like object.


I should add that I understand it because I am familiar with the python idiom of "file-like" and maybe most users of these class will be since, because after all this is python.

samredai · 2022-01-25T02:48:01Z

This seems to require https://pypi.org/project/typing-extensions/ for python <= 3.7

In some of the original design discussions, the idea was that we would stick with the NEP 29 deprecation policy so we wouldn't be supporting python versions <= 3.7, @jun-he does that sound right?

github-actions bot added the python label Dec 8, 2021

samredai mentioned this pull request Dec 8, 2021

Python: Adding TableMetadata object from dict, bytestream, and InputFile implementation #3677

Closed

samredai marked this pull request as draft December 8, 2021 20:41

samredai requested review from jun-he, danielcweeks, rdblue and TGooch44 December 8, 2021 20:49

rdblue reviewed Dec 8, 2021

View reviewed changes

python/src/iceberg/io/file.py Outdated Show resolved Hide resolved

rdblue reviewed Dec 8, 2021

View reviewed changes

python/src/iceberg/io/s3.py Outdated Show resolved Hide resolved