-
Notifications
You must be signed in to change notification settings - Fork 2.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Python: Support iceberg base catalog in python library (#3245) #4706
Python: Support iceberg base catalog in python library (#3245) #4706
Conversation
…Table and PartitionSpec to make linter happy. (apache#3245)
…Table and PartitionSpec to make linter happy. (apache#3245)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great start!
…on typing for all Catalog APIs (apache#3245)
…tions to create or replace a table. Removed the optional namespace argument from catalog.list_namespaces() (apache#3245)
…ass to remove 'attrs' dependency. Move in-memory catalog completely in test sources for testing base catalog only. (apache#3245)
Sorry for the mess above ^. Had to do a rebase from master to pull in recent PRs to add |
python/tests/catalog/test_base.py
Outdated
|
||
# Then | ||
assert table | ||
assert table is not given_table |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't this just asserting that the table isn't the same instance? Is that useful?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rdblue In the earlier commits I was asserting if the table name has changed, but since I had to rollback all changes to the Table
interface in this PR I resorted to checking just that the table is changing in some form post renaming. I will eventually be changing this to assert table name change once we add attributes back to Table
in #3227
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nevermind. I worked around it for the time being. Since the Table
class isn't frozen yet I added an attribute identifier
post creation in the InMemoryCatalog
and asserting on identifier
renaming for now. This obviously would fail, and rightly so, once we make Table
immutable/frozen.
I'm ready to commit this once the error message format is fixed. |
@rdblue Are we good to merge? |
class InMemoryCatalog(Catalog): | ||
"""An in-memory catalog implementation for testing purposes.""" | ||
|
||
__tables: Dict[Identifier, Table] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit / question: I know that the usage of double underscore somewhat mangles function names to make them "private". Does that apply to class fields as well?
I've always preferred single underscore for this reason, but I'm not a huge pythonista and this is a test class. So more just for my own knowledge as well as to better understand the current "best practices".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I noticed this too. In some ways I think the double underscore here signals "These properties don't normally exist in a non-testing Catalog implementation" but I think it'd be clearer to just directly describe that in the docstring and use property names without any underscores. You're right in that the double underscore would mangle the names, but only for a child class, something like Foo(InMemoryCatalog)
, which isn't done anywhere here. Python would then make the property name _InMemoryCatalog__tables
for instances of Foo
.
I came to the same conclusion as you, this is a test class so it doesn't matter as much.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think a single underscore is more appropriate here. It is an internal variable of the InMemoryCatalog
.
If we want to set this as a ClassVar, then we also have to annotation for it:
__tables: Dict[Identifier, Table] | |
_tables: ClassVar[Dict[Identifier, Table]] |
But I'm not sure if we want to do that since we would like to instantiate a fresh InMemoryCatalog for the different tests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Few last remarks, looks great in general!
properties(Properties): Catalog properties | ||
""" | ||
|
||
def __init__(self, name: str, properties: Properties): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll continue my quest for data classes here as well. Instead of creating the getters/setters, we could also let the dataclasses do this:
@dataclass
class Catalog(ABC):
name: str = field()
name: Properties = field(default_factory=dict)
This will automatically generate the getters and setters.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we want to go full blown pythonic in the future, we could also validate these using Pydantic validators: https://pydantic-docs.helpmanual.io/usage/validators/
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think using @dataclass
is a good idea, but we can move on here and refactor it later.
class InMemoryCatalog(Catalog): | ||
"""An in-memory catalog implementation for testing purposes.""" | ||
|
||
__tables: Dict[Identifier, Table] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think a single underscore is more appropriate here. It is an internal variable of the InMemoryCatalog
.
If we want to set this as a ClassVar, then we also have to annotation for it:
__tables: Dict[Identifier, Table] | |
_tables: ClassVar[Dict[Identifier, Table]] |
But I'm not sure if we want to do that since we would like to instantiate a fresh InMemoryCatalog for the different tests.
if namespace in self.__namespaces: | ||
raise AlreadyExistsError(f"Namespace already exists: {namespace}") | ||
else: | ||
self.__namespaces[namespace] = properties if properties else {} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Less is more:
self.__namespaces[namespace] = properties if properties else {} | |
self.__namespaces[namespace] = properties or {} |
""" | ||
|
||
@staticmethod | ||
def identifier_to_tuple(identifier: Union[str, Identifier]) -> Identifier: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Naming things is one of the hardest things in Computer Science, but I find this one a bit awkward since:
Identifier = Tuple[str, ...]
So it is actually already a tuple. How about:
def identifier_to_tuple(identifier: Union[str, Identifier]) -> Identifier: | |
def to_identifier(identifier: Union[str, Identifier]) -> Identifier: |
TEST_TABLE_LOCATION = "protocol://some/location" | ||
TEST_TABLE_PARTITION_SPEC = PartitionSpec() | ||
TEST_TABLE_PROPERTIES = {"key1": "value1", "key2": "value2"} | ||
NO_SUCH_TABLE_ERROR = "Table does not exist: \\('com', 'organization', 'department', 'my_table'\\)" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These should use string names, not tuples. Using a tuple leaks the internal details to people using the library or something built on this library, and this is needlessly confusing for them.
|
||
@pytest.fixture | ||
def catalog() -> InMemoryCatalog: | ||
return InMemoryCatalog("test.in.memory.catalog", {"test.key": "test.value"}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
test.key
isn't used anywhere, nor is test.in.memory.catalog
. Can you add tests for properties and name?
There are still a few things to fix, but I think we can clean up things like moving to |
READY FOR REVIEW. The current PR has changes for the base Catalog with table and namespace APIs. It also has a concrete In-Memory Catalog implementation, that is Dictionary based, for testing purposes. It also helps us validate our Catalog interface and see if it holds firm in action.
The PR also includes placeholders for Table and PartitionSpec, and minutely touches issue #3227 i.e., support table in iceberg python library, and #3228 i.e. support partition spec in iceberg python library, as the catalog management is dependent on those classes.