Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement pre-existing session support for dynamodb catalog #104

Merged
merged 5 commits into from
Jan 18, 2024
Merged
Show file tree
Hide file tree
Changes from 4 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
13 changes: 13 additions & 0 deletions mkdocs/docs/configuration.md
Original file line number Diff line number Diff line change
Expand Up @@ -195,6 +195,19 @@ catalog:
table-name: iceberg
```

If you prefer to pass the credentials explicitly to the client instead of relying on environment variables,

```yaml
catalog:
default:
type: dynamodb
table-name: iceberg
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just noticed that we're mixing case here 😭

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh yeah.. i mean the underscores are pretty typical aws syntax but iceberg has always used hyphens afaik

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we have an inconsistency since the FileIO configurations use hyphens for s3 credentials:

| s3.access-key-id | admin | Configure the static secret access key used to access the FileIO. |
| s3.secret-access-key | password | Configure the static session token used to access the FileIO. |

Also, what about adding a dynamo. prefix to these credential keys?

dynamodb.access-key-id
dynamodb.secret-access-key
...

This could help clarify that these keys are used exclusively by the catalog, not the FileIO:

Just thinking ahead, if we're in favor of renaming these configurations, perhaps it's something we could roll out in a separate PR, especially considering updates to the GlueCatalog configurations that would follow.

I'd love to hear your thoughts on this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@HonahX Definitely agreed with dynamo as a prefix.

As for hyphens vs underscores, AWS is really consistent about using underscores. I'm of the opinion that the AWS-based credentials should support both underscores and hyphens, will prefer hyphens if present, but fall back to the underscore usages if necessary. Documentation should only present the hyphenated case as an option. I believe that this strategy would lead to the least number of "head banging" debug sessions. However, I think a reasonable case could be made to remove underscore support instead of supporting it as a fallback.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@waifairer Yes, I totally agree. Let's do that in a separate PR.

aws_access_key_id: <ACCESS_KEY_ID>
aws_secret_access_key: <SECRET_ACCESS_KEY>
aws_session_token: <SESSION_TOKEN>
region_name: <REGION_NAME>
```

# Concurrency

PyIceberg uses multiple threads to parallelize operations. The number of workers can be configured by supplying a `max-workers` entry in the configuration file, or by setting the `PYICEBERG_MAX_WORKERS` environment variable. The default value depends on the system hardware and Python version. See [the Python documentation](https://docs.python.org/3/library/concurrent.futures.html#threadpoolexecutor) for more details.
4 changes: 3 additions & 1 deletion mkdocs/docs/contributing.md
Original file line number Diff line number Diff line change
Expand Up @@ -30,10 +30,12 @@ For the development, Poetry is used for packing and dependency management. You c
pip install poetry
```

If you have an older version of pip and virtualenv you need to update these:
Make sure you're using an up-to-date environment from venv

```bash
pip install --upgrade virtualenv pip
python -m venv ./venv
source ./venv/bin/activate
```

To get started, you can run `make install`, which installs Poetry and all the dependencies of the Iceberg library. This also installs the development dependencies. If you don't want to install the development dependencies, you need to install using `poetry install --no-dev`.
Expand Down
12 changes: 10 additions & 2 deletions pyiceberg/catalog/dynamodb.py
Original file line number Diff line number Diff line change
Expand Up @@ -80,7 +80,15 @@
class DynamoDbCatalog(Catalog):
def __init__(self, name: str, **properties: str):
super().__init__(name, **properties)
self.dynamodb = boto3.client(DYNAMODB_CLIENT)
session = boto3.Session(
profile_name=properties.get("profile_name"),
region_name=properties.get("region_name"),
botocore_session=properties.get("botocore_session"),
aws_access_key_id=properties.get("aws_access_key_id"),
aws_secret_access_key=properties.get("aws_secret_access_key"),
aws_session_token=properties.get("aws_session_token"),
)
self.dynamodb = session.client(DYNAMODB_CLIENT)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shall we add a unit test for this update? Like this one in GlueCatalog's test:

@mock_glue
def test_passing_profile_name() -> None:
session_properties: Dict[str, Any] = {
"aws_access_key_id": "abc",
"aws_secret_access_key": "def",
"aws_session_token": "ghi",
"region_name": "eu-central-1",
"profile_name": "sandbox",
"botocore_session": None,
}
test_properties = {"type": "glue"}
test_properties.update(session_properties)
with mock.patch("boto3.Session") as mock_session:
test_catalog = GlueCatalog("glue", **test_properties)
mock_session.assert_called_with(**session_properties)
assert test_catalog.glue is mock_session().client()

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @HonahX , thanks for the suggestion. I added the unit test as you requested.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for adding the test!

self.dynamodb_table_name = self.properties.get(DYNAMODB_TABLE_NAME, DYNAMODB_TABLE_NAME_DEFAULT)
self._ensure_catalog_table_exists_or_create()

Expand Down Expand Up @@ -110,7 +118,7 @@ def _dynamodb_table_exists(self) -> bool:
return False
except self.dynamodb.exceptions.InternalServerError as e:
raise GenericDynamoDbError(e.message) from e

print(response["Table"]["TableStatus"])
Copy link
Contributor

@HonahX HonahX Nov 3, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: could you please remove this print?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not a nit at all, thanks for the catch

if response["Table"]["TableStatus"] != ACTIVE:
raise GenericDynamoDbError(f"DynamoDB table for catalog {self.dynamodb_table_name} is not {ACTIVE}")
else:
Expand Down
28 changes: 26 additions & 2 deletions tests/catalog/test_dynamodb.py
Original file line number Diff line number Diff line change
Expand Up @@ -14,17 +14,19 @@
# KIND, either express or implied. See the License for the
# specific language governing permissions and limitations
# under the License.
from typing import List
from typing import Any, Dict, List

import pytest
from moto import mock_dynamodb
from unittest import mock

from pyiceberg.catalog import METADATA_LOCATION, TABLE_TYPE
from pyiceberg.catalog.dynamodb import (
DYNAMODB_COL_CREATED_AT,
DYNAMODB_COL_IDENTIFIER,
DYNAMODB_COL_NAMESPACE,
DYNAMODB_TABLE_NAME_DEFAULT,
ACTIVE,
DynamoDbCatalog,
_add_property_prefix,
)
Expand All @@ -46,12 +48,13 @@ def test_create_dynamodb_catalog_with_table_name(_dynamodb, _bucket_initialize:
DynamoDbCatalog("test_ddb_catalog")
response = _dynamodb.describe_table(TableName=DYNAMODB_TABLE_NAME_DEFAULT)
assert response["Table"]["TableName"] == DYNAMODB_TABLE_NAME_DEFAULT
assert response["Table"]["TableStatus"] == ACTIVE

custom_table_name = "custom_table_name"
DynamoDbCatalog("test_ddb_catalog", **{"table-name": custom_table_name})
response = _dynamodb.describe_table(TableName=custom_table_name)
assert response["Table"]["TableName"] == custom_table_name

assert response["Table"]["TableStatus"] == ACTIVE

@mock_dynamodb
def test_create_table_with_database_location(
Expand Down Expand Up @@ -466,3 +469,24 @@ def test_update_namespace_properties_overlap_update_removal(
test_catalog.update_namespace_properties(database_name, removals, updates)
# should not modify the properties
assert test_catalog.load_namespace_properties(database_name) == test_properties

def test_passing_provided_profile() -> None:
catalog_name = "test_ddb_catalog"
session_props = {
"aws_access_key_id": "abc",
"aws_secret_access_key": "def",
"aws_session_token": "ghi",
"region_name": "eu-central-1",
"botocore_session": None,
"profile_name": None
}
props = {"py-io-impl": "pyiceberg.io.fsspec.FsspecFileIO"}
props.update(session_props)
with mock.patch('boto3.Session', return_value=mock.Mock()) as mock_session:
mock_client = mock.Mock()
mock_session.return_value.client.return_value = mock_client
mock_client.describe_table.return_value = {'Table': {'TableStatus': 'ACTIVE'}}
test_catalog = DynamoDbCatalog(catalog_name, **props)
assert test_catalog.dynamodb is mock_client
mock_session.assert_called_with(**session_props)
assert test_catalog.dynamodb is mock_session().client()