Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

added support to dynamically registering/adding partitions #327

Merged

Conversation

firewall413
Copy link

Added support to add new partitions (/locations) to existing or new Glue tables:

Previous PR only registered the Glue table and merely specified its partition columns.

What if, for each external s3 write, we want to register a new partition in Glue (2024/1/01; 2024/01/02; ...etc).

…specifying the partition columns in the glue table but for each external s3 write,we want to register those partitions and their s3 locations in glue too)
Copy link
Collaborator

@jwills jwills left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah this is really exciting, thank you so much for adding this; I'd like some type info and code deduplication so we can unit test this stuff; I'm assuming you're using this for work?

@@ -132,12 +133,64 @@ def _convert_columns(column_list: Sequence[Column]) -> Sequence["ColumnTypeDef"]
return column_types


def _create_table(client: "GlueClient", database: str, table_def: "TableInputTypeDef") -> None:
def _create_table(
client: "GlueClient", database: str, table_def: "TableInputTypeDef", partition_columns
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's throw some type info on the partition_columns to be consistent here

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done! Decided to stick with List[Dict[str, str]] as I use the partition_columns variable for both creating & updating partitions, I would've preferred: List[ColumnTypeDef] but that doesn't line up with what PartitionInputTypeDef requires. Is that ok?


partition_location = table_def["StorageDescriptor"]["Location"]
for name, value in zip(partition_names, partition_values):
partition_location += f"/{name}={value}"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have an aesthetic preference for something like:

partition_location = table_def["StorageDescriptor"]["Location"]
partition_components = [partition_location]
for c in partition_columns:
    partition_components.append("=".join((c["Name"], c["Value"]))
partition_location = "/".join(partition_components)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done



def _update_table(
client: "GlueClient", database: str, table_def: "TableInputTypeDef", partition_columns
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

type info for partition_columns as above


partition_location = table_def["StorageDescriptor"]["Location"]
for name, value in zip(partition_names, partition_values):
partition_location += f"/{name}={value}"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

whatever the logic ends up being for this partition path construction, let's move it to a method so we can unit test it and not repeat it in create/update table

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

partition_input["StorageDescriptor"] = table_def["StorageDescriptor"]
partition_input["StorageDescriptor"]["Location"] = partition_location

return partition_input, partition_values
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry, does this still work if partition_columns is empty? I figured it would throw an error since partition_values wouldn't be defined.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm checking if partition_columns != [] before calling this function as it shouldn't be called with empty partition_colums. Do you prefer that check inside the _parse_partition_columns function?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh I just meant that a default None for partition_values should be on the same line as the default value for partition_input

Copy link
Collaborator

@jwills jwills left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So much better, thank you very much

@jwills jwills merged commit c69bed8 into duckdb:master Feb 21, 2024
29 of 30 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants