-
Notifications
You must be signed in to change notification settings - Fork 2.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Core: Add REST API for committing changes against multiple tables #7569
Conversation
ee59679
to
143bf62
Compare
core/src/main/java/org/apache/iceberg/rest/CatalogHandlers.java
Outdated
Show resolved
Hide resolved
fa5dcc7
to
353277d
Compare
open-api/rest-catalog-open-api.yaml
Outdated
@@ -1756,6 +1871,30 @@ components: | |||
items: | |||
$ref: '#/components/schemas/TableUpdate' | |||
|
|||
CommitTransactionTableRequest: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
CommitTableRequest
does already exist in this file unfortunately, hence naming it CommitTransactionTableRequest
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we could refactor to work around this. Could we just use a list of CommitTableRequest
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the existing CommitTableRequest
doesn't have an identifier
field unfortunately
353277d
to
8c17db8
Compare
core/src/main/java/org/apache/iceberg/catalog/BaseSessionCatalog.java
Outdated
Show resolved
Hide resolved
open-api/rest-catalog-open-api.yaml
Outdated
$ref: '#/components/responses/UnauthorizedResponse' | ||
403: | ||
$ref: '#/components/responses/ForbiddenResponse' | ||
404: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this make sense with multiple tables? The route exists. Maybe this is a bad request?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The only case where I thought it might make sense to throw a 404 is when the server gets changes for a table that doesn't exist (anymore). Alternatively we could handle this case via a general CommitFailedException
, but CommitFailedException
is retryable and I don't think we'd want to retry in this particular case
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm undecided. On one hand, that would already work and we wouldn't need to find a response code for "concurrent delete" that avoids the retry. On the other, the route doesn't represent that table.
I think I'm inclined to go with what you have here right now. It makes sense and we can always deprecate its use later.
b3fcfd3
to
6a7f07a
Compare
public interface TableCommit { | ||
TableIdentifier identifier(); | ||
|
||
TableMetadata base(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As I noted above, I think it would be better to use requirements
and updates
instead of mixing the updates with a TableMetadata
.
I think there is a drawback to that approach, though. Any catalog that commits using metadata location rather than requirements and updates would be harder to update for multi-table transactions because those require the base metadata and the new metadata.
Overall, I think I would still prefer the requirements and updates. There are a couple of reasons:
- Implementations of
TableOperations
almost always refresh immediately before attempting a commit anyway. Loading table metadata should not be a performance problem. - Some operations don't refresh table metadata before attempting to commit because they do not retry. For those cases, we have to address that they are not normally retried and use generic retry logic. That's what requirements/updates already do.
For a concrete example of point 2 above, consider UpdateSchema
that does not retry. If the table changes concurrently, then the schema update fails. That's because the operation doesn't validate that it can still apply the schema changes. However, the REST protocol does provide a way to validate a schema update can still be applied, by sending an assert-last-assigned-field-id
requirement.
If we were to try to perform a schema update using a multi-table commit method, that method doesn't know what is update is happening and will always retry. To make that retry safe, we should use the requirements.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I completely agree with you on just using requirements
and updates
in TableCommit
due to the points you mentioned. The only difficulty is that requirements
are currently at the rest-layer and we'd have to refactor them out to make them generally available/usable. I didn't want to overcomplicate this PR with such a refactoring and decided to keep the base
metadata in the class for now, which we use to build the requirements
from
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like this is the primary blocker, so I recommend we get started on the refactor and get some of the other pieces of this PR (like REST updates and request objects) in.
core/src/main/java/org/apache/iceberg/catalog/BaseSessionCatalog.java
Outdated
Show resolved
Hide resolved
* not do any other conflict detection. Therefore, it does not guarantee true transactional | ||
* atomicity, which is left to the implementation details of a REST server. | ||
*/ | ||
public static void commitTransaction(Catalog catalog, CommitTransactionRequest request) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that this method should check whether the underlying catalog implements commitTransaction
. If it does, then it should delegate to the catalog. If not, it should throw an exception.
The CatalogHandlers
act as a reference implementation for the REST protocol, so I think it would be problematic if we included a method that took an atomic multi-table commit and implemented it as a sequence of individual commits.
} | ||
|
||
@Value.Immutable | ||
interface CommitTableRequest { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems confusing to have a CommitTableRequest
and an UpdateTableRequest
that are basically the same thing but with or without the identifier. Then there is also the naming problem in the spec. Those issues are a bit of a red flag that we need to address duplication.
I think there are two options. First, we could reuse the UpdateTableRequest
directly and pull the identifier to a higher level, like this:
CommitTransactionRequest:
type: array
items:
$ref: '#/components/schemas/TransactionCommit'
TransactionCommit:
type: object
required:
- identifier
- commit
properties:
identifier:
$ref: '#/components/schemas/TableIdentifier'
request:
$ref: '#/components/schemas/CommitTableRequest'
CommitTableRequest:
type: object
required:
- requirements
- updates
properties:
requirements:
type: array
items:
$ref: '#/components/schemas/TableRequirement'
updates:
type: array
items:
$ref: '#/components/schemas/TableUpdate'
That's okay, but still fairly awkward. The next option is to reuse the existing schema directly and just add an optional identifier field:
CommitTransactionRequest:
type: array
items:
description: Each table commit request must provide an `identifier`
$ref: '#/components/schemas/CommitTableRequest'
CommitTableRequest:
type: object
required:
- requirements
- updates
properties:
identifier:
description: Table identifier to update; must be present for CommitTransactionRequest
$ref: '#/components/schemas/TableIdentifier'
requirements:
type: array
items:
$ref: '#/components/schemas/TableRequirement'
updates:
type: array
items:
$ref: '#/components/schemas/TableUpdate'
I prefer the second option. It's not a problem to add an optional field and ensure it is set when the object is used in a transaction. We'd also want to check that the identifier is either not set or is set to the same table in a normal table update.
for (TableCommit commit : commits) { | ||
UpdateTableRequest.Builder updateTableBuilder = UpdateTableRequest.builderFor(commit.base()); | ||
commit.changes().forEach(updateTableBuilder::update); | ||
UpdateTableRequest updateTableRequest = updateTableBuilder.build(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like we may need to refactor how we produce requirements. This would not work for non-REST catalogs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes I agree, we'd have to pull them out of the REST layer and make them work for non-REST catalogs
core/src/test/java/org/apache/iceberg/rest/TestRESTCatalog.java
Outdated
Show resolved
Hide resolved
core/src/test/java/org/apache/iceberg/rest/TestRESTCatalog.java
Outdated
Show resolved
Hide resolved
core/src/test/java/org/apache/iceberg/rest/TestRESTCatalog.java
Outdated
Show resolved
Hide resolved
core/src/test/java/org/apache/iceberg/rest/TestRESTCatalog.java
Outdated
Show resolved
Hide resolved
(BaseTransaction.TransactionTable) transaction.table(); | ||
|
||
// this performs validations and makes temporary commits that are in-memory | ||
commit(txTable.operations(), updateTableRequest); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can see why you'd want to use a transaction to get the bulk of the work done before committing, but I don't think this actually works because you're committing the transactions sequentially below.
I think to test we need to pass the commitTransaction
call through and implement it either for an InMemoryCatalog
or JDBC.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great progress, @nastra! I think we can get this in fairly quickly. Thanks for getting this ready!
ae56b38
to
794744c
Compare
core/src/main/java/org/apache/iceberg/rest/requests/UpdateTableRequest.java
Outdated
Show resolved
Hide resolved
core/src/main/java/org/apache/iceberg/rest/requests/UpdateTableRequest.java
Outdated
Show resolved
Hide resolved
core/src/main/java/org/apache/iceberg/rest/requests/UpdateTableRequest.java
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@nastra, can you move the REST spec updates and request objects (with tests) to a different PR? I think those are about ready to go and this is currently blocked by producing requirements and refactoring that code. That way we can get some of this in rather than waiting.
794744c
to
d7abbd5
Compare
d7abbd5
to
90e88db
Compare
90e88db
to
f3c9e0c
Compare
this.requirements = requirements; | ||
this.updates = updates; | ||
} | ||
|
||
UpdateTableRequest( | ||
public UpdateTableRequest( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is required to be public so that it can be used in RESTSessionCatalog
.
It might be also worth adding a Builder to UpdateTableRequest
so that we don't have to make it public here.
In this case we have two options:
- write a manual builder
- generate a builder from the constructor definition (I've opened Core: Generate Builder for UpdateTableRequest #7838 to indicate how this would look like)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we should have a create
method like TableCommit
? Or one that accepts a TableCommit
and returns the correct request? That would be simpler than a builder.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sure that would also work. I'll make the respective changes in the next PR
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added in #7867
f3c9e0c
to
22ec8a8
Compare
|
||
List<MetadataUpdate> updates(); | ||
|
||
static TableCommit create(TableIdentifier identifier, TableMetadata base, TableMetadata updated) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Javadoc?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added in #7867
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nothing major that would block this, but I made a couple of comments. Thanks, @nastra! Good to have transactions moving forward.
The main purpose of this PR is to extract the REST-specific changes out of #6948