-
Notifications
You must be signed in to change notification settings - Fork 2.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Core: check location for conflict before creating table #8194
Conversation
882c1ce
to
f9218f9
Compare
FYI @szehon-ho @RussellSpitzer @aokolnychyi appreciate the reviews |
if (Boolean.parseBoolean(tableProperties.get(TableProperties.UNIQUE_LOCATION))) { | ||
boolean alreadyExists = ops.io().newInputFile(baseLocation).exists(); | ||
if (alreadyExists) { | ||
throw new AlreadyExistsException("Table location already in use: %s", baseLocation); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not quite accurate since we are just checking whether the directory is there, which is different than the location being in use. While I think this may be good to check, we've seen most errors here being users setting table locations as sub directories as other tables.
I think maybe we should check that the directory/location is empty? That's probably a not too aggressive check?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thank you @RussellSpitzer for your feedback. I totally agree that checking for directory or location is empty would be awesome, but based on the methods available in FileIO. I dont find anything useful for that purpose. The closest one I get is SupportsPrefixOperations::listPrefix which is subclass of fileIO.
Do you think it make sense for us to do conditional check based on io?
boolean alreadyExists =
(ops.io() instanceof SupportsPrefixOperations) ?
!Iterables.isEmpty(
((SupportsPrefixOperations) ops.io()).listPrefix(baseLocation)
)
: ops.io().newInputFile(baseLocation).exists();
@@ -365,4 +365,7 @@ private TableProperties() {} | |||
|
|||
public static final String UPSERT_ENABLED = "write.upsert.enabled"; | |||
public static final boolean UPSERT_ENABLED_DEFAULT = false; | |||
|
|||
public static final String UNIQUE_LOCATION = "location.unique"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do think we have to be careful here, because as I mentioned above, we are not actually checking for a unique location.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have thought through this and mostly two cases came to my mind. We may think with this route
- No database creation should be allowed under an existing database path. It will help a major problem of people creating even databases under existing db path.
- No table creation should be allowed under an existing table path.
- Database hash and table hashes are also included.
Table Location Minimum: <bucket_uri>/<database_hash>/<database_name>/<table_name>
Case 1
We have the following information with us which is an existing Table and it's location. Once a table got created in past that defintely has a valid path do we really needs to check fileIO or a simple string comparison/regex match is enough?
Say a table's location is s3://somerandompath/my_database/my_table
I feel instead of looking into fileIO why not we leverage our own metadata? We have various ways of creating iceberg table just via database.tableName, with location etc. This DB path is always a constant path by practice. If someone is trying to create a table under the same location with same name we can just throw the exception that s3://somerandompath/my_database/my_table exists just by looking it's database reference, which should be one level up and only one level under a database path should be a permissible table path. The uniqueness not necessarily you need from storage file location but from our metadata information.
CREATE TABLE my_database.my_table
USING iceberg
PARTITIONED BY (part)
TBLPROPERTIES ('key'='value')
AS SELECT ...
OR
CREATE TABLE IF NOT EXISTS my_database.my_table (
id integer,
......
)
USING ICEBERG
LOCATION 's3://somerandompath/my_database/my_table'
TBLPROPERTIES (
'type' 'hive',.......
)
Case 2
Rename table : When we rename a table we don't move files it is a metadata operation. The base path remains same but the table name gets updated in metadata. So in this case there is no impact. For unique location we can still look up to the metadata and get all unique paths under db reference.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hey @anigos thank you explaining your thoughts. I want to share some of mine below
Once a table got created in past that defintely has a valid path do we really needs to check fileIO or a simple string comparison/regex match is enough?
As issue #7238 mentioned, there's a lot of suggestions to avoid conflict but not all are equally practical to handle real world complexity
- existing table whose location does not follow
<bucket_uri>/<database_hash>/<database_name>/<table_name>
pattern, the directory based location is a convention for hive tables but not for iceberg. - FileIO runtime resolve provides us the ability to check if the given path is already in use/non-empty at runtime where table location based comparison/regex match rely on good intention of data owner who followed convention. So if io is available at the time of table creation, I think we can take that to our advantage.
- table rename with old location is the fine as I also include some of them in unit tests
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got you @dramaticlly ! :)
This part looks ok but one question for @RussellSpitzer too as you said Russell we need to check if the directory is empty as well but assume someone just have cerated a table and that has nothing yet but it will get incoming records? I mean existing location isn't that enough by this check? There could be timely situation where I have created a table and from next week that table will get data and actually that location is in use but currently it is empty?
if (Boolean.parseBoolean(tableProperties.get(TableProperties.UNIQUE_LOCATION))) {
boolean alreadyExists = ops.io().newInputFile(baseLocation).exists();
if (alreadyExists) {
throw new AlreadyExists
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it might be catalog dependent, but commonly when table was created, the iceberg metadata gets written into its folder.
table db.tbl
gets created usually means catalog will create following structure on file system
.
└── prefix (0)
└── db.db (1)
└── table (2)
├── data (3)
└── metadata (4)
└── 00000-fad97b4a-ffea-4707-b9cc-d083017bf482.metadata.json
So currently if we check directory is empty help guard against the case where create tables at some existing table location. I believe with this PR, create a new table at location 0,1,2,4 will throw exception as exception since location is already in use but create at 3 will succeed if original table was created empty.
core/src/main/java/org/apache/iceberg/BaseMetastoreCatalog.java
Outdated
Show resolved
Hide resolved
core/src/main/java/org/apache/iceberg/BaseMetastoreCatalog.java
Outdated
Show resolved
Hide resolved
d13fb0f
to
099c3c4
Compare
This pull request has been marked as stale due to 30 days of inactivity. It will be closed in 1 week if no further activity occurs. If you think that’s incorrect or this pull request requires a review, please simply write any comment. If closed, you can revive the PR at any time and @mention a reviewer or discuss it on the [email protected] list. Thank you for your contributions. |
This pull request has been closed due to lack of activity. This is not a judgement on the merit of the PR in any way. It is just a way of keeping the PR queue manageable. If you think that is incorrect, or the pull request requires review, you can revive the PR at any time. |
close #7238 for create table check on location uniqueness.
Instead of putting it as catalog property, I ended up using table property as it allow for catalog enforcement as well as per table overwrite.