Core: check location for conflict before creating table #8194

dramaticlly · 2023-07-31T20:03:53Z

close #7238 for create table check on location uniqueness.

Instead of putting it as catalog property, I ended up using table property as it allow for catalog enforcement as well as per table overwrite.

dramaticlly · 2023-08-01T00:05:08Z

FYI @szehon-ho @RussellSpitzer @aokolnychyi appreciate the reviews

RussellSpitzer · 2023-08-02T21:45:12Z

core/src/main/java/org/apache/iceberg/BaseMetastoreCatalog.java

+      if (Boolean.parseBoolean(tableProperties.get(TableProperties.UNIQUE_LOCATION))) {
+        boolean alreadyExists = ops.io().newInputFile(baseLocation).exists();
+        if (alreadyExists) {
+          throw new AlreadyExistsException("Table location already in use: %s", baseLocation);


Not quite accurate since we are just checking whether the directory is there, which is different than the location being in use. While I think this may be good to check, we've seen most errors here being users setting table locations as sub directories as other tables.

I think maybe we should check that the directory/location is empty? That's probably a not too aggressive check?

thank you @RussellSpitzer for your feedback. I totally agree that checking for directory or location is empty would be awesome, but based on the methods available in FileIO. I dont find anything useful for that purpose. The closest one I get is SupportsPrefixOperations::listPrefix which is subclass of fileIO.

Do you think it make sense for us to do conditional check based on io?

boolean alreadyExists = (ops.io() instanceof SupportsPrefixOperations) ? !Iterables.isEmpty( ((SupportsPrefixOperations) ops.io()).listPrefix(baseLocation) ) : ops.io().newInputFile(baseLocation).exists();

RussellSpitzer · 2023-08-02T21:45:47Z

core/src/main/java/org/apache/iceberg/TableProperties.java

@@ -365,4 +365,7 @@ private TableProperties() {}

  public static final String UPSERT_ENABLED = "write.upsert.enabled";
  public static final boolean UPSERT_ENABLED_DEFAULT = false;
+
+  public static final String UNIQUE_LOCATION = "location.unique";


I do think we have to be careful here, because as I mentioned above, we are not actually checking for a unique location.

I have thought through this and mostly two cases came to my mind. We may think with this route

No database creation should be allowed under an existing database path. It will help a major problem of people creating even databases under existing db path.

No table creation should be allowed under an existing table path.

Database hash and table hashes are also included.

Table Location Minimum: <bucket_uri>/<database_hash>/<database_name>/<table_name>

Case 1

We have the following information with us which is an existing Table and it's location. Once a table got created in past that defintely has a valid path do we really needs to check fileIO or a simple string comparison/regex match is enough?

Say a table's location is s3://somerandompath/my_database/my_table

I feel instead of looking into fileIO why not we leverage our own metadata? We have various ways of creating iceberg table just via database.tableName, with location etc. This DB path is always a constant path by practice. If someone is trying to create a table under the same location with same name we can just throw the exception that s3://somerandompath/my_database/my_table exists just by looking it's database reference, which should be one level up and only one level under a database path should be a permissible table path. The uniqueness not necessarily you need from storage file location but from our metadata information.

CREATE TABLE my_database.my_table USING iceberg PARTITIONED BY (part) TBLPROPERTIES ('key'='value') AS SELECT ...

OR

CREATE TABLE IF NOT EXISTS my_database.my_table ( id integer, ...... ) USING ICEBERG LOCATION 's3://somerandompath/my_database/my_table' TBLPROPERTIES ( 'type' 'hive',....... )

Case 2

Rename table : When we rename a table we don't move files it is a metadata operation. The base path remains same but the table name gets updated in metadata. So in this case there is no impact. For unique location we can still look up to the metadata and get all unique paths under db reference.

hey @anigos thank you explaining your thoughts. I want to share some of mine below

Once a table got created in past that defintely has a valid path do we really needs to check fileIO or a simple string comparison/regex match is enough?

As issue #7238 mentioned, there's a lot of suggestions to avoid conflict but not all are equally practical to handle real world complexity

existing table whose location does not follow <bucket_uri>/<database_hash>/<database_name>/<table_name> pattern, the directory based location is a convention for hive tables but not for iceberg.

FileIO runtime resolve provides us the ability to check if the given path is already in use/non-empty at runtime where table location based comparison/regex match rely on good intention of data owner who followed convention. So if io is available at the time of table creation, I think we can take that to our advantage.

table rename with old location is the fine as I also include some of them in unit tests

Got you @dramaticlly ! :)

This part looks ok but one question for @RussellSpitzer too as you said Russell we need to check if the directory is empty as well but assume someone just have cerated a table and that has nothing yet but it will get incoming records? I mean existing location isn't that enough by this check? There could be timely situation where I have created a table and from next week that table will get data and actually that location is in use but currently it is empty?

if (Boolean.parseBoolean(tableProperties.get(TableProperties.UNIQUE_LOCATION))) { boolean alreadyExists = ops.io().newInputFile(baseLocation).exists(); if (alreadyExists) { throw new AlreadyExists

I think it might be catalog dependent, but commonly when table was created, the iceberg metadata gets written into its folder.

table db.tbl gets created usually means catalog will create following structure on file system

. └── prefix (0) └── db.db (1) └── table (2) ├── data (3) └── metadata (4) └── 00000-fad97b4a-ffea-4707-b9cc-d083017bf482.metadata.json

So currently if we check directory is empty help guard against the case where create tables at some existing table location. I believe with this PR, create a new table at location 0,1,2,4 will throw exception as exception since location is already in use but create at 3 will succeed if original table was created empty.

core/src/main/java/org/apache/iceberg/BaseMetastoreCatalog.java

github-actions · 2024-09-10T00:14:44Z

This pull request has been marked as stale due to 30 days of inactivity. It will be closed in 1 week if no further activity occurs. If you think that’s incorrect or this pull request requires a review, please simply write any comment. If closed, you can revive the PR at any time and @mention a reviewer or discuss it on the [email protected] list. Thank you for your contributions.

github-actions · 2024-09-18T00:14:28Z

This pull request has been closed due to lack of activity. This is not a judgement on the merit of the PR in any way. It is just a way of keeping the PR queue manageable. If you think that is incorrect, or the pull request requires review, you can revive the PR at any time.

github-actions bot added core hive labels Jul 31, 2023

dramaticlly force-pushed the createTableUniqueLocation branch 2 times, most recently from 882c1ce to f9218f9 Compare July 31, 2023 20:59

dramaticlly marked this pull request as ready for review August 1, 2023 00:00

RussellSpitzer reviewed Aug 2, 2023

View reviewed changes

dramaticlly changed the title ~~Core: check table location for uniqueness before creation~~ Core: check table location for conflict before creation Aug 3, 2023

dramaticlly changed the title ~~Core: check table location for conflict before creation~~ Core: check location for conflict before creating table Aug 3, 2023

szehon-ho reviewed Aug 9, 2023

View reviewed changes

core/src/main/java/org/apache/iceberg/BaseMetastoreCatalog.java Outdated Show resolved Hide resolved

core/src/main/java/org/apache/iceberg/BaseMetastoreCatalog.java Outdated Show resolved Hide resolved

szehon-ho mentioned this pull request Aug 9, 2023

Core, Hive, Nessie: Use ResolvingFileIO as default instead of HadoopFileIO #8272

Open

Steve Zhang added 3 commits August 18, 2023 13:42

Core: check table location for uniqueness before creation

7862ee3

Check table prefix empty for location conflict

bd4c2b7

Add a filesystem fallback to see if given location is in use

099c3c4

dramaticlly force-pushed the createTableUniqueLocation branch from d13fb0f to 099c3c4 Compare August 18, 2023 22:00

github-actions bot added the stale label Sep 10, 2024

github-actions bot closed this Sep 18, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Core: check location for conflict before creating table #8194

Core: check location for conflict before creating table #8194

dramaticlly commented Jul 31, 2023

dramaticlly commented Aug 1, 2023

RussellSpitzer Aug 2, 2023

dramaticlly Aug 2, 2023 •

edited

Loading

RussellSpitzer Aug 2, 2023

anigos Aug 4, 2023 •

edited

Loading

dramaticlly Aug 4, 2023

anigos Aug 5, 2023 •

edited

Loading

dramaticlly Aug 18, 2023 •

edited

Loading

github-actions bot commented Sep 10, 2024

github-actions bot commented Sep 18, 2024

Core: check location for conflict before creating table #8194

Core: check location for conflict before creating table #8194

Conversation

dramaticlly commented Jul 31, 2023

dramaticlly commented Aug 1, 2023

RussellSpitzer Aug 2, 2023

Choose a reason for hiding this comment

dramaticlly Aug 2, 2023 • edited Loading

Choose a reason for hiding this comment

RussellSpitzer Aug 2, 2023

Choose a reason for hiding this comment

anigos Aug 4, 2023 • edited Loading

Choose a reason for hiding this comment

dramaticlly Aug 4, 2023

Choose a reason for hiding this comment

anigos Aug 5, 2023 • edited Loading

Choose a reason for hiding this comment

dramaticlly Aug 18, 2023 • edited Loading

Choose a reason for hiding this comment

github-actions bot commented Sep 10, 2024

github-actions bot commented Sep 18, 2024

dramaticlly Aug 2, 2023 •

edited

Loading

anigos Aug 4, 2023 •

edited

Loading

anigos Aug 5, 2023 •

edited

Loading

dramaticlly Aug 18, 2023 •

edited

Loading