Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Core: check location for conflict before creating table #8194

Closed

Conversation

dramaticlly
Copy link
Contributor

close #7238 for create table check on location uniqueness.

Instead of putting it as catalog property, I ended up using table property as it allow for catalog enforcement as well as per table overwrite.

@dramaticlly dramaticlly force-pushed the createTableUniqueLocation branch 2 times, most recently from 882c1ce to f9218f9 Compare July 31, 2023 20:59
@dramaticlly dramaticlly marked this pull request as ready for review August 1, 2023 00:00
@dramaticlly
Copy link
Contributor Author

FYI @szehon-ho @RussellSpitzer @aokolnychyi appreciate the reviews

if (Boolean.parseBoolean(tableProperties.get(TableProperties.UNIQUE_LOCATION))) {
boolean alreadyExists = ops.io().newInputFile(baseLocation).exists();
if (alreadyExists) {
throw new AlreadyExistsException("Table location already in use: %s", baseLocation);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not quite accurate since we are just checking whether the directory is there, which is different than the location being in use. While I think this may be good to check, we've seen most errors here being users setting table locations as sub directories as other tables.

I think maybe we should check that the directory/location is empty? That's probably a not too aggressive check?

Copy link
Contributor Author

@dramaticlly dramaticlly Aug 2, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thank you @RussellSpitzer for your feedback. I totally agree that checking for directory or location is empty would be awesome, but based on the methods available in FileIO. I dont find anything useful for that purpose. The closest one I get is SupportsPrefixOperations::listPrefix which is subclass of fileIO.

Do you think it make sense for us to do conditional check based on io?

boolean alreadyExists =
  (ops.io() instanceof SupportsPrefixOperations) ?
      !Iterables.isEmpty(
          ((SupportsPrefixOperations) ops.io()).listPrefix(baseLocation)
      )
      : ops.io().newInputFile(baseLocation).exists();

@@ -365,4 +365,7 @@ private TableProperties() {}

public static final String UPSERT_ENABLED = "write.upsert.enabled";
public static final boolean UPSERT_ENABLED_DEFAULT = false;

public static final String UNIQUE_LOCATION = "location.unique";
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do think we have to be careful here, because as I mentioned above, we are not actually checking for a unique location.

Copy link

@anigos anigos Aug 4, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have thought through this and mostly two cases came to my mind. We may think with this route

  1. No database creation should be allowed under an existing database path. It will help a major problem of people creating even databases under existing db path.
  2. No table creation should be allowed under an existing table path.
  3. Database hash and table hashes are also included.

Table Location Minimum: <bucket_uri>/<database_hash>/<database_name>/<table_name>

Case 1

We have the following information with us which is an existing Table and it's location. Once a table got created in past that defintely has a valid path do we really needs to check fileIO or a simple string comparison/regex match is enough?

Say a table's location is s3://somerandompath/my_database/my_table

I feel instead of looking into fileIO why not we leverage our own metadata? We have various ways of creating iceberg table just via database.tableName, with location etc. This DB path is always a constant path by practice. If someone is trying to create a table under the same location with same name we can just throw the exception that s3://somerandompath/my_database/my_table exists just by looking it's database reference, which should be one level up and only one level under a database path should be a permissible table path. The uniqueness not necessarily you need from storage file location but from our metadata information.

CREATE TABLE my_database.my_table
USING iceberg
PARTITIONED BY (part)
TBLPROPERTIES ('key'='value')
AS SELECT ...

OR

CREATE TABLE IF NOT EXISTS my_database.my_table (
         id integer,
        ......
       )
       USING ICEBERG 
       LOCATION 's3://somerandompath/my_database/my_table'
       TBLPROPERTIES (
         'type' 'hive',.......
       )

Case 2

Rename table : When we rename a table we don't move files it is a metadata operation. The base path remains same but the table name gets updated in metadata. So in this case there is no impact. For unique location we can still look up to the metadata and get all unique paths under db reference.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hey @anigos thank you explaining your thoughts. I want to share some of mine below

Once a table got created in past that defintely has a valid path do we really needs to check fileIO or a simple string comparison/regex match is enough?

As issue #7238 mentioned, there's a lot of suggestions to avoid conflict but not all are equally practical to handle real world complexity

  • existing table whose location does not follow <bucket_uri>/<database_hash>/<database_name>/<table_name> pattern, the directory based location is a convention for hive tables but not for iceberg.
  • FileIO runtime resolve provides us the ability to check if the given path is already in use/non-empty at runtime where table location based comparison/regex match rely on good intention of data owner who followed convention. So if io is available at the time of table creation, I think we can take that to our advantage.
  • table rename with old location is the fine as I also include some of them in unit tests

Copy link

@anigos anigos Aug 5, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got you @dramaticlly ! :)

This part looks ok but one question for @RussellSpitzer too as you said Russell we need to check if the directory is empty as well but assume someone just have cerated a table and that has nothing yet but it will get incoming records? I mean existing location isn't that enough by this check? There could be timely situation where I have created a table and from next week that table will get data and actually that location is in use but currently it is empty?

if (Boolean.parseBoolean(tableProperties.get(TableProperties.UNIQUE_LOCATION))) {
        boolean alreadyExists = ops.io().newInputFile(baseLocation).exists();
        if (alreadyExists) {
          throw new AlreadyExists

Copy link
Contributor Author

@dramaticlly dramaticlly Aug 18, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it might be catalog dependent, but commonly when table was created, the iceberg metadata gets written into its folder.

table db.tbl gets created usually means catalog will create following structure on file system

.
└── prefix (0)
    └── db.db (1)
        └── table (2)
            ├── data (3)
            └── metadata (4)
                └── 00000-fad97b4a-ffea-4707-b9cc-d083017bf482.metadata.json 

So currently if we check directory is empty help guard against the case where create tables at some existing table location. I believe with this PR, create a new table at location 0,1,2,4 will throw exception as exception since location is already in use but create at 3 will succeed if original table was created empty.

@dramaticlly dramaticlly changed the title Core: check table location for uniqueness before creation Core: check table location for conflict before creation Aug 3, 2023
@dramaticlly dramaticlly changed the title Core: check table location for conflict before creation Core: check location for conflict before creating table Aug 3, 2023
Copy link

This pull request has been marked as stale due to 30 days of inactivity. It will be closed in 1 week if no further activity occurs. If you think that’s incorrect or this pull request requires a review, please simply write any comment. If closed, you can revive the PR at any time and @mention a reviewer or discuss it on the [email protected] list. Thank you for your contributions.

@github-actions github-actions bot added the stale label Sep 10, 2024
Copy link

This pull request has been closed due to lack of activity. This is not a judgement on the merit of the PR in any way. It is just a way of keeping the PR queue manageable. If you think that is incorrect, or the pull request requires review, you can revive the PR at any time.

@github-actions github-actions bot closed this Sep 18, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add checks in create/alter table if table location conflicts
4 participants