-
Notifications
You must be signed in to change notification settings - Fork 2.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Define behavior of gc.enabled and location ownership #4159
Comments
I agree with the approach of defining gc.enabled behavior more clearly and adding owned locations. If a file is in an owned location, then the table should always be able to delete it (though it may not choose to based on catalog policy). |
I agree with that but I'd phrase it slightly differently. It see it more like a consequence. I'd say That being said, if |
For expire snapshots, I think it's possible that we do not delete any manifest or data files in the case that those files are still referenced by later snapshots. If we don't allow expiring snapshots when gc.enabled is false, we would prevent users from performing the metadata update to remove snapshots which may be limiting. Should we differentiate between performing the metadata operation to expire snapshots and performing the following deletion of unreachable files due to the expiration? Tbh I can't really think of a solid use case where someone would want this differentiation for expiring snapshots, but just bringing it up while we're having this discussion.Please let me know if I misunderstood something! |
Thanks for raising this thread! I was also planning to raise the same discussion but had some thoughts to finalize. I have been thinking about this question quite a lot recently, here are my current thoughts: 1. gc.enabledI think as of today almost all the Iceberg users that tune this parameter are interpreting I think there are only 2 cases we need to consider:
For 1, we should follow the definition of 2. Remove metadata files by configThe current behavior in
Could you provide some use cases where this is useful in addition to the recovery as a Hive table? 3. Table location ownershipAt a first glance, defining table prefix location is the best way to proceed forward, but when I think more, I start to realize that it defining ownership for a set of location prefixes in a table is not really needed for the use cases we want to achieve. Here are the 2 big use cases I considered so far: remove orphan filesThe fact that remove orphan files needs root location definition seems to be a circular argument. We exposes the action I think the correct way to run remove orphan files is to do it for the entire warehouse. I talked about my idea a bit in https://apache-iceberg.slack.com/archives/C025PH0G1D4/p1645203709220099. Most storage services have the ability to provide the full listing of files and it can be delivered much more efficiently than a ListFiles API, e.g. S3 inventory list. And Iceberg provides an efficient way to query Iceberg files metadata through system table. That means we can perform an efficient distributed join and find out the orphan files of the entire warehouse. I think that’s all what the data warehouse admin needs if we provide the feature. This is basically talking about the table access permissionAnother use case of table root location definition I can think of is for table access control. Admin might configure user access based on the table locations in storage. However, using file path access control to achieve that is just a storage-specific implementation which does not necessarily need to be true to support Iceberg table access management. For example, in S3 people can tag all the files it writes, and control access of that using S3 bucket policy. This allows multiple Iceberg tables to store files under the same bucket, but access control is still intact. Therefore from security perspective, table root path based file ownership should just be a type of ownership mode. The problem of declaring table location ownershipFrom S3 object storage mode usage reports from customers, most people still want to store data files in the same shared root bucket to minimize throttling as long as the above 2 issues I describe can be solved. If an exclusive root location has to be declared for each table, then files of the table has to share certain part of the S3 prefix, and the throttling issue comes back, where a table not accessed for a year is guaranteed to be in a cold S3 partition and get throttled heavily when new traffic comes. Maybe S3 can in the future solve the cold partition issue with some keep-warm feature, but it will come with a pricing cost for sure, and my biggest concern is that people will start to build tools that only work for tables with declared root locations (we already see this in Trino), which creates difference in table behavior at spec level that optimizes for a certain storage layout. I don't know what's the future of storage, maybe a new product will come in the market with a completely different access pattern. Currently Iceberg's minimum storage feature requirement is very resilient to integrate with any storage, adding the concept of prefix ownership is basically preferring a file-system like storage. There are also some areas that I have not finalized my thoughts yet, such as when data is replicated, should the replicated locations, quick access layer locations, archival locations also be declared as owned in Iceberg, and how should we provide tools to continuously track down those aspects. This feels to me like too much storage implementation details to handle from the table spec layer. I think it would be better to delegate to FileIOs to tackle those issues, as that is the actual vendor integration layer provided by Iceberg, where people choose which storage to use and what fits their bill, and storage frameworks and products can compete with the same set of rules. At least all the features I described for S3 are planned to be added, and I think there are GCS and HDFS equivalents that people can implement if feeling the need to reach feature parity. One valuable thing to add in the Iceberg spec is the list (or set?) of all the table locations used. I think that could be used by a specific storage to do whatever is needed based on the information, such as removing all data in all directory. But in general we should be cautious about saying that all the locations are owned by a table. |
Sorry, was distracted. Let me catch up. |
That's generally true but there is a valid use case for keeping data files: you may use SNAPSHOT command in Spark that would create an Iceberg table pointing to non-Iceberg files. Those referenced data files may belong to other prod Hive tables. We allow to create Iceberg metadata to play with that data but we can't remove the imported data files. It may corrupt original tables.
I hear your point and I think both use cases are valid. You are right, if a single prefix is shared by many tables and object store locations are enabled, the only way to remove orphans is by getting a list of all files under that prefix and querying systems tables for all Iceberg tables. However, I am not sure that's always possible. You have to assume no extra jobs use that location, you know all metastores/catalogs, etc. I'd say having a short per-table prefix or a set of prefixes would be also quite common. I can be convinced otherwise. I'd be interested to hear more from other folks too. |
As @aokolnychyi suggested in 3056, we use
I think this is necessary. We should unify the built-in catalog behavior of the drop table [purge]. And maybe need to define the interface to support some parallel operations (by leveraging distributed engine, such as spark/flink/more). |
Well, I'd also love to see consistent behavior. However, path-based tables are quite special: there is no table pointer in the catalog. Any thoughts on what drop without purge may mean for them? Maybe we should only allow drop with purge = true for path-based tables? That way, we will have consistent behavior across catalogs but path-based tables won't support some functionality? |
This is an interesting idea I also considered at one point, having the option to provide the whole listing for S3-based storage bucket, filter out for all locations owned by known tables, and run remove orphans on those locations? If you know ahead of time your 'bucket' is majority files iceberg tables, might be less expensive overall (though this sounds like a single monster job if we are bottlenecked at the speed of physically deleting the files). Another consideration is that system like S3 inventory will be stale in order of hours or days. Though it still seems to go back to the the main question, how to define the "owned locations" of a table? (Sorry, please let me know if there is discussion on it already I missed). That would be great to have, one thing I struggled in the past is "alter table location", if the previous locations can somehow be saved in a growing list it could allow to run more complete orphan removal jobs. |
drop or purge thinks should be the same mean for the path-based tables. In traditional database, the drop will move the table into the recycle bin and could restore the table with some command. However, purge will clean all data and cannot restore it. Maybe we should refer to such a design? |
Thanks for this thread.
This makes sense but we should define what this list means. There are some questions that come to my mind.
Incidentally, most of these questions also came up in the dev discussion on relative paths/DR strategy. A single location prefix support will be added as part of Supporting multiple roots is mentioned in Coming back to the problem, assuming we have such a list of owned table locations, how would we make sure all the files in these locations are indeed owned by Iceberg table? In other words, can we safely delete |
This issue has been automatically marked as stale because it has been open for 180 days with no activity. It will be closed in next 14 days if no further activity occurs. To permanently prevent this issue from being considered stale, add the label 'not-stale', but commenting on the issue is preferred when possible. |
This issue has been closed because it has not received any activity in the last 14 days since being marked as 'stale' |
The question about location ownership and file removal comes up in a lot of discussions. See here for an example.
Right now, our interpretation of
gc.enabled
is not consistent.HadoopTables
ignores it while purging files.CatalogUtil
keeps data files ifgc.enabled
is true while purging files and removes all metadata files.DeleteReachableFiles
and other actions throw an exception ifgc.enabled
is true.I think we should make the behavior consistent. My initial thoughts are below.
gc.enabled
to prohibit dangerous actions. For example,DeleteOrphanFiles
should throw an exception if garbage collection is disabled. Once we know what locations owned by the table, we can reconsider that check.Here is a list of places that may physically remove files.
Expire snapshots
It should not be possible to expire snapshots if
gc.enabled
is false.Delete orphan files
For now, we should continue to throw an exception if
gc.enabled
is false. Once we know what prefixes are owned by the table, we can allow removal of orphan files in locations that are owned by the table.Delete reachable files
It shouldn't be possible to delete data and delete files if garbage collection is disabled. However, we may consider allowing removal of metadata when
gc.enabled
is false. One may argue that metadata files are always owned by the table. We should also make our action configurable so that it can delete only data or metadata files.Drop and purge tables
I think it should match the removal of reachable files and be consistent in all APIs. Once we know locations owned by the table, we may drop them too.
cc @jackye1995 @rdblue @pvary @RussellSpitzer @flyrain @szehon-ho @danielcweeks @karuppayya @SinghAsDev
The text was updated successfully, but these errors were encountered: