Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

write iceberg tables on filesystem destination #1996

Open
rudolfix opened this issue Oct 28, 2024 · 1 comment
Open

write iceberg tables on filesystem destination #1996

rudolfix opened this issue Oct 28, 2024 · 1 comment

Comments

@rudolfix
Copy link
Collaborator

rudolfix commented Oct 28, 2024

Background
We aim to support backend and server-less write support for iceberg tables. We'd like to do that in similar way we do it to delta-tables: make table_format iceberg to be recognized by the filesystem destination. From the user PoV this means:

  • writing and reading iceberg tables without query engine as a separate backend
  • maintaining and evolving the schema without catalog as a separate backend

We want to use pyiceberg. This limits the write disposition to append and replace (until upsert is implemented). We also wont' support vacuum, compact or z-order ops on the tables.

Tasks

    • we maintain a "technical" catalog: sqllite file per table. those files we store together with the data
    • to write a table we lock the sqllite file with TransactionalFile, pull it locally, use with pyiceberg and then write it back.
    • use pyiceberg to append, replace tables, create partitions, do schema evolution etc.
    • support all buckets via fsspec
    • like for delta, expose pyiceberg for a given table. read only (catalog without lock) and r/w with lock on catalog (maybe via context manager). this will allow people ie. to delete or rebuild partitions on a table.
    • support filesystem sql_client to create views on ICEBERG via duckdb
@jorritsandbrink
Copy link
Collaborator

@rudolfix

  1. perhaps we can use an in-memory SQLite database instead of persisting the file to disk
    • if I understand correctly, at its core the catalog is only mapping table name to table metadata (which lives on the filesystem)—we can populate the in-memory SQLite database with this mapping based on dlt metadata
  2. perhaps Iceberg's optimistic concurrency makes locking unnecessary

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Planned
Development

No branches or pull requests

2 participants