Questions around Iceberg-rust #450

ChristianCasazza · 2024-07-09T16:24:07Z

Hello, I had some questions around Iceberg-rust regarding data interactions with S3, authn, and authz.

How does connecting an Iceberg catalog with a specific S3 bucket work? I understand the structure on S3 with dividing a table into parquet data files and avro metadata files, but I am not sure how the relationship between this file organization and a deployed catalog works, and how to configure that exactly.
Where does Pyiceberg fit into Iceberg-rust? Would it be possible to deploy Iceberg-rust on the server side, and interact with the rest catalog through Pyiceberg? I like python as a nice interface for data consumers to interact with a catalog, and for basic management of tables.
What are the write table options with an Iceberg rust? As of now, is it only possible with a distributed engine like Spark or Trino? What would be the bottlenecks to duckdb, polars, or Ibis+backend writes? The vast majority of my datasets are less than 50Gb currently, and most workloads a fraction of that. I would like to use Iceberg for its superior data management vs files, but initially for use cases that can mostly be done on a single node and don't really need the power of distributed engines.
How does authentication and authorization work with the current Iceberg-rust? The access control system I described above works for AWS S3 and sharing files. Any pointers about where I could learn to integrate IAM permissions into a catalog and tables? It seems the creators of https://github.com/hansetag/iceberg-catalog are in the middle of implementing some of these exact features. I would love to contribute on these features and implement for my use case. It seems the way it works where non-AWS credentials are vended to consumers, and the catalog uses AWS credentials to sign S3 requests for the users, but I am not sure. I am also not sure how this implementation compares with the open-sourced implementation released by Databricks.
Where exactly does OpenDAL fit into the Iceberg-rust catalog? Would OpenDAL help standardize accessing data from the catalog? The custom metadata Tracking issues of user metadata support opendal#4842 feature could also be useful for connecting tables to different authz commands.

Xuanwo · 2024-07-10T06:27:20Z

How does connecting an Iceberg catalog with a specific S3 bucket work?

Catalog will manage the bucket by their own. For example, to start a demo rest catalog, your can use:

iceberg-rust/crates/catalog/rest/testdata/rest_catalog/docker-compose.yaml

Lines 21 to 37 in 3a947fa

    
           rest: 
        
             image: tabulario/iceberg-rest:0.10.0 
        
             environment: 
        
               - AWS_ACCESS_KEY_ID=admin 
        
               - AWS_SECRET_ACCESS_KEY=password 
        
               - AWS_REGION=us-east-1 
        
               - CATALOG_CATOLOG__IMPL=org.apache.iceberg.jdbc.JdbcCatalog 
        
               - CATALOG_URI=jdbc:sqlite:file:/tmp/iceberg_rest_mode=memory 
        
               - CATALOG_WAREHOUSE=s3://icebergdata/demo 
        
               - CATALOG_IO__IMPL=org.apache.iceberg.aws.s3.S3FileIO 
        
               - CATALOG_S3_ENDPOINT=http://minio:9000 
        
             depends_on: 
        
               - minio 
        
             links: 
        
               - minio:icebergdata.minio 
        
             expose: 
        
               - 8181

Where does Pyiceberg fit into Iceberg-rust? Would it be possible to deploy Iceberg-rust on the server side, and interact with the rest catalog through Pyiceberg?

Both pyiceberg and iceberg-rust implement the Iceberg specification. You can write an Iceberg table with iceberg-rust and then read it using pyiceberg.

What are the write table options with an iceberg rust?

iceberg-rust has not yet fully implemented support for DataFusion. Therefore, users will need to utilize the Rust APIs we provide to manipulate tables.

How does authentication and authorization work with the current Iceberg-rust?

Iceberg-rust does not directly handle authentication or authorization. Typically, these functions are managed by catalog implementations. The community is moving to deprecate the token endpoint and adopt the OAuth2 specification (not decided yet).

I am also not sure how this implementation compares with the open-sourced implementation released by Databricks.

There should be no difference in iceberg-rust as long as it adheres to the iceberg-rest catalog specifications.

Where exactly does OpenDAL fit into the Iceberg-rust catalog?

OpenDAL serves as the core file IO for iceberg-rust and does not directly interact with the catalog. However, it's possible to implement an iceberg catalog based on opendal.

liurenjie1024 · 2024-07-10T13:37:59Z

How does connecting an Iceberg catalog with a specific S3 bucket work? I understand the structure on S3 with dividing a table into parquet data files and avro metadata files, but I am not sure how the relationship between this file organization and a deployed catalog works, and how to configure that exactly.

It depends on what catalog you are using. For hms/glue catalog, which could be classified as client side catalog, you need to setup hive metastore or glue server, and pass warehouse configuration to catalog builder. For rest catalog, it's the rest catalog server's responsibility to manage the location.

Where does Pyiceberg fit into Iceberg-rust? Would it be possible to deploy Iceberg-rust on the server side, and interact with the rest catalog through Pyiceberg? I like python as a nice interface for data consumers to interact with a catalog, and for basic management of tables.

Currently there is no relationship between these two libraries, and they are just iceberg implementation in different languages. iceberg-rust is a library, so you can use it in a server, but you need to write server code by yourself. Since pyiceberg and iceberg-rust both implement iceberg spec, so you can in theory use iceberg-rust to write data into iceberg table, and use pyiceberg to read them, and vice verse.

What are the write table options with an Iceberg rust? As of now, is it only possible with a distributed engine like Spark or Trino? What would be the bottlenecks to duckdb, polars, or Ibis+backend writes? The vast majority of my datasets are less than 50Gb currently, and most workloads a fraction of that. I would like to use Iceberg for its superior data management vs files, but initially for use cases that can mostly be done on a single node and don't really need the power of distributed engines.

Currently iceberg-rust has not implemented writing to table yet. The community focuses on reading support in recent releases.

@Xuanwo 's answer about other parts are great, and I don't have much to add.

Xuanwo added the question Further information is requested label Jul 10, 2024

ChristianCasazza closed this as completed Jul 16, 2024

apache locked and limited conversation to collaborators Jul 17, 2024

liurenjie1024 converted this issue into discussion #468 Jul 17, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

This issue was moved to a discussion.

Questions around Iceberg-rust #450

Questions around Iceberg-rust #450

ChristianCasazza commented Jul 9, 2024

Xuanwo commented Jul 10, 2024

liurenjie1024 commented Jul 10, 2024 •

edited

Loading

This issue was moved to a discussion.

This issue was moved to a discussion.

Questions around Iceberg-rust #450

Questions around Iceberg-rust #450

Comments

ChristianCasazza commented Jul 9, 2024

Xuanwo commented Jul 10, 2024

liurenjie1024 commented Jul 10, 2024 • edited Loading

This issue was moved to a discussion.

liurenjie1024 commented Jul 10, 2024 •

edited

Loading