Skip to content

Commit

Permalink
Merge pull request #15 from szarnyasg/refer-to-docs
Browse files Browse the repository at this point in the history
Move documentation to the DuckDB documentation and add reference
  • Loading branch information
samansmink committed Sep 25, 2023
2 parents 51ba956 + 223d6fc commit f40053f
Showing 1 changed file with 19 additions and 23 deletions.
42 changes: 19 additions & 23 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,18 +9,26 @@ this extension will be updated and usable from (nightly) DuckDB releases.
This repository contains a DuckDB extension that adds support for [Apache Iceberg](https://iceberg.apache.org/). In its current state, the extension offers some basics features that allow listing snapshots and reading specific snapshots
of an iceberg tables.

# Dependencies
## building
## Documentation

See the [Iceberg page in the DuckDB documentation](https://duckdb.org/docs/extensions/iceberg).

## Developer guide

### Dependencies

This extension has several dependencies. Currently, the main way to install them is through vcpkg. To install vcpkg,
check out the docs [here](https://vcpkg.io/en/getting-started.html). Note that this extension contains a custom vcpkg port
that overrides the existing 'avro-cpp' port of vcpkg. The reason for this is that the other versions of avro-cpp have
some issue that seems to cause issues with the avro files produced by the spark iceberg extension.

## test data generation
### Test data generation

To generate test data, the script in 'scripts/test_data_generator' is used to have spark generate some test data. This is
based on pyspark 3.4, which you can install through pip.

# Building the extension
### Building the extension

To build the extension with vcpkg, you can build this extension using:

```shell
Expand All @@ -33,42 +41,30 @@ This will build both the separate loadable extension and a duckdb binary with th
./build/release/extension/iceberg/iceberg.duckdb_extension
```

# Running iceberg queries
The easiest way is to start the duckdb binary produced by the build step: `./build/release/duckdb`. Then for example:
```SQL
> SELECT count(*) FROM ICEBERG_SCAN('data/iceberg/lineitem_iceberg', ALLOW_MOVED_PATHS=TRUE);
51793
```
Note that for testing, the `ALLOW_MOVED_PATHS` option is available. This option will ensure some path resolution is performed. This
path resolution allows scanning iceberg tables that are moved, which is used during testing.
### Running tests

```SQL
> SELECT * FROM ICEBERG_SNAPSHOTS('data/iceberg/lineitem_iceberg', ALLOW_MOVED_PATHS=TRUE);
1 3776207205136740581 2023-02-15 15:07:54.504 0 lineitem_iceberg/metadata/snap-3776207205136740581-1-cf3d0be5-cf70-453d-ad8f-48fdc412e608.avro
2 7635660646343998149 2023-02-15 15:08:14.73 0 lineitem_iceberg/metadata/snap-7635660646343998149-1-10eaca8a-1e1c-421e-ad6d-b232e5ee23d3.avro
```
For more examples check the tests in the `test` directory
#### Generating test data

# Running tests
## Generating test data
To generate the test data, run:
```shell
make data
```
Note that the script requires python3, pyspark and duckdb-python to be installed. Assuming python3 is already installed,
running `python3 -m pip install duckdb pyspark` should do the trick.

## Running unittests
#### Running unit tests

```shell
make test
```

## Running the local S3 test server
#### Running the local S3 test server

Running the S3 test cases requires the minio test server to be running and populated with `scripts/upload_iceberg_to_s3_test_server.sh`.
Note that this requires to have run `make data` before and also to have the aws cli and docker compose installed.

## Acknowledgements

# Acknowledgements
This extension was initially developed as part of a customer project for [RelationalAI](https://relational.ai/),
who have agreed to open source the extension. We would like to thank RelationalAI for their support
and their commitment to open source enabling us to share this extension with the community.

0 comments on commit f40053f

Please sign in to comment.