Skip to content

Commit

Permalink
[DOCS] Add documentation for read_iceberg (#1769)
Browse files Browse the repository at this point in the history
1. Add docstring
2. Add API docs
3. Adds a new section in `user_guide/integrations` for `data_catalogs`

Co-authored-by: Jay Chia <[email protected]@users.noreply.github.com>
  • Loading branch information
jaychia and Jay Chia authored Jan 9, 2024
1 parent fd662c1 commit a994c7b
Show file tree
Hide file tree
Showing 6 changed files with 119 additions and 23 deletions.
24 changes: 24 additions & 0 deletions daft/io/_iceberg.py
Original file line number Diff line number Diff line change
Expand Up @@ -70,6 +70,30 @@ def read_iceberg(
pyiceberg_table: "PyIcebergTable",
io_config: Optional["IOConfig"] = None,
) -> DataFrame:
"""Create a DataFrame from an Iceberg table
Example:
>>> import pyiceberg
>>>
>>> pyiceberg_table = pyiceberg.Table(...)
>>> df = daft.read_iceberg(pyiceberg_table)
>>>
>>> # Filters on this dataframe can now be pushed into
>>> # the read operation from Iceberg
>>> df = df.where(df["foo"] > 5)
>>> df.show()
.. NOTE::
This function requires the use of `PyIceberg <https://py.iceberg.apache.org/>`_, which is the Apache Iceberg's
official project for Python.
Args:
pyiceberg_table: Iceberg table created using the PyIceberg library
io_config: A custom IOConfig to use when accessing Iceberg object storage data. Defaults to None.
Returns:
DataFrame: a DataFrame with the schema converted from the specified Iceberg table
"""
from daft.iceberg.iceberg_scan import IcebergScanOperator

io_config = (
Expand Down
58 changes: 35 additions & 23 deletions docs/source/api_docs/creation.rst
Original file line number Diff line number Diff line change
Expand Up @@ -20,71 +20,83 @@ Python Objects
from_pylist
from_pydict

Arrow
~~~~~
Files
-----

.. _df-io-files:

Parquet
~~~~~~~

.. _daft-read-parquet:

.. autosummary::
:nosignatures:
:toctree: doc_gen/io_functions

read_parquet

CSV
~~~

.. autosummary::
:nosignatures:
:toctree: doc_gen/io_functions

from_arrow
read_csv

Pandas
~~~~~~
JSON
~~~~

.. autosummary::
:nosignatures:
:toctree: doc_gen/io_functions

from_pandas
read_json

File Paths
~~~~~~~~~~
Data Catalogs
-------------

Apache Iceberg
^^^^^^^^^^^^^^

.. autosummary::
:nosignatures:
:toctree: doc_gen/io_functions

from_glob_path
read_iceberg

Files
-----

.. _df-io-files:
Arrow
~~~~~

Parquet
~~~~~~~
.. autosummary::
:nosignatures:
:toctree: doc_gen/io_functions

.. _daft-read-parquet:

.. autosummary::
:nosignatures:
:toctree: doc_gen/io_functions

read_parquet
from_arrow

CSV
~~~
Pandas
~~~~~~

.. autosummary::
:nosignatures:
:toctree: doc_gen/io_functions

read_csv
from_pandas

JSON
~~~~
File Paths
~~~~~~~~~~

.. autosummary::
:nosignatures:
:toctree: doc_gen/io_functions

read_json
from_glob_path

Integrations
------------
Expand Down
5 changes: 5 additions & 0 deletions docs/source/user_guide/basic_concepts/read-and-write.rst
Original file line number Diff line number Diff line change
Expand Up @@ -34,6 +34,11 @@ Daft supports file paths to a single file, a directory of files, and wildcards.
To learn more about each of these constructors, as well as the options that they support, consult the API documentation on :ref:`creating DataFrames from files <df-io-files>`.

From Data Catalogs
^^^^^^^^^^^^^^^^^^

If you use catalogs such as Apache Iceberg or Hive, you may wish to consult our user guide on integrations with Data Catalogs: :doc:`Daft integration with Data Catalogs <../integrations/data_catalogs>`.

From File Paths
^^^^^^^^^^^^^^^

Expand Down
6 changes: 6 additions & 0 deletions docs/source/user_guide/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,7 @@ Daft User Guide
basic_concepts
daft_in_depth
poweruser
integrations
tutorials

Welcome to **Daft**!
Expand Down Expand Up @@ -61,6 +62,11 @@ Core Daft concepts all Daft users will find useful to understand deeply.

Become a true Daft Poweruser! This section explores advanced topics to help you configure Daft for specific application environments, improve reliability and optimize for performance.

:doc:`Integrations <integrations>`
**********************************

Learn how to use Daft's integrations with other technologies such as Ray Datasets or Apache Iceberg.

:doc:`Tutorials <tutorials>`
****************************

Expand Down
6 changes: 6 additions & 0 deletions docs/source/user_guide/integrations.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
Integrations
============

.. toctree::

integrations/data_catalogs
43 changes: 43 additions & 0 deletions docs/source/user_guide/integrations/data_catalogs.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,43 @@
Data Catalogs
=============

**Data Catalogs** are services that provide access to **Tables** of data. **Tables** are powerful abstractions for large datasets in storage, providing many benefits over naively storing data as just a bunch of CSV/Parquet files.

There are many different **Table Formats** that are employed by Data Catalogs. These table formats will differ implementation and capabilities, but will often provide advantages such as:

1. **Schema:** what data do these files contain?
2. **Partitioning Specification:** how is the data organized?
3. **Statistics/Metadata:** how many rows does each file contain, and what are the min/max values of each files' columns?
4. **ACID compliance:** updates to the table are atomic

.. NOTE::
The names of Table Formats and their Data Catalogs are often used interchangeably.

For example, "Apache Iceberg" often refers to both the Data Catalog and its Table Format.

You can retrieve an **Apache Iceberg Table** from an **Apache Iceberg REST Data Catalog**.

However, some Data Catalogs allow for many different underlying Table Formats. For example, you can request both an **Apache Iceberg Table** or a **Hive Table** from an **AWS Glue Data Catalog**.

Why use Data Catalogs?
----------------------

Daft can effectively leverage the statistics and metadata provided by these Data Catalogs' Tables to dramatically speed up queries.

This is accomplished by techniques such as:

1. **Partition pruning:** ignore files where their partition values don't match filter predicates
2. **Schema retrieval:** convert the schema provided by the data catalog into a Daft schema instead of sampling a schema from the data
3. **Metadata execution**: utilize metadata such as row counts to read the bare minimum amount of data necessary from storage

Data Catalog Integrations
-------------------------

Apache Iceberg
^^^^^^^^^^^^^^

Apache Iceberg is an open-sourced table format originally developed at Netflix for large-scale analytical datasets.

To read from the Apache Iceberg table format, use the :func:`daft.read_iceberg` function.

We integrate closely with `PyIceberg <https://py.iceberg.apache.org/>`_ (the official Python implementation for Apache Iceberg) and allow the reading of Daft dataframes from PyIceberg's Table objects.

0 comments on commit a994c7b

Please sign in to comment.