Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DOCS] Add documentation for read_iceberg #1769

Merged
merged 2 commits into from
Jan 9, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
24 changes: 24 additions & 0 deletions daft/io/_iceberg.py
Original file line number Diff line number Diff line change
Expand Up @@ -70,6 +70,30 @@ def read_iceberg(
pyiceberg_table: "PyIcebergTable",
io_config: Optional["IOConfig"] = None,
) -> DataFrame:
"""Create a DataFrame from an Iceberg table

Example:
>>> import pyiceberg
>>>
>>> pyiceberg_table = pyiceberg.Table(...)
>>> df = daft.read_iceberg(pyiceberg_table)
>>>
>>> # Filters on this dataframe can now be pushed into
>>> # the read operation from Iceberg
>>> df = df.where(df["foo"] > 5)
>>> df.show()

.. NOTE::
This function requires the use of `PyIceberg <https://py.iceberg.apache.org/>`_, which is the Apache Iceberg's
official project for Python.

Args:
pyiceberg_table: Iceberg table created using the PyIceberg library
io_config: A custom IOConfig to use when accessing Iceberg object storage data. Defaults to None.

Returns:
DataFrame: a DataFrame with the schema converted from the specified Iceberg table
"""
from daft.iceberg.iceberg_scan import IcebergScanOperator

io_config = (
Expand Down
58 changes: 35 additions & 23 deletions docs/source/api_docs/creation.rst
Original file line number Diff line number Diff line change
Expand Up @@ -20,71 +20,83 @@ Python Objects
from_pylist
from_pydict

Arrow
~~~~~
Files
-----

.. _df-io-files:

Parquet
~~~~~~~

.. _daft-read-parquet:

.. autosummary::
:nosignatures:
:toctree: doc_gen/io_functions

read_parquet

CSV
~~~

.. autosummary::
:nosignatures:
:toctree: doc_gen/io_functions

from_arrow
read_csv

Pandas
~~~~~~
JSON
~~~~

.. autosummary::
:nosignatures:
:toctree: doc_gen/io_functions

from_pandas
read_json

File Paths
~~~~~~~~~~
Data Catalogs
-------------

Apache Iceberg
^^^^^^^^^^^^^^

.. autosummary::
:nosignatures:
:toctree: doc_gen/io_functions

from_glob_path
read_iceberg

Files
-----

.. _df-io-files:
Arrow
~~~~~

Parquet
~~~~~~~
.. autosummary::
:nosignatures:
:toctree: doc_gen/io_functions

.. _daft-read-parquet:

.. autosummary::
:nosignatures:
:toctree: doc_gen/io_functions

read_parquet
from_arrow

CSV
~~~
Pandas
~~~~~~

.. autosummary::
:nosignatures:
:toctree: doc_gen/io_functions

read_csv
from_pandas

JSON
~~~~
File Paths
~~~~~~~~~~

.. autosummary::
:nosignatures:
:toctree: doc_gen/io_functions

read_json
from_glob_path

Integrations
------------
Expand Down
5 changes: 5 additions & 0 deletions docs/source/user_guide/basic_concepts/read-and-write.rst
Original file line number Diff line number Diff line change
Expand Up @@ -34,6 +34,11 @@ Daft supports file paths to a single file, a directory of files, and wildcards.
To learn more about each of these constructors, as well as the options that they support, consult the API documentation on :ref:`creating DataFrames from files <df-io-files>`.

From Data Catalogs
^^^^^^^^^^^^^^^^^^

If you use catalogs such as Apache Iceberg or Hive, you may wish to consult our user guide on integrations with Data Catalogs: :doc:`Daft integration with Data Catalogs <../integrations/data_catalogs>`.

From File Paths
^^^^^^^^^^^^^^^

Expand Down
6 changes: 6 additions & 0 deletions docs/source/user_guide/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,7 @@ Daft User Guide
basic_concepts
daft_in_depth
poweruser
integrations
tutorials

Welcome to **Daft**!
Expand Down Expand Up @@ -61,6 +62,11 @@ Core Daft concepts all Daft users will find useful to understand deeply.

Become a true Daft Poweruser! This section explores advanced topics to help you configure Daft for specific application environments, improve reliability and optimize for performance.

:doc:`Integrations <integrations>`
**********************************

Learn how to use Daft's integrations with other technologies such as Ray Datasets or Apache Iceberg.

:doc:`Tutorials <tutorials>`
****************************

Expand Down
6 changes: 6 additions & 0 deletions docs/source/user_guide/integrations.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
Integrations
============

.. toctree::

integrations/data_catalogs
43 changes: 43 additions & 0 deletions docs/source/user_guide/integrations/data_catalogs.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,43 @@
Data Catalogs
=============

**Data Catalogs** are services that provide access to **Tables** of data. **Tables** are powerful abstractions for large datasets in storage, providing many benefits over naively storing data as just a bunch of CSV/Parquet files.

There are many different **Table Formats** that are employed by Data Catalogs. These table formats will differ implementation and capabilities, but will often provide advantages such as:

1. **Schema:** what data do these files contain?
2. **Partitioning Specification:** how is the data organized?
3. **Statistics/Metadata:** how many rows does each file contain, and what are the min/max values of each files' columns?
4. **ACID compliance:** updates to the table are atomic

.. NOTE::
The names of Table Formats and their Data Catalogs are often used interchangeably.

For example, "Apache Iceberg" often refers to both the Data Catalog and its Table Format.

You can retrieve an **Apache Iceberg Table** from an **Apache Iceberg REST Data Catalog**.

However, some Data Catalogs allow for many different underlying Table Formats. For example, you can request both an **Apache Iceberg Table** or a **Hive Table** from an **AWS Glue Data Catalog**.

Why use Data Catalogs?
----------------------

Daft can effectively leverage the statistics and metadata provided by these Data Catalogs' Tables to dramatically speed up queries.

This is accomplished by techniques such as:

1. **Partition pruning:** ignore files where their partition values don't match filter predicates
2. **Schema retrieval:** convert the schema provided by the data catalog into a Daft schema instead of sampling a schema from the data
3. **Metadata execution**: utilize metadata such as row counts to read the bare minimum amount of data necessary from storage

Data Catalog Integrations
-------------------------

Apache Iceberg
^^^^^^^^^^^^^^

Apache Iceberg is an open-sourced table format originally developed at Netflix for large-scale analytical datasets.

To read from the Apache Iceberg table format, use the :func:`daft.read_iceberg` function.

We integrate closely with `PyIceberg <https://py.iceberg.apache.org/>`_ (the official Python implementation for Apache Iceberg) and allow the reading of Daft dataframes from PyIceberg's Table objects.
Loading