data pond: expose readable datasets as dataframes and arrow tables #1507

sh-rp · 2024-06-21T13:20:05Z

Description

As an alternative to the ibis integration, we are testing out wether we can create our own data reader with not too much effort that works across all destinations.

Ticket for followup work after this PR is here: https://github.com/orgs/dlt-hub/projects/9/views/1?pane=issue&itemId=80696433

TODO

Build dataset and relation interfaces (see @rudolfix comment below)
Extend DBApiCursorImpl to support arrow tables (some native cursors support arrow)
Ensure all native cursors that have native support for arrow and pandas forward this to DBApiCursorImpl
Expose prepopulated duckdb instance from filesystem somehow? Possibly via fs_client interface
Figure out default chunk sizes and a nice interface (some cursors / databases figure out their own chunk size such as snowflake, others only return chunks in vector sizes such as duckdb)
Ensure up to date docstrings on all new interface and methods

netlify · 2024-06-21T13:20:21Z

✅ Deploy Preview for dlt-hub-docs canceled.

Name	Link
🔨 Latest commit	`fb9a445`
🔍 Latest deploy log	https://app.netlify.com/sites/dlt-hub-docs/deploys/6704fad0fe6dab0008567bbd

…ystem dataframe implementation

# Conflicts: # dlt/common/destination/reference.py # dlt/destinations/sql_client.py # dlt/pipeline/pipeline.py

rudolfix

@sh-rp pls see my comments

dlt/destinations/impl/filesystem/filesystem.py

composable_pipeline_1.py

jorritsandbrink · 2024-07-29T14:31:15Z

dlt/common/destination/reference.py

+    """Add support for accessing data as arrow tables or pandas dataframes"""
+
+    @abstractmethod
+    def iter_df(


I wouldn't explicitly expose Pandas dataframes. I would only expose Arrow data structures , because the user can call to_pandas on those.

@jorritsandbrink there are some destinations that have native pandas support and I think it would be cool to be able to expose those directly for the user

jorritsandbrink · 2024-07-29T14:36:02Z

dlt/common/destination/reference.py

+    ) -> Generator[DataFrame, None, None]: ...
+
+    @abstractmethod
+    def iter_arrow(


It's probably a good idea to support pyarrow.RecordBatchReader alongside pyarrow.Table for larger-than-memory data.

Or, if possible, expose a pyarrow.Dataset, from which the user can create either a pyarrow.RecordBatchReader or pyarrow.Table.

Note: using Dataset possible allows us to "mount" any external table as a cursor that reads data so we do not need to download full table locally. we may be able to ie. mount snowflake or bigquery tables are duckdb tables

dlt/destinations/sql_client.py

# Conflicts: # dlt/destinations/impl/filesystem/filesystem.py

remove resource based dataset accessor

…b dbapi cursor in filesystem

dlt/destinations/impl/filesystem/sql_client.py

rudolfix · 2024-10-01T15:34:49Z

dlt/destinations/impl/filesystem/sql_client.py

+            # set up connection and dataset
+            self._existing_views: List[str] = []  # remember which views already where created
+
+            self.autocreate_required_views = False


why do we need this flag and why it is set to False here? is querying INFORMATION_SCHEMA a problem? IMO we should fix execute_query below

there is some bug in sqlglot that throws an internal exception when parsing the information schema query, and I think for now this method is better than catching all exceptions and ignore them in the sqlglot parsing.

sqlglot was complaining that it cannot parse parametrized queries, which is fair... this was also a good opportunity to fix things so I'm skipping parsing in this case now

dlt/destinations/impl/filesystem/sql_client.py

tests/load/test_read_interfaces.py

rudolfix · 2024-10-01T16:40:19Z

tests/load/test_read_interfaces.py


        # now we can use the filesystemsql client to create the needed views
+        fs_client: Any = pipeline.destination_client()


the interface is already there. it is called update_stored_schema on job client. in this case filesystem is a staging destination for duckdb and you create all tables as views, pass the credentials etc. maybe we need some convenience method that creates dataset instance out of such strucutre (so we take not just destination but also staging as input to dataset).

the nice thing is that all this duckdb related code that creates views and does permission handover could go to duckdb client.

#1692

this is a good followup ticket

dlt/pipeline/pipeline.py

# Conflicts: # docs/website/docs/dlt-ecosystem/visualizations/exploring-the-data.md

rudolfix

code is good now but we miss a few tests specific to sql_client for filesystem:

make sure you can "open_connection" several times on sql_client
pass external connection ie. to duckdb that is persisted and try to create a few views then use it from existing connection (or after reopen of the persistent db)
test if we skip creating view for tables from "foreign" schemas (not dataset) ie. by querying a table with known table but with schema mismatch

ideal we'd move checks in test_read_interfaces.py done only for filesystem to the specific tests. we can do it later as well

…t a few more things

sh-rp · 2024-10-07T11:56:56Z

@rudolfix I think all is addressed in the additional commit I just made. I'm not quite sure what you mean with testing open_connection several times. I am testing this by opening the sql_client context on the filesystem a few times in a row now, but there is no test for parallel access, if that is what you need lmk.

rudolfix

LGTM!

sh-rp added 6 commits June 19, 2024 12:39

add simple ibis helper

af6a40e

start working on dataframe reading interface

3a69ece

a bit more work

4324650

first simple implementation

7c960df

small change

86b89ac

more work on dataset

5a8ea54

sh-rp linked an issue Jun 21, 2024 that may be closed by this pull request

access data after load load as dataframes with ibis #1095

Closed

sh-rp added 3 commits June 24, 2024 17:34

some work on filesystem destination

36e94af

add support for parquet files and compression on jsonl files in files…

20bf9ce

…ystem dataframe implementation

Merge branch 'devel' into exp/1095-expose-readable-datasets

6dce626

# Conflicts: # dlt/common/destination/reference.py # dlt/destinations/sql_client.py # dlt/pipeline/pipeline.py

sh-rp force-pushed the exp/1095-expose-readable-datasets branch from 0e7b165 to 6dce626 Compare July 17, 2024 15:10

sh-rp added 3 commits July 17, 2024 17:23

fix test after devel merge

a0ff55f

add nice composable pipeline example

c297e96

small updates to demo

d020403

rudolfix reviewed Jul 28, 2024

View reviewed changes

dlt/destinations/impl/filesystem/filesystem.py Outdated Show resolved Hide resolved

dlt/destinations/impl/filesystem/filesystem.py Outdated Show resolved Hide resolved

composable_pipeline_1.py Outdated Show resolved Hide resolved

jorritsandbrink reviewed Jul 29, 2024

View reviewed changes

dlt/destinations/sql_client.py Outdated Show resolved Hide resolved

sh-rp changed the title ~~[experiment] expose readable datasets as dataframes and arrow tables~~ expose readable datasets as dataframes and arrow tables Aug 6, 2024

sh-rp added 6 commits August 6, 2024 14:30

Merge branch 'devel' into exp/1095-expose-readable-datasets

5c3db47

# Conflicts: # dlt/destinations/impl/filesystem/filesystem.py

enable tests for all bucket providers

79ef7dd

remove resource based dataset accessor

fix tests

ff40079

create views in duckdb filesystem accessor

ac415b9

move to relations based interface

c92a527

add generic duckdb interface to filesystem

13ec73b

sh-rp force-pushed the exp/1095-expose-readable-datasets branch from 9538229 to 13ec73b Compare August 6, 2024 14:18

sh-rp added 3 commits August 6, 2024 17:27

move code for accessing frames and tables to the cursor and use duckd…

46e0226

…b dbapi cursor in filesystem

add native db api cursor fetching to exposed dataset

7cf69a7

some small changes

6ffe302

sh-rp added 3 commits October 1, 2024 15:22

fix delta tests

3e96a6c

make default secret name derived from bucket url

355f5b6

try fix azure tests again

9002f02

rudolfix requested changes Oct 1, 2024

View reviewed changes

sh-rp and others added 7 commits October 2, 2024 11:04

fix df access tests

c3050d4

PR fixes

bbc0525

Merge branch 'devel' into exp/1095-expose-readable-datasets

ef148c3

Merge branch 'devel' into exp/1095-expose-readable-datasets

a99e987

# Conflicts: # docs/website/docs/dlt-ecosystem/visualizations/exploring-the-data.md

correct internal table access

eaf1cd8

allow datasets without schema

6bb7117

skips parametrized queries, skips tables from non-dataset schemas

6648b86

rudolfix requested changes Oct 6, 2024

View reviewed changes

move filesystem specific sql_client tests to correct location and tes…

89a9861

…t a few more things

fix sql client tests

631d50b

sh-rp force-pushed the exp/1095-expose-readable-datasets branch from c41ecca to 631d50b Compare October 7, 2024 13:36

rudolfix previously approved these changes Oct 7, 2024

View reviewed changes

make secret name when dropping optional

8e2e37c

sh-rp dismissed rudolfix’s stale review via 8e2e37c October 7, 2024 14:12

sh-rp added 2 commits October 7, 2024 17:48

fix gs test

dc383fc

remove moved filesystem tests from test_read_interfaces

41926ae

sh-rp force-pushed the exp/1095-expose-readable-datasets branch from 9885c5a to 41926ae Compare October 7, 2024 18:32

sh-rp added 2 commits October 7, 2024 23:41

fix sql client tests again... :)

9b8437a

clear duckdb secrets

5d14045

rudolfix previously approved these changes Oct 8, 2024

View reviewed changes

disable secrets deleting for delta tests

fb9a445

sh-rp dismissed rudolfix’s stale review via fb9a445 October 8, 2024 09:26

sh-rp force-pushed the exp/1095-expose-readable-datasets branch from 8658fa8 to fb9a445 Compare October 8, 2024 09:26

sh-rp merged commit 4ee65a8 into devel Oct 8, 2024
61 checks passed

sh-rp deleted the exp/1095-expose-readable-datasets branch October 8, 2024 12:31

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

data pond: expose readable datasets as dataframes and arrow tables #1507

data pond: expose readable datasets as dataframes and arrow tables #1507

sh-rp commented Jun 21, 2024 •

edited

Loading

netlify bot commented Jun 21, 2024 •

edited

Loading

rudolfix left a comment

jorritsandbrink Jul 29, 2024

sh-rp Sep 19, 2024

jorritsandbrink Jul 29, 2024 •

edited

Loading

rudolfix Oct 1, 2024

rudolfix Oct 1, 2024

sh-rp Oct 2, 2024

rudolfix Oct 6, 2024

rudolfix Oct 1, 2024

rudolfix left a comment

sh-rp commented Oct 7, 2024

rudolfix left a comment


		# now we can use the filesystemsql client to create the needed views
		fs_client: Any = pipeline.destination_client()

data pond: expose readable datasets as dataframes and arrow tables #1507

data pond: expose readable datasets as dataframes and arrow tables #1507

Conversation

sh-rp commented Jun 21, 2024 • edited Loading

Description

TODO

netlify bot commented Jun 21, 2024 • edited Loading

✅ Deploy Preview for dlt-hub-docs canceled.

rudolfix left a comment

Choose a reason for hiding this comment

jorritsandbrink Jul 29, 2024

Choose a reason for hiding this comment

sh-rp Sep 19, 2024

Choose a reason for hiding this comment

jorritsandbrink Jul 29, 2024 • edited Loading

Choose a reason for hiding this comment

rudolfix Oct 1, 2024

Choose a reason for hiding this comment

rudolfix Oct 1, 2024

Choose a reason for hiding this comment

sh-rp Oct 2, 2024

Choose a reason for hiding this comment

rudolfix Oct 6, 2024

Choose a reason for hiding this comment

rudolfix Oct 1, 2024

Choose a reason for hiding this comment

rudolfix left a comment

Choose a reason for hiding this comment

sh-rp commented Oct 7, 2024

rudolfix left a comment

Choose a reason for hiding this comment

sh-rp commented Jun 21, 2024 •

edited

Loading

netlify bot commented Jun 21, 2024 •

edited

Loading

jorritsandbrink Jul 29, 2024 •

edited

Loading