Add load_as methods for pyarrow dataset and table #240

chitralverma · 2022-12-23T08:10:34Z

Adds separate implementations for load_as_pyarrow_table and load_as_pyarrow_dataset that allows users to read delta sharing tables as pyarrow table and dataset respectively.

closes #238

chitralverma · 2022-12-28T16:34:06Z

@goodwillpunning @linzhou-db From the build logs I can see that the PYARROW_VERSION has been pinned to 4.x somewhere in the environment variables. This version of pyarrow came out in May, 2021 and since then there have been 6 major version releases.

Seems like there are some API inconsistencies the pinned version 4.x which is causing build failure on GitHub but locally test cases are passing. I also verified with versions 5.x to 10.x and was not able to reproduce the issue. Can you please unpin or upgrade this PYARROW_VERSION.

linzhou-db · 2022-12-29T23:55:39Z

Thanks @chitralverma , will take a look once back in Jan.
cc @zsxwing

linzhou-db · 2023-01-11T19:39:27Z

python/delta_sharing/converter.py

+            try:
+                import re
+
+                decimal_pattern = re.compile(r"(\([^\)]+\))")


nit: add comment with examples that this pattern could handle and not?

linzhou-db · 2023-01-11T20:05:26Z

python/delta_sharing/converter.py

+                    and struct_field["type"]["type"] == "struct"
+                    for struct_field in element_type["fields"]
+                ):
+                    raise TypeError("Nested StructType cannot be converted to PyArrow type.")


"Double Nested cannot ..."?

linzhou-db · 2023-01-11T20:06:48Z

python/delta_sharing/converter.py

+                isinstance(struct_field["type"], dict) and struct_field["type"]["type"] == "struct"
+                for struct_field in f_type["fields"]
+            ):
+                raise TypeError("Nested StructType cannot be converted to PyArrow type.")


double nested?

linzhou-db · 2023-01-11T20:14:48Z

python/delta_sharing/tests/test_converter.py

+def test_pyarrow_schema_base():
+    base_schema_dict = {
+        "type": "struct",
+        "fields": [


cover all types in this test?

linzhou-db · 2023-01-11T20:17:34Z

python/delta_sharing/reader.py

@@ -71,19 +79,112 @@ def limit(self, limit: Optional[int]) -> "DeltaSharingReader":
            timestamp=self._timestamp
        )

-    def to_pandas(self) -> pd.DataFrame:
+    def _get_response(self) -> ListFilesInTableResponse:
        response = self._rest_client.list_files_in_table(


you can directly return self._rest_client.list...?

linzhou-db · 2023-01-11T20:24:43Z

python/delta_sharing/reader.py

+                tbl: PyArrowTable = ds.head(left, **pyarrow_tbl_options)
+                pa_tables.append(tbl)
+                left -= tbl.num_rows
+                assert (


is this a hard limit? and does it require exact limit number of rows returned?

yes it results exactly the number of rows asked for. but does it file by file instead as I saw that in practice this is faster than just calling .head() on the pyarrow table.

So this kind of mimics the implementation thats done in _to_pandas()

right. I wonder we don't have to fail if we got a few more rows?

linzhou-db · 2023-01-11T20:27:24Z

python/delta_sharing/reader.py

+                    same_scheme = False
+                    break
+
+            assert same_scheme, "All files did not follow the same URL scheme."


nit: "All files should follow the same URL scheme" ?

Question: what's an example of this failure? And is it possible to add a test case to cover it?

I dont think the delta server can return files from different places for for the same table but I added this case just in case if in the future delta sharing turns in to a cross cloud service (some data in s3 and some data in GCS).

Another reason for adding this was that we dont have to initialize FSSPEC FS for each path if they all follow the same scheme.

I don't foresee delta sharing support a single table from multiple clouds any time soon.
Plus no test coverage on the code, could we rather turn this into a TODO.

linzhou-db · 2023-01-11T20:31:42Z

python/delta_sharing/tests/test_reader.py

+    assert ds.count_rows() == 0
+
+
+def test_to_pyarrow_table_non_partitioned(tmp_path):


What's the difference between this test and test_to_pyarrow_dataset?

thats for pyarrow dataset (lazy, faster) and this is for pyarrow table (eager).

internally pyarrow implementation relies on dataset implementation in the PR

linzhou-db · 2023-01-11T20:32:00Z

python/requirements-dev.txt

@@ -1,7 +1,7 @@
 # Dependencies. When you update don't forget to update setup.py.
 pandas
-pyarrow>=4.0.0


question: why are these removed?

remove temporarily to see if this is causing the build to fail. I will add it again before the PR is completely ready.

please see my original comment regarding this.

linzhou-db · 2023-01-11T20:35:54Z

Also what's your thought on loading cdf in pyarrow? is it something not needed for now?

chitralverma · 2023-01-11T21:40:51Z

Also what's your thought on loading cdf in pyarrow? is it something not needed for now?

I would prefer to raise a separate PR for the CDF to keep things simple and concise, this is just for the data.

ion-elgreco · 2024-02-22T14:08:04Z

@chitralverma @linzhou-db can we revive this PR?

chitralverma added 8 commits December 23, 2022 13:32

Add load_as methods for pyarrow dataset and table

e2a5123

lint fix

f0b88cd

fix lint for older python versions

cb39e86

Optimize load_as_pyarrow_table with limit

08341c4

Add pyarrow examples

96fc160

Added schema converter for pyarrow

ab2c387

Added tests for reader

f258886

unpin versions for pyarrow and fsspec

b9e7134

linzhou-db self-requested a review January 11, 2023 18:45

linzhou-db reviewed Jan 11, 2023

View reviewed changes

linzhou-db force-pushed the main branch from c5585be to 5828d79 Compare April 6, 2023 07:48

zsxwing force-pushed the main branch from 5828d79 to 7f01260 Compare April 6, 2023 16:47

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add load_as methods for pyarrow dataset and table #240

Add load_as methods for pyarrow dataset and table #240

chitralverma commented Dec 23, 2022 •

edited

Loading

chitralverma commented Dec 28, 2022

linzhou-db commented Dec 29, 2022

linzhou-db Jan 11, 2023

linzhou-db Jan 11, 2023

linzhou-db Jan 11, 2023

linzhou-db Jan 11, 2023

linzhou-db Jan 11, 2023

linzhou-db Jan 11, 2023

chitralverma Jan 11, 2023 •

edited

Loading

linzhou-db Jan 12, 2023

linzhou-db Jan 11, 2023

linzhou-db Jan 11, 2023

chitralverma Jan 11, 2023

linzhou-db Jan 12, 2023

linzhou-db Jan 11, 2023

chitralverma Jan 11, 2023 •

edited

Loading

linzhou-db Jan 11, 2023

chitralverma Jan 11, 2023

linzhou-db commented Jan 11, 2023

chitralverma commented Jan 11, 2023

ion-elgreco commented Feb 22, 2024

		assert ds.count_rows() == 0


		def test_to_pyarrow_table_non_partitioned(tmp_path):

Add load_as methods for pyarrow dataset and table #240

Are you sure you want to change the base?

Add load_as methods for pyarrow dataset and table #240

Conversation

chitralverma commented Dec 23, 2022 • edited Loading

chitralverma commented Dec 28, 2022

linzhou-db commented Dec 29, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

chitralverma Jan 11, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

chitralverma Jan 11, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

linzhou-db commented Jan 11, 2023

chitralverma commented Jan 11, 2023

ion-elgreco commented Feb 22, 2024

chitralverma commented Dec 23, 2022 •

edited

Loading

chitralverma Jan 11, 2023 •

edited

Loading

chitralverma Jan 11, 2023 •

edited

Loading