Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add load_as methods for pyarrow dataset and table #240

Draft
wants to merge 8 commits into
base: main
Choose a base branch
from

Conversation

chitralverma
Copy link

@chitralverma chitralverma commented Dec 23, 2022

Adds separate implementations for load_as_pyarrow_table and load_as_pyarrow_dataset that allows users to read delta sharing tables as pyarrow table and dataset respectively.

  • Add basic implementation
  • Fix lint
  • Refactor common code
  • Verify performance with and without limit
  • Add tests - converter
  • Add tests - reader
  • Add tests - delta_sharing
  • Add examples
  • Fix review comments

closes #238

@chitralverma
Copy link
Author

@goodwillpunning @linzhou-db From the build logs I can see that the PYARROW_VERSION has been pinned to 4.x somewhere in the environment variables. This version of pyarrow came out in May, 2021 and since then there have been 6 major version releases.

Seems like there are some API inconsistencies the pinned version 4.x which is causing build failure on GitHub but locally test cases are passing. I also verified with versions 5.x to 10.x and was not able to reproduce the issue. Can you please unpin or upgrade this PYARROW_VERSION.

@linzhou-db
Copy link
Collaborator

Thanks @chitralverma , will take a look once back in Jan.
cc @zsxwing

@linzhou-db linzhou-db self-requested a review January 11, 2023 18:45
try:
import re

decimal_pattern = re.compile(r"(\([^\)]+\))")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: add comment with examples that this pattern could handle and not?

and struct_field["type"]["type"] == "struct"
for struct_field in element_type["fields"]
):
raise TypeError("Nested StructType cannot be converted to PyArrow type.")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Double Nested cannot ..."?

isinstance(struct_field["type"], dict) and struct_field["type"]["type"] == "struct"
for struct_field in f_type["fields"]
):
raise TypeError("Nested StructType cannot be converted to PyArrow type.")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

double nested?

def test_pyarrow_schema_base():
base_schema_dict = {
"type": "struct",
"fields": [
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cover all types in this test?

@@ -71,19 +79,112 @@ def limit(self, limit: Optional[int]) -> "DeltaSharingReader":
timestamp=self._timestamp
)

def to_pandas(self) -> pd.DataFrame:
def _get_response(self) -> ListFilesInTableResponse:
response = self._rest_client.list_files_in_table(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can directly return self._rest_client.list...?

tbl: PyArrowTable = ds.head(left, **pyarrow_tbl_options)
pa_tables.append(tbl)
left -= tbl.num_rows
assert (
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this a hard limit? and does it require exact limit number of rows returned?

Copy link
Author

@chitralverma chitralverma Jan 11, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes it results exactly the number of rows asked for. but does it file by file instead as I saw that in practice this is faster than just calling .head() on the pyarrow table.

So this kind of mimics the implementation thats done in _to_pandas()

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

right. I wonder we don't have to fail if we got a few more rows?

same_scheme = False
break

assert same_scheme, "All files did not follow the same URL scheme."
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: "All files should follow the same URL scheme" ?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question: what's an example of this failure? And is it possible to add a test case to cover it?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I dont think the delta server can return files from different places for for the same table but I added this case just in case if in the future delta sharing turns in to a cross cloud service (some data in s3 and some data in GCS).

Another reason for adding this was that we dont have to initialize FSSPEC FS for each path if they all follow the same scheme.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't foresee delta sharing support a single table from multiple clouds any time soon.
Plus no test coverage on the code, could we rather turn this into a TODO.

assert ds.count_rows() == 0


def test_to_pyarrow_table_non_partitioned(tmp_path):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the difference between this test and test_to_pyarrow_dataset?

Copy link
Author

@chitralverma chitralverma Jan 11, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thats for pyarrow dataset (lazy, faster) and this is for pyarrow table (eager).

internally pyarrow implementation relies on dataset implementation in the PR

@@ -1,7 +1,7 @@
# Dependencies. When you update don't forget to update setup.py.
pandas
pyarrow>=4.0.0
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

question: why are these removed?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove temporarily to see if this is causing the build to fail. I will add it again before the PR is completely ready.

please see my original comment regarding this.

@linzhou-db
Copy link
Collaborator

Also what's your thought on loading cdf in pyarrow? is it something not needed for now?

@chitralverma
Copy link
Author

Also what's your thought on loading cdf in pyarrow? is it something not needed for now?

I would prefer to raise a separate PR for the CDF to keep things simple and concise, this is just for the data.

@ion-elgreco
Copy link

@chitralverma @linzhou-db can we revive this PR?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Support for load_as_pyarrow_dataset or load_as_pyarrow_table
3 participants