Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEAT] Allow sql alchemy connection factory as input to read_sql #2071

Merged
merged 16 commits into from
Apr 11, 2024

Conversation

colin-ho
Copy link
Contributor

@colin-ho colin-ho commented Apr 2, 2024

Closes #2072

Support sql alchemy connection factory as input (same as pandas)

Sql alchemy connection is nice because it gives info on dialect, driver, url, which will fit in nicely for our partitioning + predicate pushdowns.

@github-actions github-actions bot added the enhancement New feature or request label Apr 2, 2024
Copy link

codecov bot commented Apr 2, 2024

Codecov Report

Attention: Patch coverage is 22.80702% with 88 lines in your changes are missing coverage. Please review.

Project coverage is 84.97%. Comparing base (9ccdc48) to head (64cc1ba).

Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff             @@
##             main    #2071      +/-   ##
==========================================
- Coverage   85.27%   84.97%   -0.30%     
==========================================
  Files          68       68              
  Lines        7258     7293      +35     
==========================================
+ Hits         6189     6197       +8     
- Misses       1069     1096      +27     
Files Coverage Δ
daft/io/_sql.py 52.38% <60.00%> (-0.57%) ⬇️
daft/table/table_io.py 88.61% <33.33%> (ø)
daft/sql/sql_scan.py 30.08% <11.76%> (+0.08%) ⬆️
daft/sql/sql_connection.py 22.47% <22.47%> (ø)

@colin-ho colin-ho marked this pull request as ready for review April 2, 2024 17:15
@colin-ho colin-ho requested a review from samster25 April 2, 2024 17:16
@jaychia jaychia self-requested a review April 8, 2024 23:21
Copy link
Contributor

@jaychia jaychia left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice, but I think we might want to wrap self.conn in our own ConnectionFactory abstraction so that we can avoid lots of "matching" with isinstance(conn, str) in the rest of the code. Lmk your thoughts?

daft/io/_sql.py Show resolved Hide resolved
daft/sql/sql_scan.py Outdated Show resolved Hide resolved
daft/sql/sql_scan.py Outdated Show resolved Hide resolved
daft/sql/sql_scan.py Show resolved Hide resolved
@colin-ho
Copy link
Contributor Author

colin-ho commented Apr 9, 2024

Nice, but I think we might want to wrap self.conn in our own ConnectionFactory abstraction so that we can avoid lots of "matching" with isinstance(conn, str) in the rest of the code. Lmk your thoughts?

Consolidated the "matching" in a SQLConnection object, which handles any functionality that deals with either url or connection factories, such as retrieving dialect, executing sql, etc.

Copy link
Contributor

@jaychia jaychia left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking good! Just some questions/nits

daft/sql/sql_connection.py Outdated Show resolved Hide resolved
daft/sql/sql_connection.py Outdated Show resolved Hide resolved
daft/sql/sql_connection.py Outdated Show resolved Hide resolved
try:
return self._execute_sql_query(sql)
except RuntimeError as e:
if limit is not None:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh dang, what is the use-case for retrying without a limit? Sounds pretty expensive.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

some dbs don't support limit: https://stackoverflow.com/questions/2832013/can-you-name-a-single-popular-database-that-doesnt-support-limit-statement

however this shouldn't be necessary once we use a sql generation library to help build our queries.

daft/sql/sql_scan.py Outdated Show resolved Hide resolved
daft/sql/sql_scan.py Outdated Show resolved Hide resolved

import daft
from daft.context import set_execution_config
from tests.conftest import assert_df_equals
from tests.integration.sql.conftest import TEST_TABLE_NAME


@pytest.fixture(scope="session", params=["url", "conn"])
def db_conn(request, test_db):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to fully parametrize every test with url vs conn? Or can we just have one dedicated test to ensure that passing in a conn instead of a URL works as expected?

Just a little concerned that it might bloat our tests, without really giving us much more coverage since we're just passing in a SQL statement in either case.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I guess we don't, since the sqlalchemy connection path is also tested via the Trino connections as it's not supported by ConnectorX. Will remove this parametrization and add a dedicated test

@colin-ho colin-ho merged commit 3e73d74 into main Apr 11, 2024
30 of 31 checks passed
@colin-ho colin-ho deleted the colin/read_sql_refactor branch April 11, 2024 00:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Allow connection factory in read_sql
2 participants