Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fetching next batch of results using BigQuery's dbapi #3840

Closed
mxmzdlv opened this issue Aug 18, 2017 · 11 comments
Closed

Fetching next batch of results using BigQuery's dbapi #3840

mxmzdlv opened this issue Aug 18, 2017 · 11 comments
Assignees
Labels
api: bigquery Issues related to the BigQuery API. priority: p1 Important issue which blocks shipping the next release. Will be fixed prior to next release. type: bug Error or flaw in code with unintended results or allowing sub-optimal usage patterns.

Comments

@mxmzdlv
Copy link

mxmzdlv commented Aug 18, 2017

Currently, the maximum number of rows that you can get when iterating using fetchmany, fetchall or fetchone is limited by arraysize.

I see that those methods are using iterator returned by _query_data.fetch_data() with max_results set to arraysize
https://github.com/GoogleCloudPlatform/google-cloud-python/blob/36304f7448677cff2f89558ef87c374e904a0492/bigquery/google/cloud/bigquery/dbapi/cursor.py#L171-L172

However, shouldn't it automatically fetch the next page of results when we are finished iterating over the current batch of results? Otherwise, it makes it impossible to retrieve full results from the query without settingarraysize to an arbitrary large number.

Here is what I've tried:

cursor.execute('SELECT * FROM dataset.table')

while True:
    rows = cursor.fetchmany(5000)
    if not rows:
        break

    for row in rows:
        print(row)
cursor.execute('SELECT * FROM dataset.table')

row = cursor.fetchone()
while row:
   print(row)
   row = cursor.fetchone()

I am using google-cloud-bigquery 0.26.0

@tswast tswast added api: bigquery Issues related to the BigQuery API. type: question Request for information or clarification. Not an issue. labels Aug 18, 2017
@tswast
Copy link
Contributor

tswast commented Aug 18, 2017

I'm not sure what you are asking. Under the covers the fetch_data method returns an iterator, so it would be possible to prefetch results, but why would you want to?

The DB-API specification says that arraysize must default to 1, so that's what the BigQuery DB-API does to stay compliant. As with any DB-API implementation, you must set the arraysize manually if you want a larger batch size.

@mxmzdlv
Copy link
Author

mxmzdlv commented Aug 18, 2017

Unless I am missing something, there is currently no way to fetch all rows for a query? The only workaround is to set arraysize to some arbitrary large number, because we don't know beforehand how many rows the result for the query will contain.

@tswast
Copy link
Contributor

tswast commented Aug 18, 2017

What's wrong with using fetchall?

But yes, for best performance you will need to increase the arraysize parameter before calling fetchall

@tswast
Copy link
Contributor

tswast commented Aug 18, 2017

fetchall should not be limited by arraysize. It's just the batch size it uses to fetch results.

@mxmzdlv
Copy link
Author

mxmzdlv commented Aug 18, 2017

This is what I thought as well, but I've tried fetchall and it just returns 1 row or what I specify for arraysize.

@tswast tswast added type: bug Error or flaw in code with unintended results or allowing sub-optimal usage patterns. and removed type: question Request for information or clarification. Not an issue. labels Aug 18, 2017
@tswast
Copy link
Contributor

tswast commented Aug 18, 2017

Ah, okay. I'll investigate further. Thanks for you patience.

@tswast
Copy link
Contributor

tswast commented Aug 18, 2017

I'm able to reproduce the issue with this system test.

def test_dbapi_fetchall(self):
    query = 'SELECT * FROM UNNEST([(1, 2), (3, 4), (5, 6)])'

    for arraysize in range(1, 5):
        Config.CURSOR.execute(query)
        self.assertEqual(Config.CURSOR.rowcount, 3, "expected 3 rows")
        Config.CURSOR.arraysize = arraysize
        rows = Config.CURSOR.fetchall()
        self.assertEqual(rows, [(1, 2), (3, 4), (5, 6)])

When arraysize is 1 (the default), only one result is returned when three are expected.

The unit tests are passing, but they mock out the core iterator class used, so I'm thinking the issue lies there or in how it is used.

@tswast
Copy link
Contributor

tswast commented Aug 19, 2017

I think the problem is that HTTPIterator is looking for nextPageToken but the getQueryResults response populates pageToken.

@tswast tswast added the priority: p1 Important issue which blocks shipping the next release. Will be fixed prior to next release. label Aug 19, 2017
@tswast
Copy link
Contributor

tswast commented Aug 19, 2017

I'm able to reproduce this issue (only the first page is fetched) in the core API, not just the DB-API. By modifying the system test with 1000 rows and setting the page size to 100, I'm able to reproduce.

def test_large_query_w_public_data(self):
    PUBLIC = 'bigquery-public-data'
    DATASET_NAME = 'samples'
    TABLE_NAME = 'natality'
    LIMIT = 1000
    SQL = 'SELECT * from `{}.{}.{}` LIMIT {}'.format(
        PUBLIC, DATASET_NAME, TABLE_NAME, LIMIT)

    query = Config.CLIENT.run_sync_query(SQL)
    query.use_legacy_sql = False
    query.run()

    iterator = query.fetch_data(max_results=100)
    rows = list(iterator)
    self.assertEqual(len(rows), LIMIT)

Working to fix in this branch: https://github.com/tswast/google-cloud-python/tree/bq-dbapi/bigquery

@mxmzdlv
Copy link
Author

mxmzdlv commented Aug 19, 2017

Thank you for such a quick fix!

@tswast
Copy link
Contributor

tswast commented Aug 21, 2017

Fix merged. It should be in the next release. (I don't know when that will be)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api: bigquery Issues related to the BigQuery API. priority: p1 Important issue which blocks shipping the next release. Will be fixed prior to next release. type: bug Error or flaw in code with unintended results or allowing sub-optimal usage patterns.
Projects
None yet
Development

No branches or pull requests

2 participants