feat: add wrapper for reading table data using Storage API #431

alvarowolfx · 2024-03-25T20:03:12Z

Add support for easily reading Tables using the BigQuery Storage API instead of the BigQuery API. This will provide increased performance and reduced memory usage for most use cases and will allow users to keep using the same interface as they used to use on our main library or fetch data directly via a new veneer on BigQuery Storage Read API

alvarowolfx · 2024-06-04T16:27:34Z

Some early results:

SELECT repository_url as url, repository_owner as owner, repository_forks as forks FROM `bigquery-public-data.samples.github_timeline` where repository_url is not null LIMIT 300000

fetchGetQueryResultsI: 31.135s 🔴
fetchStorageAPI: 20.033s ⬆️ 36% speedup

SELECT repository_url as url, repository_owner as owner, repository_forks as forks FROM `bigquery-public-data.samples.github_timeline` where repository_url is not null LIMIT 1000000

fetchGetQueryResultsI: 1:32.622 (m:ss.mmm) 🔴
fetchStorageAPI: 1:07.363 (m:ss.mmm) ⬆️ 27% faster

SELECT name, number, state from `bigquery-public-data.usa_names.usa_1910_current

fetchGetQueryResults: 5:00.514 (m:ss.mmm) 🔴
fetchStorageAPI: 3:20.987 (m:ss.mmm) ⬆️ 33% faster

See https://github.com/googleapis/repo-automation-bots/blob/main/packages/owl-bot/README.md

shollyman

minor feedback thus far:

it might be good to pull out the logger refactor to it's own PR, was surprised to see managedwriter in scope for changes
This appears to be only the direct table read bit, are you landing query functionality in a followup, or is this PR going to keep growing?
Consider testing using a CTAS that leverages generate_array or its cousins, should allow you much larger test data without hawving to stream it all in from the test env

alvarowolfx · 2024-08-12T14:58:07Z

minor feedback thus far:

it might be good to pull out the logger refactor to it's own PR, was surprised to see managedwriter in scope for changes

This appears to be only the direct table read bit, are you landing query functionality in a followup, or is this PR going to keep growing?

Consider testing using a CTAS that leverages generate_array or its cousins, should allow you much larger test data without hawving to stream it all in from the test env

@shollyman

good point, I'll split that.
The query functionality lives in the main bq client repo: feat: use Storage Read API for faster data fetching nodejs-bigquery#1368
I'll check to use that for the integration tests.

Towards #431, separating logger changes.

shollyman

Approving for now, but please get other reviewer comments to a reasonable resolution before submitting.

alvarowolfx

@leahecole answered some other comments here that I missed. I'll try to address then later this week

alvarowolfx · 2024-08-21T16:46:11Z

system-test/reader_client_test.ts

+//
+// Licensed under the Apache License, Version 2.0 (the "License");
+// you may not use this file except in compliance with the License.
+// You may obtain a copy of the License at


Sorry missed this one:

when using bq storage read, if the query is ordered, the service side only returns one stream, so results can be fetched with the correct order.

We can add tests for that, I added on the Go side, but missed here.

Do you have an good way of testing this ? just calling stream.pause ? if you have an example of that would be nice.

alvarowolfx · 2024-08-21T16:46:38Z

system-test/reader_client_test.ts

+        {name: 'Ada Lovelace', row_num: 1},
+        {name: 'Alan Turing', row_num: 2},
+        {name: 'Bell', row_num: 3},


yeah, I'll add a test with more rows and use generation functions like @shollyman recommended

system-test/reader_client_test.ts

leahecole · 2024-08-29T19:19:21Z

system-test/reader_client_test.ts

+//
+// Licensed under the Apache License, Version 2.0 (the "License");
+// you may not use this file except in compliance with the License.
+// You may obtain a copy of the License at


Got it. Re third bullet point, something like this test I am working on in gax - in this test in gax, I have two streams piped together (I'm using pipeline) - attemptStream is the stream that's making the request to the server, and userStream the consuming stream. They are being piped together

Some things to note

Idk if you'd need the error checking I have here - I'm specifically writing a retry test so the .on('error') stuff may be irrelevant - I'm also firing errors using the showcase server at very specific points in the data stream (mainly after the 85th item)

This test has listeners for data on both the attemptStream and userStream - each of those appends the data content to an array. At the end, the two arrays should be the same length. I think a test of yours could use pipeline to connect streams and check that the lengths agree at the end.

If you experiment with this test and can't find a way to ever make it fail/be confident it's doing anything, don't sweat it. It may be more important on the retries case and when I originally wrote this comment I was very much in the weeds with that CBT stuff

the pausing happens in the user stream, but not in the attemptStream. This forces data to build up in the buffer. You could also change that userStream.on('data') bit to only pause when results2 is a certain length, but I think pausing every time is more of a surefire way to force that buffer to fill.

let results: string[] = []; let results2: string[] = []; attemptStream.on('data', (data: {content: string}) => { results.push(data.content); }); attemptStream.on('end', () => { assert.strictEqual(results.length, 100); }); attemptStream.on('error', (e: GoogleError) => { assert.strictEqual(e.code, 13); }); userStream.on('data', (data: {content: string}) => { results2.push(data.content); userStream.pause(); setTimeout(() => { userStream.resume(); }, 100); }); userStream.on('end', () => { assert.strictEqual(results.length, 100); assert.strictEqual(results2.length, 100); }); userStream.on('error', (e: GoogleError) => { // assert.strictEqual(results2.length, 85) assert.strictEqual(results.length, 85); assert.strictEqual(e.code, 14); });

package.json

alvarowolfx

Added some more comments, address some minor issues and pushed a new test using CTA and generate_array to test it with more data cc @shollyman @leahecole

alvarowolfx · 2024-09-03T20:46:19Z

src/reader/arrow_reader.ts

+    return stream
+      .pipe(new ArrowRawTransform())
+      .pipe(new ArrowRecordReaderTransform(info!))
+      .pipe(new ArrowRecordBatchTransform()) as ResourceStream<RecordBatch>;


Tried to use pipeline, but is meant to be used when we have a destination. In this case where, we are just applying a bunch of transforms and we don't know the destination beforehand.

The error that I got:

TypeError [ERR_INVALID_ARG_TYPE]: The "streams[stream.length - 1]" property must be of type function. Received an instance of ArrowRecordBatchTransform

alvarowolfx · 2024-09-03T21:02:59Z

src/reader/table_reader.ts

+    return stream.pipe(
+      new ArrowRecordBatchTableRowTransform()
+    ) as ResourceStream<TableRow>;


Errors are handled by the consumer of the stream and when used internally like here, we handle the errors.

See https://github.com/googleapis/repo-automation-bots/blob/main/packages/owl-bot/README.md

feat: add wrapper for reading table data using Storage API

485b33d

product-auto-label bot added size: l Pull request size is large. api: bigquerystorage Issues related to the googleapis/nodejs-bigquery-storage API. labels Mar 25, 2024

alvarowolfx added 9 commits March 28, 2024 16:59

feat: parse arrow record batches and convert to TableRow

39d3370

fix: set arrow to v14

1ef1c9b

feat: add bigquery as dep

4bf1511

feat: remove dep on @google-cloud/bigquery

45a4afa

fix: Stream.toArray not available on node < 17

aa57c03

fix: add paginator dep

72afe00

fix: lint issues

7a5847a

feat: move to stream transform instead of implementing Readable

6b98711

Merge branch 'main' into feat-storage-read-veneer

77fff01

alvarowolfx mentioned this pull request May 31, 2024

feat: use Storage Read API for faster data fetching googleapis/nodejs-bigquery#1368

Draft

feat: modular arrow streams and transforms

98546f3

alvarowolfx mentioned this pull request Jul 11, 2024

Memory Issue with bigQuery.createQueryStream in Node.js googleapis/nodejs-bigquery#1392

Closed

docs: update doc strings

bd67c85

product-auto-label bot added size: xl Pull request size is extra large. and removed size: l Pull request size is large. labels Jul 17, 2024

alvarowolfx added 9 commits July 16, 2024 21:07

fix: lint issues

aaeb3bd

Merge branch 'main' into feat-storage-read-veneer

9fa976a

fix: read rows sample

7382c34

test: arrow transforms

63805b3

test: add reader package tests

182d323

fix: rollback arrow to v14

ac0a018

fix: add node 14 pollyfil for array.at

26d5b95

fix: properly close connection

456f2d4

fix: lint issue

a02c634

alvarowolfx marked this pull request as ready for review July 24, 2024 20:26

alvarowolfx requested a review from a team as a code owner July 24, 2024 20:26

🦉 Updates from OwlBot post-processor

0d1af64

See https://github.com/googleapis/repo-automation-bots/blob/main/packages/owl-bot/README.md

shollyman reviewed Aug 9, 2024

View reviewed changes

alvarowolfx mentioned this pull request Aug 15, 2024

refactor: move logger to util folder #472

Merged

fix: remove failed precondition from retry predicate

88449ea

gcf-merge-on-green bot pushed a commit that referenced this pull request Aug 15, 2024

refactor: move logger to util folder (#472)

02aef98

Towards #431, separating logger changes.

Merge branch 'main' into feat-storage-read-veneer

f811ead

shollyman approved these changes Aug 21, 2024

View reviewed changes

alvarowolfx commented Aug 21, 2024

View reviewed changes

leahecole requested changes Aug 29, 2024

View reviewed changes

fix: address pr comments and add bigger table test

898ae4b

alvarowolfx commented Sep 3, 2024

View reviewed changes

shollyman approved these changes Sep 4, 2024

View reviewed changes

Merge branch 'main' into feat-storage-read-veneer

6a86580

alvarowolfx added the owlbot:run Add this label to trigger the Owlbot post processor. label Sep 6, 2024

gcf-owl-bot bot removed the owlbot:run Add this label to trigger the Owlbot post processor. label Sep 6, 2024

gcf-owl-bot bot and others added 2 commits September 6, 2024 15:03

🦉 Updates from OwlBot post-processor

872f0f3

See https://github.com/googleapis/repo-automation-bots/blob/main/packages/owl-bot/README.md

Merge branch 'main' into feat-storage-read-veneer

052713b

alvarowolfx added automerge Merge the pull request once unit tests and other checks pass. and removed automerge Merge the pull request once unit tests and other checks pass. labels Sep 20, 2024

build: update types/node to fix build

94886cc

alvarowolfx added the owlbot:run Add this label to trigger the Owlbot post processor. label Sep 20, 2024

gcf-owl-bot bot removed the owlbot:run Add this label to trigger the Owlbot post processor. label Sep 20, 2024

alvarowolfx added automerge Merge the pull request once unit tests and other checks pass. owlbot:run Add this label to trigger the Owlbot post processor. labels Sep 20, 2024

gcf-owl-bot bot removed the owlbot:run Add this label to trigger the Owlbot post processor. label Sep 20, 2024

Merge branch 'main' into feat-storage-read-veneer

6b9f68c

alvarowolfx added the owlbot:run Add this label to trigger the Owlbot post processor. label Sep 20, 2024

gcf-owl-bot bot removed the owlbot:run Add this label to trigger the Owlbot post processor. label Sep 20, 2024

alvarowolfx removed the automerge Merge the pull request once unit tests and other checks pass. label Sep 20, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add wrapper for reading table data using Storage API #431

feat: add wrapper for reading table data using Storage API #431

alvarowolfx commented Mar 25, 2024 •

edited

Loading

alvarowolfx commented Jun 4, 2024

shollyman left a comment

alvarowolfx commented Aug 12, 2024

shollyman left a comment

alvarowolfx left a comment •

edited

Loading

alvarowolfx Aug 21, 2024

alvarowolfx Aug 21, 2024

leahecole Aug 29, 2024

alvarowolfx left a comment

alvarowolfx Sep 3, 2024

alvarowolfx Sep 3, 2024

feat: add wrapper for reading table data using Storage API #431

Are you sure you want to change the base?

feat: add wrapper for reading table data using Storage API #431

Conversation

alvarowolfx commented Mar 25, 2024 • edited Loading

alvarowolfx commented Jun 4, 2024

shollyman left a comment

Choose a reason for hiding this comment

alvarowolfx commented Aug 12, 2024

shollyman left a comment

Choose a reason for hiding this comment

alvarowolfx left a comment • edited Loading

Choose a reason for hiding this comment

alvarowolfx Aug 21, 2024

Choose a reason for hiding this comment

alvarowolfx Aug 21, 2024

Choose a reason for hiding this comment

leahecole Aug 29, 2024

Choose a reason for hiding this comment

alvarowolfx left a comment

Choose a reason for hiding this comment

alvarowolfx Sep 3, 2024

Choose a reason for hiding this comment

alvarowolfx Sep 3, 2024

Choose a reason for hiding this comment

alvarowolfx commented Mar 25, 2024 •

edited

Loading

alvarowolfx left a comment •

edited

Loading