Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add wrapper for reading table data using Storage API #431

Open
wants to merge 32 commits into
base: main
Choose a base branch
from

Conversation

alvarowolfx
Copy link
Contributor

@alvarowolfx alvarowolfx commented Mar 25, 2024

Add support for easily reading Tables using the BigQuery Storage API instead of the BigQuery API. This will provide increased performance and reduced memory usage for most use cases and will allow users to keep using the same interface as they used to use on our main library or fetch data directly via a new veneer on BigQuery Storage Read API

@product-auto-label product-auto-label bot added size: l Pull request size is large. api: bigquerystorage Issues related to the googleapis/nodejs-bigquery-storage API. labels Mar 25, 2024
@alvarowolfx
Copy link
Contributor Author

Some early results:

  1. SELECT repository_url as url, repository_owner as owner, repository_forks as forks FROM `bigquery-public-data.samples.github_timeline` where repository_url is not null LIMIT 300000
  • fetchGetQueryResultsI: 31.135s 🔴
  • fetchStorageAPI: 20.033s ⬆️ 36% speedup
  1. SELECT repository_url as url, repository_owner as owner, repository_forks as forks FROM `bigquery-public-data.samples.github_timeline` where repository_url is not null LIMIT 1000000
  • fetchGetQueryResultsI: 1:32.622 (m:ss.mmm) 🔴
  • fetchStorageAPI: 1:07.363 (m:ss.mmm) ⬆️ 27% faster
  1. SELECT name, number, state from `bigquery-public-data.usa_names.usa_1910_current
  • fetchGetQueryResults: 5:00.514 (m:ss.mmm) 🔴
  • fetchStorageAPI: 3:20.987 (m:ss.mmm) ⬆️ 33% faster

@product-auto-label product-auto-label bot added size: xl Pull request size is extra large. and removed size: l Pull request size is large. labels Jul 17, 2024
@alvarowolfx alvarowolfx marked this pull request as ready for review July 24, 2024 20:26
@alvarowolfx alvarowolfx requested a review from a team as a code owner July 24, 2024 20:26
Copy link
Contributor

@shollyman shollyman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor feedback thus far:

  • it might be good to pull out the logger refactor to it's own PR, was surprised to see managedwriter in scope for changes
  • This appears to be only the direct table read bit, are you landing query functionality in a followup, or is this PR going to keep growing?
  • Consider testing using a CTAS that leverages generate_array or its cousins, should allow you much larger test data without hawving to stream it all in from the test env

@alvarowolfx
Copy link
Contributor Author

minor feedback thus far:

  • it might be good to pull out the logger refactor to it's own PR, was surprised to see managedwriter in scope for changes
  • This appears to be only the direct table read bit, are you landing query functionality in a followup, or is this PR going to keep growing?
  • Consider testing using a CTAS that leverages generate_array or its cousins, should allow you much larger test data without hawving to stream it all in from the test env

@shollyman

  1. good point, I'll split that.
  2. The query functionality lives in the main bq client repo: feat: use Storage Read API for faster data fetching nodejs-bigquery#1368
  3. I'll check to use that for the integration tests.

gcf-merge-on-green bot pushed a commit that referenced this pull request Aug 15, 2024
Towards #431, separating logger changes.
Copy link
Contributor

@shollyman shollyman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approving for now, but please get other reviewer comments to a reasonable resolution before submitting.

Copy link
Contributor Author

@alvarowolfx alvarowolfx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@leahecole answered some other comments here that I missed. I'll try to address then later this week

//
// Licensed under the Apache License, Version 2.0 (the "License");
// you may not use this file except in compliance with the License.
// You may obtain a copy of the License at
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry missed this one:

  • when using bq storage read, if the query is ordered, the service side only returns one stream, so results can be fetched with the correct order.
  • We can add tests for that, I added on the Go side, but missed here.
  • Do you have an good way of testing this ? just calling stream.pause ? if you have an example of that would be nice.

Comment on lines +100 to +102
{name: 'Ada Lovelace', row_num: 1},
{name: 'Alan Turing', row_num: 2},
{name: 'Bell', row_num: 3},
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, I'll add a test with more rows and use generation functions like @shollyman recommended

system-test/reader_client_test.ts Show resolved Hide resolved
//
// Licensed under the Apache License, Version 2.0 (the "License");
// you may not use this file except in compliance with the License.
// You may obtain a copy of the License at
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it. Re third bullet point, something like this test I am working on in gax - in this test in gax, I have two streams piped together (I'm using pipeline) - attemptStream is the stream that's making the request to the server, and userStream the consuming stream. They are being piped together

Some things to note

  • Idk if you'd need the error checking I have here - I'm specifically writing a retry test so the .on('error') stuff may be irrelevant - I'm also firing errors using the showcase server at very specific points in the data stream (mainly after the 85th item)
  • This test has listeners for data on both the attemptStream and userStream - each of those appends the data content to an array. At the end, the two arrays should be the same length. I think a test of yours could use pipeline to connect streams and check that the lengths agree at the end.
  • If you experiment with this test and can't find a way to ever make it fail/be confident it's doing anything, don't sweat it. It may be more important on the retries case and when I originally wrote this comment I was very much in the weeds with that CBT stuff
  • the pausing happens in the user stream, but not in the attemptStream. This forces data to build up in the buffer. You could also change that userStream.on('data') bit to only pause when results2 is a certain length, but I think pausing every time is more of a surefire way to force that buffer to fill.
let results: string[] = [];
let results2: string[] = [];

attemptStream.on('data', (data: {content: string}) => {
    results.push(data.content);
  });
  attemptStream.on('end', () => {
    assert.strictEqual(results.length, 100);
  });

  attemptStream.on('error', (e: GoogleError) => {
    assert.strictEqual(e.code, 13);
  });
  userStream.on('data', (data: {content: string}) => {
    results2.push(data.content);
    userStream.pause();
    setTimeout(() => {
      userStream.resume();
    }, 100);
  });
  userStream.on('end', () => {
    assert.strictEqual(results.length, 100);
    assert.strictEqual(results2.length, 100);
  });
  userStream.on('error', (e: GoogleError) => {
    // assert.strictEqual(results2.length, 85)
    assert.strictEqual(results.length, 85);

    assert.strictEqual(e.code, 14);
  });

package.json Show resolved Hide resolved
Copy link
Contributor Author

@alvarowolfx alvarowolfx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added some more comments, address some minor issues and pushed a new test using CTA and generate_array to test it with more data cc @shollyman @leahecole

Comment on lines +92 to +95
return stream
.pipe(new ArrowRawTransform())
.pipe(new ArrowRecordReaderTransform(info!))
.pipe(new ArrowRecordBatchTransform()) as ResourceStream<RecordBatch>;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tried to use pipeline, but is meant to be used when we have a destination. In this case where, we are just applying a bunch of transforms and we don't know the destination beforehand.

The error that I got:

  TypeError [ERR_INVALID_ARG_TYPE]: The "streams[stream.length - 1]" property must be of type function. Received an instance of ArrowRecordBatchTransform

Comment on lines +99 to +101
return stream.pipe(
new ArrowRecordBatchTableRowTransform()
) as ResourceStream<TableRow>;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Errors are handled by the consumer of the stream and when used internally like here, we handle the errors.

@alvarowolfx alvarowolfx added the owlbot:run Add this label to trigger the Owlbot post processor. label Sep 6, 2024
@gcf-owl-bot gcf-owl-bot bot removed the owlbot:run Add this label to trigger the Owlbot post processor. label Sep 6, 2024
@alvarowolfx alvarowolfx added automerge Merge the pull request once unit tests and other checks pass. and removed automerge Merge the pull request once unit tests and other checks pass. labels Sep 20, 2024
@alvarowolfx alvarowolfx added the owlbot:run Add this label to trigger the Owlbot post processor. label Sep 20, 2024
@gcf-owl-bot gcf-owl-bot bot removed the owlbot:run Add this label to trigger the Owlbot post processor. label Sep 20, 2024
@alvarowolfx alvarowolfx added automerge Merge the pull request once unit tests and other checks pass. owlbot:run Add this label to trigger the Owlbot post processor. labels Sep 20, 2024
@gcf-owl-bot gcf-owl-bot bot removed the owlbot:run Add this label to trigger the Owlbot post processor. label Sep 20, 2024
@alvarowolfx alvarowolfx added the owlbot:run Add this label to trigger the Owlbot post processor. label Sep 20, 2024
@gcf-owl-bot gcf-owl-bot bot removed the owlbot:run Add this label to trigger the Owlbot post processor. label Sep 20, 2024
@alvarowolfx alvarowolfx removed the automerge Merge the pull request once unit tests and other checks pass. label Sep 20, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api: bigquerystorage Issues related to the googleapis/nodejs-bigquery-storage API. size: xl Pull request size is extra large.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants