Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

createQueryStream loads a big set of data into memory before streaming #1073

Closed
koenvanzuijlen opened this issue Jan 27, 2022 · 6 comments
Closed
Labels
api: bigquery Issues related to the googleapis/nodejs-bigquery API. priority: p2 Moderately-important priority. Fix may not be included in next release. type: bug Error or flaw in code with unintended results or allowing sub-optimal usage patterns.

Comments

@koenvanzuijlen
Copy link

The bigQuery.createQueryStream seems to load an entire set of data into memory before the stream starts actually piping data into the next streams.

Environment details

  • OS: MacOS 12.1
  • Node.js version: 14.18.1
  • npm version: 6.14.15
  • @google-cloud/bigquery version: 5.10.0

Steps to reproduce

Using this test script I can see over 300mb of data is loaded into memory before the stream starts piping to the next streams. And I am only selecting one column, so this is a lot of records in that case.

If I log each entry in the transform stream it also seems to come into batches. It pauses for a while and suddenly starts piping again. This makes me think internally a whole page is loaded into memory and then piped to the readable stream, but this might not be the issue.

const stream = bigQuery
  .dataset("dataset")
  .createQueryStream("SELECT email FROM table");

let checked = false;
const tr = new Transform({
  objectMode: true,
  transform: (chunk, enc, next) => {
    if (!checked) {
      console.dir("START PIPING");
      console.dir(process.memoryUsage());
      console.dir(
        "DIFFERENCE = " +
          (process.memoryUsage().heapUsed - heapUsed) / (1024 * 1024) +
          " MB"
      );
      checked = true;
    }
    next(null, JSON.stringify(chunk) + "\n");
  },
});

const write = fs.createWriteStream("/dev/null");

console.dir("BEFORE");
console.dir(process.memoryUsage());
const { heapUsed } = process.memoryUsage();

tr.pipe(write);
stream.pipe(tr);
@koenvanzuijlen koenvanzuijlen added priority: p2 Moderately-important priority. Fix may not be included in next release. type: bug Error or flaw in code with unintended results or allowing sub-optimal usage patterns. labels Jan 27, 2022
@product-auto-label product-auto-label bot added the api: bigquery Issues related to the googleapis/nodejs-bigquery API. label Jan 27, 2022
@steffnay
Copy link
Contributor

@koenvanzuijlen This is the case because createQueryStream() is a streamified version of the query() method, which retrieves pages of results using job.getQueryResults().

@koenvanzuijlen
Copy link
Author

@steffnay Is there an option available to stream query results directly without loading a lot of results into memory first?

@steffnay
Copy link
Contributor

steffnay commented May 1, 2022

No, unfortunately, we only have the option of streamifying the job.getQueryResults method.

@rohit-gohri
Copy link

Is there a way to limit the size of this chunk of data? Because it is causing an OOM exception for us on some tables. Can we use maxResults for that or will it limit the total results?

@steffnay steffnay removed their assignment Jul 8, 2022
@loferris
Copy link
Contributor

The actual createQueryStream method doesn’t have this parameter available, as it’s packaging both query creation and use of the streamify API. One option would be to end the stream after a certain amount of data has been emitted and make separate calls of that method until all data has been read. (See an example). For further flexibility, you can also write the logic by hand using various query and results methods with streaming. Feel free to open up a follow up issue if a sample seems helpful! As to incorporating more configurability into the library itself for creating streamified queries, it’ll depend on requests!

@ricardograca-scratch
Copy link

Also having the same problem. Loading all the results in memory isn't exactly what one would expect from streaming.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api: bigquery Issues related to the googleapis/nodejs-bigquery API. priority: p2 Moderately-important priority. Fix may not be included in next release. type: bug Error or flaw in code with unintended results or allowing sub-optimal usage patterns.
Projects
None yet
Development

No branches or pull requests

5 participants