createQueryStream loads a big set of data into memory before streaming #1073

koenvanzuijlen · 2022-01-27T09:43:08Z

The bigQuery.createQueryStream seems to load an entire set of data into memory before the stream starts actually piping data into the next streams.

Environment details

OS: MacOS 12.1
Node.js version: 14.18.1
npm version: 6.14.15
@google-cloud/bigquery version: 5.10.0

Steps to reproduce

Using this test script I can see over 300mb of data is loaded into memory before the stream starts piping to the next streams. And I am only selecting one column, so this is a lot of records in that case.

If I log each entry in the transform stream it also seems to come into batches. It pauses for a while and suddenly starts piping again. This makes me think internally a whole page is loaded into memory and then piped to the readable stream, but this might not be the issue.

const stream = bigQuery
  .dataset("dataset")
  .createQueryStream("SELECT email FROM table");

let checked = false;
const tr = new Transform({
  objectMode: true,
  transform: (chunk, enc, next) => {
    if (!checked) {
      console.dir("START PIPING");
      console.dir(process.memoryUsage());
      console.dir(
        "DIFFERENCE = " +
          (process.memoryUsage().heapUsed - heapUsed) / (1024 * 1024) +
          " MB"
      );
      checked = true;
    }
    next(null, JSON.stringify(chunk) + "\n");
  },
});

const write = fs.createWriteStream("/dev/null");

console.dir("BEFORE");
console.dir(process.memoryUsage());
const { heapUsed } = process.memoryUsage();

tr.pipe(write);
stream.pipe(tr);

The text was updated successfully, but these errors were encountered:

steffnay · 2022-03-20T17:19:20Z

@koenvanzuijlen This is the case because createQueryStream() is a streamified version of the query() method, which retrieves pages of results using job.getQueryResults().

koenvanzuijlen · 2022-03-21T14:00:04Z

@steffnay Is there an option available to stream query results directly without loading a lot of results into memory first?

steffnay · 2022-05-01T21:33:44Z

No, unfortunately, we only have the option of streamifying the job.getQueryResults method.

rohit-gohri · 2022-05-17T18:38:42Z

Is there a way to limit the size of this chunk of data? Because it is causing an OOM exception for us on some tables. Can we use maxResults for that or will it limit the total results?

loferris · 2022-08-10T16:04:18Z

The actual createQueryStream method doesn’t have this parameter available, as it’s packaging both query creation and use of the streamify API. One option would be to end the stream after a certain amount of data has been emitted and make separate calls of that method until all data has been read. (See an example). For further flexibility, you can also write the logic by hand using various query and results methods with streaming. Feel free to open up a follow up issue if a sample seems helpful! As to incorporating more configurability into the library itself for creating streamified queries, it’ll depend on requests!

ricardograca-scratch · 2023-04-18T13:48:25Z

Also having the same problem. Loading all the results in memory isn't exactly what one would expect from streaming.

koenvanzuijlen added priority: p2 Moderately-important priority. Fix may not be included in next release. type: bug Error or flaw in code with unintended results or allowing sub-optimal usage patterns. labels Jan 27, 2022

product-auto-label bot added the api: bigquery Issues related to the googleapis/nodejs-bigquery API. label Jan 27, 2022

blunderbuss-gcf bot assigned steffnay Jan 27, 2022

steffnay removed their assignment Jul 8, 2022

loferris closed this as completed Aug 10, 2022

gayathri-mandala mentioned this issue Jul 10, 2024

Memory Issue with bigQuery.createQueryStream in Node.js #1392

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

createQueryStream loads a big set of data into memory before streaming #1073

createQueryStream loads a big set of data into memory before streaming #1073

koenvanzuijlen commented Jan 27, 2022

steffnay commented Mar 20, 2022

koenvanzuijlen commented Mar 21, 2022

steffnay commented May 1, 2022

rohit-gohri commented May 17, 2022

loferris commented Aug 10, 2022

ricardograca-scratch commented Apr 18, 2023

createQueryStream loads a big set of data into memory before streaming #1073

createQueryStream loads a big set of data into memory before streaming #1073

Comments

koenvanzuijlen commented Jan 27, 2022

Environment details

Steps to reproduce

steffnay commented Mar 20, 2022

koenvanzuijlen commented Mar 21, 2022

steffnay commented May 1, 2022

rohit-gohri commented May 17, 2022

loferris commented Aug 10, 2022

ricardograca-scratch commented Apr 18, 2023