bigquery table.insert fails for identical rows #1041

j-5-s · 2016-01-02T00:48:42Z

It seems to be failing sometimes when you insert identical rows in parallel. I think this is because the insertId is being generated from a stringified md5 hash of the row. Would you be open to a pull request to pass in your own insert id in an optional metadata object? It could fallback to how it operates now.

https://github.com/GoogleCloudPlatform/gcloud-node/blob/fbf1ecc36c45c48aeee5469e9729a529d94ee610/lib/bigquery/table.js#L904

stephenplusplus · 2016-01-02T01:10:49Z

Thanks for offering! I think you just want to skip the insertId then, since its purpose is to prevent duplicate data from being inserted. If that sounds right, we could just add an options object like:

var opts = {
  allowDuplicates: true // (default: false)
};

table.insert(rows, [opts], callback);

When it's true, we just wouldn't set the insertId.

@jgeewax @fhoffa is this a sane plan?

stephenplusplus · 2016-01-04T14:49:22Z

Sent a PR: #1045

stephenplusplus · 2016-01-04T20:22:27Z

@JamesCharlesworth while trying to test this out for #1045, I can't actually reproduce the error. Can you show me some code I can use that should cause the error?

palamccc · 2016-01-05T09:14:59Z

https://github.com/GoogleCloudPlatform/gcloud-node/blob/c7b49cb8fec15d6d159db6d7b8832e8ac4e50d0e/lib/bigquery/table.js#L902-L910
As of now md5(JSON.stringify(row)) is used as insertId, This is not a robust one. JSON.Stringify is not guaranteed to give same result for similar objects. and md5 can be same for two different strings.

I think the insertId generation should be left to user. If user is copying data from another db, then user may prefer to use existing db id as insertId. Just like mentioned in Google API Docs the insertId should be a 'opt-in' feature. the insertId deduplication is also not guaranteed, so generating insertId automatically may lead to inconsistent behaviour.

stephenplusplus · 2016-01-06T14:59:00Z

I think the insertId generation should be left to user.

👍 Would you (or @JamesCharlesworth) be open for sending a PR that realizes your vision for how that should be implemented?

palamccc · 2016-01-10T17:43:55Z

I looked into code and tried to remove insertId feature. But it will break existing applications.

Instead we can add new api 'insertAll' and make it transparent. User should build and pass the request body, as per spec. In this way, user can use 'insertId' ,'templateSuffix', 'ignoreUnknownValues', 'skipInvalidRows' features. In current implementation of 'insert', user can't use any of these features.

I prefer, gcloud-node should be just utility library to call api endpoints. Its should not add any magic to the call. I just checked python api, its very clean, and just matches with api spec.

stephenplusplus · 2016-01-10T18:19:57Z

I prefer, gcloud-node should be just utility library to call api endpoints

That's more like https://github.com/google/google-api-nodejs-client -- which is the Node version of the Python library you linked to. We purposely design an easier to use API around the raw, upstream JSON API.

And that comes with challenges. It's not always clear where our API needs to allow 100% upstream transparency and when lesser coverage is okay. Each call is different than the next. That's why we chose to default insertId in the first place-- when sensible, we want each method to be completely "ready-to-go" for the common use case.

I still agree insertId can have its quirks, so it's probably best to avoid defaulting it to a value. The questions I think we have are:

Do we need to allow insertId, templateSuffix, and ignoreUnknownValues to be set at all? The way our library is now is the way its always been-- defaulting insertId and not allowing the other properties to be set at all. This issue was the first to bring it up, and it can be solved by just removing insertId.
If we do allow setting these properties, how can we do it?
2a. A new insertAll method
2b. ??

If we can find a good answer to 2, I'm all for supporting it. I would prefer we blend support into the original insert method somehow, e.g.

var row = {
  json: {/*...*/},
  insertId: '...'
};

table.insert([row, row], { raw: true }, function() {});

@callmehiphop thoughts?

palamccc · 2016-01-10T19:08:17Z

I was not aware of https://github.com/google/google-api-nodejs-client , Probably you can link to the library in README.md, the current repo name https://github.com/GoogleCloudPlatform/gcloud-node looks like official library. somehow i landed here from google search results, and assumed that this is official library. Thanks.

stephenplusplus · 2016-01-10T19:09:25Z

In fact, we do!

Sorry for the confusion.

palamccc · 2016-01-10T19:11:07Z

thanks, I missed it. If possible, add to this page also. https://cloud.google.com/nodejs/

stephenplusplus · 2016-01-10T19:14:41Z

Related issue: #556. I'm not sure who runs the docs pages, but I'll tag @jgeewax to make sure someone sees your feedback.

callmehiphop · 2016-01-11T16:07:56Z

@stephenplusplus IMO introducing options for this method seems like the way to go. I like the raw option. Maybe we could also include an option for generating the insertIds when raw is not flagged.

callmehiphop · 2016-01-19T16:21:21Z

Was this resolved by #1068?

stephenplusplus · 2016-01-19T16:23:28Z

Sure was!

j-5-s · 2016-01-26T01:57:15Z

Awesome, thank you guys. really appreciated this!

stephenplusplus · 2017-05-07T17:57:12Z

Re-opening based on feedback from @jgeewax here

Can we re-open the discussion about using a UUID-1 (or similar) to generate insert ID values when none are provided (and which are only used when we automatically retry a failed request) ?

@jgeewax I think it was this comment that left me thinking it was best to stop defaulting something that the API docs said wasn't a good idea:

Just like mentioned in Google API Docs the insertId should be a 'opt-in' feature. the insertId deduplication is also not guaranteed, so generating insertId automatically may lead to inconsistent behaviour. (source)

Insert rows r

Request times out (maybe network cable was yanked, maybe GCP had an error)

It's impossible to know for sure whether this batch of rows was added

Retry inserting rows r

I believe you're suggesting that all of this is behind the scenes, as opposed to the user initiating the retry. To review our insert methods real quick:

table.insert(r): Makes an all-in-one API request. If the request fails, I don't think the API saves any of the data, as the network transmission wouldn't have been properly completed. Retrying is done automatically if there is a failure.
table.createWriteStream(). This won't retry at all. Last I tried, retrying writable streams is impossible.

If the underlying goal is safe-retry-able insert operations, would resumable uploads be a better solution? We currently don't support it, but we could (please re-open that issue if we need that). However, this wouldn't work for table.createWriteStream() as mentioned above, and only with insert.

stephenplusplus · 2017-05-07T18:04:37Z

This part of my last post isn't right:

However, this wouldn't work for table.createWriteStream() as mentioned above, and only with insert.

I forgot I made the GCS resumable uploads API work with with file.createWriteStream(), which uses gcs-resumable-upload under the hood. The reason we don't already do this in BigQuery is because I figured using the GCS upload -> BigQuery import was a safer solution when you're doing larger data inserts. I should be able to figure this out with the BigQuery insert API as well, if we decide we want to support this.

jgeewax · 2017-05-07T18:08:08Z

I'm focusing specifically on the table.insert(data) (no interest at all with the streaming inserts).

Imagine you have an app and you want to log events for later analysis to BQ (e.g. at the end of each Uber trip, they might make a request to BQ saying "this trip happened: "). If something happens that would cause an auto-retry (e.g., network cable popped out and the initial try got a timeout), our retry logic, if it doesn't have an insert ID, may add these same rows twice (this happens if the network cable gets yanked after the request has made it to BQ).

I think the auto-generation that we had (hashing the data) was misplaced in that it could convey to people that BQ is capable of de-duplicating data on the way in when it really is not.

The insert ID is not for de-duplicating data itself, but instead for de-duplicating requests. In other words, it's there to make sure that requests which returned an ambiguous result (e.g., the data may have made it into BQ, but you can't be sure) can be safely retried without causing duplicate data to be added.

Using a UUID-4 (random values) means that these types of errors (e.g., "Something might have gone wrong, but it's also possible that everything is fine...") can be corrected with a retry ("Just try it again with the same randomly generated value, no problem at all, worst case is we disregard the request because we already got it").

This also means that client-side errors (e.g., the process crashed before we know if the data was added) are not solved at all -- users are still responsible for that, which I think is the right thing to do.

stephenplusplus · 2017-05-07T18:14:13Z

If something happens that would cause an auto-retry (e.g., network cable popped out and the initial try got a timeout), our retry logic, if it doesn't have an insert ID, may add these same rows twice (this happens if the network cable gets yanked after the request has made it to BQ).

But would this happen when we're only making one HTTP request? I believe the request goes through with transfer-encoding: chunk. Is the upstream API parsing each bit it gets, inserting the rows as it notices they are defined in the bits, or does it wait until all data is received before storing any data? I have been assuming the API sees an incomplete transmission as a failure, so it wouldn't save any of it. So our retrying would be safe, because none of the last attempts could have been stored.

stephenplusplus · 2017-05-07T18:17:01Z

If something happens that would cause an auto-retry (e.g., network cable popped out and the initial try got a timeout), our retry logic, if it doesn't have an insert ID, may add these same rows twice (this happens if the network cable gets yanked after the request has made it to BQ).

I think my answer is in here, but I'm trying to figure it out:

"my-app" calls table.insert()
Network cable pops out, callback not executed to communicate results of test
my-app goes "dang, network dropped out, I need to retry [step 1]"
data saves twice

Is that what we're trying to fix?

jgeewax · 2017-05-08T12:32:35Z

Yes. In that scenario, if we don't have an insert ID, the retry is not safe and may end up with data saving twice.

It's important to make clear that we're not trying to use insert ID as data de-duplication but as request de-duplication. That is, this should be a per-request ID not a "per-data" ID. Does that make sense?

stephenplusplus · 2017-05-08T13:53:16Z

Alright, so we are solving for:

We send a request
BigQuery inserts the data
Network drops out between ending the POST transmission and receiving the response (this is what I've been stuck on. This isn't a streaming request, it's an all-in-one. Wouldn't this be a very, very narrow slice of time?)
Network wakes up
Our retry logic only got a network hangup/timeout response, so it makes the request again

jgeewax · 2017-05-10T18:29:42Z

This isn't a streaming request, it's an all-in-one.

Correct.

Wouldn't this be a very, very narrow slice of time?

Yes. Imagine sending 1k QPS and yanking the cable. Or any other network blip. At least some of those requests are going to be in this situation.

stephenplusplus · 2017-05-10T19:54:57Z

Okay, thanks for walking me through it. I'll assign a uuid to non-insertId'd rows of data.

jgeewax · 2017-05-10T19:59:25Z

Can we run this by someone on the BQ team to be sure this is right ?

stephenplusplus · 2017-05-10T20:00:20Z

Know anyone?

stephenplusplus · 2017-05-23T17:40:18Z

@swcloud do you know anyone from the BigQuery team who can help review this suggestion? Thanks!

danoscarmike · 2017-08-09T19:29:17Z

@callmehiphop might this be addressed by the bigquery work you're currently looking at?

callmehiphop · 2017-08-09T21:04:26Z

@danoscarmike AFAIK there is not a requirement that covers this specific issue, however if we could get some feedback from a BigQuery team member I'd be happy to sneak it in.

danoscarmike · 2017-08-09T21:11:29Z

@jba could you run this by the best POC on BigQuery? Thanks!

jba · 2017-08-09T22:16:16Z

@danoscarmike There's so much going on in this issue that I'm not sure what "this" is. But I can tell you:

BQ insert IDs are per-row.
The Go client lets the user provide their own in the "standard" case.
There are also cases (like uploading Go values directly, which for convenience we allow and convert into rows) where there is no surface for allowing the user to provide an insert ID, and we do not attempt to construct one. This can result in duplicate rows. We document this clearly.
In their recent set of requests, BQ team did not mention anything about client handling of insert IDs.

If that still leaves open questions for BQ team, let me know what they are and I'll relay.

fix: use google-gax v3.3.0 Source-Link: googleapis/synthtool@c73d112 Post-Processor: gcr.io/cloud-devrel-public-resources/owlbot-nodejs:latest@sha256:b15a6f06cc06dcffa11e1bebdf1a74b6775a134aac24a0f86f51ddf728eb373e

stephenplusplus added the api: bigquery Issues related to the BigQuery API. label Jan 2, 2016

stephenplusplus mentioned this issue Jan 4, 2016

bigquery: allow duplicate entries #1045

Closed

stephenplusplus mentioned this issue Jan 18, 2016

Accept insertID for BigQuery API #1066

Closed

stephenplusplus closed this as completed Jan 19, 2016

jgeewax mentioned this issue May 7, 2017

bigquery: remove automatic insertId generation & allow specifying raw format #1068

Merged

stephenplusplus reopened this May 7, 2017

stephenplusplus self-assigned this May 10, 2017

stephenplusplus assigned swcloud and unassigned stephenplusplus Jun 12, 2017

danoscarmike assigned callmehiphop and unassigned swcloud Aug 9, 2017

danoscarmike closed this as completed Aug 9, 2017

danoscarmike reopened this Aug 9, 2017

stephenplusplus assigned stephenplusplus and unassigned callmehiphop Nov 9, 2017

stephenplusplus mentioned this issue Nov 9, 2017

Generate insertId if not provided. #2739

Merged

stephenplusplus closed this as completed in #2739 Nov 13, 2017

alexander-fenster mentioned this issue Nov 17, 2017

generate insertId if not provided googleapis/nodejs-bigquery#5

Merged

sofisl mentioned this issue Jan 17, 2023

migrate code from googleapis/nodejs-vision #3860

Merged

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

bigquery table.insert fails for identical rows #1041

bigquery table.insert fails for identical rows #1041

j-5-s commented Jan 2, 2016

stephenplusplus commented Jan 2, 2016

stephenplusplus commented Jan 4, 2016

stephenplusplus commented Jan 4, 2016

palamccc commented Jan 5, 2016

stephenplusplus commented Jan 6, 2016

palamccc commented Jan 10, 2016

stephenplusplus commented Jan 10, 2016

palamccc commented Jan 10, 2016

stephenplusplus commented Jan 10, 2016

palamccc commented Jan 10, 2016

stephenplusplus commented Jan 10, 2016

callmehiphop commented Jan 11, 2016

callmehiphop commented Jan 19, 2016

stephenplusplus commented Jan 19, 2016

j-5-s commented Jan 26, 2016

stephenplusplus commented May 7, 2017

stephenplusplus commented May 7, 2017 •

edited

Loading

jgeewax commented May 7, 2017 •

edited

Loading

stephenplusplus commented May 7, 2017

stephenplusplus commented May 7, 2017

jgeewax commented May 8, 2017

stephenplusplus commented May 8, 2017

jgeewax commented May 10, 2017

stephenplusplus commented May 10, 2017

jgeewax commented May 10, 2017 via email

stephenplusplus commented May 10, 2017

stephenplusplus commented May 23, 2017

danoscarmike commented Aug 9, 2017

callmehiphop commented Aug 9, 2017

danoscarmike commented Aug 9, 2017

jba commented Aug 9, 2017

bigquery table.insert fails for identical rows #1041

bigquery table.insert fails for identical rows #1041

Comments

j-5-s commented Jan 2, 2016

stephenplusplus commented Jan 2, 2016

stephenplusplus commented Jan 4, 2016

stephenplusplus commented Jan 4, 2016

palamccc commented Jan 5, 2016

stephenplusplus commented Jan 6, 2016

palamccc commented Jan 10, 2016

stephenplusplus commented Jan 10, 2016

palamccc commented Jan 10, 2016

stephenplusplus commented Jan 10, 2016

palamccc commented Jan 10, 2016

stephenplusplus commented Jan 10, 2016

callmehiphop commented Jan 11, 2016

callmehiphop commented Jan 19, 2016

stephenplusplus commented Jan 19, 2016

j-5-s commented Jan 26, 2016

stephenplusplus commented May 7, 2017

stephenplusplus commented May 7, 2017 • edited Loading

jgeewax commented May 7, 2017 • edited Loading

stephenplusplus commented May 7, 2017

stephenplusplus commented May 7, 2017

jgeewax commented May 8, 2017

stephenplusplus commented May 8, 2017

jgeewax commented May 10, 2017

stephenplusplus commented May 10, 2017

jgeewax commented May 10, 2017 via email

stephenplusplus commented May 10, 2017

stephenplusplus commented May 23, 2017

danoscarmike commented Aug 9, 2017

callmehiphop commented Aug 9, 2017

danoscarmike commented Aug 9, 2017

jba commented Aug 9, 2017

stephenplusplus commented May 7, 2017 •

edited

Loading

jgeewax commented May 7, 2017 •

edited

Loading