-
Notifications
You must be signed in to change notification settings - Fork 237
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GCS connector sends a high volume of ListObjects requests, triggering GCS 429 throttling error #151
Comments
I'm seeing an issue that may or may not be related, via GCP Dataproc. Specifically, I have a Spark job running on Dataproc that reads and writes an RDD from a GCS bucket containing implicit directories, and starting on Feb 15, certain interactions with the bucket appears to generate an enormous number of I don't fully understand the commit referenced by this issue, but from what I gather, if this were in fact the issue causing my spike in
None of my Spark job code changed, and there haven't been any notable changes in the size or nature of the data that we're passing in to the Spark job. As mentioned above, this started happening spontaneously on Feb 15. |
Most probably this is regression introduced in GCS connector 1.9.12 during performance optimizations and not related to linked commit, we will have fix shortly and release new GCS connector with it. |
@medb Thanks for the update. From the information I provided, do you suspect that this issue would be the root cause of the scenario I described as well? Or d'you think what I'm seeing is unrelated? |
@medb Also, do I need to do anything to upgrade to 1.9.14 (or is it 1.9.15?)? Or will the latest version be used automatically behind the scenes by dataproc/spark? Thanks! |
I think that this is a related issue. The problem with list request is that it's paginated (1,000 objects per single request), that's why if you have a lot of objects in the same directory you could have higher list requests ratio to other requests. I think that we will have GCS connector 1.9.15 build in hour or so, so it will be great if you will be able to validate fix using connectors init action: We will update Dataproc with new GCS connector version automatically but it will take around a week for new Dataproc release to roll out and you will need to recreate cluster with latest image. |
Great, thanks for the info @medb. Is there anything we can do in the short term to mitigate the issue prior to the new Dataproc release?
|
|
We are also seeing this issue; we moved back to |
GCS connector 1.9.15 that fixes this issue was just released: |
@medb We've reverted to dataproc Thanks for the quick turnaround on this issue! |
Still seeing this issue on latest dataproc image 1.4 so GCS connection v1.9.16. |
@jaketf from your description it seems that this is a different issue. May you file a new GitHub issue for this? And add more details about current behavior (how many clusters with how many nodes are running simultaneously? what is observed behavior (list requests QPS, etc)?) and add expected/desirable behavior. |
@medb sure. |
…S (HTTP 429 reponse). Fixes GoogleCloudDataproc#151
While enabling Implicit Directories can help list all file objects, it has the side effect of sending too many ListObjects requests to GCS. When this happens, GCS returns the following error:
GcsFuse does mention about this caveat here:
https://github.com/GoogleCloudPlatform/gcsfuse/blob/master/docs/semantics.md#implicit-directories
Can we please roll back the change?
The text was updated successfully, but these errors were encountered: