Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add a clear warning in the documentation that optimize is not concurrently safe in Cloudflare R2 #1348

Closed
djouallah opened this issue May 9, 2023 · 11 comments
Labels
bug Something isn't working

Comments

@djouallah
Copy link

Environment

0.9

Binding: Python

Environment:

  • Cloud provider: Cloudflare R2
  • OS: LInux
  • Other:

Bug

Optimize is not a safe operation when it is done concurrently with another delta writer, it will be nice to make it clear in the documentation, it is not a big problem as a user can always suspend writer 1 and do maintenance in that window, but it will be nice to be documented.

@djouallah djouallah added the bug Something isn't working label May 9, 2023
@roeap
Copy link
Collaborator

roeap commented May 9, 2023

did you experience a specific bug or can you describe the specific scenario that worries you?

n principle we are using the new commit function which includes conflict resolution. Thus it should be safe to use concurrently. Optimize in particular is not too critical, since it is not changing data, it may however fail for concurrent deletes. This should be captured tough by the conflict resolutio.

@djouallah
Copy link
Author

I run this code with cloudflare R2, I got duplicated data

from deltalake import DeltaTable
import os
delta_path = 's3://xxxx/scada'
storage_options = {
"Region": "us-east-1",   
"AWS_ACCESS_KEY_ID":     os.environ.get("aws_access_key_id_secret") ,
"AWS_SECRET_ACCESS_KEY": os.environ.get("aws_secret_access_key_secret")   ,   
"AWS_ENDPOINT_URL" :     os.environ.get("endpoint_url_secret") ,
"AWS_S3_ALLOW_UNSAFE_RENAME":"true"
}
def compaction(request):
    dt = DeltaTable(delta_path,storage_options=storage_options)
    dt.optimize()
    return 'done'

I got duplicated data in delta table
image

1 similar comment
@djouallah
Copy link
Author

I run this code with cloudflare R2, I got duplicated data

from deltalake import DeltaTable
import os
delta_path = 's3://xxxx/scada'
storage_options = {
"Region": "us-east-1",   
"AWS_ACCESS_KEY_ID":     os.environ.get("aws_access_key_id_secret") ,
"AWS_SECRET_ACCESS_KEY": os.environ.get("aws_secret_access_key_secret")   ,   
"AWS_ENDPOINT_URL" :     os.environ.get("endpoint_url_secret") ,
"AWS_S3_ALLOW_UNSAFE_RENAME":"true"
}
def compaction(request):
    dt = DeltaTable(delta_path,storage_options=storage_options)
    dt.optimize()
    return 'done'

I got duplicated data in delta table
image

@roeap
Copy link
Collaborator

roeap commented May 9, 2023

well ...safety is specifically disabled :). Without safe rename we cannot even properly detect that a version was previously created.

"AWS_S3_ALLOW_UNSAFE_RENAME": "true"

@djouallah
Copy link
Author

yes, but it will not work otherwise ?

@djouallah djouallah changed the title add a clear warning in the documentation that optimize is not concurrently safe add a clear warning in the documentation that optimize is not concurrently safe in Cloudflare R2 May 9, 2023
@roeap
Copy link
Collaborator

roeap commented May 9, 2023

yes, but it will not work otherwise ?

How do you mean? To have concurrent write support we (or rather delta also has requirements to the fs) require to have a safe rename mechanisms in the object store. If that is not available, all guarantees kind of break down.

If that is met, at least in principle things should work. The repo even contains formal proofs for certain aspects of the implementation.

@djouallah
Copy link
Author

sorry for not being clear, I think it is the same issue here if I don't add this option ""AWS_S3_ALLOW_UNSAFE_RENAME": "true"" optimize not write delta will work at all
"AWS_S3_ALLOW_UNSAFE_RENAME": "true"

@wjones127
Copy link
Collaborator

Yes, this is what should happen:

  • "AWS_S3_ALLOW_UNSAFE_RENAME": "true": you will get duplicate data, since you turned off support for safe concurrent writes. You should only set this when you aren't writing concurrently.
  • "AWS_S3_ALLOW_UNSAFE_RENAME": "false" (or unset): neither write nor optimize will work, unless you set up a locking client configuration.

@wjones127
Copy link
Collaborator

I do hear you, though, that with optimize implemented, we need to have clearly warnings in the documentation about what "AWS_S3_ALLOW_UNSAFE_RENAME": "true" really means. Plus I don't think we have great instructions for setting up a locking client.

@djouallah
Copy link
Author

@wjones127 btw thank you very much for your work, since 0.9, delta lake python is becoming very useful for me.

@roeap
Copy link
Collaborator

roeap commented May 10, 2023

while we should improve the docs, atomic rename may soon be available for R2 storages. apache/arrow-rs#4194. we will have to do some extra work on the delta-rs end to not require the lock client for r2, but at least the basics are there ...

should we update this issue or create a new one to track these changes?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants