Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GCSSynchronizeBucketsOperator does not retrieve crc32c hash for CMEK objects #34980

Closed
2 tasks done
dmedora opened this issue Oct 16, 2023 · 1 comment · Fixed by #38191
Closed
2 tasks done

GCSSynchronizeBucketsOperator does not retrieve crc32c hash for CMEK objects #34980

dmedora opened this issue Oct 16, 2023 · 1 comment · Fixed by #38191
Labels
kind:bug This is a clearly a bug provider:google Google (including GCP) related issues

Comments

@dmedora
Copy link
Contributor

dmedora commented Oct 16, 2023

Apache Airflow version

2.7.2

What happened

The GCSSynchronizeBucketsOperator eventually calls the _prepare_sync_plan function in GCS' hooks.py. This function retrieves objects in the buckets using the list_blobs method. However, at present, the Cloud Storage Objects.List API does not return the crc32c for CMEK-protected objects. As per the GCP public docs, "The CRC32C checksum and MD5 hash of objects encrypted with customer-managed encryption keys are not returned when listing objects with the JSON API."

As a result, if an object is CMEK-protected, its crc32c value is always None, leading to incorrect synchronization (crc32c comparison).

What you think should happen instead

This should be handled by making an Objects.Get call to retrieve the crc32c for CMEK'd objects.

How to reproduce

  1. Create a GCP Cloud Key Management Service (KMS) key.
  2. Create two Cloud Storage buckets with a default bucket CMEK key.
  3. Upload an object with the same name but different contents to each bucket.
  4. Run the GCSSynchronizeBucketsOperator with one bucket as source and one as destination, and allow_overwrite=True. Since a file is found with the same name in each, the crc32c will be compared. Since they are both None, they are seen as equal, and the source object does not overwrite the destination one.

Operating System

Linux / Cloud Composer

Versions of Apache Airflow Providers

No response

Deployment

Google Cloud Composer

Deployment details

No response

Anything else

No response

Are you willing to submit PR?

  • Yes I am willing to submit a PR!

Code of Conduct

@dmedora dmedora added area:core kind:bug This is a clearly a bug needs-triage label for new issues that we didn't triage yet labels Oct 16, 2023
@boring-cyborg
Copy link

boring-cyborg bot commented Oct 16, 2023

Thanks for opening your first issue here! Be sure to follow the issue template! If you are willing to raise PR to address this issue please do so, no need to wait for approval.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind:bug This is a clearly a bug provider:google Google (including GCP) related issues
Projects
None yet
2 participants