You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The GCSSynchronizeBucketsOperator eventually calls the _prepare_sync_plan function in GCS' hooks.py. This function retrieves objects in the buckets using the list_blobs method. However, at present, the Cloud Storage Objects.List API does not return the crc32c for CMEK-protected objects. As per the GCP public docs, "The CRC32C checksum and MD5 hash of objects encrypted with customer-managed encryption keys are not returned when listing objects with the JSON API."
As a result, if an object is CMEK-protected, its crc32c value is always None, leading to incorrect synchronization (crc32c comparison).
What you think should happen instead
This should be handled by making an Objects.Get call to retrieve the crc32c for CMEK'd objects.
How to reproduce
Create a GCP Cloud Key Management Service (KMS) key.
Create two Cloud Storage buckets with a default bucket CMEK key.
Upload an object with the same name but different contents to each bucket.
Run the GCSSynchronizeBucketsOperator with one bucket as source and one as destination, and allow_overwrite=True. Since a file is found with the same name in each, the crc32c will be compared. Since they are both None, they are seen as equal, and the source object does not overwrite the destination one.
Thanks for opening your first issue here! Be sure to follow the issue template! If you are willing to raise PR to address this issue please do so, no need to wait for approval.
Apache Airflow version
2.7.2
What happened
The GCSSynchronizeBucketsOperator eventually calls the
_prepare_sync_plan
function in GCS'hooks.py
. This function retrieves objects in the buckets using thelist_blobs
method. However, at present, the Cloud Storage Objects.List API does not return the crc32c for CMEK-protected objects. As per the GCP public docs, "The CRC32C checksum and MD5 hash of objects encrypted with customer-managed encryption keys are not returned when listing objects with the JSON API."As a result, if an object is CMEK-protected, its crc32c value is always None, leading to incorrect synchronization (crc32c comparison).
What you think should happen instead
This should be handled by making an Objects.Get call to retrieve the crc32c for CMEK'd objects.
How to reproduce
allow_overwrite=True
. Since a file is found with the same name in each, the crc32c will be compared. Since they are both None, they are seen as equal, and the source object does not overwrite the destination one.Operating System
Linux / Cloud Composer
Versions of Apache Airflow Providers
No response
Deployment
Google Cloud Composer
Deployment details
No response
Anything else
No response
Are you willing to submit PR?
Code of Conduct
The text was updated successfully, but these errors were encountered: