-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
QOL-8612 make non-current S3 objects private #75
Conversation
- this makes it easier to wipe the 'all objects' cache when uploading a new object
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looking good, would like some improvements for optional flags so others can use this and have all resources still public if they want.
as well as ability for the system to clean up dead resource files which are not linked anymore from s3 which in turn speeds up the system.
is_public_read = self.is_key_public(upload['Key']) | ||
upload_key = upload['Key'] | ||
log.debug("Setting visibility for key [%s], current object is [%s]", upload_key, current_key) | ||
if upload_key == current_key: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we put this behind a config flag
ckanext.s3filestore.hideOldResourceObject = true/false #default value: true
if true if not same as db key then set private
if false, same as original functionality (all is same as active file)
another option for a 'cleaned up' s3 is
ckanext.s3filestore.deleteNonLinkedResource = true/false #default value: false
ckanext.s3filestore.deleteNonLinkedResourceAge = <number of days old> #default value: 90
if a file is found that does not match database and it is older than deleteNonLinkedResourceAge, then delete file so it is treated like it was overwritten with possible backup window. (Age might be hard to do as i don't think you can record acl changes and tag key/values are immutable unless the object is 'refreshed/re-uploaded'
for our purposes we would have it set to true. As we use s3 versioning we have 90+ days to recover the file if requested.
This keeps our blob storage size down and only list what is still active.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
seems you can't 'modify' objects to have a marker, so deleteNonLinkedResourceAge can't be done if true it will purge dead data.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've added a config value to control the ACLs of non-current files. Default is 'private'.
I'm unclear, though, on why deleting old objects would require modifying tags. Couldn't we just rely on the built-in "Last Modified" metadata?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, thats true, if its fresh it is kept for x days and then deleted which will give it more grace period.
ckanext/s3filestore/uploader.py
Outdated
Bucket=self.bucket_name, Key=upload_key, ACL=acl) | ||
self._cache_delete(upload_key) | ||
self._cache_put(upload_key + VISIBILITY_CACHE_PATH, acl) | ||
self._cache_put(current_key + VISIBILITY_CACHE_PATH + '/all', target_acl) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is it possible to set a longer window for /all and for /visibility keys, 1 hour may not be long enough for the performance improvements and with these items being so small, i think 30 days should be ok, similar for public files, they should be left in cache far longer than when they are private and expire.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
eg update signature of _cache_put to
def _cache_put(self, key, value, ttl=self.signed_url_cache_window):
so default is from cache_window but for acl targets, they can be 30 days for max speed and reduce s3 'hits'
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm...since the cache is automatically cleared when a resource is uploaded, I suppose it is reasonable to have a fairly long expiry on it if that hasn't happened.
- ACLs can have a very long expiry, since they should only change when a resource or dataset is changed, will will override the cache settings anyway.
ckanext/s3filestore/uploader.py
Outdated
@@ -128,14 +129,21 @@ def _cache_get(self, key): | |||
cache_value = six.ensure_text(cache_value) | |||
return cache_value | |||
|
|||
def _cache_put(self, key, value): | |||
''' Set a value in the cache, if enabled, with an appropriate expiry. | |||
def _cache_put_url(self, key, value): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
unless we will be doing other things (like appending /acl or visability or all etc), lets keep it simple and let this just take a ttl value param and have the default as self.signed_url_cache_window
if None is passed in, so that the ttl can be injected when doing acl stuff.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, there's already a logical split based on whether signed URL caching is enabled, and I've just added another to automatically distinguish signed from unsigned URLs.
- unsigned URLs don't expire so they can be cached much longer
- when updating object visibility, if deletion age is configured and non-current object is older than configured age, then delete instead of updating
README.rst
Outdated
ckanext.s3filestore.acl_cache_window = 2592000 | ||
|
||
# If set, then prior objects uploaded for a resource may be deleted | ||
# after the specified number of days. If less than zero, nothing |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please update this that this is about when resource was uploaded. Not when delete function was called.
i.e. if uploaded over 90 days ago, it will be purged if set to 90. if it was uploaded yesterday, it will be kept for 89 days. Then it is up to the configuration of the s3 bucket on how long deleted items are kept.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure which part of the documentation is unclear about that.
ckanext/s3filestore/uploader.py
Outdated
@@ -308,7 +327,7 @@ def get_signed_url_to_key(self, key, extra_params={}): | |||
if is_public_read: | |||
url = url.split('?')[0] + '?ETag=' + metadata['ETag'] | |||
|
|||
self._cache_put(key, url) | |||
self._cache_put_url(key, url) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you look at doing a refactor on this method def get_signed_url_to_key(self, key, extra_params={}):
to check cache at the start before doing the s3 head calls etc.
And if it is in the cache, return it. saves as doing so many checks when its still the same.
We also saw that if redis/cache_put fails, it stops the url being returned to the customer. Can you wrap the redis put with a try, log error continue as its a nice to have not a show stopper for getting the url.
Also on redis get, do similar that if it throws exception that None is return per it not finding a cached key.
…d-gov-au/ckanext-s3filestore into QOL-8612-manage-s3-visibility
…estore into QOL-8612-manage-s3-visibility
- return without contacting S3 API if we have a cached URL - add error handling so a Redis failure doesn't break the whole system - make Flake8 happy
Older objects, with different filenames, will still be accessible via the
orig_download
URL, but will not be directly visible and cacheable.