Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

QOL-8612 make non-current S3 objects private #75

Merged
merged 30 commits into from
Feb 7, 2022

Conversation

ThrawnCA
Copy link
Contributor

@ThrawnCA ThrawnCA commented Feb 4, 2022

Older objects, with different filenames, will still be accessible via the orig_download URL, but will not be directly visible and cacheable.

Copy link
Member

@duttonw duttonw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looking good, would like some improvements for optional flags so others can use this and have all resources still public if they want.

as well as ability for the system to clean up dead resource files which are not linked anymore from s3 which in turn speeds up the system.

is_public_read = self.is_key_public(upload['Key'])
upload_key = upload['Key']
log.debug("Setting visibility for key [%s], current object is [%s]", upload_key, current_key)
if upload_key == current_key:
Copy link
Member

@duttonw duttonw Feb 4, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we put this behind a config flag
ckanext.s3filestore.hideOldResourceObject = true/false #default value: true
if true if not same as db key then set private
if false, same as original functionality (all is same as active file)

another option for a 'cleaned up' s3 is

ckanext.s3filestore.deleteNonLinkedResource = true/false #default value: false
ckanext.s3filestore.deleteNonLinkedResourceAge = <number of days old> #default value: 90

if a file is found that does not match database and it is older than deleteNonLinkedResourceAge, then delete file so it is treated like it was overwritten with possible backup window. (Age might be hard to do as i don't think you can record acl changes and tag key/values are immutable unless the object is 'refreshed/re-uploaded'

for our purposes we would have it set to true. As we use s3 versioning we have 90+ days to recover the file if requested.

This keeps our blob storage size down and only list what is still active.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

seems you can't 'modify' objects to have a marker, so deleteNonLinkedResourceAge can't be done if true it will purge dead data.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added a config value to control the ACLs of non-current files. Default is 'private'.

I'm unclear, though, on why deleting old objects would require modifying tags. Couldn't we just rely on the built-in "Last Modified" metadata?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, thats true, if its fresh it is kept for x days and then deleted which will give it more grace period.

Bucket=self.bucket_name, Key=upload_key, ACL=acl)
self._cache_delete(upload_key)
self._cache_put(upload_key + VISIBILITY_CACHE_PATH, acl)
self._cache_put(current_key + VISIBILITY_CACHE_PATH + '/all', target_acl)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it possible to set a longer window for /all and for /visibility keys, 1 hour may not be long enough for the performance improvements and with these items being so small, i think 30 days should be ok, similar for public files, they should be left in cache far longer than when they are private and expire.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

eg update signature of _cache_put to
def _cache_put(self, key, value, ttl=self.signed_url_cache_window):
so default is from cache_window but for acl targets, they can be 30 days for max speed and reduce s3 'hits'

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm...since the cache is automatically cleared when a resource is uploaded, I suppose it is reasonable to have a fairly long expiry on it if that hasn't happened.

@@ -128,14 +129,21 @@ def _cache_get(self, key):
cache_value = six.ensure_text(cache_value)
return cache_value

def _cache_put(self, key, value):
''' Set a value in the cache, if enabled, with an appropriate expiry.
def _cache_put_url(self, key, value):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unless we will be doing other things (like appending /acl or visability or all etc), lets keep it simple and let this just take a ttl value param and have the default as self.signed_url_cache_window if None is passed in, so that the ttl can be injected when doing acl stuff.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, there's already a logical split based on whether signed URL caching is enabled, and I've just added another to automatically distinguish signed from unsigned URLs.

- unsigned URLs don't expire so they can be cached much longer
- when updating object visibility, if deletion age is configured and non-current object is older than configured age, then delete instead of updating
README.rst Outdated
ckanext.s3filestore.acl_cache_window = 2592000

# If set, then prior objects uploaded for a resource may be deleted
# after the specified number of days. If less than zero, nothing
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please update this that this is about when resource was uploaded. Not when delete function was called.
i.e. if uploaded over 90 days ago, it will be purged if set to 90. if it was uploaded yesterday, it will be kept for 89 days. Then it is up to the configuration of the s3 bucket on how long deleted items are kept.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure which part of the documentation is unclear about that.

@@ -308,7 +327,7 @@ def get_signed_url_to_key(self, key, extra_params={}):
if is_public_read:
url = url.split('?')[0] + '?ETag=' + metadata['ETag']

self._cache_put(key, url)
self._cache_put_url(key, url)
Copy link
Member

@duttonw duttonw Feb 7, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you look at doing a refactor on this method def get_signed_url_to_key(self, key, extra_params={}):

to check cache at the start before doing the s3 head calls etc.
And if it is in the cache, return it. saves as doing so many checks when its still the same.

We also saw that if redis/cache_put fails, it stops the url being returned to the customer. Can you wrap the redis put with a try, log error continue as its a nice to have not a show stopper for getting the url.

Also on redis get, do similar that if it throws exception that None is return per it not finding a cached key.

@duttonw duttonw merged commit 70dbb38 into develop Feb 7, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants