Perform S3 directory deletion with batch requests #13974

findinpath · 2022-09-02T15:27:08Z

Description

Speed up the deletion of an S3 "directory" ( a path prefix which corresponds to multiple S3 objects) by using batch deletion requests.

Is this change a fix, improvement, new feature, refactoring, or other?

Fix

Is this a change to the core query engine, a connector, client library, or the SPI interfaces? (be specific)

Hive, Delta, Iceberg (Lakehouse connectors)

How would you describe this change to a non-technical end user or system administrator?

Speed up the deletion of an S3 "directory" ( a path prefix which corresponds to multiple S3 objects) by using batch deletion requests.

Related issues, pull requests, and links

Fixes #13017

Documentation

(x) No documentation is needed.
( ) Sufficient documentation is included in this PR.
( ) Documentation PR is available with #prnumber.
( ) Documentation issue #issuenumber is filed, and can be handled later.

Release notes

( ) No release notes entries required.
(x) Release notes entries required with the following suggested text:

# Delta, Hive, Iceberg
* Ensure the rename/delete effectiveness for S3 directories which appear shallowly like objects

pettyjamesm · 2022-09-02T20:38:20Z

I'm not sure this logic totally makes sense, because an S3 object with Content-Type: "application/octet-stream", with 0 bytes and a key that doesn't end in / (eg: s3://bucket/key) seems like it should be interpreted as an "empty file" and not a directory. I could see the argument for interpreting an S3 object s3://bucket/key/ as a directory, or s3://bucket/key as a "directory" even if an no object with that key exists so long as an "child object" exists (eg: some object with key s3://bucket/key/object)- but when an S3 object actually exists without the trailing slash I think we have to interpret that as a "file" and not directory unless it's appropriately marked with the right content type.

Now with that said, it's important to remember that S3 is not a file system, it's an object store so it doesn't really have directories. So there might be a better option for how to handle delete(Path path, boolean recursive == true). In that case, instead of doing this which recursively generates S3 listing, S3 getObjectMetadata, and S3 deleteObject calls:

for (FileStatus file : listStatus(path)) {
  delete(file.getPath(), true);
}

You could instead do something like (simplified to ignore details):

Iterator<S3ObjectSummary> listings = S3Objects.withPrefix(s3, bucketName, keyFromPath(path) + "/").iterator();
Iterator<String> keys = Iterators.transform(listings, S3ObjectSummary::getKey);
Iterator<List<String>> keyBatches = Iterators.partition(keys, 1000);
while (keyBatches.hasNext()) {
  String[] keysInBatch = keyBatches.next().toArray(String[]::new);
  // TODO: handle MultiObjectDeleteException in case some deletes fail
  s3.deleteObjects(new DeleteObjectsRequest(bucketName).withKeys(keysInBatch));
}

Which will be much more efficient, faster, and completely remove all S3 objects with actual keys starting with the prefix "s3://<bucket>/<path>/".

findinpath · 2022-09-05T11:34:17Z

@pettyjamesm I have addressed your comment.

As you mentioned offline there is definitely room for improvement in io.trino.plugin.hive.s3.TrinoS3FileSystem#rename . I would suggest addressing the refactoring of this method in a separate PR.

pettyjamesm

Added review comments, I think we probably want to reorganize the logic to check recursive explicitly instead of handling that by falling through, since that's now a much more significant factor in the implementation logic. Recursive deletes for "implied directories" (ie: key does not exist but is a prefix for keys that do) and for "actual directories" (key exists, but has content-type DIRECTORY_MEDIA_TYPE) can proceed, but "empty objects" must be considered files and should not allow recursive deletes.