S3 sync recursively with per-object metadata #2045

jcmcken · 2016-06-29T17:20:33Z

I'm looking to take advantage of the aws s3 sync command, but provide per-object metadata (i.e. metadata that can change per object) rather than provide global metadata with --metadata.

Right now, I have basically a couple of options:

Write my own syncing procedures. This would seem duplicative of the work/testing that has gone into the sync command.
Include metadata in separate objects, e.g. /path/to/object and /path/to/object.meta. This isn't the greatest, means I have to pay extra and also manage the metadata within my application.
Upload on a per-object basis. This isn't going to be performant for my use case.

What would be nice is if I could somehow indicate to the CLI that I want to map each object to a set of metadata, and then upload each object with that metadata. A couple of solutions come to mind:

A giant JSON file mapping each key name to a metadata hash e.g.:

{
  "path/to/object1": {"key1": "value1", "key2": "value2},
  ...etc...
}

$ aws s3 sync /some/dir s3://somebucket --metadata-mapping /path/to/meta/mapping.json

Some convention for writing metadata locally into a separate file per intended object, and having the sync command read the metadata for each object prior to uploading. For example, I could have a local directory:

$ ls /path/to/local/files
file1
file1.meta
$ cat file1.meta
{
  "key1": "value1",
  "key2": "value2"
}
$ aws s3 sync /path/to/local/files s3://somebucket --object-metadata '$filename.meta'

(So when this is run, the $filename.meta files would just be read for metadata, and would not be transferred)

A callback that takes the local filename as a parameter and spits out the metadata, e.g.

$ ls /path/to/local/files
file1
$ lookup-metadata.py /path/to/local/files/file1
{
  "key1": "value1"
}
$ aws s3 sync /path/to/local/files s3://somebucket --metadata-callback lookup-metadata.py

Alternatively, what would be really great is if the syncing functionality were available independently of the CLI from within Python (without requiring me to figure out the internals of how to properly initialize the CLI environment, etc.), so that I could subclass and customize the process. I started going down this route somewhat, but am worried that this API is not for public consumption and would break in the future.

Any thoughts?

The text was updated successfully, but these errors were encountered:

JordonPhillips · 2016-06-29T17:35:44Z

I'm -1 on adding that. I don't think providing that kind of mapping is a very good experience. At that point you're effectively setting everything manually anyway, so it would take just as much time to perform all those requests.

As far as using our code, we don't guarantee we won't break internals. However, it is MIT licensed so feel free to vendor or copy it.

jcmcken · 2016-06-29T21:35:28Z

In my use case, the metadata is precomputed against the objects I'm trying to store and placed in a storage backend (details not important, but e.g. MongoDB). All I'm doing is retrieving the data from that backend and storing it with the objects. If I do this object-by-object, then I need to recreate threaded uploads, multipart handling, sync strategies, etc -- all of the things that the sync command normally would do for me. I then need to hook in my logic to make sure the correct metadata is stored with each object. If the CLI supports a mapping or callback, then I just need to translate the data into the correct format (which I can stage ahead of time), and then run the sync

jamesls · 2016-07-06T17:29:30Z

+1 for me. I think this is a reasonable request. Out of all the proposed solutions, I like the metadata JSON file the best. I'm inclined to mark this as a feature request.

@jcmcken One other thing worth considering is the work @kyleknap's been doing for s3transfer. It's still under active development so I wouldn't recommend it for general use just yet, but the idea is to create a good python API for the functionality that's currently exposed in the AWS CLI.

rmharrison · 2016-10-19T20:00:36Z

@jamesls Since s3transfer is still very much in active development, do you have a recommendation for syncing with per file metadata?

node-s3-client is the most promising library I've come across, but the project seems to be having problems with the underlying AWS SDK, see andrewrk/node-s3-client#129

ASayre · 2018-02-06T10:25:26Z

Good Morning!

We're closing this issue here on GitHub, as part of our migration to UserVoice for feature requests involving the AWS CLI.

This will let us get the most important features to you, by making it easier to search for and show support for the features you care the most about, without diluting the conversation with bug reports.

As a quick UserVoice primer (if not already familiar): after an idea is posted, people can vote on the ideas, and the product team will be responding directly to the most popular suggestions.

We’ve imported existing feature requests from GitHub - Search for this issue there!

And don't worry, this issue will still exist on GitHub for posterity's sake. As it’s a text-only import of the original post into UserVoice, we’ll still be keeping in mind the comments and discussion that already exist here on the GitHub issue.

GitHub will remain the channel for reporting bugs.

Once again, this issue can now be found by searching for the title on: https://aws.uservoice.com/forums/598381-aws-command-line-interface

-The AWS SDKs & Tools Team

jamesls · 2018-04-06T21:14:25Z

Based on community feedback, we have decided to return feature requests to GitHub issues.

pgriess · 2021-09-30T01:43:14Z

I'd like this to exist and am willing to spend some time building it.

What is the best way to proceed here? I can jump right to submitting a PR for the single metadata JSON file, but would it be helpful to discuss design / implementation strategy first? I've never committed to this repo before, so if there are any pointers to related code / suggested supporting infrastructure, I'm all ears.

tim-finnigan · 2021-11-05T15:54:07Z

Hi @pgriess thanks for your willingness to contribute. If you want to create a PR then I recommended reading the contributing guide here: https://github.com/aws/aws-cli/blob/master/CONTRIBUTING.md

You can expand on your proposed implementation here or in a PR. I think looking through these s3 sync customizations is a good place to start: https://github.com/aws/aws-cli/tree/develop/awscli/customizations/s3/syncstrategy

JordonPhillips added the needs-discussion label Jun 29, 2016

JordonPhillips added the closing-soon This issue will automatically close in 4 days unless further comments are made. label Jun 29, 2016

jamesls removed the closing-soon This issue will automatically close in 4 days unless further comments are made. label Jul 6, 2016

rmharrison mentioned this issue Oct 19, 2016

Is this library still supported? andrewrk/node-s3-client#129

Open

JordonPhillips added feature-request A feature should be added or improved. and removed needs-discussion labels Jul 25, 2017

ASayre closed this as completed Feb 6, 2018

jamesls reopened this Apr 6, 2018

johnbradley mentioned this issue May 10, 2018

Add ability to upload files including md5sum metadata Duke-GCB/datadelivery-cli#4

Open

tim-finnigan added the s3sync label Oct 25, 2021

tim-finnigan added the p3 This is a minor priority issue label Nov 4, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

S3 sync recursively with per-object metadata #2045

S3 sync recursively with per-object metadata #2045

jcmcken commented Jun 29, 2016

JordonPhillips commented Jun 29, 2016

jcmcken commented Jun 29, 2016

jamesls commented Jul 6, 2016

rmharrison commented Oct 19, 2016

ASayre commented Feb 6, 2018

jamesls commented Apr 6, 2018

pgriess commented Sep 30, 2021

tim-finnigan commented Nov 5, 2021

S3 sync recursively with per-object metadata #2045

S3 sync recursively with per-object metadata #2045

Comments

jcmcken commented Jun 29, 2016

JordonPhillips commented Jun 29, 2016

jcmcken commented Jun 29, 2016

jamesls commented Jul 6, 2016

rmharrison commented Oct 19, 2016

ASayre commented Feb 6, 2018

jamesls commented Apr 6, 2018

pgriess commented Sep 30, 2021

tim-finnigan commented Nov 5, 2021