Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

S3 sync recursively with per-object metadata #2045

Open
jcmcken opened this issue Jun 29, 2016 · 8 comments
Open

S3 sync recursively with per-object metadata #2045

jcmcken opened this issue Jun 29, 2016 · 8 comments
Labels
feature-request A feature should be added or improved. p3 This is a minor priority issue s3sync

Comments

@jcmcken
Copy link

jcmcken commented Jun 29, 2016

I'm looking to take advantage of the aws s3 sync command, but provide per-object metadata (i.e. metadata that can change per object) rather than provide global metadata with --metadata.

Right now, I have basically a couple of options:

  • Write my own syncing procedures. This would seem duplicative of the work/testing that has gone into the sync command.
  • Include metadata in separate objects, e.g. /path/to/object and /path/to/object.meta. This isn't the greatest, means I have to pay extra and also manage the metadata within my application.
  • Upload on a per-object basis. This isn't going to be performant for my use case.

What would be nice is if I could somehow indicate to the CLI that I want to map each object to a set of metadata, and then upload each object with that metadata. A couple of solutions come to mind:

  • A giant JSON file mapping each key name to a metadata hash e.g.:
{
  "path/to/object1": {"key1": "value1", "key2": "value2},
  ...etc...
}
$ aws s3 sync /some/dir s3://somebucket --metadata-mapping /path/to/meta/mapping.json
  • Some convention for writing metadata locally into a separate file per intended object, and having the sync command read the metadata for each object prior to uploading. For example, I could have a local directory:
$ ls /path/to/local/files
file1
file1.meta
$ cat file1.meta
{
  "key1": "value1",
  "key2": "value2"
}
$ aws s3 sync /path/to/local/files s3://somebucket --object-metadata '$filename.meta'

(So when this is run, the $filename.meta files would just be read for metadata, and would not be transferred)

  • A callback that takes the local filename as a parameter and spits out the metadata, e.g.
$ ls /path/to/local/files
file1
$ lookup-metadata.py /path/to/local/files/file1
{
  "key1": "value1"
}
$ aws s3 sync /path/to/local/files s3://somebucket --metadata-callback lookup-metadata.py

Alternatively, what would be really great is if the syncing functionality were available independently of the CLI from within Python (without requiring me to figure out the internals of how to properly initialize the CLI environment, etc.), so that I could subclass and customize the process. I started going down this route somewhat, but am worried that this API is not for public consumption and would break in the future.

Any thoughts?

@JordonPhillips
Copy link
Member

I'm -1 on adding that. I don't think providing that kind of mapping is a very good experience. At that point you're effectively setting everything manually anyway, so it would take just as much time to perform all those requests.

As far as using our code, we don't guarantee we won't break internals. However, it is MIT licensed so feel free to vendor or copy it.

@JordonPhillips JordonPhillips added the closing-soon This issue will automatically close in 4 days unless further comments are made. label Jun 29, 2016
@jcmcken
Copy link
Author

jcmcken commented Jun 29, 2016

In my use case, the metadata is precomputed against the objects I'm trying to store and placed in a storage backend (details not important, but e.g. MongoDB). All I'm doing is retrieving the data from that backend and storing it with the objects. If I do this object-by-object, then I need to recreate threaded uploads, multipart handling, sync strategies, etc -- all of the things that the sync command normally would do for me. I then need to hook in my logic to make sure the correct metadata is stored with each object. If the CLI supports a mapping or callback, then I just need to translate the data into the correct format (which I can stage ahead of time), and then run the sync

@jamesls
Copy link
Member

jamesls commented Jul 6, 2016

+1 for me. I think this is a reasonable request. Out of all the proposed solutions, I like the metadata JSON file the best. I'm inclined to mark this as a feature request.

@jcmcken One other thing worth considering is the work @kyleknap's been doing for s3transfer. It's still under active development so I wouldn't recommend it for general use just yet, but the idea is to create a good python API for the functionality that's currently exposed in the AWS CLI.

@jamesls jamesls removed the closing-soon This issue will automatically close in 4 days unless further comments are made. label Jul 6, 2016
@rmharrison
Copy link

@jamesls Since s3transfer is still very much in active development, do you have a recommendation for syncing with per file metadata?

node-s3-client is the most promising library I've come across, but the project seems to be having problems with the underlying AWS SDK, see andrewrk/node-s3-client#129

@ASayre
Copy link
Contributor

ASayre commented Feb 6, 2018

Good Morning!

We're closing this issue here on GitHub, as part of our migration to UserVoice for feature requests involving the AWS CLI.

This will let us get the most important features to you, by making it easier to search for and show support for the features you care the most about, without diluting the conversation with bug reports.

As a quick UserVoice primer (if not already familiar): after an idea is posted, people can vote on the ideas, and the product team will be responding directly to the most popular suggestions.

We’ve imported existing feature requests from GitHub - Search for this issue there!

And don't worry, this issue will still exist on GitHub for posterity's sake. As it’s a text-only import of the original post into UserVoice, we’ll still be keeping in mind the comments and discussion that already exist here on the GitHub issue.

GitHub will remain the channel for reporting bugs.

Once again, this issue can now be found by searching for the title on: https://aws.uservoice.com/forums/598381-aws-command-line-interface

-The AWS SDKs & Tools Team

@ASayre ASayre closed this as completed Feb 6, 2018
@jamesls
Copy link
Member

jamesls commented Apr 6, 2018

Based on community feedback, we have decided to return feature requests to GitHub issues.

@pgriess
Copy link

pgriess commented Sep 30, 2021

I'd like this to exist and am willing to spend some time building it.

What is the best way to proceed here? I can jump right to submitting a PR for the single metadata JSON file, but would it be helpful to discuss design / implementation strategy first? I've never committed to this repo before, so if there are any pointers to related code / suggested supporting infrastructure, I'm all ears.

@tim-finnigan
Copy link
Contributor

Hi @pgriess thanks for your willingness to contribute. If you want to create a PR then I recommended reading the contributing guide here: https://github.com/aws/aws-cli/blob/master/CONTRIBUTING.md

You can expand on your proposed implementation here or in a PR. I think looking through these s3 sync customizations is a good place to start: https://github.com/aws/aws-cli/tree/develop/awscli/customizations/s3/syncstrategy

@tim-finnigan tim-finnigan added the p3 This is a minor priority issue label Nov 4, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature-request A feature should be added or improved. p3 This is a minor priority issue s3sync
Projects
None yet
Development

No branches or pull requests

7 participants