Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

s3: support for directories #2619

Merged
merged 19 commits into from Oct 25, 2019
Merged

s3: support for directories #2619

merged 19 commits into from Oct 25, 2019

Conversation

ghost
Copy link

@ghost ghost commented Oct 17, 2019

Fix: #1654

Setup:

#!/usr/bin/env bash

set -x

export AWS_ACCESS_KEY_ID='testing'
export AWS_SECRET_ACCESS_KEY='testing'
export AWS_SECURITY_TOKEN='testing'
export AWS_SESSION_TOKEN='testing'

moto_server s3 &> /dev/null &

# File structure:
#       dvc-temp
#       ├── data
#       │  ├── alice
#       │  ├── alpha
#       │  └── subdir
#       │     ├── 1
#       │     ├── 2
#       │     └── 3
#       ├── empty_dir
#       ├── empty_file
#       └── foo

python -c '
import boto3

session = boto3.session.Session()
s3 = session.client("s3", endpoint_url="http://localhost:5000")
s3.create_bucket(Bucket="dvc-temp")
s3.put_object(Bucket="dvc-temp", Key="fix-1654/")
s3.put_object(Bucket="dvc-temp", Key="fix-1654/cache/")
s3.put_object(Bucket="dvc-temp", Key="fix-1654/empty_dir/")
s3.put_object(Bucket="dvc-temp", Key="fix-1654/empty_file", Body=b"")
s3.put_object(Bucket="dvc-temp", Key="fix-1654/foo", Body=b"foo")
s3.put_object(Bucket="dvc-temp", Key="fix-1654/data/alice", Body=b"alice")
s3.put_object(Bucket="dvc-temp", Key="fix-1654/data/alpha", Body=b"alpha")
s3.put_object(Bucket="dvc-temp", Key="fix-1654/data/subdir/1", Body=b"1")
s3.put_object(Bucket="dvc-temp", Key="fix-1654/data/subdir/2", Body=b"2")
s3.put_object(Bucket="dvc-temp", Key="fix-1654/data/subdir/3", Body=b"3")
'

repository=$(mktemp -d)
cd $repository
dvc init --no-scm
dvc remote add s3 s3://dvc-temp/fix-1654
dvc remote modify s3 endpointurl http://localhost:5000

dvc remote add cache remote://s3/cache
dvc config cache.s3 cache

dvc run -d remote://s3/data 'echo something'

Notes:

  • head_object doesn't work with directories, only empty ones
  • list_objects doesn't work with empty directories
  • aws s3 cp doesn't work with directories, you need to use --recursive
  • dvc run -d remote:// doesn't look up if cache is setup (as -o), but directories uses cache

TODO:

  • Change walk implementation to walk_files
  • Write unit tests for new implemented S3 methods
  • Address review comments

dvc/remote/s3.py Outdated Show resolved Hide resolved
Copy link
Contributor

@Suor Suor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, there is no tests here.

dvc/remote/s3.py Outdated Show resolved Hide resolved
dvc/remote/s3.py Outdated Show resolved Hide resolved
dvc/remote/s3.py Outdated Show resolved Hide resolved
@ghost ghost changed the title [WIP] s3: support for directories s3: support for directories Oct 23, 2019
dvc/remote/local.py Outdated Show resolved Hide resolved
dvc/remote/local.py Outdated Show resolved Hide resolved
dvc/remote/s3.py Outdated Show resolved Hide resolved
dvc/remote/ssh/__init__.py Outdated Show resolved Hide resolved
tests/unit/remote/test_s3.py Outdated Show resolved Hide resolved


def test_isdir(s3, remote):
assert remote.isdir(remote.path_info / "data")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you could totally use parametrize here instead of copypasting. 🙂

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes! Forgot about parametrize 👌

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting, parametrize is 1 and a half second slower 😓

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@MrOutis this is because all the fixtures are rebuilt every time. Use for loop inside single test instead. Moreover, I would reuse the same list for fixture and tests as far as that is possible.

tests/unit/remote/test_s3.py Outdated Show resolved Hide resolved
tests/unit/remote/test_s3.py Outdated Show resolved Hide resolved
@efiop
Copy link
Contributor

efiop commented Oct 24, 2019

@MrOutis Tests are still failing on py2 due to unicode error. Just a heads up 🙂

Copy link
Contributor

@efiop efiop left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

@efiop efiop merged commit 27b63e2 into iterative:master Oct 25, 2019
@ghost ghost deleted the fix-1654 branch October 25, 2019 01:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants