Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Archive allows the same asset path to be both a directory and a file #1109

Closed
jwodder opened this issue May 31, 2022 · 12 comments · Fixed by #1368
Closed

Archive allows the same asset path to be both a directory and a file #1109

jwodder opened this issue May 31, 2022 · 12 comments · Fixed by #1368
Assignees
Labels
bug Something isn't working released This issue/pull request has been released.

Comments

@jwodder
Copy link
Member

jwodder commented May 31, 2022

When uploading assets to a Dandiset (as well as when changing an asset's path), the archive does not prohibit having one asset with path foo and another with path foo/bar, regardless of what order the assets are created in. This is a problem, as it means that "foo" is both a directory and a file.

MVCE for the uploading case:

import os
from pathlib import Path
from shutil import rmtree
from tempfile import TemporaryDirectory

from dandi.consts import dandiset_metadata_file
from dandi.dandiapi import DandiAPIClient
from dandi.files import dandi_file

with DandiAPIClient.for_dandi_instance(
    "dandi-staging", token=os.environ["DANDI_API_KEY"]
) as client:
    d = client.create_dandiset(
        "Dandiset with Mismatched File Types",
        {
            "schemaKey": "Dandiset",
            "name": "Test Dandiset",
            "description": "A test Dandiset",
            "contributor": [
                {
                    "schemaKey": "Person",
                    "name": "Wodder, John",
                    "roleName": ["dcite:Author", "dcite:ContactPerson"],
                }
            ],
            "license": ["spdx:CC0-1.0"],
        },
    )
    dandiset_id = d.identifier
    print("DANDISET ID:", dandiset_id)
    with TemporaryDirectory() as tmpdir:
        dspath = Path(tmpdir)
        (dspath / dandiset_metadata_file).write_text(f"identifier: '{dandiset_id}'\n")
        (dspath / "subdir").mkdir()

        p = dspath / "subdir" / "file.txt"
        p.write_text("This is test text.\n")
        df = dandi_file(p, dandiset_path=dspath)
        print("Uploading subdir/file.txt ...")
        df.upload(d, {"description": "A file"})

        p.unlink()
        p = dspath / "subdir" / "file.txt" / "subfile.dat"
        p.parent.mkdir()
        p.write_text("This is inner text.\n")
        df = dandi_file(p, dandiset_path=dspath)
        print("Uploading subdir/file.txt/subfile.dat ...")
        df.upload(d, {"description": "A file"})

        rmtree(dspath / "subdir")
        p = dspath / "subdir"
        p.write_text("This is outer text.\n")
        df = dandi_file(p, dandiset_path=dspath)
        print("Uploading subdir ...")
        df.upload(d, {"description": "A file"})

        print("Assets in Dandiset:")
        for asset in d.get_assets():
            print(f" - {asset.path}")

    print("Deleting Dandiset ...")
    d.delete()
@yarikoptic
Copy link
Member

yikes, nice find! Do you know if we already have such in the main deployment of the archive @jwodder ?

I guess we are doomed to add a check/error to prevent such cases asap.

@jwodder
Copy link
Member Author

jwodder commented May 31, 2022

@yarikoptic If there were any of these conflicts in the production archive, I believe it would've caused the backup script to error by now.

@yarikoptic
Copy link
Member

good point. Although backup script is not "on cron" ATM, since we are still polishing the rough angles with the addition of all the zarrs, so conflicts might creep in, although unlikely given that people typically use dandi-cli to upload, so unlikely (albeit possible) to cause such conflicts while having real file system files layout.

@jjnesbitt
Copy link
Member

This is because of how S3 stores files (paths only). I'm not sure the best way to deal with this, as it might cause a performance hit if we wanted to try to completely prevent this. Maybe we could roll this into the validation of a dandiset / version?

How did you come across this issue? If people are uploading from their file system, this shouldn't occur.

@jwodder
Copy link
Member Author

jwodder commented Jun 6, 2022

@AlmightyYakob I came across this while trying to write a unit test for an asset rename method; I was expecting the API to error on renaming an asset path to a folder or a subpath of an asset.

@jjnesbitt
Copy link
Member

@AlmightyYakob I came across this while trying to write a unit test for an asset rename method; I was expecting the API to error on renaming an asset path to a folder or a subpath of an asset.

Ahh okay, that might be more realistic to prevent directly,

@waxlamp
Copy link
Member

waxlamp commented Jul 6, 2022

@AlmightyYakob, what would be the approach here--check for the proposed asset name being a prefix of any asset in the system? If there's an efficient DB method for that type of search, then perhaps that fix would be effective without damaging performance too much.

@jjnesbitt
Copy link
Member

@AlmightyYakob, what would be the approach here--check for the proposed asset name being a prefix of any asset in the system? If there's an efficient DB method for that type of search, then perhaps that fix would be effective without damaging performance too much.

This check would apply on asset creation or update. I'm not sure exactly what it'd look like, but yes it'd be some kind of check against the path of the proposed asset, and any asset in the specific draft version being uploaded to (so we wouldn't need to check against every asset).

@waxlamp
Copy link
Member

waxlamp commented Jul 13, 2022

We'd need to run a once-over to find any existing problems too, right?

@waxlamp waxlamp added the bug Something isn't working label Jul 13, 2022
@jwodder
Copy link
Member Author

jwodder commented Nov 10, 2022

@AlmightyYakob The tests that I mentioned above have started passing; has this issue been resolved?

@jjnesbitt
Copy link
Member

@AlmightyYakob The tests that I mentioned above have started passing; has this issue been resolved?

Technically yes, since there's now a database constraint that directly prevents this from occuring. However, it's not handled nicely (would return a 500 internal error). I'm going to make a PR soon to handle this correctly.

@dandibot
Copy link
Member

🚀 Issue was released in v0.3.2 🚀

@dandibot dandibot added the released This issue/pull request has been released. label Nov 21, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working released This issue/pull request has been released.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants