Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Move default location of data to the user home folder #652

Merged
merged 40 commits into from
Jul 21, 2023

Conversation

flefebv
Copy link
Contributor

@flefebv flefebv commented Jul 18, 2023

Closes #641

  • Add a get_data_home function that returns the root data directory path, and create it if it does not already exist. By default, it is set to ~/skrub_data.

  • Modify the get_data_dir function that returns the data directory where a given loader stores data. The output directory is a subdirectory of the root data directory.

  • Implement tests for these two functions.

Copy link
Contributor

@jeremiedbb jeremiedbb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we're still in the early stages of the project, I think it's a good thing to use the more modern tools like pathlib instead of os.path.

skrub/datasets/_utils.py Outdated Show resolved Hide resolved
skrub/datasets/_utils.py Outdated Show resolved Hide resolved
skrub/datasets/_utils.py Outdated Show resolved Hide resolved
skrub/datasets/_utils.py Outdated Show resolved Hide resolved
skrub/datasets/_utils.py Outdated Show resolved Hide resolved
skrub/datasets/_utils.py Outdated Show resolved Hide resolved
@mock.patch("os.path.dirname")
def test_get_data_dir(mock_os_path_dirname):

def test_get_data_dir():
Copy link
Member

@glemaitre glemaitre Jul 19, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
def test_get_data_dir():
@pytest.mark.parametrize("data_home_type", ["string", "path"])
@pytest.mark.parametrize("subdirectory_name", [None, "tests"])
def test_get_data_dir_existing_folder(data_home_type, subdirectory_name):

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can parametrize with subdirectory_name and it will avoid repeating code below

Comment on lines 12 to 13
with tempfile.TemporaryDirectory() as tmpdirname:
tmpdirpath = Path(tmpdirname)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
with tempfile.TemporaryDirectory() as tmpdirname:
tmpdirpath = Path(tmpdirname)
with tempfile.TemporaryDirectory() as data_home:
directory = data_home if data_home_type == "string" else Path(data_home)

from skrub.datasets._utils import get_data_dir
with tempfile.TemporaryDirectory() as tmpdirname:
tmpdirpath = Path(tmpdirname)
assert get_data_dir(data_home=tmpdirpath) == tmpdirpath
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then you can adapt the test with the variable names

skrub/datasets/_utils.py Outdated Show resolved Hide resolved
Copy link
Member

@LilianBoulard LilianBoulard left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the contribution!
Here's a review for the implementation, I still have to review the tests :)

Edit: sorry for the interlacing reviews, might be annoying to filter 😅

skrub/datasets/_utils.py Outdated Show resolved Hide resolved
skrub/datasets/_utils.py Outdated Show resolved Hide resolved
skrub/datasets/_utils.py Outdated Show resolved Hide resolved
skrub/datasets/_utils.py Outdated Show resolved Hide resolved
skrub/datasets/_utils.py Outdated Show resolved Hide resolved
skrub/datasets/_utils.py Outdated Show resolved Hide resolved
skrub/datasets/_utils.py Show resolved Hide resolved
skrub/datasets/_utils.py Outdated Show resolved Hide resolved
skrub/datasets/_utils.py Show resolved Hide resolved
skrub/datasets/_utils.py Outdated Show resolved Hide resolved
Copy link
Member

@LilianBoulard LilianBoulard left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here are a few more comments!
I'll commit the type hints right away, it's a minor issue.

benchmarks/bench_fuzzy_join_sparse_vs_dense.py Outdated Show resolved Hide resolved
benchmarks/utils/join.py Outdated Show resolved Hide resolved
benchmarks/utils/join.py Outdated Show resolved Hide resolved
benchmarks/utils/join.py Outdated Show resolved Hide resolved
benchmarks/bench_fuzzy_join_sparse_vs_dense.py Outdated Show resolved Hide resolved
skrub/datasets/_utils.py Outdated Show resolved Hide resolved
skrub/datasets/_utils.py Outdated Show resolved Hide resolved
skrub/datasets/tests/test_fetching.py Outdated Show resolved Hide resolved
skrub/datasets/tests/test_utils.py Outdated Show resolved Hide resolved
skrub/datasets/tests/test_utils.py Outdated Show resolved Hide resolved
@GaelVaroquaux
Copy link
Member

The suggestions of @glemaitre are relevent and they are a bit too big for me to do right now.

Also, please add the new function to the API reference (file: doc/api.rst)

skrub/datasets/_fetching.py Outdated Show resolved Hide resolved
skrub/datasets/_fetching.py Outdated Show resolved Hide resolved
skrub/datasets/_fetching.py Outdated Show resolved Hide resolved
skrub/datasets/_fetching.py Outdated Show resolved Hide resolved
skrub/datasets/_utils.py Outdated Show resolved Hide resolved
@GaelVaroquaux
Copy link
Member

I'm trying to finish this off. I've merged with main (resolving conflicts). Will implement the changes

@GaelVaroquaux
Copy link
Member

OK, I have refactored the tests. Unless someone sees something problematic, I will merge soon (hopefully the CI will be green)

@GaelVaroquaux
Copy link
Member

This was a real bug! Thanks to @glemaitre

@GaelVaroquaux GaelVaroquaux merged commit dd35027 into skrub-data:main Jul 21, 2023
20 of 21 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Change the directory where fetcher are caching downloaded file
5 participants