Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add tests for drivers #42

Merged
merged 33 commits into from
Aug 9, 2022
Merged

Add tests for drivers #42

merged 33 commits into from
Aug 9, 2022

Conversation

winfried-ripken
Copy link
Contributor

@winfried-ripken winfried-ripken commented Jun 20, 2022

Description

The drivers should be tested fully in isolation. The goal is to mock all data if applicable.

Type of change

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • Documentation update
  • Refactoring including code style reformatting
  • Other (please describe):

Checklist:

  • I have read the contributing guideline doc (external contributors only)
  • Lint and unit tests pass locally with my changes
  • I have kept the PR small so that it can be easily reviewed
  • I have made corresponding changes to the documentation
  • I have added tests that prove my fix is effective or that my feature works
  • All dependency changes have been reflected in the pip requirement files.

@@ -12,7 +12,7 @@ myst_parser # For importing md into rst files.
pyspark>=3.2.0
flaky # for retry flaky tests
nbmake
notebook>=6.4.10
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fix vulnerability

@@ -58,7 +60,7 @@ def _download_image(url: str) -> np.ndarray:
url: location of the image
"""
req = Request(url, headers={"User-Agent": "Mozilla/5.0"})
resp = urllib.request.urlopen(req)
resp = urllib.request.urlopen(req, timeout=1)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add timeout here to unblock the loading

@@ -32,14 +33,6 @@ def __init__(

self.compression = compression

@property
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unused

sample["image"] = load_image(sample["image_url"])
if parse_label:
if parse_label and "label_url" in sample:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

check if label is really there to avoid fetching None

requirements.txt Outdated Show resolved Hide resolved
@winfried-ripken winfried-ripken mentioned this pull request Jun 27, 2022
12 tasks
@winfried-ripken winfried-ripken marked this pull request as ready for review June 27, 2022 10:27
Copy link
Contributor

@AlpAribal AlpAribal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks nice! I am dropping some intermediate feedback for some of the drivers. Will continue the review soon.

In general, you can check the assertions in tests for all samples rather than the first one.

Currently, we are only testing with local paths. If makes sense to you too, we can define a separate ticket for adding GCP paths via a fixture.

requirements.txt Outdated Show resolved Hide resolved
src/squirrel_datasets_core/datasets/bdd100k/driver.py Outdated Show resolved Hide resolved
src/squirrel_datasets_core/datasets/bdd100k/driver.py Outdated Show resolved Hide resolved
test/test_datasets/test_bdd100k.py Outdated Show resolved Hide resolved
test/test_datasets/test_bdd100k.py Outdated Show resolved Hide resolved
test/test_datasets/mock_utils.py Outdated Show resolved Hide resolved
test/test_datasets/mock_utils.py Outdated Show resolved Hide resolved
test/test_datasets/test_monthly_german_tweets.py Outdated Show resolved Hide resolved
test/test_datasets/test_monthly_german_tweets.py Outdated Show resolved Hide resolved
@winfried-ripken
Copy link
Contributor Author

/gcbrun

Copy link
Contributor

@AlpAribal AlpAribal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very good job with the mocks, overall they are very easy to understand and cover all driver functionalities, thank you!

I am leaving some more minor comments, now I have checked the whole PR.

test/test_datasets/test_allenai.py Outdated Show resolved Hide resolved
save_path = mock_allenai_data(N, tmp_path)

config = defaultdict(dict)
config["zu"]["train"] = [save_path]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would add two other languages here (perhaps with different number of samples and different splits) and then test:

  • switching between languages and checking for the expected number of samples
  • switching between splits and checking for the expected number of samples
  • using lang=None before & after selecting specific languages. I guess with None, we should be getting all samples. I think there might be a bug where None does not work as intended if we select a specific language before

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very good catch! Indeed there was a bug

test/test_datasets/test_allenai.py Outdated Show resolved Hide resolved
test/test_datasets/test_camvid.py Outdated Show resolved Hide resolved
test/test_datasets/test_cc100.py Outdated Show resolved Hide resolved
test/test_datasets/test_conceptual_captions.py Outdated Show resolved Hide resolved
test/test_datasets/test_conceptual_captions.py Outdated Show resolved Hide resolved
test/test_datasets/test_conceptual_captions.py Outdated Show resolved Hide resolved
test/test_datasets/test_imagenet.py Outdated Show resolved Hide resolved
test/test_datasets/test_imagenet.py Outdated Show resolved Hide resolved
@winfried-ripken
Copy link
Contributor Author

/gcbrun

1 similar comment
@winfried-ripken
Copy link
Contributor Author

/gcbrun

@winfried-ripken
Copy link
Contributor Author

Thanks a lot @AlpAribal for the great feedback! I have integrated your suggestions and fixed a few more bugs

@winfried-ripken
Copy link
Contributor Author

Looks nice! I am dropping some intermediate feedback for some of the drivers. Will continue the review soon.

In general, you can check the assertions in tests for all samples rather than the first one.

Currently, we are only testing with local paths. If makes sense to you too, we can define a separate ticket for adding GCP paths via a fixture.

I don't know if testing with GCP paths is really necessary - as we rely on the fsspec module to handle paths I think we would rather test the functionality of that module if it can handle gcp and local paths equally, which would be out of scope, would you agree with that or am I missing something?

Copy link
Contributor

@AlpAribal AlpAribal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Do you want to make the bump to py39 in this PR or in another?

requirements.txt Outdated
@@ -1,5 +1,5 @@
#
# This file is autogenerated by pip-compile with python 3.8
# This file is autogenerated by pip-compile with python 3.9
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To switch to python3.9, we also need to update the dockerfile base image and re-generate the hashes in this file (i.e. run pip-compile with --no-reuse-hashes once)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the hint!
Let's do this in another PR then

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot for your very detailed feedback!

test/test_datasets/test_allenai.py Outdated Show resolved Hide resolved
test/test_datasets/test_allenai.py Outdated Show resolved Hide resolved
test/test_datasets/test_conceptual_captions.py Outdated Show resolved Hide resolved
@AlpAribal
Copy link
Contributor

I don't know if testing with GCP paths is really necessary - as we rely on the fsspec module to handle paths I think we would rather test the functionality of that module if it can handle gcp and local paths equally, which would be out of scope, would you agree with that or am I missing something?

You are right, I agree. I was thinking that we might be introducing code that only works for local paths (e.g. using pathlib to append to paths etc.) but this does not seem to be the case.

Copy link
Contributor

@AlpAribal AlpAribal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Great job, I really like and appreciate the amount of detail that you put into mocking the drivers' data 🎉

@winfried-ripken winfried-ripken merged commit d71ace2 into main Aug 9, 2022
@winfried-ripken winfried-ripken deleted the winnie-add-tests branch August 9, 2022 13:14
@github-actions github-actions bot locked and limited conversation to collaborators Aug 9, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants