-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add tests for drivers #42
Conversation
requirements.dev.in
Outdated
@@ -12,7 +12,7 @@ myst_parser # For importing md into rst files. | |||
pyspark>=3.2.0 | |||
flaky # for retry flaky tests | |||
nbmake | |||
notebook>=6.4.10 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fix vulnerability
@@ -58,7 +60,7 @@ def _download_image(url: str) -> np.ndarray: | |||
url: location of the image | |||
""" | |||
req = Request(url, headers={"User-Agent": "Mozilla/5.0"}) | |||
resp = urllib.request.urlopen(req) | |||
resp = urllib.request.urlopen(req, timeout=1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add timeout here to unblock the loading
@@ -32,14 +33,6 @@ def __init__( | |||
|
|||
self.compression = compression | |||
|
|||
@property |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
unused
sample["image"] = load_image(sample["image_url"]) | ||
if parse_label: | ||
if parse_label and "label_url" in sample: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
check if label is really there to avoid fetching None
73716fa
to
932cfc9
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks nice! I am dropping some intermediate feedback for some of the drivers. Will continue the review soon.
In general, you can check the assertions in tests for all samples rather than the first one.
Currently, we are only testing with local paths. If makes sense to you too, we can define a separate ticket for adding GCP paths via a fixture.
/gcbrun |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very good job with the mocks, overall they are very easy to understand and cover all driver functionalities, thank you!
I am leaving some more minor comments, now I have checked the whole PR.
test/test_datasets/test_allenai.py
Outdated
save_path = mock_allenai_data(N, tmp_path) | ||
|
||
config = defaultdict(dict) | ||
config["zu"]["train"] = [save_path] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would add two other languages here (perhaps with different number of samples and different splits) and then test:
- switching between languages and checking for the expected number of samples
- switching between splits and checking for the expected number of samples
- using
lang=None
before & after selecting specific languages. I guess with None, we should be getting all samples. I think there might be a bug where None does not work as intended if we select a specific language before
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very good catch! Indeed there was a bug
fc8fc5a
to
be503cf
Compare
/gcbrun |
1 similar comment
/gcbrun |
Co-authored-by: Alp Arıbal <[email protected]>
Co-authored-by: Alp Arıbal <[email protected]>
Co-authored-by: Alp Arıbal <[email protected]>
Co-authored-by: Alp Arıbal <[email protected]>
Co-authored-by: Alp Arıbal <[email protected]>
Co-authored-by: Alp Arıbal <[email protected]>
Thanks a lot @AlpAribal for the great feedback! I have integrated your suggestions and fixed a few more bugs |
I don't know if testing with GCP paths is really necessary - as we rely on the fsspec module to handle paths I think we would rather test the functionality of that module if it can handle gcp and local paths equally, which would be out of scope, would you agree with that or am I missing something? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Do you want to make the bump to py39 in this PR or in another?
requirements.txt
Outdated
@@ -1,5 +1,5 @@ | |||
# | |||
# This file is autogenerated by pip-compile with python 3.8 | |||
# This file is autogenerated by pip-compile with python 3.9 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To switch to python3.9, we also need to update the dockerfile base image and re-generate the hashes in this file (i.e. run pip-compile with --no-reuse-hashes
once)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the hint!
Let's do this in another PR then
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a lot for your very detailed feedback!
src/squirrel_datasets_core/datasets/allenai_c4/allenai_c4_multilingual.py
Outdated
Show resolved
Hide resolved
You are right, I agree. I was thinking that we might be introducing code that only works for local paths (e.g. using pathlib to append to paths etc.) but this does not seem to be the case. |
Co-authored-by: Alp Arıbal <[email protected]>
Co-authored-by: Alp Arıbal <[email protected]>
Co-authored-by: Alp Arıbal <[email protected]>
…ilingual.py Co-authored-by: Alp Arıbal <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Great job, I really like and appreciate the amount of detail that you put into mocking the drivers' data 🎉
Description
The drivers should be tested fully in isolation. The goal is to mock all data if applicable.
Type of change
Checklist: