-
Notifications
You must be signed in to change notification settings - Fork 51
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Dataset Wrapper API implementation and tests #223
Conversation
…_dataset() and get_raw_dataset(); changed reading the dataset file by name to by file path
…hat on the ayang branch
…wishes to get them only for test/train datasets. wrote tests for dataset wrapper class
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great work on this PR! The API is nice and clean, and the tests do a good job of covering the functionality.
I'm requesting a couple of updates before merging. Top-line:
- please move
dataset_wrapper.py
to thepresc
dir - all core code should be in there. - make sure to run Black formatting prior to submitting. This should happen automatically if you set up
pre-commit
as described in the README. When you rungit commit
, it will automatically run Black, and if Black made changes to the files, the commit will fail and you will need togit add
those changes.
There are a couple of other minor things I've noted in inline comments.
datasets/dataset_wrapper.py
Outdated
self._dataset = pd.read_csv(dataset_file) | ||
|
||
# set X,y | ||
self.X,self.y = self._dataset.iloc[:,:-1], self._dataset.iloc[:,-1] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should pass in the label column name as a string to the function, and use that to identify the X and y parts. That way we don't need to make assumptions about column positions. We will want to use this class to handle all datasets interacting with PRESC, so we should stay general.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added a label
parameter for this function and added a label setter function as well.
datasets/dataset_wrapper.py
Outdated
def get_raw_dataset(self) -> object: | ||
return self._dataset | ||
|
||
def get_label(self, subset:str = None) -> object: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would be good to add a docstring here and below to document the subset
parameter.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added a docstring in the get_label()
function.
tests/test_dataset_wrapper.py
Outdated
|
||
from datasets.dataset_wrapper import DatasetWrapper | ||
|
||
wine_data_path = "../datasets/winequality.csv" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hardcoding the path here is not a great idea as it makes assumptions about which working dir the tests are run from.
In fact I noticed this because I tried running the tests from the root dir :).
A way to get around that is to use
from pathlib import Path
Path(__file__)
to get the path of this module (test_dataset_wrapper.py
), and set paths relative to that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Took your suggestion and now everything is in reference to Path! Thank you for the catch.
tests/test_dataset_wrapper.py
Outdated
# test invalid initailization of DatasetWrapper object | ||
def test_fie_does_not_exist(): | ||
with pytest.raises(FileNotFoundError): | ||
DatasetWrapper("datasets/winequalit.csv") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this and the next test assumes a different working dir than wine_path_data
above?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right. I changed this to using the reference path as well.
tests/test_dataset_wrapper.py
Outdated
|
||
def test_fie_format_incorrect(): | ||
with pytest.raises(IOError): | ||
DatasetWrapper("datasets/README.md") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is testing the failure when you pass in a file that is not a CSV?
For me, this is actually not failing, because Pandas is trying to be smart and is parsing the table we have in that README. If you're not getting this behaviour, please make sure you've activated the conda environment before running the tests.
On the other hand, I tried running this with a different file and it's giving a Pandas ParserError
rather than IOError
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I passed in a python script just to make this test fail. Not sure if we need to handle files like markdown as errors? I
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using the setup.py script is fine. Anything that is not CSV should raise the error. What I meant was the markdown should have raised the exception as well, and I think that's what you intended, but it didn't because of a Pandas feature trying to recognize that the markdown in fact contains a table.
Also moved the wrapper class to the presc folder. |
Looks great! Thanks for making the fixes. |
In response to issue #208. Implemented methods that allow access to raw dataset, split train and test by a ratio, getters for train/test dataset and features/label for raw dataset or train/test dataset.
Implemented tests for the implementation. Test coverage is 100%.