Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dataset Wrapper API implementation and tests #223

Merged
merged 13 commits into from
Oct 15, 2020

Conversation

AYYYang
Copy link
Collaborator

@AYYYang AYYYang commented Oct 10, 2020

In response to issue #208. Implemented methods that allow access to raw dataset, split train and test by a ratio, getters for train/test dataset and features/label for raw dataset or train/test dataset.

Implemented tests for the implementation. Test coverage is 100%.

Copy link
Contributor

@dzeber dzeber left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work on this PR! The API is nice and clean, and the tests do a good job of covering the functionality.

I'm requesting a couple of updates before merging. Top-line:

  • please move dataset_wrapper.py to the presc dir - all core code should be in there.
  • make sure to run Black formatting prior to submitting. This should happen automatically if you set up pre-commit as described in the README. When you run git commit, it will automatically run Black, and if Black made changes to the files, the commit will fail and you will need to git add those changes.

There are a couple of other minor things I've noted in inline comments.

self._dataset = pd.read_csv(dataset_file)

# set X,y
self.X,self.y = self._dataset.iloc[:,:-1], self._dataset.iloc[:,-1]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should pass in the label column name as a string to the function, and use that to identify the X and y parts. That way we don't need to make assumptions about column positions. We will want to use this class to handle all datasets interacting with PRESC, so we should stay general.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a label parameter for this function and added a label setter function as well.

def get_raw_dataset(self) -> object:
return self._dataset

def get_label(self, subset:str = None) -> object:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be good to add a docstring here and below to document the subset parameter.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a docstring in the get_label() function.


from datasets.dataset_wrapper import DatasetWrapper

wine_data_path = "../datasets/winequality.csv"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hardcoding the path here is not a great idea as it makes assumptions about which working dir the tests are run from.

In fact I noticed this because I tried running the tests from the root dir :).

A way to get around that is to use

from pathlib import Path
Path(__file__)

to get the path of this module (test_dataset_wrapper.py), and set paths relative to that.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Took your suggestion and now everything is in reference to Path! Thank you for the catch.

# test invalid initailization of DatasetWrapper object
def test_fie_does_not_exist():
with pytest.raises(FileNotFoundError):
DatasetWrapper("datasets/winequalit.csv")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this and the next test assumes a different working dir than wine_path_data above?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right. I changed this to using the reference path as well.


def test_fie_format_incorrect():
with pytest.raises(IOError):
DatasetWrapper("datasets/README.md")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is testing the failure when you pass in a file that is not a CSV?

For me, this is actually not failing, because Pandas is trying to be smart and is parsing the table we have in that README. If you're not getting this behaviour, please make sure you've activated the conda environment before running the tests.

On the other hand, I tried running this with a different file and it's giving a Pandas ParserError rather than IOError.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I passed in a python script just to make this test fail. Not sure if we need to handle files like markdown as errors? I

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using the setup.py script is fine. Anything that is not CSV should raise the error. What I meant was the markdown should have raised the exception as well, and I think that's what you intended, but it didn't because of a Pandas feature trying to recognize that the markdown in fact contains a table.

@AYYYang
Copy link
Collaborator Author

AYYYang commented Oct 13, 2020

Also moved the wrapper class to the presc folder. pre-commit is now installed and running before I commit. Thank you for the comments David!

@dzeber
Copy link
Contributor

dzeber commented Oct 15, 2020

Looks great! Thanks for making the fixes.

@dzeber dzeber merged commit 111b920 into mozilla:master Oct 15, 2020
@AYYYang AYYYang deleted the amanda-yang/dataset-wrapper branch October 15, 2020 09:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants