Dataset Wrapper API implementation and tests #223

AYYYang · 2020-10-10T11:33:16Z

In response to issue #208. Implemented methods that allow access to raw dataset, split train and test by a ratio, getters for train/test dataset and features/label for raw dataset or train/test dataset.

Implemented tests for the implementation. Test coverage is 100%.

…g the notebook

…_dataset() and get_raw_dataset(); changed reading the dataset file by name to by file path

…hat on the ayang branch

…wishes to get them only for test/train datasets. wrote tests for dataset wrapper class

dzeber

Great work on this PR! The API is nice and clean, and the tests do a good job of covering the functionality.

I'm requesting a couple of updates before merging. Top-line:

please move dataset_wrapper.py to the presc dir - all core code should be in there.
make sure to run Black formatting prior to submitting. This should happen automatically if you set up pre-commit as described in the README. When you run git commit, it will automatically run Black, and if Black made changes to the files, the commit will fail and you will need to git add those changes.

There are a couple of other minor things I've noted in inline comments.

dzeber · 2020-10-12T22:06:50Z

datasets/dataset_wrapper.py

+            self._dataset = pd.read_csv(dataset_file)
+
+            # set X,y
+            self.X,self.y = self._dataset.iloc[:,:-1], self._dataset.iloc[:,-1]


I think we should pass in the label column name as a string to the function, and use that to identify the X and y parts. That way we don't need to make assumptions about column positions. We will want to use this class to handle all datasets interacting with PRESC, so we should stay general.

Added a label parameter for this function and added a label setter function as well.

dzeber · 2020-10-12T22:09:04Z

datasets/dataset_wrapper.py

+    def get_raw_dataset(self) -> object:
+        return self._dataset
+
+    def get_label(self, subset:str = None) -> object:


Would be good to add a docstring here and below to document the subset parameter.

Added a docstring in the get_label() function.

dzeber · 2020-10-12T22:15:49Z

tests/test_dataset_wrapper.py

+
+from datasets.dataset_wrapper import DatasetWrapper
+
+wine_data_path = "../datasets/winequality.csv"


Hardcoding the path here is not a great idea as it makes assumptions about which working dir the tests are run from.

In fact I noticed this because I tried running the tests from the root dir :).

A way to get around that is to use

from pathlib import Path Path(__file__)

to get the path of this module (test_dataset_wrapper.py), and set paths relative to that.

Took your suggestion and now everything is in reference to Path! Thank you for the catch.

dzeber · 2020-10-12T22:17:12Z

tests/test_dataset_wrapper.py

+# test invalid initailization of DatasetWrapper object
+def test_fie_does_not_exist():
+    with pytest.raises(FileNotFoundError):
+        DatasetWrapper("datasets/winequalit.csv")


I think this and the next test assumes a different working dir than wine_path_data above?

Right. I changed this to using the reference path as well.

dzeber · 2020-10-12T22:33:40Z

tests/test_dataset_wrapper.py

+
+def test_fie_format_incorrect():
+    with pytest.raises(IOError):
+        DatasetWrapper("datasets/README.md")


I think this is testing the failure when you pass in a file that is not a CSV?

For me, this is actually not failing, because Pandas is trying to be smart and is parsing the table we have in that README. If you're not getting this behaviour, please make sure you've activated the conda environment before running the tests.

On the other hand, I tried running this with a different file and it's giving a Pandas ParserError rather than IOError.

I passed in a python script just to make this test fail. Not sure if we need to handle files like markdown as errors? I

Using the setup.py script is fine. Anything that is not CSV should raise the error. What I meant was the markdown should have raised the exception as well, and I think that's what you intended, but it didn't because of a Pandas feature trying to recognize that the markdown in fact contains a table.

AYYYang · 2020-10-13T10:28:29Z

Also moved the wrapper class to the presc folder. pre-commit is now installed and running before I commit. Thank you for the comments David!

dzeber · 2020-10-15T00:46:35Z

Looks great! Thanks for making the fixes.

AYYYang added 10 commits September 26, 2020 02:20

added new dataset and ML workflow

b6a1bf3

added kick_starter.csv for datasets, and removed the folder containin…

2f9aad7

…g the notebook

adding some files back

aec2bc5

started dataset API

b82446b

Frist try on implementing the dataset wrapper API

cb8455f

type casting

3fd8a49

Remove workspac file

3d6d9a3

removed the transformer attribute and functions, imlemented get_train…

52c7517

…_dataset() and get_raw_dataset(); changed reading the dataset file by name to by file path

cleaned up this branch to remove ML workflow realted stuff, will do t…

9d12988

…hat on the ayang branch

added subset parameter for feature and label getters for if the user …

209a602

…wishes to get them only for test/train datasets. wrote tests for dataset wrapper class

dzeber suggested changes Oct 12, 2020

View reviewed changes

AYYYang added 2 commits October 13, 2020 16:53

moved dataset_wrapper to presc/ folder; testing pre-commit

6933d55

addressed the change requests by David

7c77830

added try except for set_labei() and corresponding test case

d159aa0

dzeber approved these changes Oct 15, 2020

View reviewed changes

dzeber merged commit 111b920 into mozilla:master Oct 15, 2020

AYYYang deleted the amanda-yang/dataset-wrapper branch October 15, 2020 09:43

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Dataset Wrapper API implementation and tests #223

Dataset Wrapper API implementation and tests #223

AYYYang commented Oct 10, 2020

dzeber left a comment

dzeber Oct 12, 2020

AYYYang Oct 13, 2020

dzeber Oct 12, 2020

AYYYang Oct 13, 2020

dzeber Oct 12, 2020

AYYYang Oct 13, 2020

dzeber Oct 12, 2020

AYYYang Oct 13, 2020

dzeber Oct 12, 2020

AYYYang Oct 13, 2020

dzeber Oct 15, 2020

AYYYang commented Oct 13, 2020

dzeber commented Oct 15, 2020


		from datasets.dataset_wrapper import DatasetWrapper

		wine_data_path = "../datasets/winequality.csv"

Dataset Wrapper API implementation and tests #223

Dataset Wrapper API implementation and tests #223

Conversation

AYYYang commented Oct 10, 2020

dzeber left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

AYYYang commented Oct 13, 2020

dzeber commented Oct 15, 2020