Data Preparer and PreparerSteps base class #98

cchoquette · 2019-07-17T16:51:41Z

Adding DataPreparer with baes for its steps.

… DataPreparer; 3. DataCleaner functionality

adithyabsk

Here is a partial code review.

adithyabsk · 2019-07-19T18:42:38Z

foreshadow/cleaners/internals/__init__.py

@@ -0,0 +1,37 @@
+"""Internal cleaners for handling the cleaning and shaping of data."""


I would keep all concrete transformers in concrete so there is one place to look for concrete transformers.

foreshadow/tests/test_core/test_base.py

foreshadow/metrics/internals.py

foreshadow/core/base.py

…plementation testing its usage. Merge remote-tracking branch 'remotes/origin/development' into data_cleaner # Conflicts: # foreshadow/exceptions.py # foreshadow/transformers/base.py # pytest.ini

adithyabsk · 2019-07-22T13:57:19Z

foreshadow/cleaners/internals/__init__.py

@@ -27,7 +27,7 @@ def _get_classes():
        c[1]
        for m in modules
        for c in inspect.getmembers(m)
-        if inspect.isclass(c[1])
+        if inspect.isclass(c[1]) and c[1].__name__.find("Base") == -1


Just a note to change this once we merge serializer mixing changes in or move this function to utilities.

…ly implemented.

…names

Merge in latest development.

adithyabsk

Added some comments.

adithyabsk · 2019-07-24T16:12:13Z

foreshadow/cleaners/internals/drop.py

@@ -0,0 +1,39 @@
+# import re


Is this supposed to be committed?

adithyabsk · 2019-07-24T16:13:25Z

foreshadow/exceptions.py

+class InvalidDataFrame(Exception):
+    """Raised when a tranformer outputs an invalid DataFrame.
+
+    An example of when this might occur is a DataFrame different list lengths.


DataFrame of different list lengths

adithyabsk · 2019-07-24T16:15:20Z

foreshadow/tests/test_transformers/test_transformers.py

@@ -413,6 +413,8 @@ def test_smarttransformer_function_override(smart_child):
    std.fit(df[["crim"]])
    std_data = std.transform(df[["crim"]])

+    assert std_data.columns[0] == 'crim_impute_0'
+
    # TODO, remove when SmartTransformer is no longer wrapped


As of right now everything will need to be wrapped in order to be serializable, otherwise they need to explicitly inherit from ConcreteSerializable.

adithyabsk · 2019-07-24T16:15:44Z

foreshadow/transformers/core/parallelprocessor.py

@@ -213,7 +213,7 @@ def fit(self, X, y=None, **fit_params):
                trans,
                _slice_cols(X, cols),
                y,
-                **{**fit_params, **_inject_df(trans, X)}
+                **fit_params,


I assume this didn't break any tests?

Can we add this as an issue so we can track it with this Line linked in a comment.

adithyabsk · 2019-07-24T16:26:03Z

foreshadow/transformers/core/pipeline.py

+
+
+class TransformersPipeline(Pipeline):
+    def _fit(self, X, y=None, **fit_params):


Can you point out the differences with the Pipeline parent _fit in comments?

I would look into this to see if this approach might work: https://gist.github.com/adam-p/63aaae093a71a844150e

Look into turning on linting for internal methods (flake8-docstrings)

adithyabsk · 2019-07-24T20:19:54Z

foreshadow/cleaners/data_cleaner.py


        Args:
-            **kwargs: placeholder.
+            row_of_feature: one row of one column


Add return_tuple

adithyabsk · 2019-07-24T20:20:35Z

foreshadow/cleaners/internals/datetimes.py

+
+    """
+    delimiters = "[-/]"
+    regex = "^.*(([\d]{4})%s([\d]{2})%s([\d]{2})).*$" % (


I would use python format strings

adithyabsk · 2019-07-24T20:21:02Z

foreshadow/cleaners/internals/datetimes.py

+    text = str(t)
+    res = re.search(regex, text)
+    if res is not None:
+        res = sum([len(range(reg[0], reg[1])) for reg in res.regs[1:2]])


I might add a comment as to what is going on here.

adithyabsk · 2019-07-24T20:21:54Z

foreshadow/cleaners/internals/datetimes.py

+
+
+def _split_to_new_cols(t, return_search=False):
+    """Clean text if it is in a YYYYMDD format and split to three columns.


I know this is a proof of concept but can we use the datetime library that will parse any date and will allow us to extract the entities we need?

adithyabsk · 2019-07-24T20:22:47Z

foreshadow/cleaners/internals/financial_cleaner.py

+
+
+def financial_transform(text, return_search=False):
+    """Clean text if it is a financial text.


I might steal what is already implemented in Financial in smart.

The regex in there is really robust.

cchoquette added 2 commits July 15, 2019 13:31

Extended 1.DataPreparer functionality; 2. Base Class for each step of…

2647236

… DataPreparer; 3. DataCleaner functionality

Making unprivate imports

5c2033e

cchoquette requested a review from adithyabsk July 17, 2019 16:51

cchoquette added 3 commits July 17, 2019 13:06

Fixing __init__'s

58fdb8b

Adding financial_cleaner example

80bd55a

Committing prior to merging in development.

75decdf

adithyabsk suggested changes Jul 19, 2019

View reviewed changes

cchoquette added 4 commits July 21, 2019 23:04

New PreparerStep base class. Minimal tests and minimal DataCleaner im…

cd13449

…plementation testing its usage. Merge remote-tracking branch 'remotes/origin/development' into data_cleaner # Conflicts: # foreshadow/exceptions.py # foreshadow/transformers/base.py # pytest.ini

Cleaned up PreparerStep base class code and added thorough doc strings.

dd910b2

Fixing pytest.ini

ed30660

DataCleaner working without full tests less creating new columns

c7d14d0

adithyabsk reviewed Jul 22, 2019

View reviewed changes

cchoquette added 8 commits July 22, 2019 20:09

Finished DataPreparer and DataCleaner without tests/DropTransform ful…

d64a524

…ly implemented.

Adding test and fixing DataCleaner pipeline creation to include step …

5f78dd9

…names

merge in data_preparer

a67d197

Merge branch 'data_preparer' into data_cleaner

2e67a71

Almost fully working solution.

d4bfeae

Merge in latest development.

DataCleaner working for a DataFrame with 1 column.

9b5018d

Working PreparerSteps and DataCleaner, with a couple tests to show.

ef73579

Making smart still fit aggregate transformer, but return self.

b2e409a

adithyabsk suggested changes Jul 24, 2019

View reviewed changes

cchoquette added 4 commits July 24, 2019 16:31

Changed column naming scheme and added column_sharer across the project.

9576014

Fully working solution with DropTransform. Minimal Tests.

85633dc

Flaked.

a8f85ab

isorted.

de2b0ff

cchoquette marked this pull request as ready for review July 25, 2019 21:45

Removed DropMixin

0971f4c

cchoquette merged commit b0cda87 into data_preparer Jul 25, 2019

adithyabsk deleted the data_cleaner branch August 6, 2019 19:28

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Data Preparer and PreparerSteps base class #98

Data Preparer and PreparerSteps base class #98

cchoquette commented Jul 17, 2019

adithyabsk left a comment

adithyabsk Jul 19, 2019

adithyabsk Jul 22, 2019

adithyabsk left a comment

adithyabsk Jul 24, 2019

adithyabsk Jul 24, 2019

adithyabsk Jul 24, 2019

adithyabsk Jul 24, 2019

adithyabsk Jul 26, 2019

adithyabsk Jul 24, 2019

adithyabsk Jul 24, 2019

adithyabsk Jul 26, 2019

adithyabsk Jul 24, 2019

adithyabsk Jul 24, 2019

adithyabsk Jul 24, 2019

adithyabsk Jul 24, 2019

adithyabsk Jul 24, 2019

adithyabsk Jul 24, 2019

		@@ -0,0 +1,37 @@
		"""Internal cleaners for handling the cleaning and shaping of data."""



		class TransformersPipeline(Pipeline):
		def _fit(self, X, y=None, **fit_params):



		def _split_to_new_cols(t, return_search=False):
		"""Clean text if it is in a YYYYMDD format and split to three columns.



		def financial_transform(text, return_search=False):
		"""Clean text if it is a financial text.

Data Preparer and PreparerSteps base class #98

Data Preparer and PreparerSteps base class #98

Conversation

cchoquette commented Jul 17, 2019

adithyabsk left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

adithyabsk left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment