Skip to content
This repository has been archived by the owner on Jan 9, 2024. It is now read-only.

Data Preparer and PreparerSteps base class #98

Merged
merged 22 commits into from
Jul 25, 2019
Merged

Conversation

cchoquette
Copy link
Contributor

Adding DataPreparer with baes for its steps.

Copy link
Contributor

@adithyabsk adithyabsk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here is a partial code review.

@@ -0,0 +1,37 @@
"""Internal cleaners for handling the cleaning and shaping of data."""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would keep all concrete transformers in concrete so there is one place to look for concrete transformers.

foreshadow/tests/test_core/test_base.py Outdated Show resolved Hide resolved
foreshadow/metrics/internals.py Outdated Show resolved Hide resolved
foreshadow/core/base.py Outdated Show resolved Hide resolved
foreshadow/core/base.py Outdated Show resolved Hide resolved
foreshadow/core/base.py Outdated Show resolved Hide resolved
foreshadow/core/base.py Outdated Show resolved Hide resolved
foreshadow/core/base.py Outdated Show resolved Hide resolved
foreshadow/core/base.py Outdated Show resolved Hide resolved
foreshadow/core/base.py Outdated Show resolved Hide resolved
…plementation testing its usage.

Merge remote-tracking branch 'remotes/origin/development' into data_cleaner

# Conflicts:
#	foreshadow/exceptions.py
#	foreshadow/transformers/base.py
#	pytest.ini
@@ -27,7 +27,7 @@ def _get_classes():
c[1]
for m in modules
for c in inspect.getmembers(m)
if inspect.isclass(c[1])
if inspect.isclass(c[1]) and c[1].__name__.find("Base") == -1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a note to change this once we merge serializer mixing changes in or move this function to utilities.

Copy link
Contributor

@adithyabsk adithyabsk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added some comments.

@@ -0,0 +1,39 @@
# import re
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this supposed to be committed?

class InvalidDataFrame(Exception):
"""Raised when a tranformer outputs an invalid DataFrame.

An example of when this might occur is a DataFrame different list lengths.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DataFrame of different list lengths

@@ -413,6 +413,8 @@ def test_smarttransformer_function_override(smart_child):
std.fit(df[["crim"]])
std_data = std.transform(df[["crim"]])

assert std_data.columns[0] == 'crim_impute_0'

# TODO, remove when SmartTransformer is no longer wrapped
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As of right now everything will need to be wrapped in order to be serializable, otherwise they need to explicitly inherit from ConcreteSerializable.

@@ -213,7 +213,7 @@ def fit(self, X, y=None, **fit_params):
trans,
_slice_cols(X, cols),
y,
**{**fit_params, **_inject_df(trans, X)}
**fit_params,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume this didn't break any tests?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add this as an issue so we can track it with this Line linked in a comment.



class TransformersPipeline(Pipeline):
def _fit(self, X, y=None, **fit_params):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you point out the differences with the Pipeline parent _fit in comments?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would look into this to see if this approach might work: https://gist.github.com/adam-p/63aaae093a71a844150e

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Look into turning on linting for internal methods (flake8-docstrings)


Args:
**kwargs: placeholder.
row_of_feature: one row of one column
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add return_tuple


"""
delimiters = "[-/]"
regex = "^.*(([\d]{4})%s([\d]{2})%s([\d]{2})).*$" % (
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would use python format strings

text = str(t)
res = re.search(regex, text)
if res is not None:
res = sum([len(range(reg[0], reg[1])) for reg in res.regs[1:2]])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I might add a comment as to what is going on here.



def _split_to_new_cols(t, return_search=False):
"""Clean text if it is in a YYYYMDD format and split to three columns.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know this is a proof of concept but can we use the datetime library that will parse any date and will allow us to extract the entities we need?



def financial_transform(text, return_search=False):
"""Clean text if it is a financial text.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I might steal what is already implemented in Financial in smart.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The regex in there is really robust.

@cchoquette cchoquette marked this pull request as ready for review July 25, 2019 21:45
@cchoquette cchoquette merged commit b0cda87 into data_preparer Jul 25, 2019
@adithyabsk adithyabsk deleted the data_cleaner branch August 6, 2019 19:28
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants