-
Notifications
You must be signed in to change notification settings - Fork 2
Data Preparer and PreparerSteps base class #98
Conversation
… DataPreparer; 3. DataCleaner functionality
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here is a partial code review.
@@ -0,0 +1,37 @@ | |||
"""Internal cleaners for handling the cleaning and shaping of data.""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would keep all concrete transformers in concrete so there is one place to look for concrete transformers.
…plementation testing its usage. Merge remote-tracking branch 'remotes/origin/development' into data_cleaner # Conflicts: # foreshadow/exceptions.py # foreshadow/transformers/base.py # pytest.ini
@@ -27,7 +27,7 @@ def _get_classes(): | |||
c[1] | |||
for m in modules | |||
for c in inspect.getmembers(m) | |||
if inspect.isclass(c[1]) | |||
if inspect.isclass(c[1]) and c[1].__name__.find("Base") == -1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a note to change this once we merge serializer mixing changes in or move this function to utilities.
Merge in latest development.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added some comments.
@@ -0,0 +1,39 @@ | |||
# import re |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this supposed to be committed?
class InvalidDataFrame(Exception): | ||
"""Raised when a tranformer outputs an invalid DataFrame. | ||
|
||
An example of when this might occur is a DataFrame different list lengths. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
DataFrame of different list lengths
@@ -413,6 +413,8 @@ def test_smarttransformer_function_override(smart_child): | |||
std.fit(df[["crim"]]) | |||
std_data = std.transform(df[["crim"]]) | |||
|
|||
assert std_data.columns[0] == 'crim_impute_0' | |||
|
|||
# TODO, remove when SmartTransformer is no longer wrapped |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As of right now everything will need to be wrapped in order to be serializable, otherwise they need to explicitly inherit from ConcreteSerializable.
@@ -213,7 +213,7 @@ def fit(self, X, y=None, **fit_params): | |||
trans, | |||
_slice_cols(X, cols), | |||
y, | |||
**{**fit_params, **_inject_df(trans, X)} | |||
**fit_params, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I assume this didn't break any tests?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we add this as an issue so we can track it with this Line linked in a comment.
|
||
|
||
class TransformersPipeline(Pipeline): | ||
def _fit(self, X, y=None, **fit_params): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you point out the differences with the Pipeline parent _fit in comments?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would look into this to see if this approach might work: https://gist.github.com/adam-p/63aaae093a71a844150e
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Look into turning on linting for internal methods (flake8-docstrings)
|
||
Args: | ||
**kwargs: placeholder. | ||
row_of_feature: one row of one column |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add return_tuple
|
||
""" | ||
delimiters = "[-/]" | ||
regex = "^.*(([\d]{4})%s([\d]{2})%s([\d]{2})).*$" % ( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would use python format strings
text = str(t) | ||
res = re.search(regex, text) | ||
if res is not None: | ||
res = sum([len(range(reg[0], reg[1])) for reg in res.regs[1:2]]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I might add a comment as to what is going on here.
|
||
|
||
def _split_to_new_cols(t, return_search=False): | ||
"""Clean text if it is in a YYYYMDD format and split to three columns. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know this is a proof of concept but can we use the datetime library that will parse any date and will allow us to extract the entities we need?
|
||
|
||
def financial_transform(text, return_search=False): | ||
"""Clean text if it is a financial text. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I might steal what is already implemented in Financial in smart.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The regex in there is really robust.
Adding DataPreparer with baes for its steps.