Skip to content
This repository has been archived by the owner on Jan 9, 2024. It is now read-only.

Updating PreparererStep, creating PreparerMapping. CR changes to DataCleaner. #111

Merged
merged 57 commits into from
Aug 2, 2019
Merged
Show file tree
Hide file tree
Changes from 39 commits
Commits
Show all changes
57 commits
Select commit Hold shift + click to select a range
711ccea
Outline for DataPreparer object as defined pipeline object for Foresh…
cchoquette Jul 11, 2019
2647236
Extended 1.DataPreparer functionality; 2. Base Class for each step of…
cchoquette Jul 15, 2019
7413c7f
Fixing pytest.ini file for command line usage.
cchoquette Jul 16, 2019
d59e4bb
Removing modeler from data_preparer
cchoquette Jul 16, 2019
2625e87
Switching _none_to_dict to single kwarg
cchoquette Jul 16, 2019
5c2033e
Making unprivate imports
cchoquette Jul 17, 2019
58fdb8b
Fixing __init__'s
cchoquette Jul 17, 2019
80bd55a
Adding financial_cleaner example
cchoquette Jul 18, 2019
75decdf
Committing prior to merging in development.
cchoquette Jul 19, 2019
cd13449
New PreparerStep base class. Minimal tests and minimal DataCleaner im…
cchoquette Jul 22, 2019
dd910b2
Cleaned up PreparerStep base class code and added thorough doc strings.
cchoquette Jul 22, 2019
ed30660
Fixing pytest.ini
cchoquette Jul 22, 2019
c7d14d0
DataCleaner working without full tests less creating new columns
cchoquette Jul 22, 2019
d64a524
Finished DataPreparer and DataCleaner without tests/DropTransform ful…
cchoquette Jul 23, 2019
5f78dd9
Adding test and fixing DataCleaner pipeline creation to include step …
cchoquette Jul 23, 2019
ef97754
Merge remote-tracking branch 'remotes/origin/development' into data_p…
cchoquette Jul 23, 2019
a67d197
merge in data_preparer
cchoquette Jul 23, 2019
2e67a71
Merge branch 'data_preparer' into data_cleaner
cchoquette Jul 23, 2019
a3e81d4
Merge branch 'development' into data_preparer
cchoquette Jul 23, 2019
d4bfeae
Almost fully working solution.
cchoquette Jul 23, 2019
9b5018d
DataCleaner working for a DataFrame with 1 column.
cchoquette Jul 23, 2019
ef73579
Working PreparerSteps and DataCleaner, with a couple tests to show.
cchoquette Jul 24, 2019
b2e409a
Making smart still fit aggregate transformer, but return self.
cchoquette Jul 24, 2019
9576014
Changed column naming scheme and added column_sharer across the project.
cchoquette Jul 24, 2019
85633dc
Fully working solution with DropTransform. Minimal Tests.
cchoquette Jul 25, 2019
a8f85ab
Flaked.
cchoquette Jul 25, 2019
de2b0ff
isorted.
cchoquette Jul 25, 2019
0971f4c
Removed DropMixin
cchoquette Jul 25, 2019
b0cda87
Data Preparer and PreparerSteps base class and Data Cleaner(#98)
cchoquette Jul 25, 2019
02cbf32
DataPreparer, Baseclass for PreparerSteps, DataCleaner.
cchoquette Jul 25, 2019
f6a9660
Newsfragment.
cchoquette Jul 25, 2019
b42bbf3
Creating DynamicPipeline
cchoquette Jul 29, 2019
d25fd7c
Creating DynamicPipeline with exec for licensing. Fixing Numerical in…
cchoquette Jul 29, 2019
3597d53
Merge branch 'development' into data_preparer
cchoquette Jul 29, 2019
02f87ad
Code Review changes
cchoquette Jul 29, 2019
631a396
DataCleaner changes
cchoquette Jul 29, 2019
8ebe7b8
Switching to internal PreparerMapping
cchoquette Jul 30, 2019
be33eff
Flaked.
cchoquette Jul 30, 2019
272c302
Fixing documentation, adding PreparerMapping
cchoquette Jul 30, 2019
126854a
Code Review changes.
cchoquette Jul 31, 2019
4046bcd
Code Review changes. Some Test refactoring, import refactoring.
cchoquette Jul 31, 2019
d4ab362
Merge branch 'development' into data_preparer and some minor import c…
cchoquette Aug 1, 2019
629d940
Project restructure.
cchoquette Aug 1, 2019
b644abc
Project restructure.
cchoquette Aug 1, 2019
5027969
Partial project restructure.
cchoquette Aug 1, 2019
84b9516
Partial project restructure.
cchoquette Aug 1, 2019
a8550b3
Final Project restructure:
cchoquette Aug 2, 2019
7dd156c
foreshadow.concrete import rollup complete.
cchoquette Aug 2, 2019
9bca84b
Partial flake
cchoquette Aug 2, 2019
016e7a0
Flake complete.
cchoquette Aug 2, 2019
3bddf65
Fixing setup.cfg
cchoquette Aug 2, 2019
81b6fbf
Code review changes.
cchoquette Aug 2, 2019
3133b1f
Flaked.
cchoquette Aug 2, 2019
a96f841
More flake.
cchoquette Aug 2, 2019
7c27697
Skipping 5 important tests for next sprint.
cchoquette Aug 2, 2019
34ebff0
skipping integration.
cchoquette Aug 2, 2019
0afce1e
skipping integration.
cchoquette Aug 2, 2019
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
112 changes: 95 additions & 17 deletions foreshadow/cleaners/data_cleaner.py
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,7 @@
import pandas as pd
from sklearn.base import BaseEstimator, TransformerMixin

from foreshadow.core import logging
from foreshadow.core.preparerstep import PreparerStep
from foreshadow.exceptions import InvalidDataFrame
from foreshadow.metrics.internals import avg_col_regex, regex_rows
Expand All @@ -19,15 +20,14 @@
class DataCleaner(PreparerStep):
"""Determine and perform best data cleaning step."""

def __init__(self, *args, **kwargs):
def __init__(self, **kwargs):
"""Define the single step for DataCleaner, using SmartCleaner.

Args:
*args: args to PreparerStep constructor.
**kwargs: kwargs to PreparerStep constructor.

"""
super().__init__(*args, use_single_pipeline=True, **kwargs)
super().__init__(use_single_pipeline=True, **kwargs)

def get_mapping(self, X):
"""Return the mapping of transformations for the DataCleaner step.
Expand All @@ -47,15 +47,37 @@ def get_mapping(self, X):
]
for c in X
],
X=X,
cols=X.columns,
cchoquette marked this conversation as resolved.
Show resolved Hide resolved
)

def __repr__(self):
"""Return string representation of this object with parent params.

Returns:
See above.

"""
r = super().__repr__()
preparer_params = self._preparer_params()
preparer_params = {p: getattr(self, p, None) for p in preparer_params}
preparer_print = ", ".join(
["{}={}".format(k, v) for k, v in preparer_params.items()]
)
return r[:-1] + preparer_print + ")"


class SmartCleaner(SmartTransformer):
"""Intelligently decide which cleaning function should be applied."""

def __init__(self, **kwargs):
super().__init__(**kwargs)
def __init__(
self, # manually adding as otherwise get_params won't see it.
check_wrapped=False,
**kwargs
):
self.single_input = True # all transformers under this only accept
# 1 column. This is how DynamicPipeline knows this.
# get_params then set_params, it may be in kwargs already
super().__init__(check_wrapped=check_wrapped, **kwargs)

def pick_transformer(self, X, y=None, **fit_params):
"""Get best transformer for a given column.
Expand All @@ -79,22 +101,44 @@ def pick_transformer(self, X, y=None, **fit_params):
best_score = 0
best_cleaner = None
for cleaner, name in cleaners:
cleaner = cleaner(column_sharer=self.column_sharer, name=name)
cleaner = cleaner(column_sharer=self.column_sharer)
score = cleaner.metric_score(X)
if score > best_score:
best_score = score
best_cleaner = cleaner

if best_cleaner is None:
return NoTransform(column_sharer=self.column_sharer)
return best_cleaner

def resolve(self, X, *args, **kwargs):
"""Resolve the underlying concrete transformer.

Sets self.column_sharer with the domain tag.

Args:
X: input DataFrame
*args: args to super
**kwargs: kwargs to super

Returns:
Return from super.

"""
ret = super().resolve(X, *args, **kwargs)
if self.column_sharer is not None:
self.column_sharer[
"domain", X.columns[0]
] = self.transformer.__class__.__name__
cchoquette marked this conversation as resolved.
Show resolved Hide resolved
else:
logging.debug("column_sharer was None")
cchoquette marked this conversation as resolved.
Show resolved Hide resolved
return ret


class SmartFlatten(SmartTransformer):
"""Smartly determine how to flatten an input DataFrame."""

def __init__(self, **kwargs):
super().__init__(**kwargs)
def __init__(self, check_wrapped=True, **kwargs):
super().__init__(check_wrapped=check_wrapped, **kwargs)

def pick_transformer(self, X, y=None, **fit_params):
"""Get best transformer for a given column.
Expand Down Expand Up @@ -122,16 +166,38 @@ def pick_transformer(self, X, y=None, **fit_params):
best_score = 0
best_flattener = None
for flattener, name in flatteners:
flattener = flattener(column_sharer=self.column_sharer, name=name)
flattener = flattener(column_sharer=self.column_sharer)
score = flattener.metric_score(X)
if score > best_score:
best_score = score
best_flattener = flattener

if best_flattener is None:
return NoTransform(column_sharer=self.column_sharer)
return best_flattener

def resolve(self, X, *args, **kwargs):
"""Resolve the underlying concrete transformer.

Sets self.column_sharer with the domain tag.

Args:
X: input DataFrame
*args: args to super
**kwargs: kwargs to super

Returns:
Return from super.

"""
ret = super().resolve(X, *args, **kwargs)
if self.column_sharer is not None:
self.column_sharer[
"domain", X.columns[0]
] = self.transformer.__class__.__name__
else:
logging.debug("column_sharer was None")
return ret


class BaseCleaner(BaseEstimator, TransformerMixin):
"""Base class for any Cleaner Transformer."""
Expand All @@ -142,6 +208,7 @@ def __init__(
output_columns=None,
confidence_computation=None,
default=lambda x: x,
column_sharer=None,
):
"""Construct any cleaner/flattener.

Expand All @@ -157,6 +224,8 @@ def __init__(
subclass's metric computation. This implies an OVR model.
default: Function that returns the default value for a row if
the transformation failed. Accepts the row as input.
column_sharer: An instance of 'foreshadow.core.ColumnSharer' to
be used to share information between PreparerSteps.

Raises:
ValueError: If not a list, int, or None specifying expected
Expand All @@ -170,6 +239,7 @@ def __init__(
self.output_columns = output_columns
self.transformations = transformations
self.confidence_computation = {regex_rows: 0.8, avg_col_regex: 0.2}
self.column_sharer = column_sharer
if confidence_computation is not None:
self.confidence_computation = confidence_computation

Expand Down Expand Up @@ -289,25 +359,33 @@ def transform(self, X, y=None):
# access single column as series and apply the list of
# transformations to each row in the series.
if any(
[isinstance(out[i], (list, tuple)) for i in range(out.shape[0])]
[
isinstance(out.iloc[i], (list, tuple))
for i in range(out.shape[0])
]
): # out are lists == new columns
if not all(
[len(out[0]) == len(out[i]) for i in range(len(out[0]))]
[
len(out.iloc[0]) == len(out.iloc[i])
for i in range(len(out.iloc[0]))
]
):
raise InvalidDataFrame(
"length of lists: {}, returned not of same value.".format(
[out[i] for i in range(len(out[0]))]
[out.iloc[i] for i in range(len(out[0]))]
)
)
columns = self.output_columns
if columns is None:
columns = [X.columns[0] + str(c) for c in range(len(out[0]))]
columns = [
X.columns[0] + str(c) for c in range(len(out.iloc[0]))
]
# by default, pandas would have given a unique integer to
# each column, instead, we keep the previous column name and
# add that integer.
X = pd.DataFrame([*out.values], columns=columns)
elif any(
[isinstance(out[i], (dict)) for i in range(out.shape[0])]
[isinstance(out.iloc[i], (dict)) for i in range(out.shape[0])]
): # out are dicts == named new columns
all_keys = dict()
for row in out:
Expand Down
2 changes: 1 addition & 1 deletion foreshadow/cleaners/internals/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -33,5 +33,5 @@ def _get_classes():
return classes


classes = _get_modules(_get_classes(), globals(), __name__)
classes = _get_modules(_get_classes(), globals(), __name__, wrap=False)
cchoquette marked this conversation as resolved.
Show resolved Hide resolved
__all__ = classes
13 changes: 7 additions & 6 deletions foreshadow/cleaners/internals/datetimes.py
Original file line number Diff line number Diff line change
Expand Up @@ -16,15 +16,14 @@ def _split_to_new_cols(t):

"""
delimiters = "[-/]"
regex = r"^.*(([\d]{4})%s([\d]{2})%s([\d]{2})).*$" % (
delimiters,
delimiters,
regex = r"^.*(([\d]{{4}}){delim}([\d]{{2}}){delim}([\d]{{2}})).*$".format(
delim=delimiters
)
text = str(t)
res = re.search(regex, text)
if res is not None:
res = sum([len(range(reg[0], reg[1])) for reg in res.regs[1:2]])
texts = [re.sub(regex, r"\%d" % i, text) for i in range(2, 5)]
texts = [re.sub(regex, r"\{}".format(i), text) for i in range(2, 5)]
cchoquette marked this conversation as resolved.
Show resolved Hide resolved
else:
texts = t
res = 0
Expand All @@ -38,7 +37,7 @@ class YYYYMMDDDateCleaner(BaseCleaner):

"""

def __init__(self):
def __init__(self, column_sharer=None):
transformations = [_split_to_new_cols]

def make_list_of_three(x):
Expand All @@ -55,4 +54,6 @@ def make_list_of_three(x):
return [x, "", ""]

default = make_list_of_three
super().__init__(transformations, default=default)
super().__init__(
transformations, default=default, column_sharer=column_sharer
)
4 changes: 2 additions & 2 deletions foreshadow/cleaners/internals/drop.py
Original file line number Diff line number Diff line change
Expand Up @@ -35,9 +35,9 @@ class DropCleaner(BaseCleaner):

"""

def __init__(self):
def __init__(self, column_sharer=None):
cchoquette marked this conversation as resolved.
Show resolved Hide resolved
transformations = [drop_transform]
super().__init__(transformations)
super().__init__(transformations, column_sharer=column_sharer)

def transform(self, X, y=None):
"""Clean string columns.
Expand Down
4 changes: 2 additions & 2 deletions foreshadow/cleaners/internals/financial_cleaner.py
Original file line number Diff line number Diff line change
Expand Up @@ -34,6 +34,6 @@ class DollarFinancialCleaner(BaseCleaner):

"""

def __init__(self):
def __init__(self, column_sharer=None):
transformations = [financial_transform]
super().__init__(transformations)
super().__init__(transformations, column_sharer=column_sharer)
6 changes: 3 additions & 3 deletions foreshadow/cleaners/internals/json_flattener.py
Original file line number Diff line number Diff line change
Expand Up @@ -47,7 +47,7 @@ def json_flatten(text):
try:
ret = json.loads(text)
matched = len(text)
except json.JSONDecodeError:
except (json.JSONDecodeError, TypeError):
pass # didn't match.
return ret, matched

Expand All @@ -59,6 +59,6 @@ class StandardJsonFlattener(BaseCleaner):

"""

def __init__(self):
def __init__(self, column_sharer=None):
transformations = [json_flatten]
super().__init__(transformations)
super().__init__(transformations, column_sharer=column_sharer)
2 changes: 1 addition & 1 deletion foreshadow/config.py
Original file line number Diff line number Diff line change
Expand Up @@ -50,7 +50,7 @@ def get_config(base):


def reset_config():
"""Reset internal configuration
"""Reset internal configuration.

Note:
This is useful in an IDLE setting when the configuration file might
Expand Down
18 changes: 18 additions & 0 deletions foreshadow/core/logging.py
Original file line number Diff line number Diff line change
Expand Up @@ -278,6 +278,24 @@ def info(*args, **kwargs):
return log(*args, **kwargs)


def warning(*args, **kwargs):
"""Log warning message.

Manually overriding so that this method is explicitly a part of this
module.

Args:
*args: To logging.info
**kwargs: To logging.info

Returns:
logging.info

"""
log = _wrap_log(_log, "warning")
return log(*args, **kwargs)


def log_and_gui(level, msg, gui_details, gui_schema, *args, **kwargs):
"""Log msg to gui at specific level and write gui_details under gui_schema.

Expand Down
Loading