-
Notifications
You must be signed in to change notification settings - Fork 162
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
New Data Labeler: ColumnNameModel Build #626
New Data Labeler: ColumnNameModel Build #626
Conversation
@@ -31,8 +32,8 @@ def __init__(self, label_mapping=None, parameters=None): | |||
# parameter initialization | |||
if not parameters: | |||
parameters = {} | |||
parameters.setdefault('negative_dataframe', ) | |||
parameters.setdefault('positive_dataframe', ) | |||
parameters.setdefault('false_positive_df', None) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nice naming
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lol thx
scores = self._model( | ||
list_of_column_names, | ||
check_values_dict, | ||
self._make_lower_case(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this something we are going to parameterize? why not just do str.lower
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good question -- thought the same myself. The parameter for the process.cdist
function must be callable
so that is a big requirement. str.lower()
isn't callable since that returns a str
object type.
Documentation to the process.cdist
function here
pass | ||
|
||
def _make_lower_case(str, **kwargs): | ||
return str.lower() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should it be str.lower()
or str.lower
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Out[1]: <function str.lower()>
should return a str
object type... so think we actually do want to call the lower()
as a function call
@@ -134,6 +134,7 @@ def get_class(cls, class_name): | |||
# Import possible internal models | |||
from .character_level_cnn_model import CharacterLevelCnnModel # NOQA | |||
from .regex_model import RegexModel # NOQA | |||
from .column_name_model import ColumnNameModel # NOQA |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
adding the model to be loadable as a class
@@ -0,0 +1,225 @@ | |||
"""Contains class for regex data labeling model.""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
entirely new model based regex model type
mock_model_parameters = { | ||
'true_positive_dict': [ | ||
{ | ||
'attribute': 'ssn', | ||
'label': 'ssn' | ||
}, | ||
{ | ||
'attribute': 'suffix', | ||
'label': 'name' | ||
}, | ||
{ | ||
'attribute': 'my_home_address', | ||
'label': 'address' | ||
}, | ||
], | ||
'false_positive_dict': [ | ||
{ | ||
'attribute': 'contract_number', | ||
'label': 'ssn', | ||
}, | ||
{ | ||
'attribute': 'role', | ||
'label': 'name', | ||
}, | ||
{ | ||
'attribute': 'send_address', | ||
'label': 'address', | ||
}, | ||
] | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
format fixing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixing in next commit
@@ -0,0 +1,292 @@ | |||
import json |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
test suite for each method in the new ColumnNameModel
class
"conf": 100.0 | ||
} | ||
} | ||
model_output = model.predict(data=["ssn", "role_name", "wallet_address"], show_confidences=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we have a test for false as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, the below is a FALSE
scenario
with self.assertLogs(
"DataProfiler.labelers.column_name_model", level="INFO"
) as logs:
model_output = model.predict(data=["ssn", "role_name", "wallet_address"])
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ahh i missed it, sorry since it wasn't declared i missed it.
not isinstance(value, list) or not isinstance(value[0], dict) | ||
): | ||
errors.append( | ||
"""`{}` must be a list of dictionaries each with the following |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
doesn't check for attribute
and label
key in list of dicts. Does it need to?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixing -- good catch
false_positive_dict = self._parameters["false_positive_dict"] | ||
if false_positive_dict: | ||
data = self._compare_negative( | ||
data, false_positive_dict, negative_threshold=50 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this negative_threshold
value arbitrary? Should it be a constant at the top of the file or in a config somewhere?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good catch -- nah this should be a param and not hard coded
output = self._compare_positive( | ||
data, | ||
self._parameters["true_positive_dict"], | ||
positive_threshold=85, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same question as above regarding positive_threshold
@@ -0,0 +1,258 @@ | |||
"""Contains class for column name data labeling model.""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
whole new model -- based on regex_model.py to some extent
"""Reset weights function.""" | ||
pass | ||
|
||
@require_module(["rapidfuzz"]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we have a test for this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah good call!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, I think this is going to spout a graph profiling error as oppose to a labeler error
No description provided.