Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New Data Labeler: ColumnNameModel Build #626

Merged
merged 17 commits into from
Sep 14, 2022
Merged

New Data Labeler: ColumnNameModel Build #626

merged 17 commits into from
Sep 14, 2022

Conversation

taylorfturner
Copy link
Contributor

No description provided.

@taylorfturner taylorfturner added Work In Progress Solution is being developed New Feature A feature addition not currently in the library labels Sep 9, 2022
@taylorfturner taylorfturner self-assigned this Sep 9, 2022
@JGSweets JGSweets changed the title New Data Labeler: ColumnNameModel Build [WIP] New Data Labeler: ColumnNameModel Build Sep 9, 2022
@@ -31,8 +32,8 @@ def __init__(self, label_mapping=None, parameters=None):
# parameter initialization
if not parameters:
parameters = {}
parameters.setdefault('negative_dataframe', )
parameters.setdefault('positive_dataframe', )
parameters.setdefault('false_positive_df', None)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice naming

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lol thx

scores = self._model(
list_of_column_names,
check_values_dict,
self._make_lower_case(),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this something we are going to parameterize? why not just do str.lower?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question -- thought the same myself. The parameter for the process.cdist function must be callable so that is a big requirement. str.lower() isn't callable since that returns a str object type.

Documentation to the process.cdist function here

pass

def _make_lower_case(str, **kwargs):
return str.lower()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should it be str.lower() or str.lower?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Out[1]: <function str.lower()>

should return a str object type... so think we actually do want to call the lower() as a function call

@taylorfturner taylorfturner removed the Work In Progress Solution is being developed label Sep 13, 2022
@@ -134,6 +134,7 @@ def get_class(cls, class_name):
# Import possible internal models
from .character_level_cnn_model import CharacterLevelCnnModel # NOQA
from .regex_model import RegexModel # NOQA
from .column_name_model import ColumnNameModel # NOQA
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

adding the model to be loadable as a class

@@ -0,0 +1,225 @@
"""Contains class for regex data labeling model."""
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

entirely new model based regex model type

Comment on lines 17 to 46
mock_model_parameters = {
'true_positive_dict': [
{
'attribute': 'ssn',
'label': 'ssn'
},
{
'attribute': 'suffix',
'label': 'name'
},
{
'attribute': 'my_home_address',
'label': 'address'
},
],
'false_positive_dict': [
{
'attribute': 'contract_number',
'label': 'ssn',
},
{
'attribute': 'role',
'label': 'name',
},
{
'attribute': 'send_address',
'label': 'address',
},
]
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

format fixing

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixing in next commit

@@ -0,0 +1,292 @@
import json
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

test suite for each method in the new ColumnNameModel class

"conf": 100.0
}
}
model_output = model.predict(data=["ssn", "role_name", "wallet_address"], show_confidences=True)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we have a test for false as well?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, the below is a FALSE scenario

with self.assertLogs(
            "DataProfiler.labelers.column_name_model", level="INFO"
        ) as logs:
            model_output = model.predict(data=["ssn", "role_name", "wallet_address"])

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ahh i missed it, sorry since it wasn't declared i missed it.

@taylorfturner taylorfturner changed the title [WIP] New Data Labeler: ColumnNameModel Build New Data Labeler: ColumnNameModel Build Sep 13, 2022
@JGSweets JGSweets enabled auto-merge (squash) September 14, 2022 14:10
not isinstance(value, list) or not isinstance(value[0], dict)
):
errors.append(
"""`{}` must be a list of dictionaries each with the following
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

doesn't check for attribute and label key in list of dicts. Does it need to?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixing -- good catch

false_positive_dict = self._parameters["false_positive_dict"]
if false_positive_dict:
data = self._compare_negative(
data, false_positive_dict, negative_threshold=50
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this negative_threshold value arbitrary? Should it be a constant at the top of the file or in a config somewhere?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good catch -- nah this should be a param and not hard coded

output = self._compare_positive(
data,
self._parameters["true_positive_dict"],
positive_threshold=85,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same question as above regarding positive_threshold

@@ -0,0 +1,258 @@
"""Contains class for column name data labeling model."""
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

whole new model -- based on regex_model.py to some extent

"""Reset weights function."""
pass

@require_module(["rapidfuzz"])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we have a test for this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah good call!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, I think this is going to spout a graph profiling error as oppose to a labeler error

@JGSweets JGSweets merged commit 63fd1fb into capitalone:main Sep 14, 2022
@taylorfturner taylorfturner deleted the new_model/initial_build branch October 4, 2022 14:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
New Feature A feature addition not currently in the library
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants