New Data Labeler: ColumnNameModel Build #626

taylorfturner · 2022-09-09T17:45:47Z

No description provided.

JGSweets · 2022-09-09T19:10:16Z

dataprofiler/labelers/column_name_model.py

@@ -31,8 +32,8 @@ def __init__(self, label_mapping=None, parameters=None):
        # parameter initialization
        if not parameters:
            parameters = {}
-        parameters.setdefault('negative_dataframe', )
-        parameters.setdefault('positive_dataframe', )
+        parameters.setdefault('false_positive_df', None)


nice naming

JGSweets · 2022-09-09T20:11:33Z

dataprofiler/labelers/column_name_model.py

+        scores = self._model(
+                list_of_column_names,
+                check_values_dict,
+                self._make_lower_case(),


is this something we are going to parameterize? why not just do str.lower?

Good question -- thought the same myself. The parameter for the process.cdist function must be callable so that is a big requirement. str.lower() isn't callable since that returns a str object type.

Documentation to the process.cdist function here

JGSweets · 2022-09-09T20:11:54Z

dataprofiler/labelers/column_name_model.py

+        pass
+
+    def _make_lower_case(str, **kwargs):
+        return str.lower()


should it be str.lower() or str.lower?

Out[1]: <function str.lower()>

should return a str object type... so think we actually do want to call the lower() as a function call

requirements-ml.txt

taylorfturner · 2022-09-13T20:11:15Z

dataprofiler/labelers/base_model.py

@@ -134,6 +134,7 @@ def get_class(cls, class_name):
        # Import possible internal models
        from .character_level_cnn_model import CharacterLevelCnnModel  # NOQA
        from .regex_model import RegexModel  # NOQA
+        from .column_name_model import ColumnNameModel  # NOQA


adding the model to be loadable as a class

taylorfturner · 2022-09-13T20:12:10Z

dataprofiler/labelers/column_name_model.py

@@ -0,0 +1,225 @@
+"""Contains class for regex data labeling model."""


entirely new model based regex model type

JGSweets · 2022-09-13T20:12:15Z

dataprofiler/tests/labelers/test_column_name_model.py

+mock_model_parameters = {
+            'true_positive_dict': [
+                {
+					'attribute': 'ssn',
+					'label': 'ssn'                
+                },
+                {
+                	'attribute': 'suffix',
+                	'label': 'name'
+                },
+                {
+                	'attribute': 'my_home_address',
+                	'label': 'address'
+                },
+            ],
+            'false_positive_dict': [
+                {
+                    'attribute': 'contract_number',
+                    'label': 'ssn',
+                },
+                { 
+                    'attribute': 'role',
+                    'label': 'name',
+                },
+                {
+                    'attribute': 'send_address',
+                    'label': 'address',
+                },
+            ]
+        }


format fixing

fixing in next commit

dataprofiler/tests/labelers/test_column_name_model.py

taylorfturner · 2022-09-13T20:13:30Z

dataprofiler/tests/labelers/test_column_name_model.py

@@ -0,0 +1,292 @@
+import json


test suite for each method in the new ColumnNameModel class

JGSweets · 2022-09-13T20:14:02Z

dataprofiler/tests/labelers/test_column_name_model.py

+                    "conf": 100.0
+                }
+            }
+        model_output = model.predict(data=["ssn", "role_name", "wallet_address"], show_confidences=True)


do we have a test for false as well?

yes, the below is a FALSE scenario

with self.assertLogs( "DataProfiler.labelers.column_name_model", level="INFO" ) as logs: model_output = model.predict(data=["ssn", "role_name", "wallet_address"])

ahh i missed it, sorry since it wasn't declared i missed it.

dataprofiler/tests/labelers/test_column_name_model.py

dataprofiler/labelers/column_name_model.py

micdavis · 2022-09-14T14:03:30Z

dataprofiler/labelers/column_name_model.py

+                not isinstance(value, list) or not isinstance(value[0], dict)
+            ):
+                errors.append(
+                    """`{}` must be a list of dictionaries each with the following


doesn't check for attribute and label key in list of dicts. Does it need to?

fixing -- good catch

micdavis · 2022-09-14T14:11:24Z

dataprofiler/labelers/column_name_model.py

+        false_positive_dict = self._parameters["false_positive_dict"]
+        if false_positive_dict:
+            data = self._compare_negative(
+                data, false_positive_dict, negative_threshold=50


is this negative_threshold value arbitrary? Should it be a constant at the top of the file or in a config somewhere?

good catch -- nah this should be a param and not hard coded

micdavis · 2022-09-14T14:12:02Z

dataprofiler/labelers/column_name_model.py

+        output = self._compare_positive(
+            data,
+            self._parameters["true_positive_dict"],
+            positive_threshold=85,


same question as above regarding positive_threshold

taylorfturner · 2022-09-14T14:36:49Z

dataprofiler/labelers/column_name_model.py

@@ -0,0 +1,258 @@
+"""Contains class for column name data labeling model."""


whole new model -- based on regex_model.py to some extent

JGSweets · 2022-09-14T14:53:18Z

dataprofiler/labelers/column_name_model.py

+        """Reset weights function."""
+        pass
+
+    @require_module(["rapidfuzz"])


do we have a test for this?

ah good call!

Also, I think this is going to spout a graph profiling error as oppose to a labeler error

taylorfturner added Work In Progress Solution is being developed New Feature A feature addition not currently in the library labels Sep 9, 2022

taylorfturner self-assigned this Sep 9, 2022

taylorfturner requested review from JGSweets, ksneab7, micdavis and tyfarnan as code owners September 9, 2022 17:45

JGSweets changed the title ~~New Data Labeler: ColumnNameModel Build~~ [WIP] New Data Labeler: ColumnNameModel Build Sep 9, 2022

JGSweets reviewed Sep 9, 2022

View reviewed changes

requirements-ml.txt Show resolved Hide resolved

taylorfturner added 7 commits September 13, 2022 16:07

new model for DP labeler library

8e07b27

add data labeler params

ff39b3f

add _validate_parameters

30ca099

update new model code

2c6ae44

add rapidfuzz ML dependency

5e707ed

get_class update

b47cdda

clean up

1cef721

taylorfturner removed the Work In Progress Solution is being developed label Sep 13, 2022

taylorfturner commented Sep 13, 2022

View reviewed changes

JGSweets reviewed Sep 13, 2022

View reviewed changes

dataprofiler/tests/labelers/test_column_name_model.py Show resolved Hide resolved

taking out regex refs

be0fd7a

taylorfturner commented Sep 13, 2022

View reviewed changes

JGSweets reviewed Sep 13, 2022

View reviewed changes

dataprofiler/tests/labelers/test_column_name_model.py Outdated Show resolved Hide resolved

taylorfturner added 2 commits September 13, 2022 16:16

clean up

2455675

remove json file

b6a3f28

JGSweets reviewed Sep 13, 2022

View reviewed changes

dataprofiler/tests/labelers/test_column_name_model.py Outdated Show resolved Hide resolved

taylorfturner added 4 commits September 13, 2022 16:28

pre-commit run fix

69c565b

clean up white space in testing

38a0456

raise import warning for install req

df50a7b

clean up

55ed522

taylorfturner changed the title ~~[WIP] New Data Labeler: ColumnNameModel Build~~ New Data Labeler: ColumnNameModel Build Sep 13, 2022

JGSweets reviewed Sep 13, 2022

View reviewed changes

dataprofiler/labelers/column_name_model.py Outdated Show resolved Hide resolved

JGSweets reviewed Sep 13, 2022

View reviewed changes

dataprofiler/labelers/column_name_model.py Outdated Show resolved Hide resolved

taylorfturner added 2 commits September 13, 2022 17:02

import clean up

4f0225b

import message comment

96305de

JGSweets enabled auto-merge (squash) September 14, 2022 14:10

micdavis reviewed Sep 14, 2022

View reviewed changes

fix hardcoded vars

8cfefc5

taylorfturner commented Sep 14, 2022

View reviewed changes

micdavis approved these changes Sep 14, 2022

View reviewed changes

JGSweets reviewed Sep 14, 2022

View reviewed changes

tyfarnan approved these changes Sep 14, 2022

View reviewed changes

JGSweets merged commit 63fd1fb into capitalone:main Sep 14, 2022

taylorfturner deleted the new_model/initial_build branch October 4, 2022 14:42

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

New Data Labeler: ColumnNameModel Build #626

New Data Labeler: ColumnNameModel Build #626

taylorfturner commented Sep 9, 2022

JGSweets Sep 9, 2022

taylorfturner Sep 9, 2022

JGSweets Sep 9, 2022

taylorfturner Sep 13, 2022

JGSweets Sep 9, 2022

taylorfturner Sep 13, 2022

taylorfturner Sep 13, 2022

taylorfturner Sep 13, 2022

JGSweets Sep 13, 2022

taylorfturner Sep 13, 2022

taylorfturner Sep 13, 2022

JGSweets Sep 13, 2022

taylorfturner Sep 13, 2022

JGSweets Sep 14, 2022

micdavis Sep 14, 2022

taylorfturner Sep 14, 2022

micdavis Sep 14, 2022

taylorfturner Sep 14, 2022

micdavis Sep 14, 2022

taylorfturner Sep 14, 2022

JGSweets Sep 14, 2022

taylorfturner Sep 14, 2022

JGSweets Sep 14, 2022

		@@ -0,0 +1,225 @@
		"""Contains class for regex data labeling model."""

		@@ -0,0 +1,258 @@
		"""Contains class for column name data labeling model."""

New Data Labeler: ColumnNameModel Build #626

New Data Labeler: ColumnNameModel Build #626

Conversation

taylorfturner commented Sep 9, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment