Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add profiler option for column level invalid values #704

Merged
merged 8 commits into from
Nov 4, 2022

Conversation

tonywu315
Copy link
Contributor

@tonywu315 tonywu315 commented Nov 1, 2022

This PR adds the feature to additionally set column-level null values. Here is an example of how to use this:

import dataprofiler as dp
import pandas as pd
import re

profiler_options = dp.ProfilerOptions()

NO_FLAG = 0
profiler_options.set(
    {
        "*.null_values": {
            "": NO_FLAG,
            "nan": re.IGNORECASE,
            "none": re.IGNORECASE,
            "null": re.IGNORECASE,
            "  *": NO_FLAG,
            "--*": NO_FLAG,
            "__*": NO_FLAG,
            "9" * 7: NO_FLAG,
        },
        "*.column_null_values": {
            0: {"1": NO_FLAG},
            1: {"3": NO_FLAG},
        },
        "*.null_replication_metrics.is_enabled": True,
        "data_labeler.is_enabled": False,
        "multiprocess.is_enabled": False,
    }
)

df = pd.DataFrame([[1, 1], [9999999, 2], [3, 3]])
profiler = dp.Profiler(df, options=profiler_options)
report = profiler.report()

In addition to the global null value 9999999, column 0 has the null value 1 and column 1 has the null value 3.

@tonywu315 tonywu315 changed the title Add option for column level invalid values [WIP] Add option for column level invalid values Nov 1, 2022
@tonywu315 tonywu315 changed the title [WIP] Add option for column level invalid values Add profiler option for column level invalid values Nov 1, 2022
@taylorfturner taylorfturner added Bug Something isn't working Medium Priority Significant improvement or bug / feature reducing overall performance labels Nov 2, 2022
@taylorfturner taylorfturner added New Feature A feature addition not currently in the library and removed Bug Something isn't working labels Nov 2, 2022
@JGSweets JGSweets enabled auto-merge (squash) November 3, 2022 00:19
@@ -2512,6 +2522,11 @@ def tqdm(level: Set[int]) -> Generator[int, None, None]:
min_true_samples = self._profile[prof_idx]._min_true_samples
try:
null_values = self._profile[prof_idx]._null_values
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a huge add and most LGTM, one big here though bc a doctor is mutable, this will change self._null_values with the update. If we instead copy prior to a variable, that would alleviate the issue. Great job though!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just need to fix in the locations where we update

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added .copy()

auto-merge was automatically disabled November 4, 2022 14:26

Head branch was pushed to by a user without write access

@@ -100,10 +103,13 @@ def __init__(
}
if options:
if options.null_values is not None:
self._null_values = options.null_values
self._null_values = options.null_values.copy()
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added copy

@@ -2594,7 +2615,11 @@ def tqdm(level: Set[int]) -> Generator[int, None, None]:
prof_idx = col_idx_to_prof_idx[col_idx]
if min_true_samples is None:
min_true_samples = self._profile[prof_idx]._min_true_samples
null_values = self._profile[prof_idx]._null_values

null_values = self._profile[prof_idx]._null_values.copy()
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

here

@@ -2576,7 +2591,13 @@ def tqdm(level: Set[int]) -> Generator[int, None, None]:
prof_idx = col_idx_to_prof_idx[col_idx]
if min_true_samples is None:
min_true_samples = self._profile[prof_idx]._min_true_samples
null_values = self._profile[prof_idx]._null_values

null_values = self._profile[prof_idx]._null_values.copy()
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

here

@@ -2536,7 +2546,12 @@ def tqdm(level: Set[int]) -> Generator[int, None, None]:
if min_true_samples is None:
min_true_samples = self._profile[prof_idx]._min_true_samples
try:
null_values = self._profile[prof_idx]._null_values
null_values: Dict = self._profile[prof_idx]._null_values.copy()
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

here

@taylorfturner taylorfturner enabled auto-merge (squash) November 4, 2022 14:30
@taylorfturner taylorfturner merged commit 387d788 into capitalone:main Nov 4, 2022
@tonywu315 tonywu315 deleted the column_invalid_values branch November 4, 2022 15:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Medium Priority Significant improvement or bug / feature reducing overall performance New Feature A feature addition not currently in the library
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants