Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ADD] Minority Coalescer #242

Closed
wants to merge 10 commits into from

Conversation

franchuterivera
Copy link
Contributor

addresses #239

Copy link
Contributor

@ravinkohli ravinkohli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey, thanks for the PR. I have suggested minor changes, and I think after they are resolved, this PR can be merged

@codecov
Copy link

codecov bot commented Jun 10, 2021

Codecov Report

Merging #242 (afd4f00) into development (999f3c3) will decrease coverage by 52.34%.
The diff coverage is 31.63%.

Impacted file tree graph

@@               Coverage Diff                @@
##           development     #242       +/-   ##
================================================
- Coverage        81.62%   29.28%   -52.35%     
================================================
  Files              150      154        +4     
  Lines             8625     8794      +169     
  Branches          1325     1351       +26     
================================================
- Hits              7040     2575     -4465     
- Misses            1108     6219     +5111     
+ Partials           477        0      -477     
Impacted Files Coverage Δ
autoPyTorch/api/base_task.py 7.24% <0.00%> (-77.40%) ⬇️
autoPyTorch/utils/implementations.py 21.97% <17.64%> (-53.64%) ⬇️
...essing/tabular_preprocessing/coalescer/__init__.py 24.61% <24.61%> (ø)
.../tabular_preprocessing/coalescer/base_coalescer.py 50.00% <50.00%> (ø)
...ing/tabular_preprocessing/coalescer/NoCoalescer.py 56.25% <56.25%> (ø)
...bular_preprocessing/coalescer/MinorityCoalescer.py 58.33% <58.33%> (ø)
autoPyTorch/pipeline/tabular_classification.py 31.74% <100.00%> (-44.26%) ⬇️
autoPyTorch/pipeline/tabular_regression.py 37.37% <100.00%> (-45.28%) ⬇️
autoPyTorch/optimizer/utils.py 0.00% <0.00%> (-100.00%) ⬇️
autoPyTorch/api/tabular_regression.py 0.00% <0.00%> (-96.88%) ⬇️
... and 140 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 999f3c3...afd4f00. Read the comment docs.

Copy link
Contributor

@ravinkohli ravinkohli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey, thanks for the PR. This is a great feature to have. I have left just a single comment other than that, I think we can merge this PR.

Copy link
Contributor

@nabenabe0928 nabenabe0928 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR.
This helps to prevent unexpectedly low performance.
However, some codes seem to include vulnerabilities, so first let's remove those issues.

autoPyTorch/api/base_task.py Show resolved Hide resolved
autoPyTorch/utils/implementations.py Outdated Show resolved Hide resolved
autoPyTorch/utils/implementations.py Outdated Show resolved Hide resolved

if self.minimum_fraction is None:
return X

Copy link
Contributor

@nabenabe0928 nabenabe0928 Jun 20, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The codes are a bit complicated and not simple, so I rewrote the codes.

def suggestion(X: Union[sparse.csr_matrix, np.ndarray]) -> None:
    n_features = X.shape[1]
    is_sparse = sparse.issparse(X)

    for col in range(n_features):
        if is_sparse:
            indptr_start = X.indptr[col],
            indptr_end = X.indptr[col + 1]
            col_data = X.data[indptr_start:indptr_end]
        else:
            col_data = X[:, col]

        mask = ~np.isin(col_data, do_not_coalesce_)
        col_data[mask] = -2

    return X

Just in case, I also checked if it works in the identical way.

size = (400000, 100)
n_vals = 10
do_not_coalesce_ = set([3, 5])

X = np.random.randint(n_vals, size=size)
X_copy = np.copy(X)

n_wrongs = (original(X) != suggestion(X_copy)).sum()
n_wrongs

>>> 0  # It works identically

Plus, I also checked the benchmark speed.

size = (400000, 100)
n_vals = 10
do_not_coalesce_ = set([3, 5])

X = np.random.randint(n_vals, size=size)
%timeit original(X)
>>> 1.24 s ± 25.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

X = np.random.randint(n_vals, size=size)
%timeit suggestion(X)
>>> 549 ms ± 13.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

hist = np.histogram(Y[:, col], bins=np.arange(-2, 7))
np.testing.assert_array_almost_equal(hist[0], [10, 0, 0, 0, 0, 30, 30, 30])
# Assert no copies were made
assert id(X) == id(Y)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you would like this method to share the address before and after the transformation, we should not return the array as a result, because it just means if we make any changes to the X, it means Y will also change unexpectedly.
It is not safe.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As the name of the method (transform) indicates, it is within the expectation that the input argument will be transformed.

Normally the input can be an array of thousands of elements. Creating a copy every time we transform is not feasible, so we do expect to transform it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the copy take memory, so I would like it to be a destructive method.
Because we potentially change the original (the one before transformed) and those changes will be reflected to the transformed version as well.
In other words, we do not have to return the transformed version.

autoPyTorch/utils/implementations.py Outdated Show resolved Hide resolved
"""
def __init__(self, minimum_fraction: float, random_state: Optional[Union[np.random.RandomState, int]] = None):
super().__init__()
self.minimum_fraction = minimum_fraction
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minimum_fraction -> min_fraction (convention)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we have a convention for this, and in this case, having the complete word is more clear. If possible I would like to preserve it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pytorch uses min rather than minimum.
When it uses minimum, it is only for the element-wise minimum.
But if you would like to stick to it, it is also fine.


class NoCoalescer(BaseCoalescer):
"""
Don't perform NoCoalescer on categorical features
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a NoCoalescer class. This allows the BO model to enable/disable coalescing.

The choice object selects between MinorityCoalescer and NoCoalescer depending on what gives better performance.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean i did not get if you mean Do not perform NO coalescer or Do not perform coalescer.

autoPyTorch/utils/implementations.py Outdated Show resolved Hide resolved
if fraction >= self.minimum_fraction:
do_not_coalesce[-1].add(unique_value)

self.do_not_coalesce_ = do_not_coalesce
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we stick to self._<private var name> rather than self.<private var name>_?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I have made the change. This is something we also have not discussed before, as it does not have any meaning for python.. It is how you want to make the convention. Maybe you can sync up with Ravin and have this be added to the wiki.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I will do it, but mostly I just follow pep8 rules.
Btw, is <var name>_ naming from another language?

Comment on lines 118 to 136
for column in range(X.shape[1]):
if sparse.issparse(X):
indptr_start = X.indptr[column]
indptr_end = X.indptr[column + 1]
unique = np.unique(X.data[indptr_start:indptr_end])
for unique_value in unique:
if unique_value not in self.do_not_coalesce_[column]:
indptr_start = X.indptr[column]
indptr_end = X.indptr[column + 1]
X.data[indptr_start:indptr_end][
X.data[indptr_start:indptr_end] == unique_value] = -2
else:
unique = np.unique(X[:, column])
unique_values = [unique_value for unique_value in unique
if unique_value not in self.do_not_coalesce_[column]]
mask = np.isin(X[:, column], unique_values)
# The imputer uses -1 for unknown categories
# Then -2 means coalesced categories
X[mask, column] = -2
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
for column in range(X.shape[1]):
if sparse.issparse(X):
indptr_start = X.indptr[column]
indptr_end = X.indptr[column + 1]
unique = np.unique(X.data[indptr_start:indptr_end])
for unique_value in unique:
if unique_value not in self.do_not_coalesce_[column]:
indptr_start = X.indptr[column]
indptr_end = X.indptr[column + 1]
X.data[indptr_start:indptr_end][
X.data[indptr_start:indptr_end] == unique_value] = -2
else:
unique = np.unique(X[:, column])
unique_values = [unique_value for unique_value in unique
if unique_value not in self.do_not_coalesce_[column]]
mask = np.isin(X[:, column], unique_values)
# The imputer uses -1 for unknown categories
# Then -2 means coalesced categories
X[mask, column] = -2
n_features = X.shape[1]
is_sparse = sparse.issparse(X)
for col in range(n_features):
if is_sparse:
indptr_start = X.indptr[col],
indptr_end = X.indptr[col + 1]
col_data = X.data[indptr_start:indptr_end]
else:
col_data = X[:, col]
# The imputer uses -1 for unknown categories
# Then -2 means coalesced categories
mask = ~np.isin(col_data, self.do_not_coalesce_)
col_data[mask] = -2

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here is the #suggestion01
Let me know if you have any questions.

Copy link
Contributor

@nabenabe0928 nabenabe0928 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think fit and transform methods in MinorityCoalescing class can be merged when I refactor them.
Please take a look at the refactored version of both methods.

Just in case, I also put the refactored version of the merged method.

        n_instances, n_features = X.shape
        is_sparse = sparse.issparse(X)

        for col in range(n_features):
            if is_sparse:
                indptr_start = X.indptr[column]
                indptr_end = X.indptr[column + 1]
                col_data = X.data[indptr_start:indptr_end]
            else:
                col_data = X[:, col]

            unique_vals, counts = np.unique(col_data, return_counts=True)
            frac = counts / n_instances
            coalesce = set(unique_vals[frac <= min_frac])

            # The imputer uses -1 for unknown categories
            # Then -2 means coalesced categories
            mask = np.isin(col_data, coalesce)
            col_data[mask] = -2

Comment on lines +91 to +107
for column in range(X.shape[1]):
do_not_coalesce.append(set())

if sparse.issparse(X):
indptr_start = X.indptr[column]
indptr_end = X.indptr[column + 1]
unique, counts = np.unique(
X.data[indptr_start:indptr_end], return_counts=True)
colsize = indptr_end - indptr_start
else:
unique, counts = np.unique(X[:, column], return_counts=True)
colsize = X.shape[0]

for unique_value, count in zip(unique, counts):
fraction = float(count) / colsize
if fraction >= self.minimum_fraction:
do_not_coalesce[-1].add(unique_value)
Copy link
Contributor

@nabenabe0928 nabenabe0928 Jun 21, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found some issues and refactored the codes.
Issues

  1. colsize changes depending on if the feature matrix is sparse or not (indptr_end - indptr_start is not correct)
  2. the fraction also changes due to 1. (especially, almost no cut-off happens in sparse matrices.)
  3. Unneeded usage of for loop (rewrite by numpy to eliminate for loop, which I did)

This function can also be written nicely.

Suggested change
for column in range(X.shape[1]):
do_not_coalesce.append(set())
if sparse.issparse(X):
indptr_start = X.indptr[column]
indptr_end = X.indptr[column + 1]
unique, counts = np.unique(
X.data[indptr_start:indptr_end], return_counts=True)
colsize = indptr_end - indptr_start
else:
unique, counts = np.unique(X[:, column], return_counts=True)
colsize = X.shape[0]
for unique_value, count in zip(unique, counts):
fraction = float(count) / colsize
if fraction >= self.minimum_fraction:
do_not_coalesce[-1].add(unique_value)
n_instances, n_features = X.shape
do_not_coalesce = [set() for _ in range(n_features)]
is_sparse = sparse.issparse(X)
for col, no_coal in enumerate(do_not_coalesce):
if is_sparse:
indptr_start = X.indptr[column]
indptr_end = X.indptr[column + 1]
col_data = X.data[indptr_start:indptr_end]
else:
col_data = X[:, col]
unique_vals, counts = np.unique(col_data, return_counts=True)
frac = counts / n_instances
no_coal.update(unique_vals[frac >= min_frac])

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, but we do not want to merge fit and transform as we are using the scikit-learn API. You can read more about it here

Copy link
Contributor

@nabenabe0928 nabenabe0928 Jun 24, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, then I will respond to the corresponding lines separately.
I will tag by #suggestion01 and #suggestion02.
Actually, this suggestion is #suggestion01 (the change for gettting do_not_coalesce)

@nabenabe0928
Copy link
Contributor

I added some comments mostly small comments as responses.
And about two large suggestions.
Those two return exactly the same results, but my codes use numpy nicely.

@ravinkohli ravinkohli marked this pull request as draft November 11, 2021 20:26
@ravinkohli
Copy link
Contributor

we have merged this with #376

@ravinkohli ravinkohli closed this Feb 10, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants