Skip to content

Latest commit

 

History

History
173 lines (145 loc) · 4.98 KB

README.md

File metadata and controls

173 lines (145 loc) · 4.98 KB

kmedoid-discretizer

Adaptative Kmedoid discretizer for numerical feature engineering.

Poetry scikit-learn Python Test License: MIT

Description

kmedoid-discretizer (Adaptative Kmedoid discretizer) allows to discritize numerical feature into n_bins using Kmedoids Clustering algrorithm compatible sklearn (Alternative to sklearn KBinsDiscretizer). With this implemenation, we can have:

  • A custom number of bins for each numeral feature. Kmedoids will be run for each columns.
  • Adapt the number of bins dynamically whenever this one is two high (more precesly when two centroids are assigned to the same data point.)
  • Multiple Backends are possible: serial, multiprocessing, and ray to speed up the Kmedoids compuation.
  • Mainly use Pandas DataFrame and Numpy array.

Install

pip install git+ssh://[email protected]/Vic-ai/kmedoid-discretizer.git

Play with the code and run it locally without pip

git clone [email protected]:Vic-ai/kmedoid-discretizer.git

Usage

Basic Usage

Here is the Basic use-case data

# Fake training set
X = pd.DataFrame.from_dict({f"feature": [1, 2, 2, 3]})
# Fake Testing set
X_test = pd.DataFrame.from_dict({f"feature": [0, 2, 5]})

Ordinal encoding

discretizer = KmedoidDiscretizer(2)
# discritize X into 2 bins => 1 and 2 will go in bin 0 and 3 in bin 1.
X_discrete = discretizer.fit_transform(X)
print(X_discrete)
# discritize X_test into 2 bins => 0 and 2 will go in bin 0 and 5 in bin 1.
X_test_discrete = discretizer.transform(X_test)
print(X_test_discrete)
   feature
0        0
1        0
2        0
3        1
   feature
0        0
1        0
2        1

Onehot encoding

discretizer = KmedoidDiscretizer(2, encoding="onehot-dense")
# discritize X into 2 bins => 1 and 2 will go in bin 0 and 3 in bin 1.
X_discrete = discretizer.fit_transform(X)
print(X_discrete)
# discritize X_test into 2 bins => 0 and 2 will go in bin 0 and 5 in bin 1.
X_test_discrete = discretizer.transform(X_test)
print(X_test_discrete)
   index    0    1
0      0  1.0  0.0
1      1  1.0  0.0
2      2  1.0  0.0
3      3  0.0  1.0
   index    0    1
0      0  1.0  0.0
1      1  1.0  0.0
2      2  0.0  1.0

Advanced Usage Titanic (Sklearn Pipeline)

Libraries

import pandas as pd
from sklearn.compose import ColumnTransformer
from sklearn.datasets import fetch_openml
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder

from kmedoid_discretizer.discretizer import KmedoidDiscretizer
from kmedoid_discretizer.utils.utils_external import PandasSimpleImputer

np.random.seed(0)

Titanic Dataset

X, y = fetch_openml("titanic", version=1, as_frame=True, return_X_y=True)

cat_features = ["pclass", "sex"]
num_features = ["age", "fare", "sibsp", "parch"] # The one we will discritize

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=0
)

Training Pipeline

# Numerical Transformer Pipeline
numeric_transformer = Pipeline(
    steps=[
        ("imputer", PandasSimpleImputer(strategy="median")),
        ("discretizer", KmedoidDiscretizer(
                            n_bins=[8, 5, 7, 7],
                            encode="onehot-dense",
                            backend="serial",
                            verbose=True,
                            seed=0,
                        )),
    ]
)

# Categorical Transformer Pipeline
categorical_transformer = Pipeline(
    steps=[
        ("encoder", OneHotEncoder()),
    ]
)

# The Combination of Numerical and Categorical
preprocessor = ColumnTransformer(
    transformers=[
        ("num", numeric_transformer, num_features),
        ("cat", categorical_transformer, cat_features),
    ]
)

# Overall Pipeline preprocessor + classifier
clf = Pipeline(
    steps=[
        ("preprocessor", preprocessor),
        ("classifier", LogisticRegression()),
    ]
)

clf.fit(X_train, y_train)
print("Train score: %.3f" % clf.score(X_train, y_train))
print("Test score: %.3f" % clf.score(X_test, y_test))
Train score: 0.802
Test score: 0.809

Contributors

Marvin Martin

Daniel Nowak

License

MIT License Vic.ai 2023