-
Notifications
You must be signed in to change notification settings - Fork 10
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Correlation speedup #123
base: main
Are you sure you want to change the base?
Correlation speedup #123
Changes from 8 commits
6232b51
4ae6e9f
9ff13c4
a187231
0292e00
b6f4edc
45c7caf
b8fb0a3
a00f237
2eb380d
4a9c094
23dbbcb
4f48f5e
06528fd
12769bc
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -2,9 +2,9 @@ | |
Collection of helper methods which can be used as to interface metrics. | ||
""" | ||
|
||
import multiprocessing as mp | ||
from typing import Any, Callable, List, Mapping, Optional, Tuple, Type, Union | ||
from typing import Any, Callable, List, Mapping, Tuple, Type, Union | ||
|
||
import numpy as np | ||
import pandas as pd | ||
|
||
from .. import utils | ||
|
@@ -118,8 +118,6 @@ def correlation_matrix( | |
num_num_metric: Callable[[pd.Series, pd.Series], float] = pearson, | ||
cat_num_metric: Callable[[pd.Series, pd.Series], float] = kruskal_wallis, | ||
cat_cat_metric: Callable[[pd.Series, pd.Series], float] = cramers_v, | ||
columns_x: Optional[List[str]] = None, | ||
columns_y: Optional[List[str]] = None, | ||
) -> pd.DataFrame: | ||
"""This function creates a correlation matrix out of a dataframe, using a correlation metric for each | ||
possible type of pair of series (i.e. numerical-numerical, categorical-numerical, categorical-categorical). | ||
|
@@ -135,60 +133,62 @@ def correlation_matrix( | |
cat_cat_metric (Callable[[pd.Series, pd.Series], float], optional): | ||
The correlation metric used for categorical-categorical series pairs. Defaults to corrected Cramer's V | ||
statistic. | ||
columns_x (Optional[List[str]]): | ||
The column names that determine the rows of the matrix. | ||
columns_y (Optional[List[str]]): | ||
The column names that determine the columns of the matrix. | ||
|
||
Returns: | ||
pd.DataFrame: | ||
The correlation matrix to be used in heatmap generation. | ||
""" | ||
|
||
if columns_x is None: | ||
columns_x = df.columns | ||
df = df.copy() | ||
|
||
if columns_y is None: | ||
columns_y = df.columns | ||
distr_types = [utils.infer_distr_type(df[col]) for col in df.columns] | ||
|
||
pool = mp.Pool(mp.cpu_count()) | ||
for col in df.columns: | ||
df[col] = utils.infer_dtype(df[col]) | ||
|
||
series_list = [ | ||
pd.Series( | ||
pool.starmap( | ||
_correlation_matrix_helper, | ||
[(df[col_x], df[col_y], num_num_metric, cat_num_metric, cat_cat_metric) for col_x in columns_x], | ||
), | ||
index=columns_x, | ||
name=col_y, | ||
) | ||
for col_y in columns_y | ||
] | ||
if df[col].dtype.kind == "O": | ||
df[col] = pd.Series(pd.factorize(df[col], na_sentinel=-1)[0]).replace(-1, np.nan) | ||
|
||
df = df.append(pd.DataFrame({col: [i] for i, col in enumerate(df.columns)})) | ||
|
||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The idea here is - it's impossible to know which column corresponds to which distribution type in the helper, so we append the column's index in the data frame to it (as the final row). Then in the helper, we use that row to index the precomputed distribution types and drop that row. There might be a better way of doing this. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think another way to do this would be to revert back to using |
||
pool.close() | ||
def corr(a: np.ndarray, b: np.ndarray): | ||
return _correlation_matrix_helper( | ||
a, | ||
b, | ||
distr_types=distr_types, | ||
num_num_metric=num_num_metric, | ||
cat_num_metric=cat_num_metric, | ||
cat_cat_metric=cat_cat_metric, | ||
) | ||
|
||
return pd.concat(series_list, axis=1, keys=[series.name for series in series_list]) | ||
return df.corr(method=corr) | ||
|
||
|
||
def _correlation_matrix_helper( | ||
sr_a: pd.Series, | ||
sr_b: pd.Series, | ||
a: np.ndarray, | ||
b: np.ndarray, | ||
distr_types: List[utils.DistrType], | ||
num_num_metric: Callable[[pd.Series, pd.Series], float] = pearson, | ||
cat_num_metric: Callable[[pd.Series, pd.Series], float] = kruskal_wallis, | ||
cat_cat_metric: Callable[[pd.Series, pd.Series], float] = cramers_v, | ||
) -> float: | ||
|
||
a_type = utils.infer_distr_type(sr_a) | ||
b_type = utils.infer_distr_type(sr_b) | ||
a_type = distr_types[int(a[-1])] | ||
b_type = distr_types[int(b[-1])] | ||
|
||
sr_a = pd.Series(a[:-1]) | ||
sr_b = pd.Series(b[:-1]) | ||
|
||
df = pd.DataFrame({"a": sr_a, "b": sr_b}).dropna().reset_index() | ||
|
||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Columns need to be joined so any rows with nulls are dropped before the correlation metric is applied. |
||
if a_type.is_continuous() and b_type.is_continuous(): | ||
return num_num_metric(sr_a, sr_b) | ||
return num_num_metric(df["a"], df["b"]) | ||
|
||
elif b_type.is_continuous(): | ||
return cat_num_metric(sr_a, sr_b) | ||
return cat_num_metric(df["a"], df["b"]) | ||
|
||
elif a_type.is_continuous(): | ||
return cat_num_metric(sr_b, sr_a) | ||
return cat_num_metric(df["b"], df["a"]) | ||
|
||
else: | ||
return cat_cat_metric(sr_a, sr_b) | ||
return cat_cat_metric(df["a"], df["b"]) |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,49 @@ | ||
""" | ||
Plot correlation heatmaps for datasets. | ||
""" | ||
|
||
from typing import Callable | ||
|
||
import matplotlib.pyplot as plt | ||
import pandas as pd | ||
import seaborn as sns | ||
|
||
from ..metrics import correlation, unified | ||
|
||
|
||
def heatmap( | ||
df: pd.DataFrame, | ||
num_num_metric: Callable[[pd.Series, pd.Series], float] = correlation.pearson, | ||
cat_num_metric: Callable[[pd.Series, pd.Series], float] = correlation.kruskal_wallis, | ||
cat_cat_metric: Callable[[pd.Series, pd.Series], float] = correlation.cramers_v, | ||
**kwargs | ||
Hilly12 marked this conversation as resolved.
Show resolved
Hide resolved
|
||
): | ||
"""This function creates a correlation heatmap out of a dataframe, using user provided or default correlation | ||
metrics for all possible types of pairs of series (i.e. numerical-numerical, categorical-numerical, | ||
categorical-categorical). | ||
|
||
Args: | ||
df (pd.DataFrame): | ||
The dataframe used for computing correlations and producing a heatmap. | ||
num_num_metric (Callable[[pd.Series, pd.Series], float], optional): | ||
The correlation metric used for numerical-numerical series pairs. Defaults to Pearson's correlation | ||
coefficient. | ||
cat_num_metric (Callable[[pd.Series, pd.Series], float], optional): | ||
The correlation metric used for categorical-numerical series pairs. Defaults to Kruskal-Wallis' H Test. | ||
cat_cat_metric (Callable[[pd.Series, pd.Series], float], optional): | ||
The correlation metric used for categorical-categorical series pairs. Defaults to corrected Cramer's V | ||
statistic. | ||
kwargs: | ||
Key word arguments for sns.heatmap. | ||
""" | ||
|
||
corr_matrix = unified.correlation_matrix(df, num_num_metric, cat_num_metric, cat_cat_metric) | ||
|
||
if "cmap" not in kwargs: | ||
kwargs["cmap"] = sns.cubehelix_palette(start=0.2, rot=-0.2, dark=0.3, as_cmap=True) | ||
|
||
if "linewidth" not in kwargs: | ||
kwargs["linewidth"] = 0.5 | ||
|
||
sns.heatmap(corr_matrix, vmin=0, vmax=1, square=True, **kwargs) | ||
plt.tight_layout() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The correlation matrix is generated using
df.corr()
. Sincedf.corr()
only works on numerical data, we need to encode all the columns. The issue with this is that we use theinfer_distr_type()
function to decide which metric would be suitable, which works differently on the encoded numerical data. The only way to resolve this issue is to infer types beforehand (which is probably more efficient). The problem then becomes about making a binary function(a, b) -> float
that knows the types ofa
andb
beforehand.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just to add, using
df.corr()
provides a major performance improvement.