-
Notifications
You must be signed in to change notification settings - Fork 10
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Correlation speedup #123
base: main
Are you sure you want to change the base?
Correlation speedup #123
Conversation
if df[col].dtype.kind == "O": | ||
df[col] = pd.factorize(df[col])[0] | ||
|
||
df = df.append(pd.DataFrame({col: [i] for i, col in enumerate(df.columns)})) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The idea here is - it's impossible to know which column corresponds to which distribution type in the helper, so we append the column's index in the data frame to it (as the final row). Then in the helper, we use that row to index the precomputed distribution types and drop that row. There might be a better way of doing this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think another way to do this would be to revert back to using utils.infer_distribution_type
but also adding a functools.lru_cache
to the utils.infer_distribution_type
function so that we avoid repetitive calculations.
columns_x (Optional[List[str]]): | ||
The column names that determine the rows of the matrix. | ||
columns_y (Optional[List[str]]): | ||
The column names that determine the columns of the matrix. | ||
|
||
Returns: | ||
pd.DataFrame: | ||
The correlation matrix to be used in heatmap generation. | ||
""" | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The correlation matrix is generated using df.corr()
. Since df.corr()
only works on numerical data, we need to encode all the columns. The issue with this is that we use the infer_distr_type()
function to decide which metric would be suitable, which works differently on the encoded numerical data. The only way to resolve this issue is to infer types beforehand (which is probably more efficient). The problem then becomes about making a binary function (a, b) -> float
that knows the types of a
and b
beforehand.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just to add, using df.corr()
provides a major performance improvement.
sr_a = pd.Series(a[:-1]) | ||
sr_b = pd.Series(b[:-1]) | ||
|
||
df = pd.DataFrame({"a": sr_a, "b": sr_b}).dropna().reset_index() | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Columns need to be joined so any rows with nulls are dropped before the correlation metric is applied.
@@ -133,3 +144,38 @@ def test_cn_unequal_series_corr(): | |||
sr_b = pd.Series([100, 200, 99, 101, 201, 199, 299, 300, 301, 500, 501, 505, 10, 12, 1001, 1050]) | |||
|
|||
assert distance_cn_correlation(sr_a, sr_b) > 0.7 | |||
|
|||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Stress test to make sure correlation matrix generates properly without the encoding step.
if df[col].dtype.kind == "O": | ||
df[col] = pd.factorize(df[col])[0] | ||
|
||
df = df.append(pd.DataFrame({col: [i] for i, col in enumerate(df.columns)})) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think another way to do this would be to revert back to using utils.infer_distribution_type
but also adding a functools.lru_cache
to the utils.infer_distribution_type
function so that we avoid repetitive calculations.
…io/fairlens into correlation-speedup
Codecov Report
@@ Coverage Diff @@
## main #123 +/- ##
==========================================
+ Coverage 74.16% 79.16% +4.99%
==========================================
Files 15 15
Lines 871 864 -7
Branches 186 184 -2
==========================================
+ Hits 646 684 +38
+ Misses 181 135 -46
- Partials 44 45 +1
Flags with carried forward coverage won't be shown. Click here to find out more.
Continue to review full report at Codecov.
|
Kudos, SonarCloud Quality Gate passed! |
fairlens.unified.correlation_matrix
fairlens.plot.heatmap.two_column_heatmap
is nowfairlens.plot.correlation.heatmap