Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Correlation speedup #123

Open
wants to merge 15 commits into
base: main
Choose a base branch
from
Open

Correlation speedup #123

wants to merge 15 commits into from

Conversation

Hilly12
Copy link
Contributor

@Hilly12 Hilly12 commented Sep 9, 2021

  • Vectorise fairlens.unified.correlation_matrix
  • Add stress tests for correlation matrix generation
  • Concatenate columns before dropping nulls in the correlation matrix helper
  • fairlens.plot.heatmap.two_column_heatmap is now fairlens.plot.correlation.heatmap

if df[col].dtype.kind == "O":
df[col] = pd.factorize(df[col])[0]

df = df.append(pd.DataFrame({col: [i] for i, col in enumerate(df.columns)}))

Copy link
Contributor Author

@Hilly12 Hilly12 Sep 9, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The idea here is - it's impossible to know which column corresponds to which distribution type in the helper, so we append the column's index in the data frame to it (as the final row). Then in the helper, we use that row to index the precomputed distribution types and drop that row. There might be a better way of doing this.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think another way to do this would be to revert back to using utils.infer_distribution_type but also adding a functools.lru_cache to the utils.infer_distribution_type function so that we avoid repetitive calculations.

@Hilly12 Hilly12 marked this pull request as ready for review September 13, 2021 13:39
columns_x (Optional[List[str]]):
The column names that determine the rows of the matrix.
columns_y (Optional[List[str]]):
The column names that determine the columns of the matrix.

Returns:
pd.DataFrame:
The correlation matrix to be used in heatmap generation.
"""

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The correlation matrix is generated using df.corr(). Since df.corr() only works on numerical data, we need to encode all the columns. The issue with this is that we use the infer_distr_type() function to decide which metric would be suitable, which works differently on the encoded numerical data. The only way to resolve this issue is to infer types beforehand (which is probably more efficient). The problem then becomes about making a binary function (a, b) -> float that knows the types of a and b beforehand.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to add, using df.corr() provides a major performance improvement.

sr_a = pd.Series(a[:-1])
sr_b = pd.Series(b[:-1])

df = pd.DataFrame({"a": sr_a, "b": sr_b}).dropna().reset_index()

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Columns need to be joined so any rows with nulls are dropped before the correlation metric is applied.

@@ -133,3 +144,38 @@ def test_cn_unequal_series_corr():
sr_b = pd.Series([100, 200, 99, 101, 201, 199, 299, 300, 301, 500, 501, 505, 10, 12, 1001, 1050])

assert distance_cn_correlation(sr_a, sr_b) > 0.7


Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Stress test to make sure correlation matrix generates properly without the encoding step.

src/fairlens/plot/correlation.py Outdated Show resolved Hide resolved
if df[col].dtype.kind == "O":
df[col] = pd.factorize(df[col])[0]

df = df.append(pd.DataFrame({col: [i] for i, col in enumerate(df.columns)}))

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think another way to do this would be to revert back to using utils.infer_distribution_type but also adding a functools.lru_cache to the utils.infer_distribution_type function so that we avoid repetitive calculations.

@simonhkswan simonhkswan added the type:enhancement New feature or request label Sep 15, 2021
@codecov
Copy link

codecov bot commented Sep 16, 2021

Codecov Report

Merging #123 (12769bc) into main (432b120) will increase coverage by 4.99%.
The diff coverage is 89.39%.

Impacted file tree graph

@@            Coverage Diff             @@
##             main     #123      +/-   ##
==========================================
+ Coverage   74.16%   79.16%   +4.99%     
==========================================
  Files          15       15              
  Lines         871      864       -7     
  Branches      186      184       -2     
==========================================
+ Hits          646      684      +38     
+ Misses        181      135      -46     
- Partials       44       45       +1     
Flag Coverage Δ
unittests 79.16% <89.39%> (+4.99%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
src/fairlens/metrics/__init__.py 100.00% <ø> (ø)
src/fairlens/sensitive/correlation.py 82.22% <44.44%> (-2.23%) ⬇️
src/fairlens/metrics/correlation.py 60.86% <90.47%> (+12.48%) ⬆️
src/fairlens/metrics/unified.py 83.33% <100.00%> (+36.39%) ⬆️
src/fairlens/plot/__init__.py 100.00% <100.00%> (ø)
src/fairlens/plot/correlation.py 100.00% <100.00%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 432b120...12769bc. Read the comment docs.

@sonarcloud
Copy link

sonarcloud bot commented Sep 17, 2021

Kudos, SonarCloud Quality Gate passed!    Quality Gate passed

Bug A 0 Bugs
Vulnerability A 0 Vulnerabilities
Security Hotspot A 0 Security Hotspots
Code Smell A 0 Code Smells

92.2% 92.2% Coverage
0.0% 0.0% Duplication

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type:enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants