Skip to content

Commit

Permalink
add support for sparse X (#86)
Browse files Browse the repository at this point in the history
* add support for csr matrix

* Appease mypy about notebook.

* Test against csr in test_cross_fit_estimator.py

* Test against csr in test_utils.py

* Adapt S-Learner to work with csr matrix.

* Test against csr matrix in test_learner and test_metalearner.

* fix notebook metadata

* reduce sparse problem size for docs

* final touches

---------

Co-authored-by: kklein </>
Co-authored-by: kklein <[email protected]>
  • Loading branch information
apoorvalal and kklein authored Aug 28, 2024
1 parent 14fef77 commit 0d1958c
Show file tree
Hide file tree
Showing 19 changed files with 419 additions and 74 deletions.
8 changes: 8 additions & 0 deletions CHANGELOG.rst
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,14 @@
Changelog
=========

0.11.0 (2024-09-xx)
-------------------

**New features**

* Add support for using ``scipy.sparse.csr_matrix`` as datastructure for covariates ``X``.


0.10.0 (2024-08-13)
-------------------

Expand Down
23 changes: 12 additions & 11 deletions docs/examples/example_estimating_ates.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -150,7 +150,13 @@
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"metadata": {
"editable": true,
"slideshow": {
"slide_type": ""
},
"tags": []
},
"outputs": [],
"source": [
"metalearners_dr = DRLearner(\n",
Expand Down Expand Up @@ -558,21 +564,16 @@
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"display_name": "py311",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.11.9"
"version": "3.11.7"
},
"mystnb": {
"execution_timeout": 120
}
},
"nbformat": 4,
Expand Down
272 changes: 272 additions & 0 deletions docs/examples/example_sparse_inputs.ipynb
Original file line number Diff line number Diff line change
@@ -0,0 +1,272 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"(example-sparse)=\n",
"\n",
" Example: Using Sparse Covariate Matrices\n",
"=============================\n",
"\n",
"Motivation\n",
"----------\n",
"\n",
"In many applications, we want to adjust for categorical covariates with many levels. As a natural pre-processing step, this may involve one-hot-encoding the covariates, which can lead to a high-dimensional covariate matrix, which is typically very sparse. Many scikit-style learners accept (scipy's) sparse matrices as input, which allows us to use them for treatment effect estimation as well. \n",
"\n",
"Example\n",
"-------"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import time, psutil, os, gc\n",
"import numpy as np\n",
"import pandas as pd\n",
"import scipy as sp\n",
"\n",
"from sklearn.dummy import DummyRegressor\n",
"from sklearn.preprocessing import OneHotEncoder\n",
"from sklearn.model_selection import train_test_split\n",
"from sklearn.metrics import mean_squared_error, r2_score\n",
"\n",
"from lightgbm import LGBMRegressor, LGBMClassifier\n",
"from metalearners import DRLearner\n",
"\n",
"# This is required for when nbconvert converts the cell-magic to regular function calls.\n",
"from IPython import get_ipython"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"def get_memory_usage():\n",
" process = psutil.Process(os.getpid())\n",
" return process.memory_info().rss / 1024 / 1024 # in MB\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Causal Inference\n",
"\n",
"### DRLearner\n",
"\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We generate some data where X comprises of 100 categorical variables with 1000 possible levels. Naively one-hot-encoding this data produces a very large matrix with many zeroes, which is an ideal application of `scipy.sparse.csr_matrix`. We then use the `DRLearner` to estimate the treatment effect. \n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"def generate_causal_data(\n",
" n_samples=100_000,\n",
" n_categories=500,\n",
" n_features=100,\n",
" tau_magnitude=1.0,\n",
"):\n",
" ######################################################################\n",
" # Generate covariate matrix X\n",
" X = np.random.randint(0, n_categories, size=(n_samples, n_features))\n",
" ######################################################################\n",
" # Generate potential outcome y0\n",
" y0 = np.zeros(n_samples)\n",
" # Select a few features for main effects\n",
" main_effect_features = np.random.choice(n_features, 3, replace=False)\n",
" # Create main effects - fully dense\n",
" for i in main_effect_features:\n",
" category_effects = np.random.normal(0, 4, n_categories)\n",
" y0 += category_effects[X[:, i]]\n",
" # Select a couple of feature pairs for interaction effects\n",
" interaction_pairs = [\n",
" (i, j) for i in range(n_features) for j in range(i + 1, n_features)\n",
" ]\n",
" selected_interactions = np.random.choice(len(interaction_pairs), 2, replace=False)\n",
" # Create interaction effects\n",
" for idx in selected_interactions:\n",
" i, j = interaction_pairs[idx]\n",
" interaction_effect = np.random.choice(\n",
" [-1, 0, 1], size=(n_categories, n_categories), p=[0.25, 0.5, 0.25]\n",
" )\n",
" y0 += interaction_effect[X[:, i], X[:, j]]\n",
" # Normalize y0\n",
" y0 = (y0 - np.mean(y0)) / np.std(y0)\n",
" y0 += np.random.normal(0, 0.1, n_samples)\n",
" ######################################################################\n",
" # Generate treatment assignment W\n",
" propensity_score = np.zeros(n_samples)\n",
" for i in main_effect_features:\n",
" category_effects = np.random.normal(0, 4, n_categories)\n",
" propensity_score += category_effects[X[:, i]]\n",
" # same interactions enter pscore\n",
" # Create interaction effects\n",
" for idx in selected_interactions:\n",
" i, j = interaction_pairs[idx]\n",
" interaction_effect = np.random.choice(\n",
" [-1, 0, 1], size=(n_categories, n_categories), p=[0.25, 0.5, 0.25]\n",
" )\n",
" propensity_score += interaction_effect[X[:, i], X[:, j]]\n",
" # Convert to probabilities using logistic function\n",
" propensity_score = sp.special.expit(propensity_score)\n",
" # Generate binary treatment\n",
" W = np.random.binomial(1, propensity_score)\n",
" ######################################################################\n",
" # Generate treatment effect\n",
" tau = tau_magnitude * np.ones(n_samples)\n",
" # Generate final outcome\n",
" Y = y0 + W * tau\n",
" return X, W, Y, tau, propensity_score\n",
"\n",
"\n",
"X, W, Y, tau, propensity_score = generate_causal_data(\n",
" n_samples=1000, tau_magnitude=1.0\n",
")\n",
"Xdf = pd.DataFrame(X)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# sparse and dense X matrices\n",
"e1 = OneHotEncoder(\n",
" sparse_output=True\n",
") # onehot encoder generates sparse output automatically\n",
"\n",
"X_csr = e1.fit_transform(X)\n",
"X_np = pd.get_dummies(\n",
" Xdf, columns=Xdf.columns\n",
").values # dense onehot encoding with pandas"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"print(f\"\\nSparse data memory: {X_csr.data.nbytes / 1024 / 1024:.2f}MB\")\n",
"print(f\"Dense data memory: {X_np.nbytes / 1024 / 1024:.2f}MB\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"As expected, the memory footprint of the sparse matrix is considerably smaller than the dense matrix. \n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"def fit_drlearner_wrapper(X, name):\n",
" start_memory = get_memory_usage()\n",
" start_time = time.time()\n",
" metalearners_dr = DRLearner(\n",
" nuisance_model_factory=LGBMRegressor,\n",
" treatment_model_factory=DummyRegressor,\n",
" propensity_model_factory=LGBMClassifier,\n",
" is_classification=False,\n",
" n_variants=2,\n",
" nuisance_model_params={\"verbose\": -1},\n",
" propensity_model_params={\"verbose\": -1},\n",
" )\n",
"\n",
" metalearners_dr.fit_all_nuisance(\n",
" X=X,\n",
" y=Y,\n",
" w=W,\n",
" )\n",
" metalearners_est = metalearners_dr.average_treatment_effect(\n",
" X=X,\n",
" y=Y,\n",
" w=W,\n",
" is_oos=False,\n",
" )\n",
" end_time = time.time()\n",
" end_memory = get_memory_usage()\n",
" runtime = end_time - start_time\n",
" memory_used = end_memory - start_memory\n",
" print(f\"{name} data - Runtime: {runtime:.2f}s, Memory used: {memory_used:.2f}MB\")\n",
" print(metalearners_est)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"`scipy.sparse.csr_matrix` input"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"fit_drlearner_wrapper(X_csr, \"Sparse\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"`np.ndarray` input"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"fit_drlearner_wrapper(X_np, \"Dense\")"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "py311",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.11.7"
},
"mystnb": {
"execution_timeout": 120
}
},
"nbformat": 4,
"nbformat_minor": 2
}
1 change: 1 addition & 0 deletions docs/examples/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -16,3 +16,4 @@ Examples
Estimating CATEs for survival analysis <example_survival.ipynb>
What if I know the propensity score? <example_propensity.ipynb>
Converting a MetaLearner to ONNX <example_onnx.ipynb>
Using Sparse Covariate Matrices <example_sparse_inputs.ipynb>
3 changes: 2 additions & 1 deletion metalearners/_typing.py
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,7 @@

import numpy as np
import pandas as pd
import scipy.sparse as sps

PredictMethod = Literal["predict", "predict_proba"]

Expand All @@ -21,7 +22,7 @@

# ruff is not happy about the usage of Union.
Vector = Union[pd.Series, np.ndarray] # noqa
Matrix = Union[pd.DataFrame, np.ndarray] # noqa
Matrix = Union[pd.DataFrame, np.ndarray, sps.csr_matrix] # noqa


class _ScikitModel(Protocol):
Expand Down
7 changes: 7 additions & 0 deletions metalearners/_utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,7 @@

import numpy as np
import pandas as pd
import scipy
from sklearn.base import check_array, check_X_y, is_classifier, is_regressor
from sklearn.ensemble import (
HistGradientBoostingClassifier,
Expand All @@ -24,6 +25,12 @@
default_rng = np.random.default_rng()


def safe_len(X: Matrix) -> int:
if scipy.sparse.issparse(X):
return X.shape[0]
return len(X)


def index_matrix(matrix: Matrix, rows: Vector) -> Matrix:
"""Subselect certain rows from a matrix."""
if isinstance(rows, pd.Series):
Expand Down
Loading

0 comments on commit 0d1958c

Please sign in to comment.