add support for sparse X (#86)

* add support for csr matrix * Appease mypy about notebook. * Test against csr in test_cross_fit_estimator.py * Test against csr in test_utils.py * Adapt S-Learner to work with csr matrix. * Test against csr matrix in test_learner and test_metalearner. * fix notebook metadata * reduce sparse problem size for docs * final touches --------- Co-authored-by: kklein </> Co-authored-by: kklein <[email protected]>
Quantco · Aug 28, 2024 · 0d1958c · 0d1958c
1 parent 14fef77
commit 0d1958c
Show file tree

Hide file tree

Showing 19 changed files with 419 additions and 74 deletions.
diff --git a/CHANGELOG.rst b/CHANGELOG.rst
@@ -7,6 +7,14 @@
 Changelog
 =========
 
+0.11.0 (2024-09-xx)
+-------------------
+
+**New features**
+
+* Add support for using ``scipy.sparse.csr_matrix`` as datastructure for covariates ``X``.
+
+
 0.10.0 (2024-08-13)
 -------------------
 

diff --git a/docs/examples/example_estimating_ates.ipynb b/docs/examples/example_estimating_ates.ipynb
@@ -150,7 +150,13 @@
   {
    "cell_type": "code",
    "execution_count": null,
-   "metadata": {},
+   "metadata": {
+    "editable": true,
+    "slideshow": {
+     "slide_type": ""
+    },
+    "tags": []
+   },
    "outputs": [],
    "source": [
     "metalearners_dr = DRLearner(\n",
@@ -558,21 +564,16 @@
  ],
  "metadata": {
   "kernelspec": {
-   "display_name": "Python 3 (ipykernel)",
+   "display_name": "py311",
    "language": "python",
    "name": "python3"
   },
   "language_info": {
-   "codemirror_mode": {
-    "name": "ipython",
-    "version": 3
-   },
-   "file_extension": ".py",
-   "mimetype": "text/x-python",
    "name": "python",
-   "nbconvert_exporter": "python",
-   "pygments_lexer": "ipython3",
-   "version": "3.11.9"
+   "version": "3.11.7"
+  },
+  "mystnb": {
+   "execution_timeout": 120
   }
  },
  "nbformat": 4,

diff --git a/docs/examples/example_sparse_inputs.ipynb b/docs/examples/example_sparse_inputs.ipynb
@@ -0,0 +1,272 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "(example-sparse)=\n",
+    "\n",
+    " Example: Using Sparse Covariate Matrices\n",
+    "=============================\n",
+    "\n",
+    "Motivation\n",
+    "----------\n",
+    "\n",
+    "In many applications, we want to adjust for categorical covariates with many levels. As a natural pre-processing step, this may involve one-hot-encoding the covariates, which can lead to a high-dimensional covariate matrix, which is typically very sparse. Many scikit-style learners accept (scipy's) sparse matrices as input, which allows us to use them for treatment effect estimation as well. \n",
+    "\n",
+    "Example\n",
+    "-------"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import time, psutil, os, gc\n",
+    "import numpy as np\n",
+    "import pandas as pd\n",
+    "import scipy as sp\n",
+    "\n",
+    "from sklearn.dummy import DummyRegressor\n",
+    "from sklearn.preprocessing import OneHotEncoder\n",
+    "from sklearn.model_selection import train_test_split\n",
+    "from sklearn.metrics import mean_squared_error, r2_score\n",
+    "\n",
+    "from lightgbm import LGBMRegressor, LGBMClassifier\n",
+    "from metalearners import DRLearner\n",
+    "\n",
+    "# This is required for when nbconvert converts the cell-magic to regular function calls.\n",
+    "from IPython import get_ipython"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "def get_memory_usage():\n",
+    "    process = psutil.Process(os.getpid())\n",
+    "    return process.memory_info().rss / 1024 / 1024  # in MB\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Causal Inference\n",
+    "\n",
+    "### DRLearner\n",
+    "\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "We generate some data where X comprises of 100 categorical variables with 1000 possible levels. Naively one-hot-encoding this data produces a very large matrix with many zeroes, which is an ideal application of `scipy.sparse.csr_matrix`. We then use the `DRLearner` to estimate the treatment effect. \n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "def generate_causal_data(\n",
+    "    n_samples=100_000,\n",
+    "    n_categories=500,\n",
+    "    n_features=100,\n",
+    "    tau_magnitude=1.0,\n",
+    "):\n",
+    "    ######################################################################\n",
+    "    # Generate covariate matrix X\n",
+    "    X = np.random.randint(0, n_categories, size=(n_samples, n_features))\n",
+    "    ######################################################################\n",
+    "    # Generate potential outcome y0\n",
+    "    y0 = np.zeros(n_samples)\n",
+    "    # Select a few features for main effects\n",
+    "    main_effect_features = np.random.choice(n_features, 3, replace=False)\n",
+    "    # Create main effects - fully dense\n",
+    "    for i in main_effect_features:\n",
+    "        category_effects = np.random.normal(0, 4, n_categories)\n",
+    "        y0 += category_effects[X[:, i]]\n",
+    "    # Select a couple of feature pairs for interaction effects\n",
+    "    interaction_pairs = [\n",
+    "        (i, j) for i in range(n_features) for j in range(i + 1, n_features)\n",
+    "    ]\n",
+    "    selected_interactions = np.random.choice(len(interaction_pairs), 2, replace=False)\n",
+    "    # Create interaction effects\n",
+    "    for idx in selected_interactions:\n",
+    "        i, j = interaction_pairs[idx]\n",
+    "        interaction_effect = np.random.choice(\n",
+    "            [-1, 0, 1], size=(n_categories, n_categories), p=[0.25, 0.5, 0.25]\n",
+    "        )\n",
+    "        y0 += interaction_effect[X[:, i], X[:, j]]\n",
+    "    # Normalize y0\n",
+    "    y0 = (y0 - np.mean(y0)) / np.std(y0)\n",
+    "    y0 += np.random.normal(0, 0.1, n_samples)\n",
+    "    ######################################################################\n",
+    "    # Generate treatment assignment W\n",
+    "    propensity_score = np.zeros(n_samples)\n",
+    "    for i in main_effect_features:\n",
+    "        category_effects = np.random.normal(0, 4, n_categories)\n",
+    "        propensity_score += category_effects[X[:, i]]\n",
+    "    # same interactions enter pscore\n",
+    "    # Create interaction effects\n",
+    "    for idx in selected_interactions:\n",
+    "        i, j = interaction_pairs[idx]\n",
+    "        interaction_effect = np.random.choice(\n",
+    "            [-1, 0, 1], size=(n_categories, n_categories), p=[0.25, 0.5, 0.25]\n",
+    "        )\n",
+    "        propensity_score += interaction_effect[X[:, i], X[:, j]]\n",
+    "    # Convert to probabilities using logistic function\n",
+    "    propensity_score = sp.special.expit(propensity_score)\n",
+    "    # Generate binary treatment\n",
+    "    W = np.random.binomial(1, propensity_score)\n",
+    "    ######################################################################\n",
+    "    # Generate treatment effect\n",
+    "    tau = tau_magnitude * np.ones(n_samples)\n",
+    "    # Generate final outcome\n",
+    "    Y = y0 + W * tau\n",
+    "    return X, W, Y, tau, propensity_score\n",
+    "\n",
+    "\n",
+    "X, W, Y, tau, propensity_score = generate_causal_data(\n",
+    "    n_samples=1000, tau_magnitude=1.0\n",
+    ")\n",
+    "Xdf = pd.DataFrame(X)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# sparse and dense X matrices\n",
+    "e1 = OneHotEncoder(\n",
+    "    sparse_output=True\n",
+    ")  # onehot encoder generates sparse output automatically\n",
+    "\n",
+    "X_csr = e1.fit_transform(X)\n",
+    "X_np = pd.get_dummies(\n",
+    "    Xdf, columns=Xdf.columns\n",
+    ").values  # dense onehot encoding with pandas"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "print(f\"\\nSparse data memory: {X_csr.data.nbytes / 1024 / 1024:.2f}MB\")\n",
+    "print(f\"Dense data memory: {X_np.nbytes / 1024 / 1024:.2f}MB\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "As expected, the memory footprint of the sparse matrix is considerably smaller than the dense matrix. \n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "def fit_drlearner_wrapper(X, name):\n",
+    "    start_memory = get_memory_usage()\n",
+    "    start_time = time.time()\n",
+    "    metalearners_dr = DRLearner(\n",
+    "        nuisance_model_factory=LGBMRegressor,\n",
+    "        treatment_model_factory=DummyRegressor,\n",
+    "        propensity_model_factory=LGBMClassifier,\n",
+    "        is_classification=False,\n",
+    "        n_variants=2,\n",
+    "        nuisance_model_params={\"verbose\": -1},\n",
+    "        propensity_model_params={\"verbose\": -1},\n",
+    "    )\n",
+    "\n",
+    "    metalearners_dr.fit_all_nuisance(\n",
+    "        X=X,\n",
+    "        y=Y,\n",
+    "        w=W,\n",
+    "    )\n",
+    "    metalearners_est = metalearners_dr.average_treatment_effect(\n",
+    "        X=X,\n",
+    "        y=Y,\n",
+    "        w=W,\n",
+    "        is_oos=False,\n",
+    "    )\n",
+    "    end_time = time.time()\n",
+    "    end_memory = get_memory_usage()\n",
+    "    runtime = end_time - start_time\n",
+    "    memory_used = end_memory - start_memory\n",
+    "    print(f\"{name} data - Runtime: {runtime:.2f}s, Memory used: {memory_used:.2f}MB\")\n",
+    "    print(metalearners_est)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "`scipy.sparse.csr_matrix` input"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "fit_drlearner_wrapper(X_csr, \"Sparse\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "`np.ndarray` input"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "fit_drlearner_wrapper(X_np, \"Dense\")"
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "py311",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.11.7"
+  },
+  "mystnb": {
+   "execution_timeout": 120
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 2
+}
diff --git a/docs/examples/index.rst b/docs/examples/index.rst
@@ -16,3 +16,4 @@ Examples
    Estimating CATEs for survival analysis <example_survival.ipynb>
    What if I know the propensity score? <example_propensity.ipynb>
    Converting a MetaLearner to ONNX <example_onnx.ipynb>
+   Using Sparse Covariate Matrices <example_sparse_inputs.ipynb>
diff --git a/metalearners/_typing.py b/metalearners/_typing.py
@@ -6,6 +6,7 @@
 
 import numpy as np
 import pandas as pd
+import scipy.sparse as sps
 
 PredictMethod = Literal["predict", "predict_proba"]
 
@@ -21,7 +22,7 @@
 
 # ruff is not happy about the usage of Union.
 Vector = Union[pd.Series, np.ndarray]  # noqa
-Matrix = Union[pd.DataFrame, np.ndarray]  # noqa
+Matrix = Union[pd.DataFrame, np.ndarray, sps.csr_matrix]  # noqa
 
 
 class _ScikitModel(Protocol):

diff --git a/metalearners/_utils.py b/metalearners/_utils.py
@@ -9,6 +9,7 @@
 
 import numpy as np
 import pandas as pd
+import scipy
 from sklearn.base import check_array, check_X_y, is_classifier, is_regressor
 from sklearn.ensemble import (
     HistGradientBoostingClassifier,
@@ -24,6 +25,12 @@
 default_rng = np.random.default_rng()
 
 
+def safe_len(X: Matrix) -> int:
+    if scipy.sparse.issparse(X):
+        return X.shape[0]
+    return len(X)
+
+
 def index_matrix(matrix: Matrix, rows: Vector) -> Matrix:
     """Subselect certain rows from a matrix."""
     if isinstance(rows, pd.Series):