Sparse matrix features for DMLIV #508

delimited0 · 2021-08-07T03:39:36Z

I have been trying to use DMLIV for a problem where we want to estimate fixed effects of many features, in our case about 75000 of them. We have about 5.5 million observations. I have to use sparse matrices for the features to avoid running out of memory (256GB). What is the proper way to use DMLIV with sparse matrix features?

To illustrate what I have tried, the data looks something like

n = 5493141
p = 75000
y = np.random.poisson(size=(n, 1))
t = np.random.binomial(1, .001, size=(n, 1))
z = np.random.binomial(1, .001, size=(n, 1))
z_spmat = scipy.sparse.csr_matrix(z)

density = 0.00000001
size = int(n * p * density)

rows = np.random.randint(0, 2, size=size)
cols = np.random.randint(0, 2, size=size)
data = np.random.rand(size)

x_spmat = scipy.sparse.csr_matrix((data, (rows, cols)), shape=(n, p))
x_df = pd.DataFrame.sparse.from_spmatrix(x)

I am using SGD classifiers/regressors for first stage models, which I am able to fit to my data outside of econml:

from sklearn.linear_model import SGDClassifier, SGDRegressor

occ_sgd3 = DMLIV(
  model_Y_X = SGDRegressor(),
  model_T_X = SGDClassifier(loss='log'),
  model_T_XZ = SGDClassifier(loss='log'),
  model_final = SGDRegressor(),
)

When I pass in the features x as a sparse matrix:

occ_sgd3.fit(Y=y, T=t, Z=z, X=x_spmat)

I get the error IndexError: tuple index out of range.

When I pass them in as a pandas data frame with sparse columns

occ_sgd3.fit(Y=y, T=t, Z=z, X=x_df)

I get ValueError: all the input arrays must have same number of dimensions, but the array at index 0 has 2 dimension(s) and the array at index 1 has 1 dimension(s).

After digging around a little I tried making z a sparse matrix, since it seems to get concatenated with x when fitting, and trying

occ_sgd3.fit(Y=y, T=t, Z=z_spmat, X=x_df)

gives ValueError: Found input variables with inconsistent numbers of samples: [2, 5493141].

Finally I tried to avoid cross fitting by setting cv=1; I seemed to get further and actually estimated the first stage models, but eventually get the error TypeError: 'coo_matrix' object is not subscriptable. It looks like there's code in utilities.py that handles sparse matrix input so I suspect my input formats are not quite right.

The text was updated successfully, but these errors were encountered:

vsyrgkanis · 2021-08-07T12:48:20Z

Unfortunately we have yet to implement support for sparse matrices in our estimators. It is an important feature.

sergeyf · 2022-04-01T17:59:16Z

Hello. What would be the effort for an external PR contribute to add sparse support for those models with the word Sparse in their name? Is it deep surgery or something more superficial? I work at a lab and we're considering helping, but not sure how much insider knowledge and understanding would be necessary.

Thanks for hard work on this package.

kbattocchi · 2022-04-01T19:37:12Z

@sergeyf Thanks for your interest. It would definitely not be completely trivial, because the way fitting is done is broken up across several different files, all of which would be affected:

_OrthoLearner handles splitting the inputs into subsamples for cross-fitting
_RLearner has a little bit of reshaping logic
The individual models have wrapper classes that do a lot of the work (e.g. _FirstStageWrapper and _FinalStageWrapper)
There are also mixins like TreatmentExpansionMixin that transform input arrays

So there are lots of places where changes might need to be made, although in many cases I think the changes themselves should be relatively straightforward since most of this is merely indexing into arrays or reshaping or retiling them and simply debugging the execution of a test case using small sparse inputs should catch most of the issues.

One question is what, if anything, to do about final models. The way these algorithms work is that we take inputs Y, T, X, (and sometimes W and/or Z) and estimate some nuisance models, then use the outputs of the predictions of those nuisance models on out-of-sample data to estimate a final model. The user supplies nuisance models, and if those models support sparsity then we shouldn't need to do anything special at all as long as the places I've mentioned above propagate and manipulate the sparse inputs appropriately. However, we ourselves generally supply the final models (based on StatsModelsLinearRegression for LinearDML, or MultiOutputDebiasedLasso for SparseLinearDML, for example), so changing those estimation methods to support sparse inputs could also be worthwhile, but might be more involved depending on whether the libraries we're relying on for linear algebra, etc. themselves support sparse inputs or not.

Also, note that the estimators with Sparse in their names are not related to sparsity in the input data - instead, they are assuming that the treatment effect is a sparse function of the features (that is, that only a small number of the interactions between the features and the treatments affect the outcomes). So it could make just as much sense to apply LinearDML to sparse data as it does to apply SparseLinearDML, depending on what you believe about the nature of the treatment effect.

sergeyf · 2022-04-03T21:37:14Z

Thanks for the detailed answer. Looks like it'll be a medium amount of work for someone with your intimate knowledge of the repository, and a large amount work for someone without.

apoorvalal mentioned this issue Aug 16, 2024

add support for sparse X Quantco/metalearners#86

Merged

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sparse matrix features for DMLIV #508

Sparse matrix features for DMLIV #508

delimited0 commented Aug 7, 2021

vsyrgkanis commented Aug 7, 2021

sergeyf commented Apr 1, 2022

kbattocchi commented Apr 1, 2022

sergeyf commented Apr 3, 2022

Sparse matrix features for DMLIV #508

Sparse matrix features for DMLIV #508

Comments

delimited0 commented Aug 7, 2021

vsyrgkanis commented Aug 7, 2021

sergeyf commented Apr 1, 2022

kbattocchi commented Apr 1, 2022

sergeyf commented Apr 3, 2022