Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sparse matrix features for DMLIV #508

Open
delimited0 opened this issue Aug 7, 2021 · 4 comments
Open

Sparse matrix features for DMLIV #508

delimited0 opened this issue Aug 7, 2021 · 4 comments

Comments

@delimited0
Copy link

I have been trying to use DMLIV for a problem where we want to estimate fixed effects of many features, in our case about 75000 of them. We have about 5.5 million observations. I have to use sparse matrices for the features to avoid running out of memory (256GB). What is the proper way to use DMLIV with sparse matrix features?

To illustrate what I have tried, the data looks something like

n = 5493141
p = 75000
y = np.random.poisson(size=(n, 1))
t = np.random.binomial(1, .001, size=(n, 1))
z = np.random.binomial(1, .001, size=(n, 1))
z_spmat = scipy.sparse.csr_matrix(z)

density = 0.00000001
size = int(n * p * density)

rows = np.random.randint(0, 2, size=size)
cols = np.random.randint(0, 2, size=size)
data = np.random.rand(size)

x_spmat = scipy.sparse.csr_matrix((data, (rows, cols)), shape=(n, p))
x_df = pd.DataFrame.sparse.from_spmatrix(x)

I am using SGD classifiers/regressors for first stage models, which I am able to fit to my data outside of econml:

from sklearn.linear_model import SGDClassifier, SGDRegressor

occ_sgd3 = DMLIV(
  model_Y_X = SGDRegressor(),
  model_T_X = SGDClassifier(loss='log'),
  model_T_XZ = SGDClassifier(loss='log'),
  model_final = SGDRegressor(),
)

When I pass in the features x as a sparse matrix:

occ_sgd3.fit(Y=y, T=t, Z=z, X=x_spmat)

I get the error IndexError: tuple index out of range.

When I pass them in as a pandas data frame with sparse columns

occ_sgd3.fit(Y=y, T=t, Z=z, X=x_df)

I get ValueError: all the input arrays must have same number of dimensions, but the array at index 0 has 2 dimension(s) and the array at index 1 has 1 dimension(s).

After digging around a little I tried making z a sparse matrix, since it seems to get concatenated with x when fitting, and trying

occ_sgd3.fit(Y=y, T=t, Z=z_spmat, X=x_df)

gives ValueError: Found input variables with inconsistent numbers of samples: [2, 5493141].

Finally I tried to avoid cross fitting by setting cv=1; I seemed to get further and actually estimated the first stage models, but eventually get the error TypeError: 'coo_matrix' object is not subscriptable. It looks like there's code in utilities.py that handles sparse matrix input so I suspect my input formats are not quite right.

@vsyrgkanis
Copy link
Collaborator

Unfortunately we have yet to implement support for sparse matrices in our estimators. It is an important feature.

@sergeyf
Copy link

sergeyf commented Apr 1, 2022

Hello. What would be the effort for an external PR contribute to add sparse support for those models with the word Sparse in their name? Is it deep surgery or something more superficial? I work at a lab and we're considering helping, but not sure how much insider knowledge and understanding would be necessary.

Thanks for hard work on this package.

@kbattocchi
Copy link
Collaborator

@sergeyf Thanks for your interest. It would definitely not be completely trivial, because the way fitting is done is broken up across several different files, all of which would be affected:

  • _OrthoLearner handles splitting the inputs into subsamples for cross-fitting
  • _RLearner has a little bit of reshaping logic
  • The individual models have wrapper classes that do a lot of the work (e.g. _FirstStageWrapper and _FinalStageWrapper)
  • There are also mixins like TreatmentExpansionMixin that transform input arrays

So there are lots of places where changes might need to be made, although in many cases I think the changes themselves should be relatively straightforward since most of this is merely indexing into arrays or reshaping or retiling them and simply debugging the execution of a test case using small sparse inputs should catch most of the issues.

One question is what, if anything, to do about final models. The way these algorithms work is that we take inputs Y, T, X, (and sometimes W and/or Z) and estimate some nuisance models, then use the outputs of the predictions of those nuisance models on out-of-sample data to estimate a final model. The user supplies nuisance models, and if those models support sparsity then we shouldn't need to do anything special at all as long as the places I've mentioned above propagate and manipulate the sparse inputs appropriately. However, we ourselves generally supply the final models (based on StatsModelsLinearRegression for LinearDML, or MultiOutputDebiasedLasso for SparseLinearDML, for example), so changing those estimation methods to support sparse inputs could also be worthwhile, but might be more involved depending on whether the libraries we're relying on for linear algebra, etc. themselves support sparse inputs or not.

Also, note that the estimators with Sparse in their names are not related to sparsity in the input data - instead, they are assuming that the treatment effect is a sparse function of the features (that is, that only a small number of the interactions between the features and the treatments affect the outcomes). So it could make just as much sense to apply LinearDML to sparse data as it does to apply SparseLinearDML, depending on what you believe about the nature of the treatment effect.

@sergeyf
Copy link

sergeyf commented Apr 3, 2022

Thanks for the detailed answer. Looks like it'll be a medium amount of work for someone with your intimate knowledge of the repository, and a large amount work for someone without.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants