-
Notifications
You must be signed in to change notification settings - Fork 713
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Sparse matrix features for DMLIV #508
Comments
Unfortunately we have yet to implement support for sparse matrices in our estimators. It is an important feature. |
Hello. What would be the effort for an external PR contribute to add sparse support for those models with the word Thanks for hard work on this package. |
@sergeyf Thanks for your interest. It would definitely not be completely trivial, because the way fitting is done is broken up across several different files, all of which would be affected:
So there are lots of places where changes might need to be made, although in many cases I think the changes themselves should be relatively straightforward since most of this is merely indexing into arrays or reshaping or retiling them and simply debugging the execution of a test case using small sparse inputs should catch most of the issues. One question is what, if anything, to do about final models. The way these algorithms work is that we take inputs Y, T, X, (and sometimes W and/or Z) and estimate some nuisance models, then use the outputs of the predictions of those nuisance models on out-of-sample data to estimate a final model. The user supplies nuisance models, and if those models support sparsity then we shouldn't need to do anything special at all as long as the places I've mentioned above propagate and manipulate the sparse inputs appropriately. However, we ourselves generally supply the final models (based on Also, note that the estimators with Sparse in their names are not related to sparsity in the input data - instead, they are assuming that the treatment effect is a sparse function of the features (that is, that only a small number of the interactions between the features and the treatments affect the outcomes). So it could make just as much sense to apply LinearDML to sparse data as it does to apply SparseLinearDML, depending on what you believe about the nature of the treatment effect. |
Thanks for the detailed answer. Looks like it'll be a medium amount of work for someone with your intimate knowledge of the repository, and a large amount work for someone without. |
I have been trying to use DMLIV for a problem where we want to estimate fixed effects of many features, in our case about 75000 of them. We have about 5.5 million observations. I have to use sparse matrices for the features to avoid running out of memory (256GB). What is the proper way to use DMLIV with sparse matrix features?
To illustrate what I have tried, the data looks something like
I am using SGD classifiers/regressors for first stage models, which I am able to fit to my data outside of econml:
When I pass in the features
x
as a sparse matrix:I get the error
IndexError: tuple index out of range
.When I pass them in as a pandas data frame with sparse columns
I get
ValueError: all the input arrays must have same number of dimensions, but the array at index 0 has 2 dimension(s) and the array at index 1 has 1 dimension(s)
.After digging around a little I tried making
z
a sparse matrix, since it seems to get concatenated withx
when fitting, and tryinggives
ValueError: Found input variables with inconsistent numbers of samples: [2, 5493141]
.Finally I tried to avoid cross fitting by setting
cv=1
; I seemed to get further and actually estimated the first stage models, but eventually get the errorTypeError: 'coo_matrix' object is not subscriptable
. It looks like there's code in utilities.py that handles sparse matrix input so I suspect my input formats are not quite right.The text was updated successfully, but these errors were encountered: