Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Feature selection:
Nans inside the featurized dataframe currently creates an error when performing feature selection on it (doesn't work with the NMI). The solution is to preprocess the data exactly how it is done currently for the model fitting. It solves the Nan issue, and moreover is closer to the actual data used for the model. For now, it only performs the preprocessing when Nans are present. But this could become the future default behaviour.
Big datasets. Computing the NMI on big datasets is slow. A simple solution is to sample the data to compute the NMI. As shown in my master thesis, this convergence below 10,000 datapoint on most matminer features.