feature selection updates #58

ppdebreuck · 2021-08-10T11:51:21Z

Feature selection:

Nans inside the featurized dataframe currently creates an error when performing feature selection on it (doesn't work with the NMI). The solution is to preprocess the data exactly how it is done currently for the model fitting. It solves the Nan issue, and moreover is closer to the actual data used for the model. For now, it only performs the preprocessing when Nans are present. But this could become the future default behaviour.
Big datasets. Computing the NMI on big datasets is slow. A simple solution is to sample the data to compute the NMI. As shown in my master thesis, this convergence below 10,000 datapoint on most matminer features.

- Scale only if nans are present

ppdebreuck marked this pull request as ready for review November 9, 2021 09:15

ppdebreuck force-pushed the nans_in_feat_selection branch from be53d28 to 195447b Compare November 9, 2021 12:58

ppdebreuck added 4 commits November 9, 2021 14:11

preprocess data for feature selection

b162487

preprocess only if larger than n_samples

ccc3cc3

sample fix

7f49896

preprocessing keep past behaviour:

37f2f10

- Scale only if nans are present

ppdebreuck force-pushed the nans_in_feat_selection branch from 195447b to 37f2f10 Compare November 9, 2021 13:14

ppdebreuck merged commit f699775 into master Nov 9, 2021

Provide feedback