Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactored featurization into new module with presets #15

Merged
merged 1 commit into from
Oct 5, 2020

Conversation

ml-evs
Copy link
Collaborator

@ml-evs ml-evs commented Oct 1, 2020

This PR moves all featurization out of the preprocessing module and into a new featurization module, which includes a class MODFeaturizer and a preset DeBreuck2020Featurizer that replicates the existing featurization process.

An optional featurizer argument is added to MODData to specify the preset, either by passing an object, or by passing a string, that is looked up in the featurizers.presets.FEATURIZER_PRESETS dictionary, e.g. DeBreuck2020. The base class contains the generic parts of the featurizer_composition(...), featurize_structure(...) and featurize_site(...) functions, and the preset overrides them to perform post-processing.

Most of the work in the DeBreuck2020Featurizer preset was to ensure backwards-compatibility of column names etc. For new featurizers, it should just be a case of providing the list of BaseFeaturizer objects (assuming column names work nicely).

Still to consider (maybe outside the scope of this PR):

  • How should we handle fitting the featurizers to the dataframe? Currently the preset assumes that the entire data set has been passed, so all featurizers are fit at that point. The private methods for applying the featurizers has a bool argument for deciding whether to refit the featurizers.
  • What other presets should we include? I'm writing my own preset for the OMDB dataset, perhaps there should also be an AllMatminer preset, which uses every feature, and a QuickMatminer preset which uses "fast" features only?
  • Should the MODFeaturizer class also contain methods for tracking metadata of the featurizers that were used?

- Optional featurizer argument added to MODData
- DeBreuck2020Featurizer preset contains everything to recreate the
  paper, which is the default
- Abstract MODFeaturizer class can be inherited from by new presets
@ppdebreuck ppdebreuck merged commit 666e247 into master Oct 5, 2020
@ml-evs ml-evs deleted the ml-evs/presets branch November 16, 2020 21:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants