Add smart weight: a better way to balance classification probabilities #40
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
What problem does this address?
Without smart weight your classification (and prediction) is strongly biased when you have very limited data. You see a word once and it becomes a "100% guarantee" that the word means a specific categorization.
How does this PR address the issue?
It introduces a new option for Groupie: smart weight, which can be enabled during initialization (
Groupie.new(smart_weight: true)
) or after the fact (groupie.smart_weight = true
).Once this setting is enabled, each classification will always add a new
default_weight
value to each word's count prior to applying the score strategy, whether we've seen the word before or not.The effect is instant: each new (never classified) word is now considered to have an equal probability to be in one of your categories due to each category having a
default_weight
that balances out. By classifying more words, you simply add more weight to one or more of the categories, tipping the balance away from the default equilibrium.This means you need more data to more strongly shift the balance between categories, which is usually a good thing.
The
default_weight
value is dynamic. It's the number of unique words divided by the total count of words categorized. This means it scales along with the size of your dataset and thus balances out words with low frequency: it's hard to tell if they are a true signal or just a fluke.Breaking changes
In order to perform these calculations in a smart (i.e. fast) way, I've ended up refactoring a bunch of the internals. Two important ones are that
Groupie::Group
s now have a link back to theirGroupie
instance and that both classes keep extra data to track unique words and total word counts. An unfortunate side-effect is that old YAML-serialized data won't be drop-in compatible due to lacking some of this data.I think that loading old data and adding all words to a new instance of Groupie will probably work. Due to this being an old gem that's mostly for my own usage (and I'm not serializing any data yet, so this is not a problem for me), I'm content to leave this as a warning and an exercise for whomever needs it.
History & details
The initial implementation. It proves the concept, but is rough and inefficient. I started with the Readme, prototyping the API I wanted, then specs, then the code. Rubocop was not happy about the added complexity, so cleanup was required.
Rubocop style update to make it not perform huge indentations.
I realized that in order to make things better, I could not let Groups be fully independent classes. Newly added words needed to be tracked centrally in
Groupie
itself. It felt useful to start with moving the serialization test to Groupie, before I started making changes to Group.Track local data in Group, so we don't have to do this in
default_weight
. This eliminates having to loop over each unique word in each category. A side effect of tracking more internal data this way is that it breaks functionality of old serialized data.More refactoring passes that build on top of the previous change: have Groupie track all unique words it knows about, so we only have to count that Set rather than first building the Set in order to count it.
Next is cleaning up the implementation of
default_weight
because the new tools allow this.Lastly I'm addressing Rubocop's (correct) assessment that Groupie#classify was horribly complex. Re-implementing it led to a cleaner approach using the tools we have.
Pushing the code to Github to test on older Ruby versions exposed an interface inconsistency in the built-in YAML library, so I've added Psych as a dev dependency to work around this. I'm testing YAML as a courtesy, not as a fully embraced feature (yet).
As the force push in the history of this PR shows, I've performed a few interactive rebase fixups to correct some things in earlier commits.