Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add smart weight: a better way to balance classification probabilities #40

Merged
merged 8 commits into from
Feb 16, 2022

Conversation

Narnach
Copy link
Owner

@Narnach Narnach commented Feb 16, 2022

What problem does this address?

Without smart weight your classification (and prediction) is strongly biased when you have very limited data. You see a word once and it becomes a "100% guarantee" that the word means a specific categorization.

How does this PR address the issue?

It introduces a new option for Groupie: smart weight, which can be enabled during initialization (Groupie.new(smart_weight: true)) or after the fact (groupie.smart_weight = true).

Once this setting is enabled, each classification will always add a new default_weight value to each word's count prior to applying the score strategy, whether we've seen the word before or not.

The effect is instant: each new (never classified) word is now considered to have an equal probability to be in one of your categories due to each category having a default_weight that balances out. By classifying more words, you simply add more weight to one or more of the categories, tipping the balance away from the default equilibrium.

This means you need more data to more strongly shift the balance between categories, which is usually a good thing.

The default_weight value is dynamic. It's the number of unique words divided by the total count of words categorized. This means it scales along with the size of your dataset and thus balances out words with low frequency: it's hard to tell if they are a true signal or just a fluke.

Breaking changes

In order to perform these calculations in a smart (i.e. fast) way, I've ended up refactoring a bunch of the internals. Two important ones are that Groupie::Groups now have a link back to their Groupie instance and that both classes keep extra data to track unique words and total word counts. An unfortunate side-effect is that old YAML-serialized data won't be drop-in compatible due to lacking some of this data.

I think that loading old data and adding all words to a new instance of Groupie will probably work. Due to this being an old gem that's mostly for my own usage (and I'm not serializing any data yet, so this is not a problem for me), I'm content to leave this as a warning and an exercise for whomever needs it.

History & details

The initial implementation. It proves the concept, but is rough and inefficient. I started with the Readme, prototyping the API I wanted, then specs, then the code. Rubocop was not happy about the added complexity, so cleanup was required.

  • Feat: add smart weight support (c80b50c)

Rubocop style update to make it not perform huge indentations.

  • Style: apply a few layout rules regarding indentation (a471f58)

I realized that in order to make things better, I could not let Groups be fully independent classes. Newly added words needed to be tracked centrally in Groupie itself. It felt useful to start with moving the serialization test to Groupie, before I started making changes to Group.

  • Test: moved serialization spec from Groupie::Group to Groupie (c8a02ad)

Track local data in Group, so we don't have to do this in default_weight. This eliminates having to loop over each unique word in each category. A side effect of tracking more internal data this way is that it breaks functionality of old serialized data.

  • Change: Groups track their total word count (a8ee866)

More refactoring passes that build on top of the previous change: have Groupie track all unique words it knows about, so we only have to count that Set rather than first building the Set in order to count it.

  • Refactor: Groupie tracks known words to speedup default_weight (cfcfcf6)

Next is cleaning up the implementation of default_weight because the new tools allow this.

  • Refactor: simplified Groupie#default_weight (2f720a0)

Lastly I'm addressing Rubocop's (correct) assessment that Groupie#classify was horribly complex. Re-implementing it led to a cleaner approach using the tools we have.

  • Refactor: Groupie#classify internals are simpler (b4ed9a1)

Pushing the code to Github to test on older Ruby versions exposed an interface inconsistency in the built-in YAML library, so I've added Psych as a dev dependency to work around this. I'm testing YAML as a courtesy, not as a fully embraced feature (yet).

  • Specfix: switch to Psych for YAML test (09f5f74)

As the force push in the history of this PR shows, I've performed a few interactive rebase fixups to correct some things in earlier commits.

What's smart weight? A default weight/count that each word gets, even if
we've never seen it before. The main advantage is that new words now get
50/50 equal weighting between groups, and we need evidence to disturb
this balance. More evidence means a stronger weighting towards a group.

All you need to do to enable this is set `smart_weight` to true, which
can be done during initialization (`Groupie.new smart_weight: true`) or
via the `groupie.smart_weight = true` setter.
I'm using these in another place and just ran into a situation where the
rules were needed. It basically prevents multi-line arguments from being
indented really far. Oh, and newlines are consistent as \n
It's more useful to test the entire Groupie can be serialized than just
one.
This is a refactoring with side effects, so it's a change.

Tracking this means we don't have to count it manually in
Groupie#default_weight. A side effect is that old serialized data won't
have this cache data, so its smart weight default weight will not be
correct. Adding old data to a new instance should fix this.
This eliminates another piece of waste from Groupie#default_weight by
not having to iterate all Groups to compile a list of known unique
words.
Now that we have access to unique words and total word counts in Groups,
we can rewrite the method to simply sum what we have, without explicit
iteration and manual summing.

The net result is code that reads much more straight-forward.
This is another old method that gained new features over time which
complicated it. The double iteration and calls to apply_count_strategy
bothered me. Using Hash#transform_values makes it cleaner.
On Ruby 2.6 and 2.7 the bundled version of YAML has an inconsistent
interface (YAML.unsafe_load, YAML.safe_load and YAML.load) are not
consistently present. So embrace the gem version of the underlying
library (Psych) and simply use the latest version.
@Narnach Narnach merged commit 8728e98 into stable Feb 16, 2022
@Narnach Narnach deleted the add-smart-weight branch February 16, 2022 14:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant