Add smart weight: a better way to balance classification probabilities #40

Narnach · 2022-02-16T14:22:48Z

What problem does this address?

Without smart weight your classification (and prediction) is strongly biased when you have very limited data. You see a word once and it becomes a "100% guarantee" that the word means a specific categorization.

How does this PR address the issue?

It introduces a new option for Groupie: smart weight, which can be enabled during initialization (Groupie.new(smart_weight: true)) or after the fact (groupie.smart_weight = true).

Once this setting is enabled, each classification will always add a new default_weight value to each word's count prior to applying the score strategy, whether we've seen the word before or not.

The effect is instant: each new (never classified) word is now considered to have an equal probability to be in one of your categories due to each category having a default_weight that balances out. By classifying more words, you simply add more weight to one or more of the categories, tipping the balance away from the default equilibrium.

This means you need more data to more strongly shift the balance between categories, which is usually a good thing.

The default_weight value is dynamic. It's the number of unique words divided by the total count of words categorized. This means it scales along with the size of your dataset and thus balances out words with low frequency: it's hard to tell if they are a true signal or just a fluke.

Breaking changes

In order to perform these calculations in a smart (i.e. fast) way, I've ended up refactoring a bunch of the internals. Two important ones are that Groupie::Groups now have a link back to their Groupie instance and that both classes keep extra data to track unique words and total word counts. An unfortunate side-effect is that old YAML-serialized data won't be drop-in compatible due to lacking some of this data.

I think that loading old data and adding all words to a new instance of Groupie will probably work. Due to this being an old gem that's mostly for my own usage (and I'm not serializing any data yet, so this is not a problem for me), I'm content to leave this as a warning and an exercise for whomever needs it.

History & details

The initial implementation. It proves the concept, but is rough and inefficient. I started with the Readme, prototyping the API I wanted, then specs, then the code. Rubocop was not happy about the added complexity, so cleanup was required.

Feat: add smart weight support (c80b50c)

Rubocop style update to make it not perform huge indentations.

Style: apply a few layout rules regarding indentation (a471f58)

I realized that in order to make things better, I could not let Groups be fully independent classes. Newly added words needed to be tracked centrally in Groupie itself. It felt useful to start with moving the serialization test to Groupie, before I started making changes to Group.

Test: moved serialization spec from Groupie::Group to Groupie (c8a02ad)

Track local data in Group, so we don't have to do this in default_weight. This eliminates having to loop over each unique word in each category. A side effect of tracking more internal data this way is that it breaks functionality of old serialized data.

Change: Groups track their total word count (a8ee866)

More refactoring passes that build on top of the previous change: have Groupie track all unique words it knows about, so we only have to count that Set rather than first building the Set in order to count it.

Refactor: Groupie tracks known words to speedup default_weight (cfcfcf6)

Next is cleaning up the implementation of default_weight because the new tools allow this.

Refactor: simplified Groupie#default_weight (2f720a0)

Lastly I'm addressing Rubocop's (correct) assessment that Groupie#classify was horribly complex. Re-implementing it led to a cleaner approach using the tools we have.

Refactor: Groupie#classify internals are simpler (b4ed9a1)

Pushing the code to Github to test on older Ruby versions exposed an interface inconsistency in the built-in YAML library, so I've added Psych as a dev dependency to work around this. I'm testing YAML as a courtesy, not as a fully embraced feature (yet).

Specfix: switch to Psych for YAML test (09f5f74)

As the force push in the history of this PR shows, I've performed a few interactive rebase fixups to correct some things in earlier commits.

What's smart weight? A default weight/count that each word gets, even if we've never seen it before. The main advantage is that new words now get 50/50 equal weighting between groups, and we need evidence to disturb this balance. More evidence means a stronger weighting towards a group. All you need to do to enable this is set `smart_weight` to true, which can be done during initialization (`Groupie.new smart_weight: true`) or via the `groupie.smart_weight = true` setter.

I'm using these in another place and just ran into a situation where the rules were needed. It basically prevents multi-line arguments from being indented really far. Oh, and newlines are consistent as \n

It's more useful to test the entire Groupie can be serialized than just one.

This is a refactoring with side effects, so it's a change. Tracking this means we don't have to count it manually in Groupie#default_weight. A side effect is that old serialized data won't have this cache data, so its smart weight default weight will not be correct. Adding old data to a new instance should fix this.

This eliminates another piece of waste from Groupie#default_weight by not having to iterate all Groups to compile a list of known unique words.

Now that we have access to unique words and total word counts in Groups, we can rewrite the method to simply sum what we have, without explicit iteration and manual summing. The net result is code that reads much more straight-forward.

This is another old method that gained new features over time which complicated it. The double iteration and calls to apply_count_strategy bothered me. Using Hash#transform_values makes it cleaner.

On Ruby 2.6 and 2.7 the bundled version of YAML has an inconsistent interface (YAML.unsafe_load, YAML.safe_load and YAML.load) are not consistently present. So embrace the gem version of the underlying library (Psych) and simply use the latest version.

Narnach added 8 commits February 16, 2022 14:59

Style: apply a few layout rules regarding indentation

a471f58

I'm using these in another place and just ran into a situation where the rules were needed. It basically prevents multi-line arguments from being indented really far. Oh, and newlines are consistent as \n

Test: moved serialization spec from Groupie::Group to Groupie

c8a02ad

It's more useful to test the entire Groupie can be serialized than just one.

Refactor: Groupie tracks known words to speedup default_weight

cfcfcf6

This eliminates another piece of waste from Groupie#default_weight by not having to iterate all Groups to compile a list of known unique words.

Refactor: simplified Groupie#default_weight

2f720a0

Now that we have access to unique words and total word counts in Groups, we can rewrite the method to simply sum what we have, without explicit iteration and manual summing. The net result is code that reads much more straight-forward.

Refactor: Groupie#classify internals are simpler

b4ed9a1

This is another old method that gained new features over time which complicated it. The double iteration and calls to apply_count_strategy bothered me. Using Hash#transform_values makes it cleaner.

Narnach merged commit 8728e98 into stable Feb 16, 2022

Narnach deleted the add-smart-weight branch February 16, 2022 14:34

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add smart weight: a better way to balance classification probabilities #40

Add smart weight: a better way to balance classification probabilities #40

Narnach commented Feb 16, 2022 •

edited

Loading

Add smart weight: a better way to balance classification probabilities #40

Add smart weight: a better way to balance classification probabilities #40

Conversation

Narnach commented Feb 16, 2022 • edited Loading

What problem does this address?

How does this PR address the issue?

Breaking changes

History & details

Narnach commented Feb 16, 2022 •

edited

Loading