Skip to content

Commit

Permalink
Feat: add smart weight support
Browse files Browse the repository at this point in the history
What's smart weight? A default weight/count that each word gets, even if
we've never seen it before. The main advantage is that new words now get
50/50 equal weighting between groups, and we need evidence to disturb
this balance. More evidence means a stronger weighting towards a group.

All you need to do to enable this is set `smart_weight` to true, which
can be done during initialization (`Groupie.new smart_weight: true`) or
via the `groupie.smart_weight = true` setter.
  • Loading branch information
Narnach committed Feb 16, 2022
1 parent b06f923 commit c80b50c
Show file tree
Hide file tree
Showing 4 changed files with 113 additions and 4 deletions.
1 change: 1 addition & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,7 @@
The next release will be 0.5.0 due to a breaking change: removal of code that was deprecated in 0.4.0.

- Breaking: remove `String#tokenize` core extension; please use `Groupie.tokenize(string)` instead
- Feat: add support for smart default weights, reducing the effect of low data on predictions
- Deps: add Ruby 3.1 to list of tested & supported gems
- Chore: require multi-factor authentication to publish gem updates
- Chore: add Security.md to advertise a security policy
Expand Down
19 changes: 18 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -92,6 +92,21 @@ groupie.classify_text(test_tokens, :unique)
test_tokens - (test_tokens & groupie.unique_words)
# => ["please", "to", "reset", "awesome"]
# If you'd be classifying email, you can assume that common email headers will get ignored this way.

# If you're just starting out, your incomplete data could lead to dramatic misrepresentations of the data.
# To balance against this, you can enable smart weight:
groupie.smart_weight = true
# You could also set it during initialization via Groupie.new(smart_weight: true)
# What's so useful about it? It adds a default weight to _all_ words, even the ones you haven't
# seen yet, which counter-acts the data you have. This shines in low data situations,
# reducing the impact of the few words you have seen before.
groupie.default_weight
# => 1.2285714285714286
# Classifying the same text as before should consider all words, and add this default weight to all words
# It basically gives all groups the likelihood of "claiming" a word,
# unless there is strong data to suggest otherwise.
groupie.classify_text(test_tokens)
# => {:spam=>0.5241046831955923, :ham=>0.4758953168044077}
```

Persistence can be naively done by using YAML:
Expand All @@ -112,7 +127,9 @@ For I'm still experimenting with Groupie in [Infinity Feed](https://www.infinity

After checking out the repo, run `bin/setup` to install dependencies. Then, run `rake spec` to run the tests. You can also run `bin/console` for an interactive prompt that will allow you to experiment. Rubocop is available via `bin/rubocop` with some friendly default settings.

To install this gem onto your local machine, run `bundle exec rake install`. To release a new version, update the version number in `version.rb`, and then run `bundle exec rake release`, which will create a git tag for the version, push git commits and the created tag, and push the `.gem` file to [rubygems.org](https://rubygems.org).
To install this gem onto your local machine, run `bundle exec rake install`.

To release a new version, update the version number in `version.rb`, and then run `bundle exec rake release`, which will create a git tag for the version, push git commits and the created tag, and push the `.gem` file to [rubygems.org](https://rubygems.org). For obvious reasons, only the project maintainer can do this.

## Contributing

Expand Down
38 changes: 35 additions & 3 deletions lib/groupie.rb
Original file line number Diff line number Diff line change
Expand Up @@ -2,14 +2,19 @@

require_relative 'groupie/version'
require_relative 'groupie/group'
require 'set'

# Groupie is a text grouper and classifier, using naive Bayesian filtering.
class Groupie
# Wrap all errors we raise in this so our own errors are recognizable.
class Error < StandardError; end

def initialize
attr_accessor :smart_weight

# @param [true, false] smart_weight (false) Whether smart weight is enabled or not.
def initialize(smart_weight: false)
@groups = {}
@smart_weight = smart_weight
end

# Turn a String (or anything else that responds to #to_s) into an Array of String tokens.
Expand Down Expand Up @@ -58,13 +63,14 @@ def classify_text(words, strategy = :sum)
# @raise [Groupie::Error] Raise when an invalid strategy is provided
def classify(entry, strategy = :sum)
results = {}
default_weight = self.default_weight
total_count = @groups.values.inject(0) do |sum, group|
sum + apply_count_strategy(group.count(entry), strategy)
sum + apply_count_strategy(default_weight + group.count(entry), strategy)
end
return results if total_count.zero?

@groups.each do |name, group|
count = apply_count_strategy(group.count(entry), strategy)
count = apply_count_strategy(default_weight + group.count(entry), strategy)
results[name] = count.positive? ? count.to_f / total_count : 0.0
end

Expand All @@ -91,6 +97,32 @@ def unique_words
total_count.keys
end

# Default weight is used when +smart_weight+ is enabled.
# Each word's count is increased by the +default_weight+ value,
# which is the average frequency of each unique word we know about.
#
# Example: if we have indexed 1000 total words, of which 500 were unique,
# the default_weight would be 1000/500=2.0
#
# @return [Float] The default weight for all words
def default_weight
return 0.0 unless smart_weight

# Find all unique words and the total count of all words
total_words = 0
unique_words = Set.new
@groups.each_value do |group|
group.word_counts.each do |word, count|
unique_words << word
total_words += count
end
end
total_unique_words = unique_words.count
return 0.0 unless total_unique_words.positive?

total_words / total_unique_words.to_f
end

private

# Calculate grouped scores
Expand Down
59 changes: 59 additions & 0 deletions spec/groupie_spec.rb
Original file line number Diff line number Diff line change
Expand Up @@ -220,4 +220,63 @@
Groupie.tokenize('"first last"').should == %w[first last]
end
end

describe 'when smart_weight is enabled' do
let(:groupie) { Groupie.new smart_weight: true }

describe '#default_weight' do
it 'returns 0.0 by default' do
expect(groupie.default_weight).to eq(0.0)
end

it 'returns the average frequency of unique words' do
groupie[:one].add %w[test test groupie]
# 3 words / 2 unique words = 1.5
expect(groupie.default_weight).to eq(1.5)
end

it 'combines the results from all groups' do
groupie[:one].add %w[test test]
groupie[:two].add %w[test two]
# 4 total occurrences divided by 2 unique words
expect(groupie.default_weight).to eq(2.0)
end
end

describe '#classify' do
it 'gives new words equal weighting across groups' do
groupie[:one].add %w[one]
groupie[:two].add %w[two]
expect(groupie.classify('new')).to eq({ one: 0.5, two: 0.5 })
end

it 'adds default_weight to all word counts prior to applying the strategy' do
groupie[:one].add %w[one]
groupie[:two].add %w[two]
# sum strategy adds the word count to the default weight (1) and divides by the total weight (3)
expect(groupie.classify('one')).to eq({ one: 2 / 3.0, two: 1 / 3.0 })
end
end

describe '#classify_text' do
it 'gives new words equal weighting across groups' do
groupie[:one].add %w[one]
groupie[:two].add %w[two]
# Classify text with one word is basically a less efficient classify()
expect(groupie.classify_text(%w[new])).to eq({ one: 0.5, two: 0.5 })
end

it 'adds default_weight to all word counts prior to applying the strategy' do
groupie[:one].add %w[one]
groupie[:two].add %w[two]
# Sum strategy for each word, and then we average the results for each group
# - new should get 1/2 in each group
# - one should get 2/3 in group one, and 1/3 in group two
expect(groupie.classify_text(%w[new one])).to eq({
one: ((2 / 3.0) + (1 / 2.0)) / 2.0,
two: ((1 / 3.0) + (1 / 2.0)) / 2.0
})
end
end
end
end

0 comments on commit c80b50c

Please sign in to comment.