Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add smart weight: a better way to balance classification probabilities #40

Merged
merged 8 commits into from
Feb 16, 2022

Commits on Feb 16, 2022

  1. Feat: add smart weight support

    What's smart weight? A default weight/count that each word gets, even if
    we've never seen it before. The main advantage is that new words now get
    50/50 equal weighting between groups, and we need evidence to disturb
    this balance. More evidence means a stronger weighting towards a group.
    
    All you need to do to enable this is set `smart_weight` to true, which
    can be done during initialization (`Groupie.new smart_weight: true`) or
    via the `groupie.smart_weight = true` setter.
    Narnach committed Feb 16, 2022
    Configuration menu
    Copy the full SHA
    c80b50c View commit details
    Browse the repository at this point in the history
  2. Style: apply a few layout rules regarding indentation

    I'm using these in another place and just ran into a situation where the
    rules were needed. It basically prevents multi-line arguments from being
    indented really far. Oh, and newlines are consistent as \n
    Narnach committed Feb 16, 2022
    Configuration menu
    Copy the full SHA
    a471f58 View commit details
    Browse the repository at this point in the history
  3. Test: moved serialization spec from Groupie::Group to Groupie

    It's more useful to test the entire Groupie can be serialized than just
    one.
    Narnach committed Feb 16, 2022
    Configuration menu
    Copy the full SHA
    c8a02ad View commit details
    Browse the repository at this point in the history
  4. Change: Groups track their total word count

    This is a refactoring with side effects, so it's a change.
    
    Tracking this means we don't have to count it manually in
    Groupie#default_weight. A side effect is that old serialized data won't
    have this cache data, so its smart weight default weight will not be
    correct. Adding old data to a new instance should fix this.
    Narnach committed Feb 16, 2022
    Configuration menu
    Copy the full SHA
    a8ee866 View commit details
    Browse the repository at this point in the history
  5. Refactor: Groupie tracks known words to speedup default_weight

    This eliminates another piece of waste from Groupie#default_weight by
    not having to iterate all Groups to compile a list of known unique
    words.
    Narnach committed Feb 16, 2022
    Configuration menu
    Copy the full SHA
    cfcfcf6 View commit details
    Browse the repository at this point in the history
  6. Refactor: simplified Groupie#default_weight

    Now that we have access to unique words and total word counts in Groups,
    we can rewrite the method to simply sum what we have, without explicit
    iteration and manual summing.
    
    The net result is code that reads much more straight-forward.
    Narnach committed Feb 16, 2022
    Configuration menu
    Copy the full SHA
    2f720a0 View commit details
    Browse the repository at this point in the history
  7. Refactor: Groupie#classify internals are simpler

    This is another old method that gained new features over time which
    complicated it. The double iteration and calls to apply_count_strategy
    bothered me. Using Hash#transform_values makes it cleaner.
    Narnach committed Feb 16, 2022
    Configuration menu
    Copy the full SHA
    b4ed9a1 View commit details
    Browse the repository at this point in the history
  8. Specfix: switch to Psych for YAML test

    On Ruby 2.6 and 2.7 the bundled version of YAML has an inconsistent
    interface (YAML.unsafe_load, YAML.safe_load and YAML.load) are not
    consistently present. So embrace the gem version of the underlying
    library (Psych) and simply use the latest version.
    Narnach committed Feb 16, 2022
    Configuration menu
    Copy the full SHA
    09f5f74 View commit details
    Browse the repository at this point in the history