Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add enjim dataset(s) #9

Open
wants to merge 4 commits into
base: master
Choose a base branch
from

Conversation

lloorree
Copy link

@lloorree lloorree commented Feb 12, 2023

To run/set up:
- download sqlite from https://www.sqlite.org/download.html and install
- install/update dependencies however you're managing the .toml requirements locally
- update catbox paths in config.yaml
- run normally

To sanity-check the output (basically checks for bizarrely-formatted garbage):

grep -E "<[^* '\n3A-Z=<\.]+>|\[[^A-Z* 0-9r\.]{0,10}\]|&[^ \n]{1,10};|:[^ \n0-9]{1,10}:" rev-020c8d0-args-d93e21a.jsonl -c

@0x000011b 0x000011b self-requested a review February 19, 2023 15:51
@0x000011b 0x000011b added the enhancement New feature or request label Feb 19, 2023
@0x000011b 0x000011b linked an issue Feb 19, 2023 that may be closed by this pull request
Copy link
Collaborator

@0x000011b 0x000011b left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Absolutely top notch work, thank you! Just a couple minor points to address.

from toolbox.core.models import Episode, Turn
from toolbox.datasets.enjim import EnjimDataset, EnjimAgent, setup_sqlite
from toolbox.modules import BaseModule
from toolbox.modules.registry import ModuleRegistry
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume you were going to implement the registry pattern for the modules but backed away from doing that in this PR?

This import causes a crash since the file doesn't exist, but removing it and the reference below fixes it since it's not used elsewhere.

else self.summarize(spoken, self.settings['max_scenario_chars'], None, None),
speaker=speaker, human_speaker=speaker != bot_name) for speaker, spoken in
posts]
participant_personas = {ag.name: self.summarize_char(ag, self.settings['max_persona_chars'],
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A couple minor issues with the personas:

  • Empty personas are generated sometimes, resulting in processed text that looks like:
    A's Persona: 
    B's Persona: (actual persona text here)
    Scenario: (proper scenario here)
    ... 
    
    Which is a little wasteful token-wise and might be confusing to the model when training.
  • They seem to be unparsed query results instead of cleaned text, so instead of generating
     A's Persona: A is like this and that
    
    They look like:
    A's Persona: ("A is like this and that", )
    
    This is also happening on world_scenario.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Implement data handling for RP forum dumps
2 participants