Add enjim dataset(s) #9

lloorree · 2023-02-12T02:33:43Z

To run/set up:
- download sqlite from https://www.sqlite.org/download.html and install
- install/update dependencies however you're managing the .toml requirements locally
- update catbox paths in config.yaml
- run normally

To sanity-check the output (basically checks for bizarrely-formatted garbage):

grep -E "<[^* '\n3A-Z=<\.]+>|\[[^A-Z* 0-9r\.]{0,10}\]|&[^ \n]{1,10};|:[^ \n0-9]{1,10}:" rev-020c8d0-args-d93e21a.jsonl -c

0x000011b

Absolutely top notch work, thank you! Just a couple minor points to address.

0x000011b · 2023-02-19T15:45:57Z

toolbox/modules/enjim_pdm.py

+from toolbox.core.models import Episode, Turn
+from toolbox.datasets.enjim import EnjimDataset, EnjimAgent, setup_sqlite
+from toolbox.modules import BaseModule
+from toolbox.modules.registry import ModuleRegistry


I assume you were going to implement the registry pattern for the modules but backed away from doing that in this PR?

This import causes a crash since the file doesn't exist, but removing it and the reference below fixes it since it's not used elsewhere.

0x000011b · 2023-02-19T16:22:54Z

toolbox/modules/enjim_pdm.py

+                     else self.summarize(spoken, self.settings['max_scenario_chars'], None, None),
+                          speaker=speaker, human_speaker=speaker != bot_name) for speaker, spoken in
+                     posts]
+            participant_personas = {ag.name: self.summarize_char(ag, self.settings['max_persona_chars'],


A couple minor issues with the personas:

Empty personas are generated sometimes, resulting in processed text that looks like:
A's Persona: B's Persona: (actual persona text here) Scenario: (proper scenario here) ...
Which is a little wasteful token-wise and might be confusing to the model when training.

They seem to be unparsed query results instead of cleaned text, so instead of generating
A's Persona: A is like this and that
They look like:
A's Persona: ("A is like this and that", )
This is also happening on world_scenario.

lloorree added 3 commits February 9, 2023 22:00

in progress work on PygmalionAI#4 adding enjim datasets

020c8d0

prospective/semi-final draft for PygmalionAI#4

202a18d

minor regex fixes/tweaks for bbcode

9f88afe

lloorree mentioned this pull request Feb 12, 2023

Implement data handling for RP forum dumps #4

Closed

one last missed symbol code

1c45c5a

0x000011b force-pushed the master branch from b8fe49b to fea73ae Compare February 17, 2023 20:35

0x000011b self-requested a review February 19, 2023 15:51

0x000011b assigned lloorree Feb 19, 2023

0x000011b added the enhancement New feature or request label Feb 19, 2023

0x000011b linked an issue Feb 19, 2023 that may be closed by this pull request

Implement data handling for RP forum dumps #4

Closed

0x000011b requested changes Feb 19, 2023

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add enjim dataset(s) #9

Add enjim dataset(s) #9

lloorree commented Feb 12, 2023 •

edited

Loading

0x000011b left a comment

0x000011b Feb 19, 2023

0x000011b Feb 19, 2023

Add enjim dataset(s) #9

Are you sure you want to change the base?

Add enjim dataset(s) #9

Conversation

lloorree commented Feb 12, 2023 • edited Loading

0x000011b left a comment

Choose a reason for hiding this comment

0x000011b Feb 19, 2023

Choose a reason for hiding this comment

0x000011b Feb 19, 2023

Choose a reason for hiding this comment

lloorree commented Feb 12, 2023 •

edited

Loading