Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Automated generation of questions and answers #13

Open
nojvek opened this issue Oct 9, 2023 · 0 comments
Open

Automated generation of questions and answers #13

nojvek opened this issue Oct 9, 2023 · 0 comments

Comments

@nojvek
Copy link

nojvek commented Oct 9, 2023

One of the biggest issues with GSM-8K or any other dataset openly available on the internet, is that larger models inherently have the solutions as part of the training set when they crawl the entire internet.

https://twitter.com/suchenzang/status/1701615029211238904

https://arxiv.org/abs/2309.08632 (pre-training on the dataset is all you need).

One way to change would be to have this repo be able to dynamically generate a test set where the numbers, names and formatting are different everytime a new set is generated. The deeper semantic logic would still be the same, but it would force the model not to memorize even if the raw version of test dataset is openly available on internet for crawling.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant