Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Load HF datasets from olmo_data #646

Merged
merged 11 commits into from
Jul 10, 2024
Merged

Conversation

2015aroras
Copy link
Collaborator

@2015aroras 2015aroras commented Jul 8, 2024

This PR changes the loading of HF datasets in our downstream eval so that they are loaded from the olmo_data package. This removes the need for network calls or special caching mechanisms and, broadly speaking, was @dirkgr's idea.

The PR looks big and ugly, but it's mostly busy work (like adding all the datasets files). The commits of interest (if any) in my opinion are 84dbf43 and d112d4c

Tested on LUMI on 1 node on same checkpoint. 2 mins loading compared to 20 mins without this change, same downstream evals.

@2015aroras 2015aroras changed the title Shanea/hf datasets from package Load HF datasets from olmo_data Jul 8, 2024
@2015aroras 2015aroras marked this pull request as ready for review July 9, 2024 23:16
@2015aroras 2015aroras requested review from epwalsh and dirkgr July 9, 2024 23:16
Copy link
Member

@epwalsh epwalsh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@2015aroras 2015aroras merged commit a101b31 into main Jul 10, 2024
12 checks passed
@2015aroras 2015aroras deleted the shanea/hf-datasets-from-package branch July 10, 2024 21:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants