Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
I'm working on sorting out which model ladder runs are comparable to other model ladder runs. One crucial necessity for this is versioning our data mixes. We should lock in a name for the version of Dolma 1.7 that uses
preprocessed/tulu_flan/v1-decontaminated-60M-shots_all-upweight_1-dialog_false-sep_rulebased
instead ofpreprocessed/tulu_flan/v2-decontaminated-60M-shots_all-upweight_1-dialog_false-sep_newline/
(introduced to the model ladder in this PR). I think we should not call this dolma17 as we currently do unless there is a plan to update the HF hosted version of dolma 1.7 to also have this change. At least I'd like there to be two different named_data_mixes for dolma 1.7 with each of these flans so that tracking which what exact dataset a run uses can be done by just looking at the data mix name and not having to check out the code used to train a run just to check what overloaded version of a named mix it is.The implementation here is a hot fix to differentiate different flans in dolma17 for model ladder. Later we can work on cleaning up the data mix definition system more thoroughly but right now we need to just make sure that new runs do not have a mislabeled data mix.