You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I was wondering can you provide the index_mapping files that is generated by the GPT2Dataset? From the construction of gpt2dataset at here, I can see there are three npy index files
I was wondering can you provide a copy of these files so that I don't need to regenerate them?
I ask this request because I want to study the influence of the original training data by chunk. I have prepared the pythia-dedup dataset, but I failed to build the environments. After reading the code of GPT2Dataset, I found that with these index files, I can reproduce the original training data of pythia.
I noticed that you provide the batch_viewer.py to check the unshuffled data, but it seems that these data is still different from the actually training data that is fed into the model during the training process.
Thanks
The text was updated successfully, but these errors were encountered:
ziqi-zhang
changed the title
Provide the index_mapping npy files for ease of reproducing training data
Provide the shuffled index_mapping npy files for ease of reproducing training data
Mar 14, 2024
Hi,
I was wondering can you provide the index_mapping files that is generated by the GPT2Dataset? From the construction of gpt2dataset at here, I can see there are three
npy
index filesI was wondering can you provide a copy of these files so that I don't need to regenerate them?
I ask this request because I want to study the influence of the original training data by chunk. I have prepared the pythia-dedup dataset, but I failed to build the environments. After reading the code of GPT2Dataset, I found that with these index files, I can reproduce the original training data of pythia.
I noticed that you provide the
batch_viewer.py
to check the unshuffled data, but it seems that these data is still different from the actually training data that is fed into the model during the training process.Thanks
The text was updated successfully, but these errors were encountered: