Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rank and files #126

Merged
merged 4 commits into from
Jun 6, 2023
Merged

Rank and files #126

merged 4 commits into from
Jun 6, 2023

Conversation

tobi
Copy link
Contributor

@tobi tobi commented Jun 3, 2023

This does two things:

  • Allow dataset_format to be specified as input-output. I was also considering to call it custom. But basically it's just "don't do anything". I feel like this is the correct default in cases where a local dataset is being passed in because it's likely that the file can be formatted however one wishes and the docs say to make it input/output which is a sensible default
  • We also look at LOCAL_RANK to set the device_map to force everything to a single device. This allows torch run to do it's work. (admittedly, I'm a noob on this stuff, but it does seem to work well enough)

qlora.py Outdated
Comment on lines 466 to 474
full_dataset = Dataset.from_json(filename=dataset_name, format='jsonlines')
full_dataset = Dataset.from_pandas(pd.read_json(dataset_name, lines=True))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change would load the full dataset in memory, is it intentional ?

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typically, NLP datasets are small enough to fit in memory, so this should be fine in most cases. However, I am unaware of the benefits of using Pandas vs HF Datasets for loading and have not benchmarked the two libraries. Could you provide some more details? Otherwise, I lean towards using the HF Datasets method.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think using datasets library directly would be better, the previous code didn't work though so I fixed it the way I knew how to. You are right that it would be better to just correct the syntax.

Copy link
Contributor

@lhoestq lhoestq Jun 5, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

HF Datasets converts the data to an Arrow file and memory maps the data from disk. This gives high speed while keeping the RAM usage to minimum. It's also useful in distributed setups because the memory mapped file can be seen as shared memory across processes - no need to copy the data to the different processes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd switch it but I'm traveling rest of week.

@artidoro
Copy link
Owner

artidoro commented Jun 5, 2023

Thank you @tobi! I agree the default format should be input-output and the device map specification looks good to me. It will be useful to have it working with torch run!

qlora.py Outdated Show resolved Hide resolved
Co-authored-by: Quentin Lhoest <[email protected]>
@artidoro
Copy link
Owner

artidoro commented Jun 6, 2023

Thank you for your contributions!

@artidoro artidoro merged commit 4ea02e7 into artidoro:main Jun 6, 2023
LagPixelLOL pushed a commit to LagPixelLOL/qlora that referenced this pull request Feb 8, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants