Rank and files #126

tobi · 2023-06-03T23:55:38Z

This does two things:

Allow dataset_format to be specified as input-output. I was also considering to call it custom. But basically it's just "don't do anything". I feel like this is the correct default in cases where a local dataset is being passed in because it's likely that the file can be formatted however one wishes and the docs say to make it input/output which is a sensible default
We also look at LOCAL_RANK to set the device_map to force everything to a single device. This allows torch run to do it's work. (admittedly, I'm a noob on this stuff, but it does seem to work well enough)

lhoestq · 2023-06-05T09:30:02Z

qlora.py

-        full_dataset = Dataset.from_json(filename=dataset_name, format='jsonlines')
+        full_dataset = Dataset.from_pandas(pd.read_json(dataset_name, lines=True))


This change would load the full dataset in memory, is it intentional ?

Typically, NLP datasets are small enough to fit in memory, so this should be fine in most cases. However, I am unaware of the benefits of using Pandas vs HF Datasets for loading and have not benchmarked the two libraries. Could you provide some more details? Otherwise, I lean towards using the HF Datasets method.

I think using datasets library directly would be better, the previous code didn't work though so I fixed it the way I knew how to. You are right that it would be better to just correct the syntax.

HF Datasets converts the data to an Arrow file and memory maps the data from disk. This gives high speed while keeping the RAM usage to minimum. It's also useful in distributed setups because the memory mapped file can be seen as shared memory across processes - no need to copy the data to the different processes.

I'd switch it but I'm traveling rest of week.

artidoro · 2023-06-05T13:51:02Z

Thank you @tobi! I agree the default format should be input-output and the device map specification looks good to me. It will be useful to have it working with torch run!

qlora.py

Co-authored-by: Quentin Lhoest <[email protected]>

artidoro · 2023-06-06T18:43:45Z

Thank you for your contributions!

Rank and files

tobi added 3 commits June 3, 2023 19:30

introduce LOCAL_RANK mapping and optiojn to leave input data as-is

e69303d

typo

bed4b0d

pandas is better at reading lines

00e5ad2

tobi mentioned this pull request Jun 4, 2023

Example: use skypilot for finetune #132

Open

lhoestq reviewed Jun 5, 2023

View reviewed changes

lhoestq reviewed Jun 6, 2023

View reviewed changes

qlora.py Outdated Show resolved Hide resolved

Update qlora.py

f4ee7fd

Co-authored-by: Quentin Lhoest <[email protected]>

artidoro merged commit 4ea02e7 into artidoro:main Jun 6, 2023

LagPixelLOL pushed a commit to LagPixelLOL/qlora that referenced this pull request Feb 8, 2024

Merge pull request artidoro#126 from tobi/rank-and-files

6632b96

Rank and files

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Rank and files #126

Rank and files #126

tobi commented Jun 3, 2023

lhoestq Jun 5, 2023

artidoro Jun 5, 2023

tobi Jun 5, 2023

lhoestq Jun 5, 2023 •

edited

Loading

tobi Jun 6, 2023

artidoro commented Jun 5, 2023

artidoro commented Jun 6, 2023

		full_dataset = Dataset.from_json(filename=dataset_name, format='jsonlines')
		full_dataset = Dataset.from_pandas(pd.read_json(dataset_name, lines=True))

Rank and files #126

Rank and files #126

Conversation

tobi commented Jun 3, 2023

lhoestq Jun 5, 2023

Choose a reason for hiding this comment

artidoro Jun 5, 2023

Choose a reason for hiding this comment

tobi Jun 5, 2023

Choose a reason for hiding this comment

lhoestq Jun 5, 2023 • edited Loading

Choose a reason for hiding this comment

tobi Jun 6, 2023

Choose a reason for hiding this comment

artidoro commented Jun 5, 2023

artidoro commented Jun 6, 2023

lhoestq Jun 5, 2023 •

edited

Loading