How to fine-tune using pytroch dataset instead of hf's dataset #958

xugy16 · 2024-08-25T06:06:24Z

How can I use pytorch's dataset to fine-tune llama3.1.

When I try to use pytorch's dataset, I keep getting the following errors related to collator:

File ~/anaconda3/envs/llama/lib/python3.10/site-packages/transformers/data/data_collator.py:589, in
...
#labels = [feature[label_name] for feature in features] if label_name in features[0].keys() else None
# reconvert list[None] to None if necessary
 # this might occur when we pass {..., "labels": None}

AttributeError: 'str' object has no attribute 'keys'

The reason is that I want to add noise to the word (data-augmentation) and the dataset is dynamic as below.

def __getitem__(self, idx):
        # only add noise to input text
        # tmp = self.data[idx]
        true_qry = [self.data['true_qry'][idx]](url)
        if random.random() < self.noise_prob:
            sample_edit_distance = random.randint(1, self.max_edit_distance)
            input_qry = self.add_noise(true_qry, sample_edit_distance)
        else:
            input_qry = true_qry

And then I follow the fine-tune scipt and use chatml template

<|im_start|>user
iobwin<|im_end|>
<|im_start|>assistant
ibowin<|im_end|>

The trainer is as below:

def my_formatting_func(example):
    return example

trainer=SFTTrainer(
        model=model,
        tokenizer=tokenizer,
        train_dataset=train_dataset,
        # dataset_text_field="text",
        max_seq_length=max_seq_length,
        dataset_num_proc=2,
        packing=False,
        data_collator = DataCollatorForSeq2Seq(tokenizer = tokenizer),
        args=train_args,
        formatting_func=my_formatting_func  # 添加这一行
    )

The text was updated successfully, but these errors were encountered:

danielhanchen · 2024-08-25T07:23:11Z

I think HF's datasets has like a converter - unsure though - maybe huggingface/datasets#4983?

xugy16 · 2024-08-25T07:44:28Z

@danielhanchen Really appreciate for your reply. Supposing we do not do converserter, is possible just FT llama3.1 with SFTTrainer using: 1) pytorch dataset using data augamentation; 2) chatml format;

I tried several methods, but seem that SFTTrainer do not tokenize my chatml input and throw "'str' object has no attribute 'keys'" error.

xugy16 changed the title ~~How to fine-tune by combining pytroch dataset~~ How to fine-tune using pytroch dataset instead of hf's dataset Aug 25, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to fine-tune using pytroch dataset instead of hf's dataset #958

How to fine-tune using pytroch dataset instead of hf's dataset #958

xugy16 commented Aug 25, 2024 •

edited

Loading

danielhanchen commented Aug 25, 2024

xugy16 commented Aug 25, 2024

How to fine-tune using pytroch dataset instead of hf's dataset #958

How to fine-tune using pytroch dataset instead of hf's dataset #958

Comments

xugy16 commented Aug 25, 2024 • edited Loading

danielhanchen commented Aug 25, 2024

xugy16 commented Aug 25, 2024

xugy16 commented Aug 25, 2024 •

edited

Loading