Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to fine-tune using pytroch dataset instead of hf's dataset #958

Open
xugy16 opened this issue Aug 25, 2024 · 2 comments
Open

How to fine-tune using pytroch dataset instead of hf's dataset #958

xugy16 opened this issue Aug 25, 2024 · 2 comments

Comments

@xugy16
Copy link

xugy16 commented Aug 25, 2024

How can I use pytorch's dataset to fine-tune llama3.1.

When I try to use pytorch's dataset, I keep getting the following errors related to collator:

File ~/anaconda3/envs/llama/lib/python3.10/site-packages/transformers/data/data_collator.py:589, in
...
#labels = [feature[label_name] for feature in features] if label_name in features[0].keys() else None
# reconvert list[None] to None if necessary
 # this might occur when we pass {..., "labels": None}

AttributeError: 'str' object has no attribute 'keys'

The reason is that I want to add noise to the word (data-augmentation) and the dataset is dynamic as below.

def __getitem__(self, idx):
        # only add noise to input text
        # tmp = self.data[idx]
        true_qry = [self.data['true_qry'][idx]](url)
        if random.random() < self.noise_prob:
            sample_edit_distance = random.randint(1, self.max_edit_distance)
            input_qry = self.add_noise(true_qry, sample_edit_distance)
        else:
            input_qry = true_qry

And then I follow the fine-tune scipt and use chatml template

<|im_start|>user
iobwin<|im_end|>
<|im_start|>assistant
ibowin<|im_end|>

The trainer is as below:

def my_formatting_func(example):
    return example

trainer=SFTTrainer(
        model=model,
        tokenizer=tokenizer,
        train_dataset=train_dataset,
        # dataset_text_field="text",
        max_seq_length=max_seq_length,
        dataset_num_proc=2,
        packing=False,
        data_collator = DataCollatorForSeq2Seq(tokenizer = tokenizer),
        args=train_args,
        formatting_func=my_formatting_func  # 添加这一行
    )
@xugy16 xugy16 changed the title How to fine-tune by combining pytroch dataset How to fine-tune using pytroch dataset instead of hf's dataset Aug 25, 2024
@danielhanchen
Copy link
Contributor

I think HF's datasets has like a converter - unsure though - maybe huggingface/datasets#4983?

@xugy16
Copy link
Author

xugy16 commented Aug 25, 2024

@danielhanchen Really appreciate for your reply. Supposing we do not do converserter, is possible just FT llama3.1 with SFTTrainer using: 1) pytorch dataset using data augamentation; 2) chatml format;

I tried several methods, but seem that SFTTrainer do not tokenize my chatml input and throw "'str' object has no attribute 'keys'" error.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants