-
Notifications
You must be signed in to change notification settings - Fork 2.1k
How can I train a BART FiD model on custom data with gold retrieved passages? #3872
Comments
Hi there, The experiments in the linked paper can be replicated somewhat with the BlenderBot 2 model architecture. Specifically, if we take the command for training FiD RAG and do the following, you should be able to use gold retrieved passages:
So, if you emit
You would add the following flags to the FiD RAG command:
Your final command would look like this: parlai train_model \
--rag-retriever-type dpr --query-model bert_from_parlai_rag \
--dpr-model-file zoo:hallucination/bart_rag_token/model \
--generation-model bart --init-opt arch/bart_large \
--batchsize 16 --fp16 True --gradient-clip 0.1 --label-truncate 128 \
--log-every-n-secs 30 --lr-scheduler reduceonplateau --lr-scheduler-patience 1 \
--model-parallel True --optimizer adam --text-truncate 512 --truncate 512 \
--learningrate 1e-05 --validation-metric-mode min --validation-every-n-epochs 0.25 \
--validation-max-exs 1000 --validation-metric ppl --validation-patience 5 \
# new args
--gold-document-key gold_docs --gold-sentence-key gold_sentences --gold-document-titles-key gold_doc_titles \
--model projects.blenderbot2.agents.blenderbot2:BlenderBot2FidAgent \
--task my_custom_task \
--insert-gold-docs true \
--splitted-chunk-length 1000 \
--retriever-debug-index compressed --knowledge-access-method search_only --n-docs 3 |
@klshuster thank you so much for your prompt response! :)
Again thanks |
Also should I create a new task or am I fine with using
|
|
@klshuster Sorry I ended up creating my task since I was not sure how should I format my data which has list of gold documents (rather than a single string) in the ParlAI Dialog Format to be able to use Below is my script for
agents.py:
|
yes, that's exactly it - setting those flags will allow the training code to access those fields in the examples you are emitting |
@klshuster Thanks. error after bart_large model is downloaded:
Other libraries seem to be needed but I am not sure which version of them I should install, like |
Installing the recent fairseq library fixed the issue. Thanks. |
@klshuster Hi , I realized I am not able to find out where the model is being saved? It is still being trained but I don't find any checkpoint in the Also, I would really appreciate your help if you can provide the generation command as well? I could not find any example in the project page. Not interactive but from a file bulk generation. Thanks again. |
hi there - could you share your command and the beginning of your training logs? For generation, you can try running |
Thanks @klshuster . I used the same command you shared with me (just changed
So here is the train log. I could not fine
That training job successfully finished, but since I could not find where it is saved (I am running on clusters), I started a new job and set |
Ahh yea, if the I'll go ahead and update those README commands to include a model file |
Hi, I'm very impressed about your Work @klshuster ! I have few questions about your project 'SEA' and paper Internet-Augmented generation. just like @fabrahman , I tried to train my own FiD - Bart model but from scratch.
I am confused about how those two models are trained. for 'WizInt Search engine FiD-Gold', I believed that this model prepends WizInt dataset's concatenated line of 'selected-sentences' to dialog history for training. and for 'WizInt Search Engine FiD' model, I believed that it prepends each of first passage of WizInt dataset's 'retrieved-docs' but unfortunately, I can't see any TeacherAgents that utilize WizInt's 'retrieved-docs' for training Search Engine FiD model. and about WizardDialogGoldKnowledgeTeacher, at first, I thought it might be the right teacher for training FiD-Gold model, but I conclude if we have just one passage of knowledge in context, we don't have to use FiD for generation. (i.e. just like training normal transformer, like Transformer(gold Knowledge) in Table 2 and 3) So what is right? How can I utilize Wizard of Internet dataset to train Search engine FiD model? (and FiD-Gold model) And For Last, like I mentioned above, I can't see any added arguments to utilize dataset's 'retrieved-docs' or 'selected-docs' column in SearchQuerySearchEngineFidAgent, so If i want to use that columns in ParlAI, should I follow your first reply on this issue? (like using BlenderBot2FidAgent and add --gold-document-key ... to my cl just like that...) I would really appreciate if you can point out my wrong understanding about this project and helping me to understand this project well. Thanks. |
Hi @Bannng, let me clarify a few things for you:
In neither case are the documents directly prepended to the context via the teacher; rather, the FiD model takes care of this via internal handling of retrieval. Indeed, we did not open-source the search engine fid-gold model. However, as I mentioned above, this can be done with the |
@klshuster Thank you for your helps. Currently, I have my index file (document embeddings) as numpy memory mapping saved on the disk. As far as I understand, in your case It seems to be |
Follow these instructions for generating the index with your own knowledge source. Note that if your index is not too big you can use Once those are generated, you can specify |
Thank you for your helps! @klshuster Your reply has been a great help for me to understand the project. If what I understand is right,
but for Thanks a lot! |
This is indeed correct
We did not actually use these fields in the experiments per se, but the values within the fields were used to train the gold knowledge baseline; these baselines were just standard Transformer models (they were trained with |
@klshuster I suppose when I train the FiD model using the command you shared earlier (pasted below), then at inference time the model should do a regular decoding and not rag-token style, right? The reason I was not certain was because we set
|
Yeah, the |
Thanks a lot @klshuster !!! It really helped me ! |
Hello @klshuster , I am sorry that I ask so many questions. I really appreciate your kind helps so far. I was wondering if you can confirm the following command is correct for training a FiD model (bart generator) using our own dataset, index, and knowledge corpus (retrieval passages).
Also I had two questions regarding options in args.py.
Thanks again |
Yes, that command looks perfect, however make sure to set a
|
@klshuster perfect! thank you so much for clarification! |
@klshuster I realized when I generated dense embeddings and set I am not sure if that why I am getting the following error when loading my index file? Is there any easy way to fix this issue wo/ having to recreate the index file again? If only NOTE:
|
I fixed this issue! The issue was also because the passage ids started from 0, while it should have been started from 1. Thanks |
Hi @klshuster ,
wo/ these arguments the command still works but I think it won't use any grounding docs.
I was able to run this command, but not sure if it's the correct way of doing this?
Thank you for your help. |
|
@klshuster Thanks a lot. This is absolutely helpful. Regarding 3: This is a trained fid model ( train command:
eval command: (assuming I can remove some of the unnecessary args for future)
|
@klshuster If I understand correctly, we can add I tried adding this argument and I realized Also in the metrics, Thanks |
It was simply not implemented for |
Hi @klshuster, |
Hi @ELotfi, yes this is indeed possible - if you specify |
Hey, Specifically, when I execute the following command: parlai tm --rag-retriever-type dpr --query-model bert_from_parlai_rag --dpr-model-file zoo:hallucination/bart_rag_token/model --generation-model bart --init-opt arch/bart_large --batchsize 2 --gradient-clip 0.1 --label-truncate 128 --log-every-n-secs 30 --lr-scheduler reduceonplateau --lr-scheduler-patience 1 --optimizer adam --text-truncate 512 --truncate 512 --learningrate 1e-05 --validation-metric-mode min --validation-every-n-epochs 0.5 --validation-max-exs 1000 --validation-metric ppl --validation-patience 5 --gold-document-key gold_documents --gold-sentence-key gold_sentences --gold-document-titles-key gold_doc_titles --model projects.blenderbot2.agents.blenderbot2:BlenderBot2FidAgent --insert-gold-docs true --splitted-chunk-length 1000 --retriever-debug-index compressed --knowledge-access-method search_only --n-docs 2 --task <my_task> --debug --loglevel debug training seems to work fine, but upon changing Traceback (most recent call last):
File "/dccstor/knewedge/yuvalk/ParlAI/parlai/scripts/multiprocessing_train.py", line 45, in multiprocess_train
return single_train.TrainLoop(opt).train()
File "/dccstor/knewedge/yuvalk/ParlAI/parlai/scripts/train_model.py", line 950, in train
for _train_log in self.train_steps():
File "/dccstor/knewedge/yuvalk/ParlAI/parlai/scripts/train_model.py", line 857, in train_steps
world.parley()
File "/dccstor/knewedge/yuvalk/ParlAI/parlai/core/worlds.py", line 880, in parley
obs = self.batch_observe(other_index, batch_act, agent_idx)
File "/dccstor/knewedge/yuvalk/ParlAI/parlai/core/worlds.py", line 824, in batch_observe
observation = agents[index].observe(observation)
File "/dccstor/knewedge/yuvalk/ParlAI/projects/blenderbot2/agents/blenderbot2.py", line 441, in observe
observation = super().observe(observation)
File "/dccstor/knewedge/yuvalk/ParlAI/parlai/agents/rag/rag.py", line 284, in observe
self._set_query_vec(observation)
File "/dccstor/knewedge/yuvalk/ParlAI/projects/blenderbot2/agents/blenderbot2.py", line 510, in _set_query_vec
observation['query_vec'] = self.model.tokenize_query(query_str)
File "/dccstor/knewedge/yuvalk/anaconda3/envs/parlai/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1131, in __getattr__
type(self).__name__, name))
AttributeError: 'DistributedDataParallel' object has no attribute 'tokenize_query' How do you recommend fine-tuning FiD with gold retrieved documents? I will appreciate any tip/advice :) |
Indeed, BB2 requires around 8 gpus for training the 2.7B model (only 4 for training the 400M model). I can put up a fix for your specific error, though we have not extensively tested BB2 with multiprocessing |
@klshuster I wonder how we can use |
we don't currently have the |
This issue has not had activity in 30 days. Please feel free to reopen if you have more issues. You may apply the "never-stale" tag to prevent this from happening. |
Hello,
Thanks for the great effort.
I am new to parlai. I am interested in training a BART FiD model on my custom data using gold retrieved passages instead of using a DPR-style retriever.
I understand how to add new dataset from here.
And in the project page here, I see the second to the last command is for training a FiD RAG. Is there a way to modify
RagModel
orFidModel
class to pass gold passages? I saw this recent paper, that they have experiments using Retrieved Gold knowledge.I would appreciate if you can point me to right direction.
The text was updated successfully, but these errors were encountered: