Skip to content

Latest commit

 

History

History

reranker

Folders and files

NameName
Last commit message
Last commit date

parent directory

..
 
 
 
 
 
 
 
 

Finetune

In this example, we show how to finetune the reranker with your data.

1. Installation

  • with pip
pip install -U FlagEmbedding[finetune]
  • from source
git clone https://github.com/FlagOpen/FlagEmbedding.git
cd FlagEmbedding
pip install  .[finetune]

For development, install as editable:

pip install -e .[finetune]

2. Data format

Train data should be a json file, where each line is a dict like this:

{"query": str, "pos": List[str], "neg":List[str], "pos_scores": List[int], "neg_scores": List[int], "prompt": str}

query is the query, and pos is a list of positive texts, neg is a list of negative texts. pos_scores is a list of scores corresponding to the query and pos, neg_scores is a list of scores corresponding to the query and neg, if you don't use knowledge distillation, it can be ignored. prompt is the prompt used for the input, input has the following format: query [sep] passage [sep] prompt. If you have no negative texts for a query, you can random sample some from the entire corpus as the negatives.

See example_data for more detailed files.

Hard Negatives

Hard negatives is a widely used method to improve the quality of sentence embedding. You can mine hard negatives following this command:

git clone https://github.com/FlagOpen/FlagEmbedding.git
cd FlagEmbedding/scripts
python hn_mine.py \
--model_name_or_path BAAI/bge-base-en-v1.5 \
--input_file toy_finetune_data.jsonl \
--output_file toy_finetune_data_minedHN.jsonl \
--range_for_sampling 2-200 \
--negative_number 15 \
--use_gpu_for_searching 
  • input_file: json data for finetuning. This script will retrieve top-k documents for each query, and random sample negatives from the top-k documents (not including the positive documents).
  • output_file: path to save JSON data with mined hard negatives for finetuning
  • negative_number: the number of sampled negatives
  • range_for_sampling: where to sample negative. For example, 2-100 means sampling negative_number negatives from top2-top200 documents. You can set larger value to reduce the difficulty of negatives (e.g., set it 60-300 to sample negatives from top60-300 passages)
  • candidate_pool: The pool to retrieval. The default value is None, and this script will retrieve from the combination of all neg in input_file. The format of this file is the same as pretrain data. If input a candidate_pool, this script will retrieve negatives from this file.
  • use_gpu_for_searching: whether to use faiss-gpu to retrieve negatives.

Teacher Scores

Teacher scores can be used for model distillation. You can obtain the scores using the following command:

git clone https://github.com/FlagOpen/FlagEmbedding.git
cd FlagEmbedding/scripts
python add_reranker_score.py \
--input_file toy_finetune_data_minedHN.jsonl \
--output_file toy_finetune_data_score.jsonl \
--reranker_name_or_path BAAI/bge-reranker-v2-m3 \
--devices cuda:0 cuda:1 \
--cache_dir ./cache/model \
--reranker_query_max_length 512 \
--reranker_max_length 1024
  • input_file: path to save JSON data with mined hard negatives for finetuning
  • output_file: path to save JSON data with scores for finetuning
  • use_fp16: Whether to use fp16 for inference. Default: True
  • devices: Devices to use for inference. Default: None, multiple values allowed
  • trust_remote_code: Trust remote code. Default: False
  • reranker_name_or_path: The reranker name or path. Default: None
  • reranker_model_class: The reranker model class. Available classes: ['auto', 'encoder-only-base', 'decoder-only-base', 'decoder-only-layerwise', 'decoder-only-lightweight']. Default: auto
  • reranker_peft_path: The reranker peft path. Default: None
  • use_bf16: Whether to use bf16 for inference. Default: False
  • query_instruction_for_rerank: Instruction for query. Default: None
  • query_instruction_format_for_rerank: Format for query instruction. Default: {{}{}}
  • passage_instruction_for_rerank: Instruction for passage. Default: None
  • passage_instruction_format_for_rerank: Format for passage instruction. Default: {{}{}}
  • cache_dir: Cache directory for models. Default: None
  • reranker_batch_size: Batch size for inference. Default: 3000
  • reranker_query_max_length: Max length for reranking queries. Default: None
  • reranker_max_length: Max length for reranking. Default: 512
  • normalize: Whether to normalize the reranking scores. Default: False
  • prompt: The prompt for the reranker. Default: None
  • cutoff_layers: The output layers of layerwise/lightweight reranker. Default: None
  • compress_ratio: The compress ratio of lightweight reranker. Default: 1
  • compress_layers: The compress layers of lightweight reranker. Default: None, multiple values allowed

3. Train

Detailed examples of various fine-tuning can be found in the bash files located in the corresponding folders. Here, we simply provide the training methods for the standard model, bge-reranker-v2-gemma and bge-reranker-v2-layerwise-minicpm.

Here are some import arguments:

  • model_name_or_path: The model checkpoint for initialization.
  • config_name: Pretrained config name or path if not the same as model_name. Default: None
  • tokenizer_name: Pretrained tokenizer name or path if not the same as model_name. Default: None
  • cache_dir: Where do you want to store the pre-trained models downloaded from s3. Default: None
  • trust_remote_code: Trust remote code. Default: False
  • model_type: Type of finetune, ['encoder', 'decoder']. Default: 'encoder'
  • token: The token to use when accessing the model. Default: Value from environment variable HF_TOKEN or None if not set
  • train_data: One or more paths to training data. query: str, pos: List[str], neg: List[str] are required in the training data. Default: None
  • cache_path: Where do you want to store the cached data. Default: None
  • train_group_size: Default: 8
  • query_max_len: The maximum total input sequence length after tokenization for passage. Sequences longer than this will be truncated. Default: 32
  • passage_max_len: The maximum total input sequence length after tokenization for passage. Sequences longer than this will be truncated. Default: 128
  • max_len: The maximum total input sequence length after tokenization. Sequences longer than this will be truncated. Default: 512
  • pad_to_multiple_of: If set, will pad the sequence to be a multiple of the provided value. Default: None
  • max_example_num_per_dataset: The max number of examples for each dataset. Default: 100000000
  • query_instruction_for_rerank: Instruction for query. Default: None
  • query_instruction_format: Format for query instruction. Default: "{}{}"
  • knowledge_distillation: Use knowledge distillation when pos_scores: List[float] and neg_scores: List[float] are in features of training data. Default: False
  • passage_instruction_for_rerank: Instruction for passage. Default: None
  • passage_instruction_format: Format for passage instruction. Default: "{}{}"
  • shuffle_ratio: The ratio of shuffling the text. Default: 0.0
  • sep_token: The separator token for LLM reranker to discriminate between query and passage. Default: '\n'

(1) standard model

torchrun --nproc_per_node 2 \
	-m FlagEmbedding.finetune.reranker.encoder_only.base \
	--model_name_or_path BAAI/bge-reranker-v2-m3 \
    --cache_dir ./cache/model \
    --train_data ./example_data/normal/examples.jsonl \
    --cache_path ./cache/data \
    --train_group_size 8 \
    --query_max_len 512 \
    --passage_max_len 512 \
    --pad_to_multiple_of 8 \
    --knowledge_distillation False \
	--output_dir ./test_encoder_only_base_bge-reranker-base \
    --overwrite_output_dir \
    --learning_rate 6e-5 \
    --fp16 \
    --num_train_epochs 2 \
    --per_device_train_batch_size 2 \
    --gradient_accumulation_steps 1 \
    --dataloader_drop_last True \
    --warmup_ratio 0.1 \
    --gradient_checkpointing \
    --weight_decay 0.01 \
    --deepspeed ../ds_stage0.json \
    --logging_steps 1 \
    --save_steps 1000

(2) bge-reranker-v2-gemma

torchrun --nproc_per_node 2 \
	-m FlagEmbedding.finetune.reranker.decoder_only.base \
	--model_name_or_path BAAI/bge-reranker-v2-gemma \
    --use_lora True \
    --lora_rank 32 \
    --lora_alpha 64 \
    --use_flash_attn True \
    --target_modules q_proj k_proj v_proj o_proj \
    --save_merged_lora_model True \
    --model_type decoder \
    --cache_dir ./cache/model \
    --train_data ./example_data/prompt_based/examples.jsonl \
    --cache_path ./cache/data \
    --train_group_size 8 \
    --query_max_len 512 \
    --passage_max_len 512 \
    --pad_to_multiple_of 8 \
    --knowledge_distillation False \
    --query_instruction_for_rerank 'A: ' \
    --query_instruction_format '{}{}' \
    --passage_instruction_for_rerank 'B: ' \
    --passage_instruction_format '{}{}' \
    --output_dir ./test_decoder_only_base_bge-reranker-v2-minicpm-layerwise \
    --overwrite_output_dir \
    --learning_rate 2e-4 \
    --bf16 \
    --num_train_epochs 1 \
    --per_device_train_batch_size 2 \
    --gradient_accumulation_steps 1 \
    --dataloader_drop_last True \
    --warmup_ratio 0.1 \
    --gradient_checkpointing \
    --weight_decay 0.01 \
    --deepspeed ../ds_stage0.json \
    --logging_steps 1 \
    --save_steps 1000

Here are some new arguments:

  • use_lora: If passed, will use LORA (low-rank parameter-efficient training) to train the model.
  • lora_rank: The rank of lora.
  • lora_alpha: The alpha parameter of lora.
  • lora_dropout: The dropout rate of lora modules.
  • target_modules: The target modules to apply LORA.
  • modules_to_save: List of modules that should be saved in the final checkpoint.
  • use_flash_attn: If passed, will use flash attention to train the model.
  • from_peft: (metadata not provided)
  • raw_peft: (metadata not provided)
  • save_merged_lora_model: If passed, will merge the lora modules and save the entire model.

(3) bge-reranker-v2-layerwise-minicpm

torchrun --nproc_per_node 2 \
	-m FlagEmbedding.finetune.reranker.decoder_only.layerwise \
    --model_name_or_path BAAI/bge-reranker-v2-minicpm-layerwise \
    --use_lora True \
    --lora_rank 32 \
    --lora_alpha 64 \
    --use_flash_attn True \
    --target_modules q_proj k_proj v_proj o_proj \
    --save_merged_lora_model True \
    --model_type decoder \
    --model_type from_finetuned_model \
    --start_layer 8 \
    --head_multi True \
    --head_type simple \
    --trust_remote_code True \
    --cache_dir ./cache/model \
    --train_data ./example_data/prompt_based/examples.jsonl \
    --cache_path ./cache/data \
    --train_group_size 8 \
    --query_max_len 512 \
    --passage_max_len 512 \
    --pad_to_multiple_of 8 \
    --knowledge_distillation False \
    --query_instruction_for_rerank 'A: ' \
    --query_instruction_format '{}{}' \
    --passage_instruction_for_rerank 'B: ' \
    --passage_instruction_format '{}{}' \
	--output_dir ./test_decoder_only_base_bge-reranker-v2-minicpm-layerwise \
    --overwrite_output_dir \
    --learning_rate 2e-4 \
    --bf16 \
    --num_train_epochs 1 \
    --per_device_train_batch_size 2 \
    --gradient_accumulation_steps 1 \
    --dataloader_drop_last True \
    --warmup_ratio 0.1 \
    --gradient_checkpointing \
    --weight_decay 0.01 \
    --deepspeed ../ds_stage0.json \
    --logging_steps 1 \
    --save_steps 1000

Here are some new arguments:

  • use_lora: If passed, will use LORA (low-rank parameter-efficient training) to train the model.
  • lora_rank: The rank of lora.
  • lora_alpha: The alpha parameter of lora.
  • lora_dropout: The dropout rate of lora modules.
  • target_modules: The target modules to apply LORA.
  • modules_to_save: List of modules that should be saved in the final checkpoint.
  • use_flash_attn: If passed, will use flash attention to train the model.
  • save_merged_lora_model: If passed, will merge the lora modules and save the entire model.
  • model_type: Model type context, which should be one of ['from_raw_model', 'from_finetuned_model'].
  • start_layer: Specifies which layer to start to compute score.
  • head_multi: Indicates whether to use one or multiple classifiers.
  • head_type: The type of the classifier.