You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
File "/home/yfliu/anaconda3/envs/oneflow/lib/python3.8/runpy.py", line 194, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/home/yfliu/anaconda3/envs/oneflow/lib/python3.8/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/home/yfliu/anaconda3/envs/oneflow/lib/python3.8/site-packages/oneflow/distributed/launch.py", line 240, in <module>
main()
File "/home/yfliu/anaconda3/envs/oneflow/lib/python3.8/site-packages/oneflow/distributed/launch.py", line 228, in main
sigkill_handler(signal.SIGTERM, None)
File "/home/yfliu/anaconda3/envs/oneflow/lib/python3.8/site-packages/oneflow/distributed/launch.py", line 196, in sigkill_handler
raise subprocess.CalledProcessError(
subprocess.CalledProcessError: Command '['/home/yfliu/anaconda3/envs/oneflow/bin/python3', '-u', 'projects/Llama/train_net.py', '--config-file', 'projects/Llama/configs/llama_sft.py']' died with <Signals.SIGABRT: 6>.
我的脚本
set -e
if [ -z"$1" ];thenecho"Usage: $0 <number>"exit 1
fi
libai_path=../libai
cd$libai_path# scripts split in case blocks.case$1in
1)
# See https://github.com/Oneflow-Inc/libai/tree/main/projects/Llama for reference# Notice:# 1. Please make sure you have setup destination_path and checkpoint_dir# For example, our checkpoint_dir is /data1/yfliu/models/LLaMA2/LLaMA2_hf_7B downloaded from https://llama.meta.com/llama-downloads/# our destination dir is /data1/yfliu/alpaca# 2. You should also modify terms in projects/Llama/configs/llama_config.py
python projects/Llama/utils/prepare_alpaca.py
;;
2)
# full finetune# Please set the finetuning parameters in projects/Llama/configs/llama_sft.py, such as dataset_path and pretrained_model_path# Type python3 -m oneflow.distributed.launch -h for more usage
FILE=projects/Llama/train_net.py
CONFIG=projects/Llama/configs/llama_sft.py
GPUS=1
NODE=1
NODE_RANK=0
ADDR=127.0.0.1
PORT=12345
LOGDIR=/home/yfliu/horizontal/oneflowtest/runs/llama2/oneflow
export ONEFLOW_FUSE_OPTIMIZER_UPDATE_CAST=true
python3 -m oneflow.distributed.launch \
--nproc_per_node $GPUS --nnodes $NODE --node_rank $NODE_RANK --master_addr $ADDR --master_port $PORT --logdir $LOGDIR --redirect_stdout_and_stderr \
$FILE --config-file $CONFIG
;;
esac
执行脚本方式
bash llama_sft.sh 2
在执行SFT训练时报错,似乎无法定位到是哪里出了问题。
The text was updated successfully, but these errors were encountered:
配置:单卡A100
在Finetune时遇到SIGABRT: 6错误
bash llama_sft.sh 2
在执行SFT训练时报错,似乎无法定位到是哪里出了问题。
The text was updated successfully, but these errors were encountered: