[gemini] gemini support tensor parallelism. #4942

* [inference] add int8 rotary embedding kernel for smoothquant (hpcaitech#4843) * [inference] add smoothquant llama attention (hpcaitech#4850) * add smoothquant llama attention * remove uselss code * remove useless code * fix import error * rename file name * [inference] add silu linear fusion for smoothquant llama mlp (hpcaitech#4853) * add silu linear * update skip condition * catch smoothquant cuda lib exception * prcocess exception for tests * [inference] add llama mlp for smoothquant (hpcaitech#4854) * add llama mlp for smoothquant * fix down out scale * remove duplicate lines * add llama mlp check * delete useless code * [inference] add smoothquant llama (hpcaitech#4861) * add smoothquant llama * fix attention accuracy * fix accuracy * add kv cache and save pretrained * refactor example * delete smooth * refactor code * [inference] add smooth function and delete useless code for smoothquant (hpcaitech#4895) * add smooth function and delete useless code * update datasets * remove duplicate import * delete useless file * refactor codes (hpcaitech#4902) * rafactor code * add license * add torch-int and smoothquant license

To be compatible with the new change in the Transformers library, where a new argument 'padding_mask' was added to forward function of attention layer. huggingface/transformers#25598

…hpcaitech#4921) * [kernel] support pure fp16 for cpu adam (hpcaitech#4896) * [kernel] fix cpu adam kernel for pure fp16 and update tests (hpcaitech#4919) * [kernel] fix cpu adam * [test] update gemini optim test

hpcaitech#4918) Co-authored-by: github-actions <[email protected]>

* add test * fix no_sync bug in low level zero plugin * fix test * add argument for grad accum * add grad accum in backward hook for gemini * finish implementation, rewrite tests * fix test * skip stuck model in low level zero test * update doc * optimize communication & fix gradient checkpoint * modify doc * cleaning codes * update cpu adam fp16 case

* [hotfix] fix launch * [test] fix test gemini optim * [shardformer] fix vit

hpcaitech#4886) Co-authored-by: github-actions <[email protected]>

…aitech#4946) * add some req for inference * clean codes * add codes * add some lightllm deps * clean codes * hello * delete rms files * add some comments * add comments * add doc * add lightllm deps * add lightllm cahtglm2 kernels * add lightllm cahtglm2 kernels * replace rotary embedding with lightllm kernel * add some commnets * add some comments * add some comments * add * replace fwd kernel att1 * fix a arg * add * add * fix token attention * add some comments * clean codes * modify comments * fix readme * fix bug * fix bug --------- Co-authored-by: cuiqing.li <[email protected]> Co-authored-by: CjhHa1 <[email protected]>

* [test] add custom models in model zoo * [test] update legacy test * [test] update model zoo * [test] update gemini test * [test] remove components to test

* add reference and fix some bugs * update gptq init --------- Co-authored-by: Xu Kai <[email protected]>

* add bench chatglm * fix bug and make utils --------- Co-authored-by: CjhHa1 <cjh18671720497outlook.com>

…ch#4938) * merge kvcache with pipeline inference and refactor the code structure * support ppsize > 2 * refactor pipeline code * do pre-commit * modify benchmark * fix bench mark * polish code * add docstring and update readme * refactor the code * fix some logic bug of ppinfer * polish readme * fix typo * skip infer test

…#4953) * [inference] Dynamic Batching for Single and Multiple GPUs (hpcaitech#4831) * finish batch manager * 1 * first * fix * fix dynamic batching * llama infer * finish test * support different lengths generating * del prints * del prints * fix * fix bug --------- Co-authored-by: CjhHa1 <cjh18671720497outlook.com> * [inference] Async dynamic batching (hpcaitech#4894) * finish input and output logic * add generate * test forward * 1 * [inference]Re push async dynamic batching (hpcaitech#4901) * adapt to ray server * finish async * finish test * del test --------- Co-authored-by: yuehuayingxueluo <[email protected]> * Revert "[inference]Re push async dynamic batching (hpcaitech#4901)" (hpcaitech#4905) This reverts commit fbf3c09. * Revert "[inference] Async dynamic batching (hpcaitech#4894)" This reverts commit fced140. * Revert "[inference] Async dynamic batching (hpcaitech#4894)" (hpcaitech#4909) This reverts commit fced140. * Add Ray Distributed Environment Init Scripts * support DynamicBatchManager base function * revert _set_tokenizer version * add driver async generate * add async test * fix bugs in test_ray_dist.py * add get_tokenizer.py * fix code style * fix bugs about No module named 'pydantic' in ci test * fix bugs in ci test * fix bugs in ci test * fix bugs in ci test * [infer]Add Ray Distributed Environment Init Scripts (hpcaitech#4911) * Revert "[inference] Async dynamic batching (hpcaitech#4894)" This reverts commit fced140. * Add Ray Distributed Environment Init Scripts * support DynamicBatchManager base function * revert _set_tokenizer version * add driver async generate * add async test * fix bugs in test_ray_dist.py * add get_tokenizer.py * fix code style * fix bugs about No module named 'pydantic' in ci test * fix bugs in ci test * fix bugs in ci test * fix bugs in ci test * support dynamic batch for bloom model and is_running function * [Inference]Test for new Async engine (hpcaitech#4935) * infer engine * infer engine * test engine * test engine * new manager * change step * add * test * fix * fix * finish test * finish test * finish test * finish test * add license --------- Co-authored-by: yuehuayingxueluo <[email protected]> * add assertion for config (hpcaitech#4947) * [Inference] Finish dynamic batching offline test (hpcaitech#4948) * test * fix test * fix quant * add default * fix * fix some bugs * fix some bugs * fix * fix bug * fix bugs * reset param --------- Co-authored-by: yuehuayingxueluo <[email protected]> Co-authored-by: Cuiqing Li <[email protected]> Co-authored-by: CjhHa1 <cjh18671720497outlook.com>

…for llama token attention (hpcaitech#4965) * adding flash-decoding * clean * adding kernel * adding flash-decoding * add integration * add * adding kernel * adding kernel * adding triton 2.1.0 features for inference * update bloom triton kernel * remove useless vllm kernels * clean codes * fix * adding files * fix readme * update llama flash-decoding --------- Co-authored-by: cuiqing.li <[email protected]>

Co-authored-by: Xu Yuanchen <[email protected]>

* update doc * Update README.md --------- Co-authored-by: cuiqing.li <[email protected]>

…eased. (hpcaitech#4940) * Fix the bug where process groups were not being properly released. * test * Revert "test" This reverts commit 479900c.

…tech#4996)

* refactor pipeline into new CaiInferEngine * updata llama modeling forward * merge tp with pp * update docstring * optimize test workflow and example * fix typo * add assert and todo

* [release] update version * [hotfix] fix ci

[gemini] gemini support tp [gemini] gemini support tp [gemini] gemini support tp [gemini] gemini support tp

fix fix

update checkpointIO update checkpointIO update checkpointIO update checkpointIO update checkpointIO update checkpointIO update checkpointIO update checkpointIO

support fused layernorm support fused layernorm

update fusedlayernorm update fusedlayernorm

add sequence parallel to gemini

fix comments fix comments

modify tp gather method modify tp gather method modify tp gather method

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[gemini] gemini support tensor parallelism. #4942

[gemini] gemini support tensor parallelism. #4942

Commits on Nov 9, 2023