Date: 2023-07-03
- [Experimental] Added support for GPT-NeoX models.
- [Experimental] Added support for BLOOM models.
- [Prototype] Added support for LLaMA models.
- Added support for more flexible tensor-parallel configurations to GPT2, OPT, and BLOOM. The attention heads doesn't need to be evenly divisible by
tp_degree
anymore. (Note: Thetp_degree
still needs to satisfy the runtime topologies constraint for collective communication (i.e Allreduce). For more details on supported topologies, see: Tensor-parallelism support and https://awsdocs-neuron.readthedocs-hosted.com/en/latest/general/arch/neuron-features/collective-communication.html.) - Added multi-query / multi-group attention support for GPT2.
- Fixed NaN issues for GPT2 model.
- Fixed OPT/GPT-NeoX gibberish output
- Resolved an issue where NaN values could be produced when the context_length argument was used in GPT2/OPT.
- Missing cache reorder support for beam search.
Date: 2023-06-12
- Added
int8
weight storage forGPT2
models. - Improved prompt context encoding performance for
GPT2
models. - Improved collective communications performance for tp-degrees 4, 8, and 24 on Inf2.
- Improved collective communications performance for tp-degrees 8 and 32 on Trn1.
- Support for the
--model-type=transformer-inference
compiler flag for optimized decoder-only LLM inference.
- Added padding to the
GPT-J
linear
layer to correctly handle odd vocabulary sizes. - Issues where the HuggingFace
generate
method produces incorrect results whenbeam_search
is used have been resolved.
Date: 2023-04-28
- Added
transformers-neuronx
artifacts to PyPI repository. - Added support for the the Hugging Face generate()
- Added support for model serialization, including model saving, loading, and weight swapping.
- Added support for caching compiled artifacts.
- Improved performance by removing unnecessary KV-cache tensor resetting.
- Improved prompt context encoding performance (
OPT
,GPT2
).
- Incorrect
GPT-J
amp_callback
import: Fixed theGPT-J
demo now imports the correctamp_callback
function.
Incorrect output with HuggingFace beam_search
: When the HuggingFace generate
method is configured to use beam_search
, this
can produce incorrect results for certain configurations. It is recommended to
use other generation methods such as sample
or greedy_search
.
Date: 2023-02-24
- Added error handling to check if the desired generated sequence length is valid based on the model configuration
- Improved logging:
- Reduced overly verbose compiler messages
- Disabled lazy module warnings
- Updated
src/transformers_neuronx/gptj/demo.py
to correctly use theamp_callback
function fromtransformers_neuronx.gpt2.demo
- Extend the
gpt_demo.py
save
function to support GPT-2 and GPT-J configs
Date: 2023-02-08
First release of transformers-neuronx
, a new library that enables LLM model inference on Inf2 & Trn1 using the Neuron SDK. transformers-neuronx
contains optimized model implementations that are checkpoint-compatible with HuggingFace Transformers, and currently supports Transformer Decoder models like GPT2, GPT-J and OPT.