Release CTranslate2 3.12.0 · OpenNMT/CTranslate2

New features

Add methods Generator.generate_tokens and Translator.generate_tokens returning a generator that yields tokens as soon as they are generated by the model (not compatible with beam search)
Improve performance of rotary embeddings on CPU with an alternative implementation that is enabled when setting rotary_interleave=False in the model specification (may require to permute QK weights)
Support a variable number of input frames in method Whisper.align to improve batch support
Expose flag low_cpu_mem_usage in the Transformers converter to reduce the memory usage when loading large models (requires the package accelerate)

Fix crash in Whisper.align when num_frames // 2 <= median_filter_width
Raise an error if arguments end_token or suppress_sequences contain tokens that are not in the vocabulary
Optimize the quantization of FP16 weights during the model conversion
In the Transformers converter, also load the model weights in FP16 when the selected quantization is int8_float16
Update the Whisper timestamp decoding rules to prevent the generation of segments with zero duration