add `chunk_length` parameter to Whisper #1909

MahmoudAshraf97 · 2024-07-07T19:24:18Z

distil-whisper models perform the best with chunk sizes less than 30s that the original whisper models use, this PR introduces the option to build the engine with a different chunk length

Summary of the changes in this PR:

Whisper encoder now supports changing chunk_size
Example has been updated to support remove_input_padding in the decoder
conv1d now supports input with more than 1 dynamic shape
Whisper decoder should now support inflight batching when built with paged_kv_cache using the executor, although there is no clear way to feed the encoder input and the prompt to the tensorrt_llm.bindings.Request class as it only accepts list of tokens in all inputs, and the encoder output is a float tensor

enabling remove_input_padding in the encoder wasn't as easy as I thought, all of my trials failed at the step where the positional embeddings are added to the conv output. chunk size is not defined at build time, this didn't work because the positional embeddings tensor first dim is 1500 which corresponds to 30s inputs. When the chunk_size is known at build time it's easy to slice the positional embeddings tensor to the correct size and add it to the conv output, but when the chunk size is unknown, the build fails at fetching the correct indices, for example:

import tensorrt_llm.functional as F

positional_embeddings = F.gather(
    positional_embedding,
    dim=0,
    indices=F.concat(
        [F.arange(0, input_length, "int32") for input_length in input_lengths.unbind()]
    ),
)

## only for padded input
positional_embeddings = F.view(positional_embeddings,[-1, chunk_size, hidden_size])
##
x = x + positional_embeddings

input_lengths.unbind() fails because input_lengths shape is [-1]

removing input padding from the encoder isn't that much important TBH as we expect encoder inputs to be of the same shape and size except for the last window in an audio, it will be beneficial in scenarios where we expect the requests to be multiple audio files which all of them are less than 30s and vary a lot in length

on the other side, remove_input_padding is important on the decoder side because it's required to enable inflight batching, from a quick trial on a 30 min audio file, the larger the batch size, the slower the generation

efficiency = generation loops needed / actual generation loops (calculated by the longest seq in the output)
# time is for decoding only, the whole 30 mins
# efficiency:  1.0 / batch_size=1
# 626 ms ± 17.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
# efficiency:  0.87 / batch_size=2
# 716 ms ± 50.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
# efficiency:  0.80 / batch_size=4
# 755 ms ± 31.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
# efficiency:  0.67 / batch_size=8
# 1.05 s ± 21 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
# efficiency:  0.56 / batch_size=16
# 1.33 s ± 38 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
# efficiency:  0.45 / batch_size=32
# 1.85 s ± 34.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

as we notice, the time taken increases with batch size which is counter productive for large workloads, hence the need for inflight batching

yuekaizhang · 2024-07-11T05:40:29Z

@MahmoudAshraf97 Hi, thanks for your effort. I would take this PR into our internal gitlab. Also, we would add your name into the co-author list and credit your work on the release notes for whisper IFB feature.

yuekaizhang · 2024-07-11T09:36:20Z

@MahmoudAshraf97 Hi, I just tried the more than 1 dynamic shape conv1d solution by setting codes below:

    x = Tensor(name="x",
               dtype=self._dtype,
               shape=[-1, self.config.n_mels, -1],
               dim_range=OrderedDict([
                   ("batch_size", [bs_range]),
                   ("feature_dim", [self.config.n_mels]),
                   ("feature_len_range", [1, 1000, 3000]),
               ]))

However, the build process failed. Seems the slice operator would need to know the value of x.shape[1].

I was wondering why you set fixed config.chunk_length here rather than let it be dynamic.

MahmoudAshraf97 · 2024-07-11T09:56:32Z

@MahmoudAshraf97 Hi, I just tried the more than 1 dynamic shape conv1d solution by setting codes below:
    x = Tensor(name="x",
               dtype=self._dtype,
               shape=[-1, self.config.n_mels, -1],
               dim_range=OrderedDict([
                   ("batch_size", [bs_range]),
                   ("feature_dim", [self.config.n_mels]),
                   ("feature_len_range", [1, 1000, 3000]),
               ]))
However, the build process failed. Seems the slice operator would need to know the value of x.shape[1].

I was wondering why you set fixed config.chunk_length here rather than let it be dynamic.

as I mentioned in my trials in the PR, this was a step to make it work but I couldn't complete it because of the slice operator or other operators that aim to add the positional embeddings to x. before this change the build failed at the first conv layer, now it passes the conv layers and fails at a later stage, so I guess we are half way there

yuekaizhang · 2024-07-11T10:07:10Z

as I mentioned in my trials in the PR, this was a step to make it work but I couldn't complete it because of the slice operator or other operators that aim to add the positional embeddings to x. before this change the build failed at the first conv layer, now it passes the conv layers and fails at a later stage, so I guess we are half way there

@MahmoudAshraf97 I see. Thanks. Btw, the remove_input_padding for decoder issue has been fixed. The code would sync to github one week later.

galv · 2024-07-17T21:37:38Z

examples/whisper/run.py

-        cross_attention_mask = torch.ones(
-            [encoder_outputs.shape[0], 1,
-             encoder_outputs.shape[1]]).int().cuda()
+        cross_attention_mask = (


Hi @MahmoudAshraf97 if I understand correctly, you are making this change because distil-whisper can work on dynamic chunk sizes, unlike whisper which must use fixed 30 second chunks. Am I understanding correctly? Thank you.

Hi @galv , this PR contains 2 main changes

Encoder is no longer restricted to 30s inputs, this helps in case of distil-whisper as you mentioned

Decoder now supports remove_input_padding and accepts packed inputs to save memory

MahmoudAshraf97 · 2024-08-28T12:27:18Z

closing this since it was merged

MahmoudAshraf97 force-pushed the main branch from a011ff9 to 693ab91 Compare July 9, 2024 08:28

MahmoudAshraf97 mentioned this pull request Jul 10, 2024

failed to load whisper decoder engine with paged kv cache #1930

Closed

4 tasks

MahmoudAshraf97 added 6 commits July 16, 2024 11:49

misc fixes

21e09e4

add chunk_length parameter

102f11a

allow conv1d to work with more than one dynamic shapes

3caa20d

improve readme

31780da

switch example to packed inference

0b30139

.

93391da

MahmoudAshraf97 force-pushed the main branch from 97a975d to 93391da Compare July 16, 2024 08:54

galv reviewed Jul 17, 2024

View reviewed changes

kaiyux mentioned this pull request Jul 23, 2024

Update TensorRT-LLM #2008

Merged

MahmoudAshraf97 closed this Aug 28, 2024

Shixiaowei02 mentioned this pull request Aug 29, 2024

TensorRT-LLM v0.12 Update #2164

Merged

MahmoudAshraf97 mentioned this pull request Sep 19, 2024

Accept variable-length batch prompts for Whisper OpenNMT/CTranslate2#1784

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add `chunk_length` parameter to Whisper #1909

add `chunk_length` parameter to Whisper #1909

MahmoudAshraf97 commented Jul 7, 2024 •

edited

Loading

yuekaizhang commented Jul 11, 2024

yuekaizhang commented Jul 11, 2024

MahmoudAshraf97 commented Jul 11, 2024

yuekaizhang commented Jul 11, 2024

galv Jul 17, 2024

MahmoudAshraf97 Jul 17, 2024

MahmoudAshraf97 commented Aug 28, 2024

add chunk_length parameter to Whisper #1909

add chunk_length parameter to Whisper #1909

Conversation

MahmoudAshraf97 commented Jul 7, 2024 • edited Loading

yuekaizhang commented Jul 11, 2024

yuekaizhang commented Jul 11, 2024

MahmoudAshraf97 commented Jul 11, 2024

yuekaizhang commented Jul 11, 2024

galv Jul 17, 2024

Choose a reason for hiding this comment

MahmoudAshraf97 Jul 17, 2024

Choose a reason for hiding this comment

MahmoudAshraf97 commented Aug 28, 2024

add `chunk_length` parameter to Whisper #1909

add `chunk_length` parameter to Whisper #1909

MahmoudAshraf97 commented Jul 7, 2024 •

edited

Loading