Why are timestamps always on exact one-second boundaries? #1341

billyc · 2023-10-02T11:08:46Z

billyc
Oct 2, 2023

Hello, I am using whisper.cpp to create .SRT subtitle files from audio. Everything is working beautifully, except the timestamps are always on one-second boundaries. In all of the examples I see online, the start/end times of spoken sentences seem to have sub-second accuracy.

Is there a setting that controls this?

My setup:

Macbook Air M1 (2020)
whisper.cpp code built from Github
ggml.large.bin model

Command:

fname=tr11.mp4

~/git/whisper.cpp/whisper -m ~/git/whisper.cpp/models/ggml-large.bin \
  --language tr \
  -t 7 \
  -osrt \
  -of $fname \
  -f $fname.wav

Sample output:

whisper_init_from_file_no_state: loading model from '/Users/billy/git/whisper.cpp/models/ggml-large.bin'
whisper_model_load: loading model
whisper_model_load: n_vocab       = 51865
whisper_model_load: n_audio_ctx   = 1500
whisper_model_load: n_audio_state = 1280
whisper_model_load: n_audio_head  = 20
whisper_model_load: n_audio_layer = 32
whisper_model_load: n_text_ctx    = 448
whisper_model_load: n_text_state  = 1280
whisper_model_load: n_text_head   = 20
whisper_model_load: n_text_layer  = 32
whisper_model_load: n_mels        = 80
whisper_model_load: ftype         = 1
whisper_model_load: qntvr         = 0
whisper_model_load: type          = 5
whisper_model_load: mem required  = 3557.00 MB (+   71.00 MB per decoder)
whisper_model_load: adding 1608 extra tokens
whisper_model_load: model ctx     = 2951.27 MB
whisper_model_load: model size    = 2950.66 MB
whisper_init_state: kv self size  =   70.00 MB
whisper_init_state: kv cross size =  234.38 MB

system_info: n_threads = 7 / 8 | AVX = 0 | AVX2 = 0 | AVX512 = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | VSX = 0 | COREML = 0 | OPENVINO = 0 |

main: processing 'tr11.mp4.wav' (23131162 samples, 1445.7 sec), 7 threads, 1 processors, lang = tr, task = transcribe, timestamps = 1 ...

[00:00:00.000 --> 00:00:02.000]   [Jenerik]
[00:00:02.000 --> 00:00:03.000]   Sol
[00:00:03.000 --> 00:00:06.000]   Toprak
[00:00:06.000 --> 00:00:08.000]   Ateş
[00:00:08.000 --> 00:00:10.000]   Hava
[00:00:10.000 --> 00:00:14.000]   Geçmişte dört ulus barış ve uyum içinde yaşıyordu.
[00:00:14.000 --> 00:00:18.000]   Sonra ateş ulusunun saldırmasıyla her şey değişti.
[00:00:18.000 --> 00:00:22.000]   Yalnızca dört elemenden üstün olan avatar önüne de geçirebilirdi.
[00:00:22.000 --> 00:00:26.000]   Ama dünyanın ona en çok ihtiyaç duyduğu bir anda ortadan kayboldu.
[00:00:26.000 --> 00:00:28.000]   Aradan yüz yıl geçti.
...etc...

As you can see, every line is output as if it was spoken precisely on one-second boundaries. Is this fixable? What have I done wrong?

Thanks in advance, happy to provide more info...

ulatekh · 2024-06-04T21:02:07Z

ulatekh
Jun 4, 2024

That's a limitation of the original Whisper model. There are derivative projects, such as WhisperX, that employ other techniques (e.g. wav2vec 2.0) to try to improve upon this.

0 replies

This comment was marked as spam.

Sign in to view

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Why are timestamps always on exact one-second boundaries? #1341

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 2 comments

This comment was marked as spam.

{{title}}

Select a reply

Why are timestamps always on exact one-second boundaries? #1341

billyc Oct 2, 2023

Replies: 2 comments

This comment was marked as spam.

ulatekh Jun 4, 2024

billyc
Oct 2, 2023

ulatekh
Jun 4, 2024