transcribe : Initial prompt parameter and Text extraction time #439

Arasun01 · 2023-08-23T03:31:27Z

Arasun01
Aug 23, 2023

I am combining two questions here and addressing them through the following codes:

_**from faster_whisper import WhisperModel
import time
import pandas as pd
mt = pd.read_csv('frequentterms.csv')
terms = ', '.join(str(x) for x in mt.term.unique())

Define the initial prompt
initial_prompt = terms

model_size = "large-v2"

Run on GPU with FP16
model = WhisperModel(model_size, device="cuda", compute_type="float16")
audio_file = "audio1.m4a"
start_time = time.time()
segments, _ = model.transcribe(audio_file,
temperature=(0.0, 0.2, 0.4, 0.8, 1.0),
best_of=5,
beam_size=5,
suppress_tokens=[-1],
suppress_blank = True,
condition_on_previous_text=True,
compression_ratio_threshold=2.4,
log_prob_threshold=-1.,
no_speech_threshold=0.3,
word_timestamps=True,
initial_prompt=initial_prompt,
vad_filter=True)

elapsed_time = time.time() - start_time
print("Transcript extraction Time:", time.strftime("%H:%M:%S",time.gmtime(elapsed_time)))**_

Initial Prompt Parameter

a) I am passing around 3, 500 domain-specific terms as comma-separate text to the initial prompt. It does not work.

b) If I pass a few terms it works (for example 'Injection, perineal, Solensia'). How do I pass all these terms as an initial prompt to improve the transcript mistakes?

Execution time

The above code execution time for 10 mins audio is 7 seconds in GPU - 1 x NVIDIA Tesla V100. Whereas when extracting the text from the segment list object it takes around 40 seconds. How to improve data extraction? Does anyone have a different mechanism to improve the code? Also removing the uh and ums. That code does not take a long time.
import concurrent.futures
start_time = time.time()

lstsegments = list(segments)

def process_segment(segment):
return segment.text, [word.word for word in segment.words if word.probability <= 0.5]

with concurrent.futures.ThreadPoolExecutor() as executor:
results = list(executor.map(process_segment, lstsegments))

output_text = ''.join(text for text, _ in results)
output_text = output_text.replace(' uh ',' ').replace(' uh','').replace(' Uh ',' ').replace(' um ', ' ').replace(', um ',',').replace(' Um ', ' ').replace('Um,', ' ').replace('um, ','').replace(' ',' ').replace(',,','')
error_words = [word for _, error_list in results for word in error_list]

elapsed_time = time.time() - start_time
print("Text generation Time:", time.strftime("%H:%M:%S",time.gmtime(elapsed_time)))

Please let me know some pointers

Answered by guillaumekln

Aug 24, 2023

Whereas when extracting the text from the segment list object it takes around 40 seconds. How to improve data extraction?

Read the README again and see the warning about the returned segments being a generator.

View full answer

phineas-pta · 2023-08-23T21:40:47Z

phineas-pta
Aug 23, 2023

prompt cannot go beyond 448 tokens

https://github.com/guillaumekln/faster-whisper/blob/master/faster_whisper/transcribe.py#L700

0 replies

phineas-pta · 2023-08-23T21:43:16Z

guillaumekln · 2023-08-24T06:15:30Z

guillaumekln
Aug 24, 2023

Whereas when extracting the text from the segment list object it takes around 40 seconds. How to improve data extraction?

Read the README again and see the warning about the returned segments being a generator.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

transcribe : Initial prompt parameter and Text extraction time #439

{{title}}

Replies: 3 comments 2 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

transcribe : Initial prompt parameter and Text extraction time #439

Arasun01 Aug 23, 2023

Replies: 3 comments · 2 replies

phineas-pta Aug 23, 2023

phineas-pta Aug 23, 2023

toanhuynhnguyen Oct 7, 2024

phineas-pta Oct 18, 2024

guillaumekln Aug 24, 2023

Arasun01
Aug 23, 2023

Replies: 3 comments 2 replies

phineas-pta
Aug 23, 2023

phineas-pta
Aug 23, 2023

guillaumekln
Aug 24, 2023