You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Enjoying the use of OpenVino/whisper to generate caption for videos with spoken language not supported by Davinci Resolve or when I need to also create a translation into English.
While the speech to text works very nicely in terms of correctly interpreting the spoken words, I think the handling of the resulting labels could be a bit improved. I was instructed at the OpenVino github that the below issues are related to whisper and not OpenVivo.
I'm seeing two things that could do with some innovation love:
Orphan words at the end of a sentence resulting in single word labels. This happens when a sentence is (say) 70 characters long and the max segment size has been set to 65. See example below:
It would be good if in the case, the sentence would be split into "and does it actually have then a significant impact" and "on the quality". Orphan words tend to be quite hard to read as the word "wizzes by" in the video quite quickly.
This could be handled with a parameter like:
end-of-sentence length (words): 3
There will ofcourse be edge cases still where 1 word labels will be created (like long pauses).
New line based on "in-word" punctuation. See one example below:
Other examples are "it's" split on the apostrophe, numbers like "2.2" split on the full stop. In all these cases the "full word" should be kept intact and not split between 2 labels.
(The screenshot also illustrates a situation where it would make sense to split the label with the last sentence two words ("So I"), and move those to the next line/label, a variation of case 1) but less important. )
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
-
Enjoying the use of OpenVino/whisper to generate caption for videos with spoken language not supported by Davinci Resolve or when I need to also create a translation into English.
While the speech to text works very nicely in terms of correctly interpreting the spoken words, I think the handling of the resulting labels could be a bit improved. I was instructed at the OpenVino github that the below issues are related to whisper and not OpenVivo.
I'm seeing two things that could do with some innovation love:
It would be good if in the case, the sentence would be split into "and does it actually have then a significant impact" and "on the quality". Orphan words tend to be quite hard to read as the word "wizzes by" in the video quite quickly.
This could be handled with a parameter like:
end-of-sentence length (words): 3
There will ofcourse be edge cases still where 1 word labels will be created (like long pauses).
Other examples are "it's" split on the apostrophe, numbers like "2.2" split on the full stop. In all these cases the "full word" should be kept intact and not split between 2 labels.
(The screenshot also illustrates a situation where it would make sense to split the label with the last sentence two words ("So I"), and move those to the next line/label, a variation of case 1) but less important. )
Beta Was this translation helpful? Give feedback.
All reactions