Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to translate subtitle .srt #14

Open
ewwink opened this issue Apr 7, 2024 · 1 comment
Open

How to translate subtitle .srt #14

ewwink opened this issue Apr 7, 2024 · 1 comment

Comments

@ewwink
Copy link

ewwink commented Apr 7, 2024

I use this command

python3 translate.py \
--sentences_path input.srt \
--output_path result.srt \
--source_lang eng_Latn \
--target_lang ind_Latn \
--model_name facebook/nllb-200-distilled-600M \
--precision fp16

with input.srt

1
00:00:07,312 --> 00:00:09,993
Hello.

2
00:00:09,994 --> 00:00:11,227
Where are you right now?

3
00:00:11,228 --> 00:00:13,360
Right now I am on my way
to South Dakota.

4
00:00:13,361 --> 00:00:16,093
Gonna do a little camping,
do a little fishing.

5
00:00:16,094 --> 00:00:17,426
Good for you, Colter.

but the result.srt has problems:

  • wrong order
  • empty line replace with (dalam bahasa Inggris)
  • appended unknown
1
00:00:07,312 --> 00:00:09,993
Hei, apa yang kau lakukan?
(dalam bahasa Inggris) <-- this should be empty line
2 (satu) <-- the '(satu)' should not be exist
00:00:09,994 --> 00:00:11,227
Di mana kau sekarang?
(dalam bahasa Inggris) ....
3 Pemberantasan Korupsi <-- this also should not be exist
00:00:11,228 --> 00:00:13,360
Saat ini aku sedang dalam perjalanan
ke Dakota Selatan.
(dalam bahasa Inggris) ...
4
00:00:13,361 --> 00:00:16,093
Akan pergi berkemah sedikit,
lakukan sedikit memancing.
(dalam bahasa Inggris) ...
5
00:00:16,094 --> 00:00:17,426
Bagus untukmu, Colter.
(dalam bahasa Inggris) ...
@stt
Copy link

stt commented May 5, 2024

Had a similar need and the issue ofc boils down to EasyTranslate requiring that every line in the input file is translatable.

Attached patch makes it so that when a line contains only numbers and/or non-alphabetical characters it is not translated but pulled aside and then printed back out during output phase (maybe there's a cleaner way but it appears that whatever is added to the pytorch Dataset structure has to be compatible with accelerator.prepare() so as workaround a collate_fn wrapper separates out any non-tokenized items).

IMO optimally the project could be reworked so that it was easier to call iteratively while parsing a file from a separate utility, or as a smaller change a parameter could be added to translate.py that specified a regex to select which lines to translate, regardless I didn't have the motivation to attempt a cleaner solution so didn't open a PR, but I do use this to translate SRT files so maybe it helps you.

EasyTranslate_retain-nontext.patch.txt

(put in the code directory and run patch -p1 <EasyTranslate_retain-nontext.patch.txt)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants