Would it be possible to train a German model? #82

gitapii · 2024-01-20T19:52:28Z

Hi,

I recently tested this repo as nuget package and it seems to be a very good Paddle OCR solution for .NET. Would it be also possible to train/finetune a German model (maybe locally) or use the inference model from 'PaddlePaddle/PaddleOCR#1048'?

It's quite similar to English, but you have 4 more characters (ä, ö, ü, ß). At the moment, the model recognizes them as (a, o, u) without the dots above. It would be great.

Kind regards,

n0099 · 2024-01-20T20:59:51Z

It's quite similar to English, but you have 4 more characters (ä, ö, ü, ß).

https://en.wikipedia.org/wiki/List_of_Latin-script_letters
https://en.wikipedia.org/wiki/Template:ISO_15924_script_codes_and_related_Unicode_data
https://knowyourmeme.com/memes/theyre-the-same-picture

In fact, every model can ONLY recognize chars out of the predefined characters dictionary at the train time since recognize will just output a list of index for each character in the dictionary, so if you match dict.txt other than the dictionary being used while training, indexes won't match together and leads to meaningless chars
https://github.com/PaddlePaddle/PaddleOCR/blob/1bc550064457b9ab7821f92f16ac5629239ae95a/doc/doc_ch/models_list.md?plain=1#L45 They claimed the latest v4 model of ch_PP-OCRv4_det is suited for 【最新】原始超轻量模型，支持中英文、多语种文本检测 but there are some missing Latin letter variants in the dictionary so they will never get recognized as it should be:

PaddleSharp/src/Sdcb.PaddleOCR.Models.Online/LocalDictOnlineRecognizationModel.cs

Line 38 in 167e760

    
           public static LocalDictOnlineRecognizationModel ChineseV4 => new("ch_PP-OCRv4_rec", "ppocr_keys_v1.txt", new Uri("https://paddleocr.bj.bcebos.com/PP-OCRv4/chinese/ch_PP-OCRv4_rec_infer.tar"), ModelVersion.V4);

ppocr_keys_v1.txt

PaddleSharp/src/Sdcb.PaddleOCR.Models.Shared/dicts/ppocr_keys_v1.txt

Line 5841 in 167e760

ä

ö not exists

PaddleSharp/src/Sdcb.PaddleOCR.Models.Shared/dicts/ppocr_keys_v1.txt

Line 6274 in 167e760

ü

ß not exists

and there's a v3 model that is trained with a latin_dict.txt contains more variant letters:

PaddleSharp/src/Sdcb.PaddleOCR.Models.Online/LocalDictOnlineRecognizationModel.cs

Line 194 in 167e760

    
           public static LocalDictOnlineRecognizationModel LatinV3 => new("latin_PP-OCRv3_rec", "latin_dict.txt", new Uri("https://paddleocr.bj.bcebos.com/PP-OCRv3/multilingual/latin_PP-OCRv3_rec_infer.tar"), ModelVersion.V3);

latin_dict.txt

PaddleSharp/src/Sdcb.PaddleOCR.Models.Shared/dicts/latin_dict.txt

Line 134 in 167e760

ä

PaddleSharp/src/Sdcb.PaddleOCR.Models.Shared/dicts/latin_dict.txt

Line 151 in 167e760

ö

PaddleSharp/src/Sdcb.PaddleOCR.Models.Shared/dicts/latin_dict.txt

Line 156 in 167e760

ü

PaddleSharp/src/Sdcb.PaddleOCR.Models.Shared/dicts/latin_dict.txt

Line 129 in 167e760

ß

If you want to use the oldest v2 model german_mobile_v2.0_rec_infer from PaddlePaddle/PaddleOCR#1048 (comment) which seems to be trained with german_dict.txt, then you may define

public static LocalDictOnlineRecognizationModel GermanV2 => new("german_mobile_v2.0_rec_infer", "german_dict.txt", new Uri("https://paddleocr.bj.bcebos.com/dygraph_v2.0/multilingual/german_mobile_v2.0_rec_infer.tar"), ModelVersion.V2);

but this shouldn't work due to all dictionaries copied from PaddleOCR/ppocr/utils/dict for usages in Sdcb.PaddleOCR.Models.Local*|Online are bundled in Sdcb.PaddleOCR.Models.Shared as assembly resource

PaddleSharp/src/Sdcb.PaddleOCR.Models.Shared/Sdcb.PaddleOCR.Models.Shared.csproj

Lines 29 to 44 in 167e760

    
           <ItemGroup> 
        
             <EmbeddedResource Include="dicts\arabic_dict.txt" /> 
        
             <EmbeddedResource Include="dicts\chinese_cht_dict.txt" /> 
        
             <EmbeddedResource Include="dicts\cyrillic_dict.txt" /> 
        
             <EmbeddedResource Include="dicts\devanagari_dict.txt" /> 
        
             <EmbeddedResource Include="dicts\en_dict.txt" /> 
        
             <EmbeddedResource Include="dicts\japan_dict.txt" /> 
        
             <EmbeddedResource Include="dicts\ka_dict.txt" /> 
        
             <EmbeddedResource Include="dicts\korean_dict.txt" /> 
        
             <EmbeddedResource Include="dicts\latin_dict.txt" /> 
        
             <EmbeddedResource Include="dicts\ppocr_keys_v1.txt" /> 
        
             <EmbeddedResource Include="dicts\table_structure_dict.txt" /> 
        
             <EmbeddedResource Include="dicts\table_structure_dict_ch.txt" /> 
        
             <EmbeddedResource Include="dicts\ta_dict.txt" /> 
        
             <EmbeddedResource Include="dicts\te_dict.txt" /> 
        
           </ItemGroup>

for

PaddleSharp/src/Sdcb.PaddleOCR.Models.Shared/SharedUtils.cs

Line 21 in 167e760

public static List<string> LoadDicts(string dictName)

to read

PaddleSharp/src/Sdcb.PaddleOCR.Models.Online/LocalDictOnlineRecognizationModel.cs

Line 31 in 167e760

    
           return new StreamDictFileRecognizationModel(RootDirectory, SharedUtils.LoadDicts(DictName), Version);

gitapii · 2024-01-20T21:43:18Z

Thank you very much! I've overlooked the latin_dict. The mentioned 4 chars are there, being also recognizable. So it's already working when selecting LocalFullModels.LatinV3 as model. I'm going to optimize it, thx!

juvebogdan · 2024-01-22T20:11:46Z

@gitapii Hi. I wanted to use german language as well. Did you have to finetune or it works out of the box?

n0099 mentioned this issue Oct 21, 2024

是否可以加载自己用PaddleOCR训练的recognition模型？ #103

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Would it be possible to train a German model? #82

Would it be possible to train a German model? #82

gitapii commented Jan 20, 2024

n0099 commented Jan 20, 2024 •

edited

Loading

gitapii commented Jan 20, 2024

juvebogdan commented Jan 22, 2024

Would it be possible to train a German model? #82

Would it be possible to train a German model? #82

Comments

gitapii commented Jan 20, 2024

n0099 commented Jan 20, 2024 • edited Loading

gitapii commented Jan 20, 2024

juvebogdan commented Jan 22, 2024

n0099 commented Jan 20, 2024 •

edited

Loading