Multilingual Model Training Project #7252

Evezerest · 2022-08-18T13:12:42Z

Evezerest
Aug 18, 2022
Collaborator

We write a notebook tutorial about how to train a multilingual language model with the synthesis data. It is very useful if you do not have much real data and the official model is not satisfied in your scenario. You can follow this tutorial step by step to fine-tune a new model. Also, when your characters are out of the official dictionary, training a new model is quite necessary too. You can find a lot of corpus and dictionaries in the pinned issue Multilingual OCR Development Plan from the community.

For the released 80 PP-OCR multilingual models, if there is a specific or typical problem with the language model (eg: the missing space problem of the previous English, Spanish model), we hope you open a new discussion on GitHub, so we can track the issue continuously, identify problems and then train a new version of the language model.

For partial users who want to recognize multiple languages at the same time, the current solution is to use the language classification model PULC in PaddleClas to classify the language, and then send it into the corresponding OCR model. We'll release a new tutorial later. Meanwhile, we set up a poll to record the multilingual languages needs, hope that everyone can vote on the languages that they want to recognize at the same time, and the official will train a single model to support multilingual recognition according to the feedback.

Any questions or insights about how to train your own language model are welcome under this discussion, we believe with the help of the community members and the official team, everyone is able to train a deep learning model even if they don‘t know much about DL theory.

shariqhameedca · 2023-08-09T05:36:48Z

shariqhameedca
Aug 9, 2023

Hi, Evezerest. I am trying to finetune the Arabic recognition model for Urdu language. Can you please give a general guideline and the things I have to be careful for. Also, can the notebook workflow be used for this task or do I have to make some changes because of the fact that Urdu is an RTL language?

0 replies

hritikakolkar · 2023-09-10T07:01:18Z

hritikakolkar
Sep 10, 2023

I am trying to train a Arabic Model from scratch, on text generated from text_renderer toolkit but the results are not improving how much data should I feed. I also tried to finetune but the result are becoming worst. Please can someone guide me

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Multilingual Model Training Project #7252

{{title}}

Replies: 2 comments

{{title}}

{{title}}

Select a reply

Multilingual Model Training Project #7252

Evezerest Aug 18, 2022 Collaborator

Replies: 2 comments

shariqhameedca Aug 9, 2023

hritikakolkar Sep 10, 2023

Evezerest
Aug 18, 2022
Collaborator

shariqhameedca
Aug 9, 2023

hritikakolkar
Sep 10, 2023