Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Hebrew Rashi script support #6

Open
wants to merge 3 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion README.md
Original file line number Diff line number Diff line change
@@ -1,3 +1,3 @@
# tessdata_contrib

User contributed (non Google) data repository for Tesseract 4 and 5 (Akkadian, Ancient Greek, Old Persian languages, ...)
User contributed (non Google) data repository for Tesseract 4 and 5 (Akkadian, Ancient Greek, Old Persian languages, Hebrew Rashi Script ...)
182 changes: 182 additions & 0 deletions heb_rashi/TRAINING.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,182 @@


# Creating Tesseract training data

The instructions bellow were tested on Linux Debian 11.

```
sudo apt update
sudo apt install tesseract-ocr libtesseract-dev unzip tmux openmpi-bin
mkdir -p ~/tessdata_heb_rashi/langdata/heb
```


## Word list

Generate Torah literature related [word list](tesseract_4.1.1/langdata/heb/heb.wordlist) from [Sefaria's MongoDB dump](https://storage.googleapis.com/sefaria-mongo-backup/dump_small.tar.gz) with words ordered by decreasing frequency.

```
# Download the dump, decompress and import it:

wget https://storage.googleapis.com/sefaria-mongo-backup/dump_small.tar.gz
tar -xzf dump_small.tar.gz
mongorestore dump


# Convert ' (i.e. \x{27}) and \" to geresh/gershayim and replace each non-Hebrew letter group with a newline, etc.:

mongo sefaria --eval "db.getCollection('texts').find({'language': 'he'}).forEach(function(f){print(tojson(f, '', true));})" | perl -Mutf8 -CS -pE 's/\x{27}/׳/g;s/\\"/״/g;s/\P{Hebrew}+/\n/g;s/^״+$//gm' > texts_hebrew_only.txt


# Sort while counting and eliminating duplicates/entries with occurrence less than 16:

sort --buffer-size=1G texts_hebrew_only.txt | uniq -c | sort -gr | awk '{if ($1 > 15) print $2}' > heb.wordlist
```

Afterwards delete several empty lines at the beginning and the end of the `langdata/heb/heb.wordlist` file manually.


## Fonts

[Rashi fonts](https://drive.google.com/file/d/1Um3yGV7dT_6AEs7DQU_oC5GpHwZlzR9N/view?usp=sharing) used to render synthetic training data images are listed inside the [tesseract_4.1.1/langdata/heb/okfonts.txt](tesseract_4.1.1/langdata/heb/okfonts.txt). All fonts from `FontsRashi/Working` worked well for the training:

```
text2image --list_available_fonts --fonts_dir FontsRashi/Working
0: BenOr Rashi
1: Guttman Rashi
2: Guttman Rashi Bold
3: Mekorot-Rashi Bold
4: Mekorot-Rashi Medium
5: Mekorot-Rashi Medium Italic
6: PFT_Rashi
7: PFT_Rashi Light
8: RashiAmiti
9: Rashy
10: ZWXLDX+RasheeMF-Medium Medium
```


If you have an idea how to fix the fonts from `FontsRashi/NonWorking` (maybe using [FontForge](https://fontforge.org/) or similar) - please [report](https://gitlab.com/pninim.org/tessdata_heb_rashi/-/issues)!



## Determine xheights for Rashi fonts

File [tesseract_4.1.1/langdata/Hebrew.xheights](tesseract_4.1.1/langdata/Hebrew.xheights) was enhanced by Rashi fonts using `grctraining` as follows:

```
git clone https://ancientgreekocr.org/grctraining.git
cd grctraining
make tools/xheight
cd tools/
./xheight 'Mekorot-Rashi Medium'
... and so on for all fonts listed in okfonts.txt
```


## Download remaining necessary files

```
cd ~/tessdata_heb_rashi/langdata

wget https://raw.githubusercontent.com/tesseract-ocr/langdata_lstm/master/Hebrew.unicharset
wget https://raw.githubusercontent.com/tesseract-ocr/langdata_lstm/master/Latin.unicharset
wget https://raw.githubusercontent.com/tesseract-ocr/langdata_lstm/master/Latin.xheights
wget https://raw.githubusercontent.com/tesseract-ocr/langdata_lstm/master/radical-stroke.txt
wget https://raw.githubusercontent.com/tesseract-ocr/langdata_lstm/master/font_properties

cd heb

wget https://raw.githubusercontent.com/tesseract-ocr/langdata_lstm/master/heb/desired_characters
wget https://raw.githubusercontent.com/tesseract-ocr/langdata_lstm/master/heb/forbidden_characters
wget https://raw.githubusercontent.com/tesseract-ocr/langdata_lstm/master/heb/heb.numbers
wget https://raw.githubusercontent.com/tesseract-ocr/langdata_lstm/master/heb/heb.punc
wget https://raw.githubusercontent.com/tesseract-ocr/langdata_lstm/master/heb/heb.singles_text
wget https://raw.githubusercontent.com/tesseract-ocr/langdata_lstm/master/heb/heb.unicharambigs
wget https://raw.githubusercontent.com/tesseract-ocr/langdata_lstm/master/heb/heb.unicharset
```


## ScrollView

**Optional step:** download and build ScrollView to visualize the training process. Not needed if `lstmtraining --debug_interval -1` (shows text information on every iteration) or `lstmtraining --debug_interval 0` (shows text information on every 100 iterations) are used.

```
# Create ScrollView (optional)

apt install openjdk-11-jdk
git clone https://github.com/tesseract-ocr/tesseract
cd tesseract
./autogen.sh
./configure
cd java
make ScrollView.jar

Now run `lstmtraining` from here to utilize ScrollView.jar
```

## Generate training corpus

```
cd /usr/share/tesseract-ocr
sudo rm language-specific.sh tesstrain.sh tesstrain_utils.sh
sudo wget https://raw.githubusercontent.com/tesseract-ocr/tesseract/4.0/src/training/language-specific.sh
sudo wget https://raw.githubusercontent.com/tesseract-ocr/tesseract/4.0/src/training/tesstrain.sh
sudo wget https://raw.githubusercontent.com/tesseract-ocr/tesseract/4.0/src/training/tesstrain_utils.sh
sudo chmod a+x *sh
```



### Create training data

Use all but one font:

```
mkdir -p ~/tessdata_heb_rashi/output/train
/usr/share/tesseract-ocr/tesstrain.sh --fonts_dir FontsRashi/Working --lang heb --linedata_only --noextract_font_properties --langdata_dir ./langdata --tessdata_dir /usr/share/tesseract-ocr/4.00/tessdata/ --output_dir output/train --fontlist 'BenOr Rashi' 'Guttman Rashi Bold' 'Mekorot-Rashi Bold' 'Mekorot-Rashi Medium' 'Mekorot-Rashi Medium Italic' 'PFT_Rashi' 'PFT_Rashi Light' 'RashiAmiti' 'Rashy' 'ZWXLDX+RasheeMF-Medium Medium'
```

Since the above command may take a lot of time - it is recommended to run it inside [tmux](https://github.com/tmux/tmux). Furthermore it might be beneficial to run the command for each font separately in a separate `tmux` session, e.g.: `/usr/share/tesseract-ocr/tesstrain.sh --fonts_dir FontsRashi/Working --lang heb --linedata_only --noextract_font_properties --langdata_dir ./langdata --tessdata_dir /usr/share/tesseract-ocr/4.00/tessdata/ --output_dir output/train --fontlist 'BenOr Rashi'` and so on.

### Create evaluation data

Use the remaining font for evaluation:

```
mkdir -p ~/tessdata_heb_rashi/output/evaluate
/usr/share/tesseract-ocr/tesstrain.sh --fonts_dir FontsRashi/Working --lang heb --linedata_only --noextract_font_properties --langdata_dir ./langdata --tessdata_dir /usr/share/tesseract-ocr/4.00/tessdata/ --output_dir output/evaluate --fontlist 'Guttman Rashi'
```

### Train

check the number at the top of `output/train/heb/heb.unicharset` - 72 and use it in the command line bellow (O1c72):

```
OMP_THREAD_LIMIT=11 lstmtraining --debug_interval 0 \
--traineddata ~/tessdata_heb_rashi/output/train/heb/heb.traineddata \
--net_spec '[1,36,0,1 Ct3,3,16 Mp3,3 Lfys48 Lfx96 Lrx96 Lfx256 O1c72]' \
--model_output ~/tessdata_heb_rashi/output/base --learning_rate 20e-4 \
--train_listfile ~/tessdata_heb_rashi/output/train/heb.training_files.txt \
--eval_listfile ~/tessdata_heb_rashi/output/evaluate/heb.training_files.txt \
--max_iterations 50000 &>~/tessdata_heb_rashi/output/basetrain.log
```


# Test results
```
lstmeval --model ~/tessdata_heb_rashi/output/base_checkpoint \
--traineddata ~/tessdata_heb_rashi/output/train/heb/heb.traineddata \
--eval_listfile ~/tessdata_heb_rashi/output/evaluate/heb.training_files.txt
```

# Convert checkpoint to best (float) traindata
```
lstmtraining --stop_training --continue_from output/base_checkpoint --traineddata ~/tessdata_heb_rashi/output/train/heb/heb.traineddata --model_output ~/tessdata_heb_rashi/output/heb_rashi.traineddata
```

# Convert checkpoint to fast (integer) traindata
```
lstmtraining --stop_training --convert_to_int --continue_from output/base_checkpoint --traineddata ~/tessdata_heb_rashi/output/train/heb/heb.traineddata --model_output ~/tessdata_heb_rashi/output/heb_rashi.fast.traineddata
```

Binary file added heb_rashi/best/heb_rashi.traineddata
Binary file not shown.
Binary file added heb_rashi/fast/heb_rashi.traineddata
Binary file not shown.