Name		Name	Last commit message	Last commit date
parent directory ..
README.md		README.md
__init__.py		__init__.py
generator.py		generator.py
renderer.py		renderer.py
run_generate.py		run_generate.py
scan_fonts.py		scan_fonts.py
utils.py		utils.py

README.md

Synthetic data generator

Generation of synthetic image-text pairs imitating Japanese manga for the purpose of training OCR.

Features:

using either text from corpus or random text
text overlaid on background images
drawing text bubbles
various fonts and font styles
variety of text layouts:
- vertical and horizontal text
- multi-line text
- furigana (added randomly)
- tate chū yoko

Text rendering is done with the usage of html2image, which is a wrapper around Chrome/Chromium browser's headless mode. It's not too elegant of a solution, and it is very slow, but it only needs to be run once, and when parallelized, processing time is manageable (~17 min per 10000 images on a 16-thread machine).

The upside of this approach is that a quite complex problem of typesetting and text rendering (especially when dealing with both horizontal and vertical text) is offloaded to the browser engine, keeping the codebase relatively simple and extendable.

High-level generation pipeline is as follows:

Preprocess text (truncate and/or split into lines, add random furigana).
Render text on a transparent background, using HTML engine.
Select background image from backgrounds dataset.
Overlay the text on the background, optionally drawing a bubble around the text.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

synthetic_data_generator

synthetic_data_generator

README.md

Synthetic data generator

Examples

Images generated with text from CC-100 Japanese corpus

Images generated with random text

Files

synthetic_data_generator

Directory actions

More options

Directory actions

More options

Latest commit

History

synthetic_data_generator

Folders and files

parent directory

README.md

Synthetic data generator

Examples

Images generated with text from CC-100 Japanese corpus

Images generated with random text