assets/ # assets (see description below)
manga_ocr/ # release code (inference only)
manga_ocr_dev/ # development code
env.py # global constants
data/ # data preprocessing
synthetic_data_generator/ # generation of synthetic image-text pairs
training/ # model training
csv with columns:
- font_path: path to font file, relative to
FONTS_ROOT
- supported_chars: string of characters supported by this font
- num_chars: number of supported characters
- label: common/regular/special (used to sample regular fonts more often than special)
List of fonts with metadata used by synthetic data generator.
Provided file is just an example, you have to generate similar file for your own set of fonts,
using manga_ocr_dev/synthetic_data_generator/scan_fonts.py
script.
Note that label
will be filled with regular
by default. You have to label your special fonts manually.
csv with columns:
- source: source of text
- id: unique id of the line
- line: line from language corpus
Example of csv used for synthetic data generation.
csv with columns:
- len: length of text
- p: probability of text of this length occurring in manga
Used by synthetic data generator to more-or-less match the natural distribution of text lengths. Computed based on Manga109-s dataset.
List of all characters supported by tokenizer.
env.py
contains global constants used across the repo. Set your paths to data etc. there.
- Download Manga109-s dataset.
- Set
MANGA109_ROOT
, so that your directory structure looks like this:<MANGA109_ROOT>/ Manga109s_released_2021_02_28/ annotations/ annotations.v2018.05.31/ images/ books.txt readme.txt
- Preprocess Manga109-s with
data/process_manga109s.py
- Optionally generate synthetic data (see below)
- Train with
manga_ocr_dev/training/train.py
Generated data is split into packages (named 0000
, 0001
etc.) for easier management of large dataset.
Each package is assumed to have similar data distribution, so that a properly balanced dataset
can be built from any subset of packages.
Data generation pipeline assumes following directory structure:
<DATA_SYNTHETIC_ROOT>/
img/ # generated images (output from generation pipeline)
0000/
0001/
...
lines/ # lines from corpus (input to generation pipeline)
0000.csv
0001.csv
...
meta/ # metadata (output from generation pipeline)
0000.csv
0001.csv
...
To use a language corpus for data generation, lines/*.csv
files must be provided.
For a small example of such file see assets/lines_example.csv
.
To generate synthetic data:
- Generate backgrounds with
data/generate_backgrounds.py
. - Put your fonts in
<FONTS_ROOT>
. - Generate fonts metadata with
synthetic_data_generator/scan_fonts.py
. - Optionally manually label your fonts with
common/regular/special
labels. - Provide
<DATA_SYNTHETIC_ROOT>/lines/*.csv
. - Run
synthetic_data_generator/run_generate.py
for each package.