End-to-End Scene Text Detection and Recognition System Resources

Author: Canjie Luo, Chongyu Liu

1.Datasets
- 1.1 Introduction
- 1.2 Comparison of Datasets
2. Summary of End-to-end Scene Text Detection and Recognition Methods
- 2.1 Comparison of methods
- 2.2 End-to-end scene text detection and recognition results
3. Survey
4. OCR Service
5. References and codes

1. Datasets

1.1 Introduction

SVT [15]：
- Introduction: There are 100 training images and 250 testing images downloaded from Google Street View of road-side scenes. The labelled text can be very challenging with a wide variety of fonts, orientations, and lighting conditions. A lexicon containing 50 words (SVT-50) is also provided for each image.
- Link: SVT-download
ICDAR 2003(IC03) [16]：
- Introduction: The dataset contains a varied array of photos of the world that contain scene text. There are 251 testing images with 50 word lexicons (IC03-50) and a lexicon of all test groundtruth words (IC03-Full).
- Link: IC03-download
ICDAR 2011(IC11) [17] :
- Introduction: The dataset is an extension to the dataset used for the text locating competitions of ICDAR 2003.It includes 485 natural images in total.
- Link: IC11-download
ICDAR 2013(IC13) [18]：
- Introduction: The dataset consists of 229 training images and 233 testing images. Most text are horizontal. Three speciﬁc lexicons are provided, named as “Strong(S)”, “Weak(W)” and “Generic(G)”. “Strong(S)” lexicon provides 100 words per-image including all words that appear in the image. “Weak(W)” lexicon includes all words that appear in the entire test set. And “Generic(G)” lexicon is a 90k word vocabulary.
- Link: IC13-download
ICDAR 2015(IC15) [19]：
- Introduction: The dataset includes 1000 training images and 500 testing images captured by Google glasses. The text in the scene is in arbitrary orientations. Similar to ICDAR 2013, it also provides “Strong(S)”, “Weak(W)” and “Generic(G)” lexicons.
- Link: IC15-download
Total-Text [20]：
- Introduction: Except for the horizontal text and oriented text, Total-Text also consists of a lot of curved text. Total-Text contains 1255 training images and 300 test images. All images are annotated with polygons and transcriptions in word-level. A “Full” lexicon contains all words in test set is provided.
- Link: Total-Text-download

1.2 Comparison of Datasets

Comparison of Datasets

Datasets

Language

Image

Text instance

Text Shape

Annotation level

Lexicon

Total

Train

Test

Total

Train

Test

Horizontal

Arbitrary-Quadrilateral

Multi-oriented

Char

Word

Text-Line

50

1k

Full

None

IC03

English

509

258

251

2266

1110

1156

✓

✕

✓

✕

✓

✕

IC11

English

484

229

255

1564

～

✓

✕

✓

✕

✓

IC13

English

462

229

233

1944

849

1095

✓

✕

✓

✕

✓

SVT

English

350

100

250

725

211

514

✓

✕

✓

✕

✓

✕

SVT-P

English

238

～

639

～

✓

✕

✓

✕

✓

✕

✓

✕

IC15

English

1500

1000

500

17548

122318

5230

✓

✕

✓

✕

✓

Total-Text

English

1525

1225

300

9330

～

✓

✕

✓

✕

✓

2. Summary of End-to-end Scene Text Detection and Recognition Methods

2.1 Comparison of methods

Method	Model	Code	Detection	Recognition	Source	Time	Highlight
Wang et al. [1]		✕	Sliding windows and Random Ferns	Pictorial Structures	ICCV	2011	Word Re-scoring for NMS
Wang et al. [2]		✕	CNN-based	Sliding windows for classification	ICPR	2012	CNN architecture
Jaderberg et al. [3]		✕	CNN-based and saliency maps	CNN classifier	ECCV	2014	Data mining and annotation
Alsharif et al. [4]		✕	CNN and hybrid HMM maxout models	Segmentation-based	ICLR	2014	Hybrid HMM maxout models
Yao et al. [5]		✕	Random Forest	Component Linking and Word Partition	TIP	2014	(1) Detection and recognition features sharing. (2) Oriented-text. (3) A new dictionary search method
Neumann et al. [6]		✕	Extremal Regions	Clustering algorithm to group characters	TPAMI	2015	Real-time performance(1.6s/image)
Jaderberg et al. [7]		✕	Region proposal mechanism	Word-level classification	IJCV	2016	Trained only on data produced by a synthetic text generation engine, requiring no human labelled data
Liao et al. [8]	TextBoxes	✓	SSD-based framework	CRNN	AAAI	2017	An end-to-end trainable fast scene text detector
Bŭsta et al. [9]	Deep TextSpotter	✕	Yolo v2	CTC	ICCV	2017	Yolov2 + RPN, RNN + CTC. It is the first end-to-end trainable detection and recognition system with high speed.
Li et al. [10]		✕	Text Proposal Network	Attention	ICCV	2017	TPN + RNN encoder + attention-based RNN
Sun et al. [22]	TextNet	✕	Scale-aware attention backbone and Perspective RoI Transform	Attention	ACCV	2018	Perspective RoI Transform for Irregular text recognition
Lyu et al. [11]	Mask TextSpotter	✓	Fast R-CNN with mask branch	Character segmentation	ECCV	2018	Precise text detection and recognition are acquired via semantic segmentation
He et al. [12]		✓	Text-Alignment Layer	Attention	CVPR	2018	Character attention mechanism: use character spatial information as explicit supervision
Liu et al. [13]	FOTS	✓	EAST with RoIRotate	CTC	CVPR	2018	Little computation overhead compared to baseline text detection network (22.6fps)
Liao et al. [14]	TextBoxes++	✓	SSD-based framework	CRNN	TIP	2018	Journal version of TextBoxes (multi-oriented scene text support)
Liao et al. [15]	Mask TextSpotter	✓	Mask R-CNN	Character segmentation + Spatial Attention Module	TPAMI	2019	Journal version of Mask TextSpotter(proposes Spatial Attention Module)
Xing et al. [23]	CharNet	✓	A character branch and a detection branch	Character level	ICCV	2019	Utilizing a character as basic element to overcome the main difficulty of joint optimization of text detection and RNN-based recognition
Feng et al. [24]	TextDragon	✕	Local box regression, center line segmentation and RoI Sliding	CTC	ICCV	2019	A new differentiable operator named RoISlide connect arbitrary shaped text detection and recognition
Qin et al. [25]		✕	Mask R-CNN with RoI masking	Attention	ICCV	2019	A simple yet effective RoI masking step to extract useful irregularly shaped text instance features
Qiao et al. [26]	Text Perceptron	✕	Mask R-CNN with Order-aware Semantic Segmentation and Boundary Regressions	Attention	AAAI	2020	A novel Shape Transform Module to transform the feature regions into regular morphologies
Wang et al. [27]		✕	Oriented Rectangular Box Detector and Boundary Point Detector	Attention	AAAI	2020	A set of points on the boundary of each text instance represents arbitrary shapes
Liu et al. [28]	ABCNet	✓	Bezier Curve Detection and BezierAlign	CTC	CVPR	2020	10 times faster than re-cent state-of-the-art methods with a competitive scene text spotting accuracy

2.2 End-to-end scene text detection and recognition results

Method

Model

Source

Time

SVT

SVT-50

IC03

IC11

IC13

IC15

Total-text

End-to-end

Spotting

End-to-end

Spotting

None

Full

None

Full

50

Full

None

S

W

G

S

W

G

S

W

G

S

W

G

Wang et al. [1]

ICCV

2011

~

51

~

Wang et al. [2]

ICPR

2012

46

~

72

67

~

Jaderberg et al. [3]

ECCV

2014

~

56

80

75

~

Alsharif et al. [4]

ICLR

2014

~

48

77

70

~

Yao et al. [5]

TIP

2014

~

48.6

~

Neumann et al. [6]

TPAMI

2015

68.1

~

45.2

~

35

19.9

15.6

35

19.9

15.6

~

Jaderberg et al. [7]

IJCV

2016

53

76

90

86

78

76

~

Liao et al. [8]

TextBoxes

AAAI

2017

64

84

~

87

91

89

84

94

92

87

~

36.3

48.9

~

Bŭsta et al. [9]

Deep TextSpotter

ICCV

2017

~

89

86

77

92

89

81

54

51

47

58

53

51

~

21.85

Li et al. [10]

ICCV

2017

66.18

84.91

~

87.7

~

91.08

89.8

84.6

94.2

92.4

88.2

~

Sun et al. [22]

TextNet

ACCV

2018

~

89.77

88.80

82.96

94.59

93.48

86.99

78.66

74.9

60.45

82.38

78.43

62.36

54.02

~

Lyu et al. [11]

Mask TextSpotter

ECCV

2018

~

92.2

91.1

86.5

92.5

92

88.2

79.3

73

62.4

79.3

74.5

64.2

52.9

71.8

~

He et al. [12]

CVPR

2018

~

91

89

86

93

92

87

82

77

63

85

80

65

~

Liu et al. [13]

FOTS

CVPR

2018

~

91.99

90.11

84.77

95.94

93.9

87.76

83.55

79.11

65.33

87.01

82.39

67.97

~

Liao et al. [14]

TextBoxes++

TIP

2018

64

84

~

93

92

85

96

95

87

73.3

65.9

51.9

76.5

69

54.4

~

Liao et al. [15]

Mask TextSpotter

TPAMI

2019

~

93.3

91.3

88.2

92.7

91.7

87.7

83

77.7

73.5

82.4

78.1

73.6

65.3

77.4

~

Xing et al. [23]

CharNet

ICCV

2019

~

85.05

81.25

71.08

~

69.2

~

Feng et al. [24]

TextDragon

ICCV

2019

~

82.54

78.34

65.15

86.22

81.62

68.03

48.8

74.8

39.7

72.4

~

Qin et al. [25]

ICCV

2019

~

85.51

81.91

69.94

~

70.7

~

Qiao et al. [26]

Text Perceptron

AAAI

2020

~

91.4

90.7

85.8

94.9

94

88.5

80.5

76.6

65.1

84.1

79.4

67.9

69.7

78.3

57

~

Wang et al. [27]

AAAI

2020

~

88.2

87.7

84.1

~

79.7

75.2

64.1

~

65

76.1

~

41.3

Liu et al. [28]

ABCNet

CVPR

2020

~

69.5

78.4

45.2

74.1

~

3. Survey

[A] [TPAMI-2015] Ye Q, Doermann D. Text detection and recognition in imagery: A survey[J]. IEEE transactions on pattern analysis and machine intelligence, 2015, 37(7): 1480-1500. paper

[B] [Frontiers-Comput. Sci-2016] Zhu Y, Yao C, Bai X. Scene text detection and recognition: Recent advances and future trends[J]. Frontiers of Computer Science, 2016, 10(1): 19-36. paper

[C] [arXiv-2018] Long S, He X, Ya C. Scene Text Detection and Recognition: The Deep Learning Era[J]. arXiv preprint arXiv:1811.04256, 2018. paper

4. OCR Service

OCR	API	Free
Tesseract OCR Engine	×	√
Azure	√	√
ABBYY	√	√
OCR Space	√	√
SODA PDF OCR	√	√
Free Online OCR	√	√
Online OCR	√	√
Super Tools	√	√
Online Chinese Recognition	√	√
Calamari OCR	×	√
Tencent OCR	√	×

5. References and codes

[1] Wang K, Babenko B, Belongie S. End-to-end scene text recognition[C].2011 International Conference on Computer Vision. IEEE, 2011: 1457-1464. paper
[2] Wang T, Wu D J, Coates A, et al. End-to-end text recognition with convolutional neural networks[C]. Proceedings of the 21st International Conference on Pattern Recognition (ICPR2012). IEEE, 2012: 3304-3308. paper
[3] Jaderberg M, Vedaldi A, Zisserman A. Deep features for text spotting[C]. European conference on computer vision. Springer, Cham, 2014: 512-528. paper
[4] Alsharif O, Pineau J. End-to-End Text Recognition with Hybrid HMM Maxout Models[C]. In ICLR 2014. paper
[5] Yao C, Bai X, Liu W. A unified framework for multioriented text detection and recognition[J]. IEEE Transactions on Image Processing, 2014, 23(11): 4737-4749. paper
[6] Neumann L, Matas J. Real-time lexicon-free scene text localization and recognition[J]. IEEE transactions on pattern analysis and machine intelligence, 2015, 38(9): 1872-1885. paper
[7] Jaderberg M, Simonyan K, Vedaldi A, et al. Reading text in the wild with convolutional neural networks[J]. International Journal of Computer Vision, 2016, 116(1): 1-20. paper
[8] Liao M, Shi B, Bai X, et al. Textboxes: A fast text detector with a single deep neural network[C]. In AAAI 2017. paper code
[9] Busta M, Neumann L, Matas J. Deep textspotter: An end-to-end trainable scene text localization and recognition framework[C]. Proceedings of the IEEE International Conference on Computer Vision. 2017: 2204-2212. paper
[10] Li H, Wang P, Shen C. Towards end-to-end text spotting with convolutional recurrent neural networks[C]. Proceedings of the IEEE International Conference on Computer Vision. 2017: 5238-5246. paper
[11] Lyu P, Liao M, Yao C, et al. Mask textspotter: An end-to-end trainable neural network for spotting text with arbitrary shapes[C]. Proceedings of the European Conference on Computer Vision (ECCV). 2018: 67-83. paper code
[12] He T, Tian Z, Huang W, et al. An end-to-end textspotter with explicit alignment and attention[C]. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2018: 5020-5029. paper code
[13] Liu X, Liang D, Yan S, et al. FOTS: Fast oriented text spotting with a unified network[C]. Proceedings of the IEEE conference on computer vision and pattern recognition. 2018: 5676-5685. paper code
[14] Liao M, Shi B, Bai X. Textboxes++: A single-shot oriented scene text detector[J]. IEEE transactions on image processing, 2018, 27(8): 3676-3690. paper code
[15] Minghui Liao, Pengyuan Lyu, Minghang He. Mask TextSpotter: An End-to-End Trainable Neural Network for Spotting Text with Arbitrary Shapes[J]. IEEE transactions on pattern analysis and machine intelligence, 2019. paper code
[16] Wang,Kai, and S. Belongie. Word Spotting in the Wild. European Conference on Computer Vision(ECCV), 2010: 591-604. Paper
[17] S. M. Lucas, A. Panaretos, L. Sosa, A. Tang, S. Wong, R. Young,K. Ashida, H. Nagai, M. Okamoto, H. Yamamoto, H. Miyao,J. Zhu, W. Ou, C. Wolf, J. Jolion, L. Todoran, M. Worring, and X. Lin. ICDAR 2003 robust reading competitions:entries, results,and future directions. IJDAR, 7(2-3):105–122, 2005. paper
[18] Shahab, A, Shafait, F, Dengel, A: ICDAR 2011 robust reading competition challenge 2: Reading text in scene images. In: ICDAR, 2011. Paper
[19] D. Karatzas, F. Shafait, S. Uchida, et al. ICDAR 2013 robust reading competition. In ICDAR, 2013. Paper
[20] D. Karatzas, L. Gomez-Bigorda, A. Nicolaou, S. K. Ghosh, A. D.Bagdanov, M. Iwamura, J. Matas, L. Neumann, V. R. Chandrasekhar, S. Lu, F. Shafait, S. Uchida, and E. Valveny. ICDAR 2015 competition on robust reading. In ICDAR, pages 1156–1160, 2015. Paper
[21] Chee C K, Chan C S. Total-text: A comprehensive dataset for scene text detection and recognition.Document Analysis and Recognition (ICDAR), 2017 14th IAPR International Conference on. IEEE, 2017, 1: 935-942.Paper
[22] Y. Sun, C. Zhang, Z. Huang, J. Liu, J. Han, and E. Ding, TextNet: Irregular Text Reading from Images with an End-to-End Trainable Network, Asian Conference on Computer Vision (ACCV), Cham, 2018, vol. 11363, no. 1, pp. 83–99.Paper
[23] Xing L, Tian Z, Huang W, Convolutional character networks.In ICCV, 2019.Paper code
[24] Feng W, He W, Yin F, et al. TextDragon: An End-to-End Framework for Arbitrary Shaped Text Spotting.In ICCV, 2019.Paper
[25] Qin S, Bissacco A, Raptis M, et al. Towards unconstrained end-to-end text spotting.In ICCV, 2019.Paper
[26] Qiao L, Tang S, Cheng Z, et al. Text Perceptron: Towards End-to-End Arbitrary-Shaped Text Spotting.In AAAI 2020.Paper
[27] Wang H, Lu P, Zhang H, et al. All You Need Is Boundary: Toward Arbitrary-Shaped Text Spotting. In AAAI 2020.Paper
[28] Liu Y, Chen H, Shen C, et al. ABCNet: Real-time Scene Text Spotting with Adaptive Bezier-Curve Network In CVPR, 2020.Paper code

If you find any problems in our resources, or any good papers/codes we have missed, please inform us at [email protected]. Thank you for your contribution.

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
README.md		README.md
scut-dlvc.jpeg		scut-dlvc.jpeg

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

End-to-End Scene Text Detection and Recognition System Resources

1. Datasets

1.1 Introduction

1.2 Comparison of Datasets

2. Summary of End-to-end Scene Text Detection and Recognition Methods

2.1 Comparison of methods

2.2 End-to-end scene text detection and recognition results

3. Survey

4. OCR Service

5. References and codes

Copyright

About

Releases

Packages

HCIILAB/Scene-Text-End2end

Folders and files

Latest commit

History

Repository files navigation

End-to-End Scene Text Detection and Recognition System Resources

1. Datasets

1.1 Introduction

1.2 Comparison of Datasets

2. Summary of End-to-end Scene Text Detection and Recognition Methods

2.1 Comparison of methods

2.2 End-to-end scene text detection and recognition results

3. Survey

4. OCR Service

5. References and codes

Copyright

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Packages