TLDR; The text is present in the .pdf so there is no need to perform OCR!!! I did it anyway. Lessons learned:
- For Teseract OCR gImageReader is a good tool if you're on Windows: https://sourceforge.net/projects/gimagereader/
- OCR accuracy is much higher if you select a similar block of text (same font/size/quality etc..)
In a facebook group recently an archive of a local paper was published. Using the front page of the first page I have put together an OCR tool to see how easy it would be to pull text from the .pdfs. I used an online tool to convert the first page from .pdf to .png. I also fooled around with the image settings to see how much the results of text extraction would change.
Nearly the entire implementation comes from small modifications suggested by this post:
https://stackoverflow.com/questions/10947399/how-to-implement-and-do-ocr-in-a-c-sharp-project
I believe in order to improve the extraction of text the following could be performed:
- Image processing
- Improving training data (let it know what fonts are used, tell it what the results are, I don't know Tesseract...)
- Selecting blocks of similar text
If fairly reliable text were able to be extracted it would be trivial to tokenize and index with something like Solr or Elastic and make a searchable archive.
Feel free to try for yourself. You'll need to put the english training data (all files in screenshot) in the tessdata directory in the project. Tesseract is licensed under Apache license 2.0 (below)
Download and index them all!
Name: Tesseract OCR engine
Source: https://github.com/tesseract-ocr/tesseract
License: Apache license 2.0 (below)
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.