Skip to content


StephanOepen edited this page Aug 26, 2009 · 21 revisions


In order to select the most effective method of text extract and parameterize text correction, it is relevant to distinguish the various ways in which the PDF documents in the NORA collection were produced, e.g. use of LaTeX, vs. M$ Word, vs. other word processing tools. In the LaTeX world, for example, it may matter which specific approach was used to output PDF, e.g. latex plus dvips plus ps2pdf, vs. pdflatex, vs. integrated tools like MiKTeX. Likewise, when using Word, results may vary according to which specific software version was used, or depending on whether Adobe Distiller or another tool for PDF creation was applied. When it comes to font choices and character encodings, it might also turn out that more basic properties of the original environment used for PDF creation are relevant, e.g. the choice of operating system (Linux vs. Windows, say) and default locale settings.

Presumably many dozens or hundreds of distinct software environments were at play in the production of the NORA PDF files, and hopefully most of this variation will be irrelevant for the WeScience0 effort. Furthermore, only quite limited information about the original environment is recorded in the PDF files, hence it may at times be impossible to give exact answers to the dimensions of variation listed above. However, we need to find out what information about the production process actually is available in PDF files, and we will need a simple tool to inspect PDF meta information and extract relevant parameters. It is possible that some of the text extraction tools for PDF (see the NoraExtraction page) can be put to use in this task too.

Clone this wiki locally