Unicode detection error #10247

akashdeepbansal · 2018-11-12T11:15:56Z

We are trying to detect symbols in a LaTeX generated PDF file. PDF.js is working very well for most of the symbols but creating issues with following symbols

\phi
\varphi
\Upsilon
\epsilon
\Updownarrow
\mapsto
\longmapsto
\neq
\notin
\simeq
\cong

timvandermeij · 2018-11-12T22:15:00Z

In order to help with this, we need an example PDF file. Could you create one that has these symbols in it and attach it to this issue? Moreover, what do you mean with "detecting symbols"? Is the rendering not working correctly, it is not in the text layer, et cetera?

akashdeepbansal · 2018-11-13T17:13:24Z

Please find the attached PDF containing some of these symbols. By "detecting symbols", I mean if I render this PDF in Firefox browser, where pdf.js is used for rendering, and perform inspect element, I will not find any of these symbols in HTML. In terms of visual rendering on the web, they are perfect but if I want to access them in HTML, I don't find these symbols in UTF-8.

I wanted to attach HTML file also, but Git not allowing me to do so. Let me know if there is any way, I can do. Please let me know, Is this clarifies your doubt or not?

pdfjs_test.pdf

Snuffleupagus · 2018-11-14T09:47:28Z

There isn't any ToUnicode data to be found in any of the fonts in this PDF file, and PDF Viewers are thus "on their own" when it comes to extracting text.
Note in particular that e.g. Adobe Reader (the PDF reference implementation) is not able to copy any of the symbols successfully in this file.

This is a bug in the PDF file itself, and please note that the PDF.js library is already doing a better job here than some other PDF viewers (since the text is mostly copyable).
Unfortunately, with no ToUnicode data available (and not even e.g. any Encoding data present), there's very little that can be done for these symbols and this issue is thus invalid.

akashdeepbansal · 2018-11-14T10:03:41Z

I completely agree with you. What I am interested is how pdf.js is extract information for other symbols? And why it is not possible for these few symbols?

We are interested in this for making mathematics equations accessible with screen readers. We will be happy if you can help us in understanding how pdf.js is able to extract the other symbol's information.

Snuffleupagus · 2019-02-07T11:34:33Z

What I am interested is how pdf.js is extract information for other symbols? And why it is not possible for these few symbols?

For standard glyphs, these maps are used to provide a reasonable fallback for missing ToUnicode data: https://github.com/mozilla/pdf.js/blob/master/src/core/glyphlist.js
In general though, it does not seem correct to add every single TeX/LaTeX-specific glyph just to deal with broken PDF files. (The few that do exist were added to fix a specific class of font-rendering, rather than text-selection, bugs on Linux; refer to PR #7705.)

@timvandermeij All in all, given #10247 (comment), this probably ought to be closed as INVALID/WONTFIX.

timvandermeij added other information-requested labels Nov 12, 2018

timvandermeij added text-selection and removed other information-requested labels Nov 13, 2018

timvandermeij closed this as completed Feb 7, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unicode detection error #10247

Unicode detection error #10247

akashdeepbansal commented Nov 12, 2018

timvandermeij commented Nov 12, 2018

akashdeepbansal commented Nov 13, 2018

Snuffleupagus commented Nov 14, 2018

akashdeepbansal commented Nov 14, 2018

Snuffleupagus commented Feb 7, 2019

Unicode detection error #10247

Unicode detection error #10247

Comments

akashdeepbansal commented Nov 12, 2018

timvandermeij commented Nov 12, 2018

akashdeepbansal commented Nov 13, 2018

Snuffleupagus commented Nov 14, 2018

akashdeepbansal commented Nov 14, 2018

Snuffleupagus commented Feb 7, 2019