Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unicode detection error #10247

Closed
akashdeepbansal opened this issue Nov 12, 2018 · 5 comments
Closed

Unicode detection error #10247

akashdeepbansal opened this issue Nov 12, 2018 · 5 comments

Comments

@akashdeepbansal
Copy link

We are trying to detect symbols in a LaTeX generated PDF file. PDF.js is working very well for most of the symbols but creating issues with following symbols

  1. \phi
  2. \varphi
  3. \Upsilon
  4. \epsilon
  5. \Updownarrow
  6. \mapsto
  7. \longmapsto
  8. \neq
  9. \notin
  10. \simeq
  11. \cong
@timvandermeij
Copy link
Contributor

In order to help with this, we need an example PDF file. Could you create one that has these symbols in it and attach it to this issue? Moreover, what do you mean with "detecting symbols"? Is the rendering not working correctly, it is not in the text layer, et cetera?

@akashdeepbansal
Copy link
Author

Please find the attached PDF containing some of these symbols. By "detecting symbols", I mean if I render this PDF in Firefox browser, where pdf.js is used for rendering, and perform inspect element, I will not find any of these symbols in HTML. In terms of visual rendering on the web, they are perfect but if I want to access them in HTML, I don't find these symbols in UTF-8.

I wanted to attach HTML file also, but Git not allowing me to do so. Let me know if there is any way, I can do. Please let me know, Is this clarifies your doubt or not?

pdfjs_test.pdf

@Snuffleupagus
Copy link
Collaborator

There isn't any ToUnicode data to be found in any of the fonts in this PDF file, and PDF Viewers are thus "on their own" when it comes to extracting text.
Note in particular that e.g. Adobe Reader (the PDF reference implementation) is not able to copy any of the symbols successfully in this file.

This is a bug in the PDF file itself, and please note that the PDF.js library is already doing a better job here than some other PDF viewers (since the text is mostly copyable).
Unfortunately, with no ToUnicode data available (and not even e.g. any Encoding data present), there's very little that can be done for these symbols and this issue is thus invalid.

@akashdeepbansal
Copy link
Author

I completely agree with you. What I am interested is how pdf.js is extract information for other symbols? And why it is not possible for these few symbols?

We are interested in this for making mathematics equations accessible with screen readers. We will be happy if you can help us in understanding how pdf.js is able to extract the other symbol's information.

@Snuffleupagus
Copy link
Collaborator

What I am interested is how pdf.js is extract information for other symbols? And why it is not possible for these few symbols?

For standard glyphs, these maps are used to provide a reasonable fallback for missing ToUnicode data: https://github.com/mozilla/pdf.js/blob/master/src/core/glyphlist.js
In general though, it does not seem correct to add every single TeX/LaTeX-specific glyph just to deal with broken PDF files. (The few that do exist were added to fix a specific class of font-rendering, rather than text-selection, bugs on Linux; refer to PR #7705.)

@timvandermeij All in all, given #10247 (comment), this probably ought to be closed as INVALID/WONTFIX.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants