-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Clarification of current PDF viewer support for AccSupp ActualText #2
Comments
There are documented limits on the length of various PDF constructs: see the PDF Reference Manual. It would not surprise me if Acrobat wasn't reflecting that. |
@josephwright interesting, thank you for pointing out. For completeness would you mind sharing the URL and page/section where this is written? |
https://www.adobe.com/content/dam/acom/en/devnet/pdf/pdfs/PDF32000_2008.pdf, toward the end: 'Implementation Limits'. |
@josephwright the AccSupp ActualText limit is which part of which page - is it "string (in content stream)", which has a limit of 32,767 bytes? That would be max 16,383 UTF16 characters, and would coincide with what Chrome renders. (Because PDF copy-paste is generally so unreliable, I guess trying to line up multiple AccSupps in order to get multiples of 32,767 bytes would be a generally bad idea. That would be a question for discussion in a separate thread though e.g. https://tex.stackexchange.com/questions/563803/how-make-a-latex-document-that-generates-a-pdf-from-which-copy-paste-works-corre .) |
After further testing we realized that Chrome always produces an AccSupp as one single line, meaning its AccSupp support is broken. Updated the topmost post in this thread accordingly. This means to date only Evince has AccSupp support that works. @josephwright @davidcarlisle , do you have any expectation that other PDF viewers will start delivering AccSupp ActualText correctly in the future? |
I have no idea what PDF viewers will do here in the future. But why do you claim that they don't handle it correctly already? Nowhere in the specification /ActualText is described as a mean to store long code listings with thousands of characters. Actually the specification says "This replacement text (which should apply to as small a piece of content as possible) ...". |
@u-fischer thanks for your response. To understand this better, would you mind sharing your view of this?: On our part the goal is copy-paste that works, just like it does in HTML/web browsers - that is leading spaces and empty lines are preserved (just like double spaces and all other characters within lines). Would you say that PDF's copy-paste behavior is so ambiguous that I guess that if you would put AccSupp:s in a sequence, you have absolutely no idea what will come out it? If it could be relied on then your PDF specification quote of "small piece", could be satisfied. Of course what's a "small piece", I think 10 A4:s worth of characters is a small piece, though I take it that someone else could suggest that "small" means <100. To satisfy that we could make one AccSupp per 100 characters. Two days from now we will submit bug reports to all popular PDF viewers with incorrect AccSupp handling, that they should copy-paste AccSupp ActualText correctly. |
Only for completeness, in https://www.adobe.com/content/dam/acom/en/devnet/pdf/pdfs/PDF32000_2008.pdf : The "small a piece" quote you made is from the definition of ActualText on page 559, it reads:
Section "14.9.4 Replacement Text" is on page 615 and reads:
The last line, that a sequence of AccSupp:s should "be treated as if no word break is present between them", is interesting. For our copy-paste usecase, we should make a Tex test where we make one AccSupp per character and see if the viewer honors it. It should? |
That is I think the more normal use, in fact you only really need it for characters that don't have a standard unicode encoding in the font, which here is possibly just the white space, the other characters should copy naturally without this. |
well in a pdf are glyphs not characters. And this distinction is important here: ActualText allows to add to a glyph (or a picture, or small cluster of glyphs) "characters". As described here https://tex.stackexchange.com/a/564164/2388 this is important for languages where you can't simply assign to every glyph one or more ToUnicodes to represent the input, for example when the font shaping changes the order. Such scripts uses ActualText to add the correct characters in such cases (and so enable copy&paste), and they uses for small pieces, in the devanagari example around two glyphs:
Marked Content operators with ActualText inside the page stream can not be nested. If you start to add it around long pieces of texts, you would break the copy&pasting of these scripts. This may not matter for your use case, but it means that it can't be a general solution to improve copy&pasting of code. (ActualText can also be added to structure elements, but it is rather unclear how well pdf viewer support this, and how they handle nesting in such cases). Besiede this: As a user I wouldn't want to have to extract long code listings by copy&paste from a pdf, I would greatly prefer an attachment or an embedded file, that I can simply save or open in an editor, as I then can be more confident that I got the real code, and not something that got modified by the pdf viewer or the OS.) |
Here follows a series of tests of PDF copy-paste behavior: Test 1 This text:
corresponds to this UTF16-LE encoding: 0048006c006f000a002000200077000a0072000a000a006c0064007b Here follows a Tex file that delivers this string as single-character AccSupps. Neither AccSupp has any visual representation.
That is, correct except verbatim copy-paste is broken as leading spaces and empty lines have been stripped.
The order is broken and the second "l" is gone.
I.e. it conveninently ignores the AccSupps altogether. So, we see that current PDF viewers have a very peculiar way of handling sequences of AccSupp ActualTexts. Specifically, ann AccSupp ActualText which only contains a space or a newline, will (tend to) be ignored. This gives weight to the idea of locating a whole copy-paste inside one single AccSupp. Let's try some middle line approach:
And only break between "H" and "l", between "o" and "w", between "r" and "l", and between "d" and "{". That is Tex:
just wow!
I'll now make a set of additional tests to look for unexpected behavior relating to AccSupps.
i.e. it now suddenly starts kicking in a space character between each AccSupp - this is undesired though I guess can't be viewed as totally unreasonable.
that is total chaos, it sometimes picks the letter (A, D, E) and sometimes the ActualText but when it does so it strips double spaces and newlines.
That is the whitespaces between AccSupps we saw when adding in characters A-E have now become newlines, that is even worse.
that is same as Adobe Reader DC, it now adds newlines between AccSupps.
gibberish, it produces the visual characters and then treats each AccSupp as either nothing or a newline. So that is, our testing to this point shows that only AccSupp ActualText with no visual content, give correct verbatim copy-paste behavior. First, a separate set of AccSupps that have the intended text as ActualText and no visual text in them. This will be one AccSupp per character, except for spaces and newlines which will be incorporated in AccSupps in such a way that there is never a space or newline at the beginning or end of an AccSupp. Then subsequently separately, I'll visually show the ActualText, in such a way that each visual character is contained in an AccSupp with empty ActualText, this way the visual part should never be considered for copying by the PDF viewer, hence avoiding a "double copy" of the text in the copy. Tex:
That is beautifully correct.
Same partial reverse order bug as above.
It produces an output in line with ignorance of AccSupps. In essence this outcome was greatly encouraging. Its only shortcoming is that the selection for copying the whole section, is made exactly at the beginning of the text block, and selecting text within the text block will copy nothing. This is not how text selection normally works, and would be felt as unintuitive by people. The intended text is organized into what we call a sequence of clusters. Each individual character is a cluster, except for in the presence of spaces and newlines, which are incorporated into groups in such a way that there is never a space or newline at the beginning or end of a cluster. We will then produce the Tex as follows, for each cluster:
Followed by, for each character contained in the cluster,
If this works, it integrates successful verbatim copy-paste, with the convenience of being able to make the text selection for copying, in the approximate visual location in the document where the text is actually displayed. Also note that this satisfies AccSupp's requirement to never be made across a page boundary. That means for the same text as above, this Tex:
I.e. unfortunately not correct, the presence of visual elements is reflected here as spaces (between H and l, and between d and {) and newlines (between o and w and between r and l).
Curiously the same behavior.
..so that's a total fail - visuals with AccSupp ActualText="" between the AccSupp ActualTexts, mess up the copy-paste behavior. In test 6 I attempted to get both correct verbatim copy-paste with ability to make selections within a text block and copy such parts, but this test failed. @u-fischer , do you see any mistake I did in test 6, that is do you see any way to get correct verbatim copy-paste from selections within a text block? Any thoughts much appreciated, if you have no thoughts I'll presume it's impossible with PDF (at least currently). (To discuss next: What bug reports to file to all the PDF viewers.) |
What do you mean? Perhaps what you said now is the explanation to some of the issues we experienced in trying to get verbatim copy-paste with or without AccSupp ActualText.
Right, this is for snippets only, such as compilation (where a configure or build command is typically 5-20 lines), maybe in longest case a short configuration file. File attachments are useful but also mean hassle, e.g. save the attachment to a file -> devise a location in home or temporary directory for the file -> open that file separatenly in an editor -> do select all+copy in that editor -> paste to the right location -> close the separate editor -> delete the temporary file, this is about 30 seconds extra work per individual copy step. |
@davidcarlisle are you sure - per my tests above it looks like copy-paste in PDF is largely broken by default, with respect to space indentation and empty lines. My test with making an AccSupp for a whitespace failed totally, did I miss anything? What are your thoughts about my "test 5" above as the currently optimal way of achieving verbatim copy-paste in PDF, also do you have any thought about how make the "test 6" approach work. |
Well it is nice that you are doing all this tests -- I don't have the time currently -- but basically you are confirming me what I thought before: that copy&pasting of code is not reliably possible, and that using ActualText/accsupp is not the right way to enable it. pdf viewer use heuristics for copy&paste. This is quite nice as they nowadays gets the text more or less right, even if there are no real spaces, or hyphens from hyphenation. They even often even preserve some formatting and tabular. But there is no specification and you don't know what you will get after the next update. If something changes you couldn't even claim that it is a bug.
In adobe reader I can simply doubleclick on an icon and the file opens in my editor (after a security question).
|
@u-fischer do you see any way that my ”test 6” can be fixed so it works, that is some way cause visual glyphs between ActualTexts to not break the copy-paste? I see your point that attachments clearly will be well preserved. PDF has some link to attachment function also, which makes it more convenient to open attachments isn’t it so eg the user would just click the text box and the attachment would open. Direct copy-paste from a document still has a charm to it. If my understanding of the outcome of test 5 is correct, then I have proven that arbitrary-length verbatim copypaste is possible (in Adobe Reader DC, and other viewers too after they fix their bugs) though the whole text must be concentrated to one single location/coordinate in the PDF. |
It would help us if you could attach an example PDF on https://bugzilla.mozilla.org/show_bug.cgi?id=1669335. |
Hi, we have just benchmarked all popular PDF viewers we could come to think of for AccSupp support, and this is the outcome:
AccSupp copy-paste support per different PDF viewers and 24 September 2020
Full or limited support:
Broken support:
Adobe Acrobat Reader DC (2020.012.20043 on Windows): Empty lines and indentation are produced correctly. Would not process 100,000 characters long AccSup, here complained the PDF was broken and told the user to contact the author. Would open a PDF with a 8063 characters long AccSupp, but it only provides a random selection of 4248 of them, i.e. Adobe Acrobat breaks the output sometimes.
Internet Explorer (on Windows): Appears to use Adobe Acrobat internally, same broken behavior as Acrobat.
Chrome (85.0.4183.21 on Windows and 85.0.4183.121 on macOS): Supported but truncates to first 16383 characters, and will also remove newlines (this means removal also of empty lines) and replace double spaces with single spaces
No support:
FireFox (80.0.1 64bit on Windows)
MuPdf (1.17.0 on Windows)
PDF Exchange Viewer (2.5 on Windows)
SumatraPDF (3.2 64bit on Windows)
Reproduction script:
Replace HEX with the 8K or >100K characters hex-UTF16-encoded.
The text was updated successfully, but these errors were encountered: