Reference links should be bigger #1043

bfirsh · 2018-08-17T14:32:16Z

Section \ref{section-foo} turns into output that looks like this:

The link is a tiny number, which is fiddly to click. It would be much better if the link was the entire text "Section 3.1". This is makes more sense semantically, too.

Unfortunately this is not easy to do automatically, because the preceding text is defined by the author. In the pre-LaTeXML version of Engrafo, we made an attempt at turning these into a larger links by looking for preceding strings like "section", "figure", "fig.", etc.

The text was updated successfully, but these errors were encountered:

matteosecli · 2018-08-18T08:49:44Z

What if having just a linked number is the intended output?

If I explicitly want to have "Section 3.1" as the text link, I prefer to resort to \autoref{section-foo} which automatically produces the desired result (and keeps the semantic). If someone wants something even more fancy, he could use \hyperref[section-foo]{Fancy Section~\ref*{section-foo}}.

I understand that the way I do requires to take action at the LaTeX source level; maybe putting this extra feature behind an option and letting the user choose whether he wants a heavier post-processing of this kind or not would be the best alternative (from my point of view).

bfirsh · 2018-08-18T09:32:49Z

Almost everybody on arXiv seems to do section \ref{...} or similar unfortunately. So, for the sake of usability, we'll probably have to clobber the occasional case where that is the intended output.

Adding it as an option sounds like a great idea. This probably applies to some of the other stuff we need to do in Engrafo too. It means the intention remains when you use LaTeXML to convert your own documents, but for the wild west of arXiv, we can enable options to massage documents into better output.

This might even work as a plugin in the meantime... I shall investigate...

matteosecli · 2018-08-21T09:09:04Z

Sounds great! 😉

brucemiller · 2018-09-04T15:48:48Z

Sounds sorta doable, at least to the extent that the section-word isn't obfuscated with styling, and to the extent that you have a plausable set of type words (including plurals, abbreviations, etc).

brucemiller · 2019-03-28T16:28:43Z

If the objective is simply to make \ref act like \autoref, then the following code snippet installed in a local style file for --preload should work:

RequirePackage('hyperref');
AtBeginDocument('\let\ref\autoref');

If the objective is to recognize the preceding unit name, that sounds much trickier. I suppose it's probably most feasible at the XML level, perhaps a clever XPath? You'd also need a reasonable dictionary of such words. I'd probably solicit a clever hacker to scan arXiv to generate a list of all words preceding a \ref :>

bfirsh · 2019-03-28T18:50:06Z

FWIW it’s also fine if you would rather consider this out of scope for LaTeXML. Seems like a good candidate for a plugin or an Engrafo post-processor.

bfirsh · 2019-03-28T18:51:37Z

Also to be clear — yes, the original intention was the latter. Just rewriting \ref to \autoref will produce double unit names in most cases won’t it?

dginev · 2019-03-28T19:23:16Z

@brucemiller fair enough, I quickly put together a llamapun stats harvester, and will report with some curious data when it completes over the 08.2018 dataset. Usual caveat that it takes ~3 days to walk the corpus with the current components in place.

FWIW, it also feels as out-of-scope to latexml to me, as it goes beyond the original TeX markup in a direction that may or may not be in alignment with the original intention of the author.

dginev · 2019-03-30T14:08:43Z

Here is the report from arXMLiv 08.2018, with an excerpt of the top 10 words:

word, frequency
figure, 3290488
theorem, 3052607
section, 2802295
lemma, 2408488
table, 1544961
proposition, 1334759
and, 1031640
corollary, 476062
appendix, 416964
definition, 317534

https://gist.github.com/dginev/c83d239524e1380f7b0e5e92a24a5eb2

Comments from a first look:

What is counted by llamapun - alphanumeric strings, at the end of a text node that is the previous sibling of a span[@class=ltx_ref] or a[@class=ltx_ref] element.
The top entries are consistent with what one would expect - a list of headings and logical blocks one tends to reference, as well as mathematical statements one tends to label+number (via e.g. amsmath \newtheorem).
Already in the top 10, one can see you can't reliably upcase any prior word - \ref{a} and \ref{b} being one example, and precedes refs over a million times.
There are several other classes of \ref-ed objects:
- names of scientists, in what I can only assume are names of notable theorems/objects, taking after their founders: laplacian, kaplansky theorem , etc. Some could be wrongly collected citations and hence considered noise for this purpose?
- actual concepts, I assume that appear inside equations, inequality, watts, tensor,
- hits in French and Russian, which are also languages used in arXiv, though in a minority of papers
- action verbs - appearing, combining, implies, satisfying, etc, which are probably mostly used with equation labels, and math statement labels.
- conditional words - if, so, which I would assume lay out a proposition with referenced pieces.
- numeric literals - not sure what these are about. Would guess conversion noise, but maybe they have a purpose?
- typos - my favourite type of entry in the frequency 1 category, you could find both typos of construction - constrution and consruction
arXiv is a noisy corpus, which becomes evident as you scroll to the bottom entries. Some of the low-frequency hits are due to noise from conversion errors, or author mistakes, others are just strange (youtube appeared 7 times).

Ok, will leave things here for a first round of discussion. The data seems to convey that we could grab a shortlist of standardized words from the most frequent entries and auto-upcase them, but that we can't do that reliably for arbitrary words. And this sounds like a stylistic choice for the final presentation, so may be best implemented as a step following latexml, and left to the discretion of the final resource editor.

P.S. I lowerecased all words when counting the frequencies, to make the final report smaller.

dginev · 2019-04-08T00:18:51Z

I got no feedback on my report, I assume everyone here is busy. That said, is everyone OK with me closing here and continuing this line of work as a post-latexml step, where needed?

brucemiller · 2019-04-08T03:18:02Z

Sorry for delayed feedback; trying to get something else finished. This is a curious issue; it certainly could be seen as being in scope of latexml. At least to the extent that it only broadens the width of the hyperlink, w/o changing the actual text. And does it consistently/correctly enough that it isn't a new irritant. Your list of words is a good start; much more than the top 10 are worth including. One concern is plurals though. How do you deal with something like "equations ref{eq:a}--\ref{eq:d}" ? You might want a synthetic block that contains a--d. Alternatively, links to b and c, as well. You probably don't want the equivalent of "\ref{equation a}--\ref{d}" and certainly not using the plural "equations".

brucemiller · 2019-04-16T22:03:11Z

As I said, I'm not against this being part of LaTeXML (optional or default, depending on how consistently it can behave), nor am I against it being a post processing operation. But I'm not likely to find time to do much on it till after a release. So, maybe I should punt to you guys whether we should close or defer to next milestone?

brucemiller · 2019-09-11T15:35:18Z

I'm back to my original thinking that it's out-of-scope in the sense of going a bit too far changing the authors intent; but at the same time, I'd like LaTeXML to be the kind of tool that can enable that sort of reworking. But it's really more plugin or post processing territory. Since the thread has gone inactive, I'll go ahead and close, but if more discussion on strategies is wanted; feel free to reopen.

bfirsh mentioned this issue Aug 17, 2018

WIP: 2.0 arxiv-vanity/engrafo#324

Merged

20 tasks

dginev added enhancement postprocessing labels Aug 17, 2018

dginev added this to the LaTeXML-0.8.4 milestone Aug 17, 2018

dginev added the minor label Aug 17, 2018

bfirsh mentioned this issue Aug 27, 2018

Make click target for references bigger arxiv-vanity/engrafo#254

Open

dginev mentioned this issue Mar 28, 2019

Sample scanner for pre-words to .ltx_ref elements KWARC/llamapun#27

Merged

dginev modified the milestones: LaTeXML-0.8.4, LaTeXML-0.8.5 Apr 16, 2019

brucemiller closed this as completed Sep 11, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reference links should be bigger #1043

Reference links should be bigger #1043

bfirsh commented Aug 17, 2018

matteosecli commented Aug 18, 2018

bfirsh commented Aug 18, 2018 •

edited

Loading

matteosecli commented Aug 21, 2018

brucemiller commented Sep 4, 2018

brucemiller commented Mar 28, 2019

bfirsh commented Mar 28, 2019

bfirsh commented Mar 28, 2019

dginev commented Mar 28, 2019 •

edited

Loading

dginev commented Mar 30, 2019 •

edited

Loading

dginev commented Apr 8, 2019

brucemiller commented Apr 8, 2019 via email

brucemiller commented Apr 16, 2019

brucemiller commented Sep 11, 2019

Reference links should be bigger #1043

Reference links should be bigger #1043

Comments

bfirsh commented Aug 17, 2018

matteosecli commented Aug 18, 2018

bfirsh commented Aug 18, 2018 • edited Loading

matteosecli commented Aug 21, 2018

brucemiller commented Sep 4, 2018

brucemiller commented Mar 28, 2019

bfirsh commented Mar 28, 2019

bfirsh commented Mar 28, 2019

dginev commented Mar 28, 2019 • edited Loading

dginev commented Mar 30, 2019 • edited Loading

dginev commented Apr 8, 2019

brucemiller commented Apr 8, 2019 via email

brucemiller commented Apr 16, 2019

brucemiller commented Sep 11, 2019

bfirsh commented Aug 18, 2018 •

edited

Loading

dginev commented Mar 28, 2019 •

edited

Loading

dginev commented Mar 30, 2019 •

edited

Loading