Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reference links should be bigger #1043

Closed
bfirsh opened this issue Aug 17, 2018 · 13 comments
Closed

Reference links should be bigger #1043

bfirsh opened this issue Aug 17, 2018 · 13 comments

Comments

@bfirsh
Copy link
Contributor

bfirsh commented Aug 17, 2018

Section \ref{section-foo} turns into output that looks like this:

screenshot 2018-08-17 17 27 36

The link is a tiny number, which is fiddly to click. It would be much better if the link was the entire text "Section 3.1". This is makes more sense semantically, too.

Unfortunately this is not easy to do automatically, because the preceding text is defined by the author. In the pre-LaTeXML version of Engrafo, we made an attempt at turning these into a larger links by looking for preceding strings like "section", "figure", "fig.", etc.

@matteosecli
Copy link
Contributor

What if having just a linked number is the intended output?

If I explicitly want to have "Section 3.1" as the text link, I prefer to resort to \autoref{section-foo} which automatically produces the desired result (and keeps the semantic). If someone wants something even more fancy, he could use \hyperref[section-foo]{Fancy Section~\ref*{section-foo}}.

I understand that the way I do requires to take action at the LaTeX source level; maybe putting this extra feature behind an option and letting the user choose whether he wants a heavier post-processing of this kind or not would be the best alternative (from my point of view).

@bfirsh
Copy link
Contributor Author

bfirsh commented Aug 18, 2018

Almost everybody on arXiv seems to do section \ref{...} or similar unfortunately. So, for the sake of usability, we'll probably have to clobber the occasional case where that is the intended output.

Adding it as an option sounds like a great idea. This probably applies to some of the other stuff we need to do in Engrafo too. It means the intention remains when you use LaTeXML to convert your own documents, but for the wild west of arXiv, we can enable options to massage documents into better output.

This might even work as a plugin in the meantime... I shall investigate...

@matteosecli
Copy link
Contributor

Sounds great! 😉

@brucemiller
Copy link
Owner

Sounds sorta doable, at least to the extent that the section-word isn't obfuscated with styling, and to the extent that you have a plausable set of type words (including plurals, abbreviations, etc).

@brucemiller
Copy link
Owner

If the objective is simply to make \ref act like \autoref, then the following code snippet installed in a local style file for --preload should work:

RequirePackage('hyperref');
AtBeginDocument('\let\ref\autoref');

If the objective is to recognize the preceding unit name, that sounds much trickier. I suppose it's probably most feasible at the XML level, perhaps a clever XPath? You'd also need a reasonable dictionary of such words. I'd probably solicit a clever hacker to scan arXiv to generate a list of all words preceding a \ref :>

@bfirsh
Copy link
Contributor Author

bfirsh commented Mar 28, 2019

FWIW it’s also fine if you would rather consider this out of scope for LaTeXML. Seems like a good candidate for a plugin or an Engrafo post-processor.

@bfirsh
Copy link
Contributor Author

bfirsh commented Mar 28, 2019

Also to be clear — yes, the original intention was the latter. Just rewriting \ref to \autoref will produce double unit names in most cases won’t it?

@dginev
Copy link
Collaborator

dginev commented Mar 28, 2019

@brucemiller fair enough, I quickly put together a llamapun stats harvester, and will report with some curious data when it completes over the 08.2018 dataset. Usual caveat that it takes ~3 days to walk the corpus with the current components in place.

FWIW, it also feels as out-of-scope to latexml to me, as it goes beyond the original TeX markup in a direction that may or may not be in alignment with the original intention of the author.

@dginev
Copy link
Collaborator

dginev commented Mar 30, 2019

Here is the report from arXMLiv 08.2018, with an excerpt of the top 10 words:

word, frequency
figure, 3290488
theorem, 3052607
section, 2802295
lemma, 2408488
table, 1544961
proposition, 1334759
and, 1031640
corollary, 476062
appendix, 416964
definition, 317534

https://gist.github.com/dginev/c83d239524e1380f7b0e5e92a24a5eb2

Comments from a first look:

  • What is counted by llamapun - alphanumeric strings, at the end of a text node that is the previous sibling of a span[@class=ltx_ref] or a[@class=ltx_ref] element.
  • The top entries are consistent with what one would expect - a list of headings and logical blocks one tends to reference, as well as mathematical statements one tends to label+number (via e.g. amsmath \newtheorem).
  • Already in the top 10, one can see you can't reliably upcase any prior word - \ref{a} and \ref{b} being one example, and precedes refs over a million times.
  • There are several other classes of \ref-ed objects:
    • names of scientists, in what I can only assume are names of notable theorems/objects, taking after their founders: laplacian, kaplansky theorem , etc. Some could be wrongly collected citations and hence considered noise for this purpose?
    • actual concepts, I assume that appear inside equations, inequality, watts, tensor,
    • hits in French and Russian, which are also languages used in arXiv, though in a minority of papers
    • action verbs - appearing, combining, implies, satisfying, etc, which are probably mostly used with equation labels, and math statement labels.
    • conditional words - if, so, which I would assume lay out a proposition with referenced pieces.
    • numeric literals - not sure what these are about. Would guess conversion noise, but maybe they have a purpose?
    • typos - my favourite type of entry in the frequency 1 category, you could find both typos of construction - constrution and consruction
  • arXiv is a noisy corpus, which becomes evident as you scroll to the bottom entries. Some of the low-frequency hits are due to noise from conversion errors, or author mistakes, others are just strange (youtube appeared 7 times).

Ok, will leave things here for a first round of discussion. The data seems to convey that we could grab a shortlist of standardized words from the most frequent entries and auto-upcase them, but that we can't do that reliably for arbitrary words. And this sounds like a stylistic choice for the final presentation, so may be best implemented as a step following latexml, and left to the discretion of the final resource editor.

P.S. I lowerecased all words when counting the frequencies, to make the final report smaller.

@dginev
Copy link
Collaborator

dginev commented Apr 8, 2019

I got no feedback on my report, I assume everyone here is busy. That said, is everyone OK with me closing here and continuing this line of work as a post-latexml step, where needed?

@brucemiller
Copy link
Owner

brucemiller commented Apr 8, 2019 via email

@brucemiller
Copy link
Owner

As I said, I'm not against this being part of LaTeXML (optional or default, depending on how consistently it can behave), nor am I against it being a post processing operation. But I'm not likely to find time to do much on it till after a release. So, maybe I should punt to you guys whether we should close or defer to next milestone?

@dginev dginev modified the milestones: LaTeXML-0.8.4, LaTeXML-0.8.5 Apr 16, 2019
@brucemiller
Copy link
Owner

I'm back to my original thinking that it's out-of-scope in the sense of going a bit too far changing the authors intent; but at the same time, I'd like LaTeXML to be the kind of tool that can enable that sort of reworking. But it's really more plugin or post processing territory. Since the thread has gone inactive, I'll go ahead and close, but if more discussion on strategies is wanted; feel free to reopen.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants