Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Making Unicode supported LaTeX template as the default #7

Open
thammegowda opened this issue Jan 4, 2022 · 12 comments
Open

Making Unicode supported LaTeX template as the default #7

thammegowda opened this issue Jan 4, 2022 · 12 comments

Comments

@thammegowda
Copy link

thammegowda commented Jan 4, 2022

We have been using PdfLatex compiler/engine as the default, but as we know it isn't Unicode (non-Latin) friendly.
Though the instructions suggest using XeLaTeX, the generated PDF looks different in many ways than PdfLatex's.
For example (left: PdfLatex, right: XeLatex): Look at the nuances in fonts, section headings aren't as bold as PdfTex's in the left. I believe the font weight isn't exactly the same.

Screen Shot 2022-01-04 at 2 55 02 PM

My request/suggestion:
Move towards Unicode supported template as a way of encouraging NLP in non-Latin languages.
Researchers working on non-Latin languages should also be able to paste qualitative examples (without some non-vector images), right? So, how about making Unicode supported template (i.e XeLatex) as the default?

If any one interested in testing unicode support of latex templates, here is a file having UDHR titles in hundreds of languages:
udhr-title.txt

Thanks,

@mbollmann
Copy link
Member

mbollmann commented Feb 9, 2022

So the proceedings template contains these lines, which are really specific to pdfLaTeX and shouldn't be used with the newer engines:

\usepackage{times}
\usepackage{latexsym}
\usepackage[T1]{fontenc}
\usepackage[utf8]{inputenc}

If I compile that on Overleaf, download the PDF, and check the fonts that are used with pdffonts, I get this:

name                                 type              encoding         emb sub uni object ID
------------------------------------ ----------------- ---------------- --- --- --- ---------
WFBODD+NimbusMonL-Regu               Type 1            Custom           yes yes yes     60  0
JKDAPA+NimbusRomNo9L-ReguItal        Type 1            Custom           yes yes yes     73  0
ARYBWX+NimbusRomNo9L-Medi            Type 1            Custom           yes yes yes     56  0
HPBZPC+NimbusRomNo9L-Regu            Type 1            Custom           yes yes yes     58  0
ALVSVR+NimbusSanL-Bold               Type 1            Custom           yes yes yes     59  0

So at least Overleaf uses the "Nimbus" fonts when including the "times" package. That makes me think that with XeLaTeX or LuaLaTeX, the above lines in the template should be replaced with:

\usepackage{fontspec}
\setmainfont{Nimbus Roman}
\setsansfont{Nimbus Sans}
\setmonofont{Nimbus Mono}

(EDIT: TeX Gyre Termes is probably better, since it's supposed to be the same but with more features.)

I think it would make sense to check in the .sty file which TeX engine is used, and modify the font-related commands accordingly. @davidweichiang Would it make sense if I tried to prepare a pull request for something like this?

@davidweichiang
Copy link
Collaborator

This all sounds great. But we need to set it up so that it looks the same either way.

@davidweichiang
Copy link
Collaborator

Also related if any pub chairs are still using it: yz-joey/ACLPUB#7

@davidweichiang
Copy link
Collaborator

XeLaTeX has a major disadvantage, which is that arXiv does not support it. So I don't think it can be made the default (yet). But I definitely agree with making it an option.

@thammegowda In your example, the one on the right is set in Computer Modern, not Times Roman. So something is wrong with the font setup.

@thammegowda
Copy link
Author

The modifications I did to add some Unicode text was

  1. enable babel
\usepackage[english]{babel} % English as the main language
\babelprovide[import]{hindi}
\babelprovide[import]{arabic}
\babelprovide[import]{kannada}
\babelfont[*devanagari]{rm}{Lohit Devanagari}
\babelfont[*arabic]{rm}{Noto Sans Arabic}
  1. Paste some Arabic and Hindi text
Hindi: \foreignlanguage{hindi}{मानव अधिकारों की सार्वभौम घोषणा} Arabic: \foreignlanguage{arabic}{الإعلان العالمي لحقوق الإنسان
  1. And switch compiler to XeLaTex, since PdfTex could not compile it.
    Also, I had to comment out \pdfoutput=1 for XeLaTex

I didn't explicitly modify fonts for English/Latin. Is babel import messing up default fonts for English? Sorry, I am not a *TeX pro. Here is my overleaf project for reference https://www.overleaf.com/project/61d4c64cbc3e72789d2de4bc

@mbollmann
Copy link
Member

Well, I would say arXiv has a major disadvantage in that it doesn't support XeLaTeX/LuaLaTeX, but I can see how we should make sure to support it ;)

@thammegowda The default font is Computer Modern, to get the correct font for the current *ACL template, both \usepackage{times} and \usepackage[T1]{fontenc} are important.

@thammegowda
Copy link
Author

@mbollmann I agree, and I hope arXiv realizes this shortcoming and makes an update.

Also, I have these two lines

\usepackage{times}
\usepackage[T1]{fontenc}

I didn't remove these two, but is XeLaTex using Computer Modern? That's surprising!

@mbollmann
Copy link
Member

@thammegowda Ah, maybe it is overwritten by something else in your preamble then. I can't access your Overleaf project, it's restricted. Try to move the "times" import further down maybe?

@thammegowda
Copy link
Author

@mbollmann
I think babel package is causing the issue. If I move times fontenc and microtype below the babel, the fonts for latin look as intended, but Arabic and Hindi stop working (text doesn't even appear).

\usepackage[english]{babel} % English as the main language
\babelprovide[import]{hindi}
\babelprovide[import]{arabic}
\babelprovide[import]{kannada}
\babelfont[*devanagari]{rm}{Lohit Devanagari}
\babelfont[*arabic]{rm}{Noto Sans Arabic}

\usepackage{times}
\usepackage[T1]{fontenc}
\usepackage{microtype}

Here is a overleaf link: https://www.overleaf.com/read/vbyhzmssdkkb (worked for me in private/incognito)
If we could share a working example with these text, it'd be very useful.

Hindi: मानव अधिकारों की सार्वभौम घोषणा
Arabic: الإعلان العالمي لحقوق الإنسان

@mbollmann
Copy link
Member

@thammegowda Not an expert with Babel, but I think as soon as you use a \babelfont, you need to define an explicit Latin font as well. I haven't found a way to get the exact same font as LaTeX's ptm family (which is what "times" uses), but if you add

\babelfont{rm}{TeX Gyre Termes}

before you load the other, language-specific fonts, you get something virtually indistinguishable from it.

@thammegowda
Copy link
Author

That works! Thanks.

@venkatasg
Copy link

venkatasg commented Mar 21, 2023

I was just looking into whether there were efforts to move away from pdflatex to make the ACL style files more Unicode friendly - Glad I found this issue thread. I have 2 suggestions, and can help with the migration in these respects:

  • As an engineer LuaLaTeX is probably better than XeLaTeX for the reasons discussed here.
  • moving away from the Times font to Charter provided by packages like XCharter. We have many good free fonts today, like Charter, there's no need to still be using Times. Furthermore, Charis SIL was based on Charter and will play well with it. Supporting IPA, and wide variety of character sets natively ought to be a priority, especially considering the object of our study.

Further decisions probably need to be made about sans-serif and monospaced fonts, but none that can't be solved with some research.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants