How to typeset UTF-8 characters without affecting ASCII characters?

Question

The context

I'm working on a homework assignment for my Databases courses. I want to typeset the ⨝ character (JOIN operation from relational algebra) but at the same time I would like to have the following behavior when reading the PDF: when selecting and copying the character, it must be copied as it is.

I've done my research and so far I've gotten the following

lualatex main

\documentclass{article} \usepackage{fontspec} \usepackage{unicode-math} \setmathfont{XITS Math} \begin{document} $A ⨝ B$ \end{document}

When copying the ⨝ character in all the PDF viewers I've tried (zathura, okular and firefox), the character is copied as it is. I thought I had accomplished my goal. However, a new problem arises.

The problem

The problem is that ASCII characters are not copied as ASCII characters in some PDF viewers. Okular is the only PDF viewer that copy A and B as ASCII characters (see below).

Using Firefox, the line is copied as

firefox --version

Mozilla Firefox 88.0.1

𝐴⨝𝐵

Using Okular, the line is copied as

okular --version

okular 21.04.0

A⨝B

Using Zathura, the line is copied as

zathura --version

zathura 0.4.7 girara 0.3.5 (runtime: 0.3.5) (plugin) djvu (0.2.9) (/usr/lib/zathura/libdjvu.so) (plugin) pdf-mupdf (0.3.6) (/usr/lib/zathura/libpdf-mupdf.so)

𝐴 ⨝ 𝐵

The question

Is there any way to create a document that meet the following conditions

Typeset the JOIN character such that when copying it in a PDF viewer, the ⨝ is inserted into the clipboard.
Typeset the ASCII characters such that when copying them in a PDF viewer, ASCII characters are inserted into the clipboard.

In simpler words: Is there any package that would ensure that: When copying the selected characters from the generated PDF, the characters, that were used to typeset the ones from the PDF, are copied.

Here, the definition of a PDF viewer is any of the following: okular, zathura, built-in firefox PDF viewer. I'm just making clear since I know that there are many bad PDF viewers out there that would have different behaviors in the scenario presented here.

Additional context

Behavior of `pdfgrep` and `pdftotext`

pdfgrep and pdftotext also interpret the ASCII characters of the PDF as non-ASCII characters.

pdfgrep '' main.pdf

𝐴⨝𝐵 1

pdftotext main.pdf cat main.txt

𝐴⨝𝐵 1

Trying every font in my TeXLive distribution

I thought that this problem was caused because of the specified font in \setmathfont. For this reason, I created the following script which generates a PDF for each OTF font in the default TeXLive installation.

\documentclass{article} \usepackage{fontspec} \usepackage{unicode-math} \setmathfont{...} \begin{document} foo $A ⨝ B$ bar \end{document}

found="$(locate "/usr/local/*.otf")" total="$(echo "$found" | wc -l)" counter=1 for file in $found do echo "Trying $file ($counter/$total)" echo "Trying $file ($counter/$total)" >> lualatex.log font=$(basename "$file") sed -i "s/\.\.\./$font/g" main.tex lualatex -interaction nonstopmode main 2>&1 >> lualatex.log exit_code=$? sed -i "s/$font/\.\.\./g" main.tex rm -f main.aux if [ "$exit_code" = 0 ] then mv main.pdf "$font.pdf" pdftotext "$font.pdf" fi counter=$((counter + 1)) done

The script took more than 30 minutes to finish and this is what I found. Of the 1702 *.otf fonts, the following fonts are the only ones that can typeset the ⨝ character.

grep -l -R --include="*.txt" '⨝' $my__experiments | sort

/home/beep1560/e/Asana-Math.otf.txt /home/beep1560/e/Erewhon-Math.otf.txt /home/beep1560/e/GFSNeohellenicMath.otf.txt /home/beep1560/e/KpMath-Bold.otf.txt /home/beep1560/e/KpMath-Light.otf.txt /home/beep1560/e/KpMath-Regular.otf.txt /home/beep1560/e/KpMath-Sans.otf.txt /home/beep1560/e/KpMath-Semibold.otf.txt /home/beep1560/e/NewCMMath-Book.otf.txt /home/beep1560/e/NewCMMath-Regular.otf.txt /home/beep1560/e/STIX2Math.otf.txt /home/beep1560/e/STIX-Regular.otf.txt /home/beep1560/e/XITSMath-Regular.otf.txt

Apparently, there is no font that typeset ASCII characters as ASCI characters because the following search yields no result

grep -l -R --include="*.txt" 'A.*⨝' $my__experiments | wc -l

So, I think that this is enough to think that this can be solved by using another font.

I think the issue here is that what looks like ASCII characters are in fact not. When you typeset with lualatex, your typical ‘A’ in a formula is rendered as 𝐴, which is in fact the character U+1D434 MATHEMATICAL ITALIC CAPITAL A. Clearly, this is intentionally so. Whether there is a way to get what you want, I do not know. Here, more knowledgeable folks must pitch in. — Harald Hanche-Olsen
– Harald Hanche-Olsen, Commented May 15, 2021 at 6:52
@HaraldHanche-Olsen thanks for answering. I've already found a solution (see the first answer.) — Rodrigo Morales
– Rodrigo Morales, Commented May 15, 2021 at 6:58

Rodrigo Morales · Accepted Answer · 2021-05-15 06:55:07Z

You can accomplish what you are searching by specifying the following option: math-style.

\documentclass{article} \usepackage{fontspec} \usepackage{unicode-math} \setmathfont[math-style=upright]{XITS Math} \begin{document} $A ⨝ B$ \\ \end{document}

I tested this solution in the same versions of the software that you mentioned (because we are the same person).

I knew this because I searched ASCII in the unicode-math official documentation. Next time, make sure that you look at the official documentation first (in this scenario, $ texdoc unicode-math) and search for keywords (in this scenario, ASCII).

Stack Exchange Network

How to typeset UTF-8 characters without affecting ASCII characters?

The context

The problem

The question

Additional context

Behavior of `pdfgrep` and `pdftotext`

Trying every font in my TeXLive distribution

1 Answer 1

You must log in to answer this question.

Hot Network Questions

How to typeset UTF-8 characters without affecting ASCII characters?

The context

The problem

The question

Additional context

Behavior of pdfgrep and pdftotext

Trying every font in my TeXLive distribution

1 Answer 1

You must log in to answer this question.

Related

Hot Network Questions

Behavior of `pdfgrep` and `pdftotext`