My goal is to understand the relationship between character bytes and character tokens with respect to new line bytes. I likely do not have my facts straight.
When TeX reads a file of bytes, encoding must be considered. Putting that aside,
We can observe that a single new line character byte (assuming LF and CRLF) is converted into a space. But what happens behind the scenes? Is a token created using data pair (LF byte number, catcode=10)?
Two consecutive new line character bytes become one single token with the data pair (space byte number, catcode 5)?
When does catcode 5 "end of the line" come into play?
I know LaTeX inserts a \par when two consecutive line endings are encountered.
Code
I attempted to visually show tokens with catcode 5, but I am still not sure if \tmp truly becomes catcode 5.
\documentclass{article} \usepackage{fontspec}% xelatex \long\def\scan#1{#1\par\rule{\textwidth}{2pt}\par\xscan#1\relax} \long\def\xscan{\afterassignment\xxscan\let\tmp= } \long\def\xxscan{% \ifx\tmp\relax\else% \ifcat\tmp\space10 \else% \ifcat\tmp a11 \else% \ifcat\tmp 112 \else%... \ifcat\tmp 5 \else% \fi\fi\fi\fi \expandafter\xscan \fi} \begin{document} \scan{ mac::exception == a } \end{document} #Notes
Notes
- xelatex, being utf-8 capable, must know how to read 2-byte line endings.
- Code modified from: How can I make LaTeX to recognize spaces in my macro (catcode 10)?
