Return to Answer

Addendum

edited Jan 20, 2017 at 11:34

1.2m
147
2.8k
4.5k

Characters have a category code; they can generate a character token during the tokenization phase, but they need not to.

Category codes are used with twofold purposes: they are looked at during tokenization (when TeX absorbs text from an input file or the terminal) but also during token list processing.

Only characters with category code

1 2 3 4 6 7 8 10 11 12 13

can generate character tokens (with the same category code), respectively

begin group
end group
math shift
alignment
parameter
superscript
subscript
space
letter
other character
active character

Characters with category code 0 5 9 14 15 will never generate a character token with the same category code: there is no way a character token with those category codes can get through in TeX internal token processor:

a character with category code 0 triggers the formation of a control sequence
a character with category code 9 is ignored
a character with category code 15 raises an error and then it is ignored
a character with category code 14 tells the tokenization processor to ignore it together with all other characters on the line

More interesting is category code 5, which is the object of your question. When TeX finds one, it discards whatever remains on the input line, generates a space character with character code 32 and category code 10 as if it had been on the line to begin with and sets the scanner in the special state of ignoring blank spaces (category code 10) until coming to something different: if this is another character of category code 5, TeX generates a \par token, otherwise enters the normal state.

Note the emphasis on space character above: this space character is tokenized according to the normal rules, so it will get ignored if it follows a control word (like \foo) but not after a control symbol (like \~).

A consequence of this is that the following inputs

\foo\baz

\foo \baz

are completely equivalent. Note that if the end-of-line in the last input generated a space token, there would be a difference. But indeed a space character (not yet tokenized) is generated instead.

Note. What said above about ignored characters might be misleading when confronted with control word formation. The formation of a control word starts with a category code 0 character followed by one of category code 11. Any character with category code different from 11 will stop the scanning, cause tokenization of the formed control word and be examined anew (for being ignored, for instance, in case it has category code 9).

Addendum about XeTeX and LuaTeX. When a UTF-8 encoded file is fed to the Unicode aware engines, it's immaterial whether a character is single, two, three or four byte long in its UTF-8 representation. These two engines do a preliminary step transforming UTF-8 combinations into Unicode entities, so what the tokenization processor sees is just one character (with its category code as assigned in the initialization table). The two engines are also able to cope with UTF-16 or UTF-32, little or big endian.

Characters have a category code; they can generate a character token during the tokenization phase, but they need not to.

Category codes are used with twofold purposes: they are looked at during tokenization (when TeX absorbs text from an input file or the terminal) but also during token list processing.

Only characters with category code

1 2 3 4 6 7 8 10 11 12 13

can generate character tokens (with the same category code), respectively

begin group
end group
math shift
alignment
parameter
superscript
subscript
space
letter
other character
active character

a character with category code 0 triggers the formation of a control sequence
a character with category code 9 is ignored
a character with category code 15 raises an error and then it is ignored
a character with category code 14 tells the tokenization processor to ignore it together with all other characters on the line

A consequence of this is that the following inputs

\foo\baz

\foo \baz

Characters have a category code; they can generate a character token during the tokenization phase, but they need not to.

Category codes are used with twofold purposes: they are looked at during tokenization (when TeX absorbs text from an input file or the terminal) but also during token list processing.

Only characters with category code

1 2 3 4 6 7 8 10 11 12 13

can generate character tokens (with the same category code), respectively

begin group
end group
math shift
alignment
parameter
superscript
subscript
space
letter
other character
active character

a character with category code 0 triggers the formation of a control sequence
a character with category code 9 is ignored
a character with category code 15 raises an error and then it is ignored
a character with category code 14 tells the tokenization processor to ignore it together with all other characters on the line

A consequence of this is that the following inputs

\foo\baz

\foo \baz

Source Link

answered Jan 20, 2017 at 10:44

egreg

1.2m
147
2.8k
4.5k

Characters have a category code; they can generate a character token during the tokenization phase, but they need not to.

Category codes are used with twofold purposes: they are looked at during tokenization (when TeX absorbs text from an input file or the terminal) but also during token list processing.

Only characters with category code

1 2 3 4 6 7 8 10 11 12 13

can generate character tokens (with the same category code), respectively

begin group
end group
math shift
alignment
parameter
superscript
subscript
space
letter
other character
active character

a character with category code 0 triggers the formation of a control sequence
a character with category code 9 is ignored
a character with category code 15 raises an error and then it is ignored
a character with category code 14 tells the tokenization processor to ignore it together with all other characters on the line

A consequence of this is that the following inputs

\foo\baz

\foo \baz