XeLaTeX, LuaLaTeX, fontspec, unicode and normalization

Question

I am troubled by the way LuaTeX and XeLaTeX normalize unicode composed character. I mean NFC / NFD.

See the following MWE

\documentclass{article} \usepackage{fontspec} \setmainfont{Linux Libertine O} \begin{document} ᾳ GREEK SMALL LETTER ALPHA (U+03B1) + COMBINING GREEK YPOGEGRAMMENI (U+0345) ᾳ GREEK SMALL LETTER ALPHA WITH YPOGEGRAMMENI (U+1FB3) \end{document}

With LuaLaTeX I obtain: Example compiled with LuaLaTeX

As you can see, Lua does not normalize unicode character, and as Linux Libertine has a bug (http://sourceforge.net/p/linuxlibertine/bugs/266/), I have a bad character.

With XeLaTeX, I obtain Example compiled with XeLaTeX

As you can see, the Unicode is normalized.

My three questions are :

Why XeLaTeX has normalized (in NFC), despite I have not used \XeTeXinputnormalization
Did this feature change from the past. Because my previous, with TeXLive 2012 send be a bad result (see the articles I wrote at this time http://geekographie.maieul.net/Normalisation-des-caracteres)
Does LuaTeX has option like there is \XeTeXinputnormalization in XeTeX?

Imho HarfBuzz (used by xetex since version XY) does an additional normalization, so the value of \XeTeXinputnormalization doesn't matter much. — Ulrike Fischer
– Ulrike Fischer, Commented Feb 19, 2015 at 15:56
The luatex manual (texdoc lua) section 2.3 says you can do unicode normalisation in the file reader callback, but doesn't give an example. I have a feeling I've seem one somewhere, either an answer here, or the context sources or.... — David Carlisle
– David Carlisle, Commented Feb 19, 2015 at 16:56
indeed, harfbuzz normalize in some condition cgit.freedesktop.org/harfbuzz/tree/src/… and xetex use it since version 1.99 and XeTeX use it. see khaledhosny.org/node/198 — Maïeul
– Maïeul, Commented Feb 19, 2015 at 18:21
ConTeXt does normalisation in Lua (but doesn’t call it that); I also wrote code for that independently, as part of my Google Summer of Code project a long time ago (code.google.com/p/google-summer-of-code-2008-tex/downloads/list), but it’s never been used anywhere that I know of. — Arthur Reutenauer
– Arthur Reutenauer, Commented Feb 19, 2015 at 19:10

michal.h21 · Accepted Answer · 2015-02-22 21:21:17Z

I don't know the answer for first two questions, as I don't use XeTeX, but I want to provide option for the third question.

Thanks to Arthur's code I was able to create basic package for unicode normalization in LuaLaTeX. The code needed only slight modifications to work with current LuaTeX. I will post only main Lua file here, full project is available on Github as uninormalize.

Sample usage:

\documentclass{article} \usepackage{fontspec} \usepackage[czech]{babel} \setmainfont{Linux Libertine O} \usepackage[nodes,buffer=false, debug]{uninormalize} \begin{document} Some tests: \begin{itemize} \item combined letter ᾳ %GREEK SMALL LETTER ALPHA (U+03B1) + COMBINING GREEK YPOGEGRAMMENI (U+0345) \item normal letter ᾳ% GREEK SMALL LETTER ALPHA WITH YPOGEGRAMMENI (U+1FB3) \end{itemize} Some more combined and normal letters: óóōōöö Linux Libertine does support some combined chars: \parbox{4em}{příliš} \end{document}

(note that correct version of this file is on Github, combined letters were transferred incorrectly in this example)

Main idea of the package is following: process the input, and when letter followed by combined marks is found, then it is replaced by normalized NFC form. Two methods are provided, my first approach was to use node processing callbacks to replace decomposed glyphs with normalized characters. This would have a advantage in that it would be possible to switch on and off the processing anywhere, using node attributes. The other possible feature could be checking if the current font contains normalized character and use original form if it doesn't. Unfortunately, in my tests it fails with some characters, notably composed í is in the nodes as dotless i + ´, instead of i + ´, which after the normalization doesn't produce the correct character, so composed chars are used instead. But this produce output with bad placing of the accent. So this method needs either some correction, or it is totally wrong.

So the other method is to use process_input_buffer callback to normalize the input file as it is read from the disk. This method doesn't allow to use info from fonts, nor it allows to turning off in the middle of the line, but it is significantly easier to implement, the callback function may look like this:

function buffer_callback(line) return NFC(line) end

which is really nice finding after three days spent on node processing version.

For curiosity this is the Lua package:

local M = {} dofile("unicode-names.lua") dofile('unicode-normalization.lua') local NFC = unicode.conformance.toNFC local char = unicode.utf8.char local gmatch = unicode.utf8.gmatch local name = unicode.conformance.name local byte = unicode.utf8.byte local unidata = characters.data local length = unicode.utf8.len M.debug = false -- for some reason variable number of arguments doesn't work local function debug_msg(a,b,c,d,e,f,g,h,i) if M.debug then local t = {a,b,c,d,e,f,g,h,i} print("[uninormalize]", unpack(t)) end end local function make_hash (t) local y = {} for _,v in ipairs(t) do y[v] = true end return y end local letter_categories = make_hash {"lu","ll","lt","lo","lm"} local mark_categories = make_hash {"mn","mc","me"} local function printchars(s) local t = {} for x in gmatch(s,".") do t[#t+1] = name(byte(x)) end debug_msg("characters",table.concat(t,":")) end local categories = {} local function get_category(charcode) local charcode = charcode or "" if categories[charcode] then return categories[charcode] else local unidatacode = unidata[charcode] or {} local category = unidatacode.category categories[charcode] = category return category end end -- get glyph char and category local function glyph_info(n) local char = n.char return char, get_category(char) end local function get_mark(n) if n.id == 37 then local character, cat = glyph_info(n) if mark_categories[cat] then return char(character) end end return false end local function make_glyphs(head, nextn,s, lang, font, subtype) local g = function(a) local new_n = node.new(37, subtype) new_n.lang = lang new_n.font = font new_n.char = byte(a) return new_n end if length(s) == 1 then return node.insert_before(head, nextn,g(s)) else local t = {} local first = true for x in gmatch(s,".") do debug_msg("multi letter",x) head, newn = node.insert_before(head, nextn, g(x)) end return head end end local function normalize_marks(head, n) local lang, font, subtype = n.lang, n.font, n.subtype local text = {} text[#text+1] = char(n.char) local head, nextn = node.remove(head, n) --local nextn = n.next local info = get_mark(nextn) while(info) do text[#text+1] = info head, nextn = node.remove(head,nextn) info = get_mark(nextn) end local s = NFC(table.concat(text)) debug_msg("We've got mark: " .. s) local new_n = node.new(37, subtype) new_n.lang = lang new_n.font = font new_n.char = byte(s) --head, new_n = node.insert_before(head, nextn, new_n) -- head, new_n = node.insert_before(head, nextn, make_glyphs(s, lang, font, subtype)) head, new_n = make_glyphs(head, nextn, s, lang, font, subtype) local t = {} for x in node.traverse_id(37,head) do t[#t+1] = char(x.char) end debug_msg("Variables ", table.concat(t,":"), table.concat(text,";"), char(byte(s)),length(s)) return head, nextn end local function normalize_glyphs(head, n) --local charcode = n.char --local category = get_category(charcode) local charcode, category = glyph_info(n) if letter_categories[category] then local nextn = n.next if nextn.id == 37 then --local nextchar = nextn.char --local nextcat = get_category(nextchar) local nextchar, nextcat = glyph_info(nextn) if mark_categories[nextcat] then return normalize_marks(head,n) end end end return head, n.next end function M.nodes(head) local t = {} local text = false local n = head -- for n in node.traverse(head) do while n do if n.id == 37 then local charcode = n.char debug_msg("unicode name",name(charcode)) debug_msg("character category",get_category(charcode)) t[#t+1]= char(charcode) text = true head, n = normalize_glyphs(head, n) else if text then local s = table.concat(t) debug_msg("text chunk",s) --printchars(NFC(s)) debug_msg("----------") end text = false t = {} n = n.next end end return head end --[[ -- These functions aren't needed when processing buffer. We can call NFC on the whole input line local unibytes = {} local function get_charcategory(s) local s = s or "" local b = unibytes[s] or byte(s) or 0 unibytes[s] = b return get_category(b) end local function normalize_charmarks(t,i) local c = {t[i]} local i = i + 1 local s = get_charcategory(t[i]) while mark_categories[s] do c[#c+1] = t[i] i = i + 1 s = get_charcategory(t[i]) end return NFC(table.concat(c)), i end local function normalize_char(t,i) local ch = t[i] local c = get_charcategory(ch) if letter_categories[c] then local nextc = get_charcategory(t[i+1]) if mark_categories[nextc] then return normalize_charmarks(t,i) end end return ch, i+1 end -- ]] function M.buffer(line) --[[ local t = {} local new_t = {} -- we need to make table witl all uni chars on the line for x in gmatch(line,".") do t[#t+1] = x end local i = 1 -- normalize next char local c, i = normalize_char(t, i) new_t[#new_t+1] = c while t[i] do c, i = normalize_char(t,i) -- local c = t[i] -- i = i + 1 new_t[#new_t+1] = c end return table.concat(new_t) --]] return NFC(line) end return M

and now is the time for some pictures.

without normalization:

enter image description here

you can see that composed Greek char is wrong, other combinations are supported by Linux Libertine

with node normalization:

enter image description here

Greek letters are correct, but í in first příliš is wrong. this is the issue I was talking about.

and now the buffer normalization:

enter image description here

everything is alright now

that is very nice ! The bug of the í is very strang, as without normalization we have the good character. However, the second method (buffer normalization) is enough, isn't it ? Or do you have a good reason to prefer the first ? Will changing in the middle of the document? I think a font without normalized form s is not good for a specific language ;-) — Maïeul
– Maïeul, Commented Feb 23, 2015 at 9:00
Thank a lot for this work. Do you think if could be published (after more test !) in CTAN? — Maïeul
– Maïeul, Commented Feb 23, 2015 at 9:00
@Maïeul the í bug probably depends on used font, when I switch to Latin Modern, it is correct. The first method would be nice because it has greater potential for context dependable modifications, but maybe it doesn't matter. maybe I like it more because I spent three days on it and in the end the better result is from one three line function :D — michal.h21
– michal.h21, Commented Feb 23, 2015 at 10:00
@Maïeul it should go to CTAN, I think, but I am really bad on releasing my packages on CTAN — michal.h21
– michal.h21, Commented Feb 23, 2015 at 10:01
Maybe it's like bug with "Ezra" Sil : there is also character conversion inside font. I know, for example, that Linux Libertine "Old Styles" digit are not in the normal unicode position, and Linux Libertine convert them. — Maïeul
– Maïeul, Commented Feb 23, 2015 at 10:03

Stack Exchange Network

XeLaTeX, LuaLaTeX, fontspec, unicode and normalization

1 Answer 1

You must log in to answer this question.

Linked

Hot Network Questions

XeLaTeX, LuaLaTeX, fontspec, unicode and normalization

1 Answer 1

You must log in to answer this question.

Linked

Related

Hot Network Questions