Here's a minimal example:
\documentclass{minimal} \usepackage{fontspec}\setmainfont{Cambria} \begin{document}aöz\end{document} Compiling this into a PDF using lualatex and extracting the text using pdftotext, I get the string aö z, that is:
U+0061(a) U+006f(o) U+0308(combining diaeresis) U+0020(space) U+007a(z) Two problems with this: there's an unnecessary space (due to some PDF formatting trickery, the uncompressed datastream shows Tm[<…>125<01C5>-124<…>]TJ), and I don't want the umlaut to be split into base and combining character because for some reason that renders weirdly when changing the font size. I want the output to be
U+0061(a) U+00f6(ö) U+007a(z) And the worst thing: with \setmainfont{Lucida Grande}, I get exactly that. Just not with Cambria.
Both are in TTF format. Checking the fonts in fontforge shows that both their U+00f6 glyphs are defined as composed of U+006f U+0308, only difference being that Cambria defines its OTF Class as “Base Glyph” while in Lucida Grande it's “Automatic” (no idea what that means).
It is a fontspec specific problem:
\documentclass{minimal} \usepackage{luaotfload} \font\foo={name:Cambria} at 10pt \begin{document} aöz \foo aöz \end{document} generates what I expect, first umlaut missing as expected, too:
U+0061(a) U+007a(z) U+0020(space) U+0061(a) U+00f6(ö) U+007a(z) but
\documentclass{minimal} \usepackage{fontspec} \begin{document} aöz \fontspec{Cambria} aöz \end{document} generates
U+0061(a) U+00f6(ö) U+007a(z) U+0020 U+0061(a) U+006f(o) U+0308 U+0020 U+007a(z) Same when using \DeclareUTFcharacter{x00F6}{\foo} and replacing ö with \foo{}, so I guess it's not xunicode's fault?