Intl.Collator sorting in Node, Chrome, Webkit, Firefox and Postgres Collation differences

Question

I am comparing the collation of Intl.Collator in the three browsers, Node and Postgres Collation and realized that the order of most implementations are very different. The only two implementations matching each other for the locale 'en' are Chrome and Firefox. For example the letters from MATHEMATICAL SANS-SERIF ITALIC SMALL A to Z are sorted together with the letters a - z in Intl with locales as e.g. en in Chrome and Firefox but not in Postgres. I was thinking that all implementations are based on the same CLDR data. The differences between the implementations are (Tested with Playwright):

Chrome - Node 1474 Chrome - Webkit 34727 Chrome - Firefox 0 Chrome - Postgres 25781 Webkit - Node 34727 Webkit - Postgres 34727 Node - Postgres 34892

A more predictable sort order would enable to sort agnostic in the frontend and backend.

The code is as follows:

The Unicode data is parsed from 'UnicodeData.txt'

export async function parseUnicodeData(unicodePath: string): Promise<UnicodeData []> { return (await fs.promises.readFile(path.join(unicodePath, 'UCD/UnicodeData.txt'), 'utf-8')) .split('\n') .filter((line) => line !== '') .map((line) => line.split(';')) .map(([codeValue]) => ({codeValue})); }

In Postgres:

CREATE TABLE unicode.character ( character text PRIMARY KEY ); -- Filtered codepoints: '0000', 'D800', 'DB7F', 'DB80', 'DBFF', 'DC00', 'DFFF' INSERT INTO unicode.character VALUES (chr(1)), (chr(2)), ... (chr(1114109));

Then in node:

import {Pool} from 'pg'; import {parseUnicodeData} from 'util-unicode-parser'; import {chromium, webkit, firefox} from 'playwright'; const pool = new Pool(); export async function compareCollations( unicodeDirectory: string, postgresCollationName: string, intlLocale: string, intlSettings = '{}', ) { // setting up playwright const pages = await Promise.all([chromium, webkit, firefox] .map((browserType) => browserType.launch({headless: true})) .map(async (browser) => (await browser).newContext()) .map(async (context) => (await context).newPage())); // parsing unicode data const unicodeData = (await parseUnicodeData(unicodeDirectory)) // codepoints filtered because can not be inserted into Postgres .filter(({codeValue}) => !['0000', 'D800', 'DB7F', 'DB80', 'DBFF', 'DC00', 'DFFF'].includes(codeValue)); const browserCollationString = `[${unicodeData .map(({codeValue}) => `String.fromCodePoint(parseInt('${codeValue}', 16))`).join(',') }].sort(new Intl.Collator('${intlLocale}', ${intlSettings}).compare)`; // creating sorted arrays of all characters const nodeCollation = unicodeData.map(({codeValue}) => String.fromCodePoint(parseInt(codeValue, 16))).sort(new Intl.Collator(intlLocale, JSON.parse(intlSettings)).compare); const chromeCollation = await pages[0].evaluate(browserCollationString); const webkitCollation = await pages[1].evaluate(browserCollationString); const firefoxCollation = await pages[2].evaluate(browserCollationString); const postgresCollation = (await pool.query(`SELECT character from unicode.character ORDER BY character COLLATE "${postgresCollationName}";`)) .rows.map(({character}) => character); // comparing sorted arrays only first pair is shown console.log(chromeCollation.map((c, i) => [c, nodeCollation[i], c.codePointAt(0), nodeCollation[i].codePointAt(0)]).filter(([c, r, pc, rc]) => c !== r ).sort((a, b) => a[2] < b[2] ? -1 : 1).length) // ... }

Can you share the code involved?

Nico Haase
– Nico Haase

2024-04-05 07:43:26 +00:00
Commented Apr 5, 2024 at 7:43 — Nico Haase
– Nico Haase, Commented Apr 5, 2024 at 7:43

Collectives™ on Stack Overflow

Intl.Collator sorting in Node, Chrome, Webkit, Firefox and Postgres Collation differences

0

Hot Network Questions

Collectives™ on Stack Overflow

0

Know someone who can answer? Share a link to this question via email, Twitter, or Facebook.