1

I am comparing the collation of Intl.Collator in the three browsers, Node and Postgres Collation and realized that the order of most implementations are very different. The only two implementations matching each other for the locale 'en' are Chrome and Firefox. For example the letters from MATHEMATICAL SANS-SERIF ITALIC SMALL A to Z are sorted together with the letters a - z in Intl with locales as e.g. en in Chrome and Firefox but not in Postgres. I was thinking that all implementations are based on the same CLDR data. The differences between the implementations are (Tested with Playwright):

Chrome - Node 1474 Chrome - Webkit 34727 Chrome - Firefox 0 Chrome - Postgres 25781 Webkit - Node 34727 Webkit - Postgres 34727 Node - Postgres 34892 

A more predictable sort order would enable to sort agnostic in the frontend and backend.

The code is as follows:

The Unicode data is parsed from 'UnicodeData.txt'

export async function parseUnicodeData(unicodePath: string): Promise<UnicodeData []> { return (await fs.promises.readFile(path.join(unicodePath, 'UCD/UnicodeData.txt'), 'utf-8')) .split('\n') .filter((line) => line !== '') .map((line) => line.split(';')) .map(([codeValue]) => ({codeValue})); } 

In Postgres:

CREATE TABLE unicode.character ( character text PRIMARY KEY ); -- Filtered codepoints: '0000', 'D800', 'DB7F', 'DB80', 'DBFF', 'DC00', 'DFFF' INSERT INTO unicode.character VALUES (chr(1)), (chr(2)), ... (chr(1114109)); 

Then in node:

import {Pool} from 'pg'; import {parseUnicodeData} from 'util-unicode-parser'; import {chromium, webkit, firefox} from 'playwright'; const pool = new Pool(); export async function compareCollations( unicodeDirectory: string, postgresCollationName: string, intlLocale: string, intlSettings = '{}', ) { // setting up playwright const pages = await Promise.all([chromium, webkit, firefox] .map((browserType) => browserType.launch({headless: true})) .map(async (browser) => (await browser).newContext()) .map(async (context) => (await context).newPage())); // parsing unicode data const unicodeData = (await parseUnicodeData(unicodeDirectory)) // codepoints filtered because can not be inserted into Postgres .filter(({codeValue}) => !['0000', 'D800', 'DB7F', 'DB80', 'DBFF', 'DC00', 'DFFF'].includes(codeValue)); const browserCollationString = `[${unicodeData .map(({codeValue}) => `String.fromCodePoint(parseInt('${codeValue}', 16))`).join(',') }].sort(new Intl.Collator('${intlLocale}', ${intlSettings}).compare)`; // creating sorted arrays of all characters const nodeCollation = unicodeData.map(({codeValue}) => String.fromCodePoint(parseInt(codeValue, 16))).sort(new Intl.Collator(intlLocale, JSON.parse(intlSettings)).compare); const chromeCollation = await pages[0].evaluate(browserCollationString); const webkitCollation = await pages[1].evaluate(browserCollationString); const firefoxCollation = await pages[2].evaluate(browserCollationString); const postgresCollation = (await pool.query(`SELECT character from unicode.character ORDER BY character COLLATE "${postgresCollationName}";`)) .rows.map(({character}) => character); // comparing sorted arrays only first pair is shown console.log(chromeCollation.map((c, i) => [c, nodeCollation[i], c.codePointAt(0), nodeCollation[i].codePointAt(0)]).filter(([c, r, pc, rc]) => c !== r ).sort((a, b) => a[2] < b[2] ? -1 : 1).length) // ... } 
1
  • Can you share the code involved? Commented Apr 5, 2024 at 7:43

0

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.