Skip to main content

Timeline for Transliterate wide-character input

Current License: CC BY-SA 4.0

6 events
when toggle format what by license comment
Mar 6 at 7:30 comment added Toby Speight It seems that the flat-map is a good choice for a lookup table that's created once and never modified. Lookup should be cheaper than std::map, and also scales as O(log 𝑛) in the length of from; we could use an unordered map when we exceed whatever threshold makes that faster for lookup (possibly larger than one might pass in process arguments?). If we're really clever, we might use a table for one or more subranges, when the from values are in clusters. All that can be transparent if we return a std::function to erase the type.
Mar 6 at 7:22 vote accept Toby Speight
Mar 2 at 22:35 history edited G. Sliepen CC BY-SA 4.0
deleted 24 characters in body
Mar 2 at 22:30 comment added G. Sliepen You wouldn't add all Unicode code points to the vector, only those spanned by the lowest one in from to the highest one. So unless you have both 0x0 and 0x10FFFF in it, you wouldn't use all 68 GB. As for std::flat_map, that's just a sorted vector of pairs, with the interface of std::map, and it just uses binary search for lookups and keeps the vector sorted on inserts.
Mar 2 at 20:16 comment added Toby Speight Oh, I meant to comment on the choice of std::map. There's probably a size where the overhead of std::unordered_map pays off, but I haven't benchmarked to find where that is. std::vector is an interesting choice, given that on my (Unicode) platform that would be 68 GB (0-0x10FFFF ✕ 4 bytes/codepoint) and I only have about 50 GB of virtual memory to share amongst all processes. I'm less familiar with std::flat_map; it seems that it's a way to use the two parallel strings without directly converting to pairs?
Mar 2 at 19:04 history answered G. Sliepen CC BY-SA 4.0