Timeline for Transliterate wide-character input

Current License: CC BY-SA 4.0

6 events

when toggle format	what		by	license	comment
Mar 6 at 7:30	comment	added	Toby Speight		It seems that the flat-map is a good choice for a lookup table that's created once and never modified. Lookup should be cheaper than `std::map`, and also scales as O(log 𝑛) in the length of `from`; we could use an unordered map when we exceed whatever threshold makes that faster for lookup (possibly larger than one might pass in process arguments?). If we're really clever, we might use a table for one or more subranges, when the `from` values are in clusters. All that can be transparent if we return a `std::function` to erase the type.
Mar 6 at 7:22	vote	accept	Toby Speight
Mar 2 at 22:35	history	edited	G. Sliepen	CC BY-SA 4.0	deleted 24 characters in body
Mar 2 at 22:30	comment	added	G. Sliepen		You wouldn't add all Unicode code points to the vector, only those spanned by the lowest one in `from` to the highest one. So unless you have both 0x0 and 0x10FFFF in it, you wouldn't use all 68 GB. As for `std::flat_map`, that's just a sorted vector of pairs, with the interface of `std::map`, and it just uses binary search for lookups and keeps the vector sorted on inserts.
Mar 2 at 20:16	comment	added	Toby Speight		Oh, I meant to comment on the choice of `std::map`. There's probably a size where the overhead of `std::unordered_map` pays off, but I haven't benchmarked to find where that is. `std::vector` is an interesting choice, given that on my (Unicode) platform that would be 68 GB (0-0x10FFFF ✕ 4 bytes/codepoint) and I only have about 50 GB of virtual memory to share amongst all processes. I'm less familiar with `std::flat_map`; it seems that it's a way to use the two parallel strings without directly converting to pairs?
Mar 2 at 19:04	history	answered	G. Sliepen	CC BY-SA 4.0