CDB replacement for localization cache
Closed, ResolvedPublic

Assigned To
Authored By
bvibber
Nov 20 2025, 10:00 PM
Referenced Files
F70778581: image.png
Nov 30 2025, 12:31 AM
F70778600: image.png
Nov 30 2025, 12:31 AM
F70776966: image.png
Nov 29 2025, 9:48 PM
F70776952: image.png
Nov 29 2025, 9:48 PM
F70774756: image.png
Nov 29 2025, 6:52 PM
F70773383: image.png
Nov 29 2025, 5:25 PM
F70763944: image.png
Nov 29 2025, 8:01 AM
F70744866: image.png
Nov 29 2025, 4:54 AM

Description

Production's localization cache uses CDB database files in the k8s deployment images, which are very efficient to look up strings from but which are very slow to copy when anything in the database changes.

This can cause deployments to be delayed by about a half hour due to the time spent copying these large database files around. It _should_ be possible to arrange the data in a way that updates much more efficiently but is still fast enough to look up strings in MediaWiki with.

Notes (to be added to further):

  • container diffs are _file-based_ not _line_- or _block_-based. any change in the file will be slow
  • therefore it's worth investigating either multiple small files, fewer of which change at once, or a file per version which builds with the differential from the previous version
  • which of these will perform better for generation? for update syncs? for string reads?
  • do we continue to build on CDB? use JSON or PHP but in different arrangements? something else?
  • check prior notes on T99740: Use static php array files for l10n cache at WMF (instead of CDB)
    • there is a lot of string duplication, especially with the keys - consider some indirection mapping to reduce space
  • consider compression of message payloads to further reduce space
    • performance impact on decode?
    • zlib? zlib with dictionary?
    • brotli? (comes with a standard dictionary with common words in many languages)

Prep before session:

  • do some exploratory hacking testing a few alternate layouts to see how they perform for lookups, and how well they isolate changes into per-version synced files

At the session:

  • hash out prior notes and any new surprises anyone comes up with
  • pick an experimental layouts and figure out what a production version would look like on the MediaWiki end and for the cache generation
  • try implementing!
  • ???
  • non-profit!

Event Timeline

Pinging particularly @Krinkle and @Joe who have both been interested in this topic historically, and might (or might not!) want to get involved in the Hackathon in Milan.

bd808 updated the task description. (Show Details)

Riffing on a comment by Tim Starling about indirecting strings I whipped up a quick proof of concept changing the CDB-backed cache to use indirection tables for message keys and strings, and compress the serialized strings with deflate:

https://gerrit.wikimedia.org/r/c/mediawiki/core/+/1209576

On my local dev setup this reduces the total size of a rebuilt CDB localization cache from 850M to 300M with only a modest increase in generation runtime. I have not attempted to benchmark the more expensive reads (it has to read from three CDBs instead of one for each lookup)

Seems like a respectable improvement in the amount of data that has to be sent over the wire, but I'd like to see another order of magnitude improvement...

About half the used space now (on my setup with only a few extensions) is the per-language mapping .cdb files, and half is the compressed string table. Modest improvements could be made on the compression with a cleverly-generated dictionary, but there might be bigger gains on making the lookup indexes smaller vs CDB (since we have indexed numbers do we need a general hash table lookup? a binary index on the index numbers is probably more efficient)

Change #1209576 had a related patch set uploaded (by Bvibber; author: Bvibber):

[mediawiki/core@master] WIP POC experiment for localisation cache

https://gerrit.wikimedia.org/r/1209576

Got my dev setup's l10n cache from original 850M down to *164M* by replacing the CDB for the id->id mappings with fixed binary array files, which should also be faster to query.

[Update] I replaced the CDB for the id->string mapping file too: a binary array in an index file and a file with concatenation of all the items, with an offset and length in each position in the index.

This gets the l10n cache directory down to 140M from 850M originally! Not bad for a start.

Thanks :)

I think the shared compression dictionary is a dead-end, it's not worth the savings to generate it.

Instead, next weekend I'll try adding block-based compression: bundle up strings by count or block size and compress them together. This has served us well in external storage etc and should give much better locality then I could get out of the shared dictionary. If adjacent strings in message key order within a language tend to appear in the same block, this would also be a win for an in-process cache.

Once I've done that I'll run a synthetic benchmark on message loads and some page-load time tests on a blank page read, special:specialpages, maybe a couple others, over a few languages (maybe en, ru, zh-hant?) Fun times, this might shape up to be useful.

Ok here's my holiday hacking plan for later this week:

  • Install more extensions to my dev setup to better simulate production environment!
  • Drop the shared compression dictionary generation
  • Add optional block indirection to compress ~64 KiB of message data together
  • Add in-process read caches to avoid re-reading and re-decompressing the same blocks

Slight change to file layout:

  • l10n_cache-keys.cdb
    • key string -> key id
  • l10n_cache-(code).bin
    • key id -> string id (u24)
  • l10n_cache-blocks.bin
    • string id -> block id (u24)
    • if block compression is disabled, this file is skipped
      • (block id == string id)
  • l10n_cache-index.bin
    • block id -> [pos, len] (u32, u24)
  • l10n_cache-data.bin
    • concatenated optionally-compressed strings
    • for block compression mode, each block is a serialized array mapping string ids to strings
      • target 32 KiB or 64 KiB blocks, allow them to spill over (some items are > 80 KiB)

Contexts to do performance testing over:

  • default cache
  • CDB
  • binary w/ no compression
  • binary w/ string compression
  • binary w/ block compression

Measure for each mode:

  • recache time
  • cache size
  • for each of [en, ru, zh-hant]
    • synthetic benchmark measuring median time loading random message keys
    • latency of an empty page
    • latency of Special:Specialpages

Did a quick benchmark of whole-web requests; since the difference is clearly visible enough I didn't bother to do a synthentic benchmark:

image.png (718×2 px, 164 KB)

https://docs.google.com/spreadsheets/d/14n6r7LQ5bDuAO2G8gSso3QZ__ByQzrUCtC1JvJYr-g0/edit?gid=0#gid=0

My binary cache code seems reasonably competitive with CDB, with a small hit for larger compression blocks, but PHP arrays with the opcode cache still absolutely blow them out of the water for read speed consistency.

I think I'll try adding some indirection modes to the static arrays:

  • consider a key indirection table
  • consider a common string indirection table
    • and/or replicate fallback language logic (don't encode strings that are the same as the fallback language returns, so languages that don't have a lot of strings don't have to spend space duplicating other strings)

Change #1212717 had a related patch set uploaded (by Bvibber; author: Bvibber):

[mediawiki/core@master] WIP POC LCStaticArrays mode to do fallback logic at read time

https://gerrit.wikimedia.org/r/1212717

I've made a stab at using the existing fallback language logic in LCStaticArrays by adding a "fallback" mode that doesn't merge in fallback keys, and instead loads them from the other cached languages at runtime:

https://gerrit.wikimedia.org/r/c/mediawiki/core/+/1212717/1

Seems to work correctly for messages at no measurable performance penalty on page views to a blank page or Special:Specialpages, though I've got some edge cases where it's not pulling up localization right on en-gb or zu so let me fix that before we take those performance numbers as gospel :)

Very promising though! Simply cutting out the merges reduces the array size on disk from 718M to 158M which is a huge reduction and should compress better. Assuming this holds when I fix stuff. :D

It might also still be worth de-duplicating key strings into integer keys.

I though this would fix the date formats on en-gb and zu (pulling up all fallbacks and using the merge logic on arrays) but it didn't, I still have to figure this out:

https://gerrit.wikimedia.org/r/c/mediawiki/core/+/1212717/2

Fixed the merging of arrays and preloads; seems to be working quite well, with no cost when no fallbacks used and a _possible_ slight cost visible on for instance zh-hant which loads several fallbacks. It still beats CDB consistently. :)

It seems like it costs 1ms to load a big language file like en or zh-hant or zh-hans; loading en and en-gb or zu is cheap, but loading all three of zh-hant, zh-hans, and en as well as the tiny zh-tw and zh-hk has some cost, probably in instantiating those big hashmaps of strings.

https://gerrit.wikimedia.org/r/c/mediawiki/core/+/1212717

image.png (1×3 px, 567 KB)

Awesome results!

I'm curious what the opcache shared memory_usage look like for the "PHP arrays" and "PHP arrays with fallback separated" cases. The shared memory_usage is sensitive to what you've requested since the last server restart, but we'd want to know whether stripping fallbacks decreases used_memory after all the difference language/page benchmarks on a given setup are done, compared to the setup including fallbacks.

When we last tried LCStoreStaticArray, it required a ~1GB increase to opcache and that was considered too large (ref T99740#5941838, T99740#6088175). I can see it saving a lot, but I could also see how interned strings might have baked in some savings already between disk size and memory size.

The string interning cache seems to de-duplicate well enough but the total cache memory definitely seems a *lot less* with the smaller files from indirecting fallback and/or keys -- opcode cache increase over CDB is reduced from 332 MB to just 72 MB over CDB in my local test, and likely a bigger relative win in production. Key indirection is a smaller win to 61 MB over CDB.

String cache takes 113 MB over CDB for the full set, 106M for the fallback set with or without message key indirection.

I _suspect_ that there's a cost of a couple of opcode entries per key-value pair of an array literal, and those likely encode file position for error reporting during execution (line and column), so simply _disappearing a bunch of strings_ from many files saves a lot of space in the opcode cache, even though it doesn't save any in the string cache. Looks like about ~70 bytes per string per not-encoding-it language we can drop.

And that's probably a win!

image.png (424×1 px, 96 KB)

Change #1213050 had a related patch set uploaded (by Bvibber; author: Bvibber):

[mediawiki/core@master] WIP POC String blobbing for LCStoreStaticArray

https://gerrit.wikimedia.org/r/1213050

Current state:
https://docs.google.com/spreadsheets/d/14n6r7LQ5bDuAO2G8gSso3QZ__ByQzrUCtC1JvJYr-g0/edit?gid=0#gid=0

Size and performance:

image.png (1×3 px, 501 KB)

Opcode cache usage:

image.png (974×2 px, 307 KB)

Magenta line is this version: https://gerrit.wikimedia.org/r/c/mediawiki/core/+/1213050/5

With the combination of:

  • separating fallbacks
  • indirecting message keys
  • indirecting message strings into references into one big string
  • packing key id -> string position mappings into another big string

I get a *massive* reduction in total opcode cache in exchange for a payload of fewer, larger strings in cache.

The biggest win in disk space and opcode cache size comes from separating the fallbacks alone. There _may_ be a small performance penalty for languages with multiple large fallback arrays to load, though the packing into large strings should reduce that. This is on the order of 1, maybe 2ms tops on my test machine with requests of ~50ms and 56ms without the fallback, specifically for zh-hant which loads three large language files (zh-hant, zh-hans, and en as well as a couple smaller ones).

More of the total opcache memory is in the string interning, but the total usage seems to be down if I'm reading things correctly so I think this is a win?

There's still some possible low-hanging fruit: LocalisationCache stores a list of all message keys in the file as an array, seemingly to validate keys before doing on-spec lookups? But this could be replaced with a check on the global key array with some adjustment, and again save thousands of opcodes worth of instantiating strings and adding them to an array at load time.

A suggestion: when reporting the latencies, it would be useful to also have the standard deviation, which ab(1) provides, as the numbers seem so close to each other that the differences might well be within one standard deviation from each other between various solutions. That might help pick the one with the better size/performance tradeoffs better.

Change #1213050 abandoned by Bvibber:

[mediawiki/core@master] WIP POC String blobbing for LCStoreStaticArray

Reason:

Abandoning in favor of 1212717; the savings from RAM overhead of using few large strings instead of many smaller strings may be illusury and is likely to have fragmentation problems per checkin with sara goleman from php core team. Sticking with small strings and minimizing the amount of duplicated arrays and setup.

https://gerrit.wikimedia.org/r/1213050

Change #1209576 abandoned by Bvibber:

[mediawiki/core@master] WIP POC experiment for localisation cache

Reason:

Abandoned in favor of 1212717; the opcache absolutely wipes the floor with this for performance and I can get similar size and ram improvements tweaking LCStaticArrays

https://gerrit.wikimedia.org/r/1209576

Experiments benchmarks in cpu and ram so far:
https://docs.google.com/spreadsheets/d/14n6r7LQ5bDuAO2G8gSso3QZ__ByQzrUCtC1JvJYr-g0/edit?gid=0#gid=0

Current patch version (cyan line in above spreadsheet):
https://gerrit.wikimedia.org/r/c/mediawiki/core/+/1212717/14

  • I've removed the binary index mode and the string blobbing as they don't seem like clear wins
  • Keeping the LCStoreStaticArray modifications to use with opcode cache
    • option fallback does fallback at read time so avoids storing redundant data -- massive disk and RAM win
    • option indirectKeys indirects message keys to integers through global mapping arrays, reducing the size of each language file and its RAM impact moderately more
      • this also allows removing the $data['list']['messages'] key with a global message list in every language file, since we make a common list and can reference it cheaply

We're looking at a disk space reduction vs current version (on my dev workstation with a modest set of extensions) from 718M for l10n.php cache files to just 98M, with opcache RAM savings from 630M to 342M (vs 298M using CDB). And it takes only a few seconds to generate the whole set, both of which should lead to much much shorter rebuild and deployment times when changing localizations.

Production data sizes will be larger by a factor of 2-3 I imagine, and probably the "large array size with few localised items" issue is even further exacerbated there by having a lot of extensions with partial localizations. I think we're onto something. :)

I'm chatting a little with Sara Goleman on the PHP core team about strings and arrays in the opcache, and she advises that it's tuned for many small strings not for fewer large strings and so the "blobbing" and packed indexes experiments, while they saved a little RAM maybe, were likely to lead to increased fragmentation in production. We're probably better off with the many-small-strings caching; garbage collection is by simple reference counting so shouldn't create any "oh but it's slow to traverse the object graph" problems.

A container running a web server with several parallel versions of MediaWiki on it should still deduplicate common strings, but I believe it won't deduplicate array data -- so those string mapping arrays being reduced in size is gonna be a HUGE win in production with potentially 2 or 3 MediaWiki versions in flight.

If this sounds useful, I think the way to clean it up is to move the runtime fallback/merging handling logic fully to LocalisationCache and have LCStoreStaticArray worry only about the message key indirection. Then the extra handlesFallback method on LCStore and friends can be dropped. Will poke that later as time permits.

A suggestion: when reporting the latencies, it would be useful to also have the standard deviation, which ab(1) provides, as the numbers seem so close to each other that the differences might well be within one standard deviation from each other between various solutions. That might help pick the one with the better size/performance tradeoffs better.

Excellent point, I'll include this with my next set of tests later this week.

It does look like all of the static-arrays-based methods perform very similarly in terms of cpu/wall-clock time, making the disk & opcache RAM sizes the primary metric for picking between them. I'm mainly testing here to ensure no big regressions against the current static-arrays version. :) If there is a time cost to fallback, it's barely on the edge of my measurement for the zh-hant worst case (~1-2 ms on times that fluctuate by ~1ms)

Change #1212717 merged by jenkins-bot:

[mediawiki/core@master] Language: shrink LCStoreStaticArray by doing fallback at read time

https://gerrit.wikimedia.org/r/1212717

As the final, more conservative refactor has been merged I'm going to close this task out as resolved: T99740 can cover further live testing of LCStoreStaticArray with the reduced disk/ram impact.

Woohoo! Now I have to find other tasks for the May Hackathon. ;)

Aklapper renamed this task from Hackathon project: CDB replacement for localization cache to CDB replacement for localization cache.Dec 18 2025, 6:11 PM