String storage optimization

Question

I'm looking for some C++ library that would help to optimize memory usage by storing similar (not exact) strings in memory only once. It is not FlyWeight or string interning which is capable to store exact objects/strings only once. The library should be able to analyze and understand that, for example, two particular strings of different length have identical first 100 characters, this substring should be stored only once.
Example 1:

std::string str1 = "http://www.contoso.com/some/path/app.aspx?ch=test1"<br/> std::string str2 = "http://www.contoso.com/some/path/app.aspx?ch=test2"<br/>

in this case it is obvious that the only difference in these two strings is the last character, so it would be a great saving in memory if we hold only one copy of "http://www.contoso.com/some/path/app.aspx?ch=test" and then two additional strings "1" and "2"
Example 2:

std::string str1 = "http://www.contoso.com/some/path/app.aspx?ch1=test1"<br/> std::string str2 = "http://www.contoso.com/some/path/app.aspx?ch2=test2"<br/>

this is more complicated case when there are multiple identical substrings : one copy of "http://www.contoso.com/some/path/app.aspx?ch", then two strings "1" and "2", one copy of "=test" and since we already have strings "1" and "2" stored we don't need any additional strings.
So, is there such a library? Is there something that can help to develop such a library relatively fast? strings are immutable, so there is no need to worry about updating indexes or locks for threadsafety

Such a "substring matching" algorithm can be found in many compression programs such as gzip. But keep in mind that it's generally more efficient to store '1' than a pointer to "1". Depending on your chunked-string implementation, you may not see an improvement for substrings smaller than 16 bytes. — MSalters
– MSalters, Commented Nov 12, 2014 at 8:08
Theoretically, yes, but in practice that means you only need to scan for sizeable substring matches. That makes a huge difference (exponential decrease in useless matches). For instance, you apparently didn't realize that ".contoso" and ".com" contain multiple identical substrings, namely ".co". — MSalters
– MSalters, Commented Nov 12, 2014 at 8:57
I guess when you solving the longest common substring problem a threshold can be set not to deal with strings shorter than 8 characters — kreuzerkrieg
– kreuzerkrieg, Commented Nov 12, 2014 at 9:00
It's more subtle than that. You initially don't deal with longer strings either. You search for 8 byte matches. Only when you find a potential match do you scan to see where the first disagreement is. Remember that you're looking at O(N*N) potential matches. — MSalters
– MSalters, Commented Nov 12, 2014 at 9:05

k06a · Accepted Answer · 2014-11-12 08:09:49Z

If strings have common prefix the solution may be - using radix tree (also known as trie) (http://en.wikipedia.org/wiki/Radix_tree) for string representation. So you can only store pointer to tree leaf. And get whole string by growing up to tree root.
```
hello world hello winter hell [2] / h-e-l-l-o-' '-w-o-r-l-d-[0] \ i-n-t-e-r-[1] 
```
Here is one more solution: http://en.wikipedia.org/wiki/Rope_(data_structure)

libstdc++ implementation: https://gcc.gnu.org/onlinedocs/libstdc++/libstdc++-html-USERS-4.3/a00223.html

SGI documentation: http://www.sgi.com/tech/stl/Rope.html

But I think you need to construct your strings for rope to work properly. Maybe found longest common prefix and suffix for every new string with previous string and then express new string as concatenation of previous string prefix, then uniq part and then previous string suffix.

Cheng · Accepted Answer · 2014-11-12 07:51:54Z

For example 1, what I can come up with is Radix Tree, a space-optimized version from Trie. I did a simple google and found quite a few implementations in C++.
For example 2, I am also curious about the answer!

See @k06a answer, it is general solution for above problem, it provides building blocks, however, I really don't want to develop full solution

Hadi Brais · Accepted Answer · 2014-11-12 08:33:42Z

First of all, note that std::string is not immutable and you have to make sure that none of these strings are accidentally modified.

This depends on the pattern of the strings. I suggest using hash tables (std::unordered_map in C++11). The exact details depend on how you are going to access these strings.

The two strings you have provided differ only after the "?ch" part. If you expect that many strings will have long common prefixes where these prefixes are almost of the same size. You can do the following:

Let's say the size of a prefix is 43 chars. Let s be a string. Then, we can consider s[0-42] a key into the hash table and the rest of the string as a value.

For example, given "http://www.contoso.com/some/path/app.aspx?ch=test1" the key would be "http://www.contoso.com/some/path/app.aspx?" and "ch=test1" would be the value. if the key already exists in the hash table you can just add the value to the collection of values associated with key. Otherwise, add the key/value pair.

This is just an example, what the key is and what the value is depend on how you are going to access these strings.

Also if all string have "=test" in them, then you don't have to store this with every value. You can just store it once and then insert it when retrieving a string. So given the value "ch1=test1", what will be stored is just "ch11". This depends on the pattern of the strings.

Collectives™ on Stack Overflow

String storage optimization

3 Answers 3

Comments

1 Comment

Comments

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

Comments

1 Comment

Comments

Related