7

Please consider the following data model:

data Artist = Artist Text data Song = Song Artist Text data Catalogue = Catalogue (Set Artist) (Set Song) 

You can see that the Artists are referred to from both the Songs and the Catalogue. The Catalogue contains a list of all artists referred to from Songs, so the same values of Artist get referred to from two places.

Suppose we were to generate the Catalogue value using multiple applications of the following function:

insertSong :: Song -> Catalogue -> Catalogue insertSong song@(Song artist title) (Catalogue artists songs) = Catalogue (Set.insert artist artists) (Set.insert song songs) 

It's evident that the Catalogue would get filled by references to the same values of Artist as the Songs refer to, thus saving the memory by not storing the copies of those values.

The problem is that when I try to recreate the catalogue from serialized data by separately deserializing a set of artists and a set of songs, the application occupies way more memory than when it generated the same value of Catalogue with insertSong. I suspect that it is caused by the lost relation between same Artists referred to from Songs and the Catalogue, which is why I get copies of values of Artist occupying extra memory.

The only solution I see is to first deserialize the set of artists and then to deserialize the set of songs while forcefully replacing the values of Artist with the ones from the first set.

So my questions are:

  1. Am I right in my suspicion?
  2. Will the solution I see work?
  3. Are there any better ways to solve this?

3 Answers 3

6
  1. That sounds plausible.
  2. It should work if done right. In particular, you have to ensure that everything is eagerly evaluated, to avoid referencing the old Text values from thunks.
  3. You can choose a smarter serialization format. For example, when you serialize songs, store the index of the artist in the artists list instead of the full artist name. Then look it up during deserialization.

Note that sharing will be also lost if you do any kind of computation on your strings (i.e. even if artist1 and artist2 are the same and shared, f artist1 and f artist2 are probably not). If this becomes a problem, you can do similar changes to your data structures, too.

Sign up to request clarification or add additional context in comments.

Comments

3

A simple solution seems to cache data using a somewhat degenerated map:

{-# LANGUAGE DeriveDataTypeable, RankNTypes #-} import Control.Monad import Control.Monad.State import Data.Map (Map) import qualified Data.Map as M type Cache a = Map a a 

We can then query this cache if there is already an entry equal to this one, and replace it with the cached one:

cached :: (Ord a) => a -> State (Cache a) a cached x = state $ \m -> case M.lookup x m of Just x' -> (x', m) Nothing -> (x, M.insert x x m) 

This way, if we load several equal elements of type a, we transform them into a single. This can be done during deserialization or once at the end.


Perhaps it would be possible to generalize it further and use SYB to map all values of some given type in a data structure through a cache:

import Data.Data (Data) import Data.Generics.Aliases (mkM) import Data.Generics.Schemes (everywhereM) import Data.Typeable (Typeable) replaceFromCache :: (Ord a, Typeable a, Data b) => b -> State (Cache a) b replaceFromCache = everywhereM (mkM cached) 

Then we can replace all artists in some data structure like

data Artist = Artist String deriving (Eq, Ord, Typeable) cacheAllArtists :: (Data b) => b -> b cacheAllArtists b = evalState (replaceFromCache b) (M.empty :: Cache Artist) 

Or we can use Proxy phantom type to create a general version:

cacheAll :: (Ord a, Typeable a, Data b) => Proxy a -> b -> b cacheAll p = flip evalState (emptyOf p) . replaceFromCache where emptyOf p = asTypeOf2 M.empty p asTypeOf2 :: f a -> Proxy a -> f a asTypeOf2 = const cacheAllArtists :: (Data b) => b -> b cacheAllArtists = cacheAll (Proxy :: Proxy Artist) 

(Disclaimer: I haven't tested any of the above code.)

4 Comments

The idea on generics is very interesting. The problem is general enough for such a library to be developed. Thanks. I'll look into it.
@NikitaVolkov I'll gladly participate.
That's great! Although I must admit I'm not ready to dive in such a project just yet, but I'll keep in touch with you if I'll get back to it.
I've accidentally stumbled upon a project, which approaches the issue. See RefSerialize.
1

I've accidentally stumbled upon a project, which approaches the issue. See RefSerialize.

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.