Caching system for dynamically created files?

Question

I have a web server that is dynamically creating various reports in several formats (pdf and doc files). The files require a fair amount of CPU to generate, and it is fairly common to have situations where two people are creating the same report with the same input.

Inputs:

raw data input as a string (equations, numbers, and lists of words), arbitrary length, almost 99% will be less than about 200 words
the version of the report creation tool

When a user attempts to generate a report, I would like to check to see if a file already exists with the given input, and if so return a link to the file. If the file doesn't already exist, then I would like to generate it as needed.

What solutions are already out there? I've cached simple http requests before, but the keys were extremely simple (usually database id's)
If I have to do this myself, what is the best way. The input can be several hundred words, and I was wondering how I should go about transforming the strings into keys sent to the cache.

//entire input, uses too much memory, one to one mapping cache['one two three four five six seven eight nine ten eleven...'] //short keys cache['one two'] => 5 results, then I must narrow these down even more
Is this something that should be done in a database, or is it better done within the web app code (python in my case)

Thanks you everyone.

Which web framework are you using? Some frameworks have caching features built-in. — nosklo
– nosklo, Commented Oct 19, 2009 at 10:46

S.Lott · Accepted Answer · 2009-10-19 11:28:06Z

This is what Apache is for.

Create a directory that will have the reports.

Configure Apache to serve files from that directory.

If the report exists, redirect to a URL that Apache will serve.

Otherwise, the report doesn't exist, so create it. Then redirect to a URL that Apache will serve.

There's no "hashing". You have a key ("a string (equations, numbers, and lists of words), arbitrary length, almost 99% will be less than about 200 words") and a value, which is a file. Don't waste time on a hash. You just have a long key.

You can compress this key somewhat by making a "slug" out of it: remove punctuation, replace spaces with _, that kind of thing.

You should create an internal surrogate key which is a simple integer.

You're simply translating a long key to a "report" which either exists as a file or will be created as a file.

How would you recommend mapping the input string to the file, though? Hashing will have too many collisions for small strings, won't it?
@Bill, consider using SHA256 or whatever for the hash, work out the cost in damage/$$ is multiplied by the likelyhood of a collision. If the resulting figure is not small enough use a bigger hash. rinse repeat. Sometimes you can't live with a hash collision at all - eg medical data or some such. Never depend on a has in that case

John La Rooy · Accepted Answer · 2009-10-19 10:56:26Z

1

The usual thing is to use a reverse proxy like Squid or Varnish

answered Oct 19, 2009 at 10:56

John La Rooy

306k54 gold badges378 silver badges514 bronze badges

Collectives™ on Stack Overflow

Caching system for dynamically created files?

2 Answers 2

2 Comments

Comments

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

2 Comments

Comments

Related