1

I have a web server that is dynamically creating various reports in several formats (pdf and doc files). The files require a fair amount of CPU to generate, and it is fairly common to have situations where two people are creating the same report with the same input.

Inputs:

  • raw data input as a string (equations, numbers, and lists of words), arbitrary length, almost 99% will be less than about 200 words
  • the version of the report creation tool

When a user attempts to generate a report, I would like to check to see if a file already exists with the given input, and if so return a link to the file. If the file doesn't already exist, then I would like to generate it as needed.

  1. What solutions are already out there? I've cached simple http requests before, but the keys were extremely simple (usually database id's)

  2. If I have to do this myself, what is the best way. The input can be several hundred words, and I was wondering how I should go about transforming the strings into keys sent to the cache.

    //entire input, uses too much memory, one to one mapping cache['one two three four five six seven eight nine ten eleven...'] //short keys cache['one two'] => 5 results, then I must narrow these down even more

  3. Is this something that should be done in a database, or is it better done within the web app code (python in my case)

Thanks you everyone.

1
  • Which web framework are you using? Some frameworks have caching features built-in. Commented Oct 19, 2009 at 10:46

2 Answers 2

2

This is what Apache is for.

Create a directory that will have the reports.

Configure Apache to serve files from that directory.

If the report exists, redirect to a URL that Apache will serve.

Otherwise, the report doesn't exist, so create it. Then redirect to a URL that Apache will serve.


There's no "hashing". You have a key ("a string (equations, numbers, and lists of words), arbitrary length, almost 99% will be less than about 200 words") and a value, which is a file. Don't waste time on a hash. You just have a long key.

You can compress this key somewhat by making a "slug" out of it: remove punctuation, replace spaces with _, that kind of thing.

You should create an internal surrogate key which is a simple integer.

You're simply translating a long key to a "report" which either exists as a file or will be created as a file.

Sign up to request clarification or add additional context in comments.

2 Comments

How would you recommend mapping the input string to the file, though? Hashing will have too many collisions for small strings, won't it?
@Bill, consider using SHA256 or whatever for the hash, work out the cost in damage/$$ is multiplied by the likelyhood of a collision. If the resulting figure is not small enough use a bigger hash. rinse repeat. Sometimes you can't live with a hash collision at all - eg medical data or some such. Never depend on a has in that case
1

The usual thing is to use a reverse proxy like Squid or Varnish

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.