Fine grained search over a large dataset

Question

I have roughly 4 million records per day and have to keep 7 years worth online, so we are looking at 10.2 billion records that I need to be able to search through. The users are expecting that the search will be quick enough for a UI, results in 3-5s

Due to politics out of my control, I can't use an off the shelf database solution because it means I will have to give the database to another team to manage (don't ask) which means I loose the ability to optimize hardware and software as they have a one-size-fits-all service for databases and charge (internally) by the GB. I'm sure I am going to get comments suggesting I make the point, I already have and management understand what they are asking me to do is ridiculous.

I have been looking at using Lucene as the crux of my solution. Storing the actual data partitioned by type and by day in flat files. Then using a Lucene document to index some of the fields that are searched by, with the only "Stored" field being the id of the record (so that I can read it from the flat file)

I am not exactly clued up on Lucene or hard drives, but as per my understanding, there will be initial IO/seeking time for searching the index, then when I have all the Lucene document IDs, I read the documents which will incur further IO/seeking time, then I read the actual record from the flat flat...I can't imagine, given the size of the dataset, that this will be very quick, which I am slightly worried about?

Lucene has a max document size of 2.1billion per index, so I will require multiple indicies here.

Does this approach, on the face of it, look like it could work?

The data I am storing is event-action data. Most of the queries will be grouping by event id and getting the last event-action details for a particular event. Some of the queries will be analysing large sets events and their individual event-actions.

Very roughly this could work. If you look at Elasticsearch this is somewhat similar. You don't talk much about what exactly you want to do with this data. Depending on the type of query you would organize the data in indices based on months. If your queries would be something on the line of statistics you could also add aggregation tables that make some calculations per month, week or quarter and optimize your code so it can use those aggregations. Also you could maybe share data over multiple machines and split queries. It just hurts to write this if Elastic would do it out of the box. — thorsten müller
– thorsten müller, Commented Apr 13, 2015 at 18:15
PS: I would at least prototype it with Elasticsearch or Apache Solr. They both use Lucene and this would give you some idea and estimate about how Lucene behaves. — thorsten müller
– thorsten müller, Commented Apr 13, 2015 at 18:16
ES is where I am getting most of my founding ideas from...its ridiculous that I can't just stick the data into ES or Hadoop and be done with it. @thorstenmüller - I've edited the OP with details — Cheetah
– Cheetah, Commented Apr 13, 2015 at 20:58
This sounds somewhat similar to blog.parsely.com/post/1633/mage — Doug T.
– Doug T., Commented Apr 14, 2015 at 1:53
When you say "I can't use an off the shelf database solution", do you mean, specifically, that you can't use an off the shelf solution that would require a purchase order? I'm guessing a purchase order would trigger whatever it is that takes things out of your control in your organization. — David
– David, Commented Jul 15, 2015 at 5:36

Simon B · Accepted Answer · 2015-09-20 22:00:40Z

You haven't said how big the data is, how big the individual fields are, or what budget you've got.

Regardless of what indexing system you choose, consider throwing hardware at the problem. You shouldn't need to search the disks for anything. Index all the data, using a scheme that's very fast to traverse (perhaps a sorted list or tree). Store the index on disk, but then cache the whole index in RAM. You may need tens, or even hundreds of gigabytes of RAM to do that.

If the individual fields are big, or variable size, consider indexing hashes of them.

The price for the server to do that could be scary.

James Anderson · Accepted Answer · 2015-07-21 18:10:12Z

Ignoring all the technical details this is an organizational/management problem and needs to be solved by the management of your organization.

Your manager has to be willing to kick the problem upstairs and/or get his users to raise the problem at a high level.

At your level put together or request an estimate for doing this with Oracle and Oracle hardware. Then put together a realistic estimate for a Hadoop cluster.

Despite the hype these clusters do not come cheap (You probably need something like 18 8 processor nodes with 64GB memory and 4 x 2 TB disks spread over three racks then another 4 nodes for the catalog etc.). DO NOT underestimate ; if you win you will have to implement it.

theMayer · Accepted Answer · 2015-07-22 13:30:49Z

So, first let's clearly re-state the problem in terms of its requirements:

The system shall store a minimum of 4M records per day.
The system shall provide a search interface to the user
2.1 The search capability shall return results in a maximum of 3s
The system shall be capable of searching a minimum of 10.2 billion records
The system shall use a custom-designed database
4.1 The system shall have hardware and software optimized for the database to be developed

There are probably additional non-functional requirements, as well as details on how big the individual records are, which are probably relevant to your situation.

The short answer is that you have a requirements problem. If you look at these requirements, three of them (the first three) apply correctly to the system to define its function and behavior. The last requirement is not a valid requirement from a purist perspective, but I have seen these types of requirements put into statements of work.

So, the way this problem is solved is to estimate the cost of the 4th requirement, given the other three. Once you do that, present that as your solution cost. Management will panic and immediately ask you why the problem can't be solved for a reasonable price. That is the entry point for your discussion on what needs to happen. Have an affordable alternative available and ready to present.

This is in contrast to what you are doing right now, which is assuming the other three can't be met given the last one. Management doesn't get it, because all they see are dollar signs.

Mike Nakis · Accepted Answer · 2015-09-21 00:47:37Z

If I was in your shoes, I would begin with a very reasonable, by-the-book implementation, using nothing but a regular RDBMS, embedded within the application, so that they do not feel as if they have to support something. SQLite, H2, or any alternative embedded database should do: No special flat files, no exotic indexes, no nothing: just a straightforward application of standard practices to solve the problem at hand, for the most part disregarding the immensity of the data. (I would, of course, choose a large enough integer as a key, and that's all, pretty much.)

While working at it, a couple of more ideas would probably occur to me, as to how to make it work faster without resorting to anything exotic.

Then, I would test this to see how it performs, and I would demonstrate the results, together with the working solution, to the "powers of be" in your organization.

There is a chance that your straightforward implementation will perform within the required constraints, so you will be fine right there, no need to do anything else, zero resources wasted.
If the performance of the straightforward implementation is outside, but not too far out of, the required constraints, the "powers of be" might say "well, this is close enough, we don't want to do anything else about it, so that's what it will be." Again, zero resources wasted.
If the performance of the straightforward implementation is outside, but within the same order of magnitude, of the required constraints, I would tell them to just buy better, bigger, faster hardware. Most chances are they will do that and case closed.
If they don't want to buy better, bigger, faster hardware, then I would recommend that they rethink their requirement to refrain from using a large, scalable RDBMS. If they are reasonable, and you have shown that you are reasonable too, chances are they will rethink it.
If the powers of be don't want to follow any of the reasonable avenues, and instead they want you to play the role of a magician, then and only then I'd start worrying about exotic solutions. Many chances are, things will not reach that point. But even if they do, the amount of work that you will have done in vain up until that point will be relatively small, and well worth the gamble that it might just suffice.

tealdev · Accepted Answer · 2015-07-22 10:29:53Z

Thinking from the front end ...

If you separate your lookup types in the UI, you might be able to have more reasonable constraints.

It sounds like one lookup type is recent event-action data on an event, which allows you to isolate by time in your data search. This perhaps gives a much smaller set of data, with the likely expectation of a user that it be retrieved somewhat soon.

Other kinds of lookup, wherein large data set or old time frame searches need to be accomplished can be given a different UI (or several UI's), with a nice spinner to indicate ... thinking now. As this can be understood by the user to be a more laborious set of requirements, patience might be reasonably expected. And of course, realistically necessary.

I don't know if you have any ability to impact the front-tend design, but if you can convey the constraints you are working with, hopefully those who handle the user interaction will respond with understanding (at least some).

Stack Exchange Network

Fine grained search over a large dataset

5 Answers 5

Hot Network Questions

Fine grained search over a large dataset

5 Answers 5

Related

Hot Network Questions