I have roughly 4 million records per day and have to keep 7 years worth online, so we are looking at 10.2 billion records that I need to be able to search through. The users are expecting that the search will be quick enough for a UI, results in 3-5s
Due to politics out of my control, I can't use an off the shelf database solution because it means I will have to give the database to another team to manage (don't ask) which means I loose the ability to optimize hardware and software as they have a one-size-fits-all service for databases and charge (internally) by the GB. I'm sure I am going to get comments suggesting I make the point, I already have and management understand what they are asking me to do is ridiculous.
I have been looking at using Lucene as the crux of my solution. Storing the actual data partitioned by type and by day in flat files. Then using a Lucene document to index some of the fields that are searched by, with the only "Stored" field being the id of the record (so that I can read it from the flat file)
I am not exactly clued up on Lucene or hard drives, but as per my understanding, there will be initial IO/seeking time for searching the index, then when I have all the Lucene document IDs, I read the documents which will incur further IO/seeking time, then I read the actual record from the flat flat...I can't imagine, given the size of the dataset, that this will be very quick, which I am slightly worried about?
Lucene has a max document size of 2.1billion per index, so I will require multiple indicies here.
Does this approach, on the face of it, look like it could work?
The data I am storing is event-action data. Most of the queries will be grouping by event id and getting the last event-action details for a particular event. Some of the queries will be analysing large sets events and their individual event-actions.