I have been tasked to implement a solution (app and db) to store the data samples from a huge sensor array. The array currently consists of about 20,000 sensors, but that is soon going to grow, up to 100,000 sensors. Each sensor sends a data sample every 10 seconds and each sample is 28 bytes in size.
Doing the sums thus leads to:
- 8640 samples per sensor per day
- 242kB of data per sensor per day
- 864 million samples per day
Now I have been wondering what the best way would be to store/retrieve the data ? I "joined" this project after the software has already been specified, so it needs to be implemented on a Windows Platform using SQL Server.
The current solution in my head is to create a DB with two tables to store the data samples. The first serves as a sort of an index into the second that stores the collated samples in a binary field on a per day per sensor basis :
Table 1: RecordID - BigInt - Identity SensorID - BigInt - Primary Key Date - DateTime - Primary Key (yyyy-mm-dd) Table 2: RecordID - BigInt - Primary Key (from an insert into Table 1) Data - Binary Basically I will write the samples from all sensors into temporary files (1 per sensor). At the end of each day I will then create an entry in Table 1, use the generated RecordID and dump the file into the Data field in Table 2.
This way I end up with only 100,000 entries into the table per day, instead of 864 million entries. The data should be available on the LAN or High Speed WAN, so retrieval of the sensor data on a whole day basis would be acceptable.
Although all data has to be stored, most of it will probably never be read. So the amount of reads on the table(s) will not be hugely more than the writes.
I know that I could implement something using the file system by just storing the path to the data files, but I read that SQL Server outperforms NTFS while your binary fields are less thank 256kB. (A gray area exists between 256kB and 1MB, while NTFS far outperforms SQL Server for binary sizes > 1 MB).
I am also slightly wary of storing data from 100,000 sensors into their own files without causes problems in the file system by either having huge amounts of files in a folder, or by having a complex tree structure with a few files in each folder, while not even taking file fragmentation into account.
Can anyone offer me some practical advice/comments on the above?
Are there obvious pitfalls that I am going to fall into?
The sample data does compress quite nicely. A 242 kB file compresses to about 85kB. Can I however implement some type of compression on database level so that the sample data (column) is compressed automatically?
Is SQL Server an obviously wrong choice for this project?
Is my design of the two tables wise, or could I just as well combine it into a single table that will still be as "performant" as the two tables?