Storing massive amounts of data from a sensor array

Question

I have been tasked to implement a solution (app and db) to store the data samples from a huge sensor array. The array currently consists of about 20,000 sensors, but that is soon going to grow, up to 100,000 sensors. Each sensor sends a data sample every 10 seconds and each sample is 28 bytes in size.

Doing the sums thus leads to:

8640 samples per sensor per day
242kB of data per sensor per day
864 million samples per day

Now I have been wondering what the best way would be to store/retrieve the data ? I "joined" this project after the software has already been specified, so it needs to be implemented on a Windows Platform using SQL Server.

The current solution in my head is to create a DB with two tables to store the data samples. The first serves as a sort of an index into the second that stores the collated samples in a binary field on a per day per sensor basis :

Table 1: RecordID - BigInt - Identity SensorID - BigInt - Primary Key Date - DateTime - Primary Key (yyyy-mm-dd) Table 2: RecordID - BigInt - Primary Key (from an insert into Table 1) Data - Binary

Basically I will write the samples from all sensors into temporary files (1 per sensor). At the end of each day I will then create an entry in Table 1, use the generated RecordID and dump the file into the Data field in Table 2.

This way I end up with only 100,000 entries into the table per day, instead of 864 million entries. The data should be available on the LAN or High Speed WAN, so retrieval of the sensor data on a whole day basis would be acceptable.

Although all data has to be stored, most of it will probably never be read. So the amount of reads on the table(s) will not be hugely more than the writes.

I know that I could implement something using the file system by just storing the path to the data files, but I read that SQL Server outperforms NTFS while your binary fields are less thank 256kB. (A gray area exists between 256kB and 1MB, while NTFS far outperforms SQL Server for binary sizes > 1 MB).

I am also slightly wary of storing data from 100,000 sensors into their own files without causes problems in the file system by either having huge amounts of files in a folder, or by having a complex tree structure with a few files in each folder, while not even taking file fragmentation into account.

Can anyone offer me some practical advice/comments on the above?
Are there obvious pitfalls that I am going to fall into?
The sample data does compress quite nicely. A 242 kB file compresses to about 85kB. Can I however implement some type of compression on database level so that the sample data (column) is compressed automatically?
Is SQL Server an obviously wrong choice for this project?
Is my design of the two tables wise, or could I just as well combine it into a single table that will still be as "performant" as the two tables?

SQL Server does support row-level and table-level compression for things like this. — JNK
– JNK, Commented Mar 1, 2012 at 19:08
Since there is only 1 entry / sensor / day, do you need Table1? — GalacticJello
– GalacticJello, Commented Mar 1, 2012 at 19:57
What do you plan to do with this data, once it is in the database? I can't imagine being able to aggregate sensor data in a binary format, at least not easily or quickly at those levels. — datagod
– datagod, Commented Mar 1, 2012 at 20:35
100,000 sensors X 10 samples per second X 28Bytes per sample x 24 hours per day = 2.2TB per day. That is a lot to put into two tables. — datagod
– datagod, Commented Mar 1, 2012 at 20:55
@AlexKuznetsov: I was wondering about the SQL Server choice myself, but they are Microsoft gold partners, so I guess that is the main reason. — Oliver
– Oliver, Commented Mar 2, 2012 at 7:54

Dave Markle · Accepted Answer · 2012-03-02 21:26:28Z

Yes, there is a pretty big pitfall you're going to run into fairly quickly, and that is with the size and maintenance of the tables. You are somewhat on the right track by saying that you want to put your data into a temporary table daily, and then move it into your permanent table, but you'll soon run into trouble with this scheme.

For example, let's say you want to "roll off" the oldest month's worth of data after two years. In your design, you would have to issue a DELETE statement against your big, big table. This will likely be somewhat slow, depending on the number of indexes you have. Also, it will cause index fragmentation, and the only way to fix that would be to rebuild or reorganize the indexes on this very large table which would also cause performance problems. There are a whole host of other issues with a big single table type design as well. For example, with a big, single table, you can't do FILEGROUP based backups, which means that if you want to have a full backup of your database, it's gonna be BIG, and it's gonna take a LONG time to complete.

What's the solution? Table partitioning. Read about this in depth, in as many places as you can. Basically, partitioning allows you to split up your data onto "tables within tables" -- each partition shares the same schema, and is accessed through the table object, but can be indexed and maintained differently. Partitions are basically tables, cut up by some useful key. In your case it will likely be date. They can be dropped just like (and just as fast as) tables, which means that if you partition your big data tables by date, you can simply drop old partitions instantly, with no adverse effect to the indexes on any of the other partitions. You can put partitions on different filegroups, which means that older partitions can be rolled off, or rolled on to cheaper commodity storage if it's not commonly used. Last but not least, in SQL 2012 you'll be able to create COLUMNSTORE type indexes on your older, read-only partitions, while having a different, more insert-oriented indexing scheme on the active partition where you're inserting all your sensor data.

Hope this helps. You have a good amount of research to do regarding partitioning and partitioning schemes, but hopefully now you know the direction you need to be looking.

P.S.: Oh, and I forgot your bulleted list of questions... Answer 1, 2, and 5. See above. Answer 3: In SQL Server, you can compress on a partition by partition basis, so compress your older partitions aggressively using PAGE compression. But I believe your out-of-row large data types will not be compressed if you do this -- again, you may want to alleviate this problem by normalizing your sensor values. Answer 4: Absolutely not, but if all you want to do is store static data by day and never search on it any other way, compressed flat files may be a much easier way to go.

P.P.S: Oh, and another thing. You don't need your two-table solution to make this all work. Large binary sensor data should be of type VARBINARY(MAX) because its values can be stored "out of row" but still be a column in a single table (see the sp_tableoption documentation). You may want to consider normalizing some of your sensor data out of the binary data you have in the table, though, because your database won't be good for much beyond retrieving chunks of sensor data by time if you don't.

Awesome info, thanks. I am not entirely sure what you mean with "normalize" in this instance. I assume though that you mean that I should extract some of the more useful fields in the data chunks and store them in their own columns. If so, the reason I did not want to do this initially is that it means that I'll end up with 864 million rows per day. Collating everything and storing it in one chunk means only 100,000 rows per day. Or is there a better way ? — Oliver
– Oliver, Commented Mar 4, 2012 at 18:12
If you are using a database, then yes, that's exactly what I mean. 864 million rows per day can be dealt with efficiently if you have the right hardware, indexing scheme, and partitioning scheme to make it work. It all depends on what your requirements really are, and why you're storing all of these data. If it's just for archival purposes, the binary column is fine. If you want to extract business value from it using SQL Server, then that's a completely different story. — Dave Markle
– Dave Markle, Commented Mar 4, 2012 at 18:58

Carter Shore · Accepted Answer · 2014-01-04 01:19:30Z

Consider a Hadoop solution. 2 Tb/day adds up quickly. Also consider logging only delta records, ie an intial value, and then only when a change occurs.

Stack Exchange Network

Storing massive amounts of data from a sensor array

2 Answers 2

Linked

Hot Network Questions

Storing massive amounts of data from a sensor array

2 Answers 2

Linked

Related

Hot Network Questions