Approaching database design

Question

I have 1000 sensors, and I need to hoard the value of each sensor at every second, for a month. This is just theory, the most realistically achievable unit of time will be determined when I start testing, though 5 seconds maximum.

If I do a single entry for each sensor at each second, I can get away with just 5 columns, entry_id, dim_id, dim_second, dim_date, sensorValue. However, that equates to (86,400 * 1000) * 31 or 2,678,400,000 rows in a month. That is a whole lot of rows.

If I had 1004 columns, I could do 86,400 * 31 or 2,678,400. That is a ton of columns.

Most of what I've read online tells me it is always undesirable to use columns in this way for RDS. However, if I were to do an entry per sensor, do I dump in a single fact table? Of course I wouldn't make 1000 tables, but should sensors be grouped together in fact tables, maybe by unit (flow, amperage, etc) to make the data set smaller for queries?

Or maybe RDS is not the proper choice for this application, and I should consider NoSQL? I have never worked with it.

I'm using a 16GB, Intel Core i7-4650u machine with a 1TB SSD for the development environment.

EDIT: I should note that my data pipeline is summarizing this data and dumping it in other tables. The issue is that this has to be an on-prem solution, and I want to be able to bring back big batches of raw data for analysis. Hence the long period and small grain.

Welcome. Having too many columns / tables - both are bad design. Refer en.wikipedia.org/wiki/Database_normalization and en.wikipedia.org/wiki/Denormalization. You shall come back here with yet another sharp question. — Natarajan N Napoleon
– Natarajan N Napoleon, Commented Feb 22, 2020 at 16:39
A large amount of rows should not really be a concern. Most definitely better than a lot of columns. — Dave
– Dave, Commented Feb 23, 2020 at 12:31

Gerard H. Pille · Accepted Answer · 2020-02-24 13:07:25Z

1

I don't know about mysql/mariadb, but a postgresql DB allows you to store arrays. An array of 86400 sensorValues would allow you to have one record per entry_id, dim_id, dim_date.

create table sensorvalues ( entry_id serial primary key, dim_id int, dim_date date, sensorvalue int[86400] );

edited Feb 24, 2020 at 13:07

answered Feb 22, 2020 at 19:20

Gerard H. Pille

3,2951 gold badge10 silver badges13 bronze badges

This would solve the storage issue but I think it would call for complex code when running queries on a specific sensor over a specific time period.

Ajility
– Ajility

2020-02-23 02:42:20 +00:00
Commented Feb 23, 2020 at 2:42
Do you consider the use of an SQL select statement complex?

Gerard H. Pille
– Gerard H. Pille

2020-02-24 12:46:56 +00:00
Commented Feb 24, 2020 at 12:46
Not at all. However arrays are index based, i.e. row[33] corresponding to sensor 30 value in my example. I see a potential for code readability issues.

Ajility
– Ajility

2020-02-24 12:51:47 +00:00
Commented Feb 24, 2020 at 12:51
I have no idea what "row[33]" has to do with a "sensor 30 value". The layout of the table could be as simple as see edit.

Gerard H. Pille
– Gerard H. Pille

2020-02-24 13:06:42 +00:00
Commented Feb 24, 2020 at 13:06
What does selecting a specific index from the array look like? Each sensor has a unique name is what I’m getting at. The data has to be accessible for generating reports, we may just need to look at just a few sensors over a time period. Linking a positional index to the name creates room for error for people who follow up on my code.

Ajility
– Ajility

2020-02-24 13:42:47 +00:00
Commented Feb 24, 2020 at 13:42

| Show 5 more comments

Rick James · Accepted Answer · 2020-02-28 00:58:17Z

Do not normalize the date and time; simply use TIMESTAMP or DATETIME, possibly to some fraction of a second.

Do have a sensor_id column. Is that what dim_id is?

Plan A

CREATE TABLE Fact ( sensor_id SMALLINT UNSIGNED NOT NULL, -- for join to/from table Sensors dt DATETIME NOT NULL, -- the second sensor_value ... NOT NULL, -- perhaps SMALLINT, perhaps FLOAT PRIMARY KEY(sensor_id, dt), INDEX(dt) -- if needed ) ENGINE=InnoDB PARTITION BY RANGE(TO_DAYS(dt)) ...

But try to do all the summarizing that anyone needs; then toss old Fact data with DROP PARTITION.

Plan B

CREATE TABLE Fact ( dt DATETIME NOT NULL, -- the second sensor_values JSON NOT NULL, -- mostly an array of sensor_id => value PRIMARY KEY(dt), ) ENGINE=InnoDB

This assumes you rarely need to access the Fact data. That is, it assumes the summaries handle most situations.

Rick James, thanks for answering- your website has been a great reference for me throughout this process. sensor_id corresponds to the dim_id table where I store related attributes. I took the approach of normalizing dim_date incase I need calendar info for a day and dim_second because the time is related to a shift and shift reports may be needed. I liked the idea of an array, because it creates less entries and seems easily accessible, though I haven't tested yet. I can keep the fact data for manual analysis and creating pattern analysis software. — Ajility
– Ajility, Commented Feb 28, 2020 at 1:14
I am already summarizing about 90% of the data I will need for most reports with a data pipeline in other tables, at the day level. — Ajility
– Ajility, Commented Feb 28, 2020 at 1:15
@SamiA - JOINing to get a date range is very likely to be a lot slower than simply specifying a range against a DATETIME in the Fact table. Please provide a sample SELECT so I can either explain further, or back off. In my experience, the summary tables handle all the requirements, and the Fact table just "collects dust". — Rick James
– Rick James, Commented Feb 28, 2020 at 5:12

Stack Exchange Network

Approaching database design

2 Answers 2

Hot Network Questions

Approaching database design

2 Answers 2

Related

Hot Network Questions