0

I have 1000 sensors, and I need to hoard the value of each sensor at every second, for a month. This is just theory, the most realistically achievable unit of time will be determined when I start testing, though 5 seconds maximum.

If I do a single entry for each sensor at each second, I can get away with just 5 columns, entry_id, dim_id, dim_second, dim_date, sensorValue. However, that equates to (86,400 * 1000) * 31 or 2,678,400,000 rows in a month. That is a whole lot of rows.

If I had 1004 columns, I could do 86,400 * 31 or 2,678,400. That is a ton of columns.

Most of what I've read online tells me it is always undesirable to use columns in this way for RDS. However, if I were to do an entry per sensor, do I dump in a single fact table? Of course I wouldn't make 1000 tables, but should sensors be grouped together in fact tables, maybe by unit (flow, amperage, etc) to make the data set smaller for queries?

Or maybe RDS is not the proper choice for this application, and I should consider NoSQL? I have never worked with it.

I'm using a 16GB, Intel Core i7-4650u machine with a 1TB SSD for the development environment.

EDIT: I should note that my data pipeline is summarizing this data and dumping it in other tables. The issue is that this has to be an on-prem solution, and I want to be able to bring back big batches of raw data for analysis. Hence the long period and small grain.

2

2 Answers 2

1

I don't know about mysql/mariadb, but a postgresql DB allows you to store arrays. An array of 86400 sensorValues would allow you to have one record per entry_id, dim_id, dim_date.

create table sensorvalues ( entry_id serial primary key, dim_id int, dim_date date, sensorvalue int[86400] ); 
10
  • This would solve the storage issue but I think it would call for complex code when running queries on a specific sensor over a specific time period. Commented Feb 23, 2020 at 2:42
  • Do you consider the use of an SQL select statement complex? Commented Feb 24, 2020 at 12:46
  • Not at all. However arrays are index based, i.e. row[33] corresponding to sensor 30 value in my example. I see a potential for code readability issues. Commented Feb 24, 2020 at 12:51
  • I have no idea what "row[33]" has to do with a "sensor 30 value". The layout of the table could be as simple as see edit. Commented Feb 24, 2020 at 13:06
  • What does selecting a specific index from the array look like? Each sensor has a unique name is what I’m getting at. The data has to be accessible for generating reports, we may just need to look at just a few sensors over a time period. Linking a positional index to the name creates room for error for people who follow up on my code. Commented Feb 24, 2020 at 13:42
1

Do not normalize the date and time; simply use TIMESTAMP or DATETIME, possibly to some fraction of a second.

Do have a sensor_id column. Is that what dim_id is?

Plan A

CREATE TABLE Fact ( sensor_id SMALLINT UNSIGNED NOT NULL, -- for join to/from table Sensors dt DATETIME NOT NULL, -- the second sensor_value ... NOT NULL, -- perhaps SMALLINT, perhaps FLOAT PRIMARY KEY(sensor_id, dt), INDEX(dt) -- if needed ) ENGINE=InnoDB PARTITION BY RANGE(TO_DAYS(dt)) ... 

But try to do all the summarizing that anyone needs; then toss old Fact data with DROP PARTITION.

Plan B

CREATE TABLE Fact ( dt DATETIME NOT NULL, -- the second sensor_values JSON NOT NULL, -- mostly an array of sensor_id => value PRIMARY KEY(dt), ) ENGINE=InnoDB 

This assumes you rarely need to access the Fact data. That is, it assumes the summaries handle most situations.

3
  • Rick James, thanks for answering- your website has been a great reference for me throughout this process. sensor_id corresponds to the dim_id table where I store related attributes. I took the approach of normalizing dim_date incase I need calendar info for a day and dim_second because the time is related to a shift and shift reports may be needed. I liked the idea of an array, because it creates less entries and seems easily accessible, though I haven't tested yet. I can keep the fact data for manual analysis and creating pattern analysis software. Commented Feb 28, 2020 at 1:14
  • I am already summarizing about 90% of the data I will need for most reports with a data pipeline in other tables, at the day level. Commented Feb 28, 2020 at 1:15
  • 1
    @SamiA - JOINing to get a date range is very likely to be a lot slower than simply specifying a range against a DATETIME in the Fact table. Please provide a sample SELECT so I can either explain further, or back off. In my experience, the summary tables handle all the requirements, and the Fact table just "collects dust". Commented Feb 28, 2020 at 5:12

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.