Store timeseries data in normal SQL database

Question

I have got historic data of around 20 machine sensors with a time resolution of one second stored in csv-files which need to be imported to a SQL database prior to further data handling and analysis.

A represantative mockup of the data tim import looks like this:

+---------------------+------------------+------------------+------------------+-----+-------------------+ | timestamp | name_of_sensor_0 | name_of_sensor_1 | name_of_sensor_2 | ... | name_of_sensor_19 | +---------------------+------------------+------------------+------------------+-----+-------------------+ | 2019-12-25 05:35:20 | 10 | 11 | 12 | ... | 19 | +---------------------+------------------+------------------+------------------+-----+-------------------+ | 2019-12-25 05:35:21 | 20 | 21 | 22 | ... | 29 | +---------------------+------------------+------------------+------------------+-----+-------------------+ | 2019-12-25 05:35:22 | 30 | 31 | 32 | ... | 39 | +---------------------+------------------+------------------+------------------+-----+-------------------+ | 2019-12-25 05:35:23 | 40 | 41 | 42 | ... | 49 | +---------------------+------------------+------------------+------------------+-----+-------------------+ | 2019-12-25 05:35:24 | 50 | 51 | 52 | ... | 59 | +---------------------+------------------+------------------+------------------+-----+-------------------+ | 2019-12-25 05:35:25 | 60 | 61 | 62 | ... | 60 | +---------------------+------------------+------------------+------------------+-----+-------------------+

For each sensor there are some descriptive meta data which should also be available in the database since these might contain important information needed to gain insights during further data analysis. Each sensor has the following meta data:

- designator - acquisition_channel - system_code - unit_code - sub_system_code - function_code - major_counting_number - minor_counting_number - measurement_unit - description

In order to combine the readings and the meta data, I thought about using two SQL tables. One containing all the meta data as given above:

SensorTable - id - designator - acquisition_channel - unit_code - system_code - sub_system_code - function_code - major_counting_number - minor_counting_number - measurement_unit - description

And another table containing the readings of each sensor at a given timestamp. To be able to combine the data from both tables, I could perform a JOIN on the sensor_id which is a foreign key:

ReadingsTable - id - timestamp - sensor_uid - value Both table defintions are implemeted using `sqlalchemy` like this: class Sensor(Base): __tablename__ = 'sensors' id = Column(Integer, primary_key=True, autoincrement=True) designator = Column(String, unique=True, nullable=False) acquisition_channel = Column(String) unit_code = Column(String) system_code = Column(String) sub_system_code = Column(String) function_code = Column(String) major_counting_number = Column(String) minor_counting_number = Column(String) measurement_unit = Column(String, nullable=False) description = Column(String) readings = relationship('Reading') class Reading(Base): __tablename__ = 'readings' id = Column(Integer, primary_key=True, autoincrement=True) timestamp = Column(DateTime, nullable=False) sensor_id = Column(Integer, ForeignKey('sensors.id'), nullable=False) value = Column(Float, nullable=False)

This table design looked pretty obvious to me and should fulfill the very basic principles of normalization. However, after having a look at the resulting table rows, I am wondering if I need to (or can) normalize the timestamp column any further. Every row in the ReadingsTable contains the sensor reading at a given timestamp. Since all sensor measure at the exact same time, I get a lot of duplicate timestamps in the same column. Recalling my data mockup from above, an excerpt of the ReadingsTable would look as following:

+-----+---------------------+-----------+-------+ | id | timestamp | sensor_id | value | +-----+---------------------+-----------+-------+ | 60 | 2019-12-25 05:35:22 | 1 | 30 | +-----+---------------------+-----------+-------+ | 61 | 2019-12-25 05:35:22 | 2 | 31 | +-----+---------------------+-----------+-------+ | 62 | 2019-12-25 05:35:22 | 3 | 32 | +-----+---------------------+-----------+-------+ | ... | ... | ... | ... | +-----+---------------------+-----------+-------+

Do I need to normalize the timestamp column any further due to the duplicate entries for each timestamp? How could I do this? Which adoptions should I make to my database / table design?

I had a look at this answer which suggests a very similar approach, but still does not address the duplicates in the timestamp column.

TIMESTAMP is displayed as 19-symbol string, of course, but it needs only 4 bytes for to store. If you'll create separate table for to store timestamps then you'll need at least: 2 bytes (if enough, but it is more likely that 4) for referal field, 2 (4) bytes for referential field, 4 bytes for value, and a lot of storage for index, and one more joined table in each query... where is a profit? — Akina
– Akina, Commented Dec 25, 2019 at 9:24
Normalization is a way to avoid modification anomalies. Your first step is straight forward, I would walk this path every day of the week. For your second suggestion, what problems would it prevent? As a side note, dont name your tables with a table suffix. The name should reflect what the set of tuples represents in the real world. I would go with SENSORS, SENSOR_READINGS — Lennart - Slava Ukraini
– Lennart - Slava Ukraini, Commented Dec 25, 2019 at 10:37

Rick James · Accepted Answer · 2019-12-25 16:19:52Z

Do NOT normalize any "continuous" or "numeric" values, such as a timestamp.

First of all, TIMESTAMP and DATETIME each take 5 bytes. INT takes 4 bytes. So, that's not much savings. A 3-byte MEDIUMINT UNSIGNED (0..16M) is still not enough savings to make up for the following...

More importantly, if doing a range scan, having the "range" in a "dimension" table will make the query very inefficient.

Readings.id may not be necessary -- The sensor_id plus the timestamp is unique; make the combination the PRIMARY KEY for Reading. Meanwhile, sensor_id should be as small as practical. TINYINT UNSIGNED allows 256 sensors; is that enough? (Note: It will be repeated in readings a huge number of times.)

PS to Akina: When fractional seconds was added, TIMESTAMP went from 4 bytes to 5 or more. The more is if you have fractions. DATETIME went from a packed-decimal representation of 8 bytes to only 5 (or more)

Indexes

Sensors will have its id and any others you will regularly use. Since it is a tiny table; the choices are not important.

The indexes for Readings are very important because of the size of the table ant the likelihood of big queries. You can't really decide until you have the SELECTs sketched out. But I can guess:

PRIMARY KEY(sensor_id, timestamp) INDEX(timestamp)

The PK allows the efficient gathering of info on a single sensor. And it is adequately efficient for the 20 inserts per second that you appear to need.

I don't know if the secondary index will be of any use.

PARTITIONing

The only likely use for partitioning is if you intend to delete "old" data. We can discuss PARTITION BY RANGE for timeseries. Otherwise, don't use PARTITIONs.

Summary Table(s)

When graphing a month's worth of data to the second. No matter how the data is stored and indexed, it will take a lot of time to fetch a few millin rows, only to throw away most of the info when building the graph. We can discuss Summary Tables further.

References (for partitioning, summary tables, etc): http://mysql.rjweb.org/

Stack Exchange Network

Store timeseries data in normal SQL database

1 Answer 1

Linked

Hot Network Questions

Store timeseries data in normal SQL database

1 Answer 1

Linked

Related

Hot Network Questions