Skip to main content
added 363 characters in body
Source Link
Rick James
  • 80.8k
  • 5
  • 52
  • 119

Do NOT normalize any "continuous" or "numeric" values, such as a timestamp.

First of all, TIMESTAMP and DATETIME each take 5 bytes. INT takes 4 bytes. So, that's not much savings. A 3-byte MEDIUMINT UNSIGNED (0..16M) is still not enough savings to make up for the following...

More importantly, if doing a range scan, having the "range" in a "dimension" table will make the query very inefficient.

Readings.id may not be necessary -- The sensor_id plus the timestamp is unique; make the combination the PRIMARY KEY for Reading. Meanwhile, sensor_id should be as small as practical. TINYINT UNSIGNED allows 256 sensors; is that enough? (Note: It will be repeated in readings a huge number of times.)

PS to Akina: When fractional seconds was added, TIMESTAMP went from 4 bytes to 5 or more. The more is if you have fractions. DATETIME went from a packed-decimal representation of 8 bytes to only 5 (or more)

Indexes

Sensors will have its id and any others you will regularly use. Since it is a tiny table; the choices are not important.

The indexes for Readings are very important because of the size of the table ant the likelihood of big queries. You can't really decide until you have the SELECTs sketched out. But I can guess:

PRIMARY KEY(sensor_id, timestamp) INDEX(timestamp) 

The PK allows the efficient gathering of info on a single sensor. And it is adequately efficient for the 20 inserts per second that you appear to need.

I don't know if the secondary index will be of any use.

PARTITIONing

The only likely use for partitioning is if you intend to delete "old" data. We can discuss PARTITION BY RANGE for timeseries. Otherwise, don't use PARTITIONs.

Summary Table(s)

When graphing a month's worth of data to the second. No matter how the data is stored and indexed, it will take a lot of time to fetch a few millin rows, only to throw away most of the info when building the graph. We can discuss Summary Tables further.

References (for partitioning, summary tables, etc): http://mysql.rjweb.org/

Do NOT normalize any "continuous" or "numeric" values, such as a timestamp.

First of all, TIMESTAMP and DATETIME each take 5 bytes. INT takes 4 bytes. So, that's not much savings. A 3-byte MEDIUMINT UNSIGNED (0..16M) is still not enough savings to make up for the following...

More importantly, if doing a range scan, having the "range" in a "dimension" table will make the query very inefficient.

Readings.id may not be necessary -- The sensor_id plus the timestamp is unique; make the combination the PRIMARY KEY for Reading. Meanwhile, sensor_id should be as small as practical. TINYINT UNSIGNED allows 256 sensors; is that enough? (Note: It will be repeated in readings a huge number of times.)

PS to Akina: When fractional seconds was added, TIMESTAMP went from 4 bytes to 5 or more. The more is if you have fractions. DATETIME went from a packed-decimal representation of 8 bytes to only 5 (or more)

Indexes

Sensors will have its id and any others you will regularly use. Since it is a tiny table; the choices are not important.

The indexes for Readings are very important because of the size of the table ant the likelihood of big queries. You can't really decide until you have the SELECTs sketched out. But I can guess:

PRIMARY KEY(sensor_id, timestamp) INDEX(timestamp) 

The PK allows the efficient gathering of info on a single sensor. And it is adequately efficient for the 20 inserts per second that you appear to need.

I don't know if the secondary index will be of any use.

PARTITIONing

The only likely use for partitioning is if you intend to delete "old" data. We can discuss PARTITION BY RANGE for timeseries. Otherwise, don't use PARTITIONs.

Do NOT normalize any "continuous" or "numeric" values, such as a timestamp.

First of all, TIMESTAMP and DATETIME each take 5 bytes. INT takes 4 bytes. So, that's not much savings. A 3-byte MEDIUMINT UNSIGNED (0..16M) is still not enough savings to make up for the following...

More importantly, if doing a range scan, having the "range" in a "dimension" table will make the query very inefficient.

Readings.id may not be necessary -- The sensor_id plus the timestamp is unique; make the combination the PRIMARY KEY for Reading. Meanwhile, sensor_id should be as small as practical. TINYINT UNSIGNED allows 256 sensors; is that enough? (Note: It will be repeated in readings a huge number of times.)

PS to Akina: When fractional seconds was added, TIMESTAMP went from 4 bytes to 5 or more. The more is if you have fractions. DATETIME went from a packed-decimal representation of 8 bytes to only 5 (or more)

Indexes

Sensors will have its id and any others you will regularly use. Since it is a tiny table; the choices are not important.

The indexes for Readings are very important because of the size of the table ant the likelihood of big queries. You can't really decide until you have the SELECTs sketched out. But I can guess:

PRIMARY KEY(sensor_id, timestamp) INDEX(timestamp) 

The PK allows the efficient gathering of info on a single sensor. And it is adequately efficient for the 20 inserts per second that you appear to need.

I don't know if the secondary index will be of any use.

PARTITIONing

The only likely use for partitioning is if you intend to delete "old" data. We can discuss PARTITION BY RANGE for timeseries. Otherwise, don't use PARTITIONs.

Summary Table(s)

When graphing a month's worth of data to the second. No matter how the data is stored and indexed, it will take a lot of time to fetch a few millin rows, only to throw away most of the info when building the graph. We can discuss Summary Tables further.

References (for partitioning, summary tables, etc): http://mysql.rjweb.org/

Source Link
Rick James
  • 80.8k
  • 5
  • 52
  • 119

Do NOT normalize any "continuous" or "numeric" values, such as a timestamp.

First of all, TIMESTAMP and DATETIME each take 5 bytes. INT takes 4 bytes. So, that's not much savings. A 3-byte MEDIUMINT UNSIGNED (0..16M) is still not enough savings to make up for the following...

More importantly, if doing a range scan, having the "range" in a "dimension" table will make the query very inefficient.

Readings.id may not be necessary -- The sensor_id plus the timestamp is unique; make the combination the PRIMARY KEY for Reading. Meanwhile, sensor_id should be as small as practical. TINYINT UNSIGNED allows 256 sensors; is that enough? (Note: It will be repeated in readings a huge number of times.)

PS to Akina: When fractional seconds was added, TIMESTAMP went from 4 bytes to 5 or more. The more is if you have fractions. DATETIME went from a packed-decimal representation of 8 bytes to only 5 (or more)

Indexes

Sensors will have its id and any others you will regularly use. Since it is a tiny table; the choices are not important.

The indexes for Readings are very important because of the size of the table ant the likelihood of big queries. You can't really decide until you have the SELECTs sketched out. But I can guess:

PRIMARY KEY(sensor_id, timestamp) INDEX(timestamp) 

The PK allows the efficient gathering of info on a single sensor. And it is adequately efficient for the 20 inserts per second that you appear to need.

I don't know if the secondary index will be of any use.

PARTITIONing

The only likely use for partitioning is if you intend to delete "old" data. We can discuss PARTITION BY RANGE for timeseries. Otherwise, don't use PARTITIONs.