I am a student, who have been given the task of designing a database for sensor data. My university currently has a large database which is being filled with these data, but a lot of what is being stored is not necessary. They want me to extract some of the fields from the existing database, and insert it into a new one, which will only hold the 'essentials'. I will need to extract every row from the old one, as well as fetching new data once a day.
- There are 1500 sensors.
- They generate a reading every minute.
- Approximately 2.1 million readings every day
- The current database have about 250 million rows.
The queries which will be performed will typically be to select sensor readings for a set of sensors between a given time span.
I was initially naive with respect to the added complexity large amounts of data introduces, so I grossly underestimated the time needed for this task. Because of this, and the fact that I don't have access to the server from home, I am here asking for help and input.
The initial design looks like this:
CREATE TABLE IF NOT EXISTS SENSORS ( ID smallint UNSIGNED NOT NULL AUTO_INCREMENT, NAME varchar(500) NOT NULL UNIQUE, VALUEFACETS varchar(500) NOT NULL, PRIMARY KEY (ID) ); CREATE TABLE IF NOT EXISTS READINGS ( ID int UNSIGNED AUTO_INCREMENT, TIMESTAMP int UNSIGNED INDEX NOT NULL, VALUE float NOT NULL, STATUS int NOT NULL, SENSOR_ID smallint UNSIGNED NOT NULL, PRIMARY KEY (ID), FOREIGN KEY (SENSOR_ID) REFERENCES SENSORS(ID) ); Design Question
My first question is whether i should keep an auto-incremented key for the readings, or if it would be more beneficial to a have a composite key on TIMESTAMP(UNIX epoch) and SENSOR_ID?
This question applies both to the fact that I have to insert 2.1 million rows per day, as well as the fact that I want to optimize for the aforementioned queries.
Initial Bulk insert:
After a lot of trial and error and finding a guide online I have found that inserting using load infile, will best suit this purpose. I have written a script that will select 500 000 rows at the time from the old db, and write them (all 250 million) to a csv file, which will look like this:
TIMESTAMP,SENSOR_ID,VALUE,STATUS 2604947572,1399,96.434564,1432543 My plan is then to sort it with GNU sort, and split it into files containing 1 million rows.
Before inserting these files, I will remove the index on TIMESTAMP, as well as running these commands:
SET FOREIGN_KEY_CHECKS = 0; SET UNIQUE_CHECKS = 0; SET SESSION tx_isolation='READ-UNCOMMITED'; SET sql_log_bin = 0; After inserting, I will of course revert these changes.
Is this plan at all viable?
Can the inserts be quickened if i sort the csv based on SENSOR_ID and TIMESTAMP instead of TIMESTAMP and SENSOR_ID?
After turning indexing back on after the bulk insert, will the insertion of 2 million rows each day be possible?
Is it possible to do the daily inserts with regular insert statements, or will I have to use load infile in order to keep up
with the input load?
my.cnf
Every configuration is default except for these:
innodb_flush_log_at_trx_commit=2 innodb_buffer_pool_size=5GB innodb_flush_method=O_DIRECT innodb_doublewrite = 0 Are there any other optimizations I need for this particular purpose?
The server has 8GB of ram. mysqld Ver 8.0.22 Ubuntu 20.04
Any thoughts, ideas or inputs would be greatly appreciated.