How to optimize Mysql database of 250 million rows for bulk inserts and selects

Question

I am a student, who have been given the task of designing a database for sensor data. My university currently has a large database which is being filled with these data, but a lot of what is being stored is not necessary. They want me to extract some of the fields from the existing database, and insert it into a new one, which will only hold the 'essentials'. I will need to extract every row from the old one, as well as fetching new data once a day.

There are 1500 sensors.
They generate a reading every minute.
Approximately 2.1 million readings every day
The current database have about 250 million rows.

The queries which will be performed will typically be to select sensor readings for a set of sensors between a given time span.

I was initially naive with respect to the added complexity large amounts of data introduces, so I grossly underestimated the time needed for this task. Because of this, and the fact that I don't have access to the server from home, I am here asking for help and input.

The initial design looks like this:

CREATE TABLE IF NOT EXISTS SENSORS ( ID smallint UNSIGNED NOT NULL AUTO_INCREMENT, NAME varchar(500) NOT NULL UNIQUE, VALUEFACETS varchar(500) NOT NULL, PRIMARY KEY (ID) ); CREATE TABLE IF NOT EXISTS READINGS ( ID int UNSIGNED AUTO_INCREMENT, TIMESTAMP int UNSIGNED INDEX NOT NULL, VALUE float NOT NULL, STATUS int NOT NULL, SENSOR_ID smallint UNSIGNED NOT NULL, PRIMARY KEY (ID), FOREIGN KEY (SENSOR_ID) REFERENCES SENSORS(ID) );

Design Question

My first question is whether i should keep an auto-incremented key for the readings, or if it would be more beneficial to a have a composite key on TIMESTAMP(UNIX epoch) and SENSOR_ID?

This question applies both to the fact that I have to insert 2.1 million rows per day, as well as the fact that I want to optimize for the aforementioned queries.

Initial Bulk insert:

After a lot of trial and error and finding a guide online I have found that inserting using load infile, will best suit this purpose. I have written a script that will select 500 000 rows at the time from the old db, and write them (all 250 million) to a csv file, which will look like this:

TIMESTAMP,SENSOR_ID,VALUE,STATUS 2604947572,1399,96.434564,1432543

My plan is then to sort it with GNU sort, and split it into files containing 1 million rows.

Before inserting these files, I will remove the index on TIMESTAMP, as well as running these commands:

SET FOREIGN_KEY_CHECKS = 0; SET UNIQUE_CHECKS = 0; SET SESSION tx_isolation='READ-UNCOMMITED'; SET sql_log_bin = 0;

After inserting, I will of course revert these changes.

Is this plan at all viable?
Can the inserts be quickened if i sort the csv based on SENSOR_ID and TIMESTAMP instead of TIMESTAMP and SENSOR_ID?
After turning indexing back on after the bulk insert, will the insertion of 2 million rows each day be possible?
Is it possible to do the daily inserts with regular insert statements, or will I have to use load infile in order to keep up
with the input load?

my.cnf

Every configuration is default except for these:

 innodb_flush_log_at_trx_commit=2 innodb_buffer_pool_size=5GB innodb_flush_method=O_DIRECT innodb_doublewrite = 0

Are there any other optimizations I need for this particular purpose?

The server has 8GB of ram. mysqld Ver 8.0.22 Ubuntu 20.04

Any thoughts, ideas or inputs would be greatly appreciated.

SSD or HDD? SSD would make the initial load run much faster; 1500rows/minute will work adequately on either disk type. — Rick James
– Rick James, Commented Nov 12, 2020 at 1:50
No, the "old db" is on another server. I don't actually know if it's an SSD or HDD. I would assume an HDD, so I'm glad you reckon 1500 rows/per minute will work either way. — Kent Odde
– Kent Odde, Commented Nov 12, 2020 at 20:21
Even on HDD, I don't worry about performance until you need more than 100 IOPs; you are at only 25. Actually, much less than 25 when blocks can be cached. — Rick James
– Rick James, Commented Nov 12, 2020 at 20:40

Rick James · Accepted Answer · 2020-11-23 16:01:59Z

General recommendations for a "sensor" dataset:

Minimize datatypes
Minimize indexes
Batch inserts
25 rows per second is fast, but does not require more drastic steps

Specifics:

STATUS int NOT NULL -- 4 bytes? What values might it have? (Make it smaller if practical.)
I suggest that PRIMARY KEY(sensor_id, timestamp) would be unique and adequate. Then get rid of id completely. The result is one fewer secondary index to update. I recommend this order for the columns. But the real choice needs to be based on the SELECTs that will be performed.
Gather all 1500 rows into a single INSERT. Or use LOAD DATA INFILE (except that this requires more disk hits). That is, you will have one INSERT per minute. There should be no problem for a single thread to keep up. Load 'continually', I see no benefit in waiting until the end of the day. (Or am I missing something?)
Debug your code, then get rid of the FOREIGN KEY; FK checking involves extra effort.
The initial LOAD DATA to put millions of rows into the table -- This will probably hit some timeout and/or buffer limit. In that case, break it into chunks - but not 1M rows; instead perhaps 10K rows. And do it single-threaded. I hesitate to think about the hiccups you might encounter if you tried to do it multi-threaded. Furthermore, it might be mostly I/O-bound, thereby not benefiting from multi-threading much.
Have autocommit ON; this way each chunk will be committed. Otherwise, the redo log will get huge, taking extra disk space and slowing things down.
Pre-sorting data -- This helps some. In your original schema, sort by the secondary key (timestamp); the AUTO_INCREMENT will take care of itself. If you remove the auto_inc, then sort by the PRIMARY KEY(sensor_id, timestamp), if you take my suggestion.
Have the PRIMARY KEY in place as you load the data. Else it will need to copy the table when building the PK.
If there are secondary key(s), ALTER TABLE .. ADD INDEX .. after the initial loading.
The settings look fine. However I would leave innodb_doublewrite on -- this protects against a rare, but catastrophic, data corruption.

Comments related to the link you provided:

The link is 8 years old. But most of it is still valid. (However, I disagree with some details.)
If you need to eventually delete old data, plan for it now. See this for a time series that greatly speeds up the monthly deletes of old data by using PARTITION BY RANGE(): http://mysql.rjweb.org/doc.php/partitionmaint Note: The PK I suggest is the correct one for partitioning by day (or week or month).
The only performance benefit of partitioning (based on what has been discussed so far) is when it comes to DELETEing via DROP PARTITION. No SELECTs are likely to run faster (or slower).
If you are using mysqldump, simply use the defaults. That will produce manageable INSERTs with lots of rows. TSV/CSV is not necessary unless you are coming from some other source.
I do not know about his debate between LOAD FILE and bulk INSERT and number of rows per chunk. Perhaps he had secondary indexes that were slowing things down. The "Change buffer" is a kind of write cache for secondary indexes. Once it is full, the inserting necessarily slows down, waiting for updates to be flushed. By getting rid of secondary indexes, this problem goes away. Or, by using ADD INDEX after loading the data, the cost is postponed.

Catching a broken cron job...

If you have a cron job capturing the data for "yesterday" based on the timestamp in the source, there is still a risk of the cron job failing or not even running (when your server is down at the time when the job should be run).

With the PK(sensor, ts) that I recommend, the following is reasonably efficient:

SELECT sensor, MAX(ts) AS max_ts FROM dest_table GROUP BY sensor;

The efficiency comes from hopping through the table, hitting only 1500 spots.

At the start of the cron job, run this (on the destination) to "find out where you left off":

SELECT MAX(max_ts) AS left_off FROM ( SELECT sensor, MAX(ts) AS max_ts FROM dest_table GROUP BY sensor ) AS x;

Then fetch new rows from the source

 WHERE ts > left_off AND ts < CURDATE()

Normally, left_off will be shortly before midnight of yesterday morning. When there is a hiccup, it will be a day earlier.

You could also use that subquery to see if any sensors have gone offline and when.

Thanks a lot! Status holds values up to a billion or so, so I won't be able to change that unfortunately. The order of the primary key seems like a much better idea than what I initially planned. I have implemented your suggestions, but haven't had a chance to test them yet. I have still accepted the solution, as this was exactly the sort of reply I needed in order to stay sane right now. Thanks again. — Kent Odde
– Kent Odde, Commented Nov 12, 2020 at 20:15
After implementing everything, it all seems to work great. There is just one problem. I query the MAX timetamp in the new database every time i want to do a periodic insert, in order to know where to start fetching data from the old database. Without an index on this field, it takes about half an hour. Do you think I can get descent insert performance if I add an index on the timestamp field, or should I rather just store the current maximum timestamp in a textfile? The file approach I feel opens up for more errors, will INSERT IGNORE or ON DUPLI..UPDATE slow the inserts too much in this case? — Kent Odde
– Kent Odde, Commented Nov 22, 2020 at 17:55
@KentOdde - "fetching new data once a day" -- Consider running a cron job just after midnight. Have it collect data WHERE ts >= yesterday AND ts < today. Suitable fill in for 'yesterday' and 'today'. In MySQL those can be CURDATE() - INTERVAL 1 DAY and CURDATE(). However, you also depend on the source data having an index on ts; does it? — Rick James
– Rick James, Commented Nov 22, 2020 at 20:33
Thanks. That seems like a much better idea. The source data has an index on timestamp, so no problem there. My only worry with your solution is that if an error occurs on a particular day, I will potentially have no way of knowing. A solution to this is perhaps to set up an MTA for the cronjob to send me error messages? — Kent Odde
– Kent Odde, Commented Nov 23, 2020 at 8:27

RolandoMySQLDBA · Accepted Answer · 2020-11-11 18:18:15Z

Your should increase the innodb_write_io_threads to 8.

This setting configures how many threads are dedicated to writing to .ibd files from the Buffer Pool.

Stack Exchange Network

How to optimize Mysql database of 250 million rows for bulk inserts and selects

2 Answers 2

Hot Network Questions

How to optimize Mysql database of 250 million rows for bulk inserts and selects

2 Answers 2

Related

Hot Network Questions