transparently splitting a single table so the program doesn't notice

Question

I am using homeassistant with mariadb. This is hosted on a Pi 4 with 8gb of RAM I have been collecting data a long time, and a single table grows to 20gb. The storage is a M.2 NVMe via USB 3. CPU usage is normal less than 20 but climbs to 40%. A few times a day it briefly spike. Memory usage has never been above 2gb with more than 6gb free.

Right now 350 (approx) sensors periodically record data to a single table.

Issues to solve:

performance gets slower over time
Too much data in table creates single point of failure. All my data can be gone with a single error.
Backups are slow and errors occur because the entire table is locked during a backup and new data can't be added to the table when it locked.

Solution I can up with

Split table by date (1 table per month)
Split table by sensor number. (Each sensor has its own table)
Ask community for alternative solutions.

Regarding solution 1

All sensors data is combined so even if I only ask for 12 sensors it has to sift through 350 worth of data. When I add senors, the table will continue to grow. Will it grow to be a problem? hope not.

Regarding solution 2

Will having 350 tables (maybe 500 in the future) cause performance issues of its own even though most table are less frequently used.

Implementation

I envision creating a view and/or trigger,procedures. I would create a view with the existing table name and use that to split the data.

In either solution 1 or 2 how would I identify and query only the necessary tables transparently using a view or etc. When homeassistant executes a query it should return all relevant data as if all the data was in a single table.

Using solution 1, The date would have to be extracted so that if the data range exceed a single month the from would have to be altered to include all relevant months. Instead of select * from table where data is between ... I would need that but then the from would be from 062023,052023,042023 (or whatever is relevant)

Using solution 2, For example if homeassistant asks select * from table where sensors =1,2, 3,5,9,11 then query would need to be altered to select * from 1,2,3,5,9,11;

Would using a view allow me to make the necessary changes transparently?

Is there another solution I had overlooked.

CREATE TABLE `states` ( `state_id` bigint(20) NOT NULL AUTO_INCREMENT, `domain` varchar(64) DEFAULT NULL, `entity_id` char(0) DEFAULT NULL, `state` varchar(255) DEFAULT NULL, `attributes` char(0) DEFAULT NULL, `event_id` bigint(20) DEFAULT NULL, `last_changed` char(0) DEFAULT NULL, `last_changed_ts` double DEFAULT NULL, `last_updated` char(0) DEFAULT NULL, `created` datetime(6) DEFAULT NULL, `last_updated_ts` double DEFAULT NULL, `old_state_id` bigint(20) DEFAULT NULL, `attributes_id` bigint(20) DEFAULT NULL, `context_id` char(0) DEFAULT NULL, `context_user_id` char(0) DEFAULT NULL, `context_parent_id` char(0) DEFAULT NULL, `origin_idx` smallint(6) DEFAULT NULL, `context_id_bin` tinyblob DEFAULT NULL, `context_user_id_bin` tinyblob DEFAULT NULL, `context_parent_id_bin` tinyblob DEFAULT NULL, `metadata_id` bigint(20) DEFAULT NULL, `last_reported_ts` double DEFAULT NULL, PRIMARY KEY (`state_id`), KEY `ix_states_last_updated_ts` (`last_updated_ts`), KEY `ix_states_context_id_bin` (`context_id_bin`(16)), KEY `ix_states_attributes_id` (`attributes_id`), KEY `ix_states_old_state_id` (`old_state_id`), KEY `ix_states_metadata_id_last_updated_ts` (`metadata_id`,`last_updated_ts`), CONSTRAINT `states_ibfk_1` FOREIGN KEY (`old_state_id`) REFERENCES `states` (` state_id`), CONSTRAINT `states_ibfk_2` FOREIGN KEY (`attributes_id`) REFERENCES `state_att ributes` (`attributes_id`), CONSTRAINT `states_ibfk_3` FOREIGN KEY (`metadata_id`) REFERENCES `states_meta ` (`metadata_id`) ) ENGINE=InnoDB AUTO_INCREMENT=373236082 DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4 _unicode_ci;

20 GB of data is a small amount of data. You shouldn't need to even think about partitioning your data for performance reasons. What you should do is add your current table's definition, the definitions of the indexes on it, and an example query that is slow, ideally with the EXPLAIN ANALYZE . — J.D.
– J.D., Commented Jun 12, 2023 at 19:03
You need to look into Table Partitioning. It will achieve the desired results in breaking your table into subtables called partitions, elevating performance, reducing backup time, etc. — Luis Alberto Barandiaran
– Luis Alberto Barandiaran, Commented Jun 12, 2023 at 20:21
@LuisAlbertoBarandiaran - Can you explain how PARTITIONing would help with backup time? — Rick James
– Rick James, Commented Jun 16, 2023 at 21:12
@RickJames - Using MysqlDump, you can specify a where clause that covers the partition definition. In that scenario, you can create smaller backups of the information that would be more manageable (and performant) than a single backup of the entire table. — Luis Alberto Barandiaran
– Luis Alberto Barandiaran, Commented Jun 17, 2023 at 4:54

danblack · Accepted Answer · 2023-06-13 03:31:43Z

Seems like adding 2-3G to the innodb_buffer_pool_size will help performance, as will increasing the innodb_log_file_size to at least 2G. If these don't significantly help performance, raise the issue with upstream homeassistant. Implementing your own mechanism without integration will probably break somethimg.

Looks like a mariadb backup mechanism isn't used by homeassistant. There are non-locking mechanisms here that can perform live backups. A replication replica for the sole purpose of backup with eliminate storage based failure and provide a service that can be interrupted during backup.

While the buffer does help with the responsiveness of the GUI, it doesn't help with things like when homeassistant does database upgrades/changes. — cybernard
– cybernard, Commented Sep 27, 2024 at 18:12

Rick James · Accepted Answer · 2024-09-29 17:42:25Z

This primary key helps with many of the issues you present.

PRIMARY KEY(sensor_id, ts)

(where ts is whatever timestamp you have -- and is unique per sensor)

I don't understand your "too much data" issue. Do you fear running out of disk space? That would lead to a crash, not loss of existing data. Yes, you could not collect more data until the issue was resolved.

Will you be purging "old" data? If so, PARTITION BY RANGE(TO_DAYS(ts)) (See ref below.) Otherwise, Partitioning is unlikely to help with any performance issues.

Since the data is essentially write-once, you could manipulate partitions in complex ways to avoid dumping old partitions, but that would messy. Instead, suggest you do mysqldump with a WHERE clause to pull out only one hour (or day or whatever) at a time. The data would not be "ready to reload", but at least it would not be lost. Consider, instead, storing the data into a flat file before inserting to the database. If you write it as a CSV, then it (or one file per day) would be ready for LOAD DATA to reinsert.

Multiple, manually maintained, tables is a big no-no. That sentiment has been stated many times in this and other forums. Partitioning, especially by time, is a common (and viable) alternative. Partitioning provides the "View" semantics transparently.

innodb_buffer_pool_size should be about 5G for your 8GB Pi, assuming no other apps are running.

Another tip: Minimize datatype sizes. Eg, sersor_id (for 500 sensors) could be a 2-byte SMALLINT UNSIGNED.

More

100 sensor tables -- No. One table with sensor_id as first column in PRIMARY KEY.
TINYBLOB is the same as BINARY(255); switch to the latter.
Index prefixing is usually useless. Shrink the BINARY to a reasonable size, then index it with out prefixing.
BIGINT takes 8 bytes; most uses will never need such large numbers. Shrink down to a reasonable INt or smaller -- saving space will help some with speed.
Consider normalizing those VARCHARs; a 1-byte TINYINT will save a bunch of space.
Single threaded? Are you referring to INSERTs? Or with SELECTs? I can help with either.
The boss says you can't change things -- The job market is wide open.

Workaround

Collect the data in the bloated schema that you show. (Few, if any, indexes should be on the table.) In parallel, have another job that copies the data into another table with smaller datatypes, normalized strings, partitioned, etc. This should nearly keep up with the data that is coming in. Perform all queries against this table. Peroidically [hourly?] toss the homeassistant table and start it over.

Issues:

How to get started with 370M rows already in place.
How to do the swap gracefully and safely.

Summary tables

This is a way to avoid lots of indexes on the Fact table. And speed up queries significantly.

In particular, if you can store all the useful info in summary tables, the need for the giant table goes away.

My Blogs

"too much data" (no storage capacity isn't an issue). Recently homeassistant made a change to the index, and the upgrade was running for 8 hours and failed. In addition the size has now grown to 104gb. While the upgrade was running I lost 8+ hours of data. If I create a view pretending to be able single table when it is really many will that be transparent to homeassitant. — cybernard
– cybernard, Commented Sep 27, 2024 at 18:08
Massive datasets usually need more careful attention than is provided by just a VIEW. In addition to my Partition blog (mentioned above), see my mysql.rjweb.org/doc.php/mysql_sensor — Rick James
– Rick James, Commented Sep 28, 2024 at 0:22
I have added the create table statement. Given the structure of the table how would a view be affected. What if I made 100+ identical tables except the names would be states1 ... states100, and put each sensor in its own table. Another thing I noticed, because of the structure of the database my database is effectively single threaded. While your link has some good info, I can not influence homeassistant project and therefore I have to deal with things the best I can — cybernard
– cybernard, Commented Sep 28, 2024 at 14:19
I added some more. Alas, 370M rows is rather late for discussing schema design flaws. — Rick James
– Rick James, Commented Sep 28, 2024 at 19:34
Since homeassistant stores all/most of the data in a single table it is hard to reduce the field types. Sure rain maybe <10 inches per day but volts is 0-120v, and temperature is -30F to 90F. Wind directions is 0-360 degrees, so homeassistant had to go with a one size fits all solution. I am trying to solve the all eggs in a single basket issue. Also if home assistant used multi-tables each table could use 25% of the cpu, but now I only see 25% cpu utilization especially during database upgrades. The last upgrade took 8 hours at 25% cpu when it should of take 2 hr at 100% cpu — cybernard
– cybernard, Commented Sep 28, 2024 at 20:21

Stack Exchange Network

transparently splitting a single table so the program doesn't notice

2 Answers 2

Hot Network Questions

transparently splitting a single table so the program doesn't notice

2 Answers 2

Related

Hot Network Questions