-1

One of the issues that I am encountering presently is that we have certain very large tables (>10 Million rows).When we reference these large tables or create joins, the speed of query is extremely slow.

One of the hypothesis for solving the issue is to create pre-computed tables, where the computation for the use cases will be done already and instead of referencing the raw data, we will query the pre-computed table instead

Are there any resources in order to implement this ? Do we only use mySQL or can we also use Pandas or other such modules in order to accomplish the same

Which is the optimal way?

5
  • 1
    I don't use ClickHouse, but typically indexes are a good way to optimize joins. Have you considered creating indexes for the join lookups? Commented Sep 7, 2022 at 14:59
  • The problem with pre-computed tables, which are commonly called summary tables, is that you're never sure if the table needs to be re-computed. Checking it is at least as costly as doing the query against the raw data. So it's unsuitable if you need the summary table to be up to date, and your raw data changes frequently. Commented Sep 7, 2022 at 15:01
  • Agreed with Bill^. You should understand why your queries are slow, don't just assume it's because of the number of rows in your tables - that's rarely the reason. More often queries run slow with bigger tables because the queries themselves aren't designed as efficiently as they can be, or there's an architecture problem like missing indexes. Commented Sep 7, 2022 at 16:58
  • I agree about the point. The present is not optimally designed and tends to cause large delays due to speed reference Commented Sep 8, 2022 at 9:39
  • However, considering the scale of things - we might be soon moving to NoSQL in future or perhaps change the underlying architecture of the storage. Considering the state of things - we have to make do with what we have presently Commented Sep 8, 2022 at 9:41

1 Answer 1

1

Yes.

See my blog on Summary Tables. It discusses their purpose (similar to what you describe), how to build them, some metrics on properly sizing them, etc.

Often I see upwards of 10-fold speedup.

A well-design Data Warehouse uses the "Fact" table only when you need to fetch individual entries, which is rare. Most queries can be done against the Summary table(s).

And, by using PARTITIONing, you can efficiently toss "old" Fact rows, while keeping the Summary data "forever". This makes disk space more manageable.

It is usually good to heavily 'normalize' the Fact table, saving disk space. Meanwhile, the Summary tables can be .denormalized', improving speed.

If you want more specifics, please divulge more info.

2
  • Thank You Can you share more light on some examples of the mySQL query that will be run for extracting summary tables out of large tables (>10 Million Rows) Commented Sep 8, 2022 at 10:28
  • @databasequestion - I would need to see SHOW CREATE TABLE for the existing tables. And some hints of what the "reports" need to present. Commented Sep 9, 2022 at 23:36

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.