Is a clustered index on a child table in a parent/child relationship the most optimal index?

Question

A clustered index will sort row data in a table according to the ordering of the fields in the index. If so, why would you (or would you not) want to create a clustered index on the child table in a parent/child relationship?

Lets assume the classical order-with-orderlines example where you never fetch the orderlines without first going through the associated order. I would think that having a clustered index on both the orderid and the orderlineid column (as clustered indexes must be unique) forces all orderlines for a single order to be physically located adjacent to each other, which would make reads very performant. Is this correct? And if so, are there downsides to this approach?

They do not HAVE to be, but if they are not they would be made unique internally by SQL Server (which is the database I'm working with so YMMV for other systems...). There is some overhead for this so I would try to avoid it. See: stackoverflow.com/questions/4332982/… — JDT
– JDT, Commented Feb 4, 2015 at 22:01

Thomas Kejser · Accepted Answer · 2015-02-06 14:37:55Z

You assumption about adjacency is correct.

If we use TPC-H as an example: Clustering the LINEITEMS table on on ORDERID will locate all order lines belonging to the same LINEITEM physically adjacent on disk. This speeds up queries that fetch all order lines for a given ORDERID. Clustering on the foreign key to the parent also allow fast merge joins between the child and parent.

There are a few downsides to the clustering approach:

The entire table must be kept sorted on disk. If you are expecting a great many inserts with ORDERID not being sequentially generated, page splits will be more expensive. This is something you can throw hardware at.
If ORDERID is generated sequentially, you will create a hotspot at the end of the table. In some database engines (For example SQL Server) this is a problem at high insert speed. In SQL Server, this typically kicks in around 5K-10K inserts/sec.
The cluster index keys either have to be unique (ex: ORDERID, LINENUMBER) or padded with some hidden column to make them unique. Since the composite cluster key must be present in all other indexes, this makes the secondary, non-clustered indexes larger.
Storing the table clustered will force a B-tree traversal when you want to locate data via a secondary index (unless the secondary index is covering the query). If you instead kept the table as a heap, all other indexes would only have an 8B overhead and your B-tree traversals are cut in half.

The vast majority of cases, you will want to cluster both the parent and the child on the same leading key. But if you expect the child table to be accessed via many different indexes - it may be worth considering the alternatives.

My understanding is that the "clustered index hotspot" problem is not real in modern versions of MS SQL (dba.stackexchange.com/a/1586/1896). — Jon of All Trades
– Jon of All Trades, Commented Feb 6, 2015 at 15:31
It very much IS a problem in even the latest versions. The problem moved from a locking to a latching issue. See this: kejser.org/boosting-insert-speed-by-generating-scalable-keys — Thomas Kejser
– Thomas Kejser, Commented Feb 6, 2015 at 16:54
Hotspots are generally a good thing rather than a bad thing. Most workloads won't be doing singleton 10K inserts/sec into a table and, if so, should be using bulk insert instead. — Dan Guzman
– Dan Guzman, Commented Feb 7, 2015 at 15:58
@DanGuzman: How are hotspots a good thing? You are right that most workloads don't start out with 10K inserts/sec. The problem occurs when you grown to that speed and find yourself trying to manage a huge table with an IDENTITY column that no longer scales — Thomas Kejser
– Thomas Kejser, Commented Feb 7, 2015 at 21:39
@enorl76, the upshot of our discussion is it depends :-) Random keys avoid latch contention at very high insert rates. The storage system must be capable of sustaining more IOPS with large tables and random keys compared to same with incremental keys. Incremental keys with large tables perform better than random keys with a slow storage system because less IO is needed. Given SSDs are becoming commonplace and more RAM available, random keys are more palatable nowadays than in the past. — Dan Guzman
– Dan Guzman, Commented Jul 27, 2018 at 1:56

Stack Exchange Network

Is a clustered index on a child table in a parent/child relationship the most optimal index?

1 Answer 1

Linked

Hot Network Questions

Is a clustered index on a child table in a parent/child relationship the most optimal index?

1 Answer 1

Linked

Related

Hot Network Questions