Is there a more efficient way to join these tables?

Question

I have a couple of tables in a DB2 database. Table_1 looks like this (the actual table is 9.2 million rows):

Customer_ID	Offer	Item_list
A	X	1
A	Y	2
B	Y	2

Table_2 looks like this (the actual table is 83k rows):

Item_list	Item_ID
1	111
1	222
1	333
2	111
2	444

I want to join the tables to return a list of items for each customer. So for the example above, the result should look like this:

Customer_ID	Item_ID
A	111
A	222
A	333
A	444
B	111
B	444

I have written the code below to do this. However, it seems to take forever to run (I finally killed it after 25 minutes with no results). Is there a more efficient way of getting the results I'm looking for please?

SELECT A.CUSTOMER_ID, B.ITEM_ID FROM TABLE_1 A LEFT JOIN TABLE_2 B ON A.ITEM_LIST = B.ITEM_LIST GROUP BY A.CUSTOMER_ID, B.ITEM_ID ;

I mean, it’s literally just a left join. It doesn’t get much simpler. What do your indices look like? Also do you really need the group by? — BenderBoy
– BenderBoy, Commented Oct 10, 2023 at 10:44
But also I don’t really get your schema. Since you can apparently have multiple Customers per offer, you’re going to get a lot of duplication. Have you tried switching the tables around? You would miss out on itemless offers, but you may be alright with that? — BenderBoy
– BenderBoy, Commented Oct 10, 2023 at 10:49
Are you sure the specified result matches the sample table data? — jarlh
– jarlh, Commented Oct 10, 2023 at 11:45
You want a query that reads 8 million rows and joins them with 83k rows. It's bound to be slow. But what's the problem? I assume such a heavy query is for a nightly process, not an interactive one, right? — The Impaler
– The Impaler, Commented Oct 10, 2023 at 13:33
@TheImpaler True if there really is no WHERE clause. @SJRCoding may want to separate query time and fetch time in the analysis. With that much data, I'd bet the fetch time is the problem. — dougp
– dougp, Commented Oct 10, 2023 at 15:27

Charles · Accepted Answer · 2023-10-10 14:43:07Z

Generally speaking, in a RDBMS there's 1 way to join related tables. In other words, the set of foreign keys is determine during design.

The type of join is also mostly determined at design time. There's some leeway at query time. But mostly that's when the relation is designed as an outer join and you're only interested in the inner.

In your case, since it seems unlikely that TABLE_1 would have a row with NULL in Item_list or have a value in Item_list that doesn't exist in TABLE_2; you should be using INNER JOIN, not outer.

But changing the join type is unlikely to have much of an effect on performance.

The only thing that is likely to help performance is ensuring that there is an index on TABLE_1 over (Customer_ID, Item_list) and one on TABLE_2 over (Item_list, Item_ID).

Lastly, your GROUP BY isn't necessary. I suspect you want ORDER_BY instead. Though leaving both off would help with performance.

Given the content of the question, use an INNER JOIN. stackoverflow.com/questions/2726657/…
Without the GROUP BY you'll get an extra row (A, 111). DISTINCT may be faster. stackoverflow.com/questions/581521/…

Collectives™ on Stack Overflow

Is there a more efficient way to join these tables?

1 Answer 1

2 Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

2 Comments

Linked

Related