-1

NOTE: Please don't respond by telling me that I probably don't understand what I am looking at. You can't possibly know that and it's wrong. Just don't answer if that's all you have to say.

I'm looking for a term akin to the "N+1 problem" but for a different issue. I know that the term 'anti-pattern' may raise some hackles but I'm pretty certain that the approach described below is either an implementation error or comes from some fundamental misunderstandings about database normalization. If anyone can explain any legitimate reason for doing this, I'd love to read it.

Background: I'm doing some analysis on a complex database defined by a 3rd-party. I keep finding an odd pattern (in the general sense) to the way that the parent-child relationships are populated with data.

I thought of a concrete example of this style:

--------------- | date | --------------- --------------- | PK | | day_of_week | | date | --------------- | day_of_week | --> | PK | | data ... | | number | --------------- | name | | weekend | --------------- 

OK. Looks like a relatively typical normalized structure. But when I look at the data in the tables, I find something puzzling:

------------------------------ ----------------------------- | date | | day_of_week | -----------------------------| ----------------------------- | PK | date | dow | data | | PK | num | name | wkend | ------------------------------ ----------------------------- | 1 | 23-05-28 | 1 | bcd | | 1 | 1 | sun | true | | 2 | 23-05-29 | 2 | cde | | 2 | 2 | mon | false | | 3 | 23-05-30 | 3 | def | | 3 | 3 | tue | false | | 4 | 23-05-31 | 4 | efg | | 4 | 4 | wed | false | | 5 | 23-06-01 | 5 | fgh | | 5 | 5 | thurs | false | | 6 | 23-06-02 | 6 | ghi | | 6 | 6 | fri | false | | 7 | 23-06-03 | 7 | hij | | 7 | 7 | sat | true | | 8 | 23-06-04 | 8 | ijk | | 8 | 1 | sun | true | | 9 | 23-06-05 | 9 | jkl | | 9 | 2 | mon | false | | 10 | 23-06-06 | 10 | klm | | 10 | 3 | tue | false | | 11 | 23-06-07 | 11 | lmn | | 11 | 4 | wed | false | | 12 | 23-06-08 | 12 | mno | | 12 | 5 | thurs | false | | 13 | 23-06-09 | 13 | nop | | 13 | 6 | fri | false | | 14 | 23-06-10 | 14 | opq | | 14 | 7 | sat | true | | 15 | 23-06-11 | 15 | pqr | | 15 | 1 | sun | true | |... |... | |... | |... | ... | ... | ... | 

In some cases that I've dug into a small set of unique data values in the reference table can make up than half of the rows (or more) in the reference table. The repeated data values are literally identical. I am seeing this in many places in the DB. Sometimes there are multiple tables pointing to the reference table and again, there's a separate entry for each dependent row regardless of whether that reference data already exists.

I don't know the full scope of this, but I believe a significant portion of the DB storage is being used to store these redundant rows. In addition, the inserts associated with these reference rows could be adding significant load. Both the storage requirements and performance of this system are problematic.

Is this something that has a name? I can think of a few choice terms for this. 'Doing it wrong' is one on the more polite end of the spectrum. Does 'anti-normalization' make sense? It's not really de-normalized. It's more of a normalization face-plant. 'Normalization face-plant' would be a fun term but doesn't sound professional.

I've seen this happen in systems I was involved in due to incorrect implementation. The problem I have is that the 3rd-party makes a habit of dismissing such critiques by saying that my team "doesn't understand" the solution. It's not clear whether they actually believe this or if they are simply using it as a strategy. It would like to be able to point to an independent source confirming this is not a good approach so I can avoid a time-consuming, frustrating and potentially fruitless debate.

24
  • 5
    Did you ask them if they can explain why the system is designed like this? Can you share their response? Commented May 31, 2023 at 18:12
  • @RikD I haven't yet. I have a strong suspicion that it's related to hand-rolling your own ORM. One of my colleagues had to fight for months, write up a paper and involve executives to get them to fix something roughly as obvious (and easier to fix.) I want to be prepared because it's really hard to keep it cool when they are so disrespectful and wrong at the same time. Commented May 31, 2023 at 18:30
  • Two columns being trivially functionally dependent by design is an obvious anti-pattern (easily fixed with computed columns though). But breaking a big table up, can be a simple case of vertical partitioning to avoid juggling all the extra columns you often don't need. Commented May 31, 2023 at 18:38
  • @Deduplicator In one example. there's 6 data columns on the ref table and 4-9 on the the children. The FK is required in all the child tables I've checked. Regardless, why wouldn't you remove the duplicates? Why not reference the existing rows instead of doing an insert on the parent for every child insert? Commented May 31, 2023 at 19:19
  • 1
    In the example data, shouldn’t the table on the right be ref instead of parent? Commented Jun 1, 2023 at 5:05

3 Answers 3

8

Answer after the question was edited:

There are several issues with the depicted data. The most serious is a violation of third normal form in the second table, since weekday-name and weekend-flag will always be the same for a given weekday number - but weekday number is not a unique for the table. Therefore weekday-name and weekend flag should be extracted to a separate lookup table.

There is arguably a redundancy in even storing weekday number for each date, since weekday can be algorithmically determined from the date. But this is not strictly a violation of normalization, and storing the weekday may be justified for practical reasons.

Then there is a 1:1 relationship between the two tables depicted. This is not strictly a violation of normalization either, but problematic since it makes inserts more complex and has a cost in joins. 1:1 relationships are sometimes used deliberately in a pattern called vertical partitioning or row splitting, but then the partitions usually share the same primary key. Unless other tables in the database have a foreign key to the second table, the partitioning does not seem useful.

I don't think there is a common term covering these issues, except the rater general "redundant data". But the violation of 3rd normal form in itself should be enough to request a redesign of the schema.

Violation of normalization is typically due to bad database design, but in some cases it may be deliberate for performance reasons. Especially in databases designed for analytics (rather than OLTP) like in data-warehouses, denormalization and row-splitting is often used to optimize certain analysis.

It could also just be that the second table is a mistaken attempt to create a lookup table.


Original answer:

What you see is called vertical partitioning or row splitting or sixth normal form.

Having repeated values in a column is not in itself problematic. E.g. if you have a boolean column you will have lots of repeated values, but this is perfectly fine.

Having 1:1 tables (as this example seem to be) does not violate normalization. But from a practical perspective, it is more inconvenient for queries and updates than just having a single table. But there may be practical considerations which offset this inconvenience. Some I can think of:

  • Performance: In some cases performance can be increased by vertical partitioning of tables.

  • Organization or permissions: The columns may "belong" to different applications or departments and have different access rights.

It might also be that the data is actually one-to-many, but just one-to-one in most cases. E.g. most people have one nationality, but a small fraction has multiple.

Bottom line: You have to bite the bullet and ask the person who designed this for the reason. You should be open to the possibility that it is designed like that for a good reason.

It sounds like you are searching for a justification to dismiss this design without needing to understand the rationale first. And the other party says you "don't understand" the solution. Honestly, it seems more of a peoples or communication issue rather than a database design issue.

20
  • Explain to me how having hundreds of millions of rows with the exact same data values improves performance for joins. Consider indexes and inserts. Why would joining the child rows to a single parent row be less performant? Commented Jun 1, 2023 at 13:41
  • @JimmyJames: It is called vertical partitioning or row splitting or 6th normal form, if you want to look further into it. It can in some cases increase performance e.g if queries often scan and select from some columns significantly much more than others. But it very much depends on the particular database and workload if there is any benefit. Commented Jun 1, 2023 at 15:26
  • Can you address the question I asked? E.g.: How is joining a thousand rows to a thousand rows going to be better/faster than joining 1000 rows to 6 rows? The results will be identical. Commented Jun 1, 2023 at 16:29
  • @JimmyJames: I doubt it would be faster. But I think you are conflating different issues with this question. Vertical partitioning does not increase or decrease the number of rows, it increases the number of tables. But if you are talking about extracting duplicate values in a column to a lookup table, that would just introduce a join. E.g. imagine you have a boolean column and then extract the two values to lookup-table with two rows - you would still need a FK to the lookup-table for each row in the main table. This would introduce a join you may not need. Commented Jun 1, 2023 at 16:49
  • 1
    @JimmyJames if the two tables are 1:1 then it is not a lookup table, it is a vertical partition. Lookup tables and vertical portions are independent concerns - you can have both, a vertical partition and a lookup table for the unique values. Or you can have either, or none. Perhaps you think the vertical partition was supposed to be a lookup table but just badly designed? This is of course possible but probably more likely it is deliberate use of vertical partitioning. Commented Jun 1, 2023 at 22:09
2

If anyone can explain any legitimate reason for doing this, I'd love to read it.

Well...

1 to 0..1
The "1 to 0..1" between super and sub-classes is used as a part of "all classes in separate tables" strategy for implementing inheritance.

A "1 to 0..1" can be represented in a single table with "0..1" portion covered by NULL-able fields. However, if the relationship is mostly "1 to 0" with only a few "1 to 1" rows, splitting-off the "0..1" portion into a separate table might save some storage (and cache performance) benefits. Some databases are thriftier at storing NULLs than others, so a "cut-off point" where this strategy becomes viable can vary considerably.

Stack Overflow - When I Should Use One To One Relationship

You may be thinking: I don't see that.

You don't see that, yet. Remember, you said "any".

1 to 1 The real "1 to 1" vertically partitions the data, which may have implications for caching. Databases typically implement caches at the page level, not at the level of individual fields, so even if you select only a few fields from a row, typically the whole page that row belongs to will be cached. If a row is very wide and the selected fields relatively narrow, you'll end-up caching a lot of information you don't actually need. In a situation like that, it may be useful to vertically partition the data, so only the narrower, more frequently used portion or rows gets cached, so more of them can fit into the cache, making the cache effectively "larger".

Another use of vertical partitioning is to change the locking behavior: databases typically cannot lock at the level of individual fields, only the whole rows. By splitting the row, you are allowing a lock to take place on only one of its halfs.

Triggers are also typically table-specific. While you can theoretically have just one table and have the trigger ignore the "wrong half" of the row, some databases may impose additional limits on what a trigger can and cannot do that could make this impractical. For example, Oracle doesn't let you modify the mutating table - by having separate tables, only one of them may be mutating so you can still modify the other one from your trigger.

Separate tables may allow more granular security.

These considerations are irrelevant in most cases, so in most cases you should consider merging the "1 to 1" tables into a single table.

Stack Overflow - When I Should Use One To One Relationship

And I'll add one: time. The data looks like this today but tomorrow it might well look different. Office databases can go years before the first time someone has a second home address associated with them. You won't see the need for that in day one data.

As for a name, I'd call it "an unjustified 1 to 1 relationship" and invite them to justify it. If they're telling you that you don't understand just ask them to explain it. The world has enough IT mysteries.

As for the dupes in the parent table, the only justification I can think of is this is a table for data entry. People just type in a field and nothing checks for dupes. You could dedupe but if you do when they come back and change it to something unique you'd have to give it new row all over again.

It may be the DB optimizes this for you silently in the background. I know Java does short string optimization which is why we have to use .equals() not ==. Who knows what the modern DBs are up to.

9
  • This doesn't make sense. Breaking up the table and keeping 100's of millions of duplicate rows (aside from surrogate keys) are two completely different things. What would you imagine prevents them from pointing two child rows to the same parent? Commented May 31, 2023 at 19:24
  • @JimmyJames Apart from the apps business rules nothing. But just because the existing data doesn't point two children at the same parent doesn't mean the app has a business rule not to point two children at one parent. Just means it hasn't happened yet. Commented May 31, 2023 at 19:34
  • "Just means it hasn't happened yet" Exactly, that's the problem. Why hasn't it happened yet? Maybe after the 100 millionth row with exactly the same data is inserted, one of those rows (0.000001%) will be reused. Problem solved! Commented May 31, 2023 at 19:40
  • I don't think I'm making this clear. It isn't about the number of rows. It's about if the data represents the results of every use case yet. Commented May 31, 2023 at 19:49
  • "It may be the DB optimizes this for you silently in the background." No. I am looking at the table. It has over 500 million rows which each have a separate PK value. "have to give it new row all over again." Yes, that's a better solution than adding the same values 100 million times. Commented May 31, 2023 at 19:49
0

Anti-normalization it is then. I'm still open to other names.

JacquesB suggested "violation of the 3rd-normal form" but I don't think that really captures how insanely stupid this is or how bad the impact is on the performance of the DB and application.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.