NOTE: Please don't respond by telling me that I probably don't understand what I am looking at. You can't possibly know that and it's wrong. Just don't answer if that's all you have to say.
I'm looking for a term akin to the "N+1 problem" but for a different issue. I know that the term 'anti-pattern' may raise some hackles but I'm pretty certain that the approach described below is either an implementation error or comes from some fundamental misunderstandings about database normalization. If anyone can explain any legitimate reason for doing this, I'd love to read it.
Background: I'm doing some analysis on a complex database defined by a 3rd-party. I keep finding an odd pattern (in the general sense) to the way that the parent-child relationships are populated with data.
I thought of a concrete example of this style:
--------------- | date | --------------- --------------- | PK | | day_of_week | | date | --------------- | day_of_week | --> | PK | | data ... | | number | --------------- | name | | weekend | --------------- OK. Looks like a relatively typical normalized structure. But when I look at the data in the tables, I find something puzzling:
------------------------------ ----------------------------- | date | | day_of_week | -----------------------------| ----------------------------- | PK | date | dow | data | | PK | num | name | wkend | ------------------------------ ----------------------------- | 1 | 23-05-28 | 1 | bcd | | 1 | 1 | sun | true | | 2 | 23-05-29 | 2 | cde | | 2 | 2 | mon | false | | 3 | 23-05-30 | 3 | def | | 3 | 3 | tue | false | | 4 | 23-05-31 | 4 | efg | | 4 | 4 | wed | false | | 5 | 23-06-01 | 5 | fgh | | 5 | 5 | thurs | false | | 6 | 23-06-02 | 6 | ghi | | 6 | 6 | fri | false | | 7 | 23-06-03 | 7 | hij | | 7 | 7 | sat | true | | 8 | 23-06-04 | 8 | ijk | | 8 | 1 | sun | true | | 9 | 23-06-05 | 9 | jkl | | 9 | 2 | mon | false | | 10 | 23-06-06 | 10 | klm | | 10 | 3 | tue | false | | 11 | 23-06-07 | 11 | lmn | | 11 | 4 | wed | false | | 12 | 23-06-08 | 12 | mno | | 12 | 5 | thurs | false | | 13 | 23-06-09 | 13 | nop | | 13 | 6 | fri | false | | 14 | 23-06-10 | 14 | opq | | 14 | 7 | sat | true | | 15 | 23-06-11 | 15 | pqr | | 15 | 1 | sun | true | |... |... | |... | |... | ... | ... | ... | In some cases that I've dug into a small set of unique data values in the reference table can make up than half of the rows (or more) in the reference table. The repeated data values are literally identical. I am seeing this in many places in the DB. Sometimes there are multiple tables pointing to the reference table and again, there's a separate entry for each dependent row regardless of whether that reference data already exists.
I don't know the full scope of this, but I believe a significant portion of the DB storage is being used to store these redundant rows. In addition, the inserts associated with these reference rows could be adding significant load. Both the storage requirements and performance of this system are problematic.
Is this something that has a name? I can think of a few choice terms for this. 'Doing it wrong' is one on the more polite end of the spectrum. Does 'anti-normalization' make sense? It's not really de-normalized. It's more of a normalization face-plant. 'Normalization face-plant' would be a fun term but doesn't sound professional.
I've seen this happen in systems I was involved in due to incorrect implementation. The problem I have is that the 3rd-party makes a habit of dismissing such critiques by saying that my team "doesn't understand" the solution. It's not clear whether they actually believe this or if they are simply using it as a strategy. It would like to be able to point to an independent source confirming this is not a good approach so I can avoid a time-consuming, frustrating and potentially fruitless debate.
refinstead ofparent?