Advice for my DB design and how to handle null / measuring errors

Question

I am seeking advice for my own DB design. I need to make a big step and start a huge load of data processing. What I decide now will have big consequences for my project.

I have this table. It holds data that was acquired online from an external source.

app_data +--------------+-------+------------+--------+---------+-------+ | surrogateKey | appid | created | owners | viewers | price | +--------------+-------+------------+--------+---------+-------+ | 1 | 2 | 1472428100 | 10 | 25 | 10,00 | | 2 | 2 | 1472428200 | 11 | 50 | 10,00 | | 3 | 2 | 1472428300 | 15 | 50 | 10,00 | | 4 | 2 | 1472428400 | 22 | 51 | 8,00 | | 5 | 2 | 1472428500 | null | 50 | 8,00 | | 6 | 2 | 1472428600 | 20 | 49 | 8,00 | | 7 | 2 | 1472428700 | 25 | 50 | 10,00 | | ... | | | | | | +--------------+-------+------------+--------+---------+-------+ CREATE TABLE app_data( surrogateKey BIGINT UNSIGNED AUTO_INCREMENT NOT NULL PRIMARY KEY, appID INT UNSIGNED NOT NULL, created TIMESTAMP NOT NULL, owners INT UNSIGNED, owners_variance MEDIUMINT UNSIGNED, viewers MEDIUMINT UNSIGNED, -- TINYINT would be better but application layer cannot handle tinyint viewers_variance MEDIUMINT UNSIGNED, price MEDIUMINT UNSIGNED, CONSTRAINT fk_appData_app_appID FOREIGN KEY (appID) REFERENCES app(appID) ON UPDATE CASCADE ) ENGINE=INNODB;

Does it makes sense to create this surrogateKey? There is no value added by it. I will not be used for joins. appid and created would make a fine composite key and I will use these keys for joining tables.
appID and created will be used for joining, comparisons, range queries, and so on. How should I create indexes? One index on each? (appID, created)? (created, appID)?
This table will be HUGE and take 95% of the storage space. How can I optimize it further performance wise?
How should I handle random null values as in the owners column? These mean basically: The thermometer was broken that day... I was contemplating in just turning them to zero or leaving them as null. null does mean that there were no measurements that day whereas zero would be ambiguous but easier for analysis.
How should I handle values that don't make sense? For example viewers upper limit is 50. There is one value with 51 that means: This value is WRONG. Should I turn it to null? To 50?
Should I normalize price into something like this? Normalization would make it more difficult to analyse and result in more complicated queries.

.

CREATE TABLE( appID INT NOT NULL PRIMARY KEY, created TIMESTAMP, start_date TIMESTAMP, end_date TIMESTAMP, price INT );

Thanks!

Community · Accepted Answer · 2017-04-13 12:43:01Z

Some of these questions are not so much about database design as about data cleaning and analysis - in those cases the correct answer really depends on the specifics of your analysis task.

If the surrogate key doesn't add value (you won't even use it for joins) and you have certainty that (appid, created) will always be unique, then just use (appid, created) as a composite primary key.
If you'll always be joining, querying, etc on (appid, created) together, than the index resulting from assigning those columns as the multi-column primary key will provide you what you need. The order shouldn't matter unless you're also going to be querying one of the two columns independently. (Edit: Rick James makes a good point about the likelihood of querying for equality on appid and some BETWEEN range for created. As he says, in that case go with (appid, created) for your order.)
That depends on the queries you'll be making. Can you provide examples? Indexing where you'll need it and avoiding joins are your easiest options.
This is really a data analysis question - but in general I would strongly recommend keeping missing values as null and transforming or dropping those values as needed for a given operation. Depending on the analysis you're performing the appropriate no-op value for missingness may be different (e.g. 0 for sums/means, but 1 for products, etc) or you may just want to be able to exclude those rows altogether, but still have a record that they were recorded.
Again, this is very much a data cleaning / analysis decision - but in general I would leave observed data intact (even if you think it's invalid / nonphysical / otherwise not possible), and only drop it during your analysis. Assuming you have the storage space, more information is always better than less - what if later you find out some supposedly erroneous value actually was possible?
From a performance perspective I would avoid needing to join tables, but clearly this would be desirable from a consistency / referential integrity perspective.

If price changes at regular time intervals, it might be possible to break your created column into two columns: one that describes a coarser time period where there's a constant price, and one that describes the more granular timestamp within that broader period. For example:

app_data: |-------+------------------+----------------+--------+---------+-------| | appid | coarsetimeperiod | finetimeperiod | owners | viewers | price | |-------+------------------+----------------+--------+---------+-------| | 2 | 1 | 1 | 10 | 25 | 10.00 | | 2 | 1 | 2 | 11 | 50 | 10.00 | | 2 | 1 | 3 | 15 | 50 | 10.00 | | 2 | 2 | 1 | 22 | 51 | 8.00 | | 2 | 2 | 2 | null | 50 | 8.00 | | 2 | 2 | 3 | 20 | 49 | 8.00 | | 2 | 3 | 1 | 25 | 50 | 10.00 | |-------+------------------+----------------+--------+---------+-------|

You would have a FOREIGN KEY (coarsetimeperiod, price) REFERENCE price_data(coarsetimeperiod, price) defined on app_data, where the price_data table looks like:

price_data : |------------------+-------| | coarsetimeperiod | price | |------------------+-------| | 1 | 10.00 | | 2 | 8.00 | | 3 | 10.00 | |------------------+-------|

This would get you the referential integrity benefits of normalization but without the performance penalty of needing to continuously perform joins.

Rick James · Accepted Answer · 2017-01-16 23:44:58Z

(Others may disagree, but here goes...)

If you have a good "natural" PRIMARY KEY, use it. In my own tables, I find that is the case about 2/3 of the time.

What makes it good?

Must be UNIQUE (since the PK is, by definition, unique)
Must never be NULL (again, a PK requirement)
'Composite' is OK, as with your example: (appID, created)
If there are no (or few) secondary keys, then it should be "small". This is because every secondary index implicitly includes the PK columns.
A 'natural' key is more efficient for scanning.
Since your queries might be WHERE appID = constant AND created BETWEEN..., this order is much better: PRIMARY KEY(assId, created).

Don't use INT (4 bytes) when MEDIUMINT (3 bytes) will suffice; etc. If the value is 0..50, then TINYINT UNSIGNED (1 byte) is better.

How did you store 10,00 in INT? Perhaps you meant DECIMAL(8,2)?

Do not normalize anything you might want to do a range test on, such as price or created. In general, don't normalize 'continuous' values. Nor small values. For example, it is hardly worth normalizing a 5-digit zip code to replace it with a 3-byte MEDIUMINT.

Richard L. Dawson · Accepted Answer · 2017-01-16 23:32:01Z

While I'm not an expert in MySql I have designed a few databases... Since appID doesn't appear to be unique it by itself cannot be your primary key. If your created column is unique as it appears to be here that by itself would be a good natural key and be your primary key quite easily. If you need something more identifying then you can combine it with the appID for a two column natural primary key.

The thing about your primary key, it is also most often the column your data is sorted on (clustered key) If your data is inserted in a sequential order and the created column represents that order then I would say use that for your primary key and create a non-clustered index on other columns you need to make your application more performant. This will also cut down on page splits as you add data and keeping your table fragmentation lower.

How you handle NULL values and values outside your acceptable range is a design decision. Many people (myself included) feel they (NULLs) are a problem waiting to happen and should not be allowed into the data. They introduce tertiary logic and have to be "handled" with cumbersome Sql syntax later on. That said, can you define a value to replace the NULL to mean the equivalent of "no data submitted"? An empty string can most often do this but for numeric values it becomes trickier.

For values outside your design range (viewers = 51) you can define the viewers column as a char (2) value and use "NA" if a NULL or incorrect data comes in. Or you can leave it as a 51 and filter that in your queries.

Your question about normalization and the price column is an easy one if all that we are going to be looking at is the columns you've defined here. It appears to be an attribute of what ever these rows are representing and so should stay with the data. The rules of performance are such that you design for third normal form and then denormalize for performance. Don't fall into the trap of turning your table into a spreadsheet though by trying to make it too flexible. I hope this answers your questions.

If it doesn't I suggest you get a bit more detailed information. A good resource is "Database Design for Mere Mortals" by a fella named Michael Hernandez.

Have a great day. Richard

SQL.RK · Accepted Answer · 2017-01-17 12:40:38Z

Here you go my 2 cents...

In RDBMS world, huge Tables have to be partitioned to make index scan very small. In this case, created is good candidate for Partition key. Based on the volume of records you can go for Daily/Weekly/Monthly Partitions.

Regarding Index, try to have the index as (created, appID) as Created will be high cardinality when compared to App ID.

Regarding Invalid Values, Best option is not to capture them in the table and move to error tables. Best may not be correct always in our world. In that case, it better to capture the actual values and filtering of invalid records in your queries.

Stack Exchange Network

Advice for my DB design and how to handle null / measuring errors

4 Answers 4

Hot Network Questions

Advice for my DB design and how to handle null / measuring errors

4 Answers 4

Related

Hot Network Questions