I am seeking advice for my own DB design. I need to make a big step and start a huge load of data processing. What I decide now will have big consequences for my project.
I have this table. It holds data that was acquired online from an external source.
app_data +--------------+-------+------------+--------+---------+-------+ | surrogateKey | appid | created | owners | viewers | price | +--------------+-------+------------+--------+---------+-------+ | 1 | 2 | 1472428100 | 10 | 25 | 10,00 | | 2 | 2 | 1472428200 | 11 | 50 | 10,00 | | 3 | 2 | 1472428300 | 15 | 50 | 10,00 | | 4 | 2 | 1472428400 | 22 | 51 | 8,00 | | 5 | 2 | 1472428500 | null | 50 | 8,00 | | 6 | 2 | 1472428600 | 20 | 49 | 8,00 | | 7 | 2 | 1472428700 | 25 | 50 | 10,00 | | ... | | | | | | +--------------+-------+------------+--------+---------+-------+ CREATE TABLE app_data( surrogateKey BIGINT UNSIGNED AUTO_INCREMENT NOT NULL PRIMARY KEY, appID INT UNSIGNED NOT NULL, created TIMESTAMP NOT NULL, owners INT UNSIGNED, owners_variance MEDIUMINT UNSIGNED, viewers MEDIUMINT UNSIGNED, -- TINYINT would be better but application layer cannot handle tinyint viewers_variance MEDIUMINT UNSIGNED, price MEDIUMINT UNSIGNED, CONSTRAINT fk_appData_app_appID FOREIGN KEY (appID) REFERENCES app(appID) ON UPDATE CASCADE ) ENGINE=INNODB; Does it makes sense to create this surrogateKey? There is no value added by it. I will not be used for joins. appid and created would make a fine composite key and I will use these keys for joining tables.
appID and created will be used for joining, comparisons, range queries, and so on. How should I create indexes? One index on each? (appID, created)? (created, appID)?
- This table will be HUGE and take 95% of the storage space. How can I optimize it further performance wise?
How should I handle random null values as in the owners column? These mean basically: The thermometer was broken that day... I was contemplating in just turning them to zero or leaving them as null. null does mean that there were no measurements that day whereas zero would be ambiguous but easier for analysis.
How should I handle values that don't make sense? For example viewers upper limit is 50. There is one value with 51 that means: This value is WRONG. Should I turn it to null? To 50?
Should I normalize price into something like this? Normalization would make it more difficult to analyse and result in more complicated queries.
.
CREATE TABLE( appID INT NOT NULL PRIMARY KEY, created TIMESTAMP, start_date TIMESTAMP, end_date TIMESTAMP, price INT ); Thanks!