Which one of those two schemas is more scalable?

Question

I'm toying with two schemas and I can't decide which is more scalable. The schema is for a Q&A, and it's built in MySQL. People post questions/answers and like/dislike/favourite questions and answers. A question can have many answers/likes/dislikes, and so can an answer.

To read a question to a user both schemas require the same number of joins, but the joins are handled differently:

Schema 1

questions(id, title, body, userId) questionLikes(id, questionId, userId) questionDislikes(id, questionId, userId) quetionComments(id, questionId, body, userId) answers(id, questionId, body, userId) answerLikes(id, answerId, userId) answerDislikes(id, answerId, userId) answerComments(id, answerId, userId, body) favourites(id, questionId, userId)

This is more normalized, easier to develop for, but scalable? Seems to be a lot of repeat information. The join sequence to grab a question is to a user (we want to include his like/dislike activity)

select question join answers join questionLikes join questionDislikes join questionComments join favouites join answers to answerLikes join answers to answerDislikes join answers to answerComments (multiply answer joins by number of answers)

Schema 2

posts(id, postTypeId, userId, title, body) postTypeId(id, postType) comments(id, postId, userId) votes(id, voteTypeId, userId) voteTypeId(id, voteType)

This is less normalized and compact, seems like it would scale better, a pain in the neck with self joins and other development issues (conditional validation). The join sequence to grab a question is

select question and its answers in the same read using where @id for question, and @questionId for answers; each row, join the following: join votes on as likes on voteType 1 join votes as dislikes on votetype 2 join comments join favouites (multiply joins by number of rows)

So what will scale better? I know can add some additional fields to store counts so no joins are necessary. But both require the same number of joins and I cant make up my mind.

I did not go very far in reading your question, but why would you have 2 different tables for questionLikes and questionDislikes ??? and I guess the same remark can be applied further to you schema. — Patrick Honorez
– Patrick Honorez, Commented Nov 30, 2010 at 13:44
Because questions and answers might have the same ID since they're different objects. — Mohamad
– Mohamad, Commented Nov 30, 2010 at 13:55

smirkingman · Accepted Answer · 2010-11-30 13:39:11Z

1

I would go even further than 2. The question is, what are the entities in your model? Answer: users and posts. A post can be a question, an answer, a vote, a comment or whatever, but it's always a post. Thus

posts(id, postTypeId, userId, title, body) postTypeId(id, postType)

BTW, both of the selects you mention retrieve everything (or were they just to show the worst possible join?).

I wouldn't see myself fetching his questions and his answers and his comments and... all in one go. Which use case would require everything like that?

answered Nov 30, 2010 at 13:39

smirkingman

6,4365 gold badges37 silver badges48 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

Mohamad Over a year ago

smirkingman, thank you! what I meant was, for any user who is browsing a question, I would need to get the question, it's answers, the likes/dislikes of the question, and of each answer (those numbers can be denormalized). But if a user is logged-in, and has voted on a question/answers, I need to grab his votes where he voted, be it on the question, or the answers belonging to that quesiton. It's not to dissimilar to StackOverFlow, really. I hope that makes sense.

smirkingman Over a year ago

With the model I suggest, it's easy "select count(*) from posts inner join posttype on posttype=vote". But I think that still doesn't answer your question. Maybe explain the result you're trying to achieve; what do you want to do with "the answers, the likes/dislikes of the question, and of each answer (those numbers can be denormalized"?

Mohamad Over a year ago

If I wanted to show how many likes/dislikes/answers a question/answer has, I can use extra columns in the posts table and update them via callbacks. That way no joins are necessary to count. But if I am a logged-in user, & I view a question & its answers, I may want to know if I have liked/disliked this question/its answers previously. I need to join the likes/dislikes tables for Qs and As @userId. (kind of like upmod / downmod here). In terms of scalability, is it better to have that info in a couple of tables (posts/votes) or a bunch of tables (questions/answers/question likes/dislikes) etc.

smirkingman Over a year ago

I would recommend normalisation until A) You have hard proof that there's a performance issue B) You've tested that denormalising into "extra columns with postbacks" really improves things. Remember that a putative performance gain by adding a denormalised column will be offset by the extra memory that it will use up. Start with a clean solution that works. Worry about performance later. Anyway, by the time it's in production extra CPUs and memory will be even cheaper>;-)

Collectives™ on Stack Overflow

Which one of those two schemas is more scalable?

1 Answer 1

4 Comments

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

4 Comments

Related