Guarantee consistency of data across microservices access a sharded cluster in MongoDB

Question

My application is essentially a bunch of microservices deployed across Node.js instances. One service might write some data while a different service will read those updates. (specific example, I'm processing data that is inbound to my solution using a processing pipeline. Stage 1 does something, stage 2 does something else to the same data, etc. It's a fairly common pattern)

So, I have a large data set (~250GB now, and I've read that once a DB gets much larger than this size, it is impossible to introduce sharding to a database, at least, not without some major hoop jumping). I want to have a highly available DB, so I'm planning on a replica set with at least one secondary and an arbiter.

I am still researching my 'sharding' options, but I think that I can shard my data by the 'client' that it belongs to and so I think it makes sense for me to have 3 shards.

First question, if I am correct, if I have 3 shards and my replica set is Primary/Secondary/Arbiter (with Arbiter running on the Primary), I will have 6 instances of MongoDB running. There will be three primaries and three secondaries (with the Arbiter running on each Primary). Is this correct?

Second question. I've read conflicting info about what 'majority' means... If I have a Primary and Secondary and I'm writing using the 'majority' write acknowledgement, what happens when either the Primary or Secondary goes down? If the Arbiter is still there, the election can happen and I'll still have a Primary. But, does Majority refer to members of the replication set? Or to Secondaries? So, if I only have a Primary and I try to write with 'majority' option, will I ever get an acknowledgement? If there is only a Primary, then 'majority' would mean a write to the Primary alone triggers the acknowledgement. Or, would this just block until my timeout was reached and then I would get an error?

Third question... I'm assuming that as long as I do writes with 'majority' acknowledgement and do reads from all the Primaries, I don't need to worry about causally consistent data? I've read that doing reads from 'Secondary' nodes is not worth the effort. If reading from a Secondary, you have to worry about 'eventual consistency' and since writes are getting synchronized, the Secondaries are essentially seeing the same amount of traffic that the Primaries are. So there isn't any benefit to reading from the Secondaries. If that is the case, I can do all reads from the Primaries (using 'majority' read concern) and be sure that I'm always getting consistent data and the sharding I'm doing is giving me some benefits from distributing the load across the shards. Is this correct?

Fourth (and last) question... When are causally consistent sessions worthwhile? If I understand correctly, and I'm not sure that I do, then I think it is when I have a case like a typical web app (not some distributed application, like my current one), where there is just one (or two) nodes doing the reading and writing. In that case, I would use causally consistent sessions and do my writes to the Primary and reads from the Secondary. But, in that case, what would the benefit of reading from the Secondaries be, anyway? What am I missing? What is the use case for causally consistent sessions?

NEVER RUN AN ARBITER ON A DATA BEARING NODE If that node fails, you loose your quorum and your replica set becomes unavailable. — Markus W Mahlberg
– Markus W Mahlberg, Commented Jun 28, 2018 at 7:21
@MarkusWMahlberg Thanks for the comment. I've read, I think even on the Mongo documentation, that it is suggested to run the Arbiter on the Primary node. Part of my overall confusion is that the Mongo documentation lacks a certain amount of 'context' that makes particular details useful. You can see from my confusion about causally consistent sessions... In general, the concept makes sense. But, I'm not sure why I care? — Chris
– Chris, Commented Jun 28, 2018 at 16:32

Wan B. · Accepted Answer · 2018-08-14 06:44:20Z

if I have 3 shards and my replica set is Primary/Secondary/Arbiter (with Arbiter running on the Primary), I will have 6 instances of MongoDB running. There will be three primaries and three secondaries (with the Arbiter running on each Primary). Is this correct?

A replica set Arbiter is still an instance of mongod. It's just that an Arbiter does not have a copy of the data and cannot become a Primary. You should have 3 instances per shard, which means 9 instances in total.

Since you mentioned that you would like to have a highly available database deployment, please note that the minimum recommended replica set members for production deployment would be a Primary with two Secondaries.

If I have a Primary and Secondary and I'm writing using the 'majority' write acknowledgement, what happens when either the Primary or Secondary goes down?

When either the Primary or Secondary becomes unavailable, a w:majority writes will either:

Wait indefinitely,
Wait until either nodes is restored, or
Failed with timeout.

This is because an Arbiter carries no data and unable to acknowledge writes but still counted as a voting member. See also Write Concern for Replica sets.

I can do all reads from the Primaries (using 'majority' read concern) and be sure that I'm always getting consistent data and the sharding I'm doing is giving me some benefits from distributing the load across the shards

Correct, MongoDB Sharding is to scale horizontally to distribute load across shards. While MongoDB Replication is to provide high availability.

If you read only from the Primary and also specifies readConcern:majority, the application will read data that has been acknowledged by the majority of the replica set members. This data is durable in the event of partition (i.e. not rolled back). See also Read Concern 'majority'.

What is the use case for causally consistent sessions?

Causal Consistency is used if the application requires an operation to be logically dependent on a preceding operation (causal). For example, a write operation that deletes all documents based on a specified condition and a subsequent read operation that verifies the delete operation have a causal relationship. This is especially important in a sharded cluster environment, where write operations may go to different replica sets.

Collectives™ on Stack Overflow

Guarantee consistency of data across microservices access a sharded cluster in MongoDB

1 Answer 1

Comments

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Related