Ceph crush rules explanation for multiroom/racks setup

Question

I started recently with ceph, inherited 1 large cluster for maintenance and now building recovery cluster. By game of trial and failure I managed to create crush rules to fit my purpose but failed to understand the syntax of crush rule definition. Could someone please explain (don't reference ceph docs, since they don't explain that)?
Here is my setup of production cluster:
20 hosts distributed in 2 rooms, 2 racks in each room, 5 servers per rack, 10 OSDs per host, 200 OSDs in total.
Someone wanted super safe setup, so replication is 2/4 and rules are (supposedly) defined to replicate to other room, 2 copies in each rack, 4 in total for every object. Here is the rule:

rule replicated_nvme { id 4 type replicated min_size 1 max_size 100 step take default class nvme step choose firstn 0 type room step choose firstn 2 type rack step chooseleaf firstn 1 type host step emit }

At my new cluster I have smaller setup so just 2 racks with 2 servers in each for test. I tried this, similar to the above, but without room:

rule replicated-nvme { id 6 type replicated step take default class nvme step choose firstn 0 type rack step chooseleaf firstn 1 type host step emit }

However, this doesn't produce desired result (with replication 2/4 it should be copy to other rack each copy to different server). What I got is 2 replicas in servers in different racks and 2 additional copies not created. I get this from ceph:

pgs: 4/8 objects degraded (50.000%) 1 active+undersized+degraded

and I see that only 2 OSDs are used, not 4!

So, I played and just changed to this:

rule replicated-nvme { id 6 type replicated step take default class nvme step choose firstn 0 type rack step chooseleaf firstn 0 type host step emit }

and it works. Pool PGs are replicated to 4 OSDs accross 2 racks (2 OSDs oer each rack). The only difference is chooseleaf firstn 0 type host instead of chooseleaf firstn 1 type host.

The questions are:

what is the difference between choose and chooseleaf
what is the meaning of the number after firstn
how is the hierarchy defined for steps, what is checked before, what after?

In short, I would like to know the syntax of crush rules.
Just for clarification, altough the production cluster have even number of hosts per room/rack, and even replication rules, the object distribution is not super even. I.e. PGs distribution may differ to up to 10% per OSD.
I suspect that 1st rule defined above is wrong and that purely by large number of OSDs is the distribution more or less equal.

eblock · Accepted Answer · 2024-08-09 12:09:46Z

The docs are quite clear about the options, not sure what your complaint is. I'll try to break it down anyway based on your example:

step take default class nvme step choose firstn 0 type rack step chooseleaf firstn 0 type host

Go into the "default" crush root, usually the top of your ceph osd tree if you only have one root, select only devices of type "nvme".
Then choose 0 buckets of type "rack". If {num} == 0, choose pool-num-replicas buckets (as many buckets as are available), so in your case 2 racks.
In each rack chooseleaf select 0 OSDs within the bucket type host, "leafs" because it goes down the branch of the crush tree. Again, if {num} == 0 it chooses as many OSDs as pool-num-replicas.

If you want to be more specific, you can replace the 0 with the exact number of buckets you want crush to choose, for example this should work as well, to have 4 replicas in total, distributed across two racks:

step take default class nvme step choose firstn 2 type rack step chooseleaf firstn 2 type host

The hierarchy is from top to bottom. You can also play with the crushtool. It can be quite valuable if you want to test the distribution without modifying the current crushmap live which can cause unexpected data movement. The data distribution of your cluster depends on a couple of factors, the applied rule is not the important in that regard, is the balancer enabled? ceph balancer status shows the current state. The number of PGs per OSD is also quite important, if your PGs are large and you only have a few of them, the data can't be distributed evenly.

I'm a bit slow, so didn't get the docs. Thanks for the explanation! It's marvelous how few sentences can explain what a lentghy docs can't. — dotokija
– dotokija, Commented Aug 12, 2024 at 8:25

Stack Exchange Network

Ceph crush rules explanation for multiroom/racks setup

1 Answer 1

You must log in to answer this question.

Hot Network Questions

Ceph crush rules explanation for multiroom/racks setup

1 Answer 1

You must log in to answer this question.

Related

Hot Network Questions