The CRUSH Failure-Domain Hierarchy
Ceph's CRUSH map is a tree: osd → host → chassis → rack → row → room → datacenter (root at the top). When you set a pool's failure domain to "host," CRUSH guarantees no two copies (or EC chunks) of the same PG land on the same host — but says nothing about whether they land in the same rack. Choosing a higher level in the hierarchy protects against a bigger blast radius (a whole rack losing power) at the cost of requiring more distinct domains at that level to satisfy your protection scheme.
Most single-rack or small clusters use host as the failure domain, since that's the most granular level above the OSD itself and matches the most common real failure mode (a server dying). Multi-rack deployments with redundant power/networking per rack can justify rack as the failure domain — but only if there are enough racks to satisfy the protection scheme.
Minimum domains by protection scheme
| Scheme | Minimum Domains | Recommended | min_size |
| Replicated size=2 | 2 | 3 | 1 |
| Replicated size=3 | 3 | 4 | 2 |
| Replicated size=4 | 4 | 5 | 3 |
| EC 4+2 | 6 | 7 | 5 |
| EC 8+3 | 11 | 12 | 9 |
"Minimum" is the bare floor CRUSH needs to place data at all. "Recommended" adds one spare domain so the cluster can recover after losing a single domain without going degraded indefinitely — see the Usable Capacity calculator for how this same +1 reserve logic affects usable space.
min_size — Why I/O Halts Instead of Risking Data
min_size is the minimum number of copies (replicated) or chunks (EC) that must be available for a PG to serve I/O at all. For replication it's size − 1; for erasure coding it's k + 1. If the available copies/chunks drop below min_size — say, two simultaneous host failures on a size=3 pool with min_size=2 — Ceph stops serving I/O on the affected PGs entirely rather than risk writes that can't be reliably protected. This looks alarming (the cluster appears to "freeze" for affected pools) but it's the safer failure mode than silently continuing without redundancy.
Frequently Asked Questions
Why not just always use rack as the failure domain for safety?
Because it requires more physical domains to satisfy the same protection scheme. A single-rack cluster literally cannot use rack as a meaningful failure domain — there's only one rack, so CRUSH has nowhere else to place additional copies/chunks. Match the failure domain level to how many of that unit you actually have, not aspirationally to the safest-sounding option.
What's the difference between firstn and indep CRUSH algorithms?
firstn is used for replicated pools — if a chosen OSD becomes unavailable, CRUSH tries the next one in a deterministic sequence, which works fine because all replicas are interchangeable. indep is used for erasure-coded pools, where each chunk position (1st data chunk, 2nd parity chunk, etc.) is meaningful — indep mode replaces a failed position independently without reshuffling the other positions, which firstn would do and which would corrupt EC chunk ordering.
How do I verify my CRUSH map actually has the domains I think it does?
Run ceph osd crush tree --show-shadow to see the full hierarchy including the per-device-class shadow trees, or ceph osd tree for a simpler host/OSD view. If you're mixing device classes (hdd/ssd/nvme) on the same hosts, make sure your crush rule specifies the device class — otherwise CRUSH may place PGs on the wrong tier.
Can I change a pool's failure domain after creation?
Yes — failure domain lives in the CRUSH rule, not the pool itself. Create a new crush rule at the desired domain level and apply it with ceph osd pool set <pool> crush_rule <new-rule>. This triggers a full data rebalance as PGs move to satisfy the new placement rule, so treat it like any other major topology change — stage it and monitor recovery.