How Ceph Erasure Coding Works
Erasure coding splits each object into k data chunks and computes m parity chunks, then stores all k+m chunks on distinct failure domains. The cluster can lose any m of those chunks and still reconstruct the full object — that's the fault tolerance. Unlike replication, EC doesn't store full copies, so it uses dramatically less raw storage for the same durability guarantee.
The tradeoff is CPU cost (encoding/decoding) and reconstruction reads when chunks are missing, which is why EC suits sequential, less latency-sensitive workloads — RGW object storage, backups, cold archives — better than low-latency block storage.
Efficiency by profile
| Profile | Efficiency (k/(k+m)) | Overhead | Tolerates | Min Domains |
| 2+2 | 50.0% | 2.00× | 2 failures | 4 |
| 4+2 | 66.7% | 1.50× | 2 failures | 6 |
| 6+2 | 75.0% | 1.33× | 2 failures | 8 |
| 8+3 | 72.7% | 1.38× | 3 failures | 11 |
| 8+4 | 66.7% | 1.50× | 4 failures | 12 |
Ceph's own default EC profile is k=2, m=2, but the documentation recommends k=4, m=2 as a practical starting point for most clusters — it delivers roughly twice the usable capacity of 3x replication while keeping the domain count and CPU overhead manageable.
Failure Domains, min_size, and the #1 EC Gotcha
Each of the k+m chunks must land on a distinct failure domain — otherwise losing one domain could take out more chunks than the profile tolerates. That means a 4+2 profile needs at least 6 hosts (with crush_failure_domain=host); fewer than that and CRUSH cannot place the pool's PGs at all. This tool recommends k+m+1 domains so the cluster has room to recover after losing one, rather than running at the bare minimum indefinitely.
min_size for an EC pool is k+1, not k. If min_size were allowed to equal k, the pool would keep serving I/O with zero spare parity chunks — any further loss during that window means unrecoverable data loss. Ceph enforces this by halting I/O below min_size rather than risking it.
The #1 EC gotcha: an erasure-code profile is locked in at pool creation. You cannot change k, m, or the plugin/technique on an existing pool — to change the profile you must create a new pool and migrate data. Plan capacity and fault tolerance carefully before running ceph osd pool create.
Frequently Asked Questions
Can I use erasure coding for RBD volumes?
Yes, since Luminous, EC pools support overwrites for RBD and CephFS when the pool sits on BlueStore and has allow_ec_overwrites true set. Performance is lower than replicated pools for small random writes because partial-chunk overwrites require a read-modify-write cycle, so EC is most often used for RGW object storage, backup targets, or CephFS data pools with mostly large sequential I/O.
Why is m=1 discouraged for production?
With m=1, the pool tolerates exactly one chunk loss. If a second failure occurs anywhere in the cluster before the first is recovered — a very real scenario during a host reboot or disk replacement — you lose data. m=2 (or higher) gives a buffer for overlapping failures, which is why most production guidance treats m=1 as appropriate only for non-critical or scratch data.
What does crush-failure-domain actually control here?
It tells CRUSH which level of your hierarchy must be distinct across the k+m chunks. host means no two chunks of the same object can land on the same host; rack raises that guarantee to the rack level (protecting against a whole rack going dark — top-of-rack switch, PDU, etc.) but requires more physical domains to satisfy the same profile. Use the CRUSH helper to check whether your physical topology actually supports the domain you pick.
What plugin and technique should I use?
jerasure with technique=reed_sol_van is the default, well-tested, portable choice for most clusters and is what this tool's generated CLI uses. isa can be faster on Intel hardware with ISA-L support; clay reduces recovery network traffic at the cost of more CPU, useful for clusters where recovery bandwidth is the bottleneck. Stick with jerasure/reed_sol_van unless you have a specific reason to deviate.
How do I size pg_num for an EC pool?
The same target of ~100 PGs per OSD applies, but PG-per-OSD math for EC pools should account for k+m chunks landing on k+m different OSDs per PG (vs. size copies for replication). Use the Ceph PG Calculator and select Erasure Code with this profile's k+m to get pg_num and pgp_num sized correctly for the pool you're about to create.