Dirichlet Label Skew is the canonical method for generating non-IID label distributions across federated clients. It samples label proportions from a Dirichlet distribution controlled by concentration parameter $\alpha$.
For each client $k$, sample label proportions:
\[p_k \sim \text{Dirichlet}(\alpha \cdot \mathbf{1}_C)\]where:
Client $k$ receives samples of class $c$ with proportion $p_{k,c}$:
\[n_{k,c} = \lfloor p_{k,c} \cdot n_k \rfloor\]The implementation is located at src/unbitrium/partitioning/dirichlet.py.
Label proportions sum to unity:
\[\sum_{c=1}^C p_{k,c} = 1 \text{ for all } k\]Verification: All sampled distributions are valid probability distributions.
As $\alpha \to \infty$, approaches uniform:
\[\lim_{\alpha \to \infty} p_k = \frac{1}{C} \mathbf{1}_C\]Verification: $\alpha = 1000$ produces near-uniform distributions.
As $\alpha \to 0$, approaches one-hot:
\[\lim_{\alpha \to 0} p_k \to e_c \text{ for some } c\]Verification: $\alpha = 0.01$ produces highly concentrated distributions.
Given identical seed, produces identical partitions:
\[\text{seed}(s) \implies p_k^{(1)} = p_k^{(2)}\]Verification: Fixed seed produces deterministic output.
All samples are assigned:
\[\sum_{k=1}^K \sum_{c=1}^C n_{k,c} = N\]Verification: Total samples across clients equals dataset size.
Configuration:
Expected Behavior:
| Metric | Expected Range |
|---|---|
| Avg classes per client | 2-4 |
| Max samples per client | 1000-2000 |
| Min samples per client | 100-300 |
| EMD from uniform | 0.7-0.9 |
Configuration:
Expected Behavior:
| Metric | Expected Range |
|---|---|
| Avg classes per client | 5-8 |
| EMD from uniform | 0.3-0.5 |
| Label entropy | 1.5-2.0 |
Configuration:
Expected Behavior:
Configuration:
Expected Behavior:
| $\alpha$ | Heterogeneity | Typical Use Case |
|---|---|---|
| 0.01 | Extreme | Pathological testing |
| 0.1 | High | Standard non-IID benchmark |
| 0.5 | Moderate | Realistic heterogeneity |
| 1.0 | Mild | Light heterogeneity |
| 10.0 | Low | Near-IID |
| 100+ | Negligible | IID approximation |
graph LR
A[alpha = 0.1] --> B[EMD: 0.7-0.9]
A --> C[Classes/client: 2-4]
D[alpha = 0.5] --> E[EMD: 0.3-0.5]
D --> F[Classes/client: 5-8]
G[alpha = 1.0] --> H[EMD: 0.15-0.25]
G --> I[Classes/client: 7-10]
Input: $K = 1$
Expected Behavior:
Input: $K > N/C$
Expected Behavior:
Input: $C = 2$
Expected Behavior:
Input: $\alpha < 0.001$
Expected Behavior:
from unbitrium.partitioning import DirichletLabelSkew
partitioner = DirichletLabelSkew(
alpha=0.5,
num_clients=100,
seed=42, # Reproducible
)
# Verify reproducibility
part1 = partitioner.partition(dataset)
part2 = partitioner.partition(dataset)
assert all(p1 == p2 for p1, p2 in zip(part1, part2))
def set_partitioning_seed(seed: int) -> None:
import numpy as np
np.random.seed(seed)
Partitioning configuration reveals:
Breakdown:
Breakdown:
Unbitrium produces identical partitions to LEAF when:
Identical to Flower’s Dirichlet partitioner within tolerance.
Hsu, T. M. H., Qi, H., & Brown, M. (2019). Measuring the effects of non-identical data distribution for federated visual classification. arXiv preprint arXiv:1909.06335.
Yurochkin, M., et al. (2019). Bayesian nonparametric federated learning of neural networks. In ICML.
Li, Q., Diao, Y., Chen, Q., & He, B. (2022). Federated learning on non-IID data silos: An experimental study. In ICDE.
| Version | Date | Changes |
|---|---|---|
| 1.0.0 | 2026-01-04 | Initial validation report |
Copyright 2026 Olaf Yunus Laitinen Imanov and Contributors. Released under EUPL 1.2.