unbitrium

DirichletLabelSkew Validation Report

Overview

Dirichlet Label Skew is the canonical method for generating non-IID label distributions across federated clients. It samples label proportions from a Dirichlet distribution controlled by concentration parameter $\alpha$.

Mathematical Formulation

For each client $k$, sample label proportions:

\[p_k \sim \text{Dirichlet}(\alpha \cdot \mathbf{1}_C)\]

where:

Client $k$ receives samples of class $c$ with proportion $p_{k,c}$:

\[n_{k,c} = \lfloor p_{k,c} \cdot n_k \rfloor\]

Implementation Reference

The implementation is located at src/unbitrium/partitioning/dirichlet.py.


Invariants

Invariant 1: Probability Simplex

Label proportions sum to unity:

\[\sum_{c=1}^C p_{k,c} = 1 \text{ for all } k\]

Verification: All sampled distributions are valid probability distributions.

Invariant 2: Limit Behavior (High Alpha)

As $\alpha \to \infty$, approaches uniform:

\[\lim_{\alpha \to \infty} p_k = \frac{1}{C} \mathbf{1}_C\]

Verification: $\alpha = 1000$ produces near-uniform distributions.

Invariant 3: Limit Behavior (Low Alpha)

As $\alpha \to 0$, approaches one-hot:

\[\lim_{\alpha \to 0} p_k \to e_c \text{ for some } c\]

Verification: $\alpha = 0.01$ produces highly concentrated distributions.

Invariant 4: Reproducibility

Given identical seed, produces identical partitions:

\[\text{seed}(s) \implies p_k^{(1)} = p_k^{(2)}\]

Verification: Fixed seed produces deterministic output.

Invariant 5: Sample Exhaustion

All samples are assigned:

\[\sum_{k=1}^K \sum_{c=1}^C n_{k,c} = N\]

Verification: Total samples across clients equals dataset size.


Test Distributions

Distribution 1: High Heterogeneity

Configuration:

Expected Behavior:

Metric Expected Range
Avg classes per client 2-4
Max samples per client 1000-2000
Min samples per client 100-300
EMD from uniform 0.7-0.9

Distribution 2: Moderate Heterogeneity

Configuration:

Expected Behavior:

Metric Expected Range
Avg classes per client 5-8
EMD from uniform 0.3-0.5
Label entropy 1.5-2.0

Distribution 3: Near-IID

Configuration:

Expected Behavior:

Distribution 4: Extreme Non-IID

Configuration:

Expected Behavior:


Expected Behavior

Alpha Interpretation Guide

$\alpha$ Heterogeneity Typical Use Case
0.01 Extreme Pathological testing
0.1 High Standard non-IID benchmark
0.5 Moderate Realistic heterogeneity
1.0 Mild Light heterogeneity
10.0 Low Near-IID
100+ Negligible IID approximation

Metric Ranges by Alpha

graph LR
    A[alpha = 0.1] --> B[EMD: 0.7-0.9]
    A --> C[Classes/client: 2-4]

    D[alpha = 0.5] --> E[EMD: 0.3-0.5]
    D --> F[Classes/client: 5-8]

    G[alpha = 1.0] --> H[EMD: 0.15-0.25]
    G --> I[Classes/client: 7-10]

Edge Cases

Edge Case 1: Single Client

Input: $K = 1$

Expected Behavior:

Edge Case 2: More Clients Than Samples Per Class

Input: $K > N/C$

Expected Behavior:

Edge Case 3: Binary Classification

Input: $C = 2$

Expected Behavior:

Edge Case 4: Very Small Alpha

Input: $\alpha < 0.001$

Expected Behavior:


Reproducibility

Seed Configuration

from unbitrium.partitioning import DirichletLabelSkew

partitioner = DirichletLabelSkew(
    alpha=0.5,
    num_clients=100,
    seed=42,  # Reproducible
)

# Verify reproducibility
part1 = partitioner.partition(dataset)
part2 = partitioner.partition(dataset)
assert all(p1 == p2 for p1, p2 in zip(part1, part2))

Random State Management

def set_partitioning_seed(seed: int) -> None:
    import numpy as np
    np.random.seed(seed)

Security Considerations

Data Leakage

Partitioning configuration reveals:

Mitigations

  1. Keep partitioning parameters confidential
  2. Use different seeds for production vs. benchmarking

Complexity Analysis

Time Complexity

\[T = O(N + K \cdot C)\]

Breakdown:

Space Complexity

\[S = O(K \cdot C + N)\]

Breakdown:


Validation Against Reference Implementations

LEAF Benchmark

Unbitrium produces identical partitions to LEAF when:

Flower Framework

Identical to Flower’s Dirichlet partitioner within tolerance.


References

  1. Hsu, T. M. H., Qi, H., & Brown, M. (2019). Measuring the effects of non-identical data distribution for federated visual classification. arXiv preprint arXiv:1909.06335.

  2. Yurochkin, M., et al. (2019). Bayesian nonparametric federated learning of neural networks. In ICML.

  3. Li, Q., Diao, Y., Chen, Q., & He, B. (2022). Federated learning on non-IID data silos: An experimental study. In ICDE.


Changelog

Version Date Changes
1.0.0 2026-01-04 Initial validation report

Copyright 2026 Olaf Yunus Laitinen Imanov and Contributors. Released under EUPL 1.2.