unbitrium

FeatureShiftClustering Validation Report

Overview

Feature Shift Clustering partitions data based on feature-space clustering, assigning samples to clients according to their cluster membership. This simulates covariate shift where clients have distinct feature distributions.

Mathematical Formulation

Given feature representations $\phi(x)$ for samples $x$, perform clustering:

\[c(x) = \arg\min_{j \in \{1, \ldots, K\}} \|\phi(x) - \mu_j\|^2\]

where $\mu_j$ is the centroid of cluster $j$ (corresponding to client $j$).

Assignment rule:

\[\mathcal{D}_k = \{(x, y) : c(x) = k\}\]

Implementation Reference

The implementation is located at src/unbitrium/partitioning/feature_shift.py.


Invariants

Invariant 1: Complete Assignment

Every sample assigned to exactly one client:

\[\bigcup_{k=1}^K \mathcal{D}_k = \mathcal{D}, \quad \mathcal{D}_i \cap \mathcal{D}_j = \emptyset \text{ for } i \neq j\]

Verification: No samples lost or duplicated.

Invariant 2: Cluster Consistency

Samples in same cluster share feature similarity:

\[x_1, x_2 \in \mathcal{D}_k \implies \|\phi(x_1) - \phi(x_2)\| \leq R_k\]

where $R_k$ is cluster $k$’s radius.

Verification: Intra-cluster distances bounded.

Invariant 3: Between-Cluster Separation

Clusters are reasonably separated:

\[\min_{k \neq j} \|\mu_k - \mu_j\| > 0\]

Verification: Distinct centroids for each client.

Invariant 4: Reproducibility

Fixed seed produces identical clusters.


Test Distributions

Distribution 1: MNIST Digit Styles

Configuration:

Expected Behavior:

Distribution 2: CIFAR-10 Visual Similarity

Configuration:

Expected Behavior:

Distribution 3: Domain Adaptation Setup

Configuration:

Expected Behavior:

Distribution 4: Overlapping Clusters

Configuration:

Expected Behavior:


Expected Behavior

Clustering Algorithm Options

Algorithm Characteristics Best For
K-Means Hard, spherical clusters Balanced clients
K-Means++ Better initialization General use
GMM Soft, elliptical Overlapping
Hierarchical Dendrogram structure Unknown $K$
DBSCAN Density-based Variable sizes

Metric Ranges

Metric Range Notes
silhouette_score $[-1, 1]$ Cluster quality
inertia $[0, \infty)$ Within-cluster variance
cluster_sizes Varies Samples per client
feature_dimension $(0, \infty)$ Embedding dimension

Edge Cases

Edge Case 1: More Clients Than Natural Clusters

Input: Data has 3 natural clusters, $K = 10$ requested

Expected Behavior:

Edge Case 2: Single Natural Cluster

Input: Homogeneous data, $K = 10$

Expected Behavior:

Edge Case 3: High-Dimensional Features

Input: $d = 10000$ feature dimensions

Expected Behavior:

Edge Case 4: Class Imbalance

Input: Dataset with 90% class A, 10% class B

Expected Behavior:


Reproducibility

Seed Configuration

from unbitrium.partitioning import FeatureShiftClustering

partitioner = FeatureShiftClustering(
    num_clients=10,
    clustering_algorithm="kmeans++",
    feature_extractor="resnet18",
    seed=42,
)

Feature Extraction

# Precompute features for reproducibility
features = partitioner.extract_features(dataset)
partitions = partitioner.cluster_and_assign(features)

Security Considerations

Feature Leakage

Clustering reveals:

Mitigations

  1. Use privacy-preserving clustering
  2. Add noise to feature representations
  3. Shuffle cluster-client mapping

Complexity Analysis

Time Complexity

Feature extraction: \(T_{extract} = O(N \cdot T_{model})\)

Clustering: \(T_{cluster} = O(N \cdot K \cdot d \cdot I)\)

where $I$ is number of iterations.

Space Complexity

\[S = O(N \cdot d + K \cdot d)\]

References

  1. Koh, P. W., et al. (2021). WILDS: A benchmark of in-the-wild distribution shifts. In ICML.

  2. Gulrajani, I., & Lopez-Paz, D. (2021). In search of lost domain generalization. In ICLR.

  3. Luo, M., et al. (2021). No fear of heterogeneity: Classifier calibration for federated learning with non-IID data. In NeurIPS.


Changelog

Version Date Changes
1.0.0 2026-01-04 Initial validation report

Copyright 2026 Olaf Yunus Laitinen Imanov and Contributors. Released under EUPL 1.2.