Synthetic Data Generation by Score Composition in Diffusion Models Improves Recognition (original) (raw)

Parsa Rahimi Noshanagh, Sebastien Marcel
EPFL, Idiap
Switzerland
parsa.rahiminoshanagh@epfl.ch, marcel@idiap.ch

Abstract

Synthetic data generation is increasingly used in machine learning for training and data augmentation. Yet, current strategies often rely on external foundation models or datasets, whose usage is restricted in many scenarios due to policy or legal constraints. We propose ScoreMix, a self-contained synthetic generation method to produce hard synthetic samples for recognition tasks by leveraging the score compositionality of diffusion models. The approach mixes class-conditioned scores along reverse diffusion trajectories, yielding domain-specific data augmentation without external resources. We systematically study class-selection strategies and find that mixing classes distant in the discriminator’s embedding space yields larger gains, providing up to 3% additional average improvement, compared to selection based on proximity. Interestingly, we observe that condition and embedding spaces are largely uncorrelated under standard alignment metrics, and the generator’s condition space has a negligible effect on downstream performance. Across 8 public face recognition benchmarks, ScoreMix improves accuracy by up to 7 percentage points, without hyperparameter search, highlighting both robustness and practicality. Our method provides a simple yet effective way to maximize discriminator performance using only the available dataset, without reliance on third-party resources. Paper website: https://parsa-ra.github.io/scoremix/.

1 Introduction

Refer to caption

Figure 1: ScoreMix. Adding carefully generated synthetic augmentations to the original training set boosts the discriminator’s performance, without relying on other sources of information (right). The first two subplots on the left show diffusion trajectories obtained under two different conditioning signals (Cond A/B). Using convex combinations of their score functions (ScoreMix A,B), we generate synthetic samples that interpolate between the two trajectories.

Synthetic dataset generation has emerged as a powerful tool for training models across a wide range of domains. A central application of this paradigm is data augmentation, which is indispensable for training strong discriminators, particularly when labeled data is limited. However, most existing strategies depend on external resources such as large foundation models or auxiliary datasets that are often impractical due to license restrictions, privacy concerns, or mismatched domains. This raises a central question: can we design a self-contained augmentation method that leverages only the available dataset to generate synthetic data and boost discriminative performance?

This paper introduces ScoreMix, an augmentation strategy that exploits the score composition phenomenon in diffusion models (Liu et al., 2022; Bradley et al., 2025).

Rather than relying on external generators such as Stable Diffusion (Esser et al., 2024) or FLUX (Labs, 2024), or even strong pre-trained backbones like SigLIP (Tschannen et al., 2025), ScoreMix produces synthetic data with convex combinations of class-conditioned scores during reverse diffusion. Crucially, both the generator and the initial discriminator are trained from scratch on the same dataset, ensuring a fully self-contained setup. This procedure yields hard on-manifold samples that enrich the training set with no information leakage.

We summarize the goal and contributions of this paper as follows.

• Goal: to develop a self-contained augmentation strategy—that is, one that does not rely on external datasets, commercial APIs, or third-party models—to maximize the performance of state-of-the-art discriminators solely with the available data.

Synthetic data generation is widely explored as an alternative to large-scale data collection. Early augmentation strategies are based on GANs (Frid-Adar et al., 2018) but do not scale well with the number of classes. Recent approaches use diffusion models, e.g. fine-tuning on ImageNet (Azizi et al., 2023), instance-level redraws (Kupyn & Rupprecht, 2024), and 3DMM-based rendering (Wood et al., 2021; Blanz & Vetter, 1999). These are effective but depend on external pretrained models or datasets. Face recognition (FR) is an important application of synthetic augmentation, with methods such as SynFace (Qiu et al., 2021), StyleGAN-based latent modeling (Rahimi et al., 2023), dual-condition diffusion (DCFace (Kim et al., 2023)), StyleGAN2-ADA for bias mitigation (Sevastopolskiy et al., 2023), attribute-conditioned diffusion (ID3 (Xu et al., 2024)), and 3D rendering pipelines like DigiFace1M and RealDigiFace (Bae et al., 2023; Rahimi et al., 2024) and CLIP-guided sampling (VariFace (Yeung et al., 2024)). FR is attractive because collecting diverse face datasets is difficult. Benchmarks such as LFW (Huang et al., 2008), IJB-B/C (Whitelam et al., 2017), and AgeDB (Moschoglou et al., 2017)provide more reliable testing protocols than noisier ImageNet settings. Recently, Rahimi et al. (2025) introduced a self-contained strategy to produce challenging samples (AugGen). They train a diffusion generator on a target FR dataset and mix labels in the generator’s condition space. This relies on heuristics and a costly parameter search. Our work builds on this work while addressing its limitations, by leveraging score composition in diffusion models and aligning class selection with the geometry of the discriminator’s embedding space.

3 Proposed Method for Generating Augmentations

We first formally define the notion of a discriminator and a generator trained using the same dataset.

Discriminator.

Assume a dataset 𝐃orig={(𝑿i,yi)}i=0k−1\mathbf{D}_{\mathrm{orig}}=\{({\bm{\mathsfit{X}}}_{i},y_{i})\}_{i=0}^{k-1}, where each 𝑿i∈ℝH×W×3{\bm{\mathsfit{X}}}_{i}\in\mathbb{R}^{H\times W\times 3} and yi∈{0,…,l−1}y_{i}\in\{0,\dots,l-1\} (l<kl<k). The goal is to learn a discriminative model fθdis:𝑿→𝒚f_{\theta_{\mathrm{dis}}}:{\bm{\mathsfit{X}}}\rightarrow{\bm{y}} that estimates p​(𝒚|𝑿)p({\bm{y}}|{\bm{\mathsfit{X}}}) (e.g., on ImageNet (Russakovsky et al., 2015) or CASIA-WebFace (Yi et al., 2014)). Typically, similar images have closer features under a distance distemb\mathrm{dist_{emb}} (e.g., cosine distance). We train fθdisf_{\theta_{\mathrm{dis}}} via empirical risk minimization:

θdis∗=arg⁡minθdis∈Θdis​𝔼(𝑿,y)∼𝐃orig​[ℒdis​(fθdis​(𝑿),𝒚)],\theta_{\mathrm{dis}}^{*}=\underset{\theta_{\mathrm{dis}}\in\Theta_{\mathrm{dis}}}{\arg\min}\,\mathbb{E}_{({\bm{\mathsfit{X}}},y)\sim\mathbf{D}_{\mathrm{orig}}}\bigl[\mathcal{L}_{\mathrm{dis}}(f_{\theta_{\mathrm{dis}}}({\bm{\mathsfit{X}}}),{\bm{y}})\bigr], (1)

where ℒdis\mathcal{L}_{\mathrm{dis}} is typically cross-entropy, and hdis\mathrm{h}_{\mathrm{dis}} manifests all the hyperparameters (e.g., learning rates).

Generative model.

Generative models seek to learn the data distribution, enabling the generation of new samples. We use diffusion models (Song et al., 2020; Anderson, 1982), which progressively add noise to data and train a Denoiser S\mathrm{S}. Following (Karras et al., 2022; 2024b), S\mathrm{S} is learned in two stages. First, for a given noise level σ\sigma, we add noise 𝑵{\bm{\mathsfit{N}}} to Epre​(𝑿)E_{\mathrm{pre}}({\bm{\mathsfit{X}}}) (or 𝑿{\bm{\mathsfit{X}}} directly in pixel-based diffusion) and remove it via:

| ℒ​(Sθd​e​n;σ)=𝔼(𝑿,y)∼Dorig,𝑵∼𝒩​(𝟎,σ​𝑰)​[‖Sθd​e​n​(Epre​(𝑿)+𝑵;c​(y),σ)−𝑿‖22],\mathcal{L}(\mathrm{S}_{\theta_{den}};\sigma)=\mathbb{E}_{({\bm{\mathsfit{X}}},y)\sim\mathrm{D}^{\mathrm{orig}},{\bm{\mathsfit{N}}}\sim\mathcal{N}(\mathbf{0},\sigma{\bm{\mathsfit{I}}})}\left[\|\mathrm{S}_{\theta_{den}}(E_{\mathrm{pre}}({\bm{\mathsfit{X}}})+{\bm{\mathsfit{N}}};\mathrm{c}(y),\sigma)-{\bm{\mathsfit{X}}}\|^{2}_{2}\right], | (2) | | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | --- |

where c​(y)\mathrm{c}(y) denotes the class condition, and Epre​(⋅)E_{\mathrm{pre}}(\cdot) and Dpre​(⋅)D_{\mathrm{pre}}(\cdot) pre-processing and post-processing functions in terms of Encoder and Decoder (e.g., they can be magnitude normalization or VAE-based compression). In the second stage, we sample different noise levels and minimize:

θd​e​n∗=arg⁡minθd​e​n∈Θd​e​n​𝔼σ∼𝒩​(μ,σ2)​[λσ​ℒ​(Sθd​e​n;σ)],\theta_{den}^{*}=\underset{\theta_{den}\in\Theta_{den}}{\arg\min}\,\mathbb{E}_{\sigma\sim\mathcal{N}(\mu,\sigma^{2})}\bigl[\lambda_{\sigma}\,\mathcal{L}(\mathrm{S}_{\theta_{den}};\sigma)\bigr], (3)

where λσ\lambda_{\sigma} weights each noise scale. Here 𝒄{\bm{c}} amongst the time embedding is learned. For simplicity, we omit the den and dis subscripts used to distinguish the parameters of the Denoiser and Discriminator, respectively. Instead, we use θ\theta to denote parameters in general, with the specific meaning inferred from context.

Conditional score estimation in diffusion models.

The predicted noise depicted in the previous section is proportional to the score function ∇𝑿tlog⁡pt​(𝑿t|c)\nabla_{{\bm{\mathsfit{X}}}_{t}}\log p_{t}({\bm{\mathsfit{X}}}_{t}|c) (Song et al., 2020; Karras et al., 2024b). Given two distinct conditions, cAc_{A} and cBc_{B}, we can obtain their respective conditional score predictions:

𝐒A​(𝑿t,t)=Sθ​(𝑿t,t,cA)𝐒B​(𝑿t,t)=Sθ​(𝑿t,t,cB)\mathbf{S}_{A}({\bm{\mathsfit{X}}}_{t},t)=\mathrm{S}_{\theta}({\bm{\mathsfit{X}}}_{t},t,c_{A})\qquad\mathbf{S}_{B}({\bm{\mathsfit{X}}}_{t},t)=\mathrm{S}_{\theta}({\bm{\mathsfit{X}}}_{t},t,c_{B}) (4)

Our work aims to generate novel synthetic data augmentations by composing information from two or more distinct conditional distributions learned by a diffusion model. We achieve this by linearly combining their respective score estimates during the reverse diffusion process.

3.1 Synthetic Augmentation via Convex Score Mixing

To generate synthetic samples that interpolate or combine aspects of both cAc_{A} and cBc_{B}, we propose a mixed score 𝐒mix\mathbf{S}_{\text{mix}}:

𝐒mix​(𝑿t,t)=α⋅𝐒A​(𝑿t,t)+β⋅𝐒B​(𝑿t,t)\mathbf{S}_{\text{mix}}({\bm{\mathsfit{X}}}_{t},t)=\alpha\cdot\mathbf{S}_{A}({\bm{\mathsfit{X}}}_{t},t)+\beta\cdot\mathbf{S}_{B}({\bm{\mathsfit{X}}}_{t},t) (5)

This mixed score 𝐒mix\mathbf{S}_{\text{mix}} is then used to guide the denoising step in a standard reverse diffusion sampler (e.g., DDIM (Song et al., 2020) or a second-order solver as in (Karras et al., 2024b)). Prior works have explored linear combinations of scores for compositional generation, often aiming to satisfy product-of-experts-like objectives or achieve disentangled concept manipulation (Liu et al., 2022; Bradley et al., 2025). These works typically focus on composing disparate concepts (e.g., “object” + “style”) or attributes.

[Uncaptioned image] Figure 2: Effect of mixing scores in ScoreMix. Each cell shows the image produced for one pair of inputs while sweeping α\alpha (horizontal, left→\rightarrowright) and β\beta (vertical, top→\rightarrowbottom). Randomness is fixed across images. [Uncaptioned image] Figure 3: Qualitative comparison of ScoreMix augmentation. Rows show Orig ID1, Repro ID1, ScoreMix (Eq. 5, AutoGuidance=1.3), Repro ID2, and Orig ID2. The center column provides augmented samples whose subtle deviations from original ones improve discriminator performance.

In our work, we adapt this principle specifically for generating nuanced synthetic augmentations by mixing related conditional distributions. We hypothesize that for this application, maintaining the overall magnitude and directional integrity of the score is paramount for generating plausible, on-manifold samples. To the best of our knowledge, this is the first work to systematically investigate and leverage this form of multi-conditional score mixing specifically for the task of generating synthetic data augmentations that lie ”between” two defined conditional states, effectively generating hard samples for the discriminator to further boost its discriminative and increase the chance of capturing any missed information from the initial training on of the discriminator.

We empirically find that the most plausible and high-fidelity synthetic augmentations are generated when the mixing coefficients α\alpha and β\beta form a convex combination (α+β=1\alpha\,+\,\beta\!=\!1). The theoretical rationale for this observation is rooted in several properties of score-based models.

The architectural advancements in models like EDM2 (Karras et al., 2024b), which focus on preserving activation and weight magnitudes, further bolster the argument for convex combinations. If individual conditional scores are already well-calibrated by the model architecture, their convex mix is one of the plausible ways to fuse their guidance without introducing extraneous magnitude distortions.

In Figure 2, the effect of different values of α\alpha and β\beta is depicted. Numeric tick labels give the exact values in steps of 0.20.2. Here, the class conditional generator is trained using face images in which each class is a unique identity. Arrows beneath and at the side of the grid highlight the directions of increasing influence from each source. The extreme corner corresponds to the unmixed original scores ((α,β)=(0,0)(\alpha,\beta)=(0,0) at the top-left and equivalently mixed (1,1)(1,1) at the bottom-right, while the descending diagonal where α+β=1\alpha+\beta=1 illustrates the complementary trade-off between the two sources; off-diagonal cells reveal The visual behaviour when the weights do not sum to 11, which empirically reflects our previous discussion. See Appendix L for more samples.

3.2 Sampling Procedure

For generating samples, we employ the deterministic second-order sampler detailed in (Karras et al., 2024b; 2022). At each step tt, the mixed score 𝐒mix​(𝑿t,t)\mathbf{S}_{\text{mix}}({\bm{\mathsfit{X}}}_{t},t) from Equation 5 is used in place of the single conditional score to compute the update Δ​𝑿t\Delta{\bm{\mathsfit{X}}}_{t}. The specific mixing parameter λ\lambda (where α=1−λ,β=λ\alpha=1-\lambda,\beta=\lambda) could be varied to generate a spectrum of synthetic augmentations. For simplicity and intuition, we set the λ=0.5\lambda=0.5. Given the conditions cAc_{A} and cBc_{B}, the detailed algorithmic procedure for mixing the conditions to generate a plausible mixed image is presented in Algorithm 1. We are also applying autoguidance (Karras et al., 2024a) for sampling, with a model trained with fewer iterations. Some examples of the ScoreMix samples are depicted in the middle column of Figure 3. See Appendix L for more samples.

Algorithm 1 Sampling with Convex Conditional Score Mixing

1Denoising network Sθ​(𝑿t,t,𝒄)\mathrm{S}_{\theta}({\bm{\mathsfit{X}}}_{t},t,{\bm{c}}); conditions cAc_{A}, cBc_{B}; weights α=0.5\alpha{=}0.5, β=0.5\beta{=}0.5; Solver steps TT

2Initialize 𝑿t∼𝒩​(𝟎,σT2​𝐈){\bm{\mathsfit{X}}}_{t}\sim\mathcal{N}({\bm{0}},\sigma_{T}^{2}\mathbf{I}) ⊳\triangleright Sample initial noise

3for t=Tt=T down to 11 do

4 𝐒A←Sθ​(𝑿t,t,cA)\mathbf{S}_{A}\leftarrow\mathrm{S}_{\theta}({\bm{\mathsfit{X}}}_{t},t,c_{A}) ⊳\triangleright Predict noise for A

5 𝐒B←Sθ​(𝑿t,t,cB)\mathbf{S}_{B}\leftarrow\mathrm{S}_{\theta}({\bm{\mathsfit{X}}}_{t},t,c_{B}) ⊳\triangleright Predict noise for B

6 𝐒mix←α⋅𝐒A+β⋅𝐒B\mathbf{S}_{\text{mix}}\leftarrow\alpha\cdot\mathbf{S}_{A}+\beta\cdot\mathbf{S}_{B} ⊳\triangleright Convex combination

7 𝒙t−1←SamplerStep​(𝑿t,t,𝐒mix){\bm{x}}_{t-1}\leftarrow\text{SamplerStep}({\bm{\mathsfit{X}}}_{t},t,\mathbf{S}_{\text{mix}}) ⊳\triangleright Update with mixed score

8end for

9Final generated sample 𝒙0{\bm{x}}_{0} ⊳\triangleright Output image

Algorithm 2 DistanceCorrelation

1E∈ℝl×dEE{\in}\mathbb{R}^{l\times d_{E}} with distemb\operatorname{dist}_{\mathrm{emb}};C∈ℝl×dCC{\in}\mathbb{R}^{l\times d_{C}} with distcond\operatorname{dist}_{\mathrm{cond}}

2𝐞\mathbf{e}, 𝐜\mathbf{c}

3𝐞←[]\mathbf{e}\leftarrow[\,], 𝐜←[]\mathbf{c}\leftarrow[\,] ⊳\triangleright Init lists

4for i←1i\leftarrow 1 to l−1l-1 do

5 for j←i+1j\leftarrow i+1 to ll do

6 u←distemb⁡(E:,i,E:,j)u\leftarrow\operatorname{dist}_{\mathrm{emb}}(E_{:,i},E_{:,j})

7 v←distcond⁡(C:,i,C:,j)v\leftarrow\operatorname{dist}_{\mathrm{cond}}(C_{:,i},C_{:,j})

8 append uu to 𝐞\mathbf{e} & append vv to 𝐜\mathbf{c}

9 end for

10end for

4 Experiments

We show that ScoreMix improves face recognition (FR) under limited data, a critical setting given the difficulty of collecting large facial datasets. As FR requires distinguishing between millions of identities in a structured input space, it remains one of the most challenging discriminative tasks, which utilizes SOTA discriminative models that use margin losses (Deng et al., 2019).

4.1 Experimental Setup

Training data.

We use WebFace160K (Rahimi et al., 2025), a subset of WebFace4M (Zhu et al., 2021), selected for its balanced distribution of 10,000 identities with 11–24 samples each ( 160K images), matching the scale of commonly used datasets like CASIA-WebFace (Yi et al., 2014). We choise this dataset over CASIA-WebFace due to performance inconsistencies previously reported in (Rahimi et al., 2025). See the Appendix I for details.

Discriminative model.

We adopt a standardized baseline. This baseline employs a face recognition (FR) system consisting of an IR50 backbone, modified according to the ArcFace’s implementation (Deng et al., 2019), paired with the ArcFace head (Deng et al., 2019) to incorporate margin loss. Additionally, standard augmentations for face recognition tasks are applied to all models. These augmentations include (1) photometric transformations (2) cropping, and (3) low-resolution adjustments to simulate common variations encountered in real-world scenarios. See Appendix J for details.

Generative model.

To train our generative model, we use a variant of the diffusion formulation (Karras et al., 2022; 2024b). For WebFace160K(Rahimi et al., 2025), the subset of WebFace4M(Yi et al., 2014), we use the pixel space variant diffusion models. Furthermore, the conditions are learned end-to-end using a diffusion objective with no explicit regularization.

4.2 Experiments on Face Recognition Benchmarks

FR benchmarks.

We evaluate our synthetic augmentation on two groups of public FR benchmarks. The first group (Avg-H in Table 1) contains High-quality datasets with variation in pose, lighting, and age: LFW (Huang et al., 2008), CFPFP (Sengupta et al., 2016), CPLFW (Zheng & Deng, 2018), CALFW (Zheng et al., 2017), and AgeDB (Moschoglou et al., 2017). The second group captures more realistic and challenging conditions: IJB-B/C (Maze et al., 2018; Whitelam et al., 2017) and TinyFace (Cheng et al., 2019). Evaluation is based on verification accuracy (TAR), with thresholds from cross-validation for High-quality datasets and fixed FPRs (10−610^{-6} and 10−510^{-5}) for IJB-B/C, reflecting deployment scenarios.

Table 1 also reports whether auxiliary models/datasets are used or not (Aux; the ideal case being N), and the training set sizes in terms of synthetic (nsn^{s}) and real (nrn^{r}) images. Following (Rahimi et al., 2025), as mentioned earlier, we adopt WebFace160K due to inconsistencies in CASIA-WebFace; results using different base datasets are separated by a double line. While ScoreMix roughly doubles the computational cost of AugGen, it consistently outperforms both AugGen and training on the original dataset across IR50 and even surpasses the stronger IR101 backbone trained on the original dataset, indicating that augmentation can yield greater gains than architectural scaling.

Takeaway: ScoreMix with λ=0.5\lambda\!=\!0.5 consistently improves discriminator performance when trained with a single dataset for synthetic data generation, surpassing the original discriminator and outperforming larger models.

Table 1: Comparison of the FRsyn\mathrm{FR_{syn}} training (upper part), FRreal\mathrm{FR_{real}} training (middle), and FRmix\mathrm{FR_{mix}} training (bottom) using CASIA-WebFace/WebFace160K, when the models are evaluated in terms of accuracy against standard FR benchmarks. Avg-H depicts the average accuracy of all high-quality benchmarks. Here nsn^{s} and nrn^{r} depict the number of Synthetic and Real Images, respectively, and Aux depicts whether the method for generating the dataset uses an auxiliary information network for generating the datasets (Y) or not (N). The †{\color[rgb]{0,0,1}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,1}\dagger} denotes network trained on IR101 if not the model trained using the IR50. The numbers under columns labeled like C/B-1e-6 indicate TAR for IJB-C/B at FPR of 1e-6. TR1 depicts the rank-1 accuracy for the TinyFace benchmark.

Method/Data Aux nsn^{s} nrn^{r} B-1e-6 B-1e-5 C-1e-6 C-1e-5 TR1 Avg-H
DigiFace1M N/A 1.2M 0 15.31 29.59 26.06 36.34 32.30 78.97
RealDigiFace Y 1.2M 0 21.37 39.14 36.18 45.55 42.64 81.34
DCFace Y 1.2M 0 22.48 47.84 35.27 58.22 45.94 91.56
AugGen N 0.6M 0 29.40 54.54 45.15 61.52 52.33 88.78
AugGen Repro N 0.6M 0 15.71 45.97 31.54 58.61 53.61 90.64
CASIA-WebFace N/A 0 0.5M 1.02 5.06 0.73 5.37 58.12 94.21
CASIA-WebFace †{\color[rgb]{0,0,1}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,1}\dagger} N/A 0 0.5M 0.74 3.94 0.38 3.92 59.64 94.84
WebFace160K N/A 0 0.16M 32.13 72.18 70.37 78.81 61.51 92.50
WebFace160K †{\color[rgb]{0,0,1}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,1}\dagger} N/A 0 0.16M 34.84 74.10 72.56 81.26 62.59 93.32
ScoreMix Repro N 0.2M 0 28.15 57.71 54.66 67.06 56.38 92.47
AugGen N 0.2M 0.16M 34.83 76.21 75.02 82.91 61.41 93.78
ScoreMix (Ours) N 0.2M 0.16M 35.95 76.41 76.45 83.58 63.09 93.87

4.3 Which classes are best to mix?

In this section, we systematically study which classes are best for approaches like AugGen (Rahimi et al., 2025) or our ScoreMix. By “best,” we mean that the generated samples using the selected classes deliver the highest performance increase compared to the baseline discriminator. To determine this, we first compare the distances between every pair of classes in (i) the learned condition space of the generator and (ii) the embedding space of the discriminator. More precisely, given ll labels in our dataset, we train a discriminator that maps each class to an embedding vector, forming an embedding matrix E∈ℝl×dEE\in\mathbb{R}^{l\times d_{E}} (i.e., the learned class centers used for margin losses). Similarly, for each class we have a unique condition vector that is mapped to the hidden latent of the denoiser network, forming a matrix C∈ℝl×dCC\in\mathbb{R}^{l\times d_{C}}.

For EE, since it arises from the discriminator’s training, we use cosine distance as our metric, which we denote distemb\mathrm{dist_{emb}}. For the condition space CC, we experiment with two popular metrics, cosine distance and Euclidean (L2) distance, both denoted distcond\mathrm{dist_{cond}}. This process is depicted in Algorithm 2.

We explore the following hypotheses:

    1. Classes that are closest in the embedding space may be less helpful: because the generator is imperfect, it cannot capture subtle differences between already similar classes, yielding samples that do not challenge the discriminator.
    1. Under common metrics (e.g., cosine or L2), interpolating between closer conditions may produce better overall samples, potentially improving the discriminator’s performance.
    1. A combination: select source classes that are both close in the condition space and distant in the embedding space.

For each setting, we select 10K class pairs and generate 20 samples per pair, matching the size of the original dataset. Results are shown in Table 2. The “Class Sel Mixing Strategy” column indicates how classes were chosen: Random (as in (Rahimi et al., 2025)), or based on their distances.

The first key observation is that adding these augmentations increases average discriminator performance by up to 6%, independent of the mixing strategy. To validate hypothesis (1), we compare strategies based on embedding distances and find that mixing pairs with larger embedding distances yields the greatest gains. In contrast, selecting classes according to condition distances (Close/Dist measured in cosine or L2L_{2}) has a negligible effect, thereby invalidating hypothesis (2). The critical role of the selection process is also evident from the “Diff.” column of Table 2. For instance, when source classes are chosen by their embedding‐space distances, the mean pairwise distance is 2.52 (substantially higher than the 0.11 or 0.56 observed in condition space), highlighting the importance of sampling based on the embedding space. (3) Finally advantages of selection based on the two spaces together, as presented in the Top/Worst Close Cond, Dist Embed, do not reach the gains achieved through selection based on the embedding space solely.

Takeaway: Choosing the source classes according to their distance in embedding space under common distances has more impact on the performance increase of the discriminator. Mixing the most distant classes is the most effective class selection strategy for increasing the performance of the discriminator.

Learned discriminator features as generators condition.

As highlighted previously, under common metrics, there is no clear correspondence between the discriminators’ embedding space and the learned generator’s condition space, please refer to Appendix Appendix Efor details. This gives us the idea to initialize the generator’s condition space using the discriminators’ class centers and freezing them to observe if we can enforce the missing correspondence. We quickly find that this approach is not feasible, leading to the generators’ failure to converge.

Takeaway: Diffusion generators tend not to converge or produce plausible results when we use the highly discriminative features as the frozen conditions.

Alignment between condition and recognition spaces.

We study whether the generator’s conditional embeddings preserve the discriminative geometry of a face recognition (FR) backbone. For each training snapshot, we extract one embedding per class from the generator’s conditioning module and compare them to the corresponding FR class centers. We report two complementary metrics: (i) Centered Kernel Alignment (CKA) (Kornblith et al., 2019), which captures global linear relational similarity; and (ii) CKNNA (Centered Kernel Nearest-Neighbor Alignment) (Huh et al., 2024), which emphasizes local neighborhood agreement via a soft kkNN kernel. See Appendix F for their exact definition.

Refer to caption

Figure 4: Geometry preservation of various spaces measured using CKA during the training of the generator.

Refer to caption

Figure 5: Alignment loss to class-centers before and after applying alignment regularization during the training of the generator.

Refer to caption

Figure 6: Intra Class Similarity (ICS) before and after applying alignment regularization during the training of the generator.

Interpretation. Higher values (↑\uparrow) indicate that the studied spaces are geometrically aligned with, i.e., classwise relations are preserved. Empirically, we observe that phases of training with higher CKA/CKNNA correspond to more stable discriminative performance, supporting a future direction that condition regularization that explicitly encourages recognition-aware geometry.

To test whether alignment is backbone-specific or universal, we are also comparing the condition space against multiple recognition models (trained on the same dataset) and treating their class centers as additional anchors. Figure 6 demonstrates consistent alignment across backbones, strengthening the evidence that the generator’s conditions capture dataset-intrinsic semantics (note the overlap of the solid lines). We observe that embeddings from different backbones trained on the same dataset but with distinct loss heads (e.g., Arc/Ada-IR50/100) exhibit highly similar geometric structures (i.e., dashed horizontal lines above 0.9). Their alignments with the condition space are also mutually consistent, although the condition space itself remains significantly farther from the cross alignment of embedding spaces. Since the condition space evolves throughout training, its geometry varies across steps. Nonetheless, it retains some structural similarity to the embedding spaces—unlike a random baseline (RandN), which is a matrix with the same number of rows as Condition Space or Embedding Space and initialized using a normal Gaussian.

Closely related to the nature of how we select the pairs, we introduce the following theorem, which investigates how the pair-wise distances (the selection process of the pairs for mixing) can be preserved in relation to CKA values.

Informal Theorem (CKA and Preservation of Local Geometry) Let ρ=CKA​(X,Y)\rho=\mathrm{CKA}(X,Y) be the centered-kernel alignment between the normalized GramsK^,L^\widehat{K},\widehat{L}, and let ΔK^>0\Delta_{\widehat{K}}>0 denote the (centered, normalized) triplet margin in the reference embedding (Appendix G) and we define N=n​(n−1)2N=\frac{n(n-1)}{2}. Under the K^\widehat{K}-orthogonal, energy-matched Gaussian misalignment model (Appendix G), the relaxed probability bound that the triplet order is preserved in YY is Pr⁡[ΔL^>0]≥Φ​(ρ​ΔK^cmask​(1−ρ)),cmask={12N−1,Euclidean squared-distance margins,2N−1,cosine-similarity margins.\displaystyle\Pr[\Delta_{\widehat{L}}>0]\geq\Phi\!\Bigg(\!\frac{\rho\,\Delta_{\widehat{K}}}{\sqrt{\,c_{\mathrm{mask}}\,(1-\rho)\,}}\!\Bigg),\;c_{\mathrm{mask}}\!=\!\begin{cases}\displaystyle\frac{12}{\,N-1\,},&\text{Euclidean squared-distance margins},\\[6.0pt] \displaystyle\frac{2}{\,N-1\,},&\text{cosine-similarity margins}.\end{cases} which is strictly increasing in ρ∈(0,1)\rho\in(0,1).

See Appendix G for a formal statement,proof, and experimental validation of the theorem.

Conjecture 4.2.

As CKA​(X,Y)→1\mathrm{CKA}(X,Y)\to 1, the preservation probability approaches 11. Equivalently, in the limit of perfect alignment, almost all local geometric inequalities are preserved.

This highlights that the same observations and methodology can be applied for generating useful samples for the discriminator (e.g., if we have selected the distances based on the other discriminators trained on the same dataset, by changing the loss or backbone), further demonstrating the robustness of the sample selection strategy and its importance.

Table 2: Effect of different strategies for choosing classes to mix for generating augmentations for enhancing the discriminator’s performance. Here Class Sel Mixing Strategy refers to how we select the classes to mix for the final generation. The Avg column is the average of all reported metrics, for each two rows grouped together (e.g., Close Embedding Cosine and Dist Embedding Cosine the Diff column depicts the absolute difference of the average metrics, presenting the effectiveness of the studied selection strategy.

Class Sel Mixing Strategy nsn^{s} nrn^{r} B-1e-6 B-1e-5 C-1e-6 C-1e-5 TR1 TR5 Avg Diff.
WebFace160K 0 0.16M 33.15 72.54 70.42 78.62 61.51 66.68 63.82 N/A
Random 0.2M 0.16M 34.83 76.21 75.02 82.91 61.41 66.60 66.17 N/A
Close Embedding Cosine 0.2M 0.16M 34.78 73.12 71.86 81.00 61.91 66.82 64.92 2.52
Dist Embedding Cosine 0.2M 0.16M 34.42 77.46 78.62 84.04 62.66 67.46 67.44
Close Condition Cosine 0.2M 0.16M 37.61 76.38 74.43 82.71 62.29 67.65 66.84 0.11
Dist Condition Cosine 0.2M 0.16M 34.52 77.17 76.97 83.15 62.39 67.49 66.95
Close Condition L2 0.2M 0.16M 37.18 72.67 72.20 80.71 62.12 66.52 65.23 0.56
Dist Condition L2 0.2M 0.16M 33.34 75.63 75.82 82.02 61.61 66.34 65.79
Top Close Cond, Dist Embed 0.2M 0.16M 34.74 76.94 76.70 83.87 62.47 67.14 66.98 1.76
Worst Close Cond, Dist Embed 0.2M 0.16M 33.27 74.45 74.50 81.22 61.13 66.77 65.22
3-Plet Sum Max 0.2M 0.16M 31.91 74.74 74.36 81.73 63.26 68.16 65.69 1.07
3-Plet Sum Min 0.2M 0.16M 31.56 73.80 73.11 80.27 61.96 67.02 64.62
Repro Aligned 0.2M 0 27.66 54.71 45.79 59.90 42.80 48.44 46.55 N/A

4.4 Beyond two classes

Here, we study whether we can exploit the gains we observed for more than two classes.

GPU-accelerated exact extreme mm-plet mining.

We study the top-KK subsets of size m∈{3,4}m\!\in\!\{3,4\} that optimize a permutation-invariant functional FF of the (m2)\binom{m}{2} intra-set distances. Naively, m=3m{=}3 requires Θ​(N3)\Theta(N^{3}) candidate evaluations (and m=4m{=}4 is Θ​(N4)\Theta(N^{4})), which is prohibitive on CPUs even for moderate NN. Our key observation is that the exhaustive search can be reorganized into tile-parallel column reductions that map to high-throughput matrix multiplications and fused argmax/argmin over candidates. This GPU-accelerated approach makes the search feasible even on consumer-level hardware for a moderate NN (less than an hour on RTX3090Ti for m=3m=3). To compare across mm, we report both the sum and its size-invariant version (the mean), i.e., the sum divided by (m2)\binom{m}{2} (=1=1 for pairs, =3=3 for triples, =6=6 for quads). In Table 2, we report 33-plet Sum/Mean under Min/Max objectives; while m=3m{=}3 improves over the baseline, it does not match the simpler m=2m{=}2 setting. These observations lead us to focus on m=2m{=}2 in the main experiments and not continue with m=4m{=}4 for mixing and training on the 4-plets. See Appendix H for more technical details.

Takeaway: Mixing more than two classes appears to be ineffective in recognition performance with current SOTA diffusion-based generators.

4.5 The More Aligned, the Better?

As shown in Table 1, training a discriminator on generator reproductions yields lower performance than training on the original dataset. This is expected, since the generator cannot fully capture the fine-grained variations of the real data. To address this, we investigated whether aligning the generator outputs to the discriminator’s class centers could help. Figure 6 shows that our regularization indeed improves alignment of generated samples to class centers. However, this comes at the cost of reduced intra-class variability (higher intra-class similarity in Figure 6), which is crucial for capturing identity-preserving information. Consequently, recognition performance on the reproduction set decreases (see the last line of Table 2). Details of the setup, including Coverage and FD metrics and the loss combination with our novel SNR weighting, are provided in Appendix H.

Takeaway: Aligning generator outputs to class centers yields no additional benefit, showing that recognition performance can be achieved without this extra constraint when training on reproduction samples.

5 Conclusions

We have shown that the compositional properties of diffusion model scores can be exploited to substantially improve recognition performance. The approach surpasses the gains from scaling the discriminator capacity and highlights that synthetic augmentation is a more effective alternative. Our analysis further identified which class combinations are most useful for augmentation. Interestingly, we found no clear correlation under standard distance metrics between the generator’s condition space and the discriminator’s feature space, and forcing the generator to align with class centers during training did not improve discriminator accuracy. To strengthen robustness, we proved that class selection remains stable even under variations in backbone architectures. Finally, we establish a theoretical connection between geometrical alignment metrics (e.g., CKA) and the induced ordering of class pairs, which underpins the stability of our class-mixing strategy to changes in the discriminator selection.

Limitations.

While our method avoids the need for discriminator-based grid search (unlike AugGen, Rahimi et al. (2025)), it incurs a higher computational sampling cost: generating mm-plets requires roughly mm times the cost of AugGen. This may limit scalability in very large augmentation regimes.

Future work.

Our findings reveal little correlation between the generator’s condition space and the representation space of a strong discriminator. A promising future direction is to investigate whether explicit regularization of the condition space guided by discriminative geometry can improve augmentation quality without sacrificing sample diversity. In particular, exploring representation alignment techniques (e.g., contrastive or CKA-based objectives) may help bridge the gap between generative and discriminative spaces, potentially unlocking further gains in recognition performance.

Reproducibility statement.

All results in this paper are reproducible, the corresponding code and synthetic datasets will be publicly released.

Acknowledgment.

This research is based on work conducted in the SAFER project and supported by the Hasler Foundation’s Responsible AI program.

References

Appendix A Appendix

Appendix B More Examples on choosing α\alpha and β\beta

In the figures Figure 7, Figure 8 and Figure 9 more examples of different values of α\alpha and β\beta are depicted. For each panel, the ID combinations are fixed across the figures to also highlight the consistency of the IDs with different sources of randomness. Note that the initial value of the seeds was all fixed for each figure to mainly study the effects of mixes of the conditions and the effects of the different values of the α\alpha and β\beta.

Appendix C Issues with generalist models

Refer to caption

(a)

Refer to caption

(b)

Refer to caption

(c)

Refer to caption

(d)

Figure 7: Effect of mixing scores in ScoreMix. Sub-figures 7(c), 7(d) show the images obtained for four different input pairs while sweeping the mixing coefficients α\alpha (horizontal axis, increasing left →\rightarrow right) andβ\beta (vertical axis, increasing top →\rightarrow bottom). All randomness aspects were fixed. All images were generated by fixing all the seeds to the initial value of ‘0‘.

Refer to caption

(a)

Refer to caption

(b)

Refer to caption

(c)

Refer to caption

(d)

Figure 8: Effect of mixing scores in ScoreMix. Sub-figures 8(a)8(d) show the images obtained for four different input pairs while sweeping the mixing coefficients α\alpha (horizontal axis, increasing left →\rightarrow right) andβ\beta (vertical axis, increasing top →\rightarrow bottom). All randomness aspects were fixed. All images were generated by fixing all the seeds to the initial value of ‘1‘.

Refer to caption

(a)

Refer to caption

(b)

Refer to caption

(c)

Refer to caption

(d)

Figure 9: Effect of mixing scores in ScoreMix. Sub-figures 9(a)9(d) show the images obtained for four different input pairs while sweeping the mixing coefficients α\alpha (horizontal axis, increasing left →\rightarrow right) andβ\beta (vertical axis, increasing top →\rightarrow bottom). All randomness aspects were fixed. All images were generated by fixing all the seeds to the initial value of ‘6‘.

Appendix D Alignment Augmented Loss

Here, we describe how we applied the alignment loss during training of the generator.

D.1 Preliminaries: The EDM2 Loss Function

We build upon the uncertainty-aware loss function from the EDM2 (Karras et al., 2024b) framework. At each training step, a noise level σ\sigma is sampled, and a clean image 𝑿{\bm{\mathsfit{X}}} is corrupted to 𝑿σ{\bm{\mathsfit{X}}}_{\sigma}. The network S​(𝑿,σ)\mathrm{S}({\bm{\mathsfit{X}}},\sigma) then predicts the denoised image 𝑿^0\hat{{\bm{\mathsfit{X}}}}_{0} and a log-variance term log⁡(𝐯)\log(\mathbf{v}). The loss is evaluated over a distribution of noise levels, training the network to denoise effectively across the entire corruption process:

ℒdiff=σ2+σdata2(σ⋅σdata)2⋅1exp⁡(log⁡(𝐯))⋅(𝑿^0−𝑿)2+log⁡(𝐯)\mathcal{L}_{\text{diff}}=\frac{\sigma^{2}+\sigma_{\text{data}}^{2}}{(\sigma\cdot\sigma_{\text{data}})^{2}}\cdot\frac{1}{\exp(\log(\mathbf{v}))}\cdot(\hat{{\bm{\mathsfit{X}}}}_{0}-{\bm{\mathsfit{X}}})^{2}+\log(\mathbf{v}) (6)

The negative values this loss can produce reflect the model learning to be confident (low predicted log⁡(𝐯)\log(\mathbf{v})) only when its denoising predictions are accurate.

D.2 Discriminator-Guided Alignment of the Denoising Path

While ℒdiff\mathcal{L}_{\text{diff}} guides the pixel-level accuracy of the prediction 𝑿^0\hat{{\bm{\mathsfit{X}}}}_{0} at each step, it does not explicitly enforce its semantic integrity. We introduce an auxiliary loss to align the network’s prediction at every timestep with its corresponding class identity.

We denote the ℱfr​(⋅)\mathcal{F}_{\text{fr}}(\cdot) as the feature extractor from a pre-trained face recognition (FR) model (i.e., usually the fθdisf_{\theta_{\mathrm{dis}}} without the classification head). For each class kk, we pre-compute the class center 𝐜k=𝔼𝑿∼classk​[ℱfr​(𝑿)]\mathbf{c}_{k}=\mathbb{E}_{{\bm{\mathsfit{X}}}\sim\text{class}_{k}}[\mathcal{F}_{\text{fr}}({\bm{\mathsfit{X}}})]. Alternatively, this can also be the class centers of the fθdisf_{\theta_{\mathrm{dis}}}.d

For a given noisy input 𝐱σ\mathbf{x}_{\sigma} from a sample of class kk, the diffusion model DD predicts the denoised image 𝑿^0=S​(𝑿σ,σ)\hat{{\bm{\mathsfit{X}}}}_{0}=\mathrm{S}({\bm{\mathsfit{X}}}_{\sigma},\sigma). We apply the alignment loss to this prediction:

| ℒalign=1−ℱfr​(S​(𝑿σ,σ))⋅𝐜k‖ℱfr​(S​(𝑿σ,σ))‖​‖𝐜k‖\mathcal{L}_{\text{align}}=1-\frac{\mathcal{F}_{\text{fr}}(\mathrm{S}({\bm{\mathsfit{X}}}_{\sigma},\sigma))\cdot\mathbf{c}_{k}}{\|\mathcal{F}_{\text{fr}}(\mathrm{S}({\bm{\mathsfit{X}}}_{\sigma},\sigma))\|\|\mathbf{c}_{k}\|} | (7) | | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | --- |

This loss acts as a semantic gradient, pulling the network’s prediction at each step towards the correct identity manifold, thereby guiding the entire denoising path.

D.3 Noise-Aware Loss Weighting

The core challenge in applying this auxiliary loss is that the prediction, 𝑿^0=S​(𝑿σ,σ)\hat{{\bm{\mathsfit{X}}}}_{0}=\mathrm{S}({\bm{\mathsfit{X}}}_{\sigma},\sigma), is an estimate whose reliability is a direct function of the noise level σ\sigma. At high noise levels (low Signal-to-Noise Ratio), this prediction is a high-variance estimate. Enforcing a strict feature-space constraint on such a high-variance prediction can introduce conflicting gradients and destabilize training. Conversely, at low noise levels (high SNR), the prediction is a much more reliable, lower-variance estimate, making it an ideal target for semantic guidance.

We therefore modulate the alignment loss with a dynamic, SNR-aware weight wsnr​(σ)w_{\text{snr}}(\sigma) that scales the loss based on the reliability of the prediction:

wsnr​(σ)=exp⁡(−k⋅σ2)w_{\text{snr}}(\sigma)=\exp(-k\cdot\sigma^{2}) (8)

where kk is a hyperparameter. This weighting scheme ensures that the semantic guidance from ℒalign\mathcal{L}_{\text{align}} is applied most strongly only when the model’s denoised prediction is coherent and meaningful.

D.4 Curriculum for Stable Alignment

To further stabilize training, especially in the initial phases where the generator is still learning basic image structures, we introduce the alignment loss gradually. We define a start point, nstartn_{\text{start}}, and a ramp-up duration, nrampn_{\text{ramp}}, measured in training images. The curriculum weight, wrampw_{\text{ramp}}, scales the influence of the alignment loss based on the current training progress, ncurn_{\text{cur}}:

wramp=min⁡(max⁡(0,ncur−nstartnramp),1.0)w_{\text{ramp}}=\min\left(\max\left(0,\frac{n_{\text{cur}}-n_{\text{start}}}{n_{\text{ramp}}}\right),1.0\right) (9)

This allows the network to first learn basic image synthesis before being gently steered by the alignment objective.

D.5 Final Loss Formulation

Our final training objective is the expectation over the data distribution and noise levels, combining all components to guide the entire denoising trajectory towards producing semantically and visually accurate results:

ℒtotal=𝔼𝑿,k,σ​[ℒdiff+λ⋅wramp⋅wsnr​(σ)⋅ℒalign]\mathcal{L}_{\text{total}}=\mathbb{E}_{{\bm{\mathsfit{X}}},k,\sigma}\left[\mathcal{L}_{\text{diff}}+\lambda\cdot w_{\text{ramp}}\cdot w_{\text{snr}}(\sigma)\cdot\mathcal{L}_{\text{align}}\right] (10)

where λ\lambda is a scalar hyperparameter balancing the two objectives. This formulation provides a stable and principled method for training a diffusion generator that is guided by semantic constraints at every step of the generation process.

D.6 Evaluation Metrics

We evaluate identity fidelity and intra-class diversity in the feature space of a frozen face-recognition (FR) model, ℱfr​(⋅)\mathcal{F}_{\text{fr}}(\cdot), trained on the target domain. Let G​(z,k)G(z,k) be the image for class kk and seed zz, and Sk={z1,…,zN}S_{k}=\{z_{1},\dots,z_{N}\} a fixed set of NN seeds per class (kept constant across runs).

Feature normalization.

All feature vectors and class centers are ℓ2\ell_{2}-normalized prior to computing any metric. Denote 𝐟i,k=norm​(ℱfr​(G​(zi,k)))\mathbf{f}_{i,k}=\mathrm{norm}\!\left(\mathcal{F}_{\text{fr}}(G(z_{i},k))\right) and𝐜ktarget=norm​(center from real data)\mathbf{c}_{k}^{\text{target}}=\mathrm{norm}\!\left(\text{center from real data}\right). We compute the (pre-)centroid 𝐜~kgen=1N​∑i=1N𝐟i,k\tilde{\mathbf{c}}_{k}^{\text{gen}}=\frac{1}{N}\sum_{i=1}^{N}\mathbf{f}_{i,k}and then re-normalize 𝐜kgen=norm​(𝐜~kgen)\mathbf{c}_{k}^{\text{gen}}=\mathrm{norm}(\tilde{\mathbf{c}}_{k}^{\text{gen}}). With unit-norm vectors, the cosine distance reduces to dcos​(𝐚,𝐛)=1−𝐚⊤​𝐛d_{\cos}(\mathbf{a},\mathbf{b})=1-\mathbf{a}^{\top}\mathbf{b}.

Alignment Loss to Target Center (Fidelity).

Average cosine distance of samples to the real class center, this is the same as it being reported in the paper (lower is better):

ℳalign​(k)=1N​∑i=1Ndcos​(𝐟i,k,𝐜ktarget).\mathcal{M}_{\text{align}}(k)=\frac{1}{N}\sum_{i=1}^{N}d_{\cos}(\mathbf{f}_{i,k},\mathbf{c}_{k}^{\text{target}}).

Intra-Class Cosine Similarity (Diversity).

Average cosine similarity of samples of the same class to the generated centroid (lower is better):

ℳICS​(k)=1N​∑i=1N1−dcos​(𝐟i,k,𝐜kgen).\mathcal{M}_{\text{ICS}}(k)=\frac{1}{N}\sum_{i=1}^{N}1-d_{\cos}(\mathbf{f}_{i,k},\mathbf{c}_{k}^{\text{gen}}).

Centroid Shift (Bias).

Cosine distance between generated and target centers (lower is better):

ℳshift​(k)=dcos​(𝐜kgen,𝐜ktarget).\mathcal{M}_{\text{shift}}(k)=d_{\cos}(\mathbf{c}_{k}^{\text{gen}},\mathbf{c}_{k}^{\text{target}}).

Mode Coverage.

Fraction of evaluated classes whose generated centroid is nearest (by cosine similarity) to their own target center among the evaluated subset (higher is better):

| ℳcoverage=1|Keval|​∑k∈Keval𝕀​[k=arg⁡maxj∈Keval​𝐜kgen⊤​𝐜jtarget].\mathcal{M}_{\text{coverage}}=\frac{1}{|K_{\text{eval}}|}\sum_{k\in K_{\text{eval}}}\mathbb{I}\!\left[k=\underset{j\in K_{\text{eval}}}{\arg\max}\ \mathbf{c}_{k}^{\text{gen}\,\top}\mathbf{c}_{j}^{\text{target}}\right]. | | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |

(If target centers for all classes are available and you want a stricter criterion, replaceKevalK_{\text{eval}} with KallK_{\text{all}} above.) We report mean ±\pm standard deviation across k∈Kevalk\in K_{\text{eval}} for distance-based metrics.

FD.

We also report Frechet Distance (FD), under various backbones, like InceptionV3 (Szegedy et al., 2016), DINOv2 (Oquab et al., 2023), and also using the embeddings of the same discriminator denoted as FDFR\mathrm{FD}_{\mathrm{FR}}.

We show the results in Figure 10. Here, we observe that light regularization for alignment tends to converge to similar values whether it is applied early or later (i.e., note where the orange and green dashed lines end for both the ICS and Alignment Loss, with the orange plot demonstrating much earlier regularization). We also observe that although the Alignment Loss is decreasing, the ICS is increasing, which causes the generated images to appear less diverse. We believe this is the main reason why reproduction with a more aligned generator penalizes the downstream performance of the discriminator on the reproduction dataset. Additionally, as highlighted in earlier works (Stein et al., 2023), FD\mathrm{FD} does not correlate well with sample quality and downstream performance (Rahimi et al., 2025). In contrast, FDDINOv2\mathrm{FD}_{\mathrm{DINOv2}} better captures this correlation. Moreover, highly discriminative features (e.g., FR features) also do not appear well suited for reporting sample quality.

Refer to caption

Figure 10: Effect of alignment regularization on various metrics during the training of the diffusion-based generator.

Appendix E Illustration of Embedding and Condition Space

After normalizing and identifying the most similar pairs (i.e., those with cosine distance 0, or equivalently, cosine similarity 1), we shift these zero distances to –1 to improve visual contrast. The resulting distance matrices for all sample pairs, 𝐄\mathbf{E} and 𝐂\mathbf{C}, are shown in Figure 11. From these plots, we see no obvious correlation between the two spaces.

Refer to caption

Figure 11: Shifted Matrix Cosine Matrix Distances between each pair in the condition and embedding space.

As another way of viewing this, if we flatten the matrices and use a few pairs like a set 𝒮\mathcal{S}:

𝒮⊆{1,…,10000}×{1,…,10000},(i,j)∈𝒮.\mathcal{S}\;\subseteq\;\{1,\dots,10000\}\times\{1,\dots,10000\},\qquad(i,j)\in\mathcal{S}.

And treating the distances as a 1D signal where each tick of the x-axis corresponds to a unique combination of ii and jj, we get a plot like Figure 12. Here, the red vertical lines are illustrating when the both condition and embedding space are having a distance lower than 0.40.4, We also apply some peak detection especially for the embedding space as we demonstrated the more distant we have in the embedding space the more beneficial the synthetic samples will be. Here, we again observe that these two spaces do not correlate well.

Refer to caption

Figure 12: Selected few distances in condition and embedding space. Here x-axis depicts a few unique combinations of classes.

Appendix F Alignment Metrics and Evaluation Protocol

Linear CKA (Kornblith et al., 2019).

Given representations X∈ℝn×dxX\in\mathbb{R}^{n\times d_{x}} and Y∈ℝn×dyY\in\mathbb{R}^{n\times d_{y}} for the same nn items (like center classes), let

H=In−1n​𝟏𝟏⊤,KX=H​X​X⊤​H,KY=H​Y​Y⊤​H.H=I_{n}-\tfrac{1}{n}\mathbf{1}\mathbf{1}^{\top},\quad K_{X}=HXX^{\top}H,\quad K_{Y}=HYY^{\top}H.

Where the 𝟏\mathbf{1} is all one vector of size nn and InI_{n} is the identity matrix. The (linear) CKA similarity is

| CKA​(X,Y)=⟨KX,KY⟩F‖KX‖F​‖KY‖F=‖X⊤​Y‖F2‖X⊤​X‖F​‖Y⊤​Y‖F.\mathrm{CKA}(X,Y)\;=\;\frac{\langle K_{X},K_{Y}\rangle_{F}}{\|K_{X}\|_{F}\,\|K_{Y}\|_{F}}\;=\;\frac{\|X^{\top}Y\|_{F}^{2}}{\|X^{\top}X\|_{F}\,\|Y^{\top}Y\|_{F}}. | | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ |

Values near 11 indicate strong global relational alignment; values near 0 indicate weak or no alignment.

CKNNA (Huh et al., 2024).

CKNNA measures local (neighborhood) alignment. For a temperature τ>0\tau>0, define a soft neighbor kernel on XX:

| AX​(i,j)={exp⁡(⟨x^i,x^j⟩/τ)∑k≠iexp⁡(⟨x^i,x^k⟩/τ)if ​i≠j,0if ​i=j,x^i=xi‖xi‖2,A_{X}(i,j)\;=\;\begin{cases}\displaystyle\frac{\exp\big(\langle\hat{x}_{i},\hat{x}_{j}\rangle/\tau\big)}{\sum_{k\neq i}\exp\big(\langle\hat{x}_{i},\hat{x}_{k}\rangle/\tau\big)}&\text{if }i\neq j,\\[6.0pt] 0&\text{if }i=j,\end{cases}\quad\hat{x}_{i}=\frac{x_{i}}{\|x_{i}\|_{2}}, | | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |

and analogously AYA_{Y} for YY. Using centered versions A~X=H​AX​H\tilde{A}_{X}=HA_{X}H and A~Y=H​AY​H\tilde{A}_{Y}=HA_{Y}H, the (cosine-type) CKNNA similarity is

| CKNNA​(X,Y)=⟨A~X,A~Y⟩F‖A~X‖F​‖A~Y‖F.\mathrm{CKNNA}(X,Y)\;=\;\frac{\langle\tilde{A}_{X},\tilde{A}_{Y}\rangle_{F}}{\|\tilde{A}_{X}\|_{F}\,\|\tilde{A}_{Y}\|_{F}}. | | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |

Smaller τ\tau emphasizes sharper, more discrete neighborhoods; larger τ\tau yields smoother neighborhoods. Higher values indicate better agreement of local neighborhoods across spaces. We set X=X=\,generator condition embeddings (one per class) and Y=Y=\,FR class centers. We report CKA​(X,Y)\mathrm{CKA}(X,Y) and CKNNA​(X,Y)\mathrm{CKNNA}(X,Y) as similarities in [0,1][0,1] (higher is better). Intuitively, CKA captures global relational structure, while CKNNA emphasizes whether each class’s nearest neighbors (by angular similarity) are consistent across the two spaces. Repeat the above with centers from several recognition models trained on the same dataset (e.g., ArcFace, AdaFace). Consistently high alignment across backbones implies that the learned embedding space captures highly similar data representation spaces.

Refer to caption

(a) CKA Normal Init of Rand

Refer to caption

(b) CKA Uniform Init of Rand

Refer to caption

(c) CKNNA Uniform Init of Rand

Figure 13: CKA/CKNNA plots under different random initialization schemes.

Appendix G Full version and proof of Theorem 4.3

Theorem G.1 (CKA and local-order preservation under K^\widehat{K}-orthogonal, energy-matched Gaussian misalignment).

Let X,Y∈ℝn×dX,Y\in\mathbb{R}^{n\times d} and define the centered Gram matrices

K=H​X​X⊤​H,L=H​Y​Y⊤​H,K\;=\;HXX^{\top}H,\qquad L\;=\;HYY^{\top}H, (11)

with H=I−1n​𝟏𝟏⊤H=I-\tfrac{1}{n}\mathbf{1}\mathbf{1}^{\top}. Normalize K^≔K/‖K‖F\widehat{K}\coloneqq K/\|K\|_{F}, L^≔L/‖L‖F\widehat{L}\coloneqq L/\|L\|_{F}, and define the (linear) CKA

ρ≔⟨K^,L^⟩F∈[0,1).\rho\;\coloneqq\;\left\langle\widehat{K},\widehat{L}\right\rangle_{F}\in[0,1). (12)

For distinct indices (i,j,k)(i,j,k), define the squared-Euclidean triplet mask Ti;j​k∈𝕊nT_{i;jk}\in\mathbb{S}^{n} by

(Ti;j​k)j​j=+1,(Ti;j​k)k​k=−1,(Ti;j​k)i​j=(Ti;j​k)j​i=−1,(Ti;j​k)i​k=(Ti;j​k)k​i=+1,(T_{i;jk})_{jj}=+1,\quad(T_{i;jk})_{kk}=-1,\quad(T_{i;jk})_{ij}=(T_{i;jk})_{ji}=-1,\quad(T_{i;jk})_{ik}=(T_{i;jk})_{ki}=+1, (13)

and 0 elsewhere. Let 𝒮c≔{M∈𝕊n:M​𝟏=𝟎}\mathcal{S}_{c}\coloneqq\{M\in\mathbb{S}^{n}:M\mathbf{1}=\mathbf{0}\} and N≔dim(𝒮c)=n​(n−1)2N\coloneqq\dim(\mathcal{S}_{c})=\frac{n(n-1)}{2} 111Note that dim{M}=dim{M⊤}=n​(n+1)/2\dim\{M\}=\dim\{M^{\top}\}=n(n+1)/2. The centering map M→M​𝟏M\rightarrow M\mathbf{1} has rank nn on 𝕊n\mathbb{S}^{n}, so dim(𝒮c)=n​(n+1)/2−n=n​(n−1)/2\dim(\mathcal{S}_{c})=n(n+1)/2-n=n(n-1)/2.. Let Tc≔H​Ti;j​k​H∈𝒮cT_{c}\coloneqq HT_{i;jk}H\in\mathcal{S}_{c} (so ‖Tc‖F≤‖Ti;j​k‖F=6\|T_{c}\|_{F}\leq\|T_{i;jk}\|_{F}=\sqrt{6}). Define the centered, normalized triplet margins

ΔK^≔⟨Tc,K^⟩F,ΔL^≔⟨Tc,L^⟩F.\Delta_{\widehat{K}}\coloneqq\left\langle T_{c},\widehat{K}\right\rangle_{F},\qquad\Delta_{\widehat{L}}\coloneqq\left\langle T_{c},\widehat{L}\right\rangle_{F}. (14)

Assume the following misalignment model on the Hilbert space (𝒮c,⟨⋅,⋅⟩F)(\mathcal{S}_{c},\left\langle\cdot,\cdot\right\rangle_{F}):

    1. L^=ρ​K^+E\widehat{L}\;=\;\rho\,\widehat{K}\;+\;E with ⟨E,K^⟩F=0\left\langle E,\widehat{K}\right\rangle_{F}=0 (orthogonal decomposition);
    1. EE is a zero-mean Gaussian random element supported on {K^}⟂∩𝒮c\{\widehat{K}\}^{\perp}\cap\mathcal{S}_{c} that is isotropic on that (N−1)(N-1)-dimensional slice: its covariance is σ2​I\sigma^{2}I;
    1. the variance level is energy matched,
      σ2=1−ρ2N−1,\sigma^{2}\;=\;\frac{1-\rho^{2}}{\,N-1\,}, (15)
      which yields 𝔼​‖E‖F2=1−ρ2\mathbb{E}\ E\

Then, for any triplet with ΔK^>0\Delta_{\widehat{K}}>0,

| ℙ​[ΔL^>0]=Φ​(ρ​ΔK^​N−1‖Π⟂​Tc‖F​ 1−ρ2),Π⟂​Tc≔Tc−⟨Tc,K^⟩F​K^,\mathbb{P}\big[\Delta_{\widehat{L}}>0\big]\;=\;\Phi\!\left(\frac{\rho\,\Delta_{\widehat{K}}\,\sqrt{N-1}}{\;\|\Pi_{\perp}T_{c}\|_{F}\,\sqrt{\,1-\rho^{2}\,}}\right),\qquad\Pi_{\perp}T_{c}\coloneqq T_{c}-\left\langle T_{c},\widehat{K}\right\rangle_{F}\,\widehat{K}, | (16) | | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ---- |

where Φ\Phi is the standard normal CDF. The right-hand side is strictly increasing in ρ∈[0,1)\rho\in[0,1), and by continuity the ρ→1\rho\rightarrow 1 limit equals 1.222When ρ=1\rho=1, we have σ2=0\sigma^{2}=0 so E≡0E\equiv 0 and hence ΔL^=ΔK^>0\Delta_{\widehat{L}}=\Delta_{\widehat{K}}>0 deterministically. Therefore, ℙ​[ΔL^>0]=1\mathbb{P}[\Delta_{\widehat{L}}>0]=1, which also matches the limit of equation 16 as ρ↑1\rho\uparrow 1.

Proof.

All inner products and norms are Frobenius on 𝒮c\mathcal{S}_{c}. By the model, L^=ρ​K^+E\widehat{L}=\rho\,\widehat{K}+E with E∈{K^}⟂E\in\{\widehat{K}\}^{\perp} a.s. For the fixed triplet, define the continuous linear functional Δ​(⋅)≔⟨Tc,⋅⟩F\Delta(\cdot)\coloneqq\left\langle T_{c},\cdot\right\rangle_{F}. Then

ΔL^=Δ​(L^)=ρ​Δ​(K^)+Δ​(E)=ρ​ΔK^+⟨Tc,E⟩F=ρ​ΔK^+⟨Π⟂​Tc,E⟩F,\Delta_{\widehat{L}}\;=\;\Delta(\widehat{L})\;=\;\rho\,\Delta(\widehat{K})+\Delta(E)\;=\;\rho\,\Delta_{\widehat{K}}+\left\langle T_{c},E\right\rangle_{F}\;=\;\rho\,\Delta_{\widehat{K}}+\left\langle\Pi_{\perp}T_{c},E\right\rangle_{F}, (17)

since E∈{K^}⟂E\in\{\widehat{K}\}^{\perp}. By Gaussianity and isotropy on the slice,

| ⟨Π⟂​Tc,E⟩F∼𝒩​(0,σ2​‖Π⟂​Tc‖F2),σ2=1−ρ2N−1.\left\langle\Pi_{\perp}T_{c},E\right\rangle_{F}\sim\mathcal{N}\!\Big(0,\;\sigma^{2}\,\|\Pi_{\perp}T_{c}\|_{F}^{2}\Big),\quad\sigma^{2}=\frac{1-\rho^{2}}{N-1}. | (18) | | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ---- |

Hence ΔL^∼𝒩​(ρ​ΔK^,1−ρ2N−1​‖Π⟂​Tc‖F2)\Delta_{\widehat{L}}\sim\mathcal{N}\!\Big(\rho\,\Delta_{\widehat{K}},\,\frac{1-\rho^{2}}{N-1}\,\|\Pi_{\perp}T_{c}\|_{F}^{2}\Big), and threfore

| ℙ​[ΔL^>0]=Φ​(ρ​ΔK^σ​‖Π⟂​Tc‖F)=Φ​(ρ​ΔK^​N−1‖Π⟂​Tc‖F​ 1−ρ2),\mathbb{P}\big[\Delta_{\widehat{L}}>0\big]\;=\;\Phi\left(\frac{\rho\,\Delta_{\widehat{K}}}{\sigma\,\|\Pi_{\perp}T_{c}\|_{F}}\right)\;=\;\Phi\left(\frac{\rho\,\Delta_{\widehat{K}}\,\sqrt{N-1}}{\;\|\Pi_{\perp}T_{c}\|_{F}\,\sqrt{\,1-\rho^{2}\,}}\right), | (19) | | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ---- |

which is equation 16. Monotonicity follows since f​(ρ)≔ρ/1−ρ2f(\rho)\coloneqq\rho/\sqrt{1-\rho^{2}} has f′​(ρ)=(1−ρ2)−3/2>0f^{\prime}(\rho)=(1-\rho^{2})^{-3/2}>0 on (0,1)(0,1) and is continuous at 0, and Φ\Phi is increasing. ∎

Corollary G.2 (Unnormalized form).

With ΔK≔⟨Tc,K⟩F=‖K‖F​ΔK^\Delta_{K}\coloneqq\left\langle T_{c},K\right\rangle_{F}=\|K\|_{F}\,\Delta_{\widehat{K}} and ΔL≔⟨Tc,L⟩F\Delta_{L}\coloneqq\left\langle T_{c},L\right\rangle_{F}, we have

| ℙ​[ΔL>0]=Φ​(ρ​ΔK​N−1‖K‖F​‖Π⟂​Tc‖F​ 1−ρ2).\mathbb{P}\big[\Delta_{L}>0\big]\;=\;\Phi\!\left(\frac{\rho\,\Delta_{K}\,\sqrt{N-1}}{\;\|K\|_{F}\,\|\Pi_{\perp}T_{c}\|_{F}\,\sqrt{\,1-\rho^{2}\,}}\right). | (20) | | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ---- |

Corollary G.3 (Universal lower bound).

For ρ∈[0,1]\rho\in[0,1], using ‖Π⟂​Tc‖F≤‖Tc‖F≤6\|\Pi_{\perp}T_{c}\|_{F}\leq\|T_{c}\|_{F}\leq\sqrt{6} and 1−ρ2≤2​(1−ρ)1-\rho^{2}\leq 2(1-\rho),

ℙ​[ΔL^>0]≥Φ​(ρ​ΔK^12N−1​(1−ρ)).\mathbb{P}\big[\Delta_{\widehat{L}}>0\big]\;\geq\;\Phi\!\left(\frac{\rho\,\Delta_{\widehat{K}}}{\sqrt{\;\frac{12}{\,N-1\,}\,\bigl(1-\rho\bigr)}}\right). (21)

G.1 Cosine distance and kernel-induced dissimilarities

Corollary G.4 (Cosine similarity case: exact bound and universal lower bound).

Let X~\widetilde{X} and Y~\widetilde{Y} be the row-normalized versions of XX and YY (each row scaled to unit ℓ2\ell_{2} norm). Define the centered cosine-similarity Gram matrices S≔H​X~​X~⊤​HS\coloneqq H\,\widetilde{X}\widetilde{X}^{\top}H, R≔H​Y~​Y~⊤​HR\coloneqq H\,\widetilde{Y}\widetilde{Y}^{\top}H, their normalizations S^≔S/‖S‖F\widehat{S}\coloneqq S/\|S\|_{F}, R^≔R/‖R‖F\widehat{R}\coloneqq R/\|R\|_{F}, and ρcos≔⟨S^,R^⟩F∈[−1,1]\rho_{\cos}\coloneqq\langle\widehat{S},\widehat{R}\rangle_{F}\in[-1,1]. For a triplet (i,j,k)(i,j,k), define the cosine-margin functional Δcos​(M)≔Mi​j−Mi​k\Delta^{\cos}(M)\coloneqq M_{ij}-M_{ik} via the symmetric mask

(Ti;j​kcos)i​j=(Ti;j​kcos)j​i=+12,(Ti;j​kcos)i​k=(Ti;j​kcos)k​i=−12,else ​0,(T^{\cos}_{i;jk})_{ij}=(T^{\cos}_{i;jk})_{ji}=+\tfrac{1}{2},\qquad(T^{\cos}_{i;jk})_{ik}=(T^{\cos}_{i;jk})_{ki}=-\tfrac{1}{2},\qquad\text{else~}~0,

so that Δcos​(M)=⟨Ti;j​kcos,M⟩F\Delta^{\cos}(M)=\langle T^{\cos}_{i;jk},M\rangle_{F} for any symmetric MM and ‖Ti;j​kcos‖F2=1\|T^{\cos}_{i;jk}\|_{F}^{2}=1. Let Tccos≔H​Ti;j​kcos​H∈𝒮cT^{\cos}_{c}\coloneqq H\,T^{\cos}_{i;jk}\,H\in\mathcal{S}_{c}, and set

ΔS^cos≔⟨Tccos,S^⟩F,ΔR^cos≔⟨Tccos,R^⟩F,Π⟂​Tccos≔Tccos−⟨Tccos,S^⟩F​S^.\Delta^{\cos}_{\widehat{S}}\coloneqq\langle T^{\cos}_{c},\widehat{S}\rangle_{F},\qquad\Delta^{\cos}_{\widehat{R}}\coloneqq\langle T^{\cos}_{c},\widehat{R}\rangle_{F},\qquad\Pi_{\perp}T^{\cos}_{c}\coloneqq T^{\cos}_{c}-\langle T^{\cos}_{c},\widehat{S}\rangle_{F}\,\widehat{S}. (22)

Under the S^\widehat{S}-orthogonal, energy-matched Gaussian isotropy model from Theorem G.1 with N=dim(𝒮c)=n​(n−1)2N=\dim(\mathcal{S}_{c})=\frac{n(n-1)}{2} and σ2=(1−ρcos2)/(N−1)\sigma^{2}=(1-\rho_{\cos}^{2})/(N-1), for any triplet with ΔS^cos>0\Delta^{\cos}_{\widehat{S}}>0 we have the exact identity

| ℙ​[ΔR^cos>0]=Φ​(ρcos​ΔS^cos​N−1‖Π⟂​Tccos‖F​ 1−ρcos2).\mathbb{P}\big[\Delta^{\cos}_{\widehat{R}}>0\big]\;=\;\Phi\!\left(\frac{\rho_{\cos}\,\Delta^{\cos}_{\widehat{S}}\,\sqrt{N-1}}{\;\|\Pi_{\perp}T^{\cos}_{c}\|_{F}\,\sqrt{\,1-\rho_{\cos}^{2}\,}}\right). | (23) | | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ---- |

Moreover, since ‖Π⟂​Tccos‖F≤‖Tccos‖F≤1\|\Pi_{\perp}T^{\cos}_{c}\|_{F}\leq\|T^{\cos}_{c}\|_{F}\leq 1 and 1−ρcos2≤2​(1−ρcos)1-\rho_{\cos}^{2}\leq 2(1-\rho_{\cos}) for ρcos∈[0,1]\rho_{\cos}\in[0,1], we obtain the universal lower bound

ℙ​[ΔR^cos>0]≥Φ​(ρcos​ΔS^cos2N−1​(1−ρcos)).\mathbb{P}\big[\Delta^{\cos}_{\widehat{R}}>0\big]\;\geq\;\Phi\!\left(\frac{\rho_{\cos}\,\Delta^{\cos}_{\widehat{S}}}{\sqrt{\;\frac{2}{\,N-1\,}\,\bigl(1-\rho_{\cos}\bigr)}}\right). (24)

Equivalently, for cosine distance dcos​(i,j)=1−cos⁡(i,j)d_{\cos}(i,j)=1-\cos(i,j) the event dcos​(i,j)<dcos​(i,k)d_{\cos}(i,j)<d_{\cos}(i,k) is the same as Δcos​(S)>0\Delta^{\cos}(S)>0, so equation 23-equation 24 apply unchanged.

Proof.

Apply Theorem G.1 with K←SK\leftarrow S, L←RL\leftarrow R and Tc←TccosT_{c}\leftarrow T^{\cos}_{c}. The mask norm satisfies ‖Ti;j​kcos‖F2=4⋅(1/2)2=1\|T^{\cos}_{i;jk}\|_{F}^{2}=4\cdot(1/2)^{2}=1, hence ‖Tccos‖F≤1\|T^{\cos}_{c}\|_{F}\leq 1. Since E∈{S^}⟂E\in\{\widehat{S}\}^{\perp} a.s., the variance of ⟨Tccos,E⟩F\langle T^{\cos}_{c},E\rangle_{F} equals σ2​‖Π⟂​Tccos‖F2\sigma^{2}\|\Pi_{\perp}T^{\cos}_{c}\|_{F}^{2}, which gives equation 23; the lower bound follows by the two inequalities above and the monotonicity of Φ\Phi. ∎

Corollary G.5 (Kernel-induced triplet margins).

Let kk be PSD with centered Grams GX≔H​K​(X)​HG^{X}\coloneqq HK(X)H, GY≔H​K​(Y)​HG^{Y}\coloneqq HK(Y)H and G^X,G^Y\widehat{G}^{X},\widehat{G}^{Y} their normalizations. If a triplet margin admits the linear form Δk​(M)=⟨Ti;j​kk,M⟩F\Delta^{k}(M)=\langle T^{k}_{i;jk},M\rangle_{F} with Ti;j​kk∈𝒮cT^{k}_{i;jk}\in\mathcal{S}_{c}, then with Tck≔H​Ti;j​kk​HT^{k}_{c}\coloneqq HT^{k}_{i;jk}H, ρk≔⟨G^X,G^Y⟩F\rho_{k}\coloneqq\langle\widehat{G}^{X},\widehat{G}^{Y}\rangle_{F}, andΠ⟂​Tck≔Tck−⟨Tck,G^X⟩F​G^X\Pi_{\perp}T^{k}_{c}\coloneqq T^{k}_{c}-\langle T^{k}_{c},\widehat{G}^{X}\rangle_{F}\,\widehat{G}^{X},

| ℙ​[⟨Tck,G^Y⟩F>0]\displaystyle\mathbb{P}\big[\langle T^{k}_{c},\widehat{G}^{Y}\rangle_{F}>0\big] | =\displaystyle= | Φ​(ρk​⟨Tck,G^X⟩F​N−1‖Π⟂​Tck‖F​ 1−ρk2),\displaystyle\Phi\!\left(\frac{\rho_{k}\,\langle T^{k}_{c},\widehat{G}^{X}\rangle_{F}\,\sqrt{N-1}}{\;\|\Pi_{\perp}T^{k}_{c}\|_{F}\,\sqrt{\,1-\rho_{k}^{2}\,}}\right), | (25) | | ------------------------------------------------------------------------------------------------------------- | -------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ | ---- | | ℙ​[⟨Tck,G^Y⟩F>0]\displaystyle\mathbb{P}\big[\langle T^{k}_{c},\widehat{G}^{Y}\rangle_{F}>0\big] | ≥\displaystyle\geq | Φ​(ρk​⟨Tck,G^X⟩F‖Tck‖F2N−1​(1−ρk)),\displaystyle\Phi\!\left(\frac{\rho_{k}\,\langle T^{k}_{c},\widehat{G}^{X}\rangle_{F}}{\sqrt{\;\frac{\|T^{k}_{c}\|_{F}^{2}}{\,N-1\,}\,(1-\rho_{k})}}\right), | (26) |

under the same isotropic Gaussian misalignment model (and its universal relaxation), respectively.

G.2 Experimental Validation

We now verify that the simplified universal lower probability bound is consistent with empirical order preservation across different embedding spaces. For each pair of spaces, we measured the top-KK set overlap between different spaces like Arc/Ada-IR50/IR101 (note that we wanted to see if, with these observations, we can verify that higher alignment preserves the ordering and hence the mix selection procedure), Jaccard similarity, and average rank gaps. We also computed the bound-based probability using ΔK≈10−5\Delta_{K}\approx 10^{-5} as the effective margin (note that GapA and GapB columns) was about this range. If we define the lower bound of Equation 23 as plower−boundp_{\mathrm{lower-bound}}, multiplying plower−boundp_{\mathrm{lower-bound}} by K=20,000K=20{,}000 gives a predicted rough estimation of the overlap that closely matches the observed values.

Table 3: Empirical vs. bound-based overlap at K=20,000K=20{,}000. Overlap and Jaccard are computed directly from top-KK sets.plower−boundp_{\mathrm{lower-bound}} is the probability from the practical bound with measured CKA values. “Expected” is K⋅plower-boundK\cdot p_{\text{lower-bound}}.

The bound consistently predicts overlaps of the right order of magnitude, with deviations of ≈5\approx 5–10% that are expected due to finite-sample effects and the coarse margin choice. Importantly, the relative ranking across pairs (higher CKA ⇒\Rightarrow higher overlap/Jaccard) is preserved, supporting the validity of the bound as a practical predictor of order preservation.

Appendix H Algorithmic design for exact extreme mm-plets

Problem and scoring.

Given embeddings X∈ℝN×DX\!\in\!\mathbb{R}^{N\times D} and a distance d​(⋅,⋅)d(\cdot,\cdot), we seek top-KK sets SS of size mm maximizing or minimizing a symmetric functional FF of the (m2)\binom{m}{2} pairwise distances within SS. Examples include sum, mean, std, and order statistics of the pairwise distances; our reducers treat FF generically.

Pairs (m=2m{=}2), exact.

We partition the strict upper triangle into B×BB{\times}B blocks, evaluate a distance block, mask i≥ji\!\geq\!j, and maintain on-device top-KK for nearest and farthest pairs. Block size BB is chosen experimentally with monitoring the GPU power usage by a simple memory budget to maximize arithmetic intensity while keeping working buffers subquadratic.

Triples (m=3m{=}3), column-exact with global top-KK.

We tile indices II and JJ with sizes (Ti,Tj)(T_{i},T_{j}) and traverse their Cartesian product. Within each (I,J)(I,J) tile, a sub-batch of PcP_{c} pair-columns (i,j)(i,j) is processed as follows:

    1. Compute the base pair distances di​jd_{ij} for the PcP_{c} columns.
    1. Form two candidate matrices A=X​XI⊤∈ℝN×PcA\!=\!XX_{I}^{\top}\in\mathbb{R}^{N\times P_{c}} and B=X​XJ⊤∈ℝN×PcB\!=\!XX_{J}^{\top}\in\mathbb{R}^{N\times P_{c}}.
    1. For each column cc (a fixed (i,j)(i,j)), evaluate F​({di​j​[c],di​k,dj​k})F(\{d_{ij}[c],\,d_{ik},\,d_{jk}\}) for all k∈[N]∖{i,j}k\!\in\![N]\setminus\{i,j\} via a fused reduction over the kk-dimension, and select the exact argmax/argmin k⋆k^{\star}.
    1. Push the resulting triple (i,j,k⋆)(i,j,k^{\star}) and its score to a global device top-KK.

This procedure is exact per column. Global top-KK is exact provided at most one kk per (i,j)(i,j) lies above the KK-th frontier; if necessary, emitting the top-MM candidates per column and performing a KK-way merge yields full exactness (in practice, M=1M{=}1 sufficed under our settings). Arithmetic remains Θ​(N3)\Theta(N^{3}) but is streamed through GEMM-like blocks; peak memory is O​(N​Pc)O(NP_{c}), independent of the total number of columns processed.

Quads (m=4m{=}4), per-triple exact greedy expansion.

Given a triple (i,j,k)(i,j,k), we evaluate all candidates l∈[N]∖{i,j,k}l\!\in\![N]\setminus\{i,j,k\} in one batched pass by forming the six pairwise distances within {i,j,k,l}\{i,j,k,l\} and reducing by FF to obtain l⋆l^{\star}. This step is exact conditioned on the triple, but globally greedy (full Θ​(N4)\Theta(N^{4}) exact search is infeasible at scale). Note that as of results in Table 2, we did not evaluate this for increasing the performance of the discriminator, but the results of the exact pairs is verified by the stochastic verifier.

Complexity.

Pairs cost Θ​(N2)\Theta(N^{2}) distance evaluations with subquadratic memory per block. Triples perform two matrix–block multiplies per (I,J)(I,J) tile and a per-column reduction over kk, totaling Θ​(N3)\Theta(N^{3}) arithmetic overall but only O​(N​Pc)O(NP_{c}) peak memory. The greedy 3→43{\to}4 adds a single O​(N)O(N) candidate sweep per retained triple.

Verification.

We provide a GPU-side stochastic verifier: draw SS random mm-plets, score them, and report (i) strict top-1 violations and (ii) exceedances above the reported KK-th threshold. Exceedances are partitioned into those already present in the report vs. genuinely new sets; we also record worst exceedance margins. This yields a high-power consistency check without an additional exhaustive pass.

Appendix I Original Datasets Dorig\mathrm{D}^{\mathrm{orig}}

Table 4 summarizes key statistics of CASIA-WebFace (Yi et al., 2014), WebFace160K (Rahimi et al., 2025), and WebFace4M(Zhu et al., 2021). WebFace160K was curated to reduce the long-tail distribution of samples per identity, resulting in a more balanced dataset compared to CASIA-WebFace.

Table 4: Summary statistics of the datasets used as Dorig\mathrm{D}^{\mathrm{orig}} in this work. The middle section reports the number of identities (nn) and real images (nrn^{r}). For each dataset, we also report the minimum, maximum, and 25%, 50%, and 75% percentiles of the number of samples per identity.

Appendix J Discriminator Details

See Tab. 5 for hardware specifications and training hyperparameters used for the IR50 and IR101 discriminators. Training on the 200K dataset will take about 2×42\times 4 3090Ti GPU hours for the IR50 backbone and about 2.7×42.7\times 4 GPU hours for IR101.

Table 5: Details of the Discriminator and its Training

Appendix K Generator Details

We used the small preset of the pixel-space EDM2 formulation, with a U-Net denoiser architecture. Training the generator required approximately 42 H100 GPU hours.

Appendix L More Samples

Refer to caption

(a) IDs 6934 and 2767

Refer to caption

(b) IDs 4566 and 2325

Refer to caption

(c) IDs 8430 and 5412

Refer to caption

(d) IDs 8476 and 2790

Figure 14: Qualitative comparison of ScoreMix augmentation samples. Each subfigure has five columns: from the left, Orig ID1 and Repro ID1 represent samples from the original dataset used to train the generator and their reproductions from the same class using the generator, respectively. Similarly, from the right, Orig ID2 and Repro ID2 represent samples from another identity/class. The central column (3rd from the left) shows images generated by mixing scores of ID1 and ID2 according to Equation 5 using AutoGuidance of 1.3. These images serve as augmentations for Orig ID1 and Orig ID2 during discriminator training. Note the subtle differences between the ScoreMix samples and their source counterparts; we believe these differences contribute significantly to the discriminator’s improved performance beyond architectural enhancements.

Refer to caption

(a) IDs 6934 and 2767

Refer to caption

(b) IDs 4566 and 2325

Refer to caption

(c) IDs 8430 and 5412

Refer to caption

(d) IDs 8476 and 2790

Figure 15: Qualitative comparison of ScoreMix augmentation samples. Each subfigure has five columns: from the left, Orig ID1 and Repro ID1 represent samples from the original dataset used to train the generator and their reproductions from the same class using the generator, respectively. Similarly, from the right, Orig ID2 and Repro ID2 represent samples from another identity/class. The central column (3rd from the left) shows images generated by mixing scores of ID1 and ID2 according to Equation 5 using AutoGuidance of 2.75. These images serve as augmentations for Orig ID1 and Orig ID2 during discriminator training. Note the subtle differences between the ScoreMix samples and their source counterparts; we believe these differences contribute significantly to the discriminator’s improved performance beyond architectural enhancements.

LLM Usage

Here we state that LLM has been used in our paper for better wording, proofreading (e.g., in long mathematical equations), and summarizing of text to better reflect the key ideas behind our work. We have also used LLMs for debugging our code and refactoring it for better readability and organization.

Impact Statement

In our approach, we introduce a novel technique that leverages generative models to further improve state-of-the-art (SOTA) facial recognition (FR) systems, as demonstrated on publicly available medium-sized datasets. However, these same FR systems can inadvertently facilitate unauthorized identity preservation in deepfakes and other forms of fraudulent media when attackers mimic individuals without their consent.