ICASSP 2025 - Identity Drift Analysis
This website accompanies our ICASSP paper and provides interactive access to all materials referenced in the manuscript. For each target phrase, we include two sets of normalized speaker–similarity graphs—one generated using an ECAPA‐TDNN backbone and one using a ResNet backbone. These graphs visualize how the similarity distributions shift under adversarial perturbations. We also provide scatter plots illustrating the relationship between signal‑to‑noise ratio (SNR) and similarity scores for each phrase.
ID | Target Text | Phonetic Short Note | Model | #Samples | #Genuine | #Impostor | TMR@0.1%FMR | d′ |
---|---|---|---|---|---|---|---|---|
T1 | yes | mono-syll.; 1:2 V:C; glide+fric stop | ECAPA | 11881 | 109 | 11772 | 1.0000 | 9.68 |
ResNet50 | 11881 | 109 | 11772 | 1.0000 | 9.43 | |||
T2 | open the door | 4 syll.; 4:6 V:C; dental fric. + stops | ECAPA | 11881 | 109 | 11772 | 1.0000 | 9.11 |
ResNet50 | 11881 | 109 | 11772 | 1.0000 | 8.89 | |||
T3 | call emergency services | fricative-heavy; 8 syll.; 8:16 V:C | ECAPA | 11881 | 109 | 11772 | 0.9725 | 5.59 |
ResNet50 | 11881 | 109 | 11772 | 0.9817 | 6.22 | |||
T4 | the quick brown fox jumped over the lazy dog | pangram; 11 syll.; broad coverage | ECAPA | 11881 | 109 | 11772 | 0.9083 | 4.80 |
ResNet50 | 11881 | 109 | 11772 | 0.9817 | 5.14 | |||
T5 | shhh she sees the sea fish | fricatives /ʃ s z/ cluster; 6 syll. | ECAPA | 11881 | 109 | 11772 | 0.9908 | 7.46 |
ResNet50 | 11881 | 109 | 11772 | 1.0000 | 7.74 | |||
T6 | do go big bag dig | voiced stops chain; minimal vowels; 5 syll. | ECAPA | 11881 | 109 | 11772 | 1.0000 | 8.44 |
ResNet50 | 11881 | 109 | 11772 | 1.0000 | 8.35 | |||
T7 | two tall teachers talk to tim | /t/ alliteration; alveolar bursts; 7 syll. | ECAPA | 11881 | 109 | 11772 | 1.0000 | 7.79 |
ResNet50 | 11881 | 109 | 11772 | 1.0000 | 7.59 | |||
T8 | i whisper while walking wildly | approximants + /w/ clusters; 8 syll. | ECAPA | 11881 | 109 | 11772 | 1.0000 | 7.17 |
ResNet50 | 11881 | 109 | 11772 | 1.0000 | 7.33 | |||
T9 | pack my box with five dozen liquor jugs | pangram; many consonant clusters | ECAPA | 9025 | 95 | 8930 | 0.8632 | 4.63 |
ResNet50 | 9025 | 95 | 8930 | 0.9474 | 5.02 | |||
T10 | glib jocks quiz nymph to vex dwarf | pangram; high fricative/affricate load | ECAPA | 9025 | 95 | 8930 | 0.8421 | 4.75 |
ResNet50 | 9025 | 95 | 8930 | 0.9579 | 5.34 | |||
T11 | a mad boxer shot a quick gloved jab to the jaw of his dizzy opponent | mixed plosives/fricatives; many unstressed vowels | ECAPA | 9025 | 95 | 8930 | 0.7474 | 3.79 |
ResNet50 | 9025 | 95 | 8930 | 0.9263 | 4.45 | |||
T12 | just before twilight the wizard quickly jabbed five boxes of hazy quartz to vex a plump knight’s jovial frog | very long pangram; many clusters; vowel centralization | ECAPA | 6561 | 81 | 6480 | 0.4444 | 3.07 |
ResNet50 | 6561 | 81 | 6480 | 0.7160 | 3.63 | |||
T13 | twelve jolly grizzlies briskly danced over waxy benches while a flighty kitten kept humming jazz tunes in the background | fricative+affricate mix; many unstressed vowels | ECAPA | 3025 | 55 | 2970 | 0.5273 | 3.06 |
ResNet50 | 3025 | 55 | 2970 | 0.6364 | 3.46 | |||
T14 | quantum driven flux engines jam beneath zigzagging vortex panels as cryptic bioforms whisper behind polymorphic glass domes | dense consonant clusters; many fricatives/affricates | ECAPA | 2209 | 47 | 2162 | 0.6809 | 3.10 |
ResNet50 | 2209 | 47 | 2162 | 0.7447 | 3.78 | |||
T15 | while whispering winds wander westward jittery jackals jiggled jellies above velvet jars beyond flickering bonfires in a frozen jungle | repeated /w/ /ʤ/; long with many approximants | ECAPA | 3025 | 55 | 2970 | 0.6545 | 3.39 |
ResNet50 | 3025 | 55 | 2970 | 0.7091 | 3.54 | |||
T16 | kindly expedite bizarre frozen jumpsuits for victors whirlwind gala to maximize xenon emissions before daybreak | ‘x/ʒ/ks’ clusters; mixed stops/fricatives; multi-syllabic | ECAPA | 5184 | 72 | 5112 | 0.5278 | 3.09 |
ResNet50 | 5184 | 72 | 5112 | 0.6806 | 3.30 |
Scatter plots showing the relationship between Signal-to-Noise Ratio (SNR) and similarity metrics for each target text, providing insights into the trade-offs between attack strength and audio quality.