[PDF][PDF] Pair-Wise Distance Metric Learning of Neural Network Model for Spoken Language Identification.

X Lu, P Shen, Y Tsao, H Kawai - INTERSPEECH, 2016 - academia.edu
X Lu, P Shen, Y Tsao, H Kawai
INTERSPEECH, 2016academia.edu
The i-vector representation and modeling technique has been successfully applied in
spoken language identification (SLI). In modeling, a discriminative transform or classifier
must be applied to emphasize variations correlated to language identity since the i-vector
representation encodes most of the acoustic variations (eg, speaker variation, transmission
channel variation, etc.). Due to the strong nonlinear discriminative power of neural network
(NN) modeling (including its deep form DNN), the NN has been directly used to learn the …
Abstract
The i-vector representation and modeling technique has been successfully applied in spoken language identification (SLI). In modeling, a discriminative transform or classifier must be applied to emphasize variations correlated to language identity since the i-vector representation encodes most of the acoustic variations (eg, speaker variation, transmission channel variation, etc.). Due to the strong nonlinear discriminative power of neural network (NN) modeling (including its deep form DNN), the NN has been directly used to learn the mapping function between the i-vector representation and language identity labels. In most studies, only the point-wise feature-label information is feeded to NN for parameter learning which may result in model overfitting, particularly when with limited training data. In this study, we propose to integrate pair-wise distance metric learning in NN parameter optimization. In the representation space of nonlinear transforms of hidden layers, a distance metric learning is explicitly designed for minimizing the pair-wise intra-class variation and maximizing the inter-class variation. With the distance metric as a constraint in the point-wise learning, the i-vectors are transformed to a new feature space which are much more discriminative for samples belonging to different languages while are much more similar for samples belonging to the same language. We tested the algorithm on a SLI task, encouraging results were obtained with more than 20% relative improvement on identification error rate.
academia.edu
Showing the best result for this search. See all results