Investigation of Acoustic Features for Voice Activation Problem
Abstract
In this paper we examine the results of using different acoustic feature computation pipelines for classifying audio keywords with a convolutional neural network (CNN). We compare the use of Mel-frequency cepstral coefficients (MFCCs) and a simple filterbank averaging technique. Also we examined the influence of MFCCs computation parameters on the resulting quality. The results show that CNNs benifit from using prior knowledge in acoustic feature computation. In our experiments we got 30% drop in accuracy while switching from MFCC to filterbank averaging. Furthemore, the default values of MFCCs parameters that are used in many libraries might not be the best for voice activation problem: frame length of 55 ms showed better results than default length of 20 ms.
