Receptive field in neural network keyword spotting models
Abstract
Many keyword spotting models use neural networks to detect acoustic events such as phonemes, word pieces or whole words. The model is inferenced on every frame (segmented piece of audio) which is typically every 10ms. In order to improve the quality of classification neural network uses audio features for both the frame under classification and several adjacent frames. This introduces a tradeoff. Too large receptive field might cause overfitting, increases the number of parameters and latency. Too small receptive field might not be able to provide enough information to correctly classify audio event. We investigate several policies of constructing receptive field for neural network in keyword spotting including the ways to make receptive field more sparse such as frame skipping and frame stacking.