P56 Principal Components Analysis of amplitude envelopes from spectral channels: A preliminary comparison between music and speech
Introduction: The efficient coding hypothesis predicts that perceptual systems are optimally adapted to natural signal statistics. Previous work provided statistics of speech signals for 8 languages based on Principal Components Analysis (PCA), arguing that 4 frequency channels would be sufficient to optimally represent clean speech signals for each of these 8 languages. Extending these data to cochlear implant simulations in english, it has been shown that 6 to 7 frequency bands would be sufficient to optimally represent vocoded speech.
However, research on music perception in cochlear implanted listeners sheds light on potential limits associated with these results. Performance observed on vocoded signal material in normal-hearing listeners as well as in CI users is systematically better for speech signals than for music. Our aim is to compare statistical properties of natural music signals with previous work on speech in order to evaluate their respective contributions to this theoretical proposal.
Method: Analyses were carried out using Matlab on music samples from the FMA open source database (Free Music Archive, https://github.com/mdeff/fma). Signal processing and statistical procedures were mirrored from previous studies on speech. The total sample duration was comparable. Sample signals were passed through a gammatone filterbank (1/4th ERB bandwidth, approx. 100-120 channels) and their energy envelope was extracted. This amplitude modulation matrix was then run through PCA and PCs were independently rotated. Channels that covary in amplitude envelope should be grouped as a single Principal Component. As our aim was to compare speech and music, for which typical signal bandwidths differ, two higher-frequency limits were compared (8000 Hz vs. 22000 Hz).
Results and discussion: Current graphical exploration of statistical results only provides partial descriptions. As should be expected, more PCs seem to be required in order to characterize music samples than was estimated for speech. The optimal number of PCs for music seems to stabilize between 24 and 32 channels (vs. 4 to 7 channels according to previous speech studies), at least for frequencies up to 12000 Hz. For more systematic comparisons, statistical analyses need to be enhanced in order to develop methods for automatically determining the optimal number of Principal Components as well as to estimate frequency boundaries between these PCs. Results for music and speech will be compared in order to identify possible discrepancies between optimal frequency channels.