According to the authors, the audio files were collected from
freesound.org, where
the source of each audio file come from and it is recorded in
the 'fsID' column in the metadata table. The duration of each audio can be
obtained from 'end' - 'start', which gives maximum 4 second. The gun shot sounds are shortest.
We took the dataset and ran it through librosa libraries to convert the wav files into MFCC files. The Mel-Frequency Cepstral Coefficients (MFCC) is a way of capturing the spectrum of the voice (phoneme) so that it can used in voice recognition and machine learning.
import librosa.display
import matplotlib.pyplot as plt
import IPython.display as ipd
def display_wav(signal,fn):
librosa.display.waveplot(signal, sr=sr)
plt.xlabel("Time")
plt.ylabel("Amplitude")
plt.savefig(fn, Bbox='tight')
plt.show()
The following graphs and sounds show how much MFCC contains the original data.
signal, sr = librosa.load("7383-3-0-0.wav", sr=22050)
display_wav(signal, '../images/dog_bark_plot.png')
import soundfile as sf
mfccs = librosa.feature.mfcc(signal,sr=sr,n_mfcc=13)
wav = librosa.feature.inverse.mfcc_to_audio(mfccs)
display_wav(wav, '../images/dog_bark_reversed.png')
In the image, the vertical height is determined by the number of coefficients(= n_mfcc)
and the width is determined by the sample rate(sr) and duration. The left image has n_mfcc=13 as height,
and 173 as width from sr=22,050 and duration=4 sec. Each small square inside represents a number(amplitude).
The returned MFCC has, in this example, a two-dimensional array of 13 by 173.
saying the query has the limit as 1600 columns.
Eunjeong Lee, ejlee127 at gmail dot com, last updated in Nov. 2020