detect number of peaks in audio recording

I’m trying to figure out how to detect the number of syllables in a corpus of audio recordings. I think a good proxy might be peaks in the wave file.

Here’s what I tried with a file of me speaking in English (my actual use case is in Kiswahili). The transcript of this example recording is: “This is me trying to use the timer function. I’m looking at pauses, vocalizations.” There are a total of 22 syllables in this passage.

wav file:

The seewave package in R is great, and there are several potential functions. First things first, import the wave file.

w <- readWave("YOURPATHHERE/test.wav")  
# Wave Object
# Number of Samples:      278528
# Duration (seconds):     6.32
# Samplingrate (Hertz):   44100
# Channels (Mono/Stereo): Stereo
# PCM (integer format):   TRUE
# Bit (8/16/24/32/64):    16

The first thing I tried was the timer() function. One of the things it returns is the duration of each vocalization. This function identifies 7 vocalizations, which is far short of 22 syllables. A quick look at the plot suggests that vocalizations do not equal syllables.

t <- timer(w, threshold=2, msmooth=c(400,90), dmin=0.1)
# [1] 7

enter image description here

I also tried the fpeaks function without setting a threshold. It returned 54 peaks.

ms <- meanspec(w)
peaks <- fpeaks(ms)

enter image description here

This plots amplitude by frequency rather than time. Adding a threshold parameter equal to 0.005 filters out noise and reduces the count to 23 peaks, which is pretty close to the actual number of syllables (22).

enter image description here

I’m not sure this is the best approach. The result will be sensitive to the value of the threshold parameter, and I have to process a big batch of files. Any better ideas about how to code this to detect peaks that represent syllables?


I don’t think what follows is the best solution, but @eipi10 had a good suggestion to check out this answer on CrossValidated. So I did.

A general approach is to smooth the data and then find peaks by comparing a local maximum filter to the smooth.

The first step is to create the argmax function:

argmax <- function(x, y, w=1, ...) {
  n <- length(y)
  y.smooth <- loess(y ~ x, ...)$fitted
  y.max <- rollapply(zoo(y.smooth), 2*w+1, max, align="center")
  delta <- y.max - y.smooth[-c(1:w, n+1-1:w)]
  i.max <- which(delta <= 0) + w
  list(x=x[i.max], i=i.max, y.hat=y.smooth)

Its return value includes the arguments of the local maxima (x)–which answers the question–and the indexes into the x- and y-arrays where those local maxima occur (i).

I made minor modifications to the test plotting function: (a) to explicitly define x and y and (b) to show the number of peaks:

test <- function(x, y, w, span) {
  peaks <- argmax(x, y, w=w, span=span)

  plot(x, y, cex=0.75, col="Gray", main=paste("w = ", w, ", span = ", 
                                              span, ", peaks = ", 
                                              length(peaks$x), sep=""))
  lines(x, peaks$y.hat,  lwd=2) #$
  y.min <- min(y)
  sapply(peaks$i, function(i) lines(c(x[i],x[i]), c(y.min, peaks$y.hat[i]),
                                    col="Red", lty=2))
  points(x[peaks$i], peaks$y.hat[peaks$i], col="Red", pch=19, cex=1.25)

Like the fpeaks approach I mentioned in my original question, this approach also requires a good deal of tuning. I won’t know the “right” answer (i.e., the number of syllables/peaks) going into this, so I’m not sure how to define a decision rule.

test(ms[,1], ms[,2], 2, 0.01)
test(ms[,1], ms[,2], 2, 0.045)
test(ms[,1], ms[,2], 2, 0.05)

enter image description here

At this point fpeaks seems a little less complicated to me, but still not satisfying.

Source : Link , Question Author : Eric Green , Answer Author : Community

Leave a Comment