I’m trying to figure out how to detect the number of syllables in a corpus of audio recordings. I think a good proxy might be peaks in the wave file.
Here’s what I tried with a file of me speaking in English (my actual use case is in Kiswahili). The transcript of this example recording is: “This is me trying to use the timer function. I’m looking at pauses, vocalizations.” There are a total of 22 syllables in this passage.
wav file: https://www.dropbox.com/s/koqyfeaqge8t9iw/test.wav?dl=0
The
seewave
package in R is great, and there are several potential functions. First things first, import the wave file.library(seewave) library(tuneR) w <- readWave("YOURPATHHERE/test.wav") w # Wave Object # Number of Samples: 278528 # Duration (seconds): 6.32 # Samplingrate (Hertz): 44100 # Channels (Mono/Stereo): Stereo # PCM (integer format): TRUE # Bit (8/16/24/32/64): 16
The first thing I tried was the
timer()
function. One of the things it returns is the duration of each vocalization. This function identifies 7 vocalizations, which is far short of 22 syllables. A quick look at the plot suggests that vocalizations do not equal syllables.t <- timer(w, threshold=2, msmooth=c(400,90), dmin=0.1) length(t$s) # [1] 7
I also tried the fpeaks function without setting a threshold. It returned 54 peaks.
ms <- meanspec(w) peaks <- fpeaks(ms)
This plots amplitude by frequency rather than time. Adding a threshold parameter equal to 0.005 filters out noise and reduces the count to 23 peaks, which is pretty close to the actual number of syllables (22).
I’m not sure this is the best approach. The result will be sensitive to the value of the threshold parameter, and I have to process a big batch of files. Any better ideas about how to code this to detect peaks that represent syllables?
Answer
I don’t think what follows is the best solution, but @eipi10 had a good suggestion to check out this answer on CrossValidated. So I did.
A general approach is to smooth the data and then find peaks by comparing a local maximum filter to the smooth.
The first step is to create the argmax
function:
argmax <- function(x, y, w=1, ...) {
require(zoo)
n <- length(y)
y.smooth <- loess(y ~ x, ...)$fitted
y.max <- rollapply(zoo(y.smooth), 2*w+1, max, align="center")
delta <- y.max - y.smooth[-c(1:w, n+1-1:w)]
i.max <- which(delta <= 0) + w
list(x=x[i.max], i=i.max, y.hat=y.smooth)
}
Its return value includes the arguments of the local maxima (x)–which answers the question–and the indexes into the x- and y-arrays where those local maxima occur (i).
I made minor modifications to the test
plotting function: (a) to explicitly define x and y and (b) to show the number of peaks:
test <- function(x, y, w, span) {
peaks <- argmax(x, y, w=w, span=span)
plot(x, y, cex=0.75, col="Gray", main=paste("w = ", w, ", span = ",
span, ", peaks = ",
length(peaks$x), sep=""))
lines(x, peaks$y.hat, lwd=2) #$
y.min <- min(y)
sapply(peaks$i, function(i) lines(c(x[i],x[i]), c(y.min, peaks$y.hat[i]),
col="Red", lty=2))
points(x[peaks$i], peaks$y.hat[peaks$i], col="Red", pch=19, cex=1.25)
}
Like the fpeaks
approach I mentioned in my original question, this approach also requires a good deal of tuning. I won’t know the “right” answer (i.e., the number of syllables/peaks) going into this, so I’m not sure how to define a decision rule.
par(mfrow=c(3,1))
test(ms[,1], ms[,2], 2, 0.01)
test(ms[,1], ms[,2], 2, 0.045)
test(ms[,1], ms[,2], 2, 0.05)
At this point fpeaks
seems a little less complicated to me, but still not satisfying.
Attribution
Source : Link , Question Author : Eric Green , Answer Author : Community