Determining whether a website is active using daily visits

Context:

I have a group of websites where I record the number of visits on a daily basis:

W0 = { 30, 34, 28, 30, 16, 13, 8, 4, 0, 5, 2, 2, 1, 2, .. } 
W1 = { 1, 3, 21, 12, 10, 20, 15, 43, 22, 25, .. }
W2 = { 0, 0, 4, 2, 2, 5, 3, 30, 50, 30, 30, 25, 40, .. } 
...
Wn 

General Question:

  • How do I determine which sites are the most active?

By this I mean receiving more visits or having a sudden increase in visits during the last few days. For illustration purposes, in the small example above W0 would be initially popular but is starting to show abandoning, W1 is showing a steady popularity (with some isolated peak), and W3 an important raise after a quiet start).

Initial thoughts:

I found this thread on SO where a simple formula is described:


// pageviews for most recent day
y2 = pageviews[-1]
// pageviews for previous day
y1 = pageviews[-2]
// Simple baseline trend algorithm
slope = y2 - y1
trend = slope * log(1.0 +int(total_pageviews))
error = 1.0/sqrt(int(total_pageviews))
return trend, error

This looks good and easy enough, but I’m having a problem with it.

The calculation is based on slopes. This is fine and is one of the features I’m interested in, but IMHO it has problems for non-monotonic series. Imagine that during some days we have a constant number of visits (so the slope = 0), then the above trend would be zero.

Questions:

  • How do I handle both cases (monotonic increase/decrease) and large number of hits?
  • Should I use separate formulas?

Answer

It sounds like you are looking for an “online changepoint detection method.” (That’s a useful phrase for Googling.) Some useful recent (and accessible) papers are Adams & MacKay (a Bayesian approach) and Keogh et al. You might be able to press the surveillance package for R into service. Isolated large numbers of hits can be found using statistical process control methods.

Attribution
Source : Link , Question Author : Dan , Answer Author : whuber

Leave a Comment