From Wikipedia (http://en.wikipedia.org/wiki/Tweedie_distribution) we know that
The Tweedie distributions include a number of familiar distributions as well as some unusual ones, each being specified by the domain of the index parameter. We have the
normal distribution, p = 0,
Poisson distribution, p = 1,
compound Poisson–gamma distribution, 1 < p < 2,
gamma distribution, p = 2,
positive stable distributions, 2 < p < 3,
inverse Gaussian distribution, p = 3,
positive stable distributions, p > 3, and
extreme stable distributions, p = ∞.
For 0 < p < 1 no Tweedie model exists.
My question is can we determine significance from the optimal p if it is between 1-2. For instance, if we say that Poisson controls the frequency in which our Y appears and Gamma controls the size of our Y if it appears and if the log-likelihood of the model has an optimal value for p of 1.1 then is it fair to say that the data is driven more by frequency than size and vice versa?
so for example, if you run this code:
set.seed(1) x <- rep(1:100,times=10) freq <- sapply(x,function(arg) rpois(1,.25*arg)) size <- sapply(x,function(arg) rgamma(1,100*arg)) y <- freq*size df <- data.frame(cbind(x,y)) tweediep <- tweedie.profile(y~x, link.power = 0, data = df, do.smooth = FALSE, do.plot = TRUE) tweediep$xi.max
I get the result 1.3. Does that mean that this Y is more “driven by” frequency than severity? Does it mean something else altogether? Is there nothing I can discern from this estimated value of the parameter p ?
Edit: This is not a complete answer so I’m posting it as an edit and clarification rather than an answer to my own question.
I have attempted to anecdotally answer my own question via a limited experiment. Specifically, I want to create dummy data and a “frequency driven” and “severity driven” model from this data and observe the results to better understand the p parameter.
set.seed(1) #creating random variables x1<-rep(1:5,20) x2<-(runif(100))*10 x3<-rpois(100,5) x4<-rweibull(100,2,5) #creating parameters based off a linear combination of variables plus a noise term - all exponentiated fvars<-exp((x1+2*x2+.5*x3-x4+3*runif(1))/10)/5 svars<-exp((x1+2*x2+.5*x3-x4+3*runif(1))/5) summary(fvars) summary(svars) #freq is the number of occurances, sev is the size of the occurances freq<-rpois(100,fvars) sev<-rgamma(100,scale=svars,shape=1) summary(freq) summary(sev) results1<-freq*sev summary(results1) #load library if needed library(tweedie) #store all the data in a single data frame df1<-data.frame(cbind(x1,x2,x3,x4,results1))
This is my “base model”. I now find the optimal parameter p.
tweediep<-tweedie.profile(formula = results1~x1+x2+x3+x4,link.power = 0,do.smooth = TRUE,do.plot = TRUE) tweediep$xi.max
I get a result of 1.616327 . I went ahead and found the results model coefficients. Strictly for the purposes of answering my question, I don’t believe actually fitting the glm is necessary but perhaps the diagnostics of the glm will lead to more insight.
baseGLM<-glm(results1~x1+x2+x3+x4,data=df1,family=tweedie(var.power = tweediep$xi.max,link.power = 0)) summary(baseGLM)
It’s possible that the model isn’t “fit well enough” for this type of analysis to say anything meaningful, but to continue this line of reasoning, let’s try fitting a “frequency driven model”.
#freq driven freq<-rpois(100,fvars*3) sev<-rgamma(100,scale=svars/4,shape=1) summary(freq) summary(sev) results2<-freq*sev summary(results2) df1<-data.frame(cbind(df1,results2)) tweediep2<-tweedie.profile(formula = results2~x1+x2+x3+x4,link.power = 0,do.smooth = TRUE,do.plot = TRUE) tweediep2$xi.max
here our p parameter value is 1.718367 which is higher than our base value. Again, we fit the model.
GLMfreqDriven<-glm(results2~x1+x2+x3+x4,data=df1,family=tweedie(var.power = tweediep2$xi.max,link.power = 0)) summary(GLMfreqDriven)
Finally, let’s make our model more severity driven.
#severity driven freq<-rpois(100,fvars*.4) sev<-rgamma(100,scale=svars*5,shape=1) summary(freq) summary(sev) results3<-freq*sev summary(results3) df1<-data.frame(cbind(df1,results3)) tweediep3<-tweedie.profile(formula = results3~x1+x2+x3+x4,link.power = 0,do.smooth = TRUE,do.plot = TRUE) tweediep3$xi.max
Here our p parameter is 1.591937.
This is the opposite of my initial hypothesis. I originally thought that the more frequency driven a model is, the closer the p parameter would be to 1 (since 1 is a true Poisson model), and the more severity driven the model is, closer the p parameter is to 2. In my, admittedly limited, experiment the opposite was true. The Poisson distribution does allow zeros as a response variable. Perhaps the p parameter has nothing to do with “frequency driven” or “severity driven”, but is strictly a function of the number of zeros in the data (for p parameter values between 1 and 2).
One possible limitation I thought of with respect to my experiment design is that would be interesting to see how the results change if I force the results1, results2, and results3 variables to all have the same mean. The data here was fit arbitrarily. Several of instances where I am dividing or multiplying by a constant, I am doing so to get the data in such a way that the algorithm converges without error. I would be more confident if the results (1.61, 1.72, 1.59) were more spread out.
In conclusion, I can see possible explanations for results of my limited experiment, but would really be appreciative if someone could further validate and explain how the p parameter is affected by the data. Any discussion as to the differences between creating separate frequency and severity models versus creating compound Poisson-gamma models may also prove to be very insightful.
The Generalized Linear Models with Examples in R book by Peter Dunn and Gordon Smyth contains an illuminating discussion of Tweedie distributions.
If I maybe so blunt, and summarize your excellent question:
What is the relation of p and the underlying Poisson-Gamma model for a Tweedie distribution with 1<p<2?
As you already note in your question, the Tweedie distribution with 1<p<2 can be understood as a Poisson-Gamma model. To make it more concrete what this means, let's assume that
the observed y is
Dunn & Smyth give an example for this model, where N is the number of insurance claims and zi is the average cost for each claim. In that case the model would describe the total insurance payout.
The relation of p to the parameters of the Poisson and Gamma distribution is
where μ and ϕ are the mean and overdispersion parameters from the generalized linear model definition.