How to test whether a distribution follows a power law?

I have data on how many users post how many questions. For example,

[UserCount, QuestionCount] 
[2, 100] 
[9, 10] 
[3, 80] 
... ...

This means that 2 users each posted 100 questions, 9 users each posted 10 questions, and so on. So, how can I determine whether the UserCount, QuestionCount distribution follows a power law?

I found the poweRlaw package. However, I can only pass one group of numbers to do the evaluation. (The example provided in this package is word frequency.) So how do I use this package? Or do I have something wrong? I also have the data of each user’s question count, i.e., [100, 100, 10, 10, 10 ... ]. If I pass this data to the package, what I will get?

Answer

According to Clauset et al., this is how you test the power law tail with poweRlaw package:

  1. Construct the power law distribution object. In this case, your data is discrete, so use the discrete version of the class
data <- c(100, 100, 10, 10, 10 ...)
data_pl <- displ$new(data)
  1. Estimate the $x_{min}$ and the exponent $\alpha$ of the power law, and assign them to the power law object
est <- estimate_xmin(data_pl)
data_pl$xmin <- est$xmin
data_pl$pars <- est$pars

the last two line can be rewritten as one line

data_pl$xmin <- est

Also, at this point, you can see the KS statistic:

est$KS
  1. KS statistic tells you how well power law distribution fits your data, but it doesn’t tell you how likely your data is drawn from power law. So you also need a $p$ value. This is how you do it:
bs <- bootstrap_p(data_pl)
bs$p

This could take some time, so go and grab a cup of tea…

  1. Assuming you get a $p$ value and it’s greater than 0.05 or whatever your significant level is, you still need to exclude the possibility that no other alternative distribution fits the data better than power law. The poweRlaw package implements 3 other alternatives that you can compare with. Take log-normal for example:
data_alt <- dislnorm$new(data)
data_alt$xmin <- est$xmin
data_alt$pars <- estimate_pars(data_alt)
comp <- compare_distributions(data_pl, data_alt)

Note that the $x_{min}$ of log-normal distribution is set to that of power law, because compare_distributions function require the $x_{min}$s to be the same for both distributions. The comp object has two fields that’s interesting: comp$test_statistic indicates which one is a better fit, with positive number meaning data_pl is better, and negative otherwise; comp$p_two_side meaning how significant is the difference.

Repeat this step with disexp, dispois classes to compare power law with those alternatives.

Attribution
Source : Link , Question Author : tThirday , Answer Author : Zebra Propulsion Lab

Leave a Comment