How can I improve my analysis of the effects of reputation on voting?

Recently I had done some analysis of the effects of reputation on upvotes (see the blog-post), and subsequently I had a few questions about possibly more enlightening (or more appropriate) analysis and graphics.

So a few questions (and feel free to respond to anyone in particular and ignore the others):

1. In its current in incarnation, I did not mean center the post number. I think what this does is give the false appearance of a negative correlation in the scatterplot, as there are more posts towards the lower end of the post count (you see this doesn’t happen in the Jon Skeet panel, only in the mortal users panel). Is it innapropriate to not mean-center the post number (since I mean centered the score per user average score)?

2. It should be obvious from the graphs that score is highly right skewed (and mean centering did not change that any). When fitting a regression line, I fit both linear models and a model using the Huber-White sandwhich errors (via rlm in the MASS R package) and it did not make any difference in the slope estimates. Should I have considered a transformation to the data instead of robust regression? Note that any transformation would have to take into account the possibility of 0 and negative scores. Or should I have used some other type of model for count data instead of OLS?

3. I believe the last two graphics, in general, could be improved (and is related to improved modelling strategies as well). In my (jaded) opinion, I would suspect if reputation effects are real they would be realized quite early on in a posters history (I suppose if true, these may be reconsidered “you gave some excellent answers so now I will upvote all of your posts” instead of “reputation by total score” effects). How can I create a graphic to demonstrate whether this is true, while taking into account for the over-plotting? I thought maybe a good way to demonstrate this would be to fit a model of the form;

where $Y$ is the score - (mean score per user) (the same as is in the current scatterplots), $X_1$ is the post number, and the $Z_1 \cdots Z_k$ are dummy variables representing some arbitrary range of post numbers (for example $Z_1$ equals 1 if the post number is 1 through 25, $Z_2$ equals 1 if the post number is 26 through 50 etc.). $\beta_0$ and $\epsilon$ are the grand intercept and error term respectively. Then I would just examine the estimated $\gamma$ slopes to determine if reputation effects appeared early on in a posters history (or graphically display them). Is this a reasonable (and appropriate) approach?

It seems popular to fit some type of non-parametric smoothing line to scatterplots like these (such as loess or splines), but my experimentation with splines did not reveal anything enlightening (any evidence of postive effects early on in poster history was slight and tempermental to the number of splines I included). Since I have a hypothesis that the effects happen early on, is my modelling approach above more reasonable than splines?

Also note although I’ve pretty much dredged all of this data, there are still plenty of other communities out there to examine (and some like superuser and serverfault have similarly large samples to draw from), so it is plenty reasonable to suggest in future analysis that I use a hold-out sample to examine any relationship.

This is a brave try, but with these data alone, it will be difficult or impossible to answer your research question concerning the “effect of reputation on upvotes.” The problem lies in separating the effects of other phenomena, which I list along with brief indications of how they might be addressed.

• Learning effects. As reputation goes up, experience goes up; as experience goes up, we would expect a person to post better questions and answers; as their quality improves, we expect more votes per post. Conceivably, one way to handle this in an analysis would be to identify people who are active on more than one SE site. On any given site their reputation would increase more slowly than the amount of their experience, thus providing a handle for teasing apart the reputation and learning effects.

• Temporal changes in context. These are myriad, but the obvious ones would include

• Changes in numbers of voters over time, including an overall upward trend, seasonal trends (often associated with academic cycles), and outliers (arising from external publicity such as links to specific threads). Any analysis would have to factor this in when evaluating trends in reputation for any individual.

• Changes in a community’s mores over time. Communities, and how they interact, evolve and develop. Over time they may tend to vote more or less often. Any analysis would have to evaluate this effect and factor it in.

• Time itself. As time goes by, earlier posts remain available for searching and continue to garner votes. Thus, caeteris paribus, older posts ought to produce more votes than newer ones. (This is a strong effect: some people consistently high on the monthly reputation leagues have not visited this site all year!) This would mask or even invert any actual positive reputation effect. Any analysis needs to factor in the length of time each post has been present on the site.

• Subject popularity. Some tags (e.g., ) are far more popular than others. Thus, changes in the kinds of questions a person answers can be confounded with temporal changes, such as a reputation effect. Therefore, any analysis needs to factor in the nature of the questions being answered.

• Views [added as an edit]. Questions are viewed by different numbers of people for various reasons (filters, links, etc.). It’s possible the number of votes received by answers are related to the number of views, although one would expect a declining proportion as the number of views increases. (It’s a matter of how many people who are truly interested in the question actually view it, not the raw number. My own–anecdotal–experience is that roughly half the upvotes I receive on many questions come within the first 5-15 views, although eventually the questions are viewed hundreds of times.) Therefore, any analysis needs to factor in the number of views, but probably not in a linear way.

• Measurement difficulties. “Reputation” is the sum of votes received for different activities: initial reputation, answers, questions, approving questions, editing tag wikis, downvoting, and getting downvoted (in descending order of value). Because these components assess different things, and not all are under the control of the community voters, they should be separated for analysis. A “reputation effect” presumably is associated with upvotes on answers and, perhaps, on questions, but should not affect other sources of reputation. The starting reputation definitely should be subtracted (but perhaps could be used as a proxy for some initial amount of experience).

• Hidden factors. There can be many other confounding factors that are impossible to measure. For example, there are various forms of “burnout” in participation in forums. What do people do after an initial few weeks, months, or years of enthusiasm? Some possibilities include focusing on the rare, unusual, or difficult questions; providing answers only to unanswered questions; providing fewer answers but of higher quality; etc. Some of these could mask a reputation effect, whereas others could mistakenly be confused with one. A proxy for such factors might be changes in rates of participation by an individual: they could signal changes in the nature of that person’s posts.

• Subcommunity phenomena. A hard look at the statistics, even on very active SE pages, shows that a relatively small number of people do most of the answering and voting. A clique as small as two or three people can have a profound influence on the growth of reputation. A two-person clique will be detected by the site’s built-in monitors (and one such group exists on this site), but larger cliques probably won’t be. (I’m not talking about formal collusion: people can be members of such cliques without even being aware of it.) How would we separate an apparent reputation effect from activities of these invisible, undetected, informal cliques? Detailed voting data could be used diagnostically, but I don’t believe we have access to these data.

• Limited data. To detect a reputation effect, you will likely need to focus on individuals with dozens to hundreds of posts (at least). That drops the current population to less than 50 individuals. With all the possibility of variation and confounding, that is far too small to tease out significant effects unless they are very strong indeed. The cure is to augment the dataset with records from other SE sites.

Given all these complications, it should be clear that the exploratory graphics in the blog article have little chance of revealing anything unless it is glaringly obvious. Nothing leaps out at us: as expected, the data are messy and complicated. It’s premature to recommend improvements to the plots or to the analysis that has been presented: incremental changes and additional analysis won’t help until these fundamental issues have been addressed.