Question on Inference – Catching Cheating Students

In their paper “Catching cheating students“, Levitt and Lin propose a simple reduced-form method to identify cheating of students in exams.

The strategy works as follows: For each possible pair of students, they calculate the number of questions for which those students gave the same answer. They then estimate the following simple regression:

similar_answersi=β0+β1neighbori+ui,
where similar_answersi refers to the number of similar answers for pair i, and neighbor is an indicator that is equal to one if the students sit next to each other and takes the value of zero otherwise. Therefore, β1 measures whether individuals who actually were sitting next to each other have a higher number of similar answers.

The simply estimate this model using OLS and do nothing special about the standard errors. My feeling is that this cannot be right because observations are related with each other within certain groups: First, one individual shows up in multiple pairs. Second, observations might be also related in rows. For example, if individuals 1 and 2 sit next to each other and cheat, but individual 2 also copies answers from individual 3 then the pairs would not be independent from each other.

My question: what would you do to account for such correlations?

Answer

While it is tempting to think, that such pair relations are somehow autocorrelated, and this causes inference problems, the straightforward answer would be, that this is not a problem here.

Rationale behind it is close to typical clustering problem. Clustering do not disrupt significance of variables that vary at unit level, it makes too significant variables, that vary only at cluster level. If we introduced a student-level, not pair-level variable, it should be significant too often.

As this is potential autocorrelation problem, the value in question would be p-value of the estimator for the parameter of interest: β1. We worry about potential wrong number of False Positives.


In order to check false positive rate, I propose Monte Carlo simulation, with given assumptions:

  • Students do not cheat. We check False Positive rate, then there is no need for introduction of cheating mechanism.
  • n (250) students sit in one row, every student has two neighbours (first and last – one).
  • Students have a test of k (20) answers each with a (2) possibilities. Answers are random with equal probabilities.
  • Students are matched in pairs and if they sit next to each other, they are marked as neighbours. The number of similar answers for each pair is calculated.

Then the Monte Carlo simulation takes place (unit: a pair):

  • Regression similar_answersi=β0+β1neighboursi+εi is evaluated. P-value of b1 estimator is saved.
  • Process is repeated N (10000) times.
  • Shares, how many times p-value was smaller than 0.5, 0.2, 0.1, 0.05 are presented:
p < 0.50: 0.5227
p < 0.20: 0.2166
p < 0.10: 0.1147
p < 0.05: 0.0511

The shares are not that different for 10000 Monte Carlo simulation. It looks as fair enough argument, that the False Positive rate is not disrupt.


Replication code (python):

import pandas as pd
import random
import numpy as np
from multiprocessing import Pool

# number of students:
n = 250
# number of possible answers and length of the test:
a = 2
k = 20
# number of monte carlo sims:
N = 10000
# number of processors:
cpu = 2

def get_pvals(iter = 0):
    print(iter)
    answers = []
    for i in range(n):
         answers.append(np.random.choice(range(a),k))

    pairs = []
    for i1 in range(n):
        for i2 in range(i1+1, n):
            neigh = 0
            sim_ans = sum(answers[i1] == answers[i2])
            if i1 != i2:
                if i1 == i2-1:
                    neigh = 1
                if i2 == i1-1:
                    neigh = 1
            pairs.append({"sim_ans":sim_ans, "neigh":neigh})

    d = pd.DataFrame(rows)
    import statsmodels.formula.api as sm
    result = sm.ols(formula = "sim_ans ~ neigh", data = d).fit()
    p = result.pvalues['neigh']
    return p

pvals = []

if __name__ == '__main__':
    with Pool(cpu) as p:
        pvals = p.map(get_pvals, range(N))

print(pvals)

print(sum(np.array(pvals) < 0.5)/N)
print(sum(np.array(pvals) < 0.2)/N)
print(sum(np.array(pvals) < 0.1)/N)
print(sum(np.array(pvals) < 0.05)/N)

Attribution
Source : Link , Question Author : bachelor , Answer Author : cure

Leave a Comment