Seeking a Theoretical Understanding of Firth Logistic Regression

I am trying to understand Firth logistic regression (method of handling perfect/complete or quasi-complete separation in logistic regression) so I can explain it to others in simplified terms. Does anyone have a dummied-down explanation of what modification Firth estimation is making to MLE?

I have read, as best I could, Firth (1993) and I understand a correction is being applied to the score function. I am fuzzy on the origin and justification of the correction and what role the score function plays in MLE.

Sorry if this is rudimentary knowledge. The literature I’ve reviewed seems to require a much deeper understanding of MLE than I possess.

Firth’s paper is an example of a higher order asymptotics. The null order, so to say, is provided by the laws of large numbers: in large samples, $\hat \theta_n \approx \theta_0$ where $\theta_0$ is the true value. You may have learned that MLEs are asymptotically normal, roughly because they are based on nonlinear transformations of sums of i.i.d. variables (scores). This is the first order approximation: $\theta_n = \theta_0 + O(n^{-1/2}) = \theta_0 + v_1 n^{-1/2} + o(n^{-1/2})$ where $v_1$ is a normal variate with zero mean and variance $\sigma_1^2$ (or var-cov matrix) that is the inverse of Fisher information for single observation. The likelihood ratio test statistic is then asymptotically $n(\hat\theta_n – \theta_0)^2/\sigma_1^2 \sim \chi^2_1$ or whatever the multivariate extensions to inner products and inverse covariance matrices would be.
Higher order asymptotics tries to learn something about that next term $o(n^{-1/2})$, usually by teasing out the next term $O(n^{-1})$. That way, the estimates and test statistics can incorporate the small sample biases of the order of $1/n$ (if you see the paper that says “we have unbiased MLEs”, these people probably don’t know what they are talking about). The best known correction of this kind is Bartlett’s correction for likelihood ratio tests. Firth’s correction is of that order, too: it adds a fixed quantity $\frac12 \ln \det I(\theta)$ (top of p. 30) to the likelihood, and in large samples the relative contribution of that quantity disappears at the rate of $1/n$ dwarfed by the sample information.