How to model the sum of Bernoulli random variables for dependent data?

I have almost the same questions like this:
How can I efficiently model the sum of Bernoulli random variables?

But the setting is quite different:

1. $S=\sum_{i=1,N}{X_i}$, $P(X_{i}=1)=p_i$, $N$~20, $p_i$~0.1

2. We have the data for the outcomes of Bernoulli random variables: $X_{i,j}$ , $S_j=\sum_{i=1,N}{X_{i,j}}$

3. If we estimate the $p_i$ with maximum likelihood estimation (and get $\hat p^{MLE}_i$), it turns out that $\hat P\{S=3\} (\hat p^{MLE}_i)$ is much larger then expected by the other criteria: $\hat P\{S=3\} (\hat p^{MLE}_i) – \hat P^{expected} \{S=3\}\approx 0.05$

4. So, $X_{i}$ and $X_{j}$ $(j>k)$ cannot be treated as independent (they have
small dependence).

5. There are some constrains like these: $p_{i+1} \ge p_i$ and $\sum_{s \le 2}\hat P\{S=s\}=A$ (known), which should help with the estimation of $P\{S\}$.

How could we try to model the sum of Bernoulli random variables in this case?

What literature could be useful to solve the task?

UPDATED

There are some further ideas:

(1) It’s possible to assume that the unknown dependence between ${X_i}$ begins after 1 or more successes in series. So when $\sum_{i=1,K}{X_i} > 0$, $p_{K+1} \to p’_{K+1}$ and $p’_{K+1} < p_{K+1}$.

(2) In order to use MLE we need the least questionable model. Here is an variant:

$P\{X_1,…,X_k\}= (1-p_1) … (1-p_k)$ if $\sum_{i=1,k}{X_i} = 0$ for any k
$P\{X_1,…,X_k,X_{k+1},…,X_N\}= (1-p_1) … p_k P’\{X_{k+1},…,X_N\}$ if $\sum_{i=1,k-1}{X_i} = 0$ and $X_k = 1$, and $P’\{X_{k+1}=1,X_{k+2}=1,…,X_N=1\} \le p_{k+1} p_{k+2} … p_N$ for any k.

(3) Since we interested only in $P\{S\}$ we can set $P’\{X_{k+1},…,X_N\} \approx P”\{\sum_{i=1,k}{X_i}=s’ ; N-(k+1)+1=l\}$ (the probability of $\sum_{i=k+1,N}{X_i}$ successes for N-(k+1)+1 summands from the tail). And use parametrization $P”\{\sum_{i=k,N}{X_i}=s’ ; N-k+1=l\}= p_{s’,l}$

(4) Use MLE for model based on parameters $p_1,…,p_N$ and $p_{0,1}, p_{1,1}; p_{0,2}, p_{1,2}, p_{2,2};…$ with $p_{s’,l}=0$ for $s’ \ge 6$ (and any $l$) and some other native constrains.

Is everything ok with this plan?

UPDATED 2

Some examples of empirical distribution $P\{S\}$ (red) compared with Poisson distribution (blue) (the poisson means are 2.22 and 2.45, sample sizes are 332 and 259):

For samples (A1, A2) with the poisson means 2.28 and 2.51 (sample sizes are 303 and 249):

For joined samlpe A1 + A2 (the sample size is 552):

Looks like some correction to Poisson should be the best model :).

One approach would be to model the $X$’s with a generalized linear model (GLM). Here, you would formulate $p_i$, the probability of success on the $i$’th trial as a (logistic linear) function of the recent observation history. So you’re essentially fitting an autoregressive GLM where the noise is Bernoulli and the link function is logit. The setup is:

$p_i = f(b + a_1 X_{i-1} + a_2 X_{i-2} + \ldots a_k X_{i-k})$, where

$f(x) = \frac{1}{1+\exp(x)}$, and

$X_i \sim Bernoulli(p_i)$

The parameters of the model are $\{b, a_1, \ldots a_k\}$, which can be estimated by logistic regression. (All you have to do is set up your design matrix using the relevant portion of observation history at each trial, and pass that into a logistic regression estimation function; the log-likelihood is concave so there’s a unique global maximum for the parameters). If the outcomes are indeed independent then the $a_i$’s will be set to zero; positive $a_i$’s mean that subsequent $p_i$’s increase whenever a success is observed.

The model doesn’t provide a simple expression for the probability over the sum of the $X_i$’s, but this is easy to compute by simulation (particle filtering or MCMC) since the model has simple Markovian structure.

This kind of model has been used with great success to model temporal dependencies between “spikes” of neurons in the brain, and there is an extensive literature on autoregressive point process models. See, e.g., Truccolo et al 2005 (although this paper uses a Poisson instead of a Bernoulli likelihood, but the mapping from one to the other is straightforward).