Let’s say we are given the following problem:
Predict which clients are most likely to stop buying in our shop in next 3 months.
For each client we know the month when one started to buy in our shop and additionally we have many behavioral features in monthly aggregates. The ‘eldest’ client has been buying for fifty months; let’s denote the time since a client began to buy by $t$ ($t \in [0, 50]$). It can be assumed that the number of clients is very large. If a client stops buying for three months and then comes back, then he is treated as a new customer so an event (stop buying) can occur only once.
Two solutions come into my mind:
Logistic regression – For each client and each month (maybe except the 3 newest months), we can say whether a client stopped buying or not, so we can do rolling samples with one observation per client and month. We can use the number of months since beginning as a categorical variable to obtain some equivalent of base hazard function.
Extended Cox model – This problem can be also modeled using the extended Cox model. It seems that this problem is more suited to survival analysis.
Question: What are the advantages of survival analysis in similar problems? The survival analysis was invented for some reason, so there must be some serious advantage.
My knowledge in survival analysis is not very deep and I think that most potential advantages of the Cox model can also be achieved using logistic regression.
- Equivalent of stratified Cox model can be obtained using an interaction of $t$ and the stratifying variable.
- Interaction Cox model can be obtained by diving the population into several sub-populations and estimating LR for every sub-population.
The only advantage I see is that Cox model is more flexible; for example, we can easily calculate the probability that a client will stop buying in 6 months.
The problem with the Cox model is that it predicts nothing. The “intercept” (baseline hazard function) in Cox models is never actually estimated. Logistic regression can be used to predict the risk or probability for some event, in this case: whether or not a subject comes in to buy something on a specific month.
The problem with the assumptions behind ordinary logistic regression is that you treat each person-month observation as independent, regardless of whether it was the same person or the same month in which observations occurred. This can be dangerous because some items are bought in two month intervals, so consecutive person by month observations are negatively correlated. Alternately, a customer can be retained or lost by good or bad experiences leading consecutive person by month observations are positively correlated.
I think a good start to this prediction problem is taking the approach of forecasting where we can use previous information to inform our predictions about the next month’s business. A simple start to this problem is adjusting for a lagged effect, or an indicator of whether a subject had arrived in the last month, as a predictor of whether they might arrive this month.