I’m trying to model some data on train arrival times. I’d like to use a distribution that captures “the longer I wait, the more likely the train is going to show up”. It seems like such a distribution should look like a CDF, so that P(train show up | waited 60 minutes) is close to 1. What distribution is appropriate to use here?
Multiplication of two probabilities
The probability for a first arrival at a time between t and t+dt (the waiting time) is equal to the multiplication of
- the probability for an arrival between t and t+dt (which can be related to the arrival rate s(t) at time t)
- and the probability of no arrival before time t (or otherwise it
would not be the first).
This latter term is related to:
and probability distribution for waiting times is:
Derivation of cumulative distribution.
Alternatively you could use the expression for the probability of less than one arrival conditional that the time is t
and the probability for arrival between time t and t+dt is equal to the derivative
This approach/method is for instance useful in deriving the gamma distribution as the waiting time for the n-th arrival in a Poisson process. (waiting-time-of-poisson-process-follows-gamma-distribution)
You might relate this to the waiting paradox (Please explain the waiting paradox).
Exponential distribution: If the arrivals are random like a Poisson process then s(t)=λ is constant. The probability of a next arrival is independent from the previous waiting time without arrival (say, if you roll a fair dice many times without six, then for the next roll you will not suddenly have a higher probability for a six, see gambler's fallacy). You will get the exponential distribution, and the pdf for the waiting times is: f(t)=λe−λt
Constant distribution: If the arrivals are occurring at a constant rate (such as trains arriving according to a fixed schedule), then the probability of an arrival, when a person has already been waiting for some time, is increasing. Say a train is supposed to arrive every T minutes then the frequency, after already waiting t minutes is s(t)=1/(T−t) and the pdf for the waiting time will be: f(t)=e∫t0−1T−tdtT−t=1T which makes sense since every time between 0 and T should have equal probability to be the first arrival.
So it is this second case, with "then the probability of an arrival, when a person has already been waiting for some time is increasing", that relates to your question.
It might need some adjustments depending on your situation. With more information the probability s(t)dt for a train to arrive at a certain moment might be a more complex function.