I’m working on developing a model to predict total sales of a product. I have about a year and a half of bookings data, so I could do a standard time series analysis. However, I also have a lot of data about each ‘opportunity’ (potential sale) that was either closed or lost. ‘Opportunities’ are progressed along stages of a pipeline until they are closed or lost; they also have associated data about the prospective buyer, sales person, interaction history, industry, estimated size of bookings, etc.

My goal is ultimately to predict total bookings, but I want to account for all of this information about the current ‘opportunities’ which are the true ‘root cause’ of bookings.

One idea I have is to use two different models serially as follows:

Use historical ‘opportunities’ to build a model that predicts the bookings arising from an individual ‘opportunity’ (I’d probably use random forests or even plain old linear regression for this step).

Use the model from 1 to predict the estimated bookings of all ‘opportunities’ currently in the pipeline, then sum those estimates based on the month each ‘opportunity’ was created.

Use a time series model (possibly ARIMA?), using the 1.5 years of monthly historical time series data AND the predicted (using the model from 1) total bookings for all ‘opportunities’ created in that month.

Granted there would be a lag in those opportunities converting to actual bookings, but the time series model should be able to deal with the lag.

How does this sound? I’ve done a lot of reading on time series and predicting sales, and from what I can tell this is a somewhat unique approach. Therefore I’d really appreciate any feedback!

**Answer**

You may end up with a model which seems to fit your current data OK, but it will come unstuck as soon as you try and produce an out-of-sample forecast. Consider producing your forecast for 6 months time. You have no way of knowing what the opportunities will be in six months, so you are going to have to create another set of models predicting each of of the inputs to your opportunity model. And, once you do this you are going to have lots of models feeding into your main model, but each of the little models is going to have its own prediction error attached to it,and these will be compound, but your main model will not know about these, and, as a result, all your prediction intervals will be grossly deflated.

**Attribution***Source : Link , Question Author : the_fractal_mouse , Answer Author : Tim*