I am unable to interpret this graph. My dependent variable is total number of movie tickets that will be sold for a show. The independent variables are the number of days left before the show, seasonality dummy variables (day of week, month of year, holiday), price, tickets sold till date, movie rating, movie type (thriller, comedy, etc., as dummies). Also, please note that movie hall’s capacity is fixed. That is, it can host maximum of x number of people only. I am creating a linear regression solution and it’s not fitting my test data. So I thought of starting with regression diagnostics. The data are from a single movie hall for which I want to predict demand.
The is a multivariate dataset. For every date, there are 90 duplicate rows, representing days before the show. So, for 1 Jan 2016 there are 90 records. There is a ‘lead_time’ variable which gives me number of days before the show. So for 1 Jan 2016, if lead_time has value 5, it means it will have tickets sold until 5 days before the show date. In the dependent variable, total tickets sold, I will have the same value 90 times.
Also, as a side remark, is there any book that explains how to interpret residual plot and improve model afterwards?
The plot is very dense so it is not easy to see all trends there may be. You could run alternative tests for hetoroscedasticity and autocorrelation to get additional diagnostics.
What is visible is that over the first 100 values or so the variance of the residual increases which may hint to hetoroscedasticity. Afterwards the variance seems to decrease again. This somewhat non-linear behavior of the variance may also point to the need for a difference functional form (so maybe polynomial instead of linear). Another indication for this is the trend in residuals you observe in the high end of the fitted values (there aren’t any positive residuals anymore).