Help me here, please. Perhaps before even giving me an answer you may need to help me ask the question. I have never learned about time series analysis and do not know if that is indeed what I need. I have never learned about time smoothed averages and do not know if that is indeed what I need. My statistics background: I have 12 credits in biostatistics (multiple linear regression, multiple logistic regression, survival analysis, multi-factorial anova but never repeated measures anova).
So please look at my scenarios below. What are the buzzwords I should be searching for and can you suggest a resource to learn what I need to learn?
I want to look at several different data sets for totally different purposes but common to all of them is that there are dates as one variable. So a couple of examples spring to mind: clinical productivity over time (as in how many surgeries or how many office visits) or electric bill over time (as in money paid to electricity company per month).
For both of the above the near universal way to do it is to create a spreadsheet of month or quarter in one column and in the other column would be something such as electricity payment or number of patients seen in the clinic. However, counting per month leads to a lot of noise that has no meaning. For instance, if I usually pay the electricity bill on the 28th of every month but on one occasion I forget and so I only pay it 5 days later on the 3rd of the next month then one month will appear as if there was zero expense and the next month will show ginormous expense. Since one has the actual dates of payment why would one purposefully throw away the very granular data by boxing it into expenses by calendar month.
Similarly if I am out of town for 6 days at a conference then that month will appear to be very unproductive and if those 6 days fell near the end of the month, the next month will be uncharacteristicaly busy since there will be a whole waiting list of people who wanted to see me but had to wait till I returned.
Then of course there are the obvious seasonal variations. Air conditioners use a lot of electricity so obviously one has to adjust for summer heat. Billions of children are referred to me for recurrent acute otitis media in the winter and hardly any in the summer and early fall. No child of school age gets scheduled for elective surgery in the first 6 weeks that schools return following the long summer vacation. Seasonality is just one independent variable that affects the dependent variable. There must be other independent variables some of which can be guessed and others that are not known.
A whole bunch of different issues crop up when looking at enrollment in a longstanding clinical study.
What branch of statistics lets us look at this over time by simply looking at events and their actual dates but without creating artificial boxes (months/quarters/years) that do not really exist.
I thought of making the weighted average count for any event. For instance number of patients seen this week is equal to 0.5*nr seen this week + 0.25*nr seen last week + 0.25*nr seen next week.
I want to learn more about this. What buzzwords should I be searching for?
I would start with robust time series filters (i.e. time varying medians) because these are more simple and intuitive.
Basically, the robust time filter is to time series smoothers what the median is to the mean; a summary measures (in this case a time varying one) that is not sensitive to ‘wired’ observations so long as they do not represent the majority of the data. For a summary see here.
If you need more sophisticated smoothers (i.e. non linear ones), you could do with robust Kalman filtering (although this requieres a slightly higher level of mathematical sophistication)
library(robfilter) data(Nile) nile <- as.numeric(Nile) obj <- wrm.filter(nile, width=11) plot(obj)
. The last documents contains a large number of references to papers and books. Other types of filters are implemented in the package, but the repeated median is a very simple one.