I’m looking for some robust techniques to remove outliers and errors (whatever the cause) from financial time-series data (i.e. tickdata).
Tick-by-tick financial time-series data is very messy. It contains huge (time) gaps when the exchange is closed, and make huge jumps when the exchange opens again. When the exchange is open, all kinds of factors introduce trades at price levels that are wrong (they did not occur) and/or not representative of the market (a spike because of an incorrectly entered bid or ask price for example). This paper by tickdata.com (PDF) does a good job of outlining the problem, but offers few concrete solutions.
Most papers I can find online that mention this problem either ignore it (the tickdata is assumed filtered) or include the filtering as part of some huge trading model which hides any useful filtering steps.
Is anybody aware of more in-depth work in this area?
Update: this questions seems similar on the surface but:
- Financial time series is (at least at the tick level) non-periodic.
- The opening effect is a big issue because you can’t simply use the last day’s data as initialisation even though you’d really like to (because otherwise you have nothing). External events might cause the new day’s opening to differ dramatically both in absolute level, and in volatility from the previous day.
- Wildly irregular frequency of incoming data. Near open and close of the day the amount of datapoints/second can be 10 times higher than the average during the day. The other question deals with regularly sampled data.
- The “outliers” in financial data exhibit some specific patterns that could be detected with specific techniques not applicable in other domains and I’m -in part- looking for those specific techniques.
- In more extreme cases (e.g. the flash crash) the outliers might amount to more than 75% of the data over longer intervals (> 10 minutes). In addition, the (high) frequency of incoming data contains some information about the outlier aspect of the situation.
The problem is definitely hard.
Mechanical rules like the +/- N1 times standard deviations, or +/ N2 times MAD, or +/- N3 IQR or … will fail because there are always some series that are different as for example:
- fixings like interbank rate may be constant for some time and then jump all of a sudden
- similarly for e.g. certain foreign exchanges coming off a peg
- certain instrument are implicitly spreads; these may be near zero for periods and all of a sudden jump manifold
Been there, done that, … in a previous job. You could try to bracket each series using arbitrage relations ships (e.g. assuming USD/EUR and EUR/JPY are presumed good, you can work out bands around what USD/JPY should be; likewise for derivatives off an underlying etc pp.
Commercial data vendors expand some effort on this, and those of use who are clients of theirs know … it still does not exclude errors.