I am intrigued by the concept of a Maximum Entropy Markov Model (MEMM), and I am thinking
of using it for a Part of Speech (POS) tagger. At the moment, I am using a conventional
Maximum Entropy (ME) classifier to tag each individual word. This uses a number of features, including the preceding two tags.
MEMMs use the Viterbi algorithm to find the optimum path through the Markov Chain
(ie. to find a complete optimum set of tags for the sentence rather than individual optimums for each word). Reading about it, this appears to be have a wonderful elegance and simplicity. However, each stage only relies on the “results” of the previous stage (i.e. as per a Markov Chain).
However, my ME model uses the previous two stages (i.e. the tags for the preceding two words). It seems I have two possible approaches:
As with a conventional Viterbi implementation, use a set of paths stored according to
one (the previous) stage. My ME classifier would use this and a ‘frozen’ stage before this (frozen into the path under consideration) to produce the transfer function.
Or I write the algorithm to keep track of two stages. This is more complicated and would no longer be a true Markov Model because each transfer function (i.e. from the ME Model) would depend on the preceding two stages and not one stage.
It strikes me that the second will be more accurate, although it will be more complicated.
I have yet to find any examples of this during my literature search. Has it been tried? Did the two stage approach give an improvement to the overall accuracy?
(This really is a real question I’m facing and the ML StackExchange site going live was pretty much perfect timing: I’d done a few days of book reading and online research and was about to start implementing. Here are my results. Although they aren’t rigorous I think they do answer my own question. I shall leave the question open for now in-case anyone has any useful input, has tried something similar, or have some useful references.)
Okay over the past couple of days I’ve coded this up. The code is not very efficient – lots of collection creation & copying, but the object of the exercise was to see if it would work, and how well it works.
I am splitting my data randomly into two lists: training data, and test data.
I am running the test data through the conventional Maximum Entropy POS Tagger; and my new MEMM tagger. Hence they see the same test data, allowing direct comparisons – due to the randomness in the data being chosen, I see some variation between tests (typically about 0.2-0.4%).
First test uses an MEMM tagger with a single stage (ie. a true Markov Chain). This consistently performed better than the simple ME tagger by about 0.1-0.25%.
Next I tried the two stage approach which seems like it should be more correct. However the results were even more marginal. Often the results would be identical, occasionally it would be slightly inferior, but probably a majority of times it was slightly better (so +/-0.05%).
The MEMM tagger is slow. Okay I haven’t applied any optimizations, but the 1 stage (true Markov Chain) is N times slower (where N = Number of labels) because this is the number of paths which are transferred between each step. The 2 stage implementation is N*N slower (because of the greater number of paths transferred). Although optimizations might improve things, I this is probably too slow for most practical applications.
One thing I am trying is to apply a lower probability limit to the paths. Ie. the Viterbi paths are pruned during each iteration with all paths below a certain probability (currently Log(total path P)<-20.0) are pruned. This does run quite a bit faster, but the question remains as to whether it is worth it. I think it probably isn’t.
Why do we not see any improvement? I think this is primarily due to the way POS tags behave and the Maximum Entropy model. Although the model takes features based on the previous two tags, the immediate previous tag is much more important compared to the one before it. Intuitively this would make sense for the English language (eg. an adjective is usually followed by a noun or an another adjective but this doesn’t really depend on what was before the adjective).