Given a sequence of inputs, I need to determine whether this sequence has a certain desired property. The property can only be true or false, that is, there are only two possible classes that a sequence can belong to.

The exact relationship between the sequence and the property is unclear, but I believe it is very consistent and should lend itself to statistical classification. I have a large number of cases to train the classifier on, although it might be slightly noisy, in the sense that there’s a slight probability that a sequence is assigned the wrong class in this training set.

Example training data:

`Sequence 1: (7 5 21 3 3) -> true Sequence 2: (21 7 5 1) -> true Sequence 3: (12 21 7 5 11 1) -> false Sequence 4: (21 5 7 1) -> false ...`

In rough terms, the property is determined by the set of values in the sequence (e.g. the presence of an “11” means that the property will almost certainly be false), as well as the

orderof the values (e.g. “21 7 5” significantly increases the chance that the property is true).After training, I should be able to give the classifier a previously unseen sequence, like

`(1 21 7 5 3)`

, and it should output its confidence that the property is true.

Is there a well-known algorithm for training a classifier with this kind of inputs/outputs?I have considered the naive Bayesian classifier (which is not really adaptable to the fact that the order matters, at least not without severely breaking the assumption that the inputs are independent). I’ve also investigated the hidden Markov model approach, which appears to be inapplicable because only a single output is available, instead of one output per input. What did I miss?

**Answer**

You could try probabilistic approaches similar to the naive Bayes classifier but with weaker assumptions. For example, instead of making the strong independence assumption, make a Markov assumption:

p(x∣c)=p(x0∣c)∏tp(xt∣xt−1,c)

c is your class label, x is your sequence. You need to estimate two conditional distributions, one for c=1 and one for c=0.

By Bayes’ rule:

p(c=1∣x)=p(x∣c=1)p(c=1)p(x∣c=1)p(c=1)+p(x∣c=0)p(c=0).

Which distributions to pick for p(xt∣xt−1,c) depends on which other assumptions you can make about the sequences and how much data you have available.

For example, you could use:

p(xt∣xt−1,c)=π(xt,xt−1,c)∑iπ(xi,xt−1,c)

With distributions like this, if there are 21 different numbers occurring in your sequences, you would have to estimate 21⋅21⋅2=882 parameters π(xt,xt,c) plus 21⋅2=42 parameters for p(x0∣c) plus 2 parameters for p(c).

If the assumptions of your model are not met, it can help to fine-tune the parameters directly with respect to the classification performance, for example by minimizing the average log-loss

−1#D∑(x,c)∈Dlogp(c∣x)

using gradient-descent.

**Attribution***Source : Link , Question Author : Roman Starkov , Answer Author : Lucas*