I know we can quantify the similarity between two sequences with the same length and same elements by rank order correlation. But how to measure similarity between two sequences of different length, and only having some elements in common?
For example, if I have three rank ordered numeric sequences like this:
sequence A: 1,2,3,4,5,6,7,8,9;
sequence B: 2,3,4,5,6,7,8,9,10,11,12,13
sequence C: 4,2,9,7,11,13,14,16,18
Intuitively, I guess sequence A and B are more similar, since they have more numbers in common and the common numbers have same order in both sequences. Sequence A and C are less similar since they have less number in common and the common numbers have difference orders in each sequence. Is there any quantitative measurement to capture both the order similarity in common elements and the percentage of common elements in two sequences?
As mentioned in @ttnphns’ comment, there exist plenty of dissimilarity measures. Have a look at the review by Studer & Ritschard (2015) who examine the sensitivity of the measures to ordering, position (timing) and duration (how many times a state is repeated). The measures addressed in that paper are all provided by the
seqdist function of the TraMineR R package.
If you are primarily interested in the uncommon part between your two sequences, an edit distance such as optimal matching may be the solution. Optimal matching measures the minimal cost of transforming one sequence into the other by means of indels (insert or delete) and substitutions and can account for indel and substitution costs. If the difference say between rank 1 and 3 is twice the difference between rank 1 and 2 you could set the substitution costs as the rank differences for example. Such a measure works for sequences of different length. It would just account for the cost of the indels necessary to make the sequences of equal length.
If you prefer to give more focus on the similarity in the ordering of the elements in the sequences, some other measures such as optimal matching of transitions for instance could be a better choice.
Hope this helps.