## Alternative to chi-square in evaluating the similarity of two distributions (ordered categorical variables)

In my study I compare Finnish and Russian expressions for different parts of the day. I conducted a survey and asked people to refer to a time interval with some non-numeric expression (e.g. if something happens between 1pm and 3pm you might refer to it with the phrase “afternoon”). The thing is that Russian, unlike … Read more

## How to normalize the distance between two distributions

I’m creating a distance metric that is composed out of multiple pairwise feature distances. The distance metric will be used in a clustering algorithm for a computer security problem, more specifically the clustering algorithm will group together related “malicious instances”. Our hypothesis is that malicious instances will share similar characteristics (=features), as opposed to benign … Read more

## Not able to understand KL decomposition

The bias-variance decomposition usually applies to regression data. We would like to obtain similar decomposition for classification, when the prediction is given as a probability distribution over C classes. Let P=[P1,…,PC] be the ground truth class distribution associated to a particular input pattern. Assume the random estimator of class probabilities ˉP=[ˉP1,…,ˉPC] for the same input … Read more

## Sentence sampling based on frequency

I have a database with 300k+ Russian sentences and their English translation. My goal is to use these sentences as flashcards, so the users can learn the top N most frequent Russian words (let’s assume N = 10k). A requirement is that the easiest sentences are shown first, and more complex sentences get slowly introduced … Read more