My question deals with how to be able to assert that an “improved”

evolutionary algorithm is indeed improved (at least from a statistic

point of view) and not just random luck (a concern given the

stochastic nature of these algorithms).Let’s assume I am dealing with a standard GA (before) and an “improved”

GA (after). And I have a suite of 8 test problems.I run both both of these algorithms repeatedly, for instance 10 times(?)

through each of the 8 test problems and and record how many

generations it took to come up with the solution. I would start out

with the same initial random population (using same seed).Would I use a paired t-test for means to verify that any difference

(hopefully an improvement) between the averages for each test question

would be statistically significant? Should I run these algorithms more

than 10 times for each test/pair?Any pitfalls I should be aware of? I assume I could use this approach

for any (evolutionary) algorithm comparison.Or am I really on the wrong track here? I am basically looking for a

way to compare two implementations of an evolutionary algorithm and

report on how well one might work compared to the other.Thanks!

**Answer**

You would not use a paired sample t-test. The reason for this is that a particular random seed cannot be assumed to bias the outcome of both algorithms in the same way, even if that random seed is only used to generate the population and not for later operations such as mutation and selection. In other words, its logically possible that, under one algorithm, a given population will evolve better than the average for that algorithm, but will perform in the opposite way under another. If you have reason to believe that there is a similar connection between seed and performance for both algorithms, you can test this using a Pearson correlation coefficient to compare each seed’s performance on both tests. By default, however, I would assume that there is no connection, especially if you have reasonably large populations.

As far as running more than 10 times, of course more samples are always better, though your computational resources obviously may be a limiting factor. It could be a good idea to generate a power curve, which will show you the relationship between the size of difference needed for statistical significance at you’re alpha level, and the SD and n. In other words, at a given n and SD, how big does the difference have to be? http://moon.ouhsc.edu/dthompso/CDM/power/hypoth.htm <– see bottom of page for power curve info.

Finally, if you are running a genetic algorithm that actually has a defined stopping point, as yours does, you can just do a plain unpaired t-test on the number of generations needed to find the solution. Otherwise, quantifying algorithm performance tends to get a bit trickier

As far as pitfalls, and generalizability of algorithm efficiency to other problems, you really cannot take effectiveness of your algorithm for granted when porting it to other problems. In my experience, genetic algorithms usually have to be tweaked quite a bit for each new problem that you apply them to. Having said that, depending on how diverse your set of 8 tests is, they may give you some indication of how generalizable your results are, and within which scope of applications they are generalizable.

**Attribution***Source : Link , Question Author : Levon , Answer Author : Matt Munson*