The following describes what I’m trying to accomplish, but it’s possible an alternative problem statement can describe my goal:

I want to

divide the following numbers into groups where the variances of the numbers within each group are not too large, and the differences between the averages of groups are not too small

compare the distribution obtained in the end with the “perfect” ones and see how “different” it is from being perfect.

Layman’s explanation of goalI’m trying to calculate income distribution, and determine the “income brackets” each population is in. The income bracket is supposed to be self-adjusting based on the input data.

My goal is to ultimately measure or calculate the difference between the income brackets. I assume there will be many brackets, and want to see how far “apart” each tier is.

Here is a sample of hourly income for a sample set of a population of 20, and a total income of 3587:

`Population= 10 pop=2 population=5 population =3 10, 11,13,14,14,14,14,14,15,20, 40,50 ,90,91,92,93,94 999,999,900`

How can I use mathematical concepts to group, sort and analyze data that acts like income distribution over a given population?

At the end of the calculation, I want to determine tiered income distribution, where a perfect distribution would look (something) like this

`(each person makes $10 more per hour than the previous; total is 3587) 89, 99, 109, 119, 129, 139, 149, 159, 169, 179, 189, 199, 209, 219, 229, 239, 249, 259, 269, 279`

or this:

`(evenly distributed groups of people make the same per hour) (gaps between income groups is consistent and not "too far") (income total is 3587) 99 99 99 129 129 129 159 159 159 199 199 199 229 229 229 269 269 269`

QuestionHow should I analyze the population groups, and measure the gap in a way that will tell me how much is needed to make it more like the last two model sets listed above?

**Answer**

Cluster analysis with a single variable makes perfect sense whenever there is

some dimension along which values can be arranged. This could be a

**measurement scale**, **time** or **space**.

Given ordered data on some **measurement scale**, there might be

interest in looking for relative breaks within a frequency distribution

(antimodes, in one terminology).

Note of caution: However, breaks defining bins that are, or that might seem, arbitrary are widely shunned in several areas of statistical science, and there is widespread and marked preference for binning with equal intervals, and very often for avoiding binning altogether when possible. This is partly a matter of taste, partly one of convention: practices have shifted as it becomes easier to store datasets in their entirety.

A **time** series could be divided into spells,

epochs, periods, whatever, ideally with relatively small differences within

subseries and relatively large differences between subseries. The same

problem arises for **space** whenever a single spatial dimension (horizontal or vertical) is to be subdivided. In geological and other sciences, this is often studied under the heading of zonation.

Note that any formal clustering should always be accompanied by appropriate

plotting of the data (for example, using a dot or quantile or line plot),

which indeed may make clear either that breaks are obvious (so that formal

clustering is merely decorative) or that convincing breaks do not exist (so

that formal clustering may be pointless).

Consider a toy example of values ordered by magnitude:

```
14 15 16 23 24 25 56 57 58
```

where it is evident that a three-group clustering

```
14 15 16 | 23 24 25 | 56 57 58
```

is sensible. Whether the ordering is on the values themselves, or in time or

in space, the data can always be laid out in one dimension, which gives

special structure to the problem. Thus, although more general clustering

methods can be used, that special structure ideally should be exploited. $k$

groups devised for $n$ values are defined by placing $k – 1$ markers (in the

example above, $k – 1 = 2$); there are $n – 1$ possible places to put them.

There are thus $n – 1 \choose k – 1$ possible clusterings. However, if $k$ is free

to vary, then the total number of possible clusterings is $2^{n – 1}$, as each

value can be in the same group as each neighbour, or not. For even modest $n$,

that is a large number.

The problem can be made precise (Fisher 1958; Hartigan 1975) by placing

markers to minimise, for a given number of groups, the

$$\text{sum over groups of variability around group centres}.$$

A sum of squared deviations from group means will spring to mind as the most

obvious possibility. Sum of absolute

deviations from group medians, and other measures, might well be entertained.

Hartigan (1975) showed how a dynamic programming approach makes such

computation straightforward and presented Fortran code. A Stata implementation (Cox 2007) is `group1d`

to be installed from SSC.

Cox, N.J. 2007. GROUP1D: Stata module for grouping or clustering in one dimension. http://ideas.repec.org/c/boc/bocode/s456844.html

Fisher, W.D. 1958. On grouping for maximum homogeneity. *Journal, American
Statistical Association* 53: 789-98.

Hartigan, J.A. 1975. *Clustering algorithms.* New York: John Wiley. Ch.6.

**Postscript** This approach seems to match the first part of the specific question. I have pitched it generally because I think the formulation is of some general interest (and because it was easy for me to recycle part of the documentation of Cox 2007). But if the specific goal is to compare an income distribution with a reference uniform distribution, I don’t see that binning has any part to play at all. That is a standard problem in economics for which Lorenz curves and inequality measures are the starting points. In essence, you can compare quantile to quantile or percent point to percent point.

**Attribution***Source : Link , Question Author : makerofthings7 , Answer Author : Nick Cox*