I’m trying to understand the context of the famous Minsky and Papert book “Perceptrons” from 1969, so critical to neural networks.
As far as I know, there were no other generic supervised learning algorithms yet except for perceptron: decision trees started to become actually useful only in late ’70s, random forests and SVMs are ’90s. It seems that the jackknife method was already known, but not k-cross validation (70s) or bootstrap (1979?).
Wikipedia says the classical statistics frameworks of Neyman-Pearson and Fisher were still in disagreement in ’50s, despite that the first attempts at describing a hybrid theory were already in ’40s.
Therefore my question: what were the state-of-the-art methods of solving general problems of predicting from data?
I was curious about this, so I did some digging. I was surprised to find that recognizable versions of many common classification algorithms were already available in 1969 or thereabouts. Links and citations are given below.
It is worth noting that AI research was not always so focused on classification. There was a lot of interest in planning and symbolic reasoning, which are no longer in vogue, and labelled data was much harder to find. Not all of these articles may have been widely available then either: for example, the proto-SVM work was mostly published in Russian. Thus, this might over-estimate how much an average scientist knew about classification in 1969.
In a 1936 article in the Annals of Eugenics, Fisher described a procedure for finding a linear function which discriminates between three species of iris flowers, on the basis of their petal and sepal dimensions. That paper mentions that Fisher had already applied a similar technique to predict the sex of human mandibles (jaw bones) excavated in Egypt, in a collaboration with E. S Martin and Karl Pearson (jstor), as well as in a separate cranial measurement project with a Miss Mildred Barnard (which I couldn’t track down).
The logistic function itself has been known since the 19th century, but mostly as a model for saturating processes, such as population growth or biochemical reactions. Tim links to JS Cramer’s article above, which is a nice history of its early days. By 1969, however, Cox had published the first edition of Analysis of Binary Data. I could not find the original, but a later edition contains an entire chapter on using logistic regression to perform classification. For example:
In discriminant analysis, the primary notion is that there are two distinct populations, defined by y=0,1, usually two intrinsically different groups, like two species of bacteria or plants, two different kinds of product, two distinct but rather similar drugs, and so on….Esentially the focus in discriminant analysis is on the question: how do the two distributions differ most sharply? Often, this is put into a more specific form as follows. There is given a new vector x′ from an individual of unknown y. What can we say about that y….
Cover and Hart are often credited with inventing/discovering the k-nearest neighbor rule. Their 1967 paper contains a proof that k-NN’s error rate is at most twice the Bayes error rate. However, they actually credit Fix and Hodge with inventing it in 1951, citing a technical report they prepared for the USAF School of Aviation Medicine (reprint via jstor).
Rosenblatt published a technical report describing the perceptron in 1957 and followed it up with a book, Principles of Neurodynamics in 1962. Continuous versions of backpropagation have been around since the early 1960s, including work by Kelley, Bryson, and Bryson & Ho (revised in 1975, but the original is from 1969. However, it wasn’t applied to neural networks until a bit later, and methods for training very deep networks are much more recent. This scholarpedia article on deep learning has more information.
I suspect using Bayes’ Rule for classification has been discovered and rediscovered many times–it is a pretty natural consequence of the rule itself. Signal detection theory developed a quantitative framework for deciding whether a given input was a “signal” or noise. Some of it came out of radar research after WWII, but it was rapidly adapted for perceptual experiments (e.g., by Green and Swets). I do not know who discovered that assuming independence between predictors works well, but work from the early 1970s seems to have exploited this idea, as summarized in this article. Incidentally, that article also points out that Naive Bayes was once called “idiot Bayes”!
Support Vector Machines
In 1962, Vapnik and Chervonenkis described the “Generalised Portrait Algorithm” (terrible scan, sorry), which looks like a special case of a support vector machine (or actually, a one-class SVM). Chervonenkis wrote an article entitled “Early History of Support Vector Machines” which describes this and their follow-up work in more detail. The kernel trick (kernels as inner products) was described by Aizerman, Braverman and Rozonoer in 1964. svms.org has a bit more about the history of support vector machines here.