Let’s say we have the input (predictor) and output (response) data points A, B, C, D, E and we want to fit a line through the points. This is a simple problem to illustrate the question, but can be extended to higher dimensions as well.
The current best fit or hypothesis is represented by the black line above. The blue arrow (→) represents the vertical distance between the data point and the current best fit, by drawing a vertical line from the point till it intersects the line.
The green arrow (→) is drawn such that it is perpendicular to the current hypothesis at the point of intersection, and thus represents the least distance between the data point and the current hypothesis.
For points A and B, a line drawn such that it is vertical to the current best guess and is similar to a line which is vertical to the x axis. For these two points, the blue and the green lines overlap, but they don’t for points C, D and E.
The least squares principle defines the cost function for linear regression by drawing a vertical line through the data points (A, B, C, D or E) to the estimated hypothesis (→), at any given training cycle, and is represented by
Here (xi,yi) represents the data points, and hθ(xi) represents the best fit.
The minimum distance between a point (A, B, C, D or E) is represented by a perpendicular line drawn from that point on to the current best guess (green arrows).
The goal of least square function is to the define an objective function which when minimized would give rise to the least distance between the hypothesis and all the points combined, but won’t necessarily minimize the distance between the hypothesis and a single input point.
Why don’t we define the Cost Function for linear regression as the least distance between the input data point and the hypothesis (defined by a line perpendicular to the hypothesis) passing through the input datapoin, as given by (→)?
When you have noise in both the dependent variable (vertical errors) and the independent variable (horizontal errors), the least squares objective function can be modified to incorporate these horizontal errors. The problem in how to weight these two types of errors. This weighting usually depends on the ratio of the variances of the two errors:
- If the variance of the vertical error is extremely large relative to the variance of the horizontal error, OLS is correct.
- If the variance of the horizontal error is extremely large relative to the variance of the vertical error, inverse least squares (in which x is regressed on y and the inverse of the coefficient estimate for y is used as the estimate of β) is appropriate.
- If the ratio of the variance of the vertical error to the variance of the horizontal error is equal to the ratio of the variances of the dependent and independent variables, we have the case of “diagonal” regression, in which a consistent estimate turns out to be the geometric mean of the OLS and inverse least squares estimators.
- If the ratio of these error variances is one, then we have the case of “orthogonal” regression, in which the sum of squared errors measured along a line perpendicular to the estimating line is minimized. This is what you had in mind.
In practice, the great drawback of this procedure is that the ratio of the error variances is not usually known and cannot usually be estimated, so the path forward is not clear.