# Gradient and vector derivatives: row or column vector?

Quite a lot of references (including wikipedia, and http://www.atmos.washington.edu/~dennis/MatrixCalculus.pdf and http://michael.orlitzky.com/articles/the_derivative_of_a_quadratic_form.php) define the derivative of a function by a vector as the partial derivatives of the function arranged in a row (so a derivative of a scalar valued function is a row vector). In this convention the gradient and the vector derivative are transposes of each other. The benefit of this convention is that we can interpret meaning of the derivative as a function that tells you the linear rate of change in each direction. The gradient remains a vector, it tells you the direction and magnitude of the greatest rate of change.

I recently read Gentle’s Matrix Algebra (http://books.google.com/books/about/Matrix_Algebra.html?id=Pbz3D7Tg5eoC) and he seems to use another convention, where it defines the gradient as equal to the vector derivative, resulting in a column arrangement (so a derivative of a scalar valued function is a column vector). As a result of this arrangement, every differentiation result is the transpose of the result in the other convention. The benefit of this convention, I’m guessing here, is just that the gradient and the derivative are equal. So for optimization tasks, instead of differentiating and then taking the transpose, you can just differentiate.

I think the tension is between Jacobian and gradient. In the row convention the Jacobian follows directly from the definition of the derivative, but you have to apply a transpose to get the gradient; whereas in the column convention the gradient is the one that doesn’t need to be transposed, but you have to apply a transpose to get the Jacobian. So if you prefer to think of the derivative result as a linear map, then the first convention makes sense; if you prefer to think of the result as a vector/direction then the second convention makes sense. So you just have to be consistent.

Which of these conventions is more commonly used in Machine Learning? Am I going to get hopelessly confused if I spend too much time reading work in the “wrong” convention?

If you consider a linear map between vector spaces (such as the Jacobian) $J : u \in U \rightarrow v \in V$, the elements $v = J u$ have to agree in shape with the matrix-vector definition: the components of $v$ are the inner products of the rows of $J$ with $u$.
In e.g. linear regression, the (scalar in this case) output space is a weighted combination of features: $\mathbf{w} ^\intercal \mathbf{u} = v$, again requiring the inner product.