# How to run linear regression in a parallel/distributed way for big data setting?

I am working on a very large linear regression problem, with data size so large that they have to be stored on a cluster of machines. It will be way too big to aggregate all the samples into one single machine’s memory (even disk)

To do regression these data, I am thinking about a parallel approach, i.e., run regression on each individual box, and then calculate the beta based on the statistics of each individual beta (probably a mean or a median)

does this make any sense ? if so, how should I get the total expected $R^2$ from each individual $R^2$?

Yes, running linear regression in parallel has been done. For example, Xiangrui Meng et al. (2016) for Machine Learning in Apache Spark. The way it works is using stochastic gradient descent (SGD). In section 3, core features, the author mentioned:

Generalized linear models are learned via optimization algorithms which parallelize gradient computation, using fast C++-based linear algebra libraries for worker computations.

An example on how SGD works can be found in my answer here: How could stochastic gradient descent save time comparing to standard gradient descent?

Note, the notation is not consistent with the link I provided, I feel matrix notation is better in this question.

To do a linear regression we are trying to do

The derivative is

In small data settings, we can set the derivative to $0$ and solve it directly. (e.g., QR decomposition in R.) In big data settings, the data matrix $X$ is too big to be stored in memory, and may be hard to solve directly. (I am not familiar with how to do QR decomposition or Cholesky decomposition for huge matrices).

One way to parallelize this is by trying to use an iterative method: stochastic gradient descent, where we can approximate the gradient using a subset of the data. (If we use $X_s$, $y_s$ to represent a subset of the data, the gradient can be approximated by $2X_s^T(X_s\beta-y_s)$, and we can update $\beta$ with the approximated gradient).

In addition, for the $R^2$ statistic, we can compute $R^2$ for all data in parallel or approximate it by using a subset of the data.

Intuition on how it works (mapreduce paradigm):

I keep saying approximation using a subset; the intuition for why this works can be described in the following example: suppose I have 100 billion data points and we want to calculate the average of all data points. Suppose conducting such an operation takes a very long time, and further that the whole data cannot be stored in memory.

What we can do is to just take a subset, say 1 billion items, and calculate the average of these. The approximation thus produced should not be far away from the truth (i.e., using the whole data).

To parallelize, we can use 100 computers, with each of them taking a different subset of the 1 billion data points and calculating the average of these. (Commonly referred to as the MAP step). Finally, run another average on these 100 numbers (a.k.a. the REDUCE step).

Note the “mapreduce paradigm” would work well in some cases, but not well in others. For the example, the “average” operation mentioned earlier is very easy, because we know $\text{mean}()=\text{mean}(x)+\text{mean(y)}$, (assuming the length of $x$ and $y$ are the same). For some iterative methods, i.e., the current iteration is dependent on previous iteration results, it is hard to parallelize. Stochastic gradient descent solves this problem by approximating the gradient using a subset of data. And details can be found in @user20160 ‘s answer.

References:

Xiangrui Meng et al. (2016). MLlib: Machine Learning in Apache Spark