My questions are:
 Are generalized linear models (GLMs) guaranteed to converge to a global maximum? If so, why?
 Furthermore, what constraints are there on the link function to insure convexity?
My understanding of GLMs is that they maximize a highly nonlinear likelihood function. Thus, I would imagine that there are several local maxima and the parameter set you converge to depends on the initial conditions for the optimization algorithm. However, after doing some research I have not found a single source which indicates that there are multiple local maxima. Furthermore, I am not so familiar with optimization techniques, but I know the NewtonRaphson method and IRLS algorithm are highly prone to local maxima.
Please explain if possible both on an intuitive and mathematical basis!
EDIT: dksahuji answered my original question, but I want to add the followup question [2] above. (“What constraints are there on the link function to insure convexity?”)
Answer
The definition of exponential family is:
$$
p(x\theta) = h(x)\exp(\theta^T\phi(x) – A(\theta)),
$$
where $A(\theta)$ is the log partition function. Now one can prove that the following three things hold for 1D case (and they generalize to higher dimensions–you can look into properties of exponential families or log partition):

$ \frac{dA}{d\theta} = \mathbb{E}[\phi(x)]$

$ \frac{d^2A}{d\theta^2} = \mathbb{E}[\phi^2(x)] \mathbb{E}[\phi(x)]^2 = {\rm var}(\phi(x)) $

$ \frac{ \partial ^2A}{\partial\theta_i\partial\theta_j} = \mathbb{E}[\phi_i(x)\phi_j(x)] \mathbb{E}[\phi_i(x)] \mathbb{E}[\phi_j(x)] = {\rm cov}(\phi(x)) \Rightarrow \Delta^2A(\theta) = {\rm cov}(\phi(x))$
The above result prove that $A(\theta)$ is convex(as ${\rm cov}(\phi(x))$ is positive semidefinite). Now we take a look at likelihood function for MLE:
\begin{align}
p(\mathcal{D}\theta) &= \bigg[\prod_{i=1}^{N}{h(x_i)}\bigg]\ \exp\!\big(\theta^T[\sum_{i=1}^{N}\phi(x_i)] – NA(\theta)\big) \\
\log\!\big(p(\mathcal{D}\theta)\big) &= \theta^T\bigg[\sum_{i=1}^{N}\phi(x_i)\bigg] – NA(\theta) \\
&= \theta^T[\phi(\mathcal{D})] – NA(\theta)
\end{align}
Now $\theta^T[\phi(\mathcal{D})]$ is linear in theta and $A(\theta)$ is concave. Therefore, there is a unique global maximum.
There is a generalized version called curved exponential family which would also be similar. But most of the proofs are in canonical form.
Attribution
Source : Link , Question Author : DankMasterDan , Answer Author : gung – Reinstate Monica