next up previous
Next: Caveats Up: Binomial Logistic Regression Previous: Parameter Estimation

The Newton-Raphson Method

Setting the equations in Eq. 11 equal to zero results in a system of $ K+1$ nonlinear equations each with $ K+1$ unknown variables. The solution to the system is a vector with elements, $ \beta_k$. After verifying that the matrix of second partial derivatives is negative definite, and that the solution is the global maximum rather than a local maximum, then we can conclude that this vector contains the parameter estimates for which the observed data would have the highest probability of occurrence. However, solving a system of nonlinear equations is not easy--the solution cannot be derived algebraically as it can in the case of linear equations. The solution must be numerically estimated using an iterative process. Perhaps the most popular method for solving systems of nonlinear equations is Newton's method, also called the Newton-Raphson method.

Newton's method begins with an initial guess for the solution then uses the first two terms of the Taylor polynomial evaluated at the initial guess to come up with another estimate that is closer to the solution. This process continues until it converges (hopefully) to the actual solution. Recall that the Taylor polynomial of degree $ n$ for $ f$ at the point $ x = x_0$ is defined as the first $ n$ terms of the Taylor series for $ f$:

$\displaystyle \sum_{i=0}^n \frac{f^{(i)} (x_0)}{i!} (x - x_0)^i$ (17)

provided that the first $ n$ derivatives of $ f$ at $ x_0$ all exist. The first degree Taylor polynomial is also the equation for the line tangent to $ f$ at the point $ (x_0, f(x_0))$. The point at which the tangent line crosses the x-axis, $ (x_1, 0)$, is used in the next approximation of the root to be found where $ f(x) = 0$. The first step in Newton's method is to take the first degree Taylor polynomial as an approximation for $ f$, which we want to set equal to zero:

$\displaystyle f(x_0) + f^{\prime}(x_0) \cdot (x - x_0) = 0$ (18)

Solving for $ x$, we have:

$\displaystyle x = x_0 - \frac{f(x_0)}{f^{\prime}(x_0)}$ (19)

This new value of $ x$ is the next approximation for the root. We let $ x_1 = x$ and continue in the same manner to generate $ x_2, x_3, \ldots$, until successive approximations converge.

Generalizing Newton's method to a system of equations is not difficult. In our case, the equations whose roots we want to solve are those in Eq. 11, the first derivative of the log-likelihood function. Since Eq. 11 is actually a system of $ K+1$ equations whose roots we want to find simultaneously, it is more convenient to use matrix notation to express each step of the Newton-Raphson method. We can write Eq. 11 as $ l^{\prime}(\boldsymbol{\beta})$. Let $ \boldsymbol{\beta}^{(0)}$ represent the vector of initial approximations for each $ \beta_k$, then the first step of Newton-Raphson can be expressed as:

$\displaystyle \boldsymbol{\beta}^{(1)} = \boldsymbol{\beta}^{(0)} + [-l^{\prime\prime}(\boldsymbol{\beta}^{(0)})]^{-1} \cdot l^{\prime}(\boldsymbol{\beta}^{(0)})$ (20)

Let $ \boldsymbol{\mu}$ be a column vector of length $ N$ with elements $ \mu_i = n_i \pi_i$. Note that each element of $ \boldsymbol{\mu}$ can also be written as $ \mu_i = E(y_i)$, the expected value of $ y_i$. Using matrix multiplication, we can show that:

$\displaystyle l^{\prime}(\boldsymbol{\beta}) = \boldsymbol{X}^T(\boldsymbol{y} - \boldsymbol{\mu})$ (21)

is a column vector of length $ K+1$ whose elements are $ \frac{\partial l(\beta)}{\partial \beta_k}$, as derived in Eq. 11. Now, let $ \boldsymbol{W}$ be a square matrix of order $ N$, with elements $ n_i\pi_i(1-\pi_i)$ on the diagonal and zeros everywhere else. Again, using matrix multiplication, we can verify that

$\displaystyle l^{\prime\prime}(\boldsymbol{\beta}) = - \boldsymbol{X}^T\boldsymbol{W}\boldsymbol{X}$ (22)

is a $ K+1 \times K+1$ square matrix with elements $ \frac{\partial^2 l(\beta)}{\partial \beta_k \partial\beta_{k^\prime}}$. Now, Eq. 20 can be written:

$\displaystyle \boldsymbol{\beta}^{(1)} = \boldsymbol{\beta}^{(0)} + [\boldsymbo...
...}\boldsymbol{X}]^{-1} \cdot \boldsymbol{X}^T(\boldsymbol{y} - \boldsymbol{\mu})$ (23)

Continue applying Eq. 23 until there is essentially no change between the elements of $ \boldsymbol{\beta}$ from one iteration to the next. At that point, the maximum likelihood estimates are said to have converged, and Eq. 22 will hold the variance-covariance matrix of the estimates.


next up previous
Next: Caveats Up: Binomial Logistic Regression Previous: Parameter Estimation

Scott Czepiel
http://czep.net/contact.html