Parameter Estimation

Next: The Newton-Raphson Method Up: Binomial Logistic Regression Previous: The Model

Parameter Estimation

The goal of logistic regression is to estimate the

unknown parameters $\boldsymbol{\beta}$ in Eq. 1. This is done with maximum likelihood estimation which entails finding the set of parameters for which the probability of the observed data is greatest. The maximum likelihood equation is derived from the probability distribution of the dependent variable. Since each ${y_i}$ represents a binomial count in the $i^{th}$ population, the joint probability density function of $\boldsymbol{Y}$ is:

$\displaystyle f(\boldsymbol{y}\vert\boldsymbol{\beta}) = \prod_{i=1}^N \frac{n_i!}{y_i!(n_i-y_i)!} \pi_i^{y_i}(1-\pi_i)^{n_i-y_i}$

(2)

For each population, there are ${n_i\choose{y_i}}$ different ways to arrange ${y_i}$ successes from among ${n_i}$ trials. Since the probability of a success for any one of the ${n_i}$ trials is ${\pi_i}$ , the probability of ${y_i}$ successes is ${\pi_{i}^{y_i}}$ . Likewise, the probability of ${n_i-y_i}$ failures is $(1-\pi_{i})^{n_i-y_i}$ .

The joint probability density function in Eq. 2 expresses the values of $\boldsymbol{y}$ as a function of known, fixed values for $\boldsymbol{\beta}$ . (Note that $\boldsymbol{\beta}$ is related to $\boldsymbol{\pi}$ by Eq. 1). The likelihood function has the same form as the probability density function, except that the parameters of the function are reversed: the likelihood function expresses the values of $\boldsymbol{\beta}$ in terms of known, fixed values for $\boldsymbol{y}$ . Thus,

$\displaystyle L(\boldsymbol{\beta}\vert\boldsymbol{y}) = \prod_{i=1}^N \frac{n_i!}{y_i!(n_i-y_i)!} \pi_i^{y_i}(1-\pi_i)^{n_i-y_i}$

(3)

The maximum likelihood estimates are the values for $\boldsymbol{\beta}$ that maximize the likelihood function in Eq. 3. The critical points of a function (maxima and minima) occur when the first derivative equals 0. If the second derivative evaluated at that point is less than zero, then the critical point is a maximum (for more on this see a good Calculus text, such as Spivak [14]). Thus, finding the maximum likelihood estimates requires computing the first and second derivatives of the likelihood function. Attempting to take the derivative of Eq. 3 with respect to $\boldsymbol{\beta}$ is a difficult task due to the complexity of multiplicative terms. Fortunately, the likelihood equation can be considerably simplified.

First, note that the factorial terms do not contain any of the $\pi_i$ . As a result, they are essentially constants that can be ignored: maximizing the equation without the factorial terms will come to the same result as if they were included. Second, note that since $a^{x-y} = a^x/a^y$ , and after rearragning terms, the equation to be maximized can be written as:

$\displaystyle \prod_{i=1}^N \displaystyle\biggl(\frac{\pi_i}{1-\pi_i}\biggr)^{y_i}(1-\pi_i)^{n_i}$

(4)

Note that after taking

to both sides of Eq. 1,

$\displaystyle \displaystyle\biggl(\frac{\pi_i}{1-\pi_i}\biggr) = e^{\sum_{k=0}^{K} x_{ik}\beta_k}$

(5)

which, after solving for $\pi_i$ becomes,

$\displaystyle \pi_i = \displaystyle\biggl(\frac{e^{\sum_{k=0}^{K} x_{ik}\beta_k}}{1+e^{\sum_{k=0}^{K} x_{ik}\beta_k}}\biggr)$

(6)

Substituting Eq. 5 for the first term and Eq. 6 for the second term, Eq. 4 becomes:

$\displaystyle \prod_{i=1}^N (e^{\sum_{k=0}^{K} x_{ik}\beta_k})^{y_i} \displayst... ...\sum_{k=0}^{K} x_{ik}\beta_k}}{1+e^{\sum_{k=0}^{K} x_{ik}\beta_k}}\biggr)^{n_i}$

(7)

Use $(a^x)^y=a^{xy}$ to simplify the first product and replace 1 with $\frac{1+e^{\sum\boldsymbol{x}\boldsymbol{\beta}}}{1+e^{\sum\boldsymbol{x}\boldsymbol{\beta}}}$ to simplify the second product. Eq. 7 can now be written as:

$\displaystyle \prod_{i=1}^N (e^{y_i\sum_{k=0}^{K} x_{ik}\beta_k}) (1+e^{\sum_{k=0}^{K} x_{ik}\beta_k})^{-n_i}$

(8)

This is the kernel of the likelihood function to maximize. However, it is still cumbersome to differentiate and can be simplified a great deal further by taking its log. Since the logarithm is a monotonic function, any maximum of the likelihood function will also be a maximum of the log likelihood function and vice versa. Thus, taking the natural log of Eq. 8 yields the log likelihood function:

$\displaystyle l(\boldsymbol{\beta}) = \sum_{i=1}^N y_i \biggl(\sum_{k=0}^{K} x_{ik}\beta_k\biggr) - n_i \cdot \log(1+e^{\sum_{k=0}^{K} x_{ik}\beta_k})$

(9)

To find the critical points of the log likelihood function, set the first derivative with respect to each $\beta$ equal to zero. In differentiating Eq. 9, note that

$\displaystyle \frac{\partial}{\partial\beta_k} \sum_{k=0}^{K} x_{ik}\beta_k = x_{ik}$

(10)

since the other terms in the summation do not depend on $\beta_k$ and can thus be treated as constants. In differentiating the second half of Eq. 9, take note of the general rule that $\frac{\partial}{\partial x} \log y = \frac{1}{y}\frac{\partial y}{\partial x}$ . Thus, differentiating Eq. 9 with respect to each $\beta_k$ ,

$\displaystyle \frac{\partial l(\beta)}{\partial \beta_k}$	$\displaystyle =$	$\displaystyle \sum_{i=1}^N y_i x_{ik} - n_i \cdot \frac{1}{1+e^{\sum_{k=0}^{K} ... ...{\partial}{\partial \beta_k} \biggl( 1+e^{\sum_{k=0}^{K} x_{ik}\beta_k} \biggr)$
	$\displaystyle =$	$\displaystyle \sum_{i=1}^N y_i x_{ik} - n_i \cdot \frac{1}{1+e^{\sum_{k=0}^{K} ... ...k}\beta_k} \cdot \frac{\partial}{\partial \beta_k} \sum_{k=0}^{K} x_{ik}\beta_k$
	$\displaystyle =$	$\displaystyle \sum_{i=1}^N y_i x_{ik} - n_i \cdot \frac{1}{1+e^{\sum_{k=0}^{K} x_{ik}\beta_k}} \cdot e^{\sum_{k=0}^{K} x_{ik}\beta_k} \cdot x_{ik}$
	$\displaystyle =$	$\displaystyle \sum_{i=1}^N y_i x_{ik} - n_i \pi_i x_{ik} %\\$	(11)

The maximum likelihood estimates for $\boldsymbol\beta$ can be found by setting each of the equations in Eq. 11 equal to zero and solving for each $\beta_k$ . Each such solution, if any exists, specifies a critical point-either a maximum or a minimum. The critical point will be a maximum if the matrix of second partial derivatives is negative definite; that is, if every element on the diagonal of the matrix is less than zero (for a more precise definition of matrix definiteness see [7]). Another useful property of this matrix is that it forms the variance-covariance matrix of the parameter estimates. It is formed by differentiating each of the equations in Eq. 11 a second time with respect to each element of $\beta$ , denoted by $\beta_{k^\prime}$ . The general form of the matrix of second partial derivatives is

$\displaystyle \frac{\partial^2 l(\beta)}{\partial \beta_k \partial\beta_{k^\prime}}$	$\displaystyle =$	$\displaystyle \frac{\partial}{\partial\beta_{k^\prime}} \sum_{i=1}^N y_i x_{ik} - n_i x_{ik} \pi_i$
	$\displaystyle =$	$\displaystyle \frac{\partial}{\partial\beta_{k^\prime}} \sum_{i=1}^N - n_i x_{ik} \pi_i$
	$\displaystyle =$	$\displaystyle - \sum_{i=1}^N n_i x_{ik} \frac{\partial}{\partial\beta_{k^\prime... ...e^{\sum_{k=0}^{K} x_{ik}\beta_k}}{{1+e^{\sum_{k=0}^{K} x_{ik}\beta_k}}} \biggr)$	(12)

To solve Eq. 12 we will make use of two general rules for differentiation. First, a rule for differentiating exponential functions:

$\displaystyle \frac{\mathrm{d}}{\mathrm{d} x} e^{u(x)} = e^{u(x)} \cdot \frac{\mathrm{d}}{\mathrm{d} x} u(x)$

(13)

In our case, let $u(x) = \sum_{k=0}^{K} x_{ik}\beta_k$ . Second, the quotient rule for differentiating the quotient of two functions:

$\displaystyle \biggl(\frac{f}{g}\biggr)^\prime(a) = \frac{g(a) \cdot f^\prime(a) - f(a) \cdot g^\prime(a)}{[g(a)]^2}$

(14)

Applying these two rules together allows us to solve Eq. 12.

$\displaystyle \frac{\mathrm{d}}{\mathrm{d} x} \frac{e^{u(x)}}{1+e^{u(x)}}$	$\displaystyle =$	$\displaystyle \frac{(1+e^{u(x)}) \cdot e^{u(x)} \frac{\mathrm{d}}{\mathrm{d} x}... ...- e^{u(x)} \cdot e^{u(x)} \frac{\mathrm{d}}{\mathrm{d} x} u(x)}{(1+e^{u(x)})^2}$
	$\displaystyle =$	$\displaystyle \frac{e^{u(x)} \frac{\mathrm{d}}{\mathrm{d} x} u(x)}{(1+e^{u(x)})^2}$
	$\displaystyle =$	$\displaystyle \frac{e^{u(x)}}{1+e^{u(x)}} \cdot \frac{1}{1+e^{u(x)}} \cdot \frac{\mathrm{d}}{\mathrm{d} x} u(x)$	(15)

Thus, Eq. 12 can now be written as:

$\displaystyle - \sum_{i=1}^N n_i x_{ik} \pi_i(1-\pi_i)x_{ik^\prime}$

(16)

Next: The Newton-Raphson Method Up: Binomial Logistic Regression Previous: The Model

Scott Czepiel
http://czep.net/contact.html