0%

EM Algorithm

I'd like to talk about something about EM algorithm in my understanding.

This post is mainly based on Richard Xu's machine learning course.

Gaussian Mixture Model

Gaussian Mixture Model (GMM) (k-mixture) is defined as: \[ p(X | \Theta) = \sum_{l=1}^k \alpha_l \mathcal{N}(X | \mu_l, \Sigma_l) \tag{1} \]

\[ \sum_{l=1}^k \alpha_l = 1 \tag{2} \]

and \[ \Theta = \{ \alpha_1, \dots, \alpha_k, \mu_1, \dots, \mu_k, \Sigma_1, \dots, \Sigma_k \} \tag{3} \]

For data \(X = \{ x_1, \dots, x_n \}\), we introduce latent variable \(Z = \{ z_1, \dots, z_n \}\), each \(z_i\) indicates which mixture components \(x_i\) belongs to. (The introduction of latent variable should not change the marginal distribution of \(p(X)\).)

Then we can use MLE to estimate \(\Theta\) : \[ \Theta_{MLE} = \mathop{\arg\max}\limits_{\Theta} \Big( \sum_{i=1}^N \log \big[ \sum_{l=1}^k \alpha_l \mathcal{N}(x_i | \mu_l, \Sigma_l) \big] \Big) \tag{4} \]

This formula is difficult to solve because it is in 'log-of-sum' form. So, we solve this problem in an iterative way, called Expectation Maximization.

Expectation Maximization

Instead of performing \[ \Theta_{MLE} = \mathop{\arg\max}\limits{\Theta} \Big( \mathcal{L}(\Theta) \Big) = \mathop{\arg\max}\limits_{\Theta} \Big( \log \big( p(X|\Theta) \big ) \Big) \tag{5} \]

we assume some latent variable \(Z\) to the model, such that we generate a series of \(\Theta = \{ \Theta^{(1)}, \Theta^{(2)}, \dots, \Theta^{(t)} \}\).

For each iteration of the E-M algorithm, we perform: \[ \Theta^{(g+1)} =\mathop{\arg\max}\limits_{\Theta} \Big( \int_Z \log \big( p(X, Z | \Theta) p (Z | X, \Theta^{(g)}) \big) \Big) dZ \tag{6} \]

We must ensure convergence: \[ \log p(X|\Theta^{(g+1)}) \ge \log p(X|\Theta^{(g)}) \tag{7} \] Proof : \[ E_{p(Z|X, \Theta^{(g)})} \Big[\log p(X|\Theta) \Big] = E_{p(Z|X, \Theta^{(g)})} \Big[\log p(X,Z|\Theta) - \log p(Z|X,\Theta) \Big] \tag{8} \]

\[ \log p(X|\Theta) = \int_Z \log p(X,Z|\Theta) p(Z|X, \Theta^{(g)}) dZ - \int_Z \log p(Z|X,\Theta) p(Z|X, \Theta^{(g)}) dZ \tag{9} \]

denote \[ Q(\Theta, \Theta^{(g)}) = \int_Z \log p(X,Z|\Theta) p(Z|X, \Theta^{(g)}) dZ \\ H(\Theta, \Theta^{(g)}) = \int_Z \log p(Z|X,\Theta) p(Z|X, \Theta^{(g)}) dZ \] then we have \[ \log p(X|\Theta) =Q(\Theta, \Theta^{(g)}) - H(\Theta, \Theta^{(g)}) \tag{10} \] Because \[ Q(\Theta^{(g)}, \Theta^{(g)}) \le Q(\Theta^{(g+1)}, \Theta^{(g)}) \\ H(\Theta^{(g)}, \Theta^{(g)}) \ge H(\Theta^{(g+1)}, \Theta^{(g)}) \] the second inequality can be derived using Jensen's inequality.

Hence , \[ \log p(X|\Theta^{(g+1)}) \ge \log p(X|\Theta^{(g)}) \tag{11} \]

Using EM algorithm to solve GMM

Put GMM into this frame work. \[ \Theta^{(g+1)} = \mathop{\arg\max}\limits_{\Theta} \big[ Q(\Theta, \Theta^{(g)}) \big] =\mathop{\arg\max}\limits_{\Theta} \Big( \int_Z \log \big( p(X, Z | \Theta) p (Z | X, \Theta^{(g)}) \big) \Big) dZ \tag{12} \] E-Step:

Define $ p(X, Z | )$ : \[ p(X, Z | \Theta) = \Pi_{i=1}^n p(x_i, z_i | \Theta) = \Pi_{i=1}^n p(x_i|z_i, \Theta) p(z_i|\Theta) = \Pi_{i=1}^n \alpha_{z_i} \mathcal{N}(\mu_{z_i}, \Sigma_{z_i}) \tag{13} \] Define \(p (Z | X, \Theta)\) : \[ p (Z | X, \Theta) = \Pi_{i=1}^{n} p(z_i|x_i, \Theta) = \Pi_{i=1}^{n} \frac{\alpha_{z_i} \mathcal{N} (\mu_{z_i}, \Sigma_{z_i})}{\sum_{l=1}^k \alpha_l \mathcal{N} (\mu_l, \Sigma_l)} \tag{14} \] Then \[ Q(\Theta, \Theta^{(g)}) = \sum_{z_1= 1}^k \sum_{z_2= 1}^k \dots \sum_{z_N= 1}^k \Big( \sum_{i=1}^N \big[ \log \alpha_{z_i} + \log \mathcal{N}(\mu_{z_i}, \Sigma_{z_i}) \big] * \Pi_{i=1}^{N} p(z_i| x_i, \Theta^{(g)}) \Big) \] \[ = \sum_{i=1}^N \sum_{l = 1}^k \big( \log \alpha_{l} + \log \mathcal{N}(\mu_{l}, \Sigma_{l}) \big) p(l| x_i, \Theta^{(g)}) \tag{15} \]

M-Step: \[ Q(\Theta, \Theta^{(g)}) = \sum_{i=1}^N \sum_{l = 1}^k \log (\alpha_{l}) p(l| x_i, \Theta^{(g)}) + \sum_{i=1}^N \sum_{l = 1}^k\log \mathcal{N}(\mu_{l}, \Sigma_{l}) p(l| x_i, \Theta^{(g)}) \tag{16} \] The first term contains only \(\alpha\) and the second term contains only \(\mu, \Sigma\), so we can maximize both terms independantly.

Maximizing \(\alpha\) means that: \[ \frac{\partial \sum_{i=1}^N \sum_{l = 1}^k \log (\alpha_{l}) p(l| x_i, \Theta^{(g)})}{\partial\alpha_1 \dots \partial\alpha_k} = 0 \tag{17} \] subject to \(\sum_{l=1}^k = 1\).

Solving this problem via Lagrangian Multiplier, we have \[ \alpha_l = \frac{1}{N} \sum_{i=1}^N p(l| x_i, \Theta^{(g)}) \tag{18} \] Similarly, we can solve \(\mu\) and \(\Sigma\).