Assuming the distribution of the output is a Gaussian distribution, the model only predicts the mean (μ \mu μ ) and std (σ \sigma σ ). We then sample the latent variable from this Gaussian distribution. The sample latent distribution parameters should match the true distribution, which is enforced using the KL divergence.
We want replace p θ ( z ∣ x ) p_{\theta}(z|x) p θ ( z ∣ x ) with q ϕ ( z ) q_{\phi}(z) q ϕ ( z ) since we dont have GT of p θ ( z ∣ x ) p_{\theta} (z|x) p θ ( z ∣ x )
log p θ ( x ) = E ( q ϕ ) [ log p θ ( x ) ] = E q ϕ ( z ) [ log ( p θ ( x , z ) p θ ( z ∣ x ) ) ] = E q ϕ ( z ) [ log ( p θ ( x , z ) p ϕ ( z ) p ϕ ( z ) p θ ( z ∣ x ) ) ] = E q ϕ ( z ) [ log ( p θ ( x , z ) p ϕ ( z ) ) ] + E q ϕ ( z ) [ log ( p ϕ ( z ) p θ ( z ∣ x ) ) ] = E q ϕ ( z ) [ log ( p θ ( x , z ) p ϕ ( z ) ) ] + K L ( q ϕ ( z ) ∣ ∣ p θ ( z ∣ x ) ) log p θ ( x ) = E q ϕ ( z ) [ log ( p θ ( x , z ) p ϕ ( z ) ) ] + K L ( q ϕ ( z ) ∣ ∣ p θ ( z ∣ x ) ) log p θ ( x ) = L E L B O + K L ( q ϕ ( z ) ∣ ∣ p θ ( z ∣ x ) ) L E L B O = log p θ ( x ) − K L ( q ϕ ( z ) ∣ ∣ p θ ( z ∣ x ) ) \begin{aligned}
\log p_\theta(x) &= E_{(q_\phi)} [\log p_\theta (x)]\\
&= E_{q_\phi(z)} [\log (\frac{p_\theta (x,z)}{p_\theta (z|x)})]\\
&= E_{q_\phi(z)} [\log (\frac{p_\theta (x,z)}{p_\phi(z)}\frac{p_\phi(z)}{p_\theta (z|x)})]\\
&= E_{q_\phi(z)} [\log (\frac{p_\theta (x,z)}{p_\phi(z)})]+E_{q_\phi(z)}[\log (\frac{p_\phi(z)}{p_\theta (z|x)})]\\
&= E_{q_\phi(z)} [\log (\frac{p_\theta (x,z)}{p_\phi(z)})]+KL(q_{\phi}(z)||p_{\theta}(z|x))\\
\log p_\theta(x)&= {\color{orange}E_{q_\phi(z)} [\log (\frac{p_\theta (x,z)}{p_\phi(z)})]}+KL(q_{\phi}(z)||p_{\theta}(z|x))\\
\log p_\theta(x)&= {\color{orange}\mathcal{L}_{ELBO}}+KL(q_{\phi}(z)||p_{\theta}(z|x))\\
{\color{orange}\mathcal{L}_{ELBO}} &=\log p_\theta(x) - KL(q_{\phi}(z)||p_{\theta}(z|x))\\
\end{aligned} log p θ ( x ) log p θ ( x ) log p θ ( x ) L E L BO = E ( q ϕ ) [ log p θ ( x )] = E q ϕ ( z ) [ log ( p θ ( z ∣ x ) p θ ( x , z ) )] = E q ϕ ( z ) [ log ( p ϕ ( z ) p θ ( x , z ) p θ ( z ∣ x ) p ϕ ( z ) )] = E q ϕ ( z ) [ log ( p ϕ ( z ) p θ ( x , z ) )] + E q ϕ ( z ) [ log ( p θ ( z ∣ x ) p ϕ ( z ) )] = E q ϕ ( z ) [ log ( p ϕ ( z ) p θ ( x , z ) )] + K L ( q ϕ ( z ) ∣∣ p θ ( z ∣ x )) = E q ϕ ( z ) [ l o g ( p ϕ ( z ) p θ ( x , z ) )] + K L ( q ϕ ( z ) ∣∣ p θ ( z ∣ x )) = L E L BO + K L ( q ϕ ( z ) ∣∣ p θ ( z ∣ x )) = log p θ ( x ) − K L ( q ϕ ( z ) ∣∣ p θ ( z ∣ x ))
Variational Inference
Importance Sampling to ELBO
Variational EM
refs
Variational Autoencoder (Kingma & Welling, 2014)
random init centroids
find the nearest centroids of each unquantise vector
if quantise have low usage then random init a new centroids
calculate average center of unquantise vector
Exponential Moving Average Update between new centroid and old centroid
VQ loss: The L2 error between the embedding space and the encoder outputs.
Commitment loss: A measure to encourage the encoder output to stay close to the embedding space and to prevent it from fluctuating too frequently from one code vector to another.
where sg [ . ] \text{sg}[.] sg [ . ] is the stop_gradient operator.
L = ∥ x − D ( e k ) ∥ 2 2 ⏟ reconstruction loss + ∥ sg [ E ( x ) ] − e k ∥ 2 2 ⏟ VQ loss + β ∥ E ( x ) − sg [ e k ] ∥ 2 2 ⏟ commitment loss L = \underbrace{\|\mathbf{x} - D(\mathbf{e}_k)\|_2^2}_{\textrm{reconstruction loss}} +
\underbrace{\|\text{sg}[E(\mathbf{x})] - \mathbf{e}_k\|_2^2}_{\textrm{VQ loss}} +
\underbrace{\beta \|E(\mathbf{x}) - \text{sg}[\mathbf{e}_k]\|_2^2}_{\textrm{commitment loss}} L = reconstruction loss ∥ x − D ( e k ) ∥ 2 2 + VQ loss ∥ sg [ E ( x )] − e k ∥ 2 2 + commitment loss β ∥ E ( x ) − sg [ e k ] ∥ 2 2