Exploring Different Methods for Calculating Kullback-Leibler Divergence (KL_divergence) in Variational Autoencoders (VAE) Training

18 min readSep 23, 2023

Introduction

In this article, we will explore various methods for calculating the Kullback-Leibler divergence (KL), a crucial component of the loss function used during Variational Autoencoder (VAE) training. KL measures the dissimilarity between probability distributions and plays a pivotal role in VAEs. Typically, VAE loss functions involve KL divergence, often applied to Normal distributions using analytical techniques.

Analytical Calculation

Consider two Normal distributions, p(x)=N(μ1,σ1) and q(x)=N(μ2,σ2). The analytical formula for KL divergence is as follows:

For instance, when the prior distribution is N(0,1) and the posterior distribution is N(μ2,σ2) for the latent space of a VAE, KL(post, prior) can be computed in TensorFlow as:

0.5 * (1.0 + tf.math.log(σ2) — tf.math.square(μ2) — th.math.square(σ2))

If we calculate KL(prior, post), the formula changes to:

−0.5∗(1 + tf.math.log(tf.math.square(μ2)) — ((1+tf.math.square(μ2))/tf.math.square(σ2)))

These distinct expressions emphasize the importance of understanding the order of arguments in KL divergence calculations.

Monte Carlo Method

An alternative method to compute KL divergence in TensorFlow involves the use of the Monte Carlo module. KL divergence (DKL) between the distributions p(x) and q(x) can be defined as the expectation of the ratio of ln(q(x)p(x)) under the p distribution:

DKL(p,q)=Ep(x)[ln(p(x)q(x))]=∫p(x)ln(p(x)q(x))dx

This calculation involves the log-likelihoods of each value of x, weighted by p(x).

TensorFlow Implementation

In the sections that follow, we will explore these different methods for calculating KL divergence and their practical implications in VAE training.

In TensorFlow, using the tfp.monte_carlo module to obtain the expectation , we have:

kl_mcpostprior = tfp.monte_carlo.expectation(
    f=lambda x: postdist.log_prob(x) - priordist.log_prob(x), 
    samples=postdist.sample(batch_size, seed=s),
    log_prob=postdist.log_prob, 
    use_reparameterization=False
)

kl_mcpriorpost = tfp.monte_carlo.expectation(
    f=lambda x: priordist.log_prob(x) - postdist.log_prob(x), 
    samples=priordist.sample(batch_size, seed=s),
    log_prob=priordist.log_prob, 
    use_reparameterization=False
)

where x is the set of samples from the p distribution.

Now, we move on the TensorFlow code to run these different KL_divergence expression for Normal distributions and MultivariateNormal distributions.

import numpy as np
import tensorflow as tf 
import tensorflow_probability as tfp 
from tensorflow import keras
from tensorflow.keras import layers
from IPython.display import display,Markdown
from tensorflow.keras import backend as K
import matplotlib.pyplot as plt
tfd = tfp.distributions
from sklearn.decomposition import PCA 
from sklearn.preprocessing import StandardScaler
## import os can be skipped if there is nocompatibility issue
## with the OpenMP library and TensorFlow
import os
os.environ["KMP_DUPLICATE_LIB_OK"]="TRUE"

print('############################### now Normal distribution ######################################') 
batch_size = 2500   
latd = 3 ### features which will be latent dim in VAE code in a future paper
s = 540 ### random seed 
means = np.random.randint(low=-4, high=5, size=(batch_size, latd))
stddevs = np.random.uniform(0.1, 2.0, size=(batch_size, latd))

means = tf.cast(means, dtype=tf.float64)      
stddevs = tf.cast(stddevs, dtype=tf.float64) 

##priordist2 = tfd.Laplace(loc=tf.zeros(latd, dtype=tf.float64),scale=tf.ones(latd,dtype=tf.float64))
##postdist2  = tfd.Laplace(loc=means, scale=stddevs) 
priordist2 = tfd.Normal(loc=tf.zeros(latd, dtype=tf.float64),scale=tf.ones(latd,dtype=tf.float64))
postdist2  = tfd.Normal(loc=means, scale=stddevs)    
kl1 = tfd.kl_divergence(priordist2, postdist2) 
kl2 = tfd.kl_divergence(postdist2, priordist2) 
print('Normal (postdist,priordist), (priordist,postdist):', tf.reduce_mean(kl2), tf.reduce_mean(kl1)) 

print('### analytical Kl-divergence for Normal distribution only ######')
'''
Let p(x)=N(μ1,σ1) and q(x)=N(μ2,σ2)
KL(p,q) = log(s2/s1) + (s1^2 + (u1-u2)^2)/2s2^2) -0.5 
prior = N(0,1) and post = N(u2,s2) 
'''
klpostp = -0.5*(1 + tf.math.log(stddevs) - tf.math.square(means) - tf.math.square(stddevs)) 
klppost = -0.5*(1 + tf.math.log(tf.math.square(stddevs)) - ((1+tf.math.square(means))/tf.math.square(stddevs)) ) 

##print(klppost, klpostp) 
kl1_lossred = tf.reduce_mean(klppost,axis=1)
kl2_lossred = tf.reduce_mean(klpostp,axis=1) 
##print(kl_lossred, kl1_lossred) 
print('kl_loss (post,prior), (prior,post):', tf.reduce_mean(kl2_lossred),  tf.reduce_mean(kl1_lossred)) 

print('################## KL divergence tfd kl Monte Carlo for Normal distribution #############') 
kl_mcpostp2 = tfp.monte_carlo.expectation(f=lambda x: postdist2.log_prob(x) - priordist2.log_prob(x), 
                                    samples=postdist2.sample(batch_size, seed=s),
                                    log_prob=postdist2.log_prob, use_reparameterization=False) 

kl_mcppost2 = tfp.monte_carlo.expectation(f=lambda x: priordist2.log_prob(x) - postdist2.log_prob(x), 
                                    samples=priordist2.sample(batch_size, seed=s),
                                    log_prob=priordist2.log_prob, use_reparameterization=False) 
lossmcpostp2 = tf.reduce_mean(kl_mcpostp2)
lossmcppost2 = tf.reduce_mean(kl_mcppost2)

print('loss MC Normal (post2,prior2) and (prior2,post2) :', lossmcpostp2 , lossmcppost2) 

print('\n\n################## KL divergence tfd kl divergence for MultiVariateNormal #############') 

priordist = tfd.MultivariateNormalDiag(tf.zeros(latd, dtype=tf.float64),tf.ones(latd,dtype=tf.float64))
postdist = tfd.MultivariateNormalDiag(means, stddevs)    

kl_div = tfd.kl_divergence(postdist, priordist) 
kl1_div = tfd.kl_divergence(priordist, postdist) 
##print('kl_div both :', kl_div, kl1_div) 
kl_loss = tf.reduce_mean(kl_div)
kl_loss1 = tf.reduce_mean(kl1_div)
print('kl_loss (postdist,priordist) and (priordist,postdist) :', kl_loss,kl_loss1) 

print('################## KL divergence tfd kl Monte Carlo for MultiVariateNormal #############') 
  
kl_mcpostp = tfp.monte_carlo.expectation(f=lambda x: postdist.log_prob(x) - priordist.log_prob(x), 
                                    samples=postdist.sample(batch_size, seed=s),
                                    log_prob=postdist.log_prob, use_reparameterization=False) 

kl_mcppost = tfp.monte_carlo.expectation(f=lambda x: priordist.log_prob(x) - postdist.log_prob(x), 
                                    samples=priordist.sample(batch_size, seed=s),
                                    log_prob=priordist.log_prob, use_reparameterization=False) 


##print('kl MC:', kl_mcpostp,kl_mcppost )
lossmcpostp = tf.reduce_mean(kl_mcpostp)
lossmcppost = tf.reduce_mean(kl_mcppost)

print('loss MC (post,prior) and (prior,post) :', lossmcpostp , lossmcppost) 
      
print('############################ \n kl_Mone_Carlo direct calculation for MultiVariateNormal  : \n ##############################') 
samples = postdist.sample(batch_size, seed=s) 
samples = tf.cast(samples, dtype=tf.float64) 
sm0=postdist.log_prob(samples) 
mix01 = tf.reduce_mean(sm0) 
print(' reduce mean mix01:', mix01) 
      
sm1 = priordist.log_prob(samples) 
mix1 = tf.reduce_mean(sm1) 
print(' reduce mean mix1:', mix1) 

#############################################
samples = priordist.sample(batch_size, seed=s) 
samples = tf.cast(samples, dtype=tf.float64) 
sm0=priordist.log_prob(samples) 
mix011 = tf.reduce_mean(sm0) 
print(' reduce mean mix01:', mix011) 
      
sm1 = postdist.log_prob(samples) 
mix11 = tf.reduce_mean(sm1) 
print(' reduce mean mix1:', mix11) 

print('kl MC post.mean - prior.mean, prior.mean - post.mean :', mix01 - mix1, mix011 - mix11)

Running this script gives:

############################### Normal distribution ######################################

####################### KL_divergence from TensorFlow module ###########
Normal (postdist,priordist), (priordist,postdist): 
tf.Tensor(3.6711214654924613, shape=(), dtype=float64) tf.Tensor(18.2955884
0485741, shape=(), dtype=float64)

### analytical Kl-divergence for Normal distribution only ######
kl_loss (post,prior), (prior,post):
 tf.Tensor(3.5999187224713096, shape=(), dtype=float64) tf.Tensor(18.580399376942015, shape
=(), dtype=float64)

################## KL divergence tfd kl Monte Carlo for Normal distribution #############
loss MC Normal (post2,prior2) and (prior2,post2) : 
tf.Tensor(3.670799454143481, shape=(), dtype=float64) tf.Tensor(18.44077590
768171, shape=(), dtype=float64)


#######################################################################################
################## KL divergence tfd kl divergence for MultiVariateNormal #############


################## KL divergence using Tensorflow KL module ###################
kl_loss (postdist,priordist) and (priordist,postdist) : 
tf.Tensor(11.013364396477384, shape=(), dtype=float64) tf.Tensor(54.88
676521457222, shape=(), dtype=float64)

################## KL divergence tfd kl Monte Carlo for MultiVariateNormal #############
loss MC (post,prior) and (prior,post) : 
tf.Tensor(11.017891615998414, shape=(), dtype=float64) tf.Tensor(54.45004747827108, sh
ape=(), dtype=float64)

###############################################################
 kl_Mone_Carlo direct calculation for MultiVariateNormal  :
###############################################################
 reduce mean mix01: tf.Tensor(-3.8289334972249156, shape=(), dtype=float64)
 reduce mean mix1: tf.Tensor(-14.840782579019738, shape=(), dtype=float64)
 reduce mean mix01: tf.Tensor(-4.260726351396556, shape=(), dtype=float64)
 reduce mean mix1: tf.Tensor(-59.391981852309705, shape=(), dtype=float64)
kl MC post.mean - prior.mean, prior.mean - post.mean : 
tf.Tensor(11.011849081794821, shape=(), dtype=float64) tf.Tensor(55.131
25550091315, shape=(), dtype=float64)

We used three methods to calculate KL_divergence between two Normal distributions, the TensorFlow module, the analytical functions, and the Monte Carlo module in Tensorflow. These 3 methods give very close results:

TensorFlow KL module:  3.6711214654924613, 18.29558840485741
Analytical formula KL: 3.5999187224713096, 18.580399376942015
TensorFlow MC module:  3.5999187224713096, 18.580399376942015

The same exercise is repeated for MultivariateNormal distribution. We used 3methods to calculate KL_divergence: the TensorFlow module, the Monte Carlo module in TensorFlow, and a direct Monte Carlo method , as previously the results are very close:

TensorFlow KL module: 11.013364396477384, 54.88676521457222 
TensorFlow MC module: 11.017891615998414, 54.45004747827108 
Direct Monte Carlo:   11.011849081794821, 55.13125550091315

Now, we showed that the TensorFlow MC module gives accurate results for a univariate gaussian distribution, we will investigate the Gaussian Mixture Model.

Exploring Gaussian Mixture Models (GMM) with TensorFlow Monte Carlo Module

Introduction

In the previous section, we demonstrated the accuracy of TensorFlow’s Monte Carlo (MC) module for a univariate Gaussian distribution. Now, let’s extend our investigation to the Gaussian Mixture Model (GMM).

Short Summary of Gaussian Mixture Model (GMM)

A Gaussian Mixture Model (GMM) is a probabilistic model that represents a complex distribution as a weighted sum of multiple Gaussian distributions. Each Gaussian component in the mixture model contributes to the overall distribution with its mean, variance, and weight. GMMs are widely used for various applications, including clustering, density estimation, and data generation. They offer the flexibility to model data with multiple underlying patterns or clusters, making them a valuable tool in machine learning and statistics.

This summary provides a brief overview of what a Gaussian Mixture Model is and hints at its significance in various data modeling tasks.

In a Gaussian Mixture Model (GMM), the mean and variance of the overall distribution are determined by combining the means and variances of its individual Gaussian components, weighted by their respective mixing coefficients. Here’s how you calculate the mean and variance of a GMM:

Mean of GMM: The mean μGMM of a GMM is computed as a weighted sum of the means μi of its individual Gaussian components:

μGMM = ∑Nwi⋅μi

Where:

N is the number of Gaussian components in the mixture.
wi is the weight (or mixing coefficient) of the i-th Gaussian component.
μi is the mean of the i-th Gaussian component.

Variance of GMM: The variance σGMM2 of a GMM is calculated using a weighted sum of the variances σi2 of its individual Gaussian components, along with a correction term accounting for the spread of the means:

σGMM2= ∑Nwi⋅(σi2+(μi−μGMM)2)

Where:

N is the number of Gaussian components in the mixture.
wi is the weight (or mixing coefficient) of the i-th Gaussian component.
σi2 is the variance of the i-th Gaussian component.
μi is the mean of the i-th Gaussian component.
μGMM is the mean of the GMM.

These formulas show how the mean and variance of a GMM are derived from the means, variances, and mixing coefficients of its constituent Gaussian components. The GMM’s mean is a weighted average of the component means, and its variance accounts for both the component variances and the spread of the component means.

Implementing Gaussian Mixture Models in TensorFlow

TensorFlow offers a versatile feature within its tfd module to generate Gaussian Mixture Models (GMMs) using tfd.MixtureSameFamily. These models are valuable tools for modeling complex data distributions.

While TensorFlow provides a convenient way to create GMMs, calculating the Kullback-Leibler (KL) divergence for GMMs poses a unique challenge. Unlike simple analytical formulas used for single Gaussian distributions, there’s no straightforward analytical formula for GMMs. Consequently, the KL_divergence module cannot be directly employed for KL divergence calculations in GMMs. To address this challenge, we turn to the Monte Carlo module for a practical solution.

In this article, we will begin by reviewing the implementation of a GMM model for 1D (univariate) Normal distributions. Subsequently, we will delve into the more complex implementation of GMMs for Multivariate Normal distributions. These implementations will showcase how to leverage TensorFlow’s capabilities to work with GMMs effectively.

Univariate GMM model

The code using tfd.MixtureSameFamily is described below for a GMM model with 3 Normal distributions and a batch of 12 experiments. The tfd.MixtureSameFamily needs two inputs:

1- mixture_distribution to define the weights of the different Normal distributions

2- components_distributions to define the distributions family

mixture = tfd.MixtureSameFamily(mixture_distribution=tfd.Categorical( probs = mixture_weights),
                               components_distribution=tfd.Normal(loc=means, scale=stddevs))

The mixture weights and the means and stdvs of the 3 Normal distributions is defined using NumPy libraries:

mixture_weights = np.random.dirichlet(np.ones(num_components), size=batch_size) 

means = np.random.randint(low=-4, high=5, size=(batch_size, num_components)).astype(np.float32) # convert to float32
stddevs = np.random.uniform(0.1, 2.0, size=(batch_size, num_components)).astype(np.float32) # convert to float32

Once implemented in the tfd.MixtureSameFamily the mixture (GMM) distribution can be used to generate samples and to extract the the GMM statistics:

mix = mixture.sample() 
av = mixture.mean()
std = mixture.stddev().numpy()

An example is given belwo with the plot of the first batch sample.

The plot below corresponds to the following GMM:

means [ 2.0 0.0 -4]

stdv [1.098 0.634 0.651]

mixture weights [0.3213 0.4739 0.2047]

The full code is given in the window below:

import numpy as np
import tensorflow as tf 
import tensorflow_probability as tfp 
from tensorflow import keras
from tensorflow.keras import layers
from IPython.display import display,Markdown
from tensorflow.keras import backend as K
import matplotlib.pyplot as plt
tfd = tfp.distributions
from sklearn.decomposition import PCA 
from sklearn.preprocessing import StandardScaler
## import os can be skipped if there is nocompatibility issue
## with the OpenMP library and TensorFlow
import os
os.environ["KMP_DUPLICATE_LIB_OK"]="TRUE"

batch_size = 12
num_components = 3 

#Define Gaussian mixture Model
mixture_weights = np.random.dirichlet(np.ones(num_components), size=batch_size) 

means = np.random.randint(low=-4, high=5, size=(batch_size, num_components)).astype(np.float32) # convert to float32
stddevs = np.random.uniform(0.1, 2.0, size=(batch_size, num_components)).astype(np.float32) # convert to float32
print('means:', means)
print('stddevs:',stddevs)
print('mixture_weights:', mixture_weights) 

####convert Numpy arrays to tf tensors #####
mixture_weights = tf.cast(mixture_weights, dtype=tf.float32)
means = tf.cast(means, dtype=tf.float32)
stddevs = tf.cast(stddevs, dtype=tf.float32)

mixture = tfd.MixtureSameFamily(mixture_distribution=tfd.Categorical( probs = mixture_weights),
                               components_distribution=tfd.Normal(loc=means, scale=stddevs)) 

##### mixture.sample() will generate 12 samples based on mixture distribution#####
mix = mixture.sample() 
av = mixture.mean()
std = mixture.stddev().numpy()
print('############mix sample, average and stdv of mixture #######:\n')
print(mix)
print('average, stdv:', av, std) 

##### plot of the distribution #####
##### mixture[0].prob() generate samples from the first sample 
##### means[0], stddevs[0] and mixture_weights[0] 
x = np.linspace(-10, 10,100)
#### print the 12 batch probabilities for x=0.5 ####
print(mixture.prob(0.5).numpy())
#### print the first bacth probability for x=0.5 ###
print(mixture[0].prob(0.5).numpy())
#### print the first batch probabilities for x in [-10,10] ####
print(mixture[0].prob(x).numpy())
#### plot the final distribution for the first batch ####
plt.plot(x,mixture[0].prob(x).numpy())
plt.show() 

#### the following is the same as previously with N(0,1) distributions 
#### used to calculate Kl_divergence with MOnte Carlo module in the next section 
z0 = tf.zeros((batch_size, num_components))
s0 = tf.ones((batch_size, num_components))
m0 = tf.fill((batch_size, num_components),1./num_components)
print('means0:', z0)
print('stdv1:',s0) 
print('mix0:',m0) 
             
mixture0 = tfd.MixtureSameFamily(mixture_distribution=tfd.Categorical( probs=m0),
                               components_distribution=tfd.Normal(loc=z0,scale=s0)) 


mix0 = mixture0.sample() 
av0 = mixture0.mean()
std0 = mixture0.stddev().numpy()
print('############mix0 sample#######:\n',mix0)
print(mix0)
print('average0, stdv0:', av0, std0) 

x = np.linspace(-3, 3,100)
print(mixture0[0].prob(x).numpy())
plt.plot(x,mixture0[0].prob(x).numpy())
plt.show()

Multivariate GMM model

The code is similar to the previous one, we just need to use 1 batch to be able to plot the distribution.

latd = 2 
batch_size = 1
num_components = 3
print('**** Mixtures of MultiVariateNormal distributions **** batch size  - number of mixtures  - latetnt dim  ****')        

mixture_weights = np.random.dirichlet(np.ones(num_components), size=batch_size) 
means = np.random.randint(low=-4, high=5, size=(batch_size, num_components, latd)).astype(np.float64) 
stddevs = np.random.uniform(0.1, 2.0, size=(batch_size, num_components ,latd)).astype(np.float64) 
             
print('mixture_weights:',mixture_weights)
print('means:',means)
print('stddevs:',stddevs)     
             
mixture = tfd.MixtureSameFamily(mixture_distribution=tfd.Categorical( probs = mixture_weights),
                               components_distribution=tfd.MultivariateNormalDiag(loc=means, scale_diag=stddevs)) 

print('*** samples of mixtures distribution *** samples (batch size) of dimension  (latent dim) ***') 
print(mixture.sample()) 
print('mean, stdv:', mixture.mean(),mixture.stddev()) 

sizetab = 100 
x = np.linspace(-10. , 10., sizetab) 
x, y = np.meshgrid(x,x) 
#print(x)
data = np.stack((x.flatten(), y.flatten()), axis=1)
prob = mixture.prob(data).numpy()

ax = plt.axes(projection='3d')
plt.contour(x,y,prob.reshape((sizetab, sizetab)))
ax.plot_surface(x,y,prob.reshape((sizetab,sizetab)), cmap='plasma')
plt.show()

print('**** Mixtures of MultiVariateNormal distributions with means 0.0 and stdv 1.0**** batch size  - number of mixtures  - latetnt dim  ****')
#### this GMM0 model is used for calculating the KL_divergence ####
mixture_weights0 = np.full((batch_size,num_components), 1.0/num_components) 
mean0 = np.full((batch_size,num_components,latd), 0.0) 
stdv0 = np.full((batch_size,num_components,latd), 1.0)   
print('mixture_weights0:',mixture_weights0)
print('mean0:',mean0)
print('stdv0:',stdv0) 
             
mixture0 = tfd.MixtureSameFamily(mixture_distribution=tfd.Categorical( probs=mixture_weights0),
                               components_distribution=tfd.MultivariateNormalDiag(loc=mean0,scale_diag=stdv0)) 

print('mixture0 samples:', mixture0.sample())   

#### with mixture and mixture0 GMM models we can used tfp Monta Carlo module 
#### to calculate the KL_divergence 

postdist = mixture 
priordist = mixture0 

kl_mc = tfp.monte_carlo.expectation(f=lambda x: postdist.log_prob(x) - priordist.log_prob(x), 
                                    samples=postdist.sample(1000, seed=42),
                                    log_prob=postdist.log_prob, use_reparameterization=False) 

print('kl MC:', kl_mc)
lossmc = tf.reduce_mean(kl_mc)
print('loss MC:', lossmc) 

#### this lossmc can be used in a VAE TensorFlow code for training #####

The code above follows the same structure as the first one. The only different part if the plotting of the 2D distributions.

We will detail the plotting part and show an example.

sizetab = 100 
x = np.linspace(-10. , 10., sizetab) 
x, y = np.meshgrid(x,x) 
#print(x)
data = np.stack((x.flatten(), y.flatten()), axis=1)
prob = mixture.prob(data).numpy()

ax = plt.axes(projection='3d')
plt.contour(x,y,prob.reshape((sizetab, sizetab)))
ax.plot_surface(x,y,prob.reshape((sizetab,sizetab)), cmap='plasma')
plt.show()

The Python script above generates and visualizes a 3D surface plot of a probability distribution using Matplotlib and NumPy.

1. `sizetab = 100`: This line initializes a variable `sizetab` and assigns it the value 100. This variable seems to represent the number of points in the grid.

2. `x = np.linspace(-10. , 10., sizetab)`: This line creates an array `x` using NumPy’s `linspace` function. It generates 100 evenly spaced points between -10 and 10.

3. `x, y = np.meshgrid(x, x)`: This line uses `np.meshgrid` to create two 2D arrays `x` and `y` that represent a grid of points based on the 1D array `x`. It essentially creates a 2D coordinate grid for the `x` and `y` values.

4. `data = np.stack((x.flatten(), y.flatten()), axis=1)`: Here, the code flattens the `x` and `y` arrays into 1D arrays and then stacks them horizontally using `np.stack` to create a 2D array `data`. This `data` array will be used as input to some probability distribution.

5. `prob = mixture.prob(data).numpy()`: This line calculates the probability values for the points in the `data` array using some probability distribution represented by `mixture`. The result is stored in the `prob` variable.

6. `ax = plt.axes(projection=’3d’)`: This line creates a 3D axes object `ax` using Matplotlib for the subsequent 3D plot.

7. `plt.contour(x, y, prob.reshape((sizetab, sizetab)))`: This line generates contour lines on the `x` and `y` plane to visualize the probability distribution. It uses the `contour` function from Matplotlib. Reshaping the prob array into a 2D grid using prob.reshape((sizetab, sizetab)) is necessary to prepare the data for plotting and ensure that the visualization accurately represents the probability distribution across the defined grid of points (run the code and print the x,y and prob data if it is not clear).

8. `ax.plot_surface(x, y, prob.reshape((sizetab, sizetab)), cmap=’plasma’)`: This line plots a 3D surface using `ax.plot_surface()`. It visualizes the probability values as a surface in 3D space. The `cmap=’plasma’` argument specifies the colormap to use for coloring the surface.

9. `plt.show()`: Finally, this line displays the plot on the screen.

The plot of the 3D distribution for the following data is displayed below:

mixture_weights: [[0.21276701 0.73880999 0.048423 ]]
means: [[[ 2. 1.]
[ 1. -4.]
[-2. 4.]]]
stddevs: [[[0.85344813 0.53343337]
[1.91462427 0.93378822]
[0.4929878 0.34117561]]]

top view of the 3 2D Gaussian distributions. The peak location is given by means

The distribution on the righ corresponds to mean [1.0 -4.] , stdv [1.91462427 0.93378822] weight 0.7388

As we can see on the second plot, the 3 distributions have their relative size given by the weights [0.21276701 0.73880999 0.048423].

In the final section, we will show how to calculate eigen values for tensor data as that will be used in the VAE, like all what we presented in this paper.

Principal component analysis (PCA)

Principal Component Analysis (PCA) is a dimensionality reduction technique commonly used in data analysis and machine learning. Its primary purpose is to reduce the number of features (variables) in a dataset while preserving the most important information. Here’s a short summary of what PCA is and its key concepts:

1. **Dimensionality Reduction**: PCA helps reduce the dimensionality of a dataset by transforming the original features into a new set of features, known as principal components. These principal components are linear combinations of the original features.

2. **Orthogonal Transformation**: PCA performs an orthogonal linear transformation, which means that the new features (principal components) are uncorrelated with each other. This decorrelation property helps in simplifying the data.

3. **Variance Maximization**: PCA identifies the principal components in such a way that the first principal component explains the maximum variance in the data, the second principal component explains the second most variance, and so on. This ensures that the most important information is retained in the lower-dimensional representation.

4. **Eigenvalues and Eigenvectors**: PCA involves calculating the eigenvalues and eigenvectors of the covariance matrix of the original data. The eigenvectors represent the directions of maximum variance, and the eigenvalues indicate the amount of variance explained by each eigenvector.

5. **Selection of Components**: Depending on the desired level of dimensionality reduction, you can select a subset of the principal components to retain. Typically, you choose a sufficient number of components to explain a high percentage of the total variance.

6. **Applications**: PCA is widely used for various purposes, including data visualization, noise reduction, feature engineering, and simplifying complex datasets. It is particularly useful in scenarios where high-dimensional data needs to be analyzed or visualized.

7. **Loss of Interpretability**: While PCA reduces dimensionality, it often comes at the cost of interpretability because the new features (principal components) are linear combinations of the original features. However, it can be a valuable preprocessing step in machine learning tasks.

In summary, PCA is a mathematical technique for reducing the dimensionality of data while retaining as much of the original information as possible. It is widely used in data preprocessing and feature extraction to simplify complex datasets and aid in data analysis and visualization.

The easiest way to calculate PCA while working on ML is to use sklearn library, as it give a very straight-forward way to calculate PCA.

The code is given below with many explanations and we will just detail a bit more.

################ definition of the distributions to be used for PCA analysis #####
batch_size = 25 
latd = 3 ### features, which will be latent dim in VAE 
means = np.random.randint(low=-4, high=5, size=(batch_size, latd))
stddevs = np.random.uniform(0.1, 2.0, size=(batch_size, latd))

means = tf.cast(means, dtype=tf.float64)      
stddevs = tf.cast(stddevs, dtype=tf.float64) 

postdist = tfd.MultivariateNormalDiag(means, stddevs)     


print('################################################ PCA analysis #####################') 
zsamp = postdist.sample()
##print(zsamp)
# Visualizing the sampled data
fig = plt.figure()
ax = fig.add_subplot(111,projection='3d')
sca = ax.scatter(zsamp[:, 0],zsamp[:, 1],zsamp[:,2],s=10, cmap='plasma')
plt.colorbar(sca)
plt.title(' zsamp') 
plt.show()


#### Using scikit-learn for PCA calculation
pca = PCA(n_components=3) 
z_reduced = pca.fit_transform(zsamp) 
#print('z reduced:', z_reduced) 

##### the variance gives the eigen values after normalization 
print('\n PCA variance explained z:', pca.explained_variance_ratio_) 
pca_vectors = pca.components_
transfer_matrix = pca.components_.T 
print('\n singular values:', pca.singular_values_)
print('\n pca vectors:',pca_vectors)
print('\n transfer:', transfer_matrix)

#### Calculating eigenvalues manually using the transfer_matrix
meanev = tf.reduce_mean(zsamp,axis=0)
zsamp=zsamp-meanev
zred = tf.matmul(zsamp,transfer_matrix)

### the ratio is one is the PCA is done correctly 
ztot = zred/z_reduced
##print('\n ztot:', ztot) 

#### # Eigendecomposition using tf.linalg.eigh ####

#### we get a tensor (batch,feature) (25,3)
#### transpose zsamp gives a tensor (3,25) 
#### transpose(zsamp) * zsamp gives a (3,3) tensor 
#### from the (3,3) tensor we can extract the 3 eigenvalues using tf.linalg.eigh 
#### Note: We center zsamp (as done above) before this calculation
eigen_values, eigen_vectors = tf.linalg.eigh(tf.tensordot(tf.transpose(zsamp), zsamp, axes=1))
print('\n eigen:', eigen_values, eigen_vectors)
print('\n eigen T',tf.transpose(eigen_vectors)) 
zred1 = tf.matmul(zsamp,eigen_vectors)
#print('\n zred1:', zred1)

### as tf.linalg.eigh gives the eigen vectors in a different order than sklearn
### we need to reoder the zred1 tensor to compare with PCA from sklearn 
### by swapping the colunm first and last column 
### to do that we need to convert tf.tensor zred11 to a numpy array since tensor are immutable
### meaning their values cannot be changed once they are created
### Once the changes are made, we reconvert zred11 back to a tensor 
### using tf.constant 
zred11 = tf.identity(zred1)
zred11 = zred11.numpy()
zred11[:,0] = zred1[:,2] 
zred11[:,2] = zred1[:,0] 
zred11 = tf.constant(zred11) 
ztot1 = zred11/z_reduced
### the ratio must be 1 or -1 if the two PCA analysis match 
##print('\n ztot1 ratio:',ztot1) 

fig = plt.figure()
ax = fig.add_subplot(111,projection='3d')
sca = ax.scatter(z_reduced[:, 0], z_reduced[:, 1], z_reduced[:, 2], s=10, cmap='plasma')
plt.colorbar(sca)
plt.title('PCA z') 
plt.show()

Now, we run this code without printing the ratios for clarity:

################################################ PCA analysis #####################

 PCA variance explained z: [0.46783778 0.35883506 0.17332716]

 singular values: [17.7023502  15.50353457 10.77498308]

 pca vectors: [[ 0.38341619  0.79801867  0.46492819]
 [-0.91623937  0.26533671  0.30016971]
 [ 0.11617852 -0.54107544  0.83291051]]

 transfer: [[ 0.38341619 -0.91623937  0.11617852]
 [ 0.79801867  0.26533671 -0.54107544]
 [ 0.46492819  0.30016971  0.83291051]]

 eigen: tf.Tensor([116.10026038 240.35958424 313.37320274], shape=(3,), dtype=float64) tf.Tensor(
[[-0.11617852 -0.91623937  0.38341619]
 [ 0.54107544  0.26533671  0.79801867]
 [-0.83291051  0.30016971  0.46492819]], shape=(3, 3), dtype=float64)

 eigen T tf.Tensor(
[[-0.11617852  0.54107544 -0.83291051]
 [-0.91623937  0.26533671  0.30016971]
 [ 0.38341619  0.79801867  0.46492819]], shape=(3, 3), dtype=float64)

Let’s go step by step int he above results:

**PCA Variance Explained**:

— they represent the normalized eigenvalues returned by scikit-learn’s PCA module
— `PCA variance explained z: [0.46783778 0.35883506 0.17332716]`
— These values represent the proportion of the total variance in the data that is explained by each of the three principal components obtained from PCA.
— The first component explains approximately 46.8% of the variance, the second component explains 35.9%, and the third component explains 17.3%.

2. **Singular Values**:
— `singular values: [17.7023502 15.50353457 10.77498308]`
— These are the singular values obtained from the PCA transformation.
— Singular values indicate the magnitude of variance along each principal component.
— The first singular value is approximately 17.7, the second is 15.5, and the third is 10.8.

3. **PCA Vectors (Principal Components)**:
— `pca vectors: [[ 0.38341619 0.79801867 0.46492819]`
— `[[-0.91623937 0.26533671 0.30016971]`
— `[ 0.11617852 -0.54107544 0.83291051]]`
— These are the principal components (eigenvectors) of the data.
— Each row represents a principal component, and each column represents a feature (dimension).
— These vectors indicate the direction of maximum variance in the data.

4. **Transfer Matrix**:
— `transfer: [[ 0.38341619 -0.91623937 0.11617852]`
— `[ 0.79801867 0.26533671 -0.54107544]`
— `[ 0.46492819 0.30016971 0.83291051]]`
— This is the transpose of the PCA vectors (principal components) matrix.
— It can be used to project data points from the original feature space into the lower-dimensional PCA space.

5. **Eigenvalues and Eigenvectors**:
— `eigen: tf.Tensor([116.10026038 240.35958424 313.37320274], shape=(3,), dtype=float64)` and
— `tf.Tensor(
[[-0.11617852 -0.91623937 0.38341619]
[ 0.54107544 0.26533671 0.79801867]
[-0.83291051 0.30016971 0.46492819]], shape=(3, 3), dtype=float64)`
— These are the eigenvalues and eigenvectors calculated using TensorFlow’s `tf.linalg.eigh`.
— The eigenvalues represent the amount of variance explained by each corresponding eigenvector.
— The eigenvectors are the directions of maximum variance in the data.
— The eigenvalues are approximately 116.1, 240.4, and 313.4.
— The eigenvectors are given in the same format as the PCA vectors, with each row representing a different eigenvalue.

6. **Transpose of Eigenvalues**:
— `eigen T tf.Tensor(
[[-0.11617852 0.54107544 -0.83291051]
[-0.91623937 0.26533671 0.30016971]
[ 0.38341619 0.79801867 0.46492819]], shape=(3, 3), dtype=float64)`
— This is the transpose of the eigenvectors matrix, which can be useful for certain calculations.

The main points are:

1- z_reduced contains the data points from zsamprepresented in the PCA coordinates

2- z_reduced are the same with sklearn or tf.linalg.eigh

3- the eighen values are also the same: sklearn gives the eigen vlaues normalized but not tf.linalg.eigh: [116.10026038 240.35958424 313.37320274]. If we sum these values and divide the list by the sum we get [0.17332716 0.35883506 0.46783778], which are the values returned by sklearn PCA modules.

Conclusion

In this paper, we have conducted a comprehensive review of critical concepts and methodologies that find direct application in Variational Autoencoder (VAE) coding in TensorFlow. Our exploration began with the analysis of Kullback-Leibler (KL) divergence, addressing analytical, TensorFlow-based, and Monte Carlo-driven approaches. These insights into KL divergence are foundational for understanding the measure of dissimilarity between probability distributions, a cornerstone of VAE.

Moreover, we have delved into the realm of Principal Component Analysis (PCA) analysis, recognizing its pivotal role in dimensionality reduction and feature extraction. The seamless integration of PCA into VAE coding equips us with powerful tools for capturing latent representations effectively.

This paper lays the groundwork for further exploration and the development of more sophisticated VAE models, contributing to a deeper understanding of probabilistic generative modeling and latent space representation in machine learning and artificial intelligence. This paper will be followed by few other papers on VAE coding in TensorFlow.

References

Kl_divergence theory https://en.wikipedia.org/wiki/Kullback%E2%80%93Leibler_divergence
Kl_divergence between two univariate Gaussians https://stats.stackexchange.com/questions/7440/kl-divergence-between-two-univariate-gaussians
TensorFlow KL_divergence module https://www.tensorflow.org/probability/api_docs/python/tfp/distributions/kl_divergence
TensorFlow KL_divergence Monte Carlo module https://www.tensorflow.org/probability/api_docs/python/tfp/monte_carlo/expectation
TensorFlow distributions https://arxiv.org/pdf/1711.10604.pdf
KL_divergence using Monte Carlo method: https://jamesmccaffrey.wordpress.com/2021/09/09/example-of-computing-kullback-leibler-divergence-for-continuous-distributions/ http://joschu.net/blog/kl-approx.html
PCA analysis in sklearn https://scikitlearn.org/stable/modules/generated/sklearn.decomposition.PCA.html
PCA TensorFlow module https://medium.com/@mukesh.mithrakumar/principal-component-analysis-with-tensorflow-2-0-395aaf96bc

Exploring Different Methods for Calculating Kullback-Leibler Divergence (KL_divergence) in Variational Autoencoders (VAE) Training

Written by Francis Benistant

No responses yet