Deep Learning

GANs
- What is the difference between generative models and discriminative models?
- Then what is a conditional generative model?
KL Divergence - Kullback-Leibler Divergence

GANs

What is the difference between generative models and discriminative models?

Generative Model

Let the training dataset be data = {(x_i)}_{i=1}^N where x_i \in \mathbb{R}^d called data points. These are iid samples from the true data distribution P(x).

The aim of a generative model is to estimate P(x) using the training data and sample new data points from it.

GANs, GMMs, etc., are examples of generative models. In GANs, we use training data to train a neural network to learn the distribution of the data. Once we have learned this distribution, we can sample new data points from it.

Discriminative Model

In this case, the data is in a different format: data = {(x_i, y_i)}_{i=1}^N where x_i \in \mathbb{R}^d are the input features and y_i \in \mathbb{R}^k are the corresponding output labels or target variables.

The aim of a discriminative model is to estimate the conditional probability P(y|x) using the training data. In simple words, we are given an input x, and we want to predict the corresponding output y.

If y can take only discrete values, then the model is called a classification model. If y can take continuous values, then the model is called a regression model.

Difference between generative and discriminative models

In generative models, we sample new data points from the learned distribution to get a new data point that might look like one of the data points in the training dataset.

In discriminative models, we estimate the conditional distribution P(y|x).

Then what is a conditional generative model?

In most practical cases, we want the generative model to generate a new data point x_i based on some conditioned input y_i.

In this case, the training dataset is data = {(x_i, y_i)}_{i=1}^N, and we want to estimate P(x|y) and sample from it.

KL Divergence - Kullback-Leibler Divergence

KL divergence is a measure of how one probability distribution differs from another probability distribution.

Mathematically, for two probability distributions P(x) and Q(x), the KL divergence from Q to P is defined as:

For discrete distributions: \(D_{KL}(P||Q) = \sum p(x) \cdot \log\left(\frac{p(x)}{q(x)}\right)\)

For continuous distributions: \(D_{KL}(P||Q) = \int p(x) \cdot \log\left(\frac{p(x)}{q(x)}\right) \, dx\)

where the sum/integral is over all possible events x. And p(x) and q(x) are the probability density functions of distributions P(x) and Q(x) respectively.

Intuition

KL divergence is a measure of how one probability distribution diverges from another. It is a measure of the information lost when Q is used to approximate P.

Properties

KL divergence is not symmetric:
KL divergence is always non-negative:
KL divergence is 0 if and only if ( P ) and ( Q ) are the same distribution

Usefulness in Machine Learning

Minimizing KL Divergence is equivalent to maximizing the likelihood.

For a typical ML problem, all we have are samples from the true distribution P(x), i.e., data = {(x_i)}_{i=1}^N where x_i \in \mathbb{R}^d are iid samples from the true distribution P(x).

We do not know the true distribution P(x) explicitly.

We try our best to estimate the true distribution P(x) by Q(x; \theta) where \theta are the parameters of the model.

We need to know how well our model Q(x; \theta) is performing. We can do this by calculating the KL divergence between P(x) and Q(x; \theta).

\[D_{KL}(P||Q) = \int p(x) \cdot \log \left( \frac{p(x)}{q(x; \theta)} \right) dx\] \[D_{KL}(P||Q) = \mathbb{E}_{x \sim p(x)} \left[ \log \left( \frac{p(x)}{q(x; \theta)} \right) \right]\] \[D_{KL}(P||Q) = \mathbb{E}_{x \sim p(x)} [\log (p(x))] - \mathbb{E}_{x \sim p(x)} [\log (q(x; \theta))]\]

We are trying to find the parameters \theta^* that minimize the KL divergence between P(x) and Q(x; \theta).

\[\theta^* = \arg\min_{\theta} D_{KL}(P||Q(x; \theta))\] \[\theta^* = \arg\min_{\theta} \mathbb{E}_{x \sim p(x)} [\log (p(x))] - \mathbb{E}_{x \sim p(x)} [\log (q(x; \theta))]\]

Because \mathbb{E}_{x \sim p(x)} [\log (p(x))] does not depend on \theta, we can ignore it.

\[\theta^* = \arg\min_{\theta} - \mathbb{E}_{x \sim p(x)} [\log (q(x; \theta))]\] \[\theta^* = \arg\max_{\theta} \mathbb{E}_{x \sim p(x)} [\log (q(x; \theta))]\]

\mathbb{E}_{x \sim p(x)} [\log (q(x; \theta))] is called the Expected Log Likelihood.

By the law of large numbers, we can approximate the expected log likelihood by the average log likelihood of the data:

\[\mathbb{E}_{x \sim p(x)} [\log (q(x; \theta))] \approx \frac{1}{N} \sum_{i=1}^{N} \log(q(x_i; \theta))\]

Therefore, our optimization problem becomes:

\[\theta^* = \arg\max_{\theta} \frac{1}{N} \sum_{i=1}^{N} \log(q(x_i; \theta))\]

This is equivalent to maximizing the log likelihood of the data under the model q(x; \theta).

\[\theta^* = \arg\max_{\theta} \frac{1}{N} \sum_{i=1}^{N} \log(q(x_i; \theta))\]

Hence, \theta is also called the maximum log likelihood estimate.