All about GANs (Part 1)

this blog, we will be discussing other popular variants of GAN such as:

  1. cGAN (Conditional GAN)
  2. WGAN (Wasserstein GAN)
  3. InfoGAN (Information GAN)
  4. BiGAN (Bidirectional GAN)
  5. CycleGAN
  6. DCGAN (Deep convolutional GAN)
  7. BigGAN, StyleGAN & StyleGAN2


All about GANs” blog series will be broken down into 4 parts. In this blog which is PART-1 we will mainly discussing about cGAN & WGAN. We will discuss other variants in upcoming parts.

In our previous blog, we introduced the GAN (Generative Adversarial Networks) and its working and mathematics. We also looked at its limitation and problems.

I would strongly recommend reading the previous blog to understand things in-depth:

Blog 1: Introduction to GAN (Generative Adversarial Networks)

This blog is similarly more focused on discussing various types of GANs and their working. In upcoming blogs, we will be discussing the implementation of StyleGAN2 to generate photorealistic human face images trained on the FFHQ dataset.

Conditional GAN (cGAN)

cGANs (Conditional Generative Adversarial Networks) improve basic GAN by adding a condition ‘c’ into the Generator and Discriminator networks.


While using the basic GAN, we encountered a problem, that it tends to produce random images of objects/classes. In the case of the fashion MNIST dataset, GAN generated images randomly of boots, or shirts, or trousers. In short, we didn’t have any control over the objects we produced. This is where we use the cGANs where we have control over the objects that can be generated.


In the case of the architecture of GAN & cGAN, we can notice that it’s pretty much similar apart from ‘c’ which is the label information. It’s fed along with ‘z’ to the generator and also it’s provided along with ‘x’ to the discriminator.

he conditions are passed along with the ‘z’ vector (latent space) represented by ‘c’ which are the labels. The objective equation we derived in the previous blog for the GAN was given as:

In the case of cGANs, we tend to use the “conditional probability” where ‘c’ represents the label information. Label information consists of different labels for different classes of objects present in that dataset. In cGANS, we use conditional probability instead of a single probability having all the functionalities same as GAN, and the equation is then written as:

Wasserstein GAN (WGAN)


Before we discuss WGAN, we need to understand few other concepts.

  • Kullback-Leibler Divergence
  • Jensen-Shannon Divergence

KL (Kullback-Leibler) divergence and JSD (Jensen-Shannon divergence) are used while deriving the objective function of the auto-encoders & GAN respectively and they have certain limitations and hence in WGAN, we use the Wasserstein distance as a metric in the function.

(Kullback-Leibler) KL Divergence

KL Divergence is a measure of how one probability distribution is different from a second. It’s used while deriving the objective functions in VAE (Variational auto-encoders) to minimize the distance between two different probability distributions and are of two types:

Forward KL Divergence

Let’s take an example of two distributions P and Q where P is a known distribution and Q is unknown distribution. We try to approximate them by minimizing the distance this is how it forms in Case-1.

Case-1 (Underestimation)

Mode seeking behavior

In this scenario of underestimation, there is a region where Q(X) = 0, due to which while substituting these values in the forward KL it becomes as,

When the denominator is ‘0’ the equation tends to be infinity and maximizes which is opposite to its actual objective.

Case-2 (Overestimation)

Mean seeking behavior

In this scenario, all the region is covered where Q(X) ≠ 0, but here P(X) = 0 as the green lines tend to extend even further due to which while substituting these values in the forward KL it becomes as,

Hence for the forward KL, overestimation is used that tends to ignore the problem of Q(X) = 0 leading KL divergence to blow up and for this reason, it’s known as a zero-avoiding situation.

Reverse KL Divergence

Similarly, we take the above two cases of underestimation and overestimation and substitute values in the reverse KL equation.

Case-1 (Underestimation)

Mode seeking behavior

In case of underestimation, we get obtain values as,

Case-2 (Overestimation)

Mean seeking behavior

In case of overestimation, we get obtain values as;


And generally, underestimation is chosen over overestimation because at times it tends to estimate a lot of area. And hence we choose Reverse KL divergence as in it’s case underestimation produces a value of 0.

But, in few cases both the forward & reverse KL tend to blow up and end up producing infinity value e.g. Let’s consider two probability distributions P at 0 and Q at 0.5 as,

In the above diagram, their distributions are given and their KL divergence will be as,

At x = 0,

At x = 0.5,


From the above scenario we can notice that when two probability distributions do not overlap, the KL divergence tends to blow up giving a value of infinity.

To overcome this problem, we use JSD (Jensen-Shannon divergence).

JSD (Jensen-Shannon divergence)

ensen-Shannon divergence is a method of measuring the similarity between two probability distributions. It is given as,

JSD(P||Q) = 1/2 * { KL(P||M) + KL(Q||M) }

where, [⸪ M = (P+1)/2]

On using the above case discussed in the KL divergence, we obtain the JSD as,

Hence, JSD helps in overcoming the problem we encountered in the KL divergence as it produces a value “log2” when two probability distributions aren’t overlapping.

Alternate method

Hence, researchers have come up with a better metric to approximate two probability distribution function by minimizing the distance that doesn’t blow up like JSD & KLD known as the Wasserstein metric.

Wasserstein Metric

It is a metric used to calculate the horizontal distance or Wasserstein distance (Wd) between two different probability distributions denoted by “θ” [ 0 < θ< 1 ]. It’s also known as the Earthmover distance.

Shifting away of distribution Q

Let’s consider a scenario in which P & Q are two probability distributions where Q keeps shifting away from the mean of P such that the distance between them keeps increasing as shown in the above figure.

As the distance keeps increasing between the mean of P & Q as the mean of Q keeps shifting away represented by the x-axis in the below diagram the divergence graphs are given as,


As the distance increases and they don’t further overlap, KL divergence outputs ‘0’ & JSD outputs value log2.

JSD produces a value of log2 which is represented by a logline and there the derivate/slope is close to ‘0’ causing a vanishing gradient problem hence, as the distance increases between two probability distributions both KL & JS divergence produce a value of 0 i.e. the generator learns nothing from the gradient descent.

Therefore, Wasserstein distance is a better distance metric. It’s given as,

Real & generated distributions

We can also write the final objective function as,

The equation we derived is intractable and to make it tractable we use the “Kantorovich-Rubinstein duality” from which we obtain the equation as,

Supremum (maximum) is taken over all 1-Lipschitz continuity function.

1-Lipschitz continuity function : It means that the slope of the curve shouldn’t exceed 1 in the neighborhood of the point of interest.

Quick Recap (k-Lipschitz continuity)

In the above diagram, the point of interest lies between x and y.

x and y are the neighborhood of the point of interest. Perpendicular lines are drawn from x & y whose corresponding functions f(x) and f(y) are drawn on the curve from which the slope is extracted.

If slope ≤ k, then it’s called k-Lipschitz continuity. If k=1, it’s called 1-Lipschitz continuity.

The equation we obtained was,

Here, we replace the supremum with the parameterized function of Dw as,


To enforce the above condition, we use the clipping concept i.e. clipping of the parameters (weights, biases) of neural network between a range of (-c, c) which act as the neighbourhood discussed during the Lipschitz continuity function.


There are few limitations while using the clipping concept to enforce the Lipschitz constraint on the discriminator’s model to find Wasserstein distance.

If the clipping parameter is large, then it can take a long time for any weights to reach optimality, thereby making it harder to train. If the clipping parameter is small, this leads to a vanishing gradient problem.

What makes WGAN different?

  1. In WGAN, the discriminator acts as a critic which finds out the closeness between two distributions rather than discriminating/classifying between real & fake. This makes it act as a regression problem.
  2. Discriminator works as a regressor instead of a classifier.
  3. As it doesn’t classify between real and fake images like regular GAN, hence it doesn’t use a sigmoid function in the last layer.
  4. The optimization method chosen in WGAN is RMSprop.
  5. There is no logarithmic term in the loss function.
  6. Clipping of ‘W’ lies between a range of (-c, c) to satisfy the k-Lipschitz condition.


That’s all about cGAN & WGAN which is definitely a step-up from the previous basic GANs. Still, they have their limitations and in the upcoming blogs, we will be dealing with better and advanced GAN variants.

For any queries, or discussion you can contact me on LinkedIn and my details are given below.

A passionate &inquisitive learner, member of Data Science Society(IMI Delhi). A strong passion for ML/DL, mathematics, quantum computing & philosophy.

A passionate &inquisitive learner, member of Data Science Society(IMI Delhi). A strong passion for ML/DL, mathematics, quantum computing & philosophy.