All about GANs (Part 2)

Sudeep Das
10 min readJul 3, 2021

--

Welcome to the second blog of the entire series “All about GANs”. The entire blog series is broken down into 4 parts and this is the 2nd blog. In this blog we will be discussing other variants of GANs mainly:

  1. InfoGAN (Information GAN)
  2. BiGAN (Bidirectional GAN)
  3. CycleGAN

In the previous blogs, we introduced the concept of GAN and its working and also discussed different variants of GAN. If you are interested to understand more about the basics and working of GAN then please go through these previous blogs.

Before we discuss the working of the InfoGAN, it’s important to understand the concept of Information Theory that will be used to derive its loss function.

Information Theory

Information theory is that learning of an unlikely event is more informative than learning that a likely event has occurred. E.g. “The sun rose from the east”, this statement doesn’t hold any unlikely information as it happens on a regular basis and is a universal truth. An example of “Today solar eclipse will take place” is an unlikely event as this happens rarely and is uncommon. In the context of ML, it’s used to quantify the similarity between distributions.

Source: https://staging.news.st-chris.net/pi-1-jamie-odowd/#.YN3oAegzZPY

There are three properties stated under the information theory,

1. Likely events should have low information content that is guaranteed to happen

2. Unlikely events should have higher information content

3. Independent events should have additive information

These are the properties that need to be sufficed for an event ‘ x ’ to have self-information which is given as:

Self-information deals only with a single outcome. We can quantify the amount of uncertainty in an entire p.d.f. (probability distribution function) using the Shannon entropy as

Shannon Entropy

Distributions that are nearly deterministic (where the outcome is nearly certain) have low entropy & distributions that are closer to uniform have high entropy.

Joint Entropy

The combined entropy of two discrete variables ‘X’ & ‘Y’ is given as the entropy of their pair (X, Y) known as joint entropy.

Conditional Entropy

The conditional entropy of ‘X’ given ‘Y’ is the average conditional entropy of ‘Y’ is given as,

On solving the above equation, we obtain,

H(X|Y) = H(X, Y) - H(X) (Chain Rule)

H(X|Y) = 0; if value of ‘X’ is completely determined by ‘Y’

H(X|Y) = 1; if X & Y are independent random variables

Mutual Information

The amount of information that can be obtained from a random variable by observing another. It gives the measure of the correlation between two random variables ‘X’ & ‘Y’ such that the more independent they are, the lesser is the mutual information among them.

Entropy Venn Diagram

The mutual information is given as,

I(X;Y) = -H(X,Y) + H(X) + H(Y)

Using the chain rule it can be written in two ways,

I(X;Y) = H(X) — H(X|Y)

(or)

I(X;Y) = H(Y) — H(Y|X)

InfoGAN

InfoGAN is unsupervised learning used for information maximization of disentangled representation of the image. Here labels aren’t provided and the InfoGAN learns from the images as it learns patterns among images in an unsupervised manner and produces images by following the patterns.

Source: https://www.researchgate.net/figure/The-architectures-of-InfoGAN-and-SCGAN-InfoGAN-attempts-to-separate-the-condition-which_fig1_327949545

In the InfoGAN, the latent vector ‘ z ’ provided is highly entangled in nature.

So, a different algorithm is run to capture the salient semantic features from the real samples and assign them labels that ensure the entangled features of ‘z’ is converted into disentangled features {c, z} where ‘ c ’ is labels/attributes. And the generated output from the generator is ‘G(z, c)’.

The output from generator G(z, c) is represented by ‘x̂’ that is fed to another neural network implementing ‘Q(c/x)’ that produces an output of “ c’ ”. This particular network’s working is similar to a Reverse autoencoder.

InfoGAN maximizes the mutual information between latent code ‘ c ’ and generated output ‘G(z, c)’, so the generator learns the important features. And the role of the discriminator is ensuring that the generator just doesn’t copy ‘c’ that makes it recognizable but to ensure that it infuses/encodes it in the images.

Here λ<1 as mentioned in the paper and,

Lambda is the regularization constant and is typically just set to one or below one. To find the regularization value, we generally estimate the likelihood.

Minimization of G*

To obtain the minimum value for G* we have to maximize “ λ I( c; G(z,c)) ”

λ I( c; G(z,c)) = H(c) - H(c|G(z,c)) → A

To obtain the maximum value of equation A, we have to minimize H(c|G(z,c)) to 0.

[ ⸪ H(X|Y) = 0; if value of ‘X’ is completely determined by ‘Y’ ]

N O T E :

It’s difficult almost intractable to directly maximize the value of I( c; G(z,c) ) as it needs to access the posterior P(c|x) and instead the lower bound of it is obtained i.e. Q(c|x) which is an auxiliary distribution approximating P(c|x). This technique is known as Variational Information Maximization Technique.

Source: https://arxiv.org/pdf/1606.03657.pdf

Using the above technique we obtain the final equation as,

BiGAN (Bi-directional GAN)

Source: https://arxiv.org/pdf/1605.09782.pdf

The above diagram is the architecture of BiGAN. The working principles of the BiGAN are very much similar to the working of autoencoders. Autoencoders have an encoder block and a decoder block.

Source: https://www.compthree.com/blog/autoencoder/

In the case of BiGAN, the encoder and the decoder block are present differently and there’s no connection between them. It can be given as,

‘ x ’ is the sample taken from the distribution of the real images and ‘ z ’ is taken from the latent coded distribution. According to the paper, the decoder is also called the generator. Then both the images are sent to the discriminator that takes the joint distribution. It produces an output of ‘1’ to images that come from the encoder block as they belong to real images and ‘0’ to the decoder block.

Now, once the discriminator can identify between real and generated images then the encoder, decoder, and discriminator block can be trained so that the joint distribution between (x, z’) and (z, x’) can be optimized.

The BiGAN training objective is defined as a minimax objective is given as,

where,

For a detailed understanding, kindly go through the blog by Hamaad Shah and an explanatory video by Ahlad Kumar.

Blog: Using Bidirectional Generative Adversarial Networks to estimate Value-at-Risk for Market Risk Management

Video: Objective function of BiGAN (Bidirectional GAN) / ALI architecture

CycleGAN

Source: https://link.springer.com/article/10.1007/s12194-019-00520-y/figures/2

CycleGAN is used for image-to-image translation that is used to convert one image into another by keeping the attributes of the original image the same. Below is the diagram is given of image-to-image translation. CycleGAN comes under the unpaired approach unlike the paired approach and there are other GANs too that work similarly under the unpaired approach such as DiscoGAN and Dual GAN.

CycleGAN is an augmentation of the GAN design that includes the synchronous preparing of two generator models and two discriminator models.

The generator inputs samples(images) from the 1st domain to generates images for the 2nd domain, and the other generator inputs images from the 2nd domain to generate images for the 1st domain. Whereas discriminator is used for discriminating between real & fake images and the loss is calculated to update the generators.

CycleGAN uses a cycle consistency loss to enable training without the need for paired data. In other words, it can translate from one domain to another without a one-to-one mapping between the source and target domain.

Source: https://arxiv.org/pdf/1703.10593.pdf

The architecture of CycleGAN is basically divided into two networks that are connected together and they are as follows:

1st Network

Source: https://blog.jaysinha.me/train-your-first-cyclegan-for-image-to-image-translation/

The above diagram represents the working of the 1st network where a real image is taken from distribution A and is passed through the generator(A2B) that generates a fake image B. Then again, it’s passed through another generator(B2A) that generates a reconstructed image. The working is very similar to the working of an autoencoder.

After this, the L1 loss metric is used for optimization to reduce the distance between the two distributions, and this is called the cycle-consistency loss.

The below diagram represents the 1st network.

Source: https://arxiv.org/pdf/1703.10593.pdf

‘ x ’ represents a sample taken from the real image distribution and that’s converted into ‘Ŷ’ using the generator ‘G’ which is later passed through another generator ‘F’ to produce ‘x̂’ which is the estimate of the actual image and then the cycle consistency loss is calculated.

Objective function

We will be having two losses here i.e. Adversarial loss and the Cycle consistency loss. These loss functions will be used for deriving the final objective function of CycleGAN.

Adversarial Loss

Cycle Consistency Loss

2nd Network

Similarly, like the 1st network, the 2nd network is similar but with few changes in it. In 2nd network where a real image is taken from distribution B and is passed through the generator(B2A) that generates a fake image A. Then again, it’s passed through another generator(A2B) that generates a reconstructed image.

The below diagram represents the 2nd network.

‘y’ represents a sample taken from the real image distribution and that’s converted into ‘X̂’ using the generator ‘F’ which is later passed through another generator ‘G’ to produce ‘ ŷ’ which is the estimate of the actual image and then the cycle consistency loss is calculated.

Objective function

We will be having two losses here i.e. Adversarial loss and the Cycle consistency loss. These loss functions will be used for deriving the final objective function of CycleGAN.

Adversarial Loss

Cycle Consistency Loss

CycleGAN Architecture

The above two networks (1st & 2nd) that were mentioned are combined together to form a CycleGAN network given above and their equations mentioned are all together combined to obtain the final objective function of the CycleGAN i.e.

And it’s optimization function is given as,

CycleGAN has multiple applications and they are as follows:

• Style Transfer

• Object Transfiguration

• Photograph generation from Painting

• Photograph Enhancement

• Image colorization

Conclusions

In case if you have followed the readings so far, you must be having a very clear idea about the working of GANs and their variants that were explained in depth. I would strongly recommend you to go through the actual papers and also other works mentioned in the references.

In the upcoming blogs on GANs, we will be focusing more on the actual implementation of DCGAN for image generation. And in the last part, we will be discussing the architecture of StyleGAN and StyleGAN2 and their implementation using TensorFlow.

References

[1] Unpaired Image-to-Image Translation using Cycle-Consistent Adversarial Networks https://arxiv.org/pdf/1703.10593.pdf

[2] Ahlad Kumar https://www.youtube.com/channel/UCP9YJJ24w6g38VMVMm6Thtg

[3] InfoGAN: Interpretable Representation Learning by Information Maximizing Generative Adversarial Nets https://arxiv.org/pdf/1606.03657.pdf

[4] Bidirectional GAN https://paperswithcode.com/method/bigan

[5] Adversarial Feature Learning https://arxiv.org/pdf/1605.09782.pdf

[6] Goodfellow, I., Bengio, Y. and Courville A. (2016). Deep Learning (MIT Press)

[7] A Gentle Introduction to CycleGAN for Image Translation https://machinelearningmastery.com/what-is-cyclegan/

[8] Implementation of CycleGAN using TensorFlow https://www.tensorflow.org/tutorials/generative/cyclegan

--

--

Sudeep Das
Sudeep Das

Written by Sudeep Das

A passionate &inquisitive learner, member of Data Science Society(IMI Delhi). A strong passion for ML/DL, mathematics, quantum computing & philosophy.

No responses yet