StyleGAN (Style Generative Adversarial Network)

11 min readJul 17, 2021

This is the final part of the series “All about GANs” where we will be focusing more upon the StyleGAN. In this particular article, we will be exploring the architecture of StyleGAN in-depth.

And also, if you aren’t very familiar with the working of GAN (or) are interested to know about other variants of GANs and their in-depth working then you can refer to my previous blogs on GAN which are mentioned below.

Blog 1: Introduction to GAN (Generative Adversarial Networks)

Blog 2: All about GANs (Part 1)

Blog 3: All about GANs (Part 2)

Blog 4: Scraping Instagram to Artificially create Sneaker Designs (DCGAN PyTorch)

Before understanding StyleGAN, there are other few concepts one must be aware of as.

Basic Concepts (Pre-requisites)

1. Fidelity: It represents how realistic & clear the images are. And in other words, how different the real image is from the generated image as it’s important to maintain consistency.

2. Diversity: Variety of images produced from the generator. It helps in covering the entire range of the distribution. And if the generator produces only one realistic image with high fidelity and not others then it is not a well-performing GAN which is similar to mode collapse

Evaluation Metrics

A. Pixel distance

It’s not much reliable as if there’s a subtle change in pixel values of an image, then the difference between pixel values would be too high. And as resolution increases, the number of pixels also increases. At a higher resolution even though the images look similar, their pixel distance may be too high.

B. Feature Distance

Instead of pixel distance, this is another way where we compare the high-level features or semantic features instead of comparing pixels at a granular level.

For feature extraction, we need to extract the features using a classifier model. We generally use transfer learning to use pre-trained weights of CNN models. In it, the last layer isn’t considered as it’s used to represent class outputs that may not be of our use.

**Source:** **https://morioh.com/p/cca080900a78**

The earlier pooling layers tend to hold important features that represent the entire set and they are also called a feature layer.

The classifier for StyleGAN was Inception v3 which was trained on 14 million images called ImageNet. The features extracted from ImageNet are called ImageNet embeddings/feature space/embedding space.

C. Fréchet Inception Distance (FID)

It indicates the distance of the distribution, the smaller the value closer is the distribution between real & fake distribution, or in other words, the generated image looks similar to the real image.

Lower FID = closer distributions and the number of samples do affect the FID score as the noise reduces as samples increases.

The FID metric is the squared Wasserstein metric between two multidimensional Gaussian distributions

i) Distribution of real images used to train of feature-wise mean & covariance matrix

ii) Distribution of images generated by the GAN of feature-wise mean & covariance matrix

It compares the mean and standard deviation of one of the deeper layers in a convolutional neural network named Inception v3. It’s a better alternative to the Inception Score.

D. Perceptual path length

Perceptual Path Length (PPL) is the difference between two images while it’s interpolating & that change occurs smoothly along the perceptual path.

**Source:** **https://medium.com/analytics-vidhya/from-gan-basic-to-stylegan2-680add7abe82**

As in the above diagram, the two images of dogs are two endpoints that have different colors and majorly share similar semantic features generated from two different latent variables. The blue line between them is the PPL.

We subdivide a latent space interpolation path into linear segments, where the total perceptual length of this segmented path is the sum of perceptual differences over each segment.

z1, z2 ∼ P(z), t ∼ U(0, 1), and ‘d’ evaluates the perceptual distance between the resulting images. Slerp denotes spherical interpolation, which is the most appropriate way of interpolating in our normalized input latent space.

Their intermediate value would produce something of the mixture of features of two images by computing the average perceptual path length in W. While taking two latent variables i.e. (z1 & z2), they are mixed together at a ratio ‘t+ε’. Lerp denotes linear interpolation.

The shorter the perceptual path length between two images, the more similar they look, and the larger the perceptual path length the more dissimilar they look.

**Source:** **https://arxiv.org/pdf/1912.04958.pdf**

StyleGAN (Style Adversarial Network)

**Source:** **https://towardsdatascience.com/explained-a-style-based-generator-architecture-for-gans-generating-and-tuning-realistic-6cb2be0f431**

The above figure represents the entire architecture of StyleGAN that consists of various components and all those components/techniques have been broadly listed as:

Source: https://arxiv.org/pdf/1812.04948.pdf

The researcher experimented with various techniques to gauge over changes in performance on two different datasets and their Fréchet inception distance (FID) is given. The lower the FID, the better.

A: The StyleGAN uses the Progressive Growing technique from ProGAN
B: Bilinear sampling technique used for upsampling & downsampling.
C: Addition of 8-layer MLP (Noise Mapping network) to disentangle the features. AdaIN is then used to apply styles
D: Input constant matrix with dimension 4×4×512 to the first layer of G
E: Addition of Stochastic Variation
F: Style mixing & Mixing Regularization using AdaIN (coarse, medium, fine details)

Progressive Growing & Upsampling/Downsampling

StyleGAN is trying to make it so it’s easier for the generator to generate higher resolution images by gradually training it from lower resolution images to those higher resolution images.

It uses one of the techniques from ProGAN (Progressive GANs). It also lets you get more control of the features that can be added to the images. And it also helps to add features/styles to the images and hence it’s known as StyleGAN.

**Source:** **https://towardsdatascience.com/progan-how-nvidia-generated-images-of-unprecedented-quality-51c98ec2cbd2**

Progressive growing takes place gradually where a double-sized image is produced by upsampling it using its nearest neighbor filters. After that, when 99% of upsampling is done, the rest 1% is taken to produce an 8x8 image from a 4x4 image

Generator

In the case of upsampling since we use the nearest neighbor filter (in the case of StyleGAN, Bilinear sampling technique is used) it consists of unlearned parameters so initially, it is used and from that final generated image, 1% is passed through convolution layer to contribute to the final image and this is because it consists of learnable parameters.

Gradually over time, we shift our upsampling process to convolutional layers that have learned the weights, which are represented by a parameter α.

Discriminator

In the case of the discriminator, a similar principle takes place where instead of upsampling we downsample it by 0.5x. Similarly, learnable parameters are present to gradually learn over time where overtime for downsampling, the discriminator relies on the convolutional layers.

And this is how the block looks of progressive growing where it consists of a sample layer and two convolution layers for learning more parameters and to capture more features from the training set.

Noise Mapping Network

Takes ‘z’ from latent vector or noise vector of size 4×4×512 which is passed through mapping network to produce ‘w’. The mapping network consists of 8 Fully Connected neural network layers (MLP Multi-Layer Perceptron).

Z-space entanglement

The features from z-space are entangled in nature which means ‘z’ doesn’t map onto output features in a one-on-one manner as it’s highly entangled so changing one feature would change the feature of others. The reason for this is because there’s a p.d.f. (probability distribution function) of the real image. And it becomes hard for z to map individually to the features.

Altering one value would alter the values of other features.

The mapping network converts the latent z into an intermediate latent space w. To achieve this disentangled connection, it’s passed through the 8 MLP/Mapping Network to disentangle it which gives more control over the features. This lets you control the styles that can be added to the outcome.

And this is how ‘z’ is passed through the mapping network to turn into ‘w’. Instead of adding ‘z’ to the progressive growing layer, a fixed value is given of size 4x4x512 because there wasn’t much noticeable difference on adding ‘z’ which existed in previous GAN models. A separate learned affine operation A is applied to transform w in each layer so it acts as the style information.

Addition of Styles using AdaIN (Adaptive instance normalization)

The ‘w’ obtained from the noise mapping layer is fed to the progressive growing layer and also as well it’s added to the AdaIN layer that exists between the convolution layers as that can be seen in the above diagram. AdaIN majorly consists of 2 concepts that makes it up which are as below:

1. Instance Normalization over Batch Normalization
2. Adaptive Instance Normalization

Instance Normalization vs Batch Norm

In the instance normalization, the convolution outputs are normalized where they are transformed in a normal distribution with a mean 0 and an std of 1. It does it by taking the mean and std of the instance instead of the entire batch.

In batch norm, it looks over the height & width of the image highlighted by blue over the entire batch of images. But in the case of images, every image is different so it’s important to look over an instance (one example & one channel) of the batch.

So that is represented by an instance ‘i’ over some instance.

AdaIN (Adaptive Instance Normalization)

‘w’ is applied over multiple layers via the AdaIN layer, to apply adaptive styles to the images. The adaptive styles obtained from ‘w’ are sent through 2 learnable parameter networks which is y(s) (scale) & y(b) (bias). These are then added to the AdaIN layer.

The ‘w’ that is being added to the progressive generator block at the earlier stage defines most of the coarser features and with time its addition to the later stages defines all the finer details and it also helps us having control over most of the features on a different scale as coarser, medium & finer details.

Note:
It’s important to understand that most of the styles of images can be presented in terms of their mean, std, scale & bias. Ex: Painting of a human or landscape may variate from artist to artist as they have their different styles but the main object tends to remain the same i.e., human/scenery, etc.

Style Mixing & Mixing Regularization

‘w’ that was obtained from the noise mapping network is added to all the layers in the progressive growing block which gives us more control. Here we can mix multiple values of ‘w’ value such as w1, w2, w3 … etc obtained from different ‘z’ values.

**Source:** **https://jonathan-hui.medium.com/gan-stylegan-stylegan2-479bdf256299**

‘w’ that is being added at the earlier stage defines most of the coarser features because that defines the basic structure of formation and with time its addition to the later stages defines all the finer details.

Since all the values can be controlled, we can obtain different images by adding different layers and one such of them is given below where w1 belongs to one style and w2 belongs to the other style. It also helps in producing a lot of diversity.

Stochastic Variation

To add some degree of variation in the details of the images, noise is added at different levels to obtain different degrees of variation in the output. As you can see below, noise added in the earlier layers produces major changes in hair curls, etc whereas noise added later defines the fine layers.

The noise that is added is taken from a normal distribution and it is then concatenated into ‘x’ that is later added just before the AdaIN layer. The degree to which it affects is defined by lambda λ. Since it’s being added to two different convolutional layers at every layer, it’s given as λ1 & λ2. And this can help in controlling extremely minute details.

Truncation Trick

The images that are generated from a normal distribution where the generated images at the peak tend to have higher fidelity and lower diversity and the ones near the edge tend to have higher diversity but lower fidelity. It can be noticed that images on top are quite similar apart from colors and have the same background whereas, the images present near the tail end have higher diversity but they don’t look authentic and real.

Hence the images that are present in the bottom end areas are truncated as they have low probability density and they won’t be enough for the model to train & learn from.

Its equation is given below where ψ is called the style scale. Generally, truncation is done at lower resolution layers while coarser styles are being defined as it doesn’t affect details in the high-resolution layers.