# Review — InfoGAN: Interpretable Representation Learning by Information Maximizing Generative Adversarial Nets

## Besides Using Latent Vector z, Latent Code c is also Input to GAN, for Learning Disentangled Representations

In this story, **Interpretable Representation Learning by Information Maximizing Generative Adversarial Nets**, (InfoGAN), by OpenAI, is reviewed. In this paper:

**InfoGAN**is designed to**maximize the mutual information between a small subset of the latent variables and the observation.****A lower bound of the mutual information objective is derived**that can be optimized efficiently.- By doing so, InfoGAN successfully
**disentangles writing styles from digit shapes**on MNIST dataset, and**disentangles the visual concepts**that include hair styles, presence/absence of eyeglasses, and emotions on the CelebA face dataset.

This is a paper in **2016 NIPS **with over **3000 citations**. (Sik-Ho Tsang @ Medium)

# Outline

**InfoGAN Concept****InfoGAN Framework****Experimental Results**

# 1. **InfoGAN Concept**

**1.1. MinMax Game Using **Mutual Information

In information theory,

mutual information between, measuresXandY,I(X,Y)the “amount of information” learned from knowledge of random variableYabout the other random variableX.

- The mutual information can be expressed as the difference of two entropy terms:

*I*(*X*;*Y*) is the reduction of uncertainty in*X*when*Y*is observed.- If X and Y are independent, then
*I*(*X*;*Y*) = 0, because knowing one variable reveals nothing about the other.

By contrast,

ifXandYare related by a deterministic, invertible function, then maximal mutual information is attained.

- Similar mutual information inspired objectives have been considered before in the context of clustering [23–25].
- This interpretation makes it easy to formulate a cost:

Given anyIn other words,x~P_G(x), we wantP_G(c|x) to have a small entropy.the information in the latent codecshould not be lost in the generation process.

- In regular GAN, the minmax game is:

- Now, the following
**information-regularized minimax game**is solved:

## 1.2. Variational Mutual Information Maximization

- In practice,
**the mutual information term**directly as it requires access to the posterior*I*(*c*;*G*(*z*,*c*)) is hard to maximize*P*(*c*|*x*). - But a lower bound of it can be obtained by defining an auxiliary distribution
*Q*(*c*|*x*) to approximate*P*(*c*|*x*).**A variational lower bound**,*LI*(*G*,*Q*), of the mutual information:

*LI*(*G*,*Q*) is added to**GAN****’s objectives**with no change to GAN’s training procedure, which resulting**Information Maximizing Generative Adversarial Networks (InfoGAN).**- InfoGAN is defined as the following minimax game with
**a variational regularization of mutual information**and a hyperparameter*LI*(*G*,*Q*)*λ*:

# 2. InfoGAN Framework

**The auxiliary distribution**is parametrized as a*Q***neural network**.- In most experiments,
and there is*Q*and*D*share all convolutional layers**one final fully connected layer to output parameters for the conditional distribution**, which means InfoGAN only adds a negligible computation cost to GAN.*Q*(*c*|*x*) - It is observed that
*LI*(*G*,*Q*) always converges faster than normal**GAN****objectives**and hence InfoGAN essentially comes for free with GAN. - For
**categorical latent code**, we use the natural choice of*ci***softmax nonlinearity**to represent*Q*(*ci*|*x*). - For
**continuous latent code**, there are*cj***more options**depending on what is the true posterior*Q*(*cj*|*x*). In the experiments, simply treating*Q*(*cj*|*x*) as a**factored Gaussian**is sufficient. - The experiments are based on existing techniques introduced by DCGAN.
- Simply setting
*λ*to 1 is sufficient for discrete latent codes. When the latent code contains continuous variables, a smaller*λ*is typically used.

# 3. Experimental Results

## 3.1 Mutual Information Maximization

- InfoGAN is trained on MNIST dataset with a uniform categorical distribution on latent codes
*c*~Cat(*K*=10,*p*=0.1). - In the above figure,
**the lower bound**, which means the derived bound is tight and maximal mutual information is achieved.*LI*(*G*,*Q*) is quickly maximized to H(c) ≈ 2.30 - On the other hand,
**the generator of regular****GAN****is not explicitly encouraged to maximize the mutual information with the latent codes.**Hence there is little mutual information between latent codes and generated images in regular GAN.

## 3.2. Disentangled Representation

## 3.2.1. MNIST

**(a)**: The discrete code*c*1 captures drastic change in shape.**Changing categorical code***c*1 switches between digits most of the time.**If InfoGAN is trained without any label,**that achieves*c*1 can be used as a classifier**5% error rate**in classifying MNIST digits by matching each category in*c*1 to a digit type. In the**second row of (a)**, we can observe a**digit 7 is classified as a 9.****(b)**:**For regular****GAN****, no clear meaning**on changing categorical code*c*1.**(c)-(d)**:**Two continuous codes**are added to*c*2 and*c*3**capture variations**that are continuous in nature:*c*2,*c*3~Unif(-1, 1).

Particularly, continuous codes

c2,c3 capture continuous variations in style:modelsc2rotationof digits andcontrols thec3width.Images plotted

from -2 to 2covering a wide region that the network was never trained on and we still getmeaningful generalization.

## 3.2.2. 3D Faces & 3D Chairs

- In this experiment, the latent codes with five continuous codes are used.

InfoGAN learns a disentangled representationthat recover azimuth (pose), elevation, lighting, and wide/narrow.

- In this experiment, the latent factors with four categorical codes and one continuous code are used.

InfoGAN is also able to continuously

interpolate between similar chair types of different widthsusing a single continuous code.

## 3.2.3. Street View House Number (SVHN)

- Street View House Number (SVHN) dataset is significantly more challenging to learn an interpretable representation because it is noisy, containing images of variable-resolution and distracting digits, and it does not have multiple variations of the same object.
- Four 10-dimensional categorical variables and two uniform continuous variables as latent codes are used.

InfoGAN can learn the disentangled representationthat recover lighting and plate context.

## 3.2.4. CelebA

- CelebA includes
**200,000 celebrity images**with**large pose variations**and**background clutter**. - The latent variation as 10 uniform categorical variables, each of dimension 10, are used.
- Surprisingly, even in this complicated dataset, InfoGAN can recover azimuth as in 3D images even though in this dataset no single face appears in multiple pose positions.

Moreover InfoGAN can disentangle other highly semantic variations like presence or absence of

glasses,hairstylesandemotion, demonstrating a level of visual understanding is acquired.

**While DC-IGN [7] **was shown to learn highly interpretable graphics codes, it **requires supervision**, it was previously not possible to learn a latent code for a variation that’s unlabeled and hence salient latent factors of variation cannot be discovered automatically from data.

By contrast, **InfoGAN is able to discover such variation on its own.**

(Btw, mutual information maximization is one of the essential elements for self-supervised learning as well.)

## Reference

[2016 NIPS] [InfoGAN]

Interpretable Representation Learning by Information Maximizing Generative Adversarial Nets

## Generative Adversarial Network (GAN)

**Image Synthesis:** **2014** [GAN] [CGAN] **2015 **[LAPGAN] **2016 **[AAE] [DCGAN] [CoGAN] [VAE-GAN] [InfoGAN] **2017** [SimGAN] [BiGAN] [ALI] [LSGAN] [EBGAN]**Image-to-image Translation: 2017 **[Pix2Pix] [UNIT] [CycleGAN] **2018 **[MUNIT]