深度學習Paper系列(08)：Conditional GAN

劉智皓 (Chih-Hao Liu)

11 min readNov 16, 2023

在前幾個章節當中，我們有介紹GAN相關的理論基礎，而我們前面所講的GAN都是基於從noise丟進去generator，變成照片的例子。

那這邊我們就會想，那我們能不能今天丟進去一個照片，然後他會給我其他風格的照片呢？沒錯這個問題就是所謂的

image-to-image translation

所以今天在這邊我們就要來介紹一個非常有名的架構，叫做

Conditional GAN

其原始的論文名稱叫做：

Image-to-image translation with conditional adversarial networks

對於Conditional GAN，其簡稱cGAN，也叫做Pix2Pix。

cGAN理論和想法

實際上cGAN的想法和GAN非常相似，我們先以我們在GAN所學到的基礎，列出以下的方程式：

我們可以看到，我們今天學習的目標是將x~X圖片轉換成y~Y圖片，所以我們希望我們的generator G，今天吃進去x圖片，並加入一些noise z之後，其產生的圖片G(x,z)和我們想要轉換的圖片y非常相近。而我們的discriminator的學習目標就是去盡可能的區分目標圖片y和轉換圖片G(x,z)。

不過在這邊我們可以額外加上一個新的限制條件，因為我們希望把X分布的圖片轉換到Y分布，所以我們為了學習上有更明確的方向，我們可以加一個regularization，希望轉換圖片G(x,z)和目標圖片y一模一樣。

所以我們把兩個loss加起來就是我們cGAN的學習方程式：

模型架構

在cGAN當中，因為我們希望把x轉換成y，每一個pixel都有對應到，所以在實作上我們會採用U-Net當作我們generator的架構。另外這也是為什麼cGAN又叫做Pix2Pix。

對於discriminator，其採用了Markovian discriminator，這個discriminator會嘗試將影像中的每個 N × N 的區塊，去辨識說是真還是假的。

結果

L1 Norm影響

這邊我們可以看一下模型的訓練結果

我們會發現如果今天只用L1 norm去訓練模型的話，我們就會發現其就很像訓練autoencoder，只有限制輸出結果。另一方面我們只用GAN來訓練的話，我們會發現其產生的圖片有些地方會發生扭曲不合邏輯。最後我們可以看到同時加上兩個loss來學習的話，可以轉換出最好的結果。

U-Net架構影響

另外我們可以看到另一個實現結果

我們可以看到使用encoder和decoder架構的話，因為沒有skip connection可以連結不同視野大小的特徵，所以我們可以看到使用U-Net架構的結果多了比較多的細節。

Discriminator視域

這邊作者也做了另一個實驗，就是去設定discriminator的視域大小，這邊我們會發現，視域越小生成圖像的diversity越高，但是卻越模糊，而視域越大剛好相反。

其他結果

這邊我們還可以看到其他cGAN轉出來的結果

實作

最後就是大家最在意的實作啦！這邊我推薦大家這個repository：

GitHub - junyanz/pytorch-CycleGAN-and-pix2pix: Image-to-Image Translation in PyTorch

Image-to-Image Translation in PyTorch. Contribute to junyanz/pytorch-CycleGAN-and-pix2pix development by creating an…

github.com

我們可以看到他的generator架構就是使用U-Net

class UnetGenerator(nn.Module):
    """Create a Unet-based generator"""
    def __init__(self, input_nc, output_nc, num_downs, ngf=64, norm_layer=nn.BatchNorm2d, use_dropout=False):
        """Construct a Unet generator
        Parameters:
            input_nc (int)  -- the number of channels in input images
            output_nc (int) -- the number of channels in output images
            num_downs (int) -- the number of downsamplings in UNet. For example, # if |num_downs| == 7,
                                image of size 128x128 will become of size 1x1 # at the bottleneck
            ngf (int)       -- the number of filters in the last conv layer
            norm_layer      -- normalization layer
        We construct the U-Net from the innermost layer to the outermost layer.
        It is a recursive process.
        """
        super(UnetGenerator, self).__init__()
        # construct unet structure
        unet_block = UnetSkipConnectionBlock(ngf * 8, ngf * 8, input_nc=None, submodule=None, norm_layer=norm_layer, innermost=True)  # add the innermost layer
        for i in range(num_downs - 5):          # add intermediate layers with ngf * 8 filters
            unet_block = UnetSkipConnectionBlock(ngf * 8, ngf * 8, input_nc=None, submodule=unet_block, norm_layer=norm_layer, use_dropout=use_dropout)
        # gradually reduce the number of filters from ngf * 8 to ngf
        unet_block = UnetSkipConnectionBlock(ngf * 4, ngf * 8, input_nc=None, submodule=unet_block, norm_layer=norm_layer)
        unet_block = UnetSkipConnectionBlock(ngf * 2, ngf * 4, input_nc=None, submodule=unet_block, norm_layer=norm_layer)
        unet_block = UnetSkipConnectionBlock(ngf, ngf * 2, input_nc=None, submodule=unet_block, norm_layer=norm_layer)
        self.model = UnetSkipConnectionBlock(output_nc, ngf, input_nc=input_nc, submodule=unet_block, outermost=True, norm_layer=norm_layer)  # add the outermost layer
    def forward(self, input):
        """Standard forward"""
        return self.model(input)

還有其discriminator是使用Markovian discriminator (PatchGAN discriminator)

class NLayerDiscriminator(nn.Module):
    """Defines a PatchGAN discriminator"""
    def __init__(self, input_nc, ndf=64, n_layers=3, norm_layer=nn.BatchNorm2d):
        """Construct a PatchGAN discriminator
        Parameters:
            input_nc (int)  -- the number of channels in input images
            ndf (int)       -- the number of filters in the last conv layer
            n_layers (int)  -- the number of conv layers in the discriminator
            norm_layer      -- normalization layer
        """
        super(NLayerDiscriminator, self).__init__()
        if type(norm_layer) == functools.partial:  # no need to use bias as BatchNorm2d has affine parameters
            use_bias = norm_layer.func == nn.InstanceNorm2d
        else:
            use_bias = norm_layer == nn.InstanceNorm2d
        kw = 4
        padw = 1
        sequence = [nn.Conv2d(input_nc, ndf, kernel_size=kw, stride=2, padding=padw), nn.LeakyReLU(0.2, True)]
        nf_mult = 1
        nf_mult_prev = 1
        for n in range(1, n_layers):  # gradually increase the number of filters
            nf_mult_prev = nf_mult
            nf_mult = min(2 ** n, 8)
            sequence += [
                nn.Conv2d(ndf * nf_mult_prev, ndf * nf_mult, kernel_size=kw, stride=2, padding=padw, bias=use_bias),
                norm_layer(ndf * nf_mult),
                nn.LeakyReLU(0.2, True)
            ]
        nf_mult_prev = nf_mult
        nf_mult = min(2 ** n_layers, 8)
        sequence += [
            nn.Conv2d(ndf * nf_mult_prev, ndf * nf_mult, kernel_size=kw, stride=1, padding=padw, bias=use_bias),
            norm_layer(ndf * nf_mult),
            nn.LeakyReLU(0.2, True)
        ]
        sequence += [nn.Conv2d(ndf * nf_mult, 1, kernel_size=kw, stride=1, padding=padw)]  # output 1 channel prediction map
        self.model = nn.Sequential(*sequence)
    def forward(self, input):
        """Standard forward."""
        return self.model(input)

Reference

[1] Isola, P., Zhu, J. Y., Zhou, T., & Efros, A. A. (2017). Image-to-image translation with conditional adversarial networks. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 1125–1134).