In this lab we will train an image colorization network that given a grayscale image hallucinates its RGB version.
Image generation/translation is more memory and computation consuming task than image classification and is also harder to train. On the other hand, it is easy to have an infinite supply of the labelled data. That is why, while is it still possible to run the training for the current assignment on CPU, we recommend to use GPU (if available) or Google Colab.
In the latter case, you will find comments with additional commands to run.
Below is an example of the [Deoldify](https://deoldify.ai) colorization model output on Dmytro Mishkin child B&W photo:
We will train a Unet-style CNN with the content (aka [perceptual](https://cs.stanford.edu/people/jcjohns/papers/eccv16/JohnsonECCV16.pdf)) loss on the ImageNette dataset, which we have already used in the previous assignment. Now we will go over the functions, classes, models you need to implement and train.
%% Cell type:code id: tags:
``` python
# FOR COLAB: uncomment and run the following code every time you are starting the session
#!pip install kornia==0.6.3
```
%% Cell type:code id: tags:
``` python
# FOR COLAB: uncomment and run the following code every time you are starting the session
[Unet](https://arxiv.org/abs/1505.04597) is one of the most popular and well-performing architectures for image-to-image translation (segmentation, colorization, restoration, etc.) tasks. It consists of an image encoder, similar to the image classification task, and a decoder, which gradually increases spatial resolution back to the original one. In addition, all blocks from the decoder are skip-connected to their same-resolution counterparts from the encoder.
There are many ways to implement Unet and Unet can be based on any backbone architecture: [AlexNet](https://papers.nips.cc/paper/4824-imagenet-classification-with-deep-convolutional-neural-networks.pdf), [VGGNet](https://arxiv.org/pdf/1409.1556), [ResNet](https://arxiv.org/abs/1512.03385), etc.
We will be using a trick from fast.ai and Deoldify by Jason Antic, (Deoldify), Jeremy Howard (fast.ai), and Uri Manor (Salk Institute): [using ImageNet-pretrained network for implementing Unet encoder part](https://www.fast.ai/2019/05/03/decrappify/). This saves a lot of training time compared to training from scratch.
Your task is to implement class `UnetFromPretrained`.
Here we write some code snippets, which might help with implementation.
%% Cell type:markdown id: tags:
### How to get pretrained model from [torchvision](https://pytorch.org/docs/stable/torchvision/models.html)
To get pretrained model, pass `True` to the constructor
We recommend to use [kornia.geometry.transform.resize]() or [kornia.geometry.transform.rescale]() with enabled flag `antialias=True` for downscaling for upscaling. The difference between `resize` and `rescale`in just input format: resize expects explicit output size, whereas rescale -- up or downsampling factor.
Other ways of resampling, e.g. MaxPooling, lead to information loss for the backprop and strided convolution lead to artifacts. For detailed reference why anti-alizased downscaling is better and more advanced variants see [The Devil is in the Decoder: Classification, Regression and GANs](https://arxiv.org/abs/1707.05847).
I also recommend to check this interactive paper/post about checkerboard artifacts in image generation:
[Deconvolution and Checkerboard Artifacts](https://distill.pub/2016/deconv-checkerboard/)
If you use normalization in the decoder, it is safer to use [GroupNorm](https://pytorch.org/docs/stable/nn.html#torch.nn.GroupNorm), `nn.GroupNorm` instead of `BatchNorm` because of small batch size.
**Don't hardcode the structure from the reference example: you would need to implement the code to generate it AND use ImageNet-pretrained weights**
As loss we will be using the so called perceptual loss. Below is a Figure from [Perceptual Losses for Real-Time Style Transfer and Super-Resolution](https://cs.stanford.edu/people/jcjohns/papers/eccv16/JohnsonECCV16.pdf) by Justin Johnson, Alexandre Alahi, and Li Fei-Fei, ECCV 2016, where it was proposed at first.
It is a simple mean square error (MSE) loss between generated image and target image, but not on the pixel level, but instead on some feature map of the ImageNet pretrained network. Compared with the MSE-pixel loss, it tends to produce less blurry images, compared with L1 loss - it better captures semantics and small image misalignments.
It is also used as a first part of [NoGAN training](https://www.fast.ai/2019/05/03/decrappify/) by Jason Antic, (Deoldify), Jeremy Howard (fast.ai), and Uri Manor (Salk Institute). Also it was proposed in "[Learning Implicit Generative Models by Matching Perceptual Features](http://openaccess.thecvf.com/content_ICCV_2019/papers/dos_Santos_Learning_Implicit_Generative_Models_by_Matching_Perceptual_Features_ICCV_2019_paper.pdf)" by Cicero Nogueira dos Santos, Youssef Mroueh, Inkit Padhi, Pierre Dognin, which inspired the current assignment.
Implement `class ContentLoss(nn.Module)`, which takes pretrained architecture network name and `layer_id` and an input and creates such a network during initialization. During `forward` it should do forward pass of input image and target image through this network and calculate `F.mse_loss` between them. Hint: to save time and memory, you can use `with torch.no_grad():` for getting the features of the target (**but not the generated**) image
%% Cell type:code id: tags:
``` python
from cnn_training import weight_init
from colorization import UnetFromPretrained,ContentLoss