Implementing Neural Style Transfer Using TensorFlow 2.0

In this tutorial, you'll learn how to implement power applications like Prisma using TensorFlow 2.0.

Jun 21, 2019 · 11 min read

Convolutional neural networks are the most powerful breed of neural networks for image classification and analysis. Their applications have surpassed many limits and have been proven to be the critical element of many deep learning enabled applications that we see today. At a very high level, CNNs can learn the internal feature level representations of images that are fed to them. This makes them so powerful. It turns out that this salient feature of CNNs is not only useful for tasks like image classification, but also for image construction. Applications like Deep Dream and Neural Style Transfer compose images based on layer activations within CNNs and their extracted features.

In this tutorial, you will be studying how Neural Style Transfer works and how it can be implemented using TensorFlow 2.0.

Note that in order to follow along with this tutorial, you need to know how CNNs work. If you are looking for resources to have a quick refresher on CNNs, give the following ones a try:

DataCamp's Convolutional Neural Networks for Image Processing course
CS231n: Convolutional Neural Networks for Visual Recognition

What is Neural Style Transfer?

For new entrants in the computer vision and deep learning field, the term neural style transfer can be a bit overwhelming. To understand each and every component of the term, consider the following two images:

In the context of neural style transfer, the left image is referred to as the content image and the image on the right side is referred to as the style image. You're interested in stylizing one image (the left one in this case) using another image (the right one). This is what constructs the last two words in the term - style transfer. To carry out the process, a neural network (CNN) is trained to optimize a custom loss function, hence the first word - neural. When the above two images are fused using neural style transfer the final output looks like so (right one) -

You may ask what is the use of a neural style transfer? Imagine you had an image of a drawing originally made by Vincent van Gogh. You want to see how this drawing would have translated to one of your own drawings. This is one of the applications where neural style transfer finds its use. Another example is several photo filter applications like Prisma, which let you perform neural style transfer using a smooth user interface.

Isolation Between Content and Style

So far, the idea that you might have gotten about neural style transfer is that the process is all about combining the content of one image with the style of another. In fact, this is 100% true. In this section, let's understand what is actually meant by content and style in the context of CNNs.

A CNN is often a collection of several convolutional layers and pooling layers. Convolutional layers are responsible for extracting highly complex features from a given image while the pooling layers discard detailed spatial information that is not relevant for an image classification problem. The effect of this is it helps the CNN to learn the content of a given image rather than anything specific such as color, texture, and so on. As we go deeper into a CNN, the complexity of the features increase and the deeper convolutional layers are often referred to as content representations.

An example of style would be anything specific to image properties like texture, color, and so on - like a prominent use of a particular color within a drawing. Now the question is, how do you extract a style of an image? This is done by calculating correlations between the convolutional layers. Correlations give us a measure of how similar/related two or more variables are? To understand this, consider a learned convolutional layer which is composed of several feature maps. For each feature map, you can measure how strongly its detected features relate to the other feature maps in the same layer. This gives you an estimate of things like - is a certain color detected in the first feature map similar to a color in another map? Are there any shapes that are common across the feature maps? These traits/similarities define the style of an image. The process of measuring the similarity between the contents of several feature maps within a convolutional layer helps networks to learn a multi-scale representation of the given image that focuses on spatial features like texture and color.

One vital point to keep in mind is during applying neural style transfer, you also need to ensure that the content of an image is preserved along with the desired style of another image. You will see how this is done in a later section. In the next section, you will be learning how to extract the features from an image (content image) and calculate the content loss.

Quantifying the Content Image and Calculating the Content Loss

The neural style transfer algorithm was first introduced by Gatys et al. in their 2015 paper, A Neural Algorithm of Artistic Style. This tutorial, however, takes reference from Image Style Transfer Using Convolutional Neural Networks, which is kind of a continuation to the previous paper mentioned.

According to the paper Image Style Transfer Using Convolutional Neural Networks, it employs a VGG-19 CNN architecture for extracting both the content and style features from the content and style images respectively. To get the content features, the second convolutional layer from the fourth block (of convolutional layers) is used. For convenience, the authors of the paper named it to be conv4_2. Once you get the content features, you will have to compare it to a target image to measure the content loss. What is a target image? Why calculating content loss is required here? Let's take a step back and focus on these two questions.

To combine the content and style features into a single image, you will need to start with a target image, which is either just a blank or the copy of the content image. Now, to learn both the content and style features effectively using a CNN, you will need a custom loss function which you will optimize to get a smooth stylistic image constructed from the content and the style images. This custom loss function is essentially an amalgamation of two different losses:

Content loss, which makes sure that the net amount of content is preserved.
Style loss which takes care of the amount of style getting transferred to the target image.

Now that you have a sense of the above questions let's return to content loss and define it. The content loss is defined as follows -

where, $T_c$ refers to the target image and $C_c$ refers to the content image. So, what this loss function is doing is, it is giving you a measure of how far are the features of the content and target images from one another. The network will aim to minimize this loss. In the next section, you will see how the style loss is to be determined.

The Style Loss and the Gram Matrix

For determining the style loss, the paper instructs you to take the representations (numbers essentially) from the following layers and obtain the gram matrices of the feature maps within these layers.

'conv1_1'
'conv2_1'
'conv3_1'
'conv4_1'
'conv5_1'

The advantage of using numerous layers for defining the style loss is it allows for learning a multi-scale representation of the style contained in an image. The gram matrices help you determine how similar are the features across the feature maps within a single convolutional layer. This further depicts non-localized information about the image which shapes the style of it. A gram matrix in this context is calculated in the following way.

Consider the dimension of a convolved image is 8x8x16, which primarily denotes that it has 16 feature maps. You are now interested in finding the similarities between the features across these features maps using a gram matrix. For that, you flatten the first two dimensions of the convolved image and turn them into a 1D vector. The dimensions if flattened, in this case, would contain 64 entries in the 1D vector. This is repeated across the feature maps. So, you get a final matrix of the dimension 16x64 (say matrix A). This matrix is then transposed (becomes 64x16) and multiplied with matrix A. A gram matrix, in this case, would be of 16x16 dimension. A particular value in the gram matrix would denote the similarities between the feature maps. Pictorially this process resembles the following:

To calculate the style loss between the target and the style images, you take the mean squared distances between their gram matrices at each of the convolutional layers' blocks. Formally, this can be defined as the following:

where, T_(s,i) is the gram matrix of the target image calculated at block i and S_(s,i) is the gram matrix of the style image calculated at block i. With w_i, you can provide custom weights to the different convolution blocks to attain a detailed representation of the style. Finally, $a$ is a constant that accounts for the values in each layer within the blocks. Let's put the two losses together to define the total loss which the network optimizes in the process of neural style transfer.

Defining the Total Loss

The total loss is now just a matter of addition with custom weights -

where, $\alpha$ denotes content weight, and $\beta$ denotes style weight. This help you in maintaining a good balance between the content and the amount of style getting transferred to the target image. The effect of a different combination of $\alpha$ and $\beta$ can be seen in the paper, and you are encouraged to check that out. You now have all pieces together to implement neural style transfer. Let's go head-on with it in the next section.

Neural Style Transfer in Action

For the implementation part, you will be using TensorFlow 2.0. It has a lot of extra-ordinary additions and is one of the most comprehensive updates to the library of date. If you are interested in learning about a few of these, you can check out this article.

The latest version of TensorFlow (at the time of writing this tutorial) is 2.0.0-beta0. If you do not have that installed yet, please get it installed first by following the instructions as specified here. You will start by importing the necessary packages and the content and the style images.

# Packages
import tensorflow as tf
from tensorflow.keras.applications.vgg19 import preprocess_input
from tensorflow.keras.models import Model
import matplotlib.pyplot as plt
import numpy as np
np.random.seed(7)

%matplotlib inline

print(tf.__version__)

2.0.0-beta0

# Load the content and style images
content = plt.imread('Content.jpeg')
style = plt.imread('Style.jpg')

# Display the images
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(20, 10))

# Content and style images side-by-side
ax1.imshow(content)
ax1.set_title('Content Image')
ax2.imshow(style)
ax2.set_title('Style Image')
plt.show()

In the next step, you will write a helper function which will load the two images in arrays of numbers (because computers understand only numbers) and reshape them for making them compatible with the model.

def load_image(image):
  image = plt.imread(image)
  img = tf.image.convert_image_dtype(image, tf.float32)
  img = tf.image.resize(img, [400, 400])
  # Shape -> (batch_size, h, w, d)
  img = img[tf.newaxis, :]
  return img

# Use load_image of content and style images
content = load_image('Content.jpeg')
style = load_image('Style.jpg')

# Verify the shapes
content.shape, style.shape

(TensorShape([1, 400, 400, 3]), TensorShape([1, 400, 400, 3]))

The images are reshaped. Now, you will load a pre-trained VGG19 model for extracting the features. As you will be using the model for extracting features, you will not need the classifier part of the model.

vgg = tf.keras.applications.VGG19(include_top=False, weights='imagenet')
vgg.trainable = False

Downloading data from https://github.com/fchollet/deep-learning-models/releases/download/v0.1/vgg19_weights_tf_dim_ordering_tf_kernels_notop.h5
80142336/80134624 [==============================] - 5s 0us/step

The authors have specified the block names they had used for extracting the content and style features. The VGG19 model you just loaded has the same naming convention. So to take that advantage, you will first print the names of all the layers present in the network.

# Print the layer names for convenience
for layer in vgg.layers:
  print(layer.name)

input_1
block1_conv1
block1_conv2
block1_pool
block2_conv1
block2_conv2
block2_pool
block3_conv1
block3_conv2
block3_conv3
block3_conv4
block3_pool
block4_conv1
block4_conv2
block4_conv3
block4_conv4
block4_pool
block5_conv1
block5_conv2
block5_conv3
block5_conv4
block5_pool

You are interested in the following ones for getting the style features, however:

'conv1_1'
'conv2_1'
'conv3_1'
'conv4_1'
'conv5_1'

For content features, you will need conv4_2. You will store these in variables accordingly.

# Content layer
content_layers = ['block4_conv2']

# Style layer
style_layers = ['block1_conv1',
                'block2_conv1',
                'block3_conv1',
                'block4_conv1',
                'block5_conv1']


num_content_layers = len(content_layers)
num_style_layers = len(style_layers)

Now create a custom VGG model which will be composed of the specified layers. This will help you run forward passes on the images and extract the necessary features along the way.

def mini_model(layer_names, model):

  outputs = [model.get_layer(name).output for name in layer_names]

  model = Model([vgg.input], outputs)
  return model

Defining a gram matrix is super easy in TensorFlow, and you can do it in the following way:

# Gram matrix
def gram_matrix(tensor):
  temp = tensor
  temp = tf.squeeze(temp)
  fun = tf.reshape(temp, [temp.shape[2], temp.shape[0]*temp.shape[1]])
  result = tf.matmul(temp, temp, transpose_b=True)
  gram = tf.expand_dims(result, axis=0)
  return gram

You will now define a custom model using the mini_model() function. This will be used for returning the content and style features from the respective images.

 class Custom_Style_Model(tf.keras.models.Model):
  def __init__(self, style_layers, content_layers):
    super(Custom_Style_Model, self).__init__()
    self.vgg =  mini_model(style_layers + content_layers, vgg)
    self.style_layers = style_layers
    self.content_layers = content_layers
    self.num_style_layers = len(style_layers)
    self.vgg.trainable = False

  def call(self, inputs):
    # Scale back the pixel values
    inputs = inputs*255.0
    # Preprocess them with respect to VGG19 stats
    preprocessed_input = preprocess_input(inputs)
    # Pass through the mini network
    outputs = self.vgg(preprocessed_input)
    # Segregate the style and content representations
    style_outputs, content_outputs = (outputs[:self.num_style_layers],
                                      outputs[self.num_style_layers:])

    # Calculate the gram matrix for each layer
    style_outputs = [gram_matrix(style_output)
                     for style_output in style_outputs]

    # Assign the content representation and gram matrix in
    # a layer by layer fashion in dicts
    content_dict = {content_name:value
                    for content_name, value
                    in zip(self.content_layers, content_outputs)}

    style_dict = {style_name:value
                  for style_name, value
                  in zip(self.style_layers, style_outputs)}

    return {'content':content_dict, 'style':style_dict}

Now that you have defined the custom model let's use it on the images to get the content and style features accordingly -

# Note that the content and style images are loaded in
# content and style variables respectively
extractor = Custom_Style_Model(style_layers, content_layers)
style_targets = extractor(style)['style']
content_targets = extractor(content)['content']


In the paper, optimization was done using the L-BFGS algorithm, but you can use Adam also.

opt = tf.optimizers.Adam(learning_rate=0.02)

Let's now define the overall content and style weights and also the weights for each of the style representations as discussed earlier. Note that these are hyperparameters and are something you should play with.

 # Custom weights for style and content updates
style_weight=100
content_weight=10

# Custom weights for different style layers
style_weights = {'block1_conv1': 1.,
                 'block2_conv1': 0.8,
                 'block3_conv1': 0.5,
                 'block4_conv1': 0.3,
                 'block5_conv1': 0.1}

Now comes the most crucial part, which makes the process of neural style transfer a lot more fun - the loss function.

# The loss function to optimize
def total_loss(outputs):
    style_outputs = outputs['style']
    content_outputs = outputs['content']
    style_loss = tf.add_n([style_weights[name]*tf.reduce_mean((style_outputs[name]-style_targets[name])**2)
                           for name in style_outputs.keys()])
    # Normalize
    style_loss *= style_weight / num_style_layers

    content_loss = tf.add_n([tf.reduce_mean((content_outputs[name]-content_targets[name])**2)
                             for name in content_outputs.keys()])
    # Normalize
    content_loss *= content_weight / num_content_layers
    loss = style_loss + content_loss
    return loss

You will now write another function which will:

Calculate the gradients of the loss function you just defined.
Use these gradients to update the target image.

With GradientTape, you can take advantage of automatic differentiation, which can calculate the gradients of a function based on its composition. You will also use the tf.function decorator to speed up the operations. Read more about it here.

@tf.function()
def train_step(image):
  with tf.GradientTape() as tape:
    # Extract the features
    outputs = extractor(image)
    # Calculate the loss
    loss = total_loss(outputs)
  # Determine the gradients of the loss function w.r.t the image pixels
  grad = tape.gradient(loss, image)
  # Update the pixels
  opt.apply_gradients([(grad, image)])
  # Clip the pixel values that fall outside the range of [0,1]
  image.assign(tf.clip_by_value(image, clip_value_min=0.0, clip_value_max=1.0))

The only step remaining before you can train the network is defining the target image. For the target image, you will use the content image only.

target_image = tf.Variable(content)

You are now absolutely ready to train the network.

epochs = 10
steps_per_epoch = 100

step = 0
for n in range(epochs):
  for m in range(steps_per_epoch):
    step += 1
    train_step(target_image)
  plt.imshow(np.squeeze(target_image.read_value(), 0))
  plt.title("Train step: {}".format(step))
  plt.show()

WARNING: Logging before flag parsing goes to stderr.
W0617 16:21:34.491543 140709216896896 deprecation.py:323] From /usr/local/lib/python3.6/dist-packages/tensorflow/python/ops/math_grad.py:1205: add_dispatch_support.<locals>.wrapper (from tensorflow.python.ops.array_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use tf.where in 2.0, which has the same broadcast rule as np.where

You are advised to play with the content and style weights to observe the changes in the target image.

Concluding Remarks

The paper you referred to here in the tutorial did not mention about further optimizing the quality of the constructed image. As you probably observed in the above results, there is some amount of smoothness needed in the spatial relationship between the overall content and style of the constructed image. To combat this problem, total_variantional_loss was introduced, which is analogous to using regularization. At a high level, total_variational_loss penalizes the high-frequency artifacts introduced by the original neural style transfer algorithm. Check out this tutorial, if you are interested in implementing it. The following are some of the resources that were references for writing this tutorial:

If you are looking to hone your deep learning skills, you might want to check the following DataCamp courses and tutorials: