Model Optimization

Most deep learning models are made up of millions/billions of parameters. So, when we inference a model, we need to load all its parameter in the memory, this means big models cannot be loaded easily on a edge/embedded devices. So, this blog is focused on optimizing a model through different techniques and analyze its effect on inference time, model size, accuracy.

Model optimization is transforming machine/deep learning model such that it has:

Faster Inference Time
Reduction in network size (memory efficient)

We’re going to cover different topics, so buckle up

Quantization:
Reducing number of bits used for model parameters
Network Pruning:
Reducing the number of distinct weight values
Low-Rank Matrix Factorization:
Factorizing model weights such that it reduces overall model parameters
Knowledge Distillation:
Obtatining a smaller network by mimicking the prediction

Optimization

We’ll mainly focus on optimizing pytorch model. But the gist we’ll be same for Tensorflow/caffe.,etc

1. Quantization

Quantization means converting/reducing the model parameters precision to lower bit precision typically 8-bit integers , without retraining the model.

Remapping from float32 to int8:

Quantization

Calculation

To map the floating-point values to integer values, we need to calculate two key parameters:

Scale
Step size between quantized values.

$$scale = \frac{max_{float} - min_{float}}{max_{int} - min_{int}}$$

For int8, $$min_{int} = -127$$ $$max_{int} = 127$$

Zero point
Integer value that corresponds to the floating-point value of zero.
$$zero_{point} = min_{int} - \frac{min_{float}}{scale}$$

Quantize the Float Values

Once the scale and zero point are determined, each floating-point value 𝑥 can be quantized to an integer value 𝑞 using:

$$q = \text{round} \left( \frac{x}{scale} + zero_{point} \right)$$

De-Quantize the Float Values

To convert back from quantized values 𝑞 to floating-point values 𝑥, we use:

𝑥 = scale⋅(𝑞 − zero_point)

$$x = scale \cdot (q - zero_{point})$$

Pytorch Availability

1. Half Precision

Half precision refers to performing computations using 16-bit floating point numbers(half precision) instead of standard 32-bit floating point numbers(single precision) to accelerate inference process.

Using half precision, the memory requirements and computational cost of the inference process can be reduced, leading to faster inference time.

Blockers:
i. As of now, Half-Precision is not supported for CPU. Operations for conv_layer, stft, etc doesn’t support operations in float-16 on cpu.

2. Post Training Dyanmic Quantization

In Dynamic Quantization, the quantization on model weights are quantized ahead of time while quantization on activations occurs dynamically during runtime.
During inference, the activation’s are collected for input and are analyzed to determine their dynamic range. Once the dynamic range is known, quantization parameters such as scale and zero-point are calculated to map the floating point activation to lower precision.

Implementation

import torch
model = …  #pytorch model
quantized_model = torch.ao.quantization.quantize_dynamic(model, qconfig_spec={torch.nn.Linear}, dtype=torch.qint8)
output = quantized_model(input) #infer on quantized model

- Supported Layers:

nn.Linear
nn.LSTM / nn.GRU / nn.RNN
nn.Embedding

3. Post Trianing Static Quantization

Scales and zero points for all the activation tensors are pre-computed using some representative unlabeled data.

Given a floating point Dense Neural Network(DNN), we would just have to run the DNN using some representative unlabeled data, then collect the distribution statistics(scale point and zero point) for all the activation layers.
This quantization minimizes the model performance degradation by estimating the range of numbers that the model interacts with using a representative dataset.

Blockers:

Conv packing expects numerical padding as an input.
The “same“ padding computation is not implemented internally in the quantized convolutional layer.

Implementation

First, we need to insert torch.quantization.QuantStub and torch.quantization.DeQuantStub operations before and after the network for the necessary tensor conversions.
Consider we have simple Pytorch model with a simple linear layer Linear(10, 20)

import torch
class OriginalModel(torch.nn.Module):
    def __init__(self):
        super(M, self).__init__()
        # QuantStub converts the incoming floating point tensors into a quantized tensor
        self.quant = torch.ao.quantization.QuantStub()
        self.linear = torch.nn.Linear(10, 20)
        # DeQuantStub converts the quantized tensor into a floating point tensors
        self.dequant = torch.ao.quantization.DeQuantStub()
    
    def forward(self, x):
    # using QuantStub and DeQuantStub ops, we indicate the region for quantization in a model
        x = self.quant(x)
        x = self.linear(x) # original model
        x = self.dequant(x)
        return x

Model is instanstiated and trained


# model is instantiated and trained
model_fp32 = OriginalModel()

# Prepare the model for static quantization
model_fp32.eval()
model_fp32.qconfig = torch.ao.quantization.get_default_qconfig('fbgemm')
model_fp32_prepared = torch.ao.quantization.prepare(model_fp32)

# Determine the best quantization settings by calibrating the model on a representative dataset.
calibration_dataset = torch.utils.data.Dataset...
model_fp32_prepared.eval()
for data, label in calibration_dataset:
    model_fp32_prepared(data)

if the target environment is mobile device, we must pass in ‘qnnpack’ to the get_default_qconfig function and for server inference, ‘fbgemm’ or ‘x86’.

Convert calibrated model to a quantized model

model_int8 = torch.ao.quantization.convert(model_fp32_prepared)

2. Pruning

Pruning is a technique of identifying and removing unimportant or redundant parameters, such as weights and neurons, from the network. Removing unimportant parameters from a trained neural network model helps to reduce its size and computational complexity Here we’ll be doing Post-training pruning which involves pruning the trained model. In general we’ve two types of pruning.

Global Unstructured Pruning

It involves removing individual model weight parameters based on their magnitude or importance.
It prunes the entire network at once, considering the importance of each parameter across the entire model.
Even if weights are pruned to zero in global unstructured pruning, the underlying matrix operations is same to that of the dense original network. So, basically it will have same inference time. One major limitation of DL framework today is the lack of sparse operation which can be benefited in case like this.

Local Structured Pruning

It involves removing entire structured groups of weights, such as channels or neurons, based on their importance or contribution to the model’s performance.
It prunes at the level of individual neurons, connections, or weights within a layer of the neural network.
This method guarantees that connections to a neuron in a linear layer or channels in a convolutional layer will be zeroed out. This means that if a channel or neuron is completely unimportant, it can be safely removed without affecting the model’s performance. So we’ll be focusing more on the Local Structured Pruning

Linear Layer

Linear Pruning Fig., (a) Dense Network (b) Pruned Network (dotted lines represent pruned neurons and connections) (c) Network in which the pruned neurons are actually removed from the architecture.

Considering a single Linear Layer

linear_layer = nn.Linear(in_features=10, out_features=100)

The weight for this linear layer will be:

torch.Size(100, 10)
- This means that each of the 100 output neurons has a connection to each of the 10 input neurons, resulting in a weight matrix of 100 rows and 10 columns.

Now After we apply Local Structured Pruning, it guarantee that some neruons will be zeroed out, so we can physically remove those connections to that particular neurons. Following are the steps:

Identify Redundant Neurons: Determine which neurons (rows in the weight matrix) are redundant. These neurons will have weights and biases that can be set to zero.
Remove connections and replace weights with new Linear Layer: Physically remove the connections associated with these redundant neurons. This means removing the corresponding rows in the weight matrix and the associated biases. and then create a new smaller linear layer that has reduced weight and biases from above.
Preserve Output Shape:
To maintain the original output shape, zero-padding is added back to the output tensor at the positions corresponding to the removed neurons. This ensures that the output shape of the pruned layer matches the original layer.

What if we have subsequent linear layers? When dealing with subsequent layers, the input to these layers will also be affected by the pruning of the previous layer. The steps are as follows:

Identify Pruned Neurons in Previous Layer:
Identify which neurons were pruned in the previous layer.
Adjust Weight Matrix of the Next Layer:
Modify the weight matrix of the subsequent linear layer to account for the removed input connections. Essentially, this means removing the corresponding columns in the weight matrix of the next layer.
Create Reduced Linear Layer:
Similar to the previous layer, create a new linear layer with the reduced number of input features.

Convolution Layer

Conv Layer

In case of convolution layer as well, the idea is same, we identify the redundant channel and perform convolution without considering those redundant channels and later add zeroed channel on those previous zeroed index channels.

3. Low Rank Matrix Factorization

Matrix Factorization is decomposing matrices into smaller sub-matrices which serve as approximation of the original matrices while having fewer parameters.

The Singular Valued Decomposition(SVD) lets us decompose any matrix A with n rows and m columns:

svd

where, r are singular values and if we take first k largest singular values and set other to zero , we’ll get an approximation of A.

Let’s see how matrix factorization helps with the model compression or faster inference.

Example:

Factorization_layer

For 2nd and 3rd layer, If we consider n=1000, and m=500,

Without low-rank matrix factorization:

Number of parameters in W = m * n = 500,000 parameters

With low-rank matrix factorization:

W is decompose into lower-rank matrices U and V.

Number of parameters in U = m × k = 500 × 50 = 25,000 parameters.

Number of parameters in V = k × n = 50 × 1000 = 50,000 parameters.

Total number of parameters = 25,000 + 50,000 = 75,000 parameters.

    Low rank approximation decreases parameters. from nm to k(n+m).

Problem

We are given a matrix W and want to find a low rank matrix that produces the lowest approximation error to W.

Find rank for each layer, but if we find rank ahead of time to use in each layer. Then we can write weight matrix at each layer as UkVkT and optimize the network as usual over all Uk , Vk matrices.

Now, let’s explore how we can apply this low rank decomposition on the fully-connected layers and convolution layers

Post Training Weight/Kernel Decomposition

1. SVD on a Fullly Connected Layer

A fully connected layer essentially does matrix multiplication of its input by a weight matrix, and then adds a bias b: Wx + b

We can take the SVD of weight matrix, and keep only the first k singular values.

svd_linear

Instead of a single fully connected layer, we implement it as two smaller fully connected layers:

The first one will have a shape of mxr, with no bias, and its weights will be taken from SVT.
The second one will have a shape of rxn, with bias equal to b, and its weights will be taken from .

The total number of weights dropped from nxm to r(n+m).

2. Tensor Decomposition on Convolution Layers

The idea is similar to above. We use the weight of convolution layer and apply some decomposition on the weight matrix and the decomposed weight matrix is represented by some smaller size convolution layer which performs depth-wise and point-wise operations.

There are different algorithm for it:

CP Decomposition
Tucker Decomposition

4. Knowledge Distillation

It is a technique where we train a smaller, more efficient student model to mimic the behavior of a larger, more complex teacher model. The central idea is to transfer the ‘knowledge ’ embedded within the cumbersome teacher model to the compact student model. Due to this simpler architecture, KD helps for faster inference and deployment on devices with limited resources.

Components of Knowledge Distillation

1. Knowledge

The valuable information embedded within a complex, well-trained deep learning model (the teacher) that we aim to transfer to a smaller model (the student). Knowledge can be categorized into:

Response-Based Knowledge:
Information derived from the final output predictions of the teacher model. This typically involves soft probabilities (in classification tasks) or continuous values (in regression tasks).
Feature-Based Knowledge:
Knowledge captured in the intermediate feature representations (outputs of hidden layers) of the teacher network. This encodes higher-level abstractions learned by the teacher.
Relationship-Based Knowledge:
Knowledge about how different data samples interact with each other, or how different parts of the model interact.

2. Distillation

The mechanism by which knowledge is transferred from the teacher to the student. This involves:

Loss Functions:
Loss functions tailored to the type of knowledge being transferred. Common choices include:
- Kullback-Leibler (KL) Divergence for response-based knowledge (often used with probability distributions).
- Mean Squared Error (MSE) for response-based (continuous outputs) and feature-based knowledge.
- Perceptual losses and Gram matrix losses for feature-based knowledge.

Training Scheme

Training_Scheme

Offline Distillation:
- Teacher Training:
  Utilize a larger, pre-trained model
- Student Training:
  - Feed data to both the teacher and the student models.
  - Calculate the distillation loss based on the difference between the teacher’s soft predictions and the student’s predictions.
  - Calculate the objective loss function based on the student’s predicted and the ground truth.
  - Combine the distillation loss and the additional loss, to form the overall training loss.
Online Distillation:
Knowledge is transferred by employing both Teacher and student training.
Self-Distillation:
The same model serves as both teacher and student (after some initial training).

3. Teacher-Student Architecture

The relationship between the complexity and capacity of the teacher and the student models:

Teacher Model:
Usually a large, pre-trained model with superior performance. This model could be a complex ensemble or a model trained with extensive compute resources.
Student Model:
A smaller, more computationally efficient model. It is designed to be deployable in scenarios with limited resources (e.g., mobile devices, embedded systems) while still maintaining good performance.

Model Optimization#

1. Quantization#

Remapping from float32 to int8:#

Calculation#

Quantize the Float Values#

De-Quantize the Float Values#

Pytorch Availability#

1. Half Precision#

2. Post Training Dyanmic Quantization#

Implementation#

- Supported Layers:#

3. Post Trianing Static Quantization#

Blockers:#

Implementation#

2. Pruning#

Linear Layer#

Convolution Layer#

3. Low Rank Matrix Factorization#

Example:#

Without low-rank matrix factorization:#

With low-rank matrix factorization:#

Problem#

Post Training Weight/Kernel Decomposition#

1. SVD on a Fullly Connected Layer#

2. Tensor Decomposition on Convolution Layers#

4. Knowledge Distillation#

Components of Knowledge Distillation#

1. Knowledge#

2. Distillation#

Training Scheme#

3. Teacher-Student Architecture#

References:#

Model Optimization

1. Quantization

Remapping from float32 to int8:

Calculation

Quantize the Float Values

De-Quantize the Float Values

Pytorch Availability

1. Half Precision

2. Post Training Dyanmic Quantization

Implementation

- Supported Layers:

3. Post Trianing Static Quantization

Blockers:

Implementation

2. Pruning

Linear Layer

Convolution Layer

3. Low Rank Matrix Factorization

Example:

Without low-rank matrix factorization:

With low-rank matrix factorization:

Problem

Post Training Weight/Kernel Decomposition

1. SVD on a Fullly Connected Layer

2. Tensor Decomposition on Convolution Layers

4. Knowledge Distillation

Components of Knowledge Distillation

1. Knowledge

2. Distillation

Training Scheme

3. Teacher-Student Architecture

References: