Efficient deep models via network binarization/quantization

30 Jul 2020
19:00-19:50
Main

Efficient deep models via network binarization/quantization

Deep neural networks trained on large datasets have advanced the state-of-the-art for a large variety of tasks. However, under low memory and limited computational power constraints, the accuracy on the same problems drops considerable. With the growing demand of on-device AI a significant body of work recently focused on speeding-up such models while retaining their original accuracy. A particular hardware-friendly direction is that of network quantization that represent the weights and/or the activations with fewer bits. In this presentation we will explore a series of model quantization approaches that tackle this problem both from a methodological and architectural  (i.e. how to design “quantizable” models) point of view, presenting in the process some of the most recent advances in the field. In particular, in the second part of the presentation, we will focus on the extreme case of quantization, i.e. binarization, where all the values are limited to 2 states only (0 or 1). Such type of extreme low-bit quantization allows in turn to replace all the multiplications inside a given convolutional layer with bitwise operations. Finally, we will show that the proposed and presented approach partially bridges the accuracy gap while offering significant speed gains.