Deep Learning Hunts for Signals Among the Noise

Over the past decade, advances in deep learning have transformed the fortunes of the artificial intelligence (AI) community. The neural network approach that researchers had largely written off by the end of the 1990s now seems likely to become the most widespread technology in machine learning. However, protagonists find it difficult to explain why deep learning often works well, but is prone to seemingly bizarre failures.

The success of deep learning came with rapid improvements in computational power that came through the development of highly parallelized microprocessors and the discovery of ways to train networks with enormous numbers of virtual neurons assembled into tens of linked layers. Before these advances, neural networks were limited to simple structures that were easily outclassed in image and audio classification tasks by other machine-learning architectures such as support vector machines.

Theorists have long assumed networks with hundreds of thousands of neurons and orders of magnitude more individually weighted connections between them should suffer from a fundamental problem: over-parameterization. There are so many weights that determine how much each neuron influences its neighbors that the network could simply find a way to encode the data used to train it. It would then correctly classify anything in the training set, but fail miserably when presented with new data.

In practice, deep neural networks do not fall easily into overparameterization; instead, they are surprisingly good at dealing with new data. When trained, they seem able to ignore parts of images used for training that had little bearing on classification performance, rather than trying to build synaptic connections to deal with them.

Stefano Soatto, professor of computer science at the University of California, Los Angeles (UCLA), explains “Most of the variability in images is irrelevant to the task. For instance, if you want to recognize a friend in a picture, you want to do so regardless of where she will be, how she will be dressed, whether she is partially occluded, what sensor will be used for the picture, etc. If you think of all the possible images of your friend, they are, for all practical purposes, infinite. So if you wanted a minimal representation—something that distills the essence of ‘your friend’ in every possible future image of her—that should be a much, much smaller object than an image.”

Unfortunately, networks can home in on details that are very different to those used by humans. This leads to sometimes intriguing failures. Researchers at Kyushu University in Japan discovered late last year that modification of just one individual pixel in an image could upset neural networks trained to classify objects and animals; a taxi might suddenly be misidentified as a dog with such a tiny change.

“Trained neural networks can be tricked to focus on patterns in images that are barely noticeable by humans into a situation where they completely misinterpret the contents,” says Chiyuan Zhang, a researcher working at the Center for Brains, Mind and Machines based at the Massachusetts Institute of Technology (MIT). “This leads to security concerns. It could potentially be used to implant backdoors in neural network models in ways that are hard to identify.”

Mathematical interpretations of how deep neural networks learn offer one path to understanding why they generalize so effectively, and may provide mechanisms for them to avoid training on the wrong types of feature. Researchers regard the layering used by deep learning as one vital attribute. The layers make it possible to pull identifying marks out of images no matter where they are within the sample.

However, that is only part of the problem.

Tomaso Poggio, principal investigator at the McGovern Institute for Brain Research based at MIT, says, “It is important to understand there is much more work to be done [in deep learning]. Our hope is that if we understand better how they work we will understand better how they fail and, by doing that, improve them.”

One strand of math-oriented research focuses on information theory. Naftali Tishby of the Hebrew University of Jerusalem believes the training processes in neural networks illustrate a branch of information theory that he helped develop two decades ago. He coined the term “information bottleneck” to describe the most efficient way that a system can find relationships between only the pieces of data that matter for a particular task and treat everything else within the sample as irrelevant noise.

Figure. A simple neural network has up to two layers hidden between the input and output layers; more than that, and it becomes a Deep Learning Neural Network, which can model complex non-linear relationships.

Tishby’s hunch was that neural networks provide examples of the information bottleneck at work. He worked with colleague Ravid Shwartz-Ziv to build a simpler form of neural network able to demonstrate how the process works. First the network finds important connections by adjusting the weights that neurons use to determine which of their peers in the network should have the greatest influence. Then, the network optimizes during what Tishby calls the compression phase. Through this process, neurons adjust weights to disregard irrelevant inputs. These inputs might represent the backgrounds of images of animals presented to a network trained to classify breeds using visual features.

However, an attempt last autumn by an independent team to replicate the results obtained by Tishby and Shwartz-Ziv using techniques employed by production neural networks failed to yield the expected compression phase consistently. Often, a neural network will achieve peak performance some time before it moves into the phase that Tishby refers to as compression, or may simply not follow the same pattern. Yet, these networks exhibit the generalization capability that the information bottleneck concept predicts. “I think the information bottleneck may be wrong or, in any case, unable to explain the puzzles of deep nets,” Poggio says.

Poggio and colleagues look at the problem of understanding deep learning from the perspective of it being a process of iterative optimization. In learning what is important from the training data, the network arranges itself to minimize an error function—an operation common to optimization functions. In practice, the error functions for neural networks for a given set of training data seem to exhibit multiple “degenerate” minima, which seem to make it easier to find good solutions that generalize well. However, away from these wide valleys that lie toward the bottom of the error function’s landscape, there are countless local minima that could trap an optimizer in a poor solution.

The secret to deep learning’s success in avoiding the traps of poor local minima may lie in a decision taken primarily to reduce computation time.

The secret to deep learning’s success in avoiding the traps of poor local minima may lie in a decision taken primarily to reduce computation time. After each pass through the training set, the backpropagation algorithm that tunes the weights used by each neuron for the next test should analyze all of the data. Instead, stochastic gradient descent (SGD) uses a much smaller random sample that is far easier to compute. The simplification causes the process to follow a more random path towards the global minimum than full gradient descent. A result of this seems to be that SGD can often skip over poor local minima.

“We are looking for a minimum that is most tolerant to perturbation in parameters or inputs,” says Poggio. “I don’t know if SGD is the best we can do now, but I find almost magical that it finds these degenerate solutions that work.”

For Soatto and his UCLA colleague Alessandro Achille, more clues as to how to make neural networks work better will come through studies that use the concept of the information bottleneck theory to look at the interactions between different network architectures and the training data.

Says Soatto, “We believe [Tishby’s] ideas are substantially correct, but there are a few technical details that have to be worked out. The fact that we converged to similar ideas is remarkable because we started from completely independent premises.”

Achille and Soatto used ideas from the information bottleneck to develop training optimizations that help smaller networks tune out noise. Timing also appears to be important, they believe. One 2017 experiment performed with Matteo Rovere of the Ann Romney Center for Neurologic Diseases in Boston, MA, indicated there is a critical phase early in training that proves crucial when the network weights are easily changed and the relationships between neurons quite plastic. The early phase has similarities to that proposed by Tishby and Shwartz-Ziv. Once this phase takes place, it seems to bias the network toward finding good minima as optimization proceeds.

Although the work on the information bottleneck and on optimization theory is beginning to lead to a better understanding of how deep learning works, Soatto says, “Most of the field is still in the ‘let all the flowers bloom’ phase, where people propose different architectures and folks adopt them, or not. It is a gruesome trial-and-error process, also known as ‘graduate student descent’, or GSD for short. Together with SGD, these are the two battle-horses of modern deep learning.”

Further Reading

Shwartz-Ziv, R., and Tishby, N.
Opening the Black Box of Deep Neural Networks via Information. ArXiv: https://arxiv.org/abs/1703.00810

Zhang, C., Bengio, S., Hardt, M., Recht, B., and Vinyalis, O.
Understanding Deep Learning Requires Rethinking Generalization. ArXiV: https://arxiv.org/abs/1611.03530

Poggio, T., Liao, Q., Miranda, B., Rosasco, L., Boix, X., Hidary, J., and Mhaskar, H.
Theory of Deep Learning III: Explaining the Non-Overfitting Puzzle. CBMM Memo 073 (2017). https://cbmm.mit.edu/sites/default/files/publications/CBMM-Memo-073.pdf

Achille, A., Rovere, M., and Soatto, S.
Critical Learning Periods in Deep Neural Networks. UCLA-TR-170017. ArXiV: https://arxiv.org/abs/1711.08856