Despite the rapid advances it has made it over the past decade, deep learning presents many industrial users with problems when they try to implement the technology, issues that the Internet giants have worked around through brute force.
"The challenge that today's systems face is the amount of data they need for training," says Tim Ensor, head of artificial intelligence (AI) at U.K.-based technology company Cambridge Consultants. "On top of that, it needs to be structured data."
Most of the commercial applications and algorithm benchmarks used to test deep neural networks (DNNs) consume copious quantities of labeled data; for example, images or pieces of text that have already been tagged in some way by a human to indicate what the sample represents.
The Internet giants, who have collected the most data for use in training deep learning systems, have often resorted to crowdsourcing measures such as asking people to prove they are human during logins by identifying objects in a collection of images, or simply buying manual labor through services such as Amazon's Mechanical Turk. However, this is not an approach that works outside a few select domains, such as image recognition.
Holger Hoos, professor of machine learning at Leiden University in the Netherlands, says, "Often we don't know what the data is about. We have a lot of data that isn't labeled, and it can be very expensive to label. There is a long way to go before we can make good use of a lot of the data that we have."
To attack a wider range of applications beyond image classification and speech recognition and push deep learning into medicine, industrial control, and sensor analysis, users want to be able to use what Facebook's chief AI scientist Yann LeCun has tagged the "dark matter of AI": unlabeled data.
"The problem I see now is that supervising with high-level concepts like 'door' or 'airplane' before the computer even knows what an object is simply invites disaster."
In parallel with those working in academia, technology companies such as Cambridge Consultants have investigated a number of approaches to the problem. Ensor sees the use of synthetic data as fruitful, using as one example a system built by his company to design bridges and control robot arms that is trained using simulations of the real world, based on calculations made by the modeling software to identify strong and weak structures as the DNN makes design choices.
Although simulation can create useful data for a back-end DNN to learn from with little input from manually labeled data, some researchers in the field believe these systems should do much better at handling unlabeled inputs without the help of synthetic data. Their hope is that by focusing on patterns in the core data, deep learning can approach problems more intuitively. Some see the current reliance on labeled data as even being counterproductive to the development of effective AI.
Alexei Efros, a professor in the Electrical Engineering and Computer Sciences department of the University of California, Berkeley, cites a problem with the current approach to handling images: they generally contain far more information than is implied by the relatively simple tags applied by humans for training purposes. "The problem I see now is that supervising with high-level concepts like 'door' or 'airplane' before the computer even knows what an object is simply invites disaster."
What Efros wants is for the AI systems to capture the information that humans remember after they have seen a picture. "Do we have a photographic memory? No, we don't. But we have a cool embedding that captures a lot of non-linguistic information," Efros says. "We need to get away from semantics and force the computer to represent more of what is actually in the image."
Leon Gatys and colleagues at the Technical University of Tübingen, Germany, in 2017 showed an example of the problem with the way in which deep learning models are trained today. A DNN will just as readily identify random patches of fur-like texture as a dog as a picture of the animal itself. "The networks are lazy," says Efros. "They do the minimum work required to get the reward or minimize the loss. Typically, local image statistics, the texture, are easier for the network to compute than long-range structure. We need to make our computers work harder to understand what they see."
Natural language processing (NLP) is an area where the performance of neural networks has been improved by forcing them to put much greater emphasis on the structure of the data they process and create embeddings similar to those used by humans to capture the information they learn.
For some time, NLP has relied on a clustering, a form of machine learning where the algorithm finds structure for itself, to try to determine which words have similar meanings. The result is that rather than analyzing text as disconnected words, a neural network provided with additional information by the clustering receives a head start. The most common technique is to give each unique word in the dictionary a value, in the form of a vector. Translation software takes advantage of this, thanks to the tendency of words in different languages tending to map to similar positions in the vector space.
The problem with the simpler clustering methods is that words often have multiple meanings. 'Bank', for example, can act as a verb or a noun which, in turn, can refer to the bank of a river or a financial bank. Rather than use simple clustering, a new generation of systems that first appeared in 2018 employ DNNs to take much more contextual information into account.
Examples include the Bidirectional Encoder Representations from Transformers (BERT), published by Google, and Carnegie Mellon University's XL-Net. Each uses subtly different algorithms to analyze the texts used for training. BERT, for example, randomly removes words from the input text and forces the model to predict which is the best candidate. XLNet's developers argue the data corruption implied by BERT's word masking degrades performance. They instead opted for a system that uses the relative positions of words in the training texts to determine how they relate to each other.
Though they are sophisticated DNNs in their own right, systems like BERT and XLNet are not standalone NLP engines, they simply provide richer information for use by downstream deep-learning systems that has paid off according to benchmarks such as GLUE that were designed to measure the capabilities of NLP algorithms. Late last year, the developers of that suite of tests revealed at the NeurIPS conference they had created a battery of more stringent tests, which they tagged SuperGLUE, to handle AI that take advantage of BERT-like pretraining.
Researchers working on image, video, and audio recognition see DNNs that perform pretext tasks that follow in the footsteps of those used in NLP as being important to finding inherent structure in the data their own systems analyze. They believe the pretext tasks should force machine-learning models to do a better job of deconstructing data.
Several years ago, inspired by the clustering used in NLP, researchers working with Efros had a DNN pipeline learn the spatial relationships between patches cut out of images. For example, given pictures of animals from which random patches were removed, the DNN might begin to learn how the ears are placed relative to the eyes, nose, and mouth by determining which arrangement of patches most closely matches the completed images.
Even when trying to force it to recognize the layout of objects within an image, the Berkeley team ran across the DNN's propensity to cheat. Rather than focus on high-level features, they found the network homed in on color aberrations from the lens used to take the photographs that were undetectable to the naked eye. This forced them to process the images to remove some of the color information. "It stumped us for a long time before we figured out what was going on," Efros says.
Work on the kind of patch embedding performed by Efros' team has led to networks that are able to add color to monochrome images effectively and to fill in blank spaces in images using material the DNN has inferred from similar pictures. Similar work on pre-text training at Facebook AI Research in France used a combination of clustering and training on rotated images to improve their image classifier's ability to work on very large collections of images. They demonstrated their DeeperCluster engine using a massive archive of unlabeled photographs on the Flickr website. After processing using the pre-text task, a second phase trains using more conventional techniques with the help of labeled data so the engine can categorize the images in ways humans understand.
A lingering issue for systems in image recognition and non-NLP tasks is how to develop pre-text tasks that are truly effective at forcing the DNNs to understand the short- and long-range structure within the data. Efros says pretext task construction remains more an art than a science at the moment.
Hoos says approaches based on synthetic data and pretraining can help better understand, for example, the relationships between objects in images. But he questions whether deep learning will hold all the answers. In image analysis, he says, it is hard to find a technology remotely competitive with deep learning. "But I would not bet everything on this approach."
Further work, Hoos believes, may show that alternatives to deep learning will be more fruitful in areas such as automation and robot control, situations where understanding the connections between objects in the field of view is critical. "If deep learning is used on its own in the way it exists today, you can't do cutting-edge robotics. Also, we often want explainability in the models, which is often difficult with deep learning. We need to understand what the systems are doing and how they made a decision. We should keep pushing on semi-supervised and deep learning, but at the same time, we should not neglect other methods."
Gatys, L., Ecker, A.S., Bethge, M.
Texture and Art with Deep Neural Networks Current Opinions in Neurobiology, Volume 46, pages 178–186 (2017)
Doersch, C., Gupta, A., Efros, A.A.
Unsupervised Visual Representation Learning by Context Prediction Proceedings of the 2015 IEEE International Conference on Computer Vision (ICCV 2015), pages 1422–1430, https://arxiv.org/abs/1505.05192
Yang, Z., Dai, Z., Yang, Y., Carbonell, J., Salakhutdinov, R., Le, Q.V.
XLNet: Generalized Autoregressive Pretraining for Language Understanding Advances in Neural Information Processing Systems 32 (NIPS 2019), https://arxiv.org/abs/1906.08237
Caron, M., Bojanowski, P., Maria, J., Joplin, A
Unsupervised Pre-Training of Image Features on Non-Curated Data Proceedings of the 2019 IEEE International Conference on Computer Vision (ICCV '19), Pages 1422–1430
©2020 ACM 0001-0782/20/6
Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and full citation on the first page. Copyright for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, to republish, to post on servers, or to redistribute to lists, requires prior specific permission and/or fee. Request permission to publish from [email protected] or fax (212) 869-0481.
The Digital Library is published by the Association for Computing Machinery. Copyright © 2020 ACM, Inc.