One of the main reasons behind the quantitative and data-driven revolution that took artificial intelligence (AI) by a storm in the early 1990s was the brittleness of symbolic (logical) systems and their never-ending need for carefully crafted rules. The rationale was that there is a knowledge acquisition bottleneck in the quest to build intelligent systems. The new cliché? Let the system 'discover' the logic/rules by crunching as much data as you can possibly get your hands on. With powerful machine learning techniques, the system will 'discover' an approximation of the probability distribution function and will 'learn' what the data is, and what it means, and will be ready for any new input hereafter. It all sounded good. Too good to be true, in fact.
Notwithstanding the philosophical problems with this paradigm (for one thing, that induction is not a sound inference methodology—outside of mathematical induction, that is), in practice, it seems that avoiding the knowledge acquisition bottleneck has not resulted in any net gain. In fact, in the world of data science it seems that data scientists are spending more than half of their time, not on the science (models, algorithms, inferences, etc.) but on preparing, cleaning, massaging, and making sure the data is ready to be pushed to the data analysis machinery—whether the machinery was SVM, deep neural networks, or what have you. Some studies indicate that data scientists spend almost 80% of their time on preparing data, and even after that tedious and time consuming process is done, unexpected results are usually blamed by the data 'scientist' on the inadequacy of the data, and another long iteration of data collection, data cleaning, transformation, massaging, etc. goes on. Given that data scientists are some of the most highly paid professionals in the IT industry today, isn't 80% of their time on cleaning and preparing the data to enter the inferno something that should raise some flags—or, at least, some eyebrows?
I did not yet mention that such techniques will, even after the long and tedious process of data cleaning and data preparation, still be vulnerable. These models can be easily fooled (or 'attacked') by data that is quite similar, yet will cause these models to erroneously classify them. The problem of adversarial data is getting too much attention, without a solution in sight. In fact, it has been shown that any machine learning model can be attacked with adversarial data (whether the data was an image, an audio signal, or text) and can make the classifier decide anything the attacker wants the classification to be, and often by changing just one pixel, one character, or one audio signal—changes that, otherwise would be unnoticeable for a human.
Maybe not everything we want is in some data distribution? Maybe we are in a (data) frenzy? Maybe we went a bit too far in our reaction to the knowledge acquisition bottleneck?
Walid Saba is Principal AI Scientist at Astound.ai, where he works on Conversational Agents technology. Before Astound, he co-founded and was the CTO of Klangoo. He has published over 35 articles in AI and NLP, including an award-winning paper at KI-2008.