Researchers at Columbia University and the University of California, Berkeley have developed software that takes humans out of the most error-prone steps of cleaning big data.
The research team says ActiveClean is designed to analyze a user's prediction model to decide which mistakes to edit first, while updating the model as it works.
The researchers note the system uses machine learning to analyze a model's structure to understand what sorts of errors will throw the model off most. They say ActiveClean targets that data first, in decreasing priority, and cleans just enough data to give users assurance their model will be reasonably accurate. The researchers say with each pass, users see their model improve.
With no data cleaning, a model trained on the Dollars for Docs dataset could predict an improper donation just 66% of the time. However, ActiveClean raised the detection rate to 90% by cleaning just 5,000 records, according to the researchers. They say the active learning method required 10 times as much data, or 50,000 records, to reach a detection rate comparable to ActiveClean.
"Dirty data is pervasive and prevents people from doing useful things," says Eugene Wu, a member of Columbia's Data Science Institute. "This is our first step towards automating the data-cleaning process."
From Columbia University
View Full Article
Abstracts Copyright © 2016 Information Inc., Bethesda, Maryland, USA