Researchers at Drexel University, the University of Maryland, the University of Goettingen, and Princeton University have developed a code stylometry using natural language processing and machine learning to determine the authors of source code based on coding style.
The researchers say the technology could be applicable to a wide range of situations in which ascertaining the originating coder is important, such as to help identify the author of malicious source code.
The researchers say they developed abstract syntax trees derived from language-specific syntax and keywords, which capture a syntactic feature set that "was created to capture properties of coding style that are completely independent from writing style." They tested the code stylometry by gathering publicly available data from Google's Code Jam, taking solutions to several identical problems for a group of users as a training dataset in order to learn the style of each coder. The researchers then looked blindly at solutions the same coders wrote to another problem and tried to identify the author of each.
The code stylometry achieved 95-percent accuracy in identifying the author of anonymous code.
In addition, the researchers found coding style is more well-defined through solving harder problems. "This might indicate that as programmers become more advanced, they build a stronger coding style compared to newbies," according to the researchers.
View Full Article
Abstracts Copyright © 2015 Information Inc., Bethesda, Maryland, USA