A New ‘model For Models’ in Software Development Effort Estimation

Runners competing on a track. — "It would be a bit like if you were thinking about being an international competitor in a sport; you have to at least be as good as a qualifying time."

Researchers at the University of Otago in New Zealand have created a software development effort estimation model they would like to see used as a baseline "model for models" of effort prediction quality for all future model comparisons in the field.

The corresponding author of the researchers’ paper on the topic, Peter Whigham, says that desire is not meant to be self-aggrandizing. In fact, he says, the term "baseline" should be taken to mean just that.

"The motivation for the model is to be a baseline model, not necessarily a good model," he says. "It’s trying to address the fact there are highly variable results in a number of fields within machine learning, but software effort estimation seems to have a serious problem right now,, with people producing results and saying ‘Model A is better than Model B,’ and somebody else says, ‘no, Model B is better than Model A,’ and it goes on and on like that. There’s no consistency in the way things are compared.

"What it’s meant to say is if you can’t be better than this model, then you shouldn’t be promoting your method as being good. It would be a bit like if you were thinking about being an international competitor in a sport, you have to at least be as good as a qualifying time."

The researchers’ paper, "A Baseline Model for Software Effort Estimation," published in the May 2015 issue of ACM Transactions on Software Engineering and Methodology, introduces an automatically transformed linear model (ATLM) as a suitable baseline model for comparison against software estimation effort (SEE) methods.

More precisely, Whigham says, ATLM is not actually a single model, since for different project and business environments there is different data, and therefore a different model. "We are really promoting ATLM as a model form that can be used with all types of data and (that) will allow a clear baseline sanity check if an academic wants to argue that their really complicated model has very low error on unseen predictions for a particular set of data."

Within academic research circles, the paper may indeed provide not only the coding foundation, but also a conceptual one, for discussing the idea of a suitable SEE baseline model. For instance, the paper supplies a list of seven characteristics such a model should contain and, Whigham says, "surprisingly enough, we spent some time trying to find a reference that would have had that list, and couldn’t find anyone who had written out a list of what characteristics a baseline model should have."

"ATLM is simple, yet performs well over a range of different project types," Whigham and co-authors Caitlin Owen and Stephen MacDonnell wrote. "In addition, ATLM may be used with mixed numeric and categorical data and requires no parameter tuning. It is also deterministic, meaning that results obtained are amenable to replication."

Whigham says the principles of replication and reproducibility weighed significantly as he and his colleagues worked on their research. They created their model in the open source R environment and provided a reference implementation in the paper’s appendix to make the model accessible.

"We supplied an implementation of the baseline model so people don’t have an excuse to say ‘I couldn’t build it,’ and as part of that we also then supplied software that you could hand the dataset and do a 10-way cross-validation 41 times or whatever it might be," he says, explaining this is meant to encourage rigor in model comparison. One of the two previously published papers which the team used to benchmark their model posited its theory on a single 10-fold cross-validation.

"If you look around at machine learning methods or modeling techniques, often they will do a single 10-way cross-validation," Whigham says, "and people often say that’s more than adequate to give an estimate of the quality of the model, and for datasets that are very well behaved, that may be true."

However, Whigham contends, in fields such as software effort, datasets are highly skewed and if a researcher were to implement one or more such cross-validation runs for a model, results could be unreliable, and if the model is complex, computationally expensive.

"The issue is that doing something once for these types of datasets is not enough to actually get a true measure of the quality of the model," he says. "People have to be very wary of the way they measure the quality of their models, because the dataset itself is not well-behaved. And the unfortunate thing is that of course, when you have a very complex model, it may take a long time to run and, even worse, it may not even be a deterministic model; it may be stochastic. For example, if it’s an evolutionary algorithm trying to converge, it won’t always behave the same, so then how many times do you have to run it? And you have to run it many, many more times than single 10-fold cross-validation."

Unknown Crossover Potential

Of course, once estimation models leave the academy and enter the realm of practical application, results are uneven at best – or, in the words of one expert in project management, perhaps too even, to the point of consensus mediocrity.

"I’ve applied a lot of these models to so many different projects, and I haven’t seen one yet that can really fit the bill," says Linda Esker, a senior applied technology engineer at the Fraunhofer Center for Experimental Software Engineering at the University of Maryland. "The reason why they don’t fit the bill is that a project is not just a project. You don’t just start one and get it done; there are too many things that happen during a project that perturb everything: the government doesn’t quite have funding this year, or we have a change in management, or we have a change in priorities."

Barbara Kitchenham, professor of quantitative software engineering at Keele University in the U.K., reviewed the Otago group’s paper before publication, and says it should be considered in its context as an attempt to bring greater rigor to academic research. "The issue of real-world traction seems a bit irrelevant to me," she says. "At the moment, I would not advise anyone in industry to use current research, unless they can reproduce the results themselves."

"When you’re on a project, you don’t have time," Esker says. "That’s really what it comes down to. It’s not that it’s good or bad, it’s ‘do I have time to prove it myself, or let the academics prove it?’ And I don’t mean that to be demeaning."

Given the Otago team’s efforts to provide the reference implementation, perhaps the ATLM could help resolve this enduring chicken-egg quandary in evaluating new research by supplying a sanity check in separating wheat from increasingly complex chaff, Whigham says. "These models will only be useful if they can help with effort estimation, but the way they are currently constructed, this is almost definitely not going to be the case," he says. "This issue won’t be helped by building better prediction models, but rather by considering how decision making under uncertainty is performed for SEE."

Ultimately, he says, ATLM may very well give academics hunkered down in the trees of creating overly-granular models a view of the ever-uncertain landscape of the software development forest.

"It may well help academics to start thinking about the bigger picture of SEE by putting less effort into building ‘better prediction models,’" Whigham says, "and more time into understanding why SEE is hard, and what are the real uncertainties in this problem."

Gregory Goth is an Oakville, CT-based writer who specializes in science and technology.