One of the success stories of software engineering research in recent decades has been the rise of empirical studies. Visionaries such as Vic Basili, Marvin Zelkowitz and Walter Tichy advocated empirical techniques early [1, 2, 3]; what enabled the field to take off was the availability of software repositories for such long-running projects as Apache, Linux and Eclipse , which researchers started mining using modern data analysis techniques.
These studies have yielded many insights including surprises. More experienced developers can produce more buggy code (Schröter, Zimmermann, Premraj, Zeller). To predict whether a module has bugs, intrinsic properties such as complexity seem to matter less than how many changes it went through (Moser, Pedrycz, Succi). Automatic analysis of user reports seems much better at identifying bugs than spotting feature requests (Panichella, Di Sorbo, Guzman, Visaggio, Canfora, Gall). More extensively tested modules tend to have more bugs (Mockus, Nagappan, Dinh-Trong). Eiffel programmers do use contracts (Estler, Furia, Nordio, Piccioni and me). Geographical distance between team members negatively affects the amount of communication in projects (Nordio, Estler, Tschannen, Ghezzi, Di Nitto and me). And so on.
The basic observation behind empirical software engineering is simple: if software products and processes are worthy of discussion, they must be worthy of quantitative discussion just like any natural artifact or human process. Usually at that point the advocacy cites Lord Kelvin:"If you cannot measure it, you cannot improve it" .
Not that advocacy is much needed today, at least for publishing research in software engineering and in computer science education. The need for empirical backing of conceptual proposals has achieved consensus. The so-called a "Marco Polo paper"  (I traveled far and saw wonderful things, thank you very much for your attention) no longer suffices for program committees; today they want numbers (and also, thankfully, a "threats to validity" section which protects you against suspicions that the numbers are bogus by stating why they might be). Some think this practice of demanding empirical backing for anything you propose has gone too far; see Jeff Ullman's complaint , pertaining to database research rather than software engineering, but reflecting some of the same discussions. Here we can counter Kelvin with another quote (for more effect attributed to Einstein, albeit falsely): not everything that can be counted counts, and not everything that counts can be counted.
There can indeed be too much of a good thing. Still, no one would seriously deny the fundamental role that empirical research has gained in modern software engineering. Which does not prevent us from considering the limits of what it has achieved; not in a spirit of criticism for its own sake, but to help researchers define an effective agenda for the next steps. There are in my opinion two principal limitations of current empirical results in software engineering.
The first has to do with the distinction introduced above between the two kinds of possible targets for empirical assessment: products (artifacts) versus processes.
Both aspects are important, but one is much easier to investigate than the other. For software products, the material of study is available in the form of repositories mentioned above, with their wealth of information about lines of code, control and data structures, commits, editing changes, bug reports and bug fixes. Processes are harder to grasp. You gain some information on processes from the repositories (for example, patterns and delays of bug fixing), but processes deserve studies of their own. For example, agile teams practice iterations (sprints) of widely different durations, from a few days to a few weeks; what is the ideal length? A good empirical answer would help many practitioners. But this example illustrates how difficult empirical studies of processes can be: you would need to try many variations with teams of professional programmers (not students) in different projects, different application areas, different companies; for the results to be believable the projects should be real ones with business results at stake, there should be enough samples in each category to ensure statistical significance, and the companies should agree to publication of some form, possibly anonymized, of the outcomes. The difficulties are formidable.
This issue of how to obtain project-oriented metrics is related to the second principal limitation of some of the initial empirical software engineering work: the risk of indulging in lamppost research. The term refers to the well-known joke about the drunkard who, in the dark of the night, searches for his lost keys next to the lamppost, not because he has lost them there but because it is the only place where one can see anything. To a certain extent all research is lamppost research: by definition, if you succeed in studying something, it will be because it can be studied. But the risk is to choose to work on a problem only, or principally, because it is easy to set up an empirical study -- regardless of its actual importance. To cite an example that I have used elsewhere, one may suspect that the reason there are so many studies of pair programming is not that it's of momentous relevance but that it is not hard to set up an experiment.
As long as empirical software engineering was a young, fledgling discipline, it made good sense to start with problems that naturally lended themselves to empirical investigation. But now that the field has matured, it may be time to reverse the perspective and start from the consumer's perspective: for practitioners of software engineering, what problems, not yet satisfactorily answered by software engineering theory, could benefit, in the search for answers, from empirical studies?
In the second part of this article, now available, I introduce a few examples, solicit readers for more, and propose a meeting next Fall to discuss these matters and advance the state of the art.
These reflections originated in a keynote that I gave at ESEM in Bolzano in 2010 (I am grateful to Barbara Russo and Giancarlo Succi for the invitation). I never wrote up the talk but I dug up the slides  since they might contain a few relevant observations. I used some of these ideas in a short panel statement at ESEC/FSE 2013 in Saint Petersburg, and I am grateful to Moshe Vardi for suggesting I should write them up for Communications of the ACM, which I never did (until this blog article).
References and notes
 Victor R. Basili: The role of experimentation in software engineering: past, present and future, in 18th ICSE (International Conference on Software Engineering), 1996, see here.
 Marvin V. Zelkowitz and Dolores Wallace: Experimental validation in software engineering, International Conference on Empirical Assessment and Evaluation in Software Engineering, March 1997, see here.
 Walter F. Tichy: Should computer scientists experiment more?, in IEEE Computer, vol. 31, no. 5, pages 32-40, May 1998, see here.
 And EiffelStudio, whose repository goes back to the early 90s and has provided a fertile ground for numerous empirical studies, some of which appear in my publication list.
 This compact sentence is how the Kelvin statement is usually abridged, but his thinking was more subtle.
 Raymond Lister: After the Gold Rush: Toward Sustainable Scholarship in Computing, Proceedings of 10th conference on Australasian Computing Education Conference, pages 3-17, see here.
 Jeffrey D. Ullman: Experiments as research validation: have we gone too far?, in Communications of the ACM, vol. 58, no. 9, pages 37-39, 2015, see here.
 Bertrand Meyer, slides of a talk at ESEM (Empirical Software Engineering and Measurement), Bozen/Bolzano, 2010, available here. (Provided as background material only, they are not a paper but just slide support for a 45-minute talk, and from several years ago.)