Doctoral Program Rankings For U.S. Computing Programs

Why do we care about rankings of graduate programs? Beyond the ability to cheer “We’re Number One!” there are very practical reasons. For example, resource allocation is often based on using rankings as synonyms for quality indicators. An institution recently decided it would become a “top 25 institution” by ensuring that each of its graduate programs was ranked within the top 25% of all the graduate programs in the corresponding fields. And it was going to accomplish this by simply eliminating any program that was not—mission accomplished! Besides resource allocation, prospective graduate students and faculty candidates look to rankings when deciding where to apply, so the rankings for U.S. institutions considered in this Viewpoint are of considerable interest both within the U.S. and internationally. Funders look at rankings when considering ability to perform the proposed research. Alumni look to rankings when making donation decisions. Despite all their acknowledged warts, rankings do matter.

In principle, generating rankings is straightforward mathematically:

Develop a list of the n most important metrics (amount of external research funding, number of quality publications, impact of publications, awards, entering student GRE scores, placement success of graduates, plus any other factor deemed important by the community).
Plot the value of these n metrics for each institution in an n-dimensional space in which the axes are the metrics.
Develop a mapping to turn this n-dimensional space into a one-dimensional ordered set of integers. (That this cannot really be done in a principled or defensible manner is one of the fundamental problems with rankings.)

Of course the practical difficulties are enormous. Among them:

What metrics should you be including?
How do you get (accurate) values for these metrics?
The mapping is going to require weighting of these metrics and how do you determine these weights? Effectively how should one weigh publication counts vs. citation counts vs. external grants vs. faculty awards vs. entering student GRE scores vs. any other factors?

So there are ample reasons why rankings based upon a transparent comprehensive analysis are not done frequently. Nevertheless, the U.S. National Research Council (NRC), through its Committee on an Assessment of Research Doctorate Programs, bravely tackled this thorny problem for U.S. institutions. (The NRC is the operating arm of the U.S. National Academies of Science and Engineering and the Institute of Medicine, honorific academies with a mission to improve government decision making and public policy, increase public education and understanding, and promote the acquisition and dissemination of knowledge in matters involving science, engineering, technology, and health. In many respects, the academies and NRC represent the “gold standard” of technical policy advice in the U.S. Because of the prestige of the academies and the NRC, their methodologies and reports have considerable international impact as well.)

Despite all their acknowledged warts, ranking do matter.

The NRC last ranked doctoral programs in the mid-1990s and these rankings are clearly out of date. Further, the earlier rankings depended heavily on “reputation” as determined by respondents and this is often an inexact and lagging indicator. This time around the NRC sought to focus on a purely quantitative approach.

In this Viewpoint we describe how this process has played out for computing. While these comments clearly apply directly only to the NRC rankings effort, they are relevant to other similar efforts.

The NRC Ranking Process

The specifics of the NRC process were the following. The NRC developed a single set of metrics for all 62 disciplines being analyzed, covering disciplines in science, engineering, humanities, social sciences, and others. It then collected the data for these metrics via questionnaires administered to institutions, programs, faculty, and Ph.D. students plus submitted faculty CVs. Determining the weights was done via two related approaches: ask a set of participants how much various metrics mattered in their perception of department rankings, and a linear regression of a set of rankings vs. these metrics. Because these two approaches yielded substantively different results, the NRC established two sets of rankings—Survey and Regression rankings—and reported these probabilistically. Specifically, they ran a set of samples using weights derived from these acquired distributions, and then reported the range of rankings corresponding to a 90^th percentile, meaning that with 95% probability, an institution’s rank would lie within the designated range. In other words, as an example, the NRC states that with 95% probability Georgia Tech ranks somewhere between 14^th and 57^th using the Survey weights and somewhere between 7^th and 28^th using the Regression weights.

The first issue is that this range, arising out of the probabilistic analysis, is difficult to reconcile. What does a rank between 14^th and 57^th mean? How does one reconcile differences between the two ranking systems—between the Survey weights which measure what respondents claim is important and the Regression weights which measure these claims against departmental reputations? Of how much value is a range if a 95^th percentile span is being used?

Even if the rankings were not as impactful as in prior NRC studies, a rigorous data collection process could have yielded valuable data, which departments could use to assess their standing relative to peers. Unfortunately, there were a number of issues with the quality of the data:

Data collection took place in 2006 but the ultimate release of data and rankings was in 2010 (with corrections well into 2011). For some metrics small changes might have large consequences, for example, given our low numbers of female faculty the addition of a single woman would result in a large percentage impact on the diversity metric or the departure/arrival of a single highly productive faculty member would similarly have a large impact on the scholarly productivity metric.
The metrics to be used are not discipline specific—exactly the same information was collected for physics, for English, for computing, and for every other discipline. But we know that publication practices in particular vary significantly across disciplines: the humanities rely heavily on book publications; the computing fields rely heavily on conferences. However, the NRC decided that the metric to use for scholarly publication was going to be journals (possibly primarily because they had low-cost access to the ISI database of citations for journal publications). We know that this will not provide accurate results for computer science, as it misses almost all of the conference publications (and corresponding citation data), and the specific choice of this database also means that many journal publications are missed.
The descriptions of data to be provided were often ambiguous, leading different institutions to respond differently. Thus the data being compared was often not measuring the same parameters across departments—this was especially true when gathering lists of faculty to be included in the data gathering, a factor that has impact on many of the data categories since parameters were often measured per faculty member.
Measurements of scholarly quality are not equivalent to measurements of scholarly quantity, that is, the most impactful publication is not necessarily the one with the most citations nor is the most impactful professor necessarily the one with the most publications. There is considerable literature on resolving this issue; for example, by measuring publication quality via the impact factor of the journal.
The NRC did not measure scholarly productivity other than publications and grants. For example, software artifacts and patents were not considered.
The NRC did not get CVs from all faculty so they simply scaled results by the number of faculty in a given department. This approach is easily gamed by having only the most productive faculty provide CVs.
The list of faculty awards was seriously incomplete—computer science was not even listed as a distinct category. The ACM A.M. Turing Award was not considered “Highly Prestigious”; no awards from organizations other than ACM and IEEE were included; and many other gaps in awards were apparent.
The NRC chose to invent data when they could not obtain it, for example, for entering student GRE scores the NRC used the national average for these scores when an institution did not collect or provide them.

The second issue noted here has gained the most attention from our community. CRA and ACM provided testimony to the NRC in 2002 when the study was just beginning, pointing out the importance of conferences to our field. Unfortunately, this advice was simply ignored by the NRC, a fact we did not discover until February 2010. We immediately notified the NRC, urging it to include conference publications, both for measuring publication productivity and for measuring citation impact. The NRC ultimately agreed to do so after extensive discussion at various levels. CRA worked with its member societies to provide a list of quality conferences; due to the tight deadline we know that this list is not 100% complete or accurate. The NRC took this list and then searched all vitae provided by CS faculty (which we also know to be incomplete) to generate conference publication counts. Since citations for conference publications were not available via the ISI database used by the NRC, citation data was not used at all for computer science as alternatives were not acceptable to the NRC. Based upon the NRC’s analysis, a typical department had one conference publication per faculty member per year. In our view, this is not credible. Further, the NRC claims that more computing publications appear in journals than in conferences, which is very difficult to reconcile with what we see in practice.

Similarly, CRA worked with its member societies to put together lists of the awards that should be included and to correctly categorize them as “Highly Prestigious” or “Prestigious.” This is not a trivial process; for example, does one include the many SIG awards? Again, the deadline to provide the list was tight and we are unable to verify that our list was applied. Thus, it is not clear that the NRC even now has a meaningful method for measuring faculty awards.

Just as troubling is that various member departments have not been able to verify the data that the NRC presents. That is, using the same vita and publication and awards listings, they simply cannot reproduce the numbers that the NRC provides for their departments. The NRC process used temporary workers trained by the NRC staff. Perhaps they were unable to deal with the multiple possible titles of publications—Commun. ACM = CACM = Communications—self-reported by faculty on their CVs. The conference publication numbers do not provide much confidence that they were.

There are ample reasons why rankings based upon a transparent comprehensive analysis are not done frequently.

One might suggest that the central problem is that computer science is unusual in its practices, and that our field is simply an outlier. This does not appear to be the case. The Council of the American Sociological Association recently passed a resolution condemning the NRC rankings and saying that they should not be used for program evaluation. Input from colleagues suggests that other fields, such as aeronautics/astronautics and chemical engineering are uncomfortable with the NRC process, for many of the reasons we have raised in this Viewpoint.

So we have a situation in which incorrect data are provided for invalid metrics and rankings are calculated using weights that are not readily understood. It would be easy to dismiss the entire process except that institutions are using the results to make programmatic decisions including closing programs. At a recent symposium, many university administrators expressed considerable support for continuing the data collection effort, and generating rankings if it can be accomplished in a meaningful way.

Conclusion

So how should the process work? Here are our suggestions:

Work with the relevant societies in order to generate metrics that matter to their constituents.
Realize that reputation does matter and include it in the metrics. There is an interesting feedback loop between rankings and reputation, of course. But this also means reputation has some validity as a measure of rank, so incorporate it.
Explore making the rankings subdiscipline-dependent. It is clear that different departments have different strengths. Thus, enabling a finer-grained assessment would allow a department with strength in a sub-field, but perhaps not the same across-the-board strength, to gain appropriate visibility. This may be particularly valuable for students deciding where to apply.
Use data mining to generate scholarly productivity data to replace commercially collected citation data that is incomplete and expensive.
Have institutions collect the remaining data under clear guidelines.
Provide a time period during which departments can correct errors in the data collected. The NRC did allow institutions to correct some errors of fact, but the allowable corrections did not include publication counts and other information. And the NRC apparently refused to remove data it invented, such as substituting national GRE average scores for institutions that do not record such information.
Provide sample weights but allow individuals to develop their own weights and apply them to the collected data so that they can generate rankings of interest to them. We realize this does not satisfy the desire for single overarching rankings. However, it does provide a tool of potential value for individual departments seeking to compare themselves against peers.

We do not claim that this strategy will eliminate all of the many issues with rankings, but it will provide a consistent set of fundamental data that administrators, faculty, students and others can use to understand departmental strengths and weaknesses in a way that matters to them.