Research and Advances
Computing Applications

Calculating Error Rates For Filtering Software

Establishing a blueprint for conducting and reporting tests of filter effectiveness.
Posted
  1. Introduction
  2. A Framework for Testing Filtering Software
  3. Conclusion
  4. References
  5. Authors
  6. Footnotes
  7. Figures
  8. Tables

Surveys in the U.S. have found that 95% of schools [4], 43% of public libraries [5], and 33% of teenagers’ parents [8] employ filtering software to block access to pornography and other inappropriate content. Many products are also now available to filter out spam email.

Filtering software, however, cannot perfectly discriminate between allowed and forbidden content, resulting in two types of errors. First, under-blocking occurs when content is not blocked that should be restricted. Second, over-blocking occurs when content is blocked that should not have been restricted. Steps can be taken to reduce the frequency of errors, and to reduce their costs (for example, by providing easy appeals processes, quick overrides, and corrections) but some errors are inevitable.

The frequency of errors is an empirical question of great importance. For example, in 2000, the U.S. Congress passed the Child Internet Protection Act (CIPA) mandating that schools and libraries install content-filtering software in order to be eligible for some forms of federal funding. A district court struck down the requirement for libraries on the grounds that it violates the First Amendment. Much of that court’s finding of facts was devoted to analyses of error rates [1] and some of the arguments made on appeal to the U.S. Supreme Court also hinged on analyses of error rates.

Most empirical studies of error rates have suffered from methodological flaws in sample selection, classification procedures, or implementation of blocking tests. Results have also been interpreted inappropriately, in part because there are two independent measures of over-blocking that are sometimes confused, and likewise for under-blocking. This article presents a framework to guide the design and interpretation of evaluation studies. While the framework applies with only minor modifications to the evaluation of spam filters, the examples and discussion here focus on pornography filters.

Back to Top

A Framework for Testing Filtering Software

The process of testing filter effectiveness is graphically outlined in Figure 1. A test set of items is generated. These items are classified to see whether they should be blocked and are tested to see whether they are actually blocked by filters. For each item, then, there are four possible outcomes: it may be correctly blocked, incorrectly blocked (which we refer to as an over-block), correctly not blocked, or incorrectly not blocked (which we refer to as an under-block). Finally, in step 3, the rates of over- and under-blocking are calculated.

Step 1: Create a Test Set. The first major step in the process is to create a test set of Web sites or other Internet content on which the performance of the filters will be judged. One approach is to collect a set of accessed items, as a way of evaluating filters’ impact on users. For example, for the CIPA case, Finnell selected sites from the proxy server access logs of three public libraries [1]. Simulations can also be conducted, to approximate what users might access. For example, for a study of filtering error rates on health information, we entered search strings on 24 health topics into six search engines and collected the first 40 results from each search [7].

A second approach is to collect a set of accessible items, as a way of evaluating the impact of filters on publishers. There is no good way to sample from all the available Web pages (even search engines index only a fraction of pages they encounter). Instead, some well-defined subset of Internet content must be chosen, such as the health listings from certain portal sites or all the Web pages served by particular Web servers.

Test sets are only representative of the larger collection from which they were drawn. For different purposes it is appropriate to estimate error rates for different subsets. For example, even within the overall domain of health sites, our study found quite different error rates from searches on the terms “condom” and “gay” than for searches on “depression” and “breast cancer” [7].

The collection process should satisfy three properties. First, it should be objective and repeatable. Many studies have relied on tester judgment to select interesting or relevant items [2, 3, 10], possibly introducing bias. Second, the collection process should be independent of the filters to be tested. The sample used by Finnell reflected patrons’ access patterns when filters were installed, not what their access patterns would have been without filters. Third, large test sets should be assembled. Some studies have relied on small test sets. Others with large test sets covered so many categories of content that there was not enough statistical power to evaluate the effectiveness of the filters for particular categories [2, 3, 10].

Step 2a: Blocking Test. Each selected URL is tested against the various filters to see whether access to the site is blocked. This is best performed through automated processes that are able to quickly test a large number of URLs against the filters. Automated tests must take into account the possibility that sites may redirect browsers through HTTP headers, HTML, or JavaScript code, to other sites. A Web browser would attempt to access the original URL and then the destination URL. Thus, in an automated test, a filter should also be tested against both URLs and the site should be considered blocked if either one is blocked.


There have been numerous studies that report the over- and under-blocking rates of filtering software products. The methodology of such studies has improved substantially in recent years, but significant concerns still remain.


Vendors regularly update the contents of their blocking lists and rules. In order to maintain comparability between vendors, therefore, all products being compared should be updated just before the tests are run. In addition, all tests should be run simultaneously or nearly so, to allow for a fair comparison. If the test set reflects the results of simulated searches, the blocking tests should be conducted as soon as possible after the searches are run, so that the results reflect what would have been accessible to a user from the search.

Product configuration choices can have a large impact on rates of over-blocking and under-blocking. For example, nearly all products offer a variety of settings or categories that can be chosen. These categories range from pornography to gambling to hobbies and rarely match up perfectly across products, making comparisons across products difficult. An informal survey of 20 school systems and libraries confirmed wide variability in their configurations and that none were using a vendor’s default setting [7]. Thus, tests should be run against a range of configurations.

Step 2b: Classification of Sites. Each URL in the test set is classified to determine whether it should have been blocked or not. The definition of what should be blocked will depend on the purpose of the test. For example, in order to test the over- and under-blocking of pornographic material it would be necessary to classify each site as containing or not containing pornographic material. In order to test whether filtering software implements the CIPA standard, or the legal definition of obscenity, sites would have to be classified according to those criteria. And if the goal were simply to test whether filtering software correctly implements the vendor’s advertised classification criteria, the sites would be independently classified according to those criteria.

Ideally, the classification process should satisfy three properties [6]. First, it should have face validity, meaning there is an obvious connection to the underlying definition of what should be blocked. Second, the procedure should be reliable, meaning that the process is sufficiently documented to be repeatable and that multiple ratings of items would be in substantial agreement. Third, there should be construct and criterion validity, meaning the classifications should be in substantial agreement with those produced by other processes that have reliability and face validity.

Because site content can change over time, sites should ideally be classified according to their state at the time the blocking tests were run. By caching the contents of sites when blocking tests are run, it is acceptable to delay the actual classification. This also allows the cache to be made public, so that others can scrutinize the classification decisions made by the raters in the study or classify the sites independently according to different criteria.

Step 3: Over- and Under-Blocking Reporting. For any product configuration and set of URLs tested, there are four results from the testing and classification, as shown in the top part of Figure 2: (a) the number of correct blocks, (b) the number of under-blocks, (c) the number of over-blocks, and (d) the number of correct non-blocks. For brevity, we will refer to sites as “bad” if they should be blocked and as “OK” if they should not be blocked according to the classification that was done: no value judgment is intended.

Two-by-two outcome tables arise when evaluating all sorts of binary decisions, from radar operators detecting the presence or absence of enemies to medical diagnostic tests to information-retrieval techniques that select documents from a large corpus. The most useful summaries of filtering test outcomes describe under-blocking and over-blocking error rates (percentages). There are two natural ways to calculate each error rate, each providing different information. Figure 2 summarizes how to calculate the error rates and their relation to measures usually reported in information science and medical research.

Consider, first, the amount of over-blocking. One measure, which we call the OK-sites over-block rate, is the fraction of acceptable sites that are blocked. This measure is related to what medical researchers would call the specificity of a diagnostic test. It is useful in answering the question of how frequently a user who is trying to access OK (non- pornographic) sites will be blocked. This is the number that a school or library or parent should consider when deciding whether a filter is overly broad in restricting access to information that should be available.

This error rate could also be relevant to a U.S. court performing an “intermediate scrutiny” or “reasonableness” analysis. To be reasonable, restrictions must not interfere substantially with the legitimate uses of a forum. One interpretation is that over-blocks must be few in relation to correct non-blocks of OK sites: in other words, the OK-sites over-block rate must be low.

A second measure of over-blocking, which we call the blocked-sites over-block rate, is the fraction of all blocked sites that are OK (not pornographic). This measure is related to what information scientists would call precision and medical researchers would call positive predictive value. It might be useful to a school or library or parent when deciding whether to monitor for blocking as evidence of violation of acceptable use policies. For example, if a high proportion of blocked sites are in fact OK, then the mere fact that a user tries to access a blocked site would not be a reason to suspect that user of trying to access pornography.

This error rate could also be relevant to a U.S. court performing a “strict scrutiny” analysis. To satisfy strict scrutiny, restrictions must be “narrowly tailored” to meeting a compelling government interest. One interpretation is that over-blocks must be few in relation to correct blocks of bad sites: in other words, the blocked-sites over-block rate must be low.

Note that the two measures of over-blocking are independent, as illustrated in Tables 1a and 1b, which give results from hypothetical tests of two filters, on the same set of sites. In both tables, the fictitious filters have a blocked-sites over-block rate of 50%: they are equally imprecise. They differ in the OK-sites over-block rate, however. In Table 1a, 99% of the OK sites are blocked but in Table 1b only 1% are blocked.

Any estimate of the blocked-sites over-block rate is sensitive to the prevalence of OK sites in the test set. Table 1d differs from Table 1c only in having a higher concentration of OK sites. The error rates of the filter on bad and OK sites are both 1% in both tables. The blocked-site over-block rate, however, goes from 1% to 50%.

Consider, for example, Edelman’s selection of 6,777 blocked sites as presented in the CIPA case [1]. Janes’ classification process, as also reported in the court’s decision, estimated that about two-thirds of those were over-blocks. But since the sampling process drew from a set deliberately designed to have a very high concentration of OK items, it should be expected that a large percentage of the blocked items would also be OK. An even more fundamental problem occurred in studies presented by Hunter [1] and Lemmons [1, 10] that employed separate samples of OK and bad sites. Any estimate of the blocked-site over-block rate from such tests is arbitrary: selecting a larger or smaller sample of OK sites, while holding everything else constant, would yield different estimates of the blocked-site over-block rate.

If a study selects only blocked items for a test set, it cannot calculate the OK-sites over-block rate. To do that, one would need additional information about the proportion of blocked to unblocked sites and the proportion of unblocked sites that were OK. For example, Edelman tested more than 500,000 URLs in order to select the 6,777 blocked items. If, as seems likely, the vast majority of the 500,000+ unblocked sites were acceptable, then the OK-sites over-block rate may have been under 1%. However, one cannot be sure since the study was designed only to identify blocking errors, not their frequency among all OK sites.

Now consider the rate of under-blocking. One measure, which we call the bad-sites under-block rate, is the percentage of all unacceptable sites that were not blocked. This measure is related to recall in information science and sensitivity in medical research. It is the number that a school or library or parent or judge should consider when deciding whether blocking software is effective at preventing children from accessing pornography or other undesirable materials.

Another measure, the unblocked sites under-block rate, is the percentage of all unblocked sites that should have been blocked. This measure could be useful in determining whether an honor code is needed in addition to any installation of filters. For example, if this error rate is high, then the fact that a site was not blocked does not necessarily mean that it is non-pornographic, and it might be necessary to inform students that they are still responsible for not visiting pornographic sites even if the filters do not block their access. Again, the two measures of the under-blocking rate are independent: one may be high without the other being high. In Tables 1a and 1b, the unblocked sites under-block rates are both 50%, but the bad-sites under-block rates are 1% and 99% respectively.

Back to Top

Conclusion

There have been numerous studies that report the over- and under-blocking rates of filtering software products. The methodology of such studies has improved substantially in recent years, but significant concerns still remain. Table 2 summarizes desirable methods.

There is no easy answer to the question of how to best protect children from inappropriate material on the Internet [9], or even whether any protection is needed. Certainly, filtering software is not a silver bullet—there are other approaches available, including student education, privacy screens, honor codes, and adult monitoring. However, the amount of attention and public concern about whether filters are helpful or harmful suggests an ongoing need for careful empirical investigation. Objective and methodologically sound research must inform the debate.

Values, however, will be the ultimate determining factor. How much over-blocking or under-blocking is too much? When we reported our findings of error rates in blocking health information [7], few questioned our methods or findings, but both supporters and opponents of filtering claimed the results supported their positions. People simply differ in their assessments of the benefits of blocking bad sites and the costs of blocking OK sites. Methodologically sound research is needed to redirect attention away from meaningless debates comparing misleading study results toward meaningful debates about values.

Back to Top

Back to Top

Back to Top

Back to Top

Figures

F1 Figure 1. Summary of process for testing filter effectiveness.

F2 Figure 2. Calculating error rates.

Back to Top

Tables

T1 Table 1. a–d: Measuring blocked-sites rates.

T2 Table 2. Methodology checklist.

Back to top

    1. American Library Association, Inc., v. United States. 2002, E.D. PA.

    2. Brunessaux, S., Isidoro, O., Kahl, S., Ferlias, G., and Soares, A.L.R. Report on Currently Available COTS Filtering Tools. MATRA Systemes and Information, 2001; www.net-protect.org/en/results3.htm.

    3. Greenfield, P., Rickwood, P., and Tran, H.C. Effectiveness of Internet Filtering Software Products. CSIRO Mathematical and Information Sciences, 2001; www.aba.gov.au/internet/research/filtering/.

    4. Kleiner, A. and Lewis, L. Internet Access in U.S. Public Schools and Classrooms: 1994–2002. U.S. Department of Education, National Center for Education Statistics, NCES 2004-011; nces.ed.gov/pubs2004/2004011.pdf.

    5. Oder, N. The new wariness. Library Journal 127, 1 (2002), 55–57.

    6. Pedhazur, E.J. and Schmelkin, L.P. Measurement, Design, and Analysis: An Integrated Approach. Lawrence Erlbaum and Associates, Hillsdale, NJ, 1991, 819.

    7. Richardson, C., Resnick, P., Hansen, D., and Rideout, V. Does pornography-blocking software block access to health information on the Internet? Journal of the American Medical Association 288, 22 (2002).

    8. Rideout, V. Generation Rx.com: How Young People Use the Internet for Health Information. Henry J. Kaiser Family Foundation: Menlo Park, CA, 2001.

    9. Thornburgh, R. and Lin, H., Eds. Youth, Pornography, and the Internet. National Academy Press, Washington, DC, 2002.

    10. U.S. Department of Justice. Web Content Filtering Software Comparison. eTesting Labs, Morrisville, NC, 2001; www.veritest.com/clients/reports/usdoj/usdoj.pdf.

    Research for material appearing in this article was supported under a contract with the Kaiser Family Foundation.

Join the Discussion (0)

Become a Member or Sign In to Post a Comment

The Latest from CACM

Shape the Future of Computing

ACM encourages its members to take a direct hand in shaping the future of the association. There are more ways than ever to get involved.

Get Involved

Communications of the ACM (CACM) is now a fully Open Access publication.

By opening CACM to the world, we hope to increase engagement among the broader computer science community and encourage non-members to discover the rich resources ACM has to offer.

Learn More