Text and Data Mining of In-Copyright Works – Communications of the ACM

mining pick and binary code, illustration

Text and data mining (TDM) uses statistical analysis tools to extract new knowledge from large quantities of text or data for purposes by finding patterns, discovering relationships, and analyzing semantics. It is used in a wide variety of fields from biomedical research to digital humanities. Copyright poses no obstacle to TDM research as long as the corpus of text and data being analyzed consists solely of public domain works.^a Copyright may, however, be a barrier to TDM research as to vast arrays of in-copyright works created in the past century.

This is because copyright regulates making copies of protected works and TDM requires researchers to make several types of copies during different stages of the process: from scanning copies of analog works to formatting the texts and data to preparing them for processing to extract useful information from the vast quantities being searched to storing the data after mining is completed.

Under the amended law, users are allowed to analyze in-copyright works for machine learning purposes.

Governments that aspire for their industries to become global leaders in artificial intelligence (AI) fields are beginning to realize their knowledge economies are more likely to thrive if they allow researchers to make copies of in-copyright works for TDM purposes. U.S. appellate courts have enabled this by ruling that TDM copying of in-copyright works is not infringement. Japan has enacted laws to allow TDM research copying. The E.U.'s 2019 Directive on Copyright and Related Rights in the Digital Single Market (CDSM) has mandated member states must adopt copyright exceptions for TDM research purposes.

U.S. Fair Use TDM Decisions

Two U.S. appellate court decisions—Authors Guild v. Google and Authors Guild v. HathiTrust—have ruled that copying of in-copyright texts for TDM research purposes was fair use, not infringement. These lawsuits grew out of the Google Book Search Project (GBS).

GBS is a corpus of millions of digital books to improve its search technologies that Google developed after making a deal with the University of Michigan in 2004 to scan all eight million books in its library's collections. In return, Michigan got back from Google digital copies of the books it scanned. Google struck similar deals with several other state-related universities. The HathiTrust digital library was formed to host a collection of library digital copies Google provided to Google's state-related library partners.

By 2005, Google had digitally scanned millions of books from research library collections, the overwhelming majority of which were in-copyright. Later that year, the Authors Guild and three of its members brought a class action lawsuit charging Google with copyright infringement for making these digital copies.

From the Guild's perspective, Google's systematic copying of the entire contents of millions of all types of in-copyright books for commercial purposes was completely unjustifiable. The main norm underlying copyright ownership is that people who want to make copies of authorial works must ask for and get permission to make such copies, which Google did not do.

Google defended by saying its copying of the books was fair use because its purpose in scanning the books was socially beneficial. It was necessary to copy the entire contents to index the books' contents, serve up snippets in response to user search queries, and enable Google to engage in non-consumptive research (for example, creating the Ngram viewer to enable users to see trends in word and phrase usages over time and improving its translation tools).

Google also asserted the snippets it served up were fair use because they were too few in number and too short in length to have harmful impacts on markets for the books. People do not use GBS to consume book contents. GBS searchers are generally looking for facts books may contain (for example, 'How many buffalos are there in Yellowstone National Park?') and copyright does not protect facts. Indeed, because Google provided links to sites at which users could purchase books responsive to user search queries, it was more likely GBS would benefit the market for books, not harm it.

An appellate court found Google's arguments more persuasive than the Authors Guild's claims. It observed GBS had enabled new kinds of research to be undertaken, specifically mentioning TDM as an example. Research and scholarship are two of the statutorily favored fair uses, so this too supported Google's defense.

The HathiTrust decision more directly addressed TDM research issues. HathiTrust allows researchers from consortium member institutions to conduct searches across its corpus of millions of books (now totaling approximately 17 million volumes) to identify every book mentioning the person, place, or phenomenon for which researchers were looking.

HathiTrust provides researchers from partner institutions with bibliographic information about specific books in which the referent search term appeared and even data about page numbers where the referents could be found. The court considered this beneficial research purpose to strongly favor HathiTrust's fair use defense.

Japan's Special TDM Exception

Recognizing how important TDM is to achieving success in AI fields, the Japanese legislature adopted a special exception to copyright rules to enable TDM research in 2009. It was the first nation in the world to enact such a law. Yet, AI researchers complained this exception did not fully address the needs of TDM and AI researchers, so in 2018 Japan amended its copyright law to respond to those concerns.

Under the amended law, users are allowed to analyze in-copyright works for machine learning purposes. As long as TDM researchers do not exploit the protected expression in the works, but only process the data to extract knowledge, they do no harm to the legitimate interests of copyright owners whose rights extend only to control exploitations of expressive aspects of their works. It is thus fair game to feed in-copyright works as raw data into computers to process it for deep learning purposes.

The amended law also permits researchers to make incidental digital copies of works for TDM purposes. This recognizes that incidental copies are necessary to carry out machine learning activities. This too causes no harm to copyright owners' legitimate interests.

An additional provision of the amended law allows TDM researchers to use digital copies of in-copyright works for data verification purposes. The legislature recognized this kind of use is important to enable researchers to ensure their results and insights from TDM research are sound. This activity too is not detrimental to the legitimate interests of copyright owners.

TDM Exceptions in the CDSM

An early draft of the European Commission's proposed CDSM directive would have required member states of the E.U. to adopt a new copyright exception to allow researchers at nonprofit scientific organizations to engage in TDM research as long as they had lawful access to the databases on which they conduct their work. This new exception was to be mandatory as well as non-waivable by contract.

The final Directive, which E.U. member states were supposed to have implemented the TDM exceptions in national laws by June 2021—although not all have done so—authorizes the TDM research exception to apply to nonprofit cultural heritage researchers as well as to scientific researchers.

In response to concerns that limiting the TDM exception to nonprofit researchers would undermine E.U.'s aspirations for their industries to build AI systems that could compete in the global marketplace, the Commission was persuaded to add a second mandatory TDM exception for other researchers, including those engaged in commercial TDM research. However, this exception can be overridden by contract by owners of databases on which these researchers want to engage in TDM analysis.

Downloading Sci-Hub would be a risky strategy for TDM researchers who do not want to be sued for copyright infringement.

Some scholars have expressed concerns the CDSM TDM exceptions, while steps in the right direction, will prove to be too narrow and uncertain in scope to fully address the needs of TDM researchers. Japan's more capacious TDM-enabling rules would be more responsive to researchers' needs.

TDM on Sci-Hub's Corpus?

Sci-Hub is a well-known repository of vast quantities of the world's scientific journal literature, much of which is usually kept behind proprietary pay-walls. Publishers such as Elsevier have sued Sci-Hub and its founder for copyright infringement. Courts have held this database contains much infringing materials and has forced its founder to shut it down. However, Sci-Hub's corpus has reemerged as a resource for scientists and can still be easily found on the Internet.

Many researchers would like to use it for TDM purposes, but is this legal?

The desire to use Sci-Hub for TDM research arises in part because numerous proprietary publishers of scientific journals offer institutional database subscriptions to universities and other research institutions that are not cross-platform interoperable. Researchers consequently cannot run searches across various proprietary databases. Cross-publisher collaborations are rare.

Moreover, the license terms on which proprietary databases are available may impair researchers' ability to make the full use of TDM tools. Publishers and some collecting societies are promoting licensing of TDM as a value-added service for which research institutions should pay. Some licenses are more restrictive than TDM researchers would want.

Even scientific researchers who work at institutions that subscribe to proprietary databases want to use Sci-Hub to do TDM research. That database is easier to use than some of the publisher repositories. The Sci-Hub database is far more comprehensive than any of the proprietary databases. And there are no license restrictions to limit researcher freedom to investigate with TDM tools to their hearts' content.

Downloading Sci-Hub would be a risky strategy for TDM researchers who do not want to be sued for copyright infringement. But running TDM searches on the Sci-Hub collection hosted elsewhere involves only the kind of transient copying that the U.S. courts have found too evanescent to be an infringing "copy" of copyrighted television programming. The results extracted in the course of doing TDM research on Sci-Hub would be unprotectable facts.¹

Consequently, it is conceivable TDM researchers would not infringe U.S. copyright law if they used Sci-Hub for TDM research purposes. However, the E.U. exceptions allowing TDM research are predicated on researchers having lawful access to the text and data they mine.

Conclusion

Only a few countries in the world have flexible fair use or fair use-like exceptions to copyright rules that would enable them to use this tool to justify TDM research copying. Hence, legislation will be necessary for allowing TDM researchers to take full advantage of this new suite of tools to expand the horizons of what can be known from digital explorations of large corpora of data and text.

Footnotes

a. In the U.S., any work published before 1926 is reliably in the public domain. In other countries, copyright terms that last for the life of the author plus 50 or 70 years make it more difficult to determine whether works are in the public domain.

Text and Data Mining of In-Copyright Works: Is It Legal?

U.S. Fair Use TDM Decisions

Japan's Special TDM Exception

TDM Exceptions in the CDSM

TDM on Sci-Hub's Corpus?

Conclusion

Text and Data Mining of In-Copyright Works: Is It Legal?

DOI

November 2021 Issue

Join the Discussion (0)

Become a Member or Sign In to Post a Comment

The Latest from CACM

Shape the Future of Computing

Communications of the ACM (CACM) is now a fully Open Access publication.

U.S. Fair Use TDM Decisions

Japan's Special TDM Exception

TDM Exceptions in the CDSM

TDM on Sci-Hub's Corpus?

Conclusion

Text and Data Mining of In-Copyright Works: Is It Legal?

DOI

November 2021 Issue

Related Reading

Join the Discussion (0)

Become a Member or Sign In to Post a Comment

Shape the Future of Computing

Communications of the ACM (CACM) is now a fully Open Access publication.