Plenty of Proteins

UCSD virtual reality environment — An interactive visualization of proteins from the Protein Data Bank in a virtual reality environment in the California Institute for Telecommunications and Information Technology at the University of California, San Diego.

In biology, structure is important. The proteins that build our bones and skin, digest our food, replicate our DNA, and perform all manner of other tasks are built from amino acids, and fold themselves into shapes that dictate their behavior. For decades, structural biologists have been measuring and collecting protein structures and depositing them and their chemical formulas in a computer archive known at the Protein Data Bank.

What began in 1971 as a group of seven protein structures has ballooned to a collection of more than 107,000 biological macromolecule structures, with about 10,000 added annually. Many are much larger and contain much more information that the original seven, which creates new challenges for the computer scientists in charge of maintaining the archive. “Because the structures are so complex and the information is rich, the bigger structures now are so big that we had to move to a new format for archiving them,” says Sameer Velankar, team leader of the content and integration group of the Protein Data Bank in Europe.

The PDB, as it is known, consists of four data centers; at the European Molecular Biology Laboratory in Hinxton, UK; Osaka University in Japan; Rutgers University in New Jersey, and the supercomputing center at the University of California, San Diego. Those may be joined, in a few years’ time, by partners in China, India, and Brazil. The PDB provides data to all users free of charge, and has become such a central resource that many scientific journals require protein models to be deposited in the PDB as a condition of publication. Pharmaceutical companies often turn to the archive when they are searching for molecular targets for potential new drugs.

Part of the challenge grows out of innovations in the technology used for studying protein structures. The standard technique for determining protein structure, dating back to the 1930s and still accounting for more than 90% of the structures in the data bank, is X-ray crystallography. Scientists crystallize a protein in which they are interested, and then scan it with an X-ray beam; the planes and edges of the crystalline faces scatter the X-rays, producing diffraction patterns that provide a precise description of the structure of the protein.

Not every protein, however, lends itself to easy crystallization. Biologists are also studying larger, more dynamic structures—”molecular machines” such as the ribosome, the factory within each cell that reads the RNA and uses that information to build new proteins. Ribosomes and other machines contain a variety of individual proteins and change their behavior, and thus their structure, over time. “It’s hard to grow crystals from mixtures of different states,” explains Stephen Burley, a professor of chemistry and chemical biology at Rutgers and director of the PDB data center there. “We’re tackling machines that are bigger and bigger, more and more complicated, with more and more moving parts.”

Biologists have turned to other imaging techniques. Nuclear magnetic resonance (NMR), for instance, is nondestructive, and allows scientists to watch molecular machines in action. Cryo-electron microscopy requires freezing samples at liquid nitrogen temperatures (–346°F to –320.44°F), but keeps them in a more natural condition than crystallization does. The PDB only contains experimentally derived structures. For computer-generated structures, biologists can turn to the National Institute of Health’s Protein Model Portal. Neither does the PDB address folding dynamics, about which not much is known, Burley says.

A team of biocurators examines the data as it comes in, validating and annotating it. They represent on a color bar how well a model scores on a number of criteria, so a user knows how much confidence to place in a model.

With this variety of new techniques, the archive can provide a richer variety of information, but not always the same information from one model to the next. “All give little bits and pieces of the puzzle, but not with the level of detail that the traditional methods do,” says Gerard Kleywegt, head of the European PDB. “For these new models with these techniques that have relatively low information content, maybe all you can track is the rough shape of the whole molecule.”

That puts the burden on the archive to make it clear just how the models were developed, what assumptions went into making them, and what research questions they can or cannot answer, so they can correlate data derived through different methods. “We have to make it clear to our users what the structures are good for,” Kleywegt says. To achieve that, the PDB has developed standards for validating the models, measuring how well they fit with experimental data so that fit can be expressed in numbers. A team of biocurators examines the data as it comes in, validating and annotating it to make it usable by biologists. They represent on a color bar how well a given model scores on a number of criteria—redder for a low score, bluer for a better one; that way, a user knows how much confidence to place in a model, even without a deep understanding of the criteria that went into the scoring.

One approach to combining data derived from different techniques is software called the Integrated Modeling Platform (IMP), developed by Andrej Sali, a professor of bioinformatics and computational structural biology at the University of California, San Francisco. He and his colleagues created open access software designed to be flexible and general enough to build models using data from the various imaging methods, and to allow scientists to share and add to each other’s models. Traditional modeling programs, Sali says, have lacked the necessary flexibility.

IMP treats a model as a collection of particles, with each particle representing some piece of the system. If the researcher has data on the arrangement of the atoms in a crystal, each particle could represent an atom, producing a high-resolution model. With lower-resolution data—say, measurements of distance between different components of a protein—the particles could represent those components. Different parts of the model might have different resolutions depending on the source of the data. Each particle has attributes associated with it, such as its radius, mass, or location.

The software uses this information to create a series of models that could fit the data. It then applies a scoring function to check the models, looking to see how well they match with information known about the particular structure being studied, as well as with structures in general. “You may find some problematic data that way,” Sali says. “You can then go back and maybe collect some more data, or check the accuracy of the data.”

Sali’s hope is that researchers contributing to the PDB can easily look at other groups’ models and check them against their own data, as well as speed up their own work by relying on models developed by others. “Why should anyone have to reinvent the wheel for any part of a bigger problem they’re trying to solve?” he asks. “They should just be able to borrow it.”

Sharing models between research groups and computing with data derived from multiple sources are issues faced by a growing collection of databases, says Manish Parashar, a professor of computer science and head of the Rutgers Discovery Informatics Institute. The Institute runs large-scale data analytics and simulations of data from a wide variety of sources, including the PDB. For instance, it models the growth of tumors and how they react to drugs for the Cancer Institute of New Jersey, combining data from sequencing the tumor’s genome with NMR or (Magnetic Resonance Imaging (MRI) images.

The curators of any set of data have to figure out what metadata to associate with it to make it useful and searchable by the people who will base their research on it, asking, “What makes sense at the science level? How do you encode that into an algorithm?” Parashar says. For example, if the researchers will be asking questions about the shape of a protein, the data has to be labeled with commonly accepted terms describing shape so the researcher can run queries on those terms.

Scientists operating the PDB continue to work on ways to make data shareable and useable. Last October, they held a meeting in Hinxton to discuss how to deal with hybrid models and the expected growth of their archive. They are optimistic they can meet the challenge. “It was started early enough that we can be ready when these big structures start coming in,” Velankar says.

The scientists are aware of the problems that can arise when data standards become outmoded; PDB files were originally written in Fortran, created on punch cards, and sent through the mail. The file structure was based on the cards’ 80-column format and could not accommodate structures with more than 99,999 atoms. As larger structures were discovered, those files had to be broken into separate chunks. “The ribosomes were so big that they actually broke the old data format system,” Burley says. The archive has since changed to a new format, and updated its software to work with it.

One non-scientific challenge is the issue of how to pay for an archive that is expected to keep growing, and that will require more storage space and more biocurators to process the incoming data. The government grants that fund the project have been flat for the past several years. “It’s a continuous struggle to try and find money,” Kleywegt says. Storage is not much of an issue; the archive contains approximately 546GB of data. Downloading the data, especially the full 3D images of large macromolecular machines, can cause “significant capacity issues,” Burley says. On average, users download a million individual structures every day, which means the archive must deliver approximately 7TB each month.

“We’re projecting five years from now the archive will be at least 50% larger and the complexity will be considerably higher,” Burley says.

Despite the challenges, Burley finds the accelerating pace of structural biology and the adoption of new data collection methods that are driving the growth very exciting. “It’s a good problem to have,” he says, “The science is moving very fast.”

Figures

Figure. An interactive visualization of proteins from the Protein Data Bank on a virtual reality wall in the California Institute for Telecommunications and Information Technology at the University of California, San Diego.