Every fall I teach one of our theme-based introductory CS courses, "Taming Big Data." The amount of data the students deal with is larger than they might see in other courses, but still relatively small — files containing fewer than 10,000 lines of information. I attempt to broaden their sense of how "big" big data can be by providing examples of new developments that involve use of or analysis of large amounts of data. I also provide examples where computer-based analysis yields results that would be impossible to achieve any other way (even for not-so-big data sets). There are lots of examples in the financial world, and the students are likely (at least the many economics majors) to stumble on some of those themselves. There are also interesting examples in the political world — for example, Prof. Mark Newman's cartograms of the 2004, 2008, and 2012 U.S. presidential elections, based on county-by-county voting data (http://www-personal.umich.edu/~mejn/election/2008/). Newman is in the Department of Physics and Center for the Study of Complex Systems, University of Michigan.
Particularly exciting are developments in the medical arena that can bring real change to people's lives. I found two examples in recent weeks. One recent result is based on The Cancer Genome Atlas (TCGA). The idea behind TCGA, funded by the National Cancer Institute, the National Human Genome Research Institute, and the National Institutes of Health, is that teams doing cancer research could make their data available to others. Data for cancer research often involves time consuming and costly collection of tissue samples, followed by genomic analysis. Access to existing data would facilitate validation of and development of new discoveries related to cancer understanding and treatment. The TCGA now contains tissue samples from 24 different cancer types, including "clinical information, genomic characterization data, and high-throughput sequencing analysis of the tumor genomes". This data was used recently to study subgroups of breast cancer, confirming that there are four primary subtypes of breast cancer, one of which is actually genomically similar to a form of ovarian cancer. This means that treatments typically used for the ovarian cancer (serious type) could also be effective for the related form of breast cancer (basal-like). In addition, analysis of TCGA data for other breast cancer tumors has been used to examine the role of mutated genes in cancer development. The large amount of genomic data available makes it possible for researchers to uncover patterns that would otherwise not be found. This provides researchers with promising new avenues for treatment, taking into consideration the underlying genomic form of the cancer when deciding on a treatment strategy, in addition to or instead of location in the body.
Another exciting application of big data to medicine comes out of the University of California, Riverside (UCR). A collaborative project, carried out by computer scientists at UCR and a doctor at Children's Hospital Los Angeles, involves mining of data that is collected from sensors attached to children in pediatric intensive care units. Rather than discard all but the most recent data, a research group led by Eamonn Keogh at UCR has come up with a technique to search extremely large datasets (more than one trillion objects). This allows them to mine extensive amounts of archived medical data, identifying heretofore unknown patterns that will help both with diagnosis and with prediction of upcoming medical episodes. The paper describing the new search technique received the ACM SIGKDD best paper award. You can find more information at the UCR website, and find the paper in the ACM DL.