Data Mining ncovers New Connections Between Health Problems

Patient health network — A network depicting patients' health problems (colored dots) reveals overlapping conditions, including such known connections as diabetes and hypertension.

Researchers in Denmark are using data mining techniques to uncover connections between health problems as seemingly unrelated as migraines and hair loss.

In addition, the scientists determined that the gluten allergy known as celiac disease is associated with hair loss and migraines, and also is linked to schizophrenia.

Some 800 pairs of health problems turned up more than twice as often as expected by chance–and 93 of those pairs were then flagged by a doctor as being "especially intriguing."

In their article published last month, co-author Søren Brunak and his team describe patients’ electronic health records (EHRs) as "an unexplored but potentially rich data source for discovering correlations between diseases."

Brunak is both the director of the Center for Biological Sequence Analysis at the Technical University of Denmark and the head of the Disease Systems Biology Department at the Center for Protein Research, University of Copenhagen. The project is a collaboration between the two institutions.

Data mining techniques were used on clinicians’ notes within the EHRs Normal 0 false false false EN-US X-NONE X-NONE —collected over a 10-year period from 5,543 patients at Denmark’s largest psychiatric hospital Normal 0 false false false EN-US X-NONE X-NONE —to automatically extract clinically relevant terms and map these to approximately 22,000 disease codes in the World Health Organization’s International Classification of Diseases ontology (ICD10).

Besides generating new leads about the molecular workings of disease, the approach is also revealing a much richer portrait of each patient.

"Using the text-mining approach, we can produce a much more fine-grained patient characterization, going far beyond the assigned codes," says Brunak. "This aspect also has the implication of potentially improving conventional epidemiology research as the registries typically only contain terms which the doctors put into the structured fields—which is about 10% of what we find."

The team’s biggest challenge involved the medical records themselves, which Brunak described as "typically dirty, full of misspelling and other errors," which are rarely corrected.

"As long as doctors understand what they or others wrote, nobody cares," he says. "In that sense, the EHRs are different from other databases which gradually are cleaned up and error corrected. Our variational dictionary took care of this difficult task, and we demonstrated by extensive, manual benchmarking that the quality of the text mining was very high."

While Brunak and his team continue to try and establish whether certain proteins and genes contain changes that could potentially explain disease-disease correlation, they have yet to draw major conclusions about the implicated proteins and mechanisms.

Paul Hyman was editor-in-chief of several hi-tech publications at CMP Media, including Electronic Buyers’ News.

Data Mining ncovers New Connections Between Health Problems

Join the Discussion (0)

Become a Member or Sign In to Post a Comment

The Latest from CACM

Shape the Future of Computing

Communications of the ACM (CACM) is now a fully Open Access publication.