At the end of last year, I presented a webinar to the American Medical Informatics Association on clinical text mining and text engineering – applying semantic annotation and text mining to medical records. There is a rapid growth in the extraction of meaning – or semantics – from medical records, and it exposes issues and problems that we need to be aware of.
Medical records typically consist of a structured part, and an unstructured, textual part. A lot of work has been spent on trying to structure as much of the medical record as possible, but despite this, most of the information is still in the free text. Some estimates put the information content of the free text of the medical record as high as 70% of the total. There are many reasons for this: the failure of user interfaces to meet the needs of clinicians; a cultural affinity to text amongst clinicians; an aversion to change; and the flexibility of language. Clinicians can easily record complex, nuanced ideas in text, in a way that they cannot in the structured part of the record. If we want to reuse the medical record for clinical care, and for research, then we must tackle this text.
The language of medical records reflects the complexity of medicine. The type of text in the record is very varied. There are very chatty, discursive letters written from one clinician to another. There are also reports written in a semi-structured way, with the same headings and phrases used time and again. At the other extreme, there are terse, telegraphic notes recorded for a clinician’s own use in a free text form entry with a limited size. And to complicate issues further, the style of these texts can vary from one medical unit to another. Looking within the individual documents, we find that medicine is rich in technical terminology: one common medical terminology resource contains over one million terms. Many of these terms are ambiguous, for example “cold” can be the symptom of a virus, or the temperature of something. And new terms are invented all of the time, and medical terminology is compositional: it is possible to invent new terms by adding others together, as in “Zika virus disease”.
Complexity is not limited to the term itself. The context in which a medical term appears can have a huge impact on its meaning. Negation, for example, is widely used, as in “without evidence of infection”, and “infection has been ruled out”. Then there is the phenomenon of hedging, where a clinical modifies the likelihood of a term. My favourite, from a real patient record, is “probably a possible tumour”. The temporal context of a term is critical to its understanding. We need to know if a symptom was apparent last week, or last year? When a clinician says “a few days ago”, what does this mean?
Given the variety of text types and the complexity of the language, we have to deploy many different text mining technologies. Technologies for handling large, complex terminologies and ontologies are important, as are technologies for matching the terms in text to their correct meaning. We also need algorithms that go beyond the terms themselves, and that consider their context and temporality.
We also need to consider infrastructural problems. Take scale, for example. A large hospital might have 15 million or more free text records, whilst a colleague at the USA Veterans Health Administration recently told me that their 152 health centres and hospitals have over one billion free text records combined. Confidentiality is also a major concern, as many people do not want to share their health records, and many countries have specific legislation about this. De-identification technologies are available, but these can never be 100% effective. Scalable, secure systems and institutional access control are therefore the norm.
KConnect consortium members have been involved in several medical record text mining systems. One of the largest is at the South London and Maudsley hospitals (SLAM), a large mental health care provider. SLAM have had a GATE-based text mining system installed in their Biomedical Research Centre for seven years. This system is used to generate data for research. For example, a widely used measure of cognitive ability, the Mini Mental State Examination (MMSE), has a special field for recording in the structured medical record. Despite this, three times as many MMSEs are recorded in free text as in the structured record. Extracting these, together with their values and dates, allows SLAM BRC epidemiologists to carry out research in to dementia that would not otherwise have been possible. Other projects at SLAM BRC that use text mining include projects on the effect of drugs used in psychosis, on pregnancy and schizophrenia, ethnicity and health, prediction of suicide attempts, the relationship between social media disinformation and mental health, and many others.
Medical records provide a good example of how semantic annotation and text mining of medical records can make a difference in the real world. It also provides a good example of some of the problems encountered, and of the solutions that we need to deploy.
(An earlier version of this blog was originally posted on the OpenMinTeD project website in April 2016)