There have been few studies of large corpora of narrative notes collected from the health clinicians working at the point of care. This chapter describes the principle issues in analysing a corpus of 44 million words of clinical notes drawn from the Intensive Care Service of a Sydney hospital. The study identifies many of the processing difficulties in dealing with written materials that have a high degree of informality, written in circumstances where the authors are under significant time pressures, and containing a large technical lexicon, in contrast to formally published material. Recommendations on the processing tasks needed to turn such materials into a more usable form are provided. The chapter argues that these problems require a return to issues of 30 years ago that have been mostly solved for computational linguists but need to be revisited for this entirely new genre of materials. In returning to the past and studying the contents of these materials in retrospective studies we can plan to go forward to a future that provides technologies that better support clinicians. They need to produce both lexically and grammatically higher quality texts that can then be leveraged successfully for advanced translational research thereby bolstering its momentum.
The task of performing natural language processing over clinical notes goes back to 1983 Chi, E. C., Sager, N., Tick, L. J., & Lyman, M. S.; Gabrieli, E. R., Speth, D. J., 1986) with the work Chi, Sager, Tick and Lyman, and it is only gradually increased in activity to this date. However, in 2008 with have had the first conference specifically targeted at the “Text and Data Mining of Clinical Documents” with a conference organized by the Turku Centre for Computer Science, in Finland (Karsten, H., Back, B., Salakoski, T., Salanterä, S., & Suominen, H., 2008). Much of the work prior to the 1990s has been superceded by later shifts in processing power and new ideas on the software development for this task. The review of the literature in this paper is restricted to later topics that are particularly relevant to automated processing of the language of clinical notes.
In 2001, Taira and Soderland published their approach to information extraction from clinical notes in the radiology domain. They proposed a general structure for such applications which was later used by many other research groups (Huang, Y., Lowe, H., Klein, D., & Cucina, R., 2005; Arnold, C. W., Bui, A. A. T., Morioka, C., El-Saden, S., & Kangarloo, H., 2007; Thomas, B. J., Ouellette, H., Halpern, E. F., & Rosenthal, D. I., 2005; Sinha, U., & Kangarloo, H., 2002). Their proposal had five steps of processing for a complete data retrieval system: Structural analyzer, Lexical analyzer, Parser, Semantic interpreter and Frame constructor. They performed an evaluation on this structuring and reported obstacles in: deep understanding of the domain, ability to deal with ungrammatical writing styles, shorthand and telegraphic writings, finding solutions for assumed knowledge between the writer and reader, and handling a large vocabulary. Following their work, different studies have expanded their proposed system to other clinical domains or addressed the reported issues (Sun, J. Y., & Sun, Y., 2006) and in some cases reported new obstacles.