Information Analysis in Digital Library Environments: Lessons Learned in Pharma

Lauren Harrison (Roche Pharmaceutical Corp., USA)
This chapter addresses the question of how the analysis of results retrieved from online bibliographic information systems changed over the last 32 years as digital libraries have evolved. It demonstrates that Digital Libraries of the future will enable knowledge discovery by providing direct access to the semantic content of documents through the implementation of text mining tools. To achieve this research with IR systems and text-mining tools, pipeline pilot (Bandy, et al., 2009), I2E (Vellay, 2009), and BioText will need to be conducted by experts in information retrieval not just subject scientific specialists.
History Of Post Processing Of Retrieval Results

In the beginning there was manual “cut and paste.” The database records contained the text of a bibliographic citation—the title, author, journal, and keywords of the referenced article. A Boolean query matched specific words to the words in the fields of the database record. The hardware capacity of the mainframes of this period determined the functionality of the IR system. The corpus was a collection of database records rather than a collection of the full text articles. The output device was a paper teletypewriter. The network was a telephone line with a coupler. These systems were best suited to generating a bibliography—an exhaustive printout of all articles that contained the key words on a subject such as diabetes and insulin. These collections of records that online systems handled became known as bibliographic databases. These bibliographies databases (e.g. Medline in Biology and Medicine and STN in Chemistry) grew over time to include searchable abstracts of the articles as a result of the growth of the indexing and abstracting industry.

Harrison and Lacerna (1992) documented the process of producing annotated bibliographies in Pharma prior to 1989. A considerable amount of production time was spent in the content analysis and review of the abstracts or original articles to determine various topics relevant to the drug profile. Prior to 1989, the process by which bibliographies were created required manual manipulation of the exported in-house database records. A print out of all records on a substance was retrieved. Each record was then manually coded or indexed according to the categories.

If a record was indexed by two or more categories, two or more photocopies were made. References were then cut, pasted and sorted manually into alphabetical lists per category. The resulting report listed records alphabetically by author within each category. This process had to be repeated per category and each category was input in the logical order of appearance desired. An author index indicating the location of authors works within the subject section could be automatically generated. This process was tedious, laborious, and costly in terms of man-hours spent in compilation.

In 1987 research on a automated method of bibliography general using Sci-Mate Manager was initiated. This initiative went on to review all bibliographic manipulation software available at the time. Pro-Cite was ultimate chosen as the best tool to facilitate bibliography generation as on a daily basis it eliminated the need to edit downloaded data for search report generation. It provided an efficient means of electronically combining references from several sources, facilitating the review of data by providing duplicate detection assistance. It greatly reduced the amount of post processing required to produce the subject section of the bibliography. Most importantly, at the time, the application was programmer independent and the users (information scientists) had total control over the output appearance.

