Galloway, 'Heinz Electronic Library Interactive Online System (HELIOS): Building a Digital Archive Using Imaging, OCR, and Natural Language Processing Technologies', Public Access Computer Systems Review v6n04 URL = http://hegel.lib.ncsu.edu/stacks/serials/pacsr/pr-v6n04-galloway-heinz + Page 6 + ----------------------------------------------------------------- Galloway, Edward A., and Gabrielle V. Michalek. "The Heinz Electronic Library Interactive Online System (HELIOS): Building a Digital Archive Using Imaging, OCR, and Natural Language Processing Technologies." The Public-Access Computer Systems Review 6, no. 4 (1995): 6-18. ----------------------------------------------------------------- 1.0 Introduction In February 1994, Carnegie Mellon University (CMU) embarked on an ambitious project to convert one million pages of the congressional papers of Senator John Heinz (R-PA) into digital format and to provide access to these papers through innovative information retrieval software developed at CMU. Named in memory of the late Senator, the Heinz Electronic Library Interactive Online System (HELIOS) supports full-page digital images and it utilizes natural language processing (NLP) technology to search large quantities of unstructured text. HELIOS will allow researchers to access the Heinz papers through the campus network as well as through the Internet. Over one million dollars was donated by the Heinz Family Foundation, Heinz Company Foundation, and Heinz Endowments to support the establishment of the H. John Heinz III Archives and the digitization project. Heinz assistance has made it possible to advance the principles of digital preservation and access for archival collections. In addition to the Heinz gift, CMU has committed an additional $450,000 in matching resources to the project. These resources primarily come in the form of permanent full-time staff salaries, archival equipment, and rental of a processing facility. Our goal is to develop a digital archive that will serve as a model for the archival profession. We expect to create an archival information technology environment that dramatically increases the depth of indexing and the quality of retrieval beyond what archiving resources have traditionally allowed. To create the HELIOS database, documents are scanned, converted to ASCII form via OCR, verified and organized, and indexed using the CLARIT natural language processing software. The project will develop three graphical user interfaces in a Microsoft Windows environment: a scanning interface, an archivist/verification interface, and an end-user interface. + Page 7 + HELIOS represents a significant breakthrough technology that has the potential to transform the work of archivists by helping them to overcome the significant challenges they face, including an inability to: 1. create good finding aids and indexes for paper archives that provide deep access to collections, 2. provide effective retrieval from paper archives due to the inherent diversity and size of these one-of-a-kind files, and 3. offer broad public access to archives because they represent resources that the researcher must visit in order to use effectively. Archivists have resisted the use of information technology because they lack appropriate tools to automatically process large amounts of text for retrieval. HELIOS will offer such a tool. Clearly, there are many problems yet to be solved in the management and preservation of digital archives, but it is CMU's intention to work with the larger archival and library community to help establish standard practices for digitizing paper archives and to develop the information management tools to give scholars and students state-of-the-art access to them. 2.0 H. John Heinz III Congressional Collection Shortly after the tragic death of Senator Heinz in 1991, the family placed the congressional papers at Carnegie Mellon University to serve as the research centerpiece for the Heinz Graduate School of Public Policy and Management. CMU spent $70,000 to prepare an archival facility near campus in anticipation of receiving the collection. Upon completion of the space, the collection was transferred from its storage facility in Harmarville, Pennsylvania to CMU. In addition to documenting Heinz' tenure as a three-term member of the U.S. House of Representatives, the papers focus on his fifteen-year Senate career. Senator Heinz earned a national reputation based on his work on retirement and aging concerns, health care, international trade and finance, human development, and environmental issues. The Heinz papers present a rich and valuable source of information about the professional life of John Heinz in the U.S. Congress and the social and political concerns of the nation during the Senator's tenure. The Heinz Archives will aid scholars in understanding the Senator's contributions to national policy and allow current public policy makers to build upon his accomplishments and unfinished work. + Page 8 + The H. John Heinz III Archives will provide both traditional and electronic access to the papers. The Heinz Archives staff uses conventional processing methods to arrange and describe the papers while applying fundamental preservation techniques to the original material to ensure longevity. The material is housed in proper environmental conditions suitable for long-term preservation. To date, over 500 of the original 1,200 cubic feet of material has been processed. At the completion of processing, the collection should comprise approximately 650 to 700 cubic feet of material. The HELIOS project will provide electronic access to the most important series and subseries in the collection. 3.0 HELIOS Team Members The HELIOS project is comprised of three umbrella units representing several disciplines. While bringing its own expertise to the creative process, each unit is making major contributions to the design, creation, and implementation of HELIOS. 3.1 Laboratory for Computational Linguistics CMU's Laboratory for Computational Linguistics (LCL) focuses its research efforts on information retrieval issues. LCL researchers developed efficient methods to analyze and extract language using computers, and this NLP research is the basis of the CLARIT software. 3.2 CLARITECH Corporation The CLARITECH Corporation, a CMU spin-off company, has improved and marketed LCL's NLP technology, dubbing it CLARIT. CLARITECH's primary contribution to the HELIOS project is system design. It is responsible for incorporating elements of the CLARIT system designed in the LCL into HELIOS, for creating three graphical user interfaces for the system, and for supporting the development of new CLARIT tools for HELIOS throughout the duration of the project. 3.3 Carnegie Mellon University Libraries Three different units within the Carnegie Mellon University Libraries play a major role in the interdisciplinary functioning of the project. + Page 9 + The Library Administration is responsible for providing the leadership function for HELIOS as well as the fiscal management of the project. The Department of Library Information Technology is responsible for developing the HELIOS client/server system, maintaining the system, training users, and documenting the system. The University Archives facilitates the interdisciplinary teamwork of the project. The Heinz Archives, a unit within the University Archives, is responsible for establishing control over the collection, appraising and processing the original Heinz papers, creating a finding aid to the collection, providing quality reference service, disseminating and cataloging the collection via OCLC, and preserving the original collection in perpetuity. Working together, the University and Heinz Archives are responsible for developing three interface specifications, testing the interfaces before release, scanning the original material into electronic format, verifying the quality of the images, performing additional organizational tasks, creating annotations and links to other parts of the electronic collection, conducting user protocol testing, and training other library staff to use the system. 4.0 HELIOS Document Processing In the HELIOS project, the original paper documents are processed based on the typical arrangement scheme of a congressional collection. After this is completed, document pages from complete series and subseries are scanned, resulting in 400 dpi image files. The image files are then converted to ASCII text files using the TextBridge OCR package. The images and text are verified and additional notes and organization added. The text is indexed by the CLARIT natural language processing software, resulting in the searchable HELIOS database. 4.1 Scanning Once an entire series has been processed, the documents are transferred into electronic format. With the use of a 66 MHz 486 DX2 PC with a 2 GB hard drive, 20" monochrome monitor (1600 x 1200 pixels), and a high-end Fujitsu scanner, we are creating 400 dpi bitonal TIFF images. The images are compressed using CCITT Group IV, an international compression standard, and backed-up onto 4 mm digital data storage tapes. + Page 10 + Because the scanning procedure represents the most crucial aspect of the project, we designed a scanning interface to facilitate the rapid scanning of documents while capturing essential contextual information for the end-user. The scanning interface imitates the standard archival collection arrangement, organizing documents into subgroups, series, subseries, and smaller units. Operators select the appropriate level using drop-down menus. The operator then enters the box and folder number as well as the folder title and date. The scanning interface was also designed to capture "bundles"; that is, groups of documents originally fastened together by paper clips, staples, or rubber bands. These bundles, which often reflect inherent meaning, are more difficult to depict to an online user; however, doing so is important because it gives the user the same context as if he or she were physically examining the material. The document feature allows the operator to choose from a prepared list of document types, such as correspondence, memoranda, speeches, and notes, and to assign a corresponding date. There are two reasons for doing this. First, tagging this kind of data will enable a user to restrict a search to a specified document type. Second, most archival documents do not have distinct titles. To overcome this problem and generate a useful description of the retrieved document for the user, the document type and date can be offered as the title. Providing this kind of "fielded" information is vital for access to the material and to maintain contextual accuracy. In addition to capturing the contextual information, this interface was developed to take into account the unique characteristics of archival documents. Prior to scanning a document, the operator must specify page size; brightness and contrast quality levels; whether the document is single- or double-sided; orientation of the page; and the scanner source (flatbed or feeder). When a page is scanned, it appears in an image viewer, allowing the operator to determine the success of the scan and to rescan if necessary. Each scanning session is logged to monitor quality control and record scanning performance. + Page 11 + 4.2 Optical Character Recognition (OCR) In order for the system to provide innovative searching capabilities, the images must be converted to machine-readable format. This text recognition process, commonly referred to as optical character recognition (OCR), produces a standard ASCII text file. An off-the-shelf package called TextBridge, a Xerox Imaging Systems product, is used for OCR conversion. The HELIOS system designers have tweaked TextBridge to run the OCR process in batch mode at night to economize staff time and computer usage. They have also implemented a CLARIT tool to perform post-OCR correction, thereby boosting the accuracy of the OCR process to an even greater degree. 4.3 Verification Once each page image has a corresponding text file, the archives staff will utilize an archivist/verification interface that is presently being designed to support image and text verification, annotation, and organization. To perform the majority of the verification tasks, operators will use a Sun SparcStation 20 workstation with an 18 GB external hard drive and 20" color monitor (1280 x 1024 pixels). This workstation will also serve as the HELIOS search engine server and file server. Two Sun SparcStation 5 workstations will be used to perform additional verification. The archivist/verification interface will display each page image and its associated ASCII text to the operator, and it will: o Enable the operator to verify the quality of the page image against the original page itself. o Schedule pages for rescanning. o Check and correct the attributes associated with each page. o Evaluate the quality of the OCR conversion for each page. o Perform minimal editing of the converted ASCII text, perhaps keying in sections that did not respond to the OCR process, such as handwritten notes. + Page 12 + o Mark pages with serious OCR conversion problems so that they can be keyboarded by a typist at a later date. o Add notations. o Create links to other groups of documents. o Perform other organizational tasks, including the reordering of pages or folders. The Heinz Archives Assistant and graduate students will perform the majority of the verification tasks. For documents not scanned in their entirety, such as government publications, operators will note the availability of the complete report in a regional repository or provide some other explanation. For poor or skewed images, operators will indicate that the original page possessed these characteristics. The Heinz Archivist will provide additional organizing, indexing, and annotations. Descriptive or structural notes will be added to various levels of the electronic collection. The Heinz Archivist will be able to organize the electronic archives along several different dimensions apart from its processed arrangement. For example, a taxonomy of related terms describing any level of the collection could be constructed. Each series will be linked to its appropriate inventory and description, and cross-reference notes will be established to link records to similar groups of material. 4.4 NLP Search Engine The HELIOS search engine utilizes natural language processing (NLP) technology. NLP stems from work done in the fields of computer science, artificial intelligence, and linguistics. Natural language is simply common, everyday language we use to speak and write. Natural language processing allows users to interact with a computer system, describing topics of interest using their own language as opposed to reacting to menus and prompts or using keyword and Boolean searching techniques. Consequently, they can make better use of the database with only a general knowledge of its contents. + Page 13 + As the HELIOS search engine, CLARIT supports more accurate, sensitive, and robust content-based indexing and retrieval than is possible with traditional "word-based" information retrieval technologies. Its indexing and retrieval capabilities are not based on locating individual words, but rather on extracting concepts that accurately characterize the content of documents. Combined with specialized statistical methods, CLARIT analyzes a query linguistically, comparing it with a similar linguistic analysis of the actual documents in the database. We have applied CLARIT to the problem of managing compound documents (text and images) and the special requirements of archival material. Why use NLP? Concrete disciplines, such as the medical and legal professions, often communicate and express ideas in rigorous terminology. But historians and other scholars, who use archives and historical material, approach their discipline with more imprecise language. This is why NLP technology has such promise for robust retrieval of archival material. The Text Retrieval Conference (TREC) studies sponsored by the National Institute of Science and Technology (NIST) and the DOD's Advanced Research Projects Agency (ARPA) have now demonstrated that CLARIT has a compelling advantage over traditional keyword and Boolean searching and retrieval. [1] Studies of keyword and Boolean retrieval systems have shown that sometimes they provide good precision and sometimes good recall, but never both together, and often neither. [2] The non-expert searcher (i.e., the average library user) has even less success. In addition, Boolean logic operators or special devices like adjacency and nesting are usually ignored by the general user who opts for single-term searches in hopes of getting the greatest number of retrieved items. They know from experience that they will do better by manually sifting the results and selecting relevant documents. Efforts to enhance online records have improved recall at the expense of precision. Unless we find new tools, moving to full-text electronic access will only make matters worse. CMU believes that CLARIT is the "better mousetrap"--one that will be especially useful for accessing archival material. + Page 14 + 4.5 Prototype End-User Interface One year ago, approximately 20,000 pages of archival material related to the work of Michael Lockerby, a Legislative Assistant (1977-1981) specializing in environmental issues and legislation, were used as a testbed for initial scanning and CLARIT retrieval. This scanning project allowed us to assess the physical demands of scanning archival documents and to determine the strengths and weaknesses of off-the-shelf technology used in HELIOS. The resultant document database was very useful in conducting focus groups on the prototype HELIOS end-user interface, and it was also used for HELIOS demonstrations at other sites. The prototype end-user interface has four major windows. 1. Query Window. The user enters a natural language search in this window, such as "the economic impact of environmental regulations and policies on the steel industry." 2. CLARIT Results Window. This window shows the list of retrieved pages. The pages are ranked in order of their estimated relevance to the query. The Change column indicates the upward or downward movement of pages based on a prior query search (e.g., +12 or -8), and an asterisk indicates that the system retrieved a new page. The Document Type and Date fields are used to generate a useful description of the retrieved page, such as "Memoranda--May 5, 1990" or "Speech--November 12, 1990." 3. Document Window. This window provides an ASCII version of the page with highlighting of retrieval terms associated with the query. 4. Image Viewer. This window displays the bitmapped image of the original page retrieved for the search. The end-user will have the ability to move forward and backward through any level of the collection (e.g., move to the next or previous page, the next or previous document, or the next or previous folder). Since document types will be tagged, the user could restrict a search to a specified date and document type, such as correspondence, memoranda, or speeches. + Page 15 + CLARIT offers several tools to improve query results. One feature can extract related terminology from the actual documents. These terms are generated "on-the-fly" by CLARIT. This feature allows the documents to describe themselves and eliminates the need for pre-existing indices. A second feature allows the researcher to use an existing page as an example query to locate more documents like it. A third tool incorporates a hypertext feature to link other relevant portions of the collection together. Unless specified, CLARIT searches the entire collection regardless of the arrangement and contextual framework imposed by the archivist. Therefore, it is crucial for the user to understand the context in which the retrieved documents were created. The HELIOS end-user interface will present the name of the folder from which a retrieved page originated, and it will allow the user to browse the inventory of any series as well as read the series descriptions. Therefore, the user interface will incorporate the traditional methods of performing archival research that maintain the context in which documents were created. Much has already been learned from initial focus groups about what modifications are needed to make the end-user interface more user friendly. Users commented that some terminology was not easily understandable. For instance, they were not sure that "score" correctly expressed the meaning of the rank order or even needed to be shown. Terms like "augment" and "parse," common though they are in linguistics, are not clear to end-users. With respect to the structure of the interface, users want to save queries and reuse them. Users also commented on what should be displayed in the list of retrieved titles. The initial focus groups discovered that users are confounded by a system that can retrieve a variable number of documents based on relevance scores. They tend to want "all" of the retrieved documents; however, this is probably an artifact of past experience in keyword and Boolean retrieval systems. Similarly, they expect the use of "not" to be Boolean in its effects, though this effect is not achieved when retrieval is only on nouns and noun phrases. + Page 16 + Upon completion of the initial version of the end-user interface, CMU will begin conducting formal user protocol testing to provide concrete data about how researchers actually approach the user interface, and to make changes as needed. In focus groups, users will often describe what they think they need, but in protocol testing they actually want something else. Additional tools and features will be built and introduced as users become more experienced with the system. 5.0 Potential HELIOS Benefits The HELIOS project team anticipates that the system will have a number of potential benefits: o It will allow users to find archival information quickly and efficiently. Because of the overwhelming amount of material that is often present in congressional archives, research is often a result of an extremely time-consuming manual "hit or miss" research methods. Using HELIOS, users will eliminate the need to wade through pages and pages of less significant material in search of those "golden nuggets." Scholars will be able to focus their efforts more on exploring new ideas, comparing and contrasting new relationships, and drawing conclusions, rather than on performing endless hours of manual research. o It will provide uniform and consistent access to the collection in a way that is superior to the access provided by traditional finding aids. o It will provide subject access across the entire record group, series, subseries, and folders, making the collection accessible in a variety of ways. o New series, which in the past received little research attention due to unmanageable bulk or perceived relevance of folder titles, will be easily accessible. o Many archives' users simply do not have the time or money to travel to distant repositories to conduct research. Remote users will be able to access both HELIOS and a finding aid via the Internet using World- Wide Web browsers, such as Mosaic. Consequently, the archives' location and operating hours will no longer be a concern. + Page 17 + o Many potential users of archives avoid them because of poor finding aids, excessive bulk, and time constraints, instead turning to secondary sources of information. HELIOS will encourage these traditional users to conduct more archival research, and it will attract new types of users. 6.0 Conclusion By effectively utilizing imaging, OCR, and natural language processing technologies, the HELIOS project promises to dramatically transform the Heinz Archives' services by providing researchers with state-of-the-art electronic access to archival source materials. The HELIOS project is building a prototype of the digital archive of the future. Hopefully, it will be one of many similar projects that will make archival information instantly available to users across the globe, offering them advanced information retrieval capabilities that significantly enhance their research activities. Notes 1. Donna Harmon, ed., The Second Text REtrieval Conference (TREC- 2) (Washington, DC: Government Printing Office, 1994). 2. D. C. Blair and M. E. Maron, "An Evaluation of Retrieval Effectiveness for a Full-Text Document-Retrieval System," Communications of the ACM 28 (March 1985): 289-299. About the Authors Edward A. Galloway, Heinz Archivist, Carnegie Mellon University, 5000 Forbes Avenue, Hamburg Hall, Room 1506, Pittsburgh, PA 15213-3890. Internet: eg2d@andrew.cmu.edu. Gabrielle V. Michalek, University Archivist, Carnegie Mellon University, 4825 Frew Street, Pittsburgh, PA 15213-3890. Internet: gm1l@andrew.cmu.edu. About the Journal The World-Wide Web home page for The Public-Access Computer Systems Review provides detailed information about the journal and access to all article files: http://info.lib.uh.edu/pacsrev.html + Page 18 + Copyright This article is Copyright (C) 1995 by Edward A. Galloway and Gabrielle V. Michalek. All Rights Reserved. The Public-Access Computer Systems Review is Copyright (C) 1995 by the University Libraries, University of Houston. All Rights Reserved. Copying is permitted for noncommercial, educational use by academic computer centers, individual scholars, and libraries. This message must appear on all copied material. All commercial use requires permission.