Price-Wilkin, 'Gateway Between the World-Wide Web and PAT: Exploiting SGML Through the Web', Public Access Computer Systems Review v5n07
URL = http://hegel.lib.ncsu.edu/stacks/serials/pacsr/pr-v5n07-price-wilkin-gateway
+ Page 5 +
-----------------------------------------------------------------
Price-Wilkin, John. "A Gateway Between the World-Wide Web and
PAT: Exploiting SGML Through the Web." The Public-Access
Computer Systems Review 5, no. 7 (1994): 5-27. To retrieve this
file, use the following URL: gopher://info.lib.uh.edu:70/
00/articles/e-journals/uhlibrary/pacsreview/v5/n7/pricewil.5n7.
Or, send the following e-mail message to listserv@uhupvm1.uh.edu:
GET PRICEWIL PRV5N7 F=MAIL.
-----------------------------------------------------------------
1.0 Introduction
The HyperText Markup Language (HTML) used by the World-Wide Web
has limited markup and structure recognition capabilities. Only
a small set of text characteristics can be represented, and few
of these have any functional value beyond display capabilities.
The HTML ANCHOR element supports hypertext links; however, it
cannot retrieve components of a linked document, such as a single
glossary entry from a collection of several thousand entries,
without resorting to programs external to HTML and the Web
server. In spite of these limitations, HTML and the Web are key
technologies for libraries.
The Standard Generalized Markup Language (SGML) is a full-
featured, standard markup language. HTML is actually an SGML
Document Type Definition. Ideally, it would be possible to
retrieve text documents marked up with the richer SGML tag set
via the World-Wide-Web.
This technical paper discusses how the Web can be linked to
the PAT system, Open Text's search engine that supports access to
SGML-encoded documents. This Web-to-PAT Gateway utilizes the
Web's Common Gateway Interface (CGI) capability and SGML-to-HTML
filter programs.
After briefly overviewing key technical concepts, the paper
explains the operation of the Web-to-PAT Gateway, using several
examples of how it is employed at the University of Virginia
Libraries, including access to text files such as a Middle
English collection, the Oxford English Dictionary, and the Text
Encoding Initiative's Guidelines for Electronic Text Encoding and
Interchange.
2.0 Key Concepts
This approach to using the Web to provide access to complex
textual resources involves many tools and concepts that may be
unfamiliar to the reader. This section provides a very brief
overview of these complex topics and it describes their
interrelationships.
+ Page 6 +
2.1 SGML and HTML
Standards and open systems must be an essential part of library
efforts to provide large-scale, wide-area access to textual
resources. Textual resources must be reusable. Because of the
cost of creating texts, it must be possible to use the texts in a
variety of settings with a variety of tools. To that end, a
standards-based encoding scheme must be the foundation of text
creation.
The Standard Generalized Markup Language (SGML), an
international standard, is such an encoding scheme, and it has
proven extremely valuable in effecting an open systems approach
with text. [1] This paper is not the place to present a detailed
argument for using SGML, especially when this has been done so
effectively elsewhere. [2] However, in addition to its value as
an internationally approved standard, SGML is ideally suited to
supporting text retrieval because it is a descriptive rather than
a procedural markup language. SGML is a language designed to
reflect the structure or function of text, rather than simply its
layout or typography. In a text retrieval system, portions of an
SGML document can be searched and retrieved, and functionally
different textual elements can be displayed in accordance with
their function.
The difficulty of designing an implementation of SGML to
meet a broad range of text processing needs in the humanities has
been met by the Text Encoding Initiative (TEI) in its Guidelines
for Electronic Text Encoding and Interchange. [3] The
application of SGML using the TEI Guidelines will play a central
role in ensuring that textual resources--particularly those
important to textual studies--are produced in a way that make
them flexible and of continuing value. The TEI itself is a
collaborative project of the Association for Computers and the
Humanities, the Association for Computational Linguistics, and
the Association for Literary and Linguistic Computing. Its
purpose is the promulgation of guidelines for the markup of
electronic text for a variety of disciplines involved in the
study of text. In mid-1994, a comprehensive and detailed two
volume set of guidelines was published. The print version of the
TEI Guidelines is an absolutely essential acquisition by
libraries; an electronic version has been made available by the
author. [4]
+ Page 7 +
A central feature of SGML is the DTD (Document Type
Definition). The DTD is a codification of the possible textual
characteristics in a given document or set of documents. SGML
expresses the organization of a document without necessarily
using the file system paradigm (i.e., discrete files representing
the organizational components of a document). It expresses
textual features (e.g., footnotes, tables, and headings) and the
building blocks of content (e.g., paragraphs) using a descriptive
language focusing on the role of the element, rather than some
presumed display value. SGML is not a tag set, but a grammar,
with the "vocabulary"--or tags--of an individual document being
articulated in its DTD. Using this rigorous grammar, SGML can
both declare information about the document in a way that can be
transported with the document and can enforce rules in the
application of markup by aiding in "parsing" the document.
The HyperText Markup Language (HTML), which is used with the
Web, is a form of SGML expressed by its own unique DTD. The
shape of the HTML DTD has changed significantly since first
articulated by researchers at CERN, and it continues to change
with the demands of the Web. [5] HTML was designed to facilitate
making documents available on the Web, and it expresses a variety
of features such as textual characteristics and hypertext links.
These hypertext links are HTML's most useful capability because
they allow authors to link documents to other resources
throughout the Internet, effectively making the Internet into a
large hypertext document.
2.2 CGI and FORM Use
The Web is far more than a server protocol for the transfer of
HTML documents. Among the many resources it offers in
facilitating sophisticated retrieval of information is the Common
Gateway Interface (CGI). Like HTML, CGI is in transition.
However, in its current state, it offers capabilities that allow
the Web to support much more complex documents and retrievals
than HTML alone supports. The Common Gateway Interface is a set
of specifications for external gateway programs to speak to the
Web's server protocol, HTTP. It allows the administrator to run
external programs from the Web server in such a way that requests
from the server return a desired document to the user or, more
typically, generate a document on the fly. This capability makes
it possible to provide uniform access to data structures or
servers that are completely independent of the HTTP, including
structures such as those represented in SGML documents or Z39.50
servers. The CGI specification is available on the NCSA
documentation Web server. [6]
+ Page 8 +
Closely associated with the CGI is the FORM specification,
which was first introduced with NCSA's Mosaic Web client. This
feature is a client-independent mechanism to submit complex
queries, usually through a graphical user interface.
FORM-compliant interfaces such as Mosaic, Lynx (a UNIX VT100
client), and OmniWeb (a NeXTStep client) use fill-out forms,
check boxes, and lists to mediate queries between the user and
CGI resources. Users respond by making selections that qualify
submissions to the server (e.g., checking a box to indicate that
a search is an author search) thereby making a complex
command-line syntax unnecessary. [7]
2.3 Computer Languages and CGI
CGI programs can be written in a variety of languages, including
UNIX shell scripts, C programs, and Perl. In fact, there are few
limitations on the type of language that can be used. Perl is
foremost among the options available to most Web administrators.
Largely the work of Larry Wall, Perl can be used to create
extremely fast and flexible programs with no practical limits on
the size of the material it can treat. Perl also has outstanding
support for the UNIX "regular expression," making it ideal for
text systems where one form of markup must be translated to
another. [8]
3.0 The Web-to-PAT Gateway
The modular approach taken in the Web-to-PAT Gateway separates
the operations of retrieval to allow one component (e.g., an
SGML-to-HTML filter) to be upgraded without affecting other
components. It should be emphasized that this separation of
operations grew out of local needs and that other approaches,
including an approach that combines all operations in a single
program, are possible. The four steps are:
1. FORM Handling
Users, with the aid of the FORM, submit a query.
2. CGI Query Handling
The query is received and translated to a PAT search.
3. PAT Result Handling
Information returned from PAT is transformed into lists
or entries that can be selected.
+ Page 9 +
4. SGML-to-HTML Filtering
The richer SGML is transformed into HTML.
This multi-stage approach has many advantages. For example, it
is possible to use different programming languages or other
software tools for each processing stage, selecting them based on
their utility for particular functions or their ability to comply
with local requirements. In the approach documented here, HTML
FORMs, shell programs, C programs, and Perl programs have been
used for the four operations. Separating the functions also
allows persons with different responsibilities, skills, or
interests to manage the different processes. For example, a
system administrator might manage the second and third stages,
while someone responsible for more aesthetic issues in the
delivery might manage parts of the first and the fourth stages.
At the University of Virginia Library, SGML-to-HTML filters
continue to be enhanced by staff from the Library's Electronic
Text Center in a process completely separate from the development
of other parts of the interface.
3.1 HTML FORM for Query Submission
The use of an HTML FORM to handle query submission may be simple
or complex. The three examples given here demonstrate that
range: the Middle English FORM supports word and phrase searches;
the Oxford English Dictionary search provides a great deal of
information about the areas to be searched and information to be
retrieved; and the TEI Guidelines FORM allows users to browse the
document in a variety of ways, such as by chapter or other
section. (The Middle English and TEI Guidelines resources are
encoded in SGML.)
+ Page 10 +
3.1.1 Middle English Query
The FORM created for Middle English materials was deliberately
made simple to allow users to retrieve keywords-in-context (KWIC)
without knowing commands such as those needed to view search
results. [9]
A search term is requested from the user and registered as
the variable "query." So that neither the user or system is
overwhelmed by large result sets, the size of result sets is
limited to 100 items, and an additional FORM option (registering
the variable "size") is included to help the user subsequently
move through the results 100 items at a time or to sample 100
items from the entire result set.
3.1.2 OED Query
The richness of the Oxford English Dictionary (OED) is often
overwhelming even for sophisticated users. Most users do not
want keyword-in-context results and would prefer simple look up
capabilities. The OED is a complex product designed to
facilitate a broad array of activities. Consequently, even
simple searches require elaborate query structures.
The OED FORM assists users in submitting many of the most
commonly performed searches, including dictionary entry retrieval
with simple look ups and truncated term look ups (e.g., "photo"
for all words beginning with this stem). [10] It is also
possible to retrieve quotations by the quote's author (e.g.,
retrieval of all quotations authored by Chaucer). This process
includes the following:
1. In the FORM, the user submits a search term which is
captured as the variable "query."
2. The user selects the type of search. Many types of
searches are possible, including traditional look ups,
alphabetic browses, full-text searches, and quotation
retrieval.
3. Several other elements are used to limit the size of
results. As in the Middle English search FORM, a
default of no more than 100 results at a time may be
viewed from each search.
+ Page 11 +
4. In addition, a variable called "period" is offered to
allow users to limit quotation searches by century.
3.1.3 TEI Guidelines Query
The structured browsing of the TEI Guidelines adds another
important feature for mediating access to large or complex
collections. Users of the TEI Guidelines are as likely to want
to read a chapter or section as they are to want to search the
contents.
To facilitate this sort of browsing, an initial HTML page is
created containing the titles of the major SGML hierarchical
structures of the TEI Guidelines (e.g., the DIV0 element), and
each of these structures is linked to an HTML page containing the
titles of subsidiary structures (e.g., the DIV1 through DIV4
elements). [11]
For example, the top-level HTML page is linked to the
secondary HTML page for Part I of the TEI Guidelines. [12]
Figure 1 shows the top-level HTML page.
-----------------------------------------------------------------
Figure 1. Top-level HTML Page for the TEI Guidelines
-----------------------------------------------------------------
TEI Guidelines for Electronic Text Encoding and Interchange (P3)
You may also browse the Guidelines.
* Bibliographic header of the TEI Guidelines
* Preface
* Acknowledgments
* Changes from TEI P1 to TEI P3
* Part 1: Introduction
* Part 2: Core Tags and General Rules
* Part 3: Base Tag Sets
* Part 4: Additional Tag Sets
* Part 5: Auxiliary Document Types
* Part 6: Technical Topics
* Part 7: Alphabetical Reference List of Tags and
Attributes
* Part 8: Reference Material
-----------------------------------------------------------------
+ Page 12 +
Figure 2 presents the beginning of the HTML page for Part I of
the TEI Guidelines.
-----------------------------------------------------------------
Figure 2. Beginning of the HTML Page for Part I of the TEI
Guidelines
-----------------------------------------------------------------
Part I: Introduction
* 1: About these Guidelines
+ 1.1: Structure and Notational Conventions of this
Document
o 1.1.1: Structure
o 1.1.2: Notational Conventions
+ 1.2: Underlying Principles and Intended Use
o 1.2.1: Design Principles of the TEI Scheme
o 1.2.2: Intended Use
# 1.2.2.1: Use in Text Capture and Text
Creation
# 1.2.2.2: Use for Interchange
# 1.2.2.3: Use for Local Processing
+ 1.3: Historical Background
o 1.3.1: Origin and Development of the TEI
o 1.3.2: Future Developments
* 2: A Gentle Introduction to SGML
+ 2.1: What's Special about SGML?
o 2.1.1: Descriptive Markup
-----------------------------------------------------------------
The URL for each list item in the Part I page contains the
information necessary to conduct a search and retrieve the
structural component being selected. For example, to retrieve
the section "Structure and Notational Conventions of this
Document," the first subsection of the first chapter in Part I,
the URL points to the component extraction program tei-tocs, and
it specifies that this is an ID "struct" at level DIV2 (e.g,. the
section is bounded by the tags