CHAPTER II:

This chapter examines the basic concepts of document retrieval systems and defines major retrieval effectiveness measures such as precision and recall. It also discusses relevance feedback and clustering techniques which are used to enhance the effectiveness of document retrieval systems.

The principal function of a document retrieval system is to retrieve all relevant documents from a store of documents, while rejecting all others. A perfect document retrieval system would retrieve all and only relevant documents. In reality, the ideal document retrieval system does not exist. Document retrieval systems do not retrieve all and only relevant documents, and users may be satisfied with systems that rapidly retrieve a few relevant documents.

Maron (1984) provides a more detailed description of the document retrieval problem and depicts the logical organization of a document retrieval system (see Figure 2.1).

As Fig. 2.1 suggests, the basic characteristics of each incoming document (e.g., author, title, and subject) are identified during the indexing process. Indexers may consult thesauri or dictionaries (controlled vocabularies) in order to assign acceptable index terms to each document. Consequently, an index record is constructed for each document for subsequent retrieval purposes.

A user can identify proper search terms by consulting these index tools during the query formulation process. After checking the validity of initial terms and identifying new ones, the user determines the most promising query terms (from the retrieval point of view) to submit to the system as the formal query. However, most users do not know about the tools that they can utilize to express their information needs, which results in search failures because of a possible mismatch between the user's vocabulary and the system's vocabulary.

In order for a document retrieval system to retrieve some documents from the database two conditions must be satisfied. First, documents must be assigned appropriate index terms by indexers. Second, users must correctly guess what the assigned index terms are and enter their search queries accordingly. Maron (1984) describes the search process as follows:

Thus, a document retrieval system consists of (1) a store of documents (or, representations thereof); (2) a user interface to allow users to interact with the system; (3) a retrieval rule which compares the representation of each user's query with the representations of all the documents in the store so as to identify the relevant documents in the store. It goes without saying that there should be a population of users each of whom makes use of the system to satisfy their information needs.

The major components of an online document retrieval system are reviewed in more detail below.

The existence of a database of documents or document representations is a prerequisite for any document retrieval system. The term "document" is used here in its broadest sense and can be anything (books, tapes, electronic files, etc.) that carries information. The database can contain the full texts of documents as well as their "surrogates" (i.e., representations).

In order to create a database of documents or document representations, the properties of each document need to be identified and recorded. This process, which is called indexing, can be done either intellectually or automatically. In an environment where intellectual indexing is involved, professional indexers identify descriptive and topical characteristics of the documents and create a record (representation) for each document.

As Fig. 2.1 suggests, indexers can consult the standard tools such as thesauri, dictionaries and controlled vocabulary lists. Anglo American Cataloging Rules (AACR2) and the Library of Congress Subject Headings List are, among others, used for descriptive and topical analysis of documents, respectively. Indexers then record the document properties and assign subject headings to each document. Recorded descriptive and topical information constitute the representation of the document, which will later be used to provide access points for retrieval purposes.

Automatic indexing, wherein a machine is instructed to recognize and record the properties of documents, has also been used to create index records for retrieval purposes. For topical analysis, automatic indexing relies heavily on terms and keywords used in the full texts (or abstracts) of documents. Words that are useless for retrieval purposes such as "the," "of" and "on" are ignored. Keywords are usually stemmed to their root forms in order to reduce the size of the dictionary of the retrieval-worthy terms. Stemming process also enables the system to retrieve documents bearing variant forms of keywords.

Once the index records are created, the document database will be ready for interrogation by users. The raison d'être of designing a document retrieval system by creating a database of index records is, of course, to serve the information needs of its potential users. We now turn our attention to users' queries and review how the users approach document retrieval systems.

The query formulation process involves the articulation and formulation of a search query, which by no means is a trivial task. Well-articulated search statements require some knowledge on the user's part. Yet users may not be knowledgeable enough to articulate what they are looking for. Hjerrpe considers this as the fundamental paradox of information retrieval: "The need to describe that which you do not know in order to find it" (Hjerrpe, 1986; cited in Larson, 1991a, p.147).

First time users of document retrieval systems usually act cautiously and tend to enter relatively broad search queries. As the database characteristics (e.g., the number of records and the collection concentration) are not known in the beginning, they try to reconcile their mental models of the system with reality. Sometimes, the reverse may be the case. Users may come up with very specific search queries thinking that the catalog should answer all types of search queries no matter how specific or how broad they happen to be.

As can be seen from Fig. 2.1, dictionaries, thesauri, printed manuals and subject headings lists can be consulted in the course of query formulation process. In addition, some systems offer online help and on-screen instructions to facilitate the query formulation process.

Once the user's information need is articulated using natural language, a "formal" query statement should be submitted to the system. The syntax of the formal query statement may vary from system to system. In most cases, strict syntactic rules of the command and query languages must be observed in order to enter a formal search statement. Few systems, on the other hand, accept search statements entered in natural language.

Constructing formal query statements is not an easy task. Users must be aware of the existence of a command language and the required commands. In addition, they ought to have some intellectual understanding of how the search query is constructed according to the specifications of the query language. For instance, constructing relatively complex formal query statements using Boolean logic troubles most users.

Each system is equipped with a user interface which accepts user-entered formal search statements and convert them to a form which will be "understood" by the search and retrieval system. In other words, communication between the system and its users takes place by means of a user interface.

More specifically, the functions of a user interface can be summarized as follows: a) allowing users to enter search queries using either the natural language or the query language provided; b) evaluating the user's query (e.g., parsing, stemming); c) converting it to a form which will be understood by the document retrieval system and submitting the search query to the system; d) displaying the retrieval results; e) gathering feedback from the user as to the relevance of records and reevaluating the original query; and, f) dispensing helpful information (about the system, the usage, the database, and so on).

There are several ways in which users can express their search queries and activate the system (Shneiderman, 1986; Bates, 1989a). The types of user interfaces range from voice input to touch-sensitive screens, from command languages to graphical user interfaces (GUIs), and from menu systems to fill-in-the-blank-type user interfaces. Although the use of voice as input in current document retrieval systems is still in its infancy, other types of user interfaces have been in use for a while. Some are more commonly used than the others. Yet whatever the type of interface used, there is always a "learning curve" involved. To put it differently, users have to master the mechanics of interfaces before they can successfully communicate with the document retrieval systems, submit their search queries and get retrieval results.

Note that an interface is a conduit to the wealth of information that is available in the document database. As far as users are concerned, this conduit should allow every one to tap into the resources regardless of their background and expertise, the amount of information they want, the complexity of the database or the query language, and so on. Mooers' law is also applicable to user interfaces:

It is, perhaps, not too much to suggest that "document retrieval systems will tend not to be used whenever it is more painful and troublesome for patrons to use a poorly designed user interface than not to use it."

The decisive point in the overall document retrieval process is the interpretation of user's query terms for retrieval purposes. Representation of the formal search requests are matched against that of documents in the database so as to retrieve the record(s) that are likely to satisfy the users' information needs. Thus, the quality of the search outcome hinges very much on the retrieval rule(s) applied in this matching process. Retrieval rules determine which records are to be retrieved and which ones are not.

It is important, however, to examine a technique that comes before the application of retrieval rules: document clustering.

During earlier document retrieval experiments it was suggested that it would be more effective to cluster/classify documents before retrieval. If it is at all possible to cluster similar documents together, it was thought, then it would be sufficient to compare the query representation with only cluster representations in order to find out all the relevant documents in the collection. In other words, comparison of the query representation with the representations of each and every document in the collection would no longer be necessary. Undoubtedly, faster retrieval to information with less processing seemed attractive.

Van Rijsbergen (1979) emphasizes the underlying assumption behind clustering, which he calls "cluster hypothesis," as follows: "closely associated documents tend to be relevant to the same requests" (p.45, original emphasis). The cluster hypothesis has been validated. It was empirically proved that retrieval effectiveness of a document retrieval system can be improved by grouping similar documents together with the aid of document clustering methods (Van Rijsbergen, 1979). In addition to increasing the number of documents retrieved for a given query, document clustering methods proved to be cost-effective as well. Once clustered, documents are no longer dealt with individually but as groups for retrieval purposes, thereby cutting down the processing costs and time. Van Rijsbergen (1979) and Salton (1971b) provide a detailed account of the use of clustering in document retrieval systems.

"Cluster" here means a group of similar documents. The number of documents in a typical cluster depends on the characteristics of the collection in question as well as the clustering algorithm used. Collections consisting of documents in a wide variety of subjects tend to produce many smaller clusters whereas collections in a single field may generate relatively fewer but larger clusters. The clustering algorithm in use can also influence the number and size of the clusters. For instance, some 8,400 clusters have been created for a collection of more than 30,000 documents in Library and Information Studies (Larson, 1989).

Document clustering is based on a measure of similarity between the documents to be clustered. Several clustering algorithms, which are built on different similarity measures such as Cosine, Dice, and Jaccard coefficients, have been developed in the past (Salton & McGill, 1983; Van Rijsbergen, 1979). Keywords in the titles, subject headings, and full texts of the documents are the most commonly used `objects' to cluster closely associated documents together. In other words, if two documents have the same keywords in their titles and/or they were assigned similar subject heading(s), a clustering algorithm will bring them together.

More recently, Larson (1991a) has successfully used classification numbers to cluster similar documents together. He argues that the use of classification for searching in document retrieval systems has been limited. The class number assigned to a document is generally seen as another keyword. Documents with identical class numbers are treated individually during the searching process. Yet, documents that were assigned the same or similar class numbers will most likely be relevant for the same queries. Like subject headings, "classification provides a topical context and perspective on a work not explicit in term assignments" (Larson, 1991a, p.152; see also Chan, 1986c, 1989; Svenonius, 1983; Shepherd, 1981, 1983). The searching behavior of the users as they search through the book shelves seems to support the above idea and suggests that more clever use of classification information should be implemented in the existing online library catalogs (Hancock-Beaulieu, 1987, 1990).

"Classification clustering method" can improve retrieval effectiveness during the retrieval process. Based on the presence of classification numbers, documents with the same classification number can be brought together along with the most frequently used subject headings in a particular cluster. Thus, these documents will be retrieved as a single group whenever a search query matches the representation of documents in that cluster.

There are several retrieval rules that are used to determine if there is a match between search query terms and index terms. Blair (1990) lists no less than 12 different retrieval rules (called "model") and discusses each in turn in considerable detail. Table 2.1 provides a brief summary of retrieval rules discussed in Blair (1990).

Model	Search Request	Documents	Retrieval Rule
1	Single query terms	Documents are assigned one or more index terms	If the term in the request is a member of the terms assigned to a document, then the document is retrieved
2	A set of query terms	A set of index terms	Document is retrieved if all the terms in the request are in the index record of the document
3	A set of query terms plus a "cut-off" value	A set of one or more index terms	Document is retrieved if it shares a number of terms with the request that exceeds the cutoff value
4	Same as 3	Same as 3	Documents showing with the request more than the specified number of terms are ranked in order of decreasing overlap
5 Weighted Requests	Set of query terms each of which has a positive number associated with it	Same as 3	Documents are ranked in decreasing order of the sum of the weights of terms common to the request and the index record
6 Weighted Indexing	Set of query terms	Set of index terms each of which has a positive number assigned to it	Documents are ranked in decreasing order of the sum of the weights of terms common to the request and the index record
7 Weighted Requests and Indexing	Same as 5	Same as 6	Documents are ranked by the sum of products each of which results from the multiplication of the weight of the term in the request by the weight of the same term in the index record
8 Cosine Rule	Same as 5	Same as 6	The weights of the terms common to the request and an indexing record are treated as vectors. The value of a retrieved document is the cosine of the angle between the vectors
9 Boolean Requests	Requests are any Boolean combination of query terms with AND, OR, and NOT	A set of one or more index terms	i) AND: Retrieve only documents that match all terms in the request ii) OR: Retrieve only documents that match any term in the request iii) NOT: retrieve all documents that do not match any term in the request
10 Full Text Retrieval	Same as 9	Entire text of the documents is searchable (except stop words)	Same as Model 9 with adjacency operators
11 Simple Thesaurus	Single terms	A set of one or more index terms	The request term is looked up in a thesaurus (online) and semantically related terms are added to the request term
12 Weighted Thesaurus	Single terms	A set of one or more index terms	The request term is looked up in a thesaurus (online) and semantically related terms above a given cut-off value (weight) are added (disjunctively) to the request term. The cut-off value could be given by the inquirer.

Retrieval rules listed in Table 2.1 can be categorized under three broad groups: 1) Exact matches between query term(s) and index terms, along with Boolean retrieval rules (Models 1-4, 9-12); 2) probabilistic retrieval rules (Models 5-7); and 3) vector space model (Model 8).

In group 1, indexing and query terms are binary: i.e., a term is either assigned to a document (or included in a search query) or not. Each term is equally important for retrieval purposes. Cut-off values can be introduced for multi-term search requests (Models 3 and 4). Search terms can be expanded by adding related terms from a thesaurus (Models 11 and 12). Retrieved records can be weakly ordered (retrieved or not) (Models 1-3, 12). Or they can be ranked on the basis of the number of matching terms in the search query and index record (Model 4). Relationships between search terms can be defined using Boolean logic (e.g., retrieve only those documents whose index records contain both search terms A and B) (Models 9 and 10). The boolean search model is believed to be "the most popular retrieval design for computerized document retrieval systems" (Blair, 1990, p.44).

Retrieval rules under group 2 call for weighted search terms (Model 5), weighted index terms (Model 6), or both weighted search and index terms (Model 7). In other words, the significance of a given term for retrieval purposes can be specified by the user. Retrieved records are ranked on the basis of the strength of the match between search and index terms. Retrieval rules in this category are known as probabilistic retrieval models.

The vector space model (Model 9) in group 3 is, in a way, similar to Model 7 in that both search and index terms are weighted and the retrieved records are ranked. However, search and index terms in vector space model are treated as vectors in an n-dimensional space and the strength of match (e.g., ranking) is determined by calculating the cosine of the angle between search and index vectors. Document retrieval systems utilizing vector space model, notably SMART, have been in use since the early 1960s.

So far the major components of a conventional document retrieval system are reviewed from the following points of view: the document database, query formulation, and retrieval rules. The ultimate objective of a document retrieval system, regardless of which retrieval rule is used, is to retrieve records that best match the user's information needs. Hence, what matters to the user most is the retrieval results (i.e., retrieval effectiveness). The primary measures of retrieval effectiveness are reviewed below.

Several different measures are used to evaluate the retrieval effectiveness of document retrieval systems. A few measures that are widely used in the study of search failures such as precision and recall are discussed below. Other retrieval effectiveness measures suggested in the literature are not reviewed here as they are seldom, if ever, used in the analysis of search failures.

Online document retrieval systems often retrieve some non-relevant documents while missing, at the same time, some relevant ones. Blair (1990) summarizes the retrieval process as follows:

Based on the above figure, the following retrieval effectiveness measures can be defined:

n₂ = total number of relevant documents in the collection (x+v in Fig. 2.2),

Precision and recall are generally used in tandem in evaluating retrieval effectiveness in document retrieval systems. "Precision is the ratio of the number of relevant documents retrieved to the total number of documents retrieved" (Van Rijsbergen, 1979, p.10, original emphasis). For instance, if, for a particular search query, the system retrieves two documents (n₁) and the user finds one of them relevant (x), then the precision ratio for this search would be 50% (x/n₁).

Recall is considerably more difficult to calculate than precision because it requires finding relevant documents that will not be retrieved during users' initial searches (Blair & Maron, 1985, p.291). "Recall is the ratio of the number of relevant documents retrieved to the total number of relevant documents (both retrieved and not retrieved)" in the collection (Van Rijsbergen, 1979, p.10, original emphasis). Take the above example. The user judged one of the two retrieved documents to be relevant. Suppose that later three more relevant documents (v) that the original search query failed to retrieve were found in the collection. The system retrieved only one (x) out of the four (n₂) relevant documents from the database. The recall ratio would then be equal to 25% for this particular search (x/n₂).

Blair and Maron (1985) point out that "Recall measures how well a system retrieves all the relevant documents, and Precision, how well the system retrieves only the relevant documents" (p.290).

Fallout is another measure of retrieval effectiveness. Fallout can be defined as the ratio of nonrelevant documents retrieved (u) over all the nonrelevant documents in the collection (u+y). Fallout "measures how well a system rejects non-relevant documents" (Blair, 1990, p.116). The earlier example also can be used to illustrate fallout. The user judged one of the two retrieved documents as relevant, and, later, three more relevant documents that the original query missed were identified. Further suppose that there are nine documents in the collection altogether (four relevant plus five non-relevant documents). Since the user retrieved one non-relevant (u) document out of a total of five non-relevant ones (u+y) in the collection, the fallout ratio would be 20% for this search (u/(u+y)).

It was mentioned earlier (section 2.1) that a document retrieval system should have some kind of user interface which allows users to interact with the system. Furthermore, the functions of a user interface were given (section 2.6) and it was stated that one of the functions of the user interface is to make various forms of feedback possible between the user and the document retrieval system.

As users scarcely find what they want in a single try, the feedback function deserves further explication. Retrieval rules, in and of themselves do not guarantee that retrieved records will be of importance to the user. The user interface may prompt users as to what to do next or suggest alternative strategies by way of system-generated feedback messages (i.e., help screens, status of search, actions to take). More importantly, the system may allow users to modify their search queries in light of a sample retrieval so that search success can be improved in subsequent retrieval runs (Van Rijsbergen, 1979). Some systems may automatically modify the original search query after the user has made relevance judgments on the documents which were retrieved in the first try. This is known as "relevance feedback" and it is the relevance feedback process that concerns us here.

Swanson (1977) examined some well-known information retrieval experiments and the measures used therein. He suggested that the design of document retrieval systems "should facilitate the trial-and-error process itself, as a means of enhancing the correctability of the request" (p.142).

Van Rijsbergen (1979) shared the same view when he pointed out that: "a user confronted with an automatic retrieval system is unlikely to be able to express his information need in one go. He is more likely to want to indulge in a trial-and-error process in which he formulates his query in the light of what the system can tell him about his query" (p.105).

Van Rijsbergen (1979) also lists the kind of information that could be of help to users when reformulating their queries such as the occurrence of users' search terms in the database, the number of documents likely to be retrieved by a particular query with a small sample, and alternative and related search terms that can be used for more effective search results.

Relevance feedback is one of the tools that facilitates the trial-and-error process by allowing the user to interactively modify his or her query based on search results obtained during the initial run. The following quotation summarizes the relevance feedback process very well:

Relevance feedback was first introduced over 20 years ago during the SMART information retrieval experiments (Salton, 1971b). Earlier relevance feedback experiments were performed on small collections (e.g., 200 documents) where the retrieval performance was unusually high (Rocchio, 1971a; Salton, 1971a; Ide, 1971). (For the use of relevance feedback technique in online catalogs, see, for instance, Porter, 1988; Walker, S. & de Gere, 1990; Larson, 1989, 1991a; Walker, S. & Hancock-Beaulieu, 1991.)

It was shown that relevance feedback markedly improved retrieval performance. Recently Salton and Buckley (1990) examined and evaluated twelve different feedback methods "by using six document collections in various subject areas for experimental purposes." The collection sizes they used varied from 1,400 to 12,600 documents. The relevance feedback methods produced improvements in retrieval performance ranging from 47% to 160%.

The relevance feedback process helps in refining the original query and finding more relevant materials in the subsequent runs. The true advantage gained through the relevance feedback process can be measured in two different ways:

1) By changing the ranking of documents and moving the documents that are judged by the user as being relevant up in the ranking. With this method documents that have already been seen (and judged as being relevant) by the user will still be retrieved in the second try, although they are somewhat ranked higher this time. "This occurs because the feedback query has been constructed so as to resemble the previously obtained relevant items" (Salton & Buckley, 1990, p.292). This effect is called "ranking effect" (Ide, 1971) and it is difficult to distinguish artificial ranking effect from the true feedback effect (Salton & Buckley, 1990). Note that users may not want to see the documents a second time because they have already seen them during the initial retrieval.

2) By eliminating the documents that have already been seen by the user in the first retrieval and "freezing" the document collection at this point for the second retrieval. In other words, documents that were judged as being relevant (or nonrelevant) during the initial retrieval will be excluded in the second retrieval, and the search will be repeated only on the frozen part of the collection (i.e., the rest of the collection from which user has seen no documents yet). This is called "residual collection" method and it ". . . depresses the absolute performance level in terms of recall and precision, but maintains a correct relative difference between initial and feedback runs" (Salton & Buckley, 1990, p.292).

The different relevance feedback formulae are based on the variations of these two methods. More detailed information on relevance feedback formulae can be found in Salton and Buckley (1990). For mathematical explications of relevance feedback process, see Rocchio (1971a), Ide (1971), and, Salton and Buckley (1990).

The relevance feedback process works in practice as follows: a user submits a search query to the system with relevance feedback capabilities and retrieves some documents. When bibliographic records of retrieved documents are displayed one by one to the user, he or she is asked to judge each retrieved document as being relevant or nonrelevant. The user proceeds by making relevance judgments for each displayed record. These relevance judgments will be used to improve the search results should the user decide to perform a relevance feedback search. The system revises and modifies the original query based on the documents judged as being relevant during the first retrieval. In other words, the relevance feedback process enables the system to "understand" the user's query better: the documents that are similar to the query are rewarded by being assigned higher ranks, while dissimilar documents are pushed farther down in the ranking. As a result, the system comes up with potentially more relevant documents.

The relevance feedback search can be iterated as many times as the user desires until the user is satisfied with the search results. However, the relevance feedback technique requires more work for the user who is known to be willing to invest minimal effort only.

The major components of a document retrieval system are examined in this chapter. The importance of indexing and query formulation processes are discussed along with the roles of user interfaces and retrieval rules. Some of the more advanced information retrieval techniques such as relevance feedback and clustering are also briefly addressed. A critical review of the major studies related to the present study is given in Chapter III.

	RELEVANT	NOT RELEVANT
RETRIEVED	x	u	TOTAL NUMBER RETRIEVED=n₁
NOT RETRIEVED	v	y
	TOTAL NUMBER RELEVANT=n₂