THE ANALYSIS OF SEARCH FAILURES IN ONLINE LIBRARY CATALOGS
by
Yasar Tonta
A research proposal
(Draft 3.0)
22 September 1991
Berkeley
CONTENTS
1. Introduction, p.1
2. Background to the Research Problem, p.2
3. Overview of a Document Retrieval System, p.7
4. Relevance Feedback Concepts, p.9
5. The Use of Clustering in Document Retrieval Systems, p.15
5.1 Classification Clustering Method, p.17
6. Failure Analysis in Document Retrieval Systems, p.22
6.1 Retrieval Effectiveness Measures, p.23
6.2 What is Search Failure?, p.24
6.3 Review of Failure Analysis Studies, p.28
6.4 Use of the Critical Incident Technique in Failure Analysis Studies, p.37
7. The Present Study, p.42
7.1 Objectives of the Study, p.42
7.2 Hypotheses, p.43
8. The Experiment, p.43
8.1 The Environment, p.43
8.1.1 The System, p.43
8.1.2 Search Queries, p.47
8.1.3 Relevance Judgments, p.48
8.1.4 Evaluation of Retrieval Effectiveness, p.50
8.2 Subjects, p.51
8.3 Data Gathering Tools, p.53
8.4 Data Analysis and Evaluation Methodology, p.56
8.5 Data Gathering, Analysis and Evaluation Procedure, p.68
8.6 Expected Results, p.72
9. Select Bibliography, p.73
10. Attachments, p.86
1. Introduction
This study will investigate the causes of search failures in online library catalogs. It is particularly concerned with the evaluation of retrieval performance in online library catalogs from the users' perspective. The research will make new contributions to the study of online catalogs. The critical incident technique will be used for the first time to examine search failures in online library catalogs. User designated ineffective searches in an experimental online catalog will be compared with transaction log records in order to identify the possible causes of search failures. We will investigate the mismatch between the users' vocabulary and the vocabulary used in online library catalogs to find out its role in search failures and retrieval effectiveness. A taxonomy of search failures in online library catalogs will be developed.
This study will evaluate the retrieval performance of an experimental online catalog by: (1) using precision/recall measures; (2) identifying user designated ineffective searches; and (3) comparing user designated ineffective searches with the precision/recall ratios for corresponding searches.
This research will also discuss the roles of index languages, natural language query interfaces and retrieval rules in online library catalogs.
Findings to be obtained from this study can be used in designing better online library catalogs. Designers equipped with information about search failures should be able to develop more robust and "fail-proof" online catalogs. Search failures due to vocabulary problems can be minimized by strengthening the existing indexing languages and/or by developing "entry vocabulary systems" to relate users' terms to systems' terms. The taxonomy to be developed can be used in other studies of search failures in online library catalogs. From the methodological point of view, using critical incident technique may prove to be invaluable in studying search failures and evaluating retrieval performance in online library catalogs.
2. Background to the Research Problem
A perfect document retrieval system should retrieve all and only relevant documents. Yet online document retrieval systems are not perfect. They retrieve some non-relevant documents while missing, at the same time, some relevant ones.
The observation just mentioned, which is based on the results of several information retrieval experiments, also well summarizes the two general types of search failures that frequently occur in online document retrieval systems: (1) non-relevant documents retrieved, and (2) relevant documents not retrieved. The former type of failure is known as precision failure: i.e., the system fails to retrieve relevant documents only. The latter is known as recall failure: i.e., the system fails to retrieve all relevant documents.
The two concepts, precision and recall, come from the information retrieval experiments and are the most widely used measures in evaluating retrieval effectiveness in online document retrieval systems. Precision is defined as the proportion of retrieved documents which are relevant, whereas recall is defined as the proportion of relevant documents retrieved (Van Rijsbergen, 1979).
In order for a document retrieval system to retrieve some documents from the database two conditions must be satisfied. First, documents must be assigned appropriate index terms by indexers. Second, users must correctly guess what the assigned index terms are and enter their search queries accordingly (Maron, 1984). These two conditions also explain the main causes of search failures in document retrieval systems; namely, the problems encountered during indexing and query formulation processes.
Recall failures occur mainly because certain index terms do not get assigned to certain documents even though some users would look under those index terms in order to retrieve the kinds of documents they want. If this is the case, then the system will fail to retrieve all the relevant documents in the database. Recall failures are difficult to detect especially in large scale document retrieval systems.
Precision failures, on the other hand, are more complicated than recall failures, although they are much simpler to detect. Precision failures occur when the user finds some retrieved documents non-relevant even if documents are assigned those index terms which the user initially asked for in his/her search query. In other words, users do not necessarily agree with indexers as to the relevance of certain documents just because indexers happened to have assigned the terms selected by users. Relevance is defined as a "relationship between a document and a person in search of information" and it is a function of a large number of variables concerning both the document (i.e., what it is about, its currency, language, date) and the person (i.e., person's education, beliefs, etc.) (Robertson, Maron and Cooper, 1982).
Other factors such as ineffective user-system interfaces, index languages used, and retrieval rules can also cause search failures in document retrieval systems. In a landmark study, Lancaster (1968) provided a detailed account of search failures that occurred in MEDLARS (Medical Literature Analysis and Retrieval System) along with the status of MEDLARS' retrieval effectiveness. More recently, Blair and Maron (1985), having conducted a retrieval effectiveness test on a full text document retrieval system, explicated the probable causes of recall failures attained in their study.
Although the causes of precision and recall failures can be explained relatively straightforwardly, the detailed intellectual analysis of reasons why these two kinds of failures occur in document retrieval systems is rarely conducted nonetheless.
Users' knowledge of (or lack thereof) controlled vocabularies and query languages also causes a great deal of search failures and frustration. Most users are not aware of the role of controlled vocabularies in document retrieval systems. They do not seem to understand (why should they?) the structure of rigid indexing and query languages. Consequently, users' natural language-based search query terms often fail to match the titles and subject headings of the documents, thereby causing some search failures. The "brittle" query languages based on Boolean logic tend to exacerbate this situation further, especially for complicated search queries requiring use of Boolean operators.
Notwithstanding the circumstantial evidence gathered through various online catalog studies in the past, studies examining the match between users' vocabulary and that of online document retrieval systems are scarce. Moreover, the probable effects of such a mismatch on search failures are yet to be fully explored.
Natural language query interfaces are believed to improve search success in document retrieval systems as users are able to formulate search queries in their own terms. Search terms chosen from a natural language will more likely to match the titles of the documents in the database. Nevertheless, the role of natural language-based query interfaces in reducing search failures in document retrieval systems needs to be thoroughly studied.
There appears to be some kind of relationship between users' perception of retrieval effectiveness and that which is obtained
through precision and recall measures. That is to say that users might think that they retrieved most of the relevant documents even though document retrieval systems tend to miss a great deal of relevant ones (i.e., recall failures). For instance, Blair and Maron (1985) observed that users involved in their retrieval effectiveness study believed that "they were retrieving 75 percent of the relevant documents when, in fact, they were only retrieving 20 percent" (p.295). As mentioned above, users might also disagree with indexers as to the relevance of a document with user-selected index terms assigned thereto (i.e., precision failures). It seems, then, that the relationship between "user designated" ineffective searches and ineffective searches thus identified by the retrieval effectiveness measures deserves further investigation.
3. Overview of a Document Retrieval System
What follows is a brief overview of a document retrieval system and its major components.
The principal function of a document retrieval system is to retrieve, for each search request, all and only relevant documents from a store of documents. In other words, the system should be capable of retrieving all relevant documents while rejecting all the others.
Maron (1984) provides a more detailed description of the document retrieval problem and depicts the logical organization of a document retrieval system as in Fig. 1.
Incoming Inquiring
documents patron
Document Thesaurus Query
identification Dictionary formulation
(Indexing)
Index Retrieval Formal
records rule query
Fig. 1. Logical Organization of a Conventional Document
Retrieval System
Source: Maron (1984), p.155.
As Fig. 1 suggests, the basic characteristics of each incoming document (e.g., author, title, subject) are identified during the indexing process. Indexers may consult thesauri or dictionaries (controlled vocabularies) in order to assign acceptable index terms for each document. Consequently, an index record is constructed for each document for subsequent retrieval purposes.
Likewise, users can identify their information needs by consulting the same index tools during the query formulation process. That is to say that a user can check to see if the terms he/she intends to use in his/her formal query are also recognized by the document retrieval system. Ultimately, he/she comes up with the most promising query terms (from the retrieval point of view) that he/she can submit to the system as his/her formal query.
As mentioned before, most users do not know about the tools that they can utilize to express their information needs, which results in search failures in view of a possible mismatch between users' vocabulary and the system's vocabulary. As Maron (1984) points out, "the process of query-formulation is a very complex process, because it requires that the searcher predict (i.e., guess) which properties a relevant document might have" (p.155). Finally, "the actual search and retrieval takes place by matching the index records with the formal search query. The matching follows a rule, called "Retrieval Rule," which can be described as follows: For any given formal query, retrieve all and only those index records which are in the subset of records that is specified by that search query" (Maron, 1984, p.155).
It follows, then, that a document retrieval system consists of (a) a store of documents (or, representations thereof); (b) a population of users each of whom makes use of the system to satisfy their information needs; and (c) a retrieval rule which compares representation of each user's query with the representations of all the documents in the store so as to identify the relevant documents in the store.
In addition, there should be some kind of user interface which allows users to interact with the system. A user interface mechanism has several functions: (1) to accept users' query formulations (in natural language or otherwise); (2) to transmit queries to the system for processing; (3) to bring the results back to users for evaluation; (4) to make various forms of feedback possible between the user and the document retrieval system.
The feedback function deserves further explanation. On the one hand, the system may prompt users as to what to do next or suggest alternative ways by way of system-generated feedback messages (i.e., help screens, status of search, actions to take). On the other hand, users should be able to modify or change their search queries in the light of a sample retrieval so as to improve search success in subsequent retrieval runs (Van Rijsbergen, 1979). Moreover, some systems may automatically modify the original search query after the user has made relevance judgments on the documents which were retrieved in the first try. This is known as "relevance feedback" and it is the relevance feedback process that concerns us here.
4. Relevance Feedback Concepts
Swanson (1977) examined some well-known information retrieval experiments and the measures used therein. He suggested that the design of document retrieval systems "should facilitate the trial-and-error process itself, as a means of enhancing the correctability of the request" (p.142).
Van Rijsbergen (1979) shares the same view when he points out that: "a user confronted with an automatic retrieval system is unlikely to be able to express his information need in one go. He is more likely to want to indulge in a trial-and-error process in which he formulates his query in the light of what the system can tell him about his query" (p.105).
Van Rijsbergen (1979) also lists the kind of information that could be of help to users when reformulating their queries such as the occurrence of users' search terms in the database, the number of documents likely to be retrieved by a particular query with a small sample, and alternative and related search terms that can be used for more effective search results.
"Relevance feedback" is one of the tools that facilitates the trial-and-error process by allowing the user to interactively modify his/her query based on the search results obtained during the initial run. The following quotation summarizes the relevance feedback process very well:
"It is well known that the original query formulation process is not transparent to most information system users. In particular, without detailed knowledge of the collection make-up, and of the retrieval environment, most users find it difficult to formulate information queries that are well designed for retrieval purposes. This suggests that the first retrieval operation should be conducted with a tentative, initial query formulation, and should be treated as a trial run only, designed to retrieve a few useful items from a given collection. These initially retrieved items could then be examined for relevance, and new improved query formulations could be constructed in the hope of retrieving additional useful items during subsequent search operations" (Salton and Buckley, 1990, p.288).
Relevance feedback was first introduced over 20 years ago during SMART information retrieval experiments (Salton, 1971b). Earlier relevance feedback experiments were performed on small collections (e.g., 200 documents) where the retrieval performance was unusually high (Rocchio, 1971a; Salton, 1971a; Ide, 1971).
It was shown that relevance feedback has markedly improved the retrieval performance. Recently Salton and Buckley (1990) examined and evaluated twelve different feedback methods "by using six document collections in various subject areas for experimental purposes." The collection sizes they used varied from 1,400 to 12,600 documents. The relevance feedback methods produced improvements in retrieval performance ranging from 47% to 160%.
The relevance feedback process offers the following main advantages:
"- It shields the user from the details of the query formulation process, and permits the construction of useful search statements without intimate knowledge of collection make-up and search environment.
-It breaks down the search operation into a sequence of small search steps, designed to approach the wanted subject area gradually.
-It provides a controlled query alteration process designed to emphasize some terms and to deemphasize the others, as required in particular search environments" (Salton and Buckley, 1990, p.288).
The relevance feedback process helps in refining the original query and finding more relevant materials in the subsequent runs. The true advantage gained through the relevance feedback process can be measured in two different ways:
1) By changing the ranking of documents and moving the documents that are judged by the user as being relevant up in the ranking. With this method documents that have already been seen (and judged as being relevant) by the user will still be retrieved in the second try, although they are somewhat ranked higher this time. "This occurs because the feedback query has been constructed so as to resemble the previously obtained relevant items" (Salton and Buckley, 1990, p.292). This effect is called "ranking effect" (Ide, 1971) and it is difficult to distinguish artificial ranking effect from the true feedback effect (Salton and Buckley, 1990). Note that the user may not want to see the documents second time because he/she has already seen them during the initial retrieval.
2) By eliminating the documents that have already been seen by the user in the first retrieval and "freezing" the document collection at this point for the second retrieval. In other words, documents that were judged as being relevant (or nonrelevant) during the initial retrieval will be excluded in the second retrieval, and the search will be repeated only on the frozen part of the collection (i.e., the rest of the collection from which user has seen no documents yet). This is called "residual collection" method and it "depresses the absolute performance level in terms of recall and precision, but maintains a correct relative difference between initial and feedback runs" (Salton and Buckley, 1990, p.292).
The different relevance feedback formulae are based on the variations of these two methods. More detailed information on relevance feedback formulae can be found in Salton and Buckley (1990). For mathematical explications of relevance feedback process, see Rocchio (1971a); Ide (1971); and, Salton and Buckley (1990).
Let's now look at how relevance feedback process works in practice. Suppose that we have a document retrieval system with relevance feedback capabilities and a user submits a search query to the system and retrieves some documents. When bibliographic records of retrieved documents are displayed one by one to the user, he/she is asked to judge each retrieved document as being relevant or nonrelevant by pressing certain function keys on the keyboard. The user proceeds by making relevance judgments for each displayed record.
When the user decides to quit because he/she is either satisfied or frustrated with the documents he/she has seen in the course of the retrieval, the system asks the user if he/she wants to perform a relevance feedback search. If the user decides to perform a relevance feedback search, then the system revises and modifies the original query based on the documents the user has already judged as relevant or nonrelevant in the first retrieval. The system incorporates the user's relevance judgments and modifies the original query in the following manner.
Suppose that the original query was "intellectual rubbish." Further suppose that the system retrieved several documents including Russell's book on intellectual rubbish and another book on "rubbish leaching." Obviously, the user would judge the former as being relevant but not the latter. Based on this feedback, the system revises the query and re-weights the terms in the original query. It could be that the term "intellectual" will be heavily weighted. Furthermore, documents with the term "intellectual" will be upgraded in rank while those with the term "leaching" will be suppressed in the relevance feedback stage. In other words, the system will try to find documents that "look like" Russell's book in content and retrieve them before the dissimilar ones. Some other terms taken from the titles and subject headings of relevant documents and classification numbers can also be added to the original query at this stage. The weight to be attached to each term or classification number depends very much on the weighting scheme as well as on the strength of similarity between the document in question and the search query.
The relevance feedback step, then, enables the system to "understand" the user's query better: the documents that are similar to the query are rewarded by being assigned higher ranks, while dissimilar documents are pushed farther down in the ranking. As a result of the relevance feedback process, the system comes up with more documents (that is, records).
Once again, the searcher would see the new documents, one after another, that were retrieved from the database as the result of the relevance feedback process, and judge them as being relevant or nonrelevant to his/her query. Relevance judgments, again, are automatically recorded for each record that the user scans.
The relevance feedback search can be iterated as many times as the user desires until the user is satisfied with the search results. It should be noted, however, that the relevance feedback technique requires more work for the user who is known to tend to be willing to invest minimal effort only.
5. The Use of Clustering in Document Retrieval Systems
It was pointed out earlier that one of the components of a document retrieval system is that of the retrieval rule. The function of the retrieval rule is, simply stated, to retrieve all relevant documents to a certain query by matching query representation with the representation of each document in the database.
During earlier document retrieval experiments it was suggested that it would be more effective to cluster/classify documents before retrieval. If it is at all possible to cluster similar documents together, it was thought, then it would be sufficient to compare the query representation with only cluster representations in order to find out all the relevant documents in the collection. In other words, comparison of the query representation with the representations of each and every document in the collection would no longer be necessary. Undoubtedly, faster retrieval to information with less processing seemed attractive.
Van Rijsbergen (1979) emphasizes the underlying assumption behind clustering, which he calls "cluster hypothesis," as follows: "closely associated documents tend to be relevant to the same requests" (p.45, original emphasis). What is meant by this is that documents similar to one another in content can be used to answer the same search queries.
The assumption turned out to be valid. It was empirically proved that retrieval effectiveness of a document retrieval system can be improved by grouping similar documents together with the aid of document clustering methods (Van Rijsbergen, 1979). In addition to increasing the number of documents retrieved for a certain query (i.e., enhanced recall), document clustering methods proved to be cost-effective as well. Once clustered, documents are no longer dealt with individually but as groups for retrieval purposes, thereby cutting down the processing costs and time. Van Rijsbergen (1979) and Salton (1971b) provide a detailed account of the use of clustering in document retrieval systems.
"Cluster" here means a group of similar documents. The number of documents in a typical cluster depends on the characteristics of the collection in question as well as the clustering algorithm used. Collections consisting of documents in a wide variety of subjects tend to produce many smaller clusters whereas collections in a single field may generate relatively fewer but larger clusters. The clustering algorithm in use can also influence the number and size of the clusters. (See Van Rijsbergen (1979) for different clustering algorithms.) For instance, some 8,400 clusters have been created for a collection of more than 30,000 documents in Library and Information Studies (Larson, 1989).
Document clustering is based on a measure of similarity between the documents to be clustered. Several document clustering algorithms, which are built on different similarity measures, have been developed in the past. Keywords in the titles and subject headings of the documents are the most commonly used `objects' to cluster closely associated documents together. In other words, if two documents have the same keywords in their titles and/or they were assigned similar subject heading(s), a document clustering algorithm will bring these two documents together.
5.1 Classification Clustering Method
Larson (1991a) has recently successfully used classification numbers to cluster similar documents together. What follows is a brief overview of the use of "classification clustering method" in document retrieval systems.
Larson (1991a) argues that the use of classification for searching in document retrieval systems has been limited. The class number assigned to a document is generally seen as another keyword. Documents with identical class numbers are treated individually during the searching process. Yet, documents that were assigned the same or similar class numbers will most likely be relevant for the same queries. Like subject headings, "classification provides a topical context and perspective on a work not explicit in term assignments" (Larson, 1991a, p.152; Chan, 1986, 1989; Svenonius, 1983; Shepherd, 1981). The searching behavior of the users on the shelves seems to support the above idea and suggests that more clever use of classification information should be implemented in the existing online library catalogs (Hancock-Beaulieu, 1987; 1990).
Classification clustering method can improve retrieval effectiveness during the retrieval process. Based on the presence of classification numbers, documents with the same classification number can be brought together along with the most frequently used subject headings in a particular cluster. Thus, these documents will be retrieved as a single group whenever a search query matches the representation of documents in that cluster. Larson (1991a, 1991c) provides a more detailed description and a more formal presentation of the classification clustering method. Fig. 2 illustrates the classification clustering procedure diagrammatically.
Let's now look at briefly as to how it is that the classification clustering method can be used to improve retrieval effectiveness during document retrieval process.
Suppose that a collection of documents have already been clustered using a particular classification clustering algorithm. Let's further suppose that a user has come to the document retrieval system and issued a specific search query (e.g., "intelligence in dolphins"). First, a retrieval function within the system analyzes the query, eliminates the "buzz" words (using a stop list), processes the query using the stemming and indexing routines and weights the terms in the query to produce a vector representation of the query. Second, the system compares the query representation with each and every document cluster representation in order "to retrieve and rank the cluster records by their probabilistic "score" based on the term weights stored in the inverted file...The ranked clusters are then displayed to the user in the form of a textual description of the classification area (derived from the LC classification summary schedule) along with several of the most frequently assigned subject headings within the cluster..." (Larson, 1991a, p.158). Once the system finds the "would-be" relevant clusters the user then will be able to judge some of the clusters as being relevant by simply identifying the relevant clusters on the screen and pushing a function key. "After one or more clusters have been selected, the system reformulates the user's query to include class numbers for the selected clusters and retrieves and ranks the individual MARC records based on this expanded query" (Larson, 1991a, p.159).
Larson (1991a) describes how it is that this tentative relevance information for the selected clusters can be utilized for ranking the individual records:
"In the second stage of retrieval..., we still have no information about the relevance of individual documents, only the tentative relevance information provided by cluster selection. In this search, the class numbers assigned to the selected clusters are added to the other terms used in the first-stage query. The individual documents are ranked in decreasing order of document relevance weight calculated, using both the original query terms and the selected class numbers, and their associated MARC records are retrieved, formatted, and displayed in this rank order... In general, documents from the selected classes will tend to be promoted over all others in the ranking. However, a document with very high index term weights that is not from one of the selected classes can appear in the rankings ahead of documents from that class that have fewer terms in common with the query" (p.159-60).
Although the identification of relevant clusters can properly be considered a type of relevance feedback, we prefer to regard it as some sort of system help before the user's query is run on the entire database.
After all of the above re-weighting and ranking processes, which are based on the user's original query and the selection of relevant clusters, are done, individual records are displayed to the user. This time the user is able to judge each individual record (rather than the cluster) that is retrieved as being relevant or nonrelevant, again by simply pushing the appropriate function key. He/she can examine several records by making relevance judgments along the way for each record until he/she thinks that there is no use to continue displaying records as the probability of relevance gets smaller and smaller.
To sum up, classification clustering method brings similar documents together by checking the class number assigned to each document. It also allows users to improve their search queries by displaying some retrieved clusters for the original query. At this point users are given a chance to judge retrieved clusters as being relevant or nonrelevant to their queries. Users' relevance judgments then get to be incorporated into the original search queries, thereby making the original queries more precise and shifting them in the "right direction" to increase retrieval effectiveness.
6. Failure Analysis in Document Retrieval Systems
Before we review concepts of failure analysis and major studies in this area, let's look first at how to measure and evaluate the performance of a document retrieval system. One would like to know how well the retrieval system performs its functions in retrieving relevant documents for each search query while holding back the nonrelevant ones. After all, one cannot analyze search failures if one does not recognize them.
6.1 Retrieval Effectiveness Measures
Precision and recall are the most commonly used retrieval effectiveness measures in information retrieval research. As given before, precision is defined as the proportion of the retrieved documents that are relevant. Recall is the proportion of the relevant documents that are retrieved.
Precision can be taken as the ratio of the number of documents that are judged relevant for a particular query over the total number of documents retrieved. For instance, if, for a particular search query, the system retrieves, say, 2 documents and the user finds one of them relevant, then the precision ratio for this search would be 50%.
Recall is considerably more difficult to calculate than precision since it requires finding relevant documents that will not be retrieved in the course of users' initial searches (Blair and Maron, 1985). Recall can be taken as the ratio of the number of relevant documents retrieved over the total number of relevant documents in the collection. Take the above example, for instance. The user judged one of the (two) retrieved documents relevant. Suppose that we later found 3 more relevant documents in the collection which the original search query failed to retrieve. That is to say that the system retrieved only one out of the four relevant documents from the database. The recall ratio would then be equal to 25% for this particular search.
Less precise retrieval effectiveness measures than precision and recall are also in use. For example, the users of a document retrieval system can be asked if they are satisfied with the documents that they found for their search queries. This would tell us something about the performance of the document retrieval system in question, albeit in much less quantifiable terms.
However, the multiplicity of retrieval effectiveness measures complicates the matters. In addition to not being able to give a single performance measure which can be used in evaluating all document retrieval systems, it gets difficult to come up with a single definition of "search failure."
6.2 What is Search Failure?
If one is to accept precision and recall as performance measures with their given definitions, it instantly becomes clear, then, that "performance" can no longer be defined as a dichotomous concept. As precision and recall are defined as percentages, "degrees" of failure (or success, for that matter) can be conceived of. In fact, such a view would probably best reflect different performance levels attained by current document retrieval systems. It is practically impossible to come across perfect document retrieval systems. So is the totally catastrophic ones. Yet we have systems better (or worse) than one another. A system failing to retrieve relevant documents half of the time is certainly much better than the one which fails to retrieve anything most of the time.
Precision and recall are two different quantitative measures of aggregation of search failures. Search failures analyzed on the basis of precision and recall can be called precision failures and recall failures even though precision and recall do not represent types of failures. As briefly mentioned before, precision failures occur when the user finds a retrieved document nonrelevant. For a given search query, the more nonrelevant documents the system retrieves, the higher the degree of precision failures gets. If no retrieved document happens to be relevant, then the precision ratio becomes zero due to severe precision failures.
Similarly, recall failures occur when the system fails to retrieve some or all relevant documents from the database. For a given search query, the more relevant documents the system misses, the higher the degree of recall failures gets. If the system retrieves all but nonrelevant documents from the collection, then the recall ratio becomes zero due to severe recall failures.
Mainly, then, two types of errors constitute "search failures": (1) missing relevant documents (recall failures), and (2) retrieving nonrelevant documents (precision failures). Failure analysis aims to find out the causes of both types of failures in document retrieval systems so that existing systems can be improved in a variety of ways. As suggested earlier, search failures might occur due to an array of reasons such as, among others, deficiencies in indexing and query analysis, retrieval rules and user interfaces.
Lancaster (1968) and Blair and Maron (1985), among others, have studied search failures on the basis of precision and recall. Yet some others defined search failure in somewhat different terms and studied it accordingly. For instance, a search is counted as a failure "if no relevant record appears in the first ten which are displayed" during the evaluation of Okapi online catalog (Walker and Jones, 1987, p.139; Jones, 1986). Definition of search failure as such is quite different from that which is based on precision and recall. It is dichotomous and assumes that users would scan at least ten records before they quit. This assumption might be true for some searches (i.e., subject) and some users but not all. It also somehow downplays the importance of search failures. Searches retrieving at least one relevant record in ten are considered "successful" even though precision rate for such searches is quite low (10%).
Markey and Demeyer (1986, p.181) took a slightly different approach when they analyzed the subject search failures that occurred in the Dewey Decimal Classification (DDC) Online Project. They singled out searches that failed to retrieve relevant documents (judged by the users) and identified the reasons why the searches failed. Apparently, however, they did not count "zero retrievals" (i.e., those searches that retrieved nothing) as search failure and analyzed accordingly.
Two observations are due here in regards to Markey and Demeyer's method of studying search failures. First, zero retrievals in subject searching is quite common. Somewhere between 35% and 50% of the subject searches retrieve nothing in online catalogs (Markey, 1984). Second, they also excluded partial search failures which retrieved at least some relevant documents. Presumably, that's why the number of search failures they analyzed were relatively fewer.
Some researchers formulated yet another practical definition of search failure. Dickson (1984), Peters (1989) and Hunter (1991) identified zero hits from the transaction logs of some online catalogs and looked into the reasons for search failures. Kirby and Miller (1986) and Walker et al. (1991) employed the same method when they studied search failures in MEDLINE. Needless to say, definition of search failure as zero retrieval is incomplete as it does not include partial search failures. More importantly, there is no reason to believe that all "non-zero retrieval" searches were successful ones. Such an assumption would mean that no precision failures occurred in the systems under investigation! Furthermore, "not all zero hits represent failures for the patrons...It is possible that the patron is satisfied knowing that the information sought is not in the database, in which case the zero-hit search is successful" (Hunter, 1991, p.401).
A brief review of the findings of major failure analysis studies are summarized below. However, not surprisingly, the results are not directly comparable in view of different definitions of "search failure" employed in different studies.
6.3 Review of Failure Analysis Studies
Various studies have shown that users experience several problems when doing searches in online document retrieval systems and they often fail to retrieve relevant documents (Lancaster, 1968). The problems users frequently encounter when searching especially in online library catalogs are well documented in the literature (Bates, 1986; Borgman, 1986; Cochrane and Markey, 1983; Hildreth, 1989; Kaske, 1983; Kern-Simirenko, 1983; Larson, 1986, 1991b; Lawrence, Graham and Presley, 1984; Markey, 1984, 1986; Matthews, 1982; Matthews, Lawrence and Ferguson, 1983; Pease and Gouke, 1982). Few researchers, however, studied search failures directly (Lancaster, 1968, 1969; Dickson, 1984; Jones, 1986; Markey and Demeyer, 1986; Kirby and Miller, 1986; Walker and Jones, 1987; Peters, 1989; Hunter, 1991; Walker et al., 1991). What follows is a brief overview of major studies of search failures in document retrieval systems.
Investigators have employed different methods in order to study search failures occurring in document retrieval systems. Lancaster (1968), for instance, studied recall failures by finding some relevant documents using sources other than the document retrieval system under investigation (MEDLARS) and then checking to see if the relevant sources identified beforehand had also been retrieved during the experiment. If some relevant documents that were identified in advance were missed during the experiment, this was considered as a recall failure and measured quantitatively. Precision failures were easier to detect as users were asked to judge the retrieved documents as being relevant or nonrelevant. If the user assessed some of the retrieved documents as being nonrelevant, this was considered as a precision failure and measured accordingly. Yet identifying the causes of precision failures proved to be much more difficult since the user might have judged a document as being nonrelevant due to, among others, indexing, searching, document characteristics as well as his/her background and previous experience with that document.
Lancaster's study (1968) has been so far the most detailed account of the causes of search failures that has ever been attempted. As Lancaster (1969) points out:
"The "hindsight" analysis of a search failure is the most challenging aspect of the evaluation process. It involves, for each "failure," an examination of the full text of the document; the indexing record for this document (i.e., the index terms assigned...); the request statement; the search formulation upon which the search was conducted; the requester's completed assessment forms, particularly the reasons for articles being judged "of no value"; and any other information supplied by the requester. On the basis of all of these records, a decision is made as to the prime cause or causes of the particular failure under review" (p.123).
As a result of his analyses, Lancaster (1969) found out that recall failures (i.e., relevant documents not retrieved) have occurred 238 out of 302 searches while precision failures (i.e., documents that were retrieved but not relevant) occurred 278 out of 302 searches. More specifically, some 797 relevant documents were not retrieved. More than 3,000 documents that were retrieved were judged nonrelevant by the requesters. Lancaster's original research report (1968) contains statistics about search failures along with detailed explanations of their causes.
Lancaster found out that almost all of the failures can be attributed to indexing, searching, the index language and the user-system interface. For instance, indexing subsystem in his research "contributed to 37% of the recall failures and...to...13% of the precision failures" (Lancaster, 1969, p.127). The searching subsystem, on the other hand, was "the greatest contributor to all the MEDLARS failures, being at least partly responsible for 35% of the recall failures and 32% of the precision failures" (Lancaster, 1969, p.131).
Blair and Maron (1985) studied recall failures in a somewhat different manner. They developed "sample frames consisting of subsets of the unretrieved database" that they believed to be "rich in relevant documents" and took random samples from these subsets. Taking samples from subsets of the database rather than the entire database was more advantageous from the methodological point of view "because, for most queries, the percentage of relevant documents in the database was less than 2 percent, making it almost impossible to have both manageable sample sizes and a high level of confidence in the resulting Recall estimates" (Blair and Maron, 1985, p.291-293).
Blair and Maron (1985) found that recall failures occurred much more frequently than one would expect: the system failed to retrieve, on the average, four out of five relevant documents in the database. They illustrated quite convincingly that such high recall failures should come as no surprise because one can express one's information needs using the natural language in a variety of different ways. However, such a diverse use of the language may not necessarily serve the users' best interests in retrieving documents from, especially full-text, document retrieval systems.
Markey and Demeyer (1986) analyzed a total of 680 subject searches as part of the DDC Online project and found that 34 out of 680 subject searches (5%) failed. Two major reasons for subject search failures were identified as follows: the topic was marginal (35%), and the users' vocabulary did not match subject headings (24%) (p.182). It should be emphasized that the authors' report gives a much more detailed account (p.173-291) of the failure analysis of different subject searching options in an online catalog enhanced with a classification system (DDC).
The availability of transaction logs documenting users' interaction with the document retrieval systems provided further opportunities to study and monitor search failures. Transaction logs were "designed to permit detailed analysis of individual user transactions and system performance characteristics. The individual transaction records provide enough information for analysts to reconstruct the events of any user session, including all searches, displays, help requests, and errors, and the system responses" (Larson, 1991b, p.7).
Dickson (1984), Jones (1986), Walker and Jones (1987), Peters (1989), and Hunter (1991) have all used transaction logs to study search failures in online catalogs. Dickson studied a sample of "zero-hit" (i.e., those search queries that retrieved nothing) author and title searches from the transaction log of Northwestern University Library's online catalog and analyzed why the searches failed. She found out that about 23% of author searches and 37% of title searches retrieved nothing. Misspellings and mistakes in the search formulation were the major causes of zero-hit searches.
Jones (1986) examined the findings of transaction log analyses in the Okapi online catalog and identified several unsatisfactory areas in the operation of Okapi due to, among others, spelling errors, failures in subject searching and user-system interface. In a more recent study it was found that 14 out of 122 sessions failed in the Okapi (excluding failures due to collection). Major causes of failures were: (a) that users' vocabulary did not match that of the catalog (50%); that topics expressed were too specific (29%); and (c) that searches did not describe user's needs (14%) (Walker and Jones, 1987, p.149).
Peters (1989) analyzed the transaction logs of a union online catalog (Libraries of the University of Missouri Information Network) and found that 40% of the searches in that catalog produced zero-hits. He classified the causes of search failures under 14 different groups, among which are typographical and spelling errors (10.9% and 9.9%, respectively) and the search system itself (9.7%). Approximately 40% of the failures were collection failures (i.e., the item sought was not in the database).
Hunter (1991) analyzed thirteen hours of transaction logs, amounting to some 3,700 searches performed on a large academic library online catalog. She used the same classification schema as Peters' (1989) and categorized the causes of search failures under 18 different groups. The overall search failure rate in Hunter's study was found to be 54.2%. The major causes of search failures were identified as the controlled vocabulary in subject searching (29%), the system itself (18%), and the typographical errors (15%). It is not explained in detail, however, what sorts of controlled vocabulary failures occurred and what the specific causes were.
Kirby and Miller (1986) analyzed search failures encountered by end-users of MEDLINE using Colleague search software. They examined 31 "incomplete" searches and found that the search strategy was the major cause of search failures (67.7%). The rest of the search failures were due to system mechanics and database selection (22.6% and 9.7%, respectively).
Walker et al. (1991) obtained similar results when they studied the problems encountered by clinical end-users of MEDLINE and GRATEFUL MED. They defined search failure as "unproductive search" (i.e., zero-hit) and analyzed 172 unsuccessful searches. They found that 48% of the search failures occurred because of some flaw in the search strategy. The software in use was responsible for 41% of the search failures. System failures constituted some 11% of all search failures.
Studies summarized above have benefited from transaction monitoring to the extent that "zero-hit" searches were identified from transaction logs. Zero-hit searches were later examined in order to find out why a particular search query failed to retrieve anything in the database. (Jones (1986) and Walker and Jones (1987) should be exempted from this as their analysis is not based on zero-hit searches.) Investigators did not, unlike Lancaster (1968), attempt to identify the causes of recall and precision failures, however.
Although transaction monitoring offers unprecedented opportunities to study search failures in document retrieval systems, it is not clear, however, what constitutes a "search failure" in transaction logs. As mentioned earlier, defining all "zero-hit" searches as search failures has some serious flaws. Furthermore, transaction logs have very little to offer when studying recall failures in document retrieval systems. Recall failures can only be unearthed by using different methods such as by analyzing search statements, indexing records and retrieved documents. In addition, some more relevant documents that were not retrieved in the first place can be found by performing successive searches in the database.
Several studies which were not necessarily concerned with the causes of search failures directly but which nevertheless addressed the issues in this area are summarized below.
Hildreth (1989) considers the "vocabulary" problem as the major retrieval problem in today's online catalogs and asserts that "no other issue is as central to retrieval performance and user satisfaction" (p.69). As suggested earlier, this may be due to the fact that controlled vocabularies are far more complicated than users can easily grasp in a short period of time. In fact, several researchers have found that the lack of knowledge concerning the Library of Congress Subject Headings (LCSH) is one of the most important reasons why users fail in online catalogs (see, for instance, Bates, 1986; Borgman, 1986; Gerhan, 1989; Lewis, 1987; Markey, 1986). Larson (1986) found that almost half of all subject searches on MELVYL (University of California Library System) retrieved nothing. More recently, Larson (1991b) analyzed the use of MELVYL over a longer period of time (6 years) and found that there is a significant positive correlation between the failure rate and the percentage of subject searching. This confirms the findings of an earlier formal analysis of factors contributing to success and satisfaction: "problems with subject searching were the most important deterrents to user satisfaction" (University, 1983, p.97).
Larson (1991a) has reviewed the literature on subject search failures in online catalogs along with remedies offered to reduce subject search problems (p.136-144). Assigning more subject headings to bibliographic records and providing keyword searching and classification enhancements are among the proposals which, if and when implemented, can reduce subject search failures in conventional online catalogs.
Carlyle (1989) studied the matching between users' vocabulary and LCSH and found that "single LCSH headings match user expressions exactly about 47% of the time" (p.37). The study conducted by Van Pulis and Ludy (1988) showed that 53% of the user entered terms matched subject headings used in the online catalog. Findings as such suggest that some of the search failures can be attributed to controlled vocabularies in current online catalogs.
From the users' viewpoint it is certainly preferable to be able to express their information needs in their own natural language terms. However, most, if not all, online catalogs today cannot accommodate search requests submitted in natural language form. Yet it is believed that natural query languages may reduce search failures in online catalogs by improving the match between users' search terms and the system's controlled vocabulary. Nevertheless, the role of natural query languages in search success in online catalogs is yet to be thoroughly investigated.
Markey (1984) discusses several different data gathering methods that were used in online catalog use studies such as questionnaires, interviews, controlled experiments and transaction monitoring. Cochrane and Markey (1983) point out that different data gathering methods have different strengths. For instance, questionnaires and interviews can provide insight on the user's attitude toward the online document retrieval system while transaction log analysis can reveal the actual user behavior at online catalogs (Tolle, 1983).
Despite the fact that different methods discussed above are most useful tools to gather data on online catalog use and search failures, they do not necessarily help fully explain the causes of search failures that occur in online catalogs and document retrieval systems. Transaction logs, for instance, can document search failure occurrences but cannot explain why a particular search failed. A variety of reasons may cause search failures in online catalogs: simple typographical errors, mismatch between user's search terms and the vocabulary used in the catalog, the database (i.e., requested item is not in the system), the user interface, the search and retrieval algorithms, to name but a few. In order to find out why a particular search failed, one needs further information in regards to the users' needs and intentions, which, obviously, are not recorded on transaction logs.
Data about user needs and intentions can be gathered through a technique known as "critical incident technique." This technique is briefly discussed below.
6.4 Use of the Critical Incident Technique in Failure Analysis Studies
Flanagan (1954) describes the critical incident technique as follows:
"The critical incident technique consists of a set of procedures for collecting direct observations of human behavior in such a way as to facilitate their potential usefulness in solving practical problems and developing broad psychological principles. The critical incident technique outlines procedures for collecting observed incidents having special significance and meeting systematically defined criteria.
"By an incident is meant any observable human activity that is sufficiently complete in itself to permit inferences and predictions to be made about the person performing the act" (p.327).
The critical incident technique was first used during World War II in the analysis of the specific reasons for failure of pilot candidates in learning to fly. Since then, this technique has been widely used not only in aviation but also in defining the critical requirements of and measuring typical performance in the health professions. Flanagan (1954) provides a more detailed account of the uses of the critical incident technique in a variety of fields.
The major advantage of this technique is to obtain "a record of specific behaviors from those in the best position to make the necessary observations and evaluations" (Flanagan, 1954, p.355). In other words, it is observed behavior that counts in critical incident technique, not opinions, hunches and estimates.
The critical incident technique essentially consists of two steps: collection and classification of detailed incident reports, and inferences that are based on the observed incidents.
Wilson, Starr-Schneidkraut and Cooper (1989) summarize these two steps as follows:
"The collection and careful analysis of a sufficient number of detailed reports of such observations of effective and ineffective behaviors results in comprehensive definition of the behaviors that are required for success in the activity in question under a wide range of conditions. These organized lists of critical requirements (generally termed performance "taxonomies") can then be used for a variety of practical purposes such as the evaluation of performance, the selection of individuals with the greatest likelihood of success in the activity, or the development of training programs or other aids to increase the effectiveness of individuals" (p.2).
The critical incident technique can also be used to gather data "on observations previously made which are reported from memory." Flanagan (1954) claims that collecting data about incidents which happened in the recent past is usually satisfactory. However, the accuracy of reporting depends on what the incident reports contain: the more detailed and precise the incident reports are the more accurate, it is assumed, the information contained therein.
Recently, critical incident technique has been used to assess "the effectiveness of the retrieval and use of biomedical information by health professionals" (Wilson, Starr-Schneidkraut and Cooper, 1989, p.2). Researchers have first devised a sampling strategy and developed an interview protocol to elicit the desired information from the subjects. They then developed three "frames of reference" to analyze the data that is to be gathered via interviews. "These were (1) "Why was the information needed?", (2) "How did the information obtained impact the decision-making of the individual who needed the information?", and (3) "How did the information obtained impact the outcome of the clinical or other situation that occasioned the search?"" (p.5). After the qualitative analysis of the critical incident reports, the frames of reference mentioned above have been used to create three taxonomies along the same lines.
In the same study, the critical incident technique has also been used to analyze and evaluate search failures in MEDLINE. Users were asked to comment on the effectiveness of online searches which they performed on the MEDLINE database. The user designated reasons as to why a particular search failed (or succeeded) were recorded through a questionnaire used during the interviews. These "incident reports" were later matched against MEDLINE transaction log records corresponding to each search in order to find out the actual reasons for search failures (and search success). It is these incident reports that provide much sought after data concerning user needs and intentions, and put each transaction record in context by making transaction logs no longer "anonymous."
Some 26 user designated ineffective incident reports were examined so as to "characterize the nature of the ineffective searches, analyze the relationship between what the user said and what the transaction log said happened during the search, and ascertain, by performing an analogous MEDLINE search, whether a search could have been performed which would have met the user's objective" (p.81). Most ineffective searches (23 out of 26) were identified as such because the users "could not find what they were looking for and/or could not find relevant materials." An appendix summarizing the analysis of each ineffective search is included in the research report.
Upon studying the transcript of the interviews and the transaction logs for ineffective searches extensively, researchers concluded that "many users who reported ineffective searches do not seem to understand:
1. How to do subject searching.
2. How MeSH [Medical Subject Headings] works.
3. How they can apply that understanding to map their search requests into a vocabulary that is likely to retrieve considerably more relevant materials" (Wilson, Starr-Schneidkraut and Cooper, 1989, p.83-84).
Some observations were derived from the analysis of the ineffective searches concerning indexing, the MEDLINE database, the search software and the users.
It appears that critical incident technique can successfully be used in the analysis of search failures in online catalogs as well. Matching incident reports against transaction logs is especially promising. Since the analyst will, through incident reports, gather contextual data for each search query, more informed relevance judgments can be made during the evaluation of retrieval effectiveness process. Furthermore, this technique can also be utilized to compare user designated search effectiveness with that obtained through traditional retrieval effectiveness measures.
7. Present Study
The present study will attempt to investigate the probable causes of search failures in a "third generation" experimental online catalog system. The rigorous analysis of retrieval effectiveness and search failures will be based on transaction log records and critical incident technique. The former method allows one to study the users' search behaviors unobtrusively while the latter helps gather information about user intentions and needs for each query submitted to the system.
The findings to be obtained through this study will shed some light on the probable causes of search failures in online library catalog systems. The results will help improve our understanding of the role of natural query languages and indexing in online catalogs. Furthermore, the findings might provide invaluable insight that can be incorporated in future retrieval effectiveness and relevance feedback studies.
7.1 Objectives of the Study
The purpose of the present study is to:
1. analyze the search failures in online catalogs so as to identify their probable causes and to improve the retrieval effectiveness;
2. ascertain the extent to which users' natural language-based queries match the titles of the documents and the Library of Congress Subject Headings (LCSH) attached to them;
3. compare user designated ineffective searches with the effectiveness results obtained through precision and recall measures;
4. measure the retrieval effectiveness in an experimental online catalog in terms of precision and recall;
5. identify the role of relevance feedback in improving the retrieval effectiveness in online catalogs;
6. identify the role of natural query languages in improving the match between users' vocabulary and the system's vocabulary along with their retrieval effectiveness scores in online catalogs.
7. develop a taxonomy of search failures in online library catalogs.
7.2 Hypotheses
Main hypotheses of this study are as follows:
1. Search failures occur in online catalog systems;
2. The match between users' vocabulary and titles of, and LCSH assigned to, documents will help reduce the search failures and improve the retrieval effectiveness in online catalogs;
3. The relevance feedback process will reduce the search failures and enhance the retrieval effectiveness in online catalogs;
4. User designated ineffective searches in online catalogs do not necessarily coincide with system designated ineffective searches.
8. The Experiment
An experiment will be conducted in order to test the hypotheses of this study and address the research questions raised above. Data will be gathered on the use of an experimental online catalog for a specified period of time (13 September - 29 November 1991). Data concerning users' actual search queries submitted to the online catalog, the records retrieved and displayed to the users, users' relevance judgments for each record displayed, records retrieved and displayed after the relevance feedback process represents the kinds of data to be collected during this experiment. This data will then be analyzed in order to find out the retrieval effectiveness attained in the experimental online catalog. The search failures will be documented and their causes will be investigated in detail. Further data will be collected using the critical incident technique from the users about their information needs and intentions when they performed their searches in the online catalog. As pointed out earlier, a detailed analysis will be performed to find out if there is some corroboration between user designated ineffective searches and search failures thus identified by the system.
8.1 The Environment
8.1.1 The System
The experiment will be conducted on the CHESHIRE (California Hybrid Extended SMART for Hypertext and Information Retrieval Experimentation), an experimental online library catalog system "designed to accommodate information retrieval techniques that go beyond simple keyword matching and Boolean retrieval to incorporate methods derived from information retrieval research and hypertext experiments" (Larson, 1989 p.130). The test database for the CHESHIRE system consists of some 30,000 MARC records representing the holdings of the Library of the School of Library and Information Studies in the University of California at Berkeley. CHESHIRE uses a modified version of Salton's SMART system for indexing and retrieval purposes (Salton, 1971; Buckley, 1987) and it runs on a Sun workstation with 320 megabytes of disk storage. Larson (1989) provides a more detailed information about CHESHIRE and the characteristics of the collection. (For the theoretical basis of, and the probabilistic retrieval models used in, CHESHIRE online catalog system, see Larson (1991c).)
CHESHIRE accommodates queries in natural language form. The user describes his/her information need using words taken from natural language and submits this statement to the system. This statement then gets "parsed" and analyzed to create a vector representation of the search query. Finally, the query gets submitted to the system for the retrieval of individual documents from the database.
CHESHIRE has a set of both vector space (e.g., cosine matching, term frequency - inverse document frequency matching (TFIDF)) and probabilistic retrieval models available for experimental purposes. Formal presentations of these models can be found elsewhere (e.g., Larson, 1991c). It is suffice to say here that cosine matching measures the similarity between document and query vectors and "ranks the documents in the collection in decreasing order of their similarity to the query." TFIDF matching is similar to the cosine matching. However, TFIDF takes the term frequencies into account and attaches more weights to the terms occurring "frequently in a given document but relatively infrequently in the collection as a whole" (Larson, 1991c). Probabilistic models (Model 1, Model 2, Model 3), on the other hand, approach the "document retrieval problem" probabilistically and assume that probability of relevance is a relationship between the searcher and the document, not between the terms used in indexing documents and the terms used in expressing search queries (Maron, 1984).
CHESHIRE also has relevance feedback capabilities to improve retrieval effectiveness. Upon retrieval of documents from the database, the user is asked to judge if the retrieved document is relevant or not. Based on users' relevance judgments on retrieved documents, the original search queries are modified and a new set of, presumably more relevant, documents is retrieved for the same query. Users can repeat the relevance feedback process in CHESHIRE as many times as they want.
Probabilistic retrieval techniques along with classification clustering, which is used for query expansion in CHESHIRE, will be used for evaluation purposes in this experiment. The feedback weight for an individual query term i will be computed according to the following probabilistic relevance feedback formula:
pi (1 - qi)
log(-------------)
qi (1 - pi)
where
rel_ret + (freq / num_doc)
pi = ----------------------------
num_rel + 1.0
freq - rel_ret + (freq / num_doc)
qi = ---------------------------------
num_doc -num_rel + 1.0
where
freq is the frequency of term i in the entire collection;
rel_ret is the number of relevant documents term i is in;
num_rel is the number of relevant documents that are retrieved;
num_doc is the number of documents.
This formula takes into account only the "feedback effect," not the artificial "ranking effect" (i.e., documents retrieved in the first run are not included in the second run).
8.1.2 Search Queries
Search queries arising from genuine information needs of the users (see Section 8.2) will be gathered for this experiment. The number of search queries to be collected is expected to be around 200.
This figure is thought to be appropriate for evaluation purposes as most information retrieval experiments in the past had been conducted with either comparable or much fewer number of queries. For instance, some 221 search queries were used in Cranfield II tests, one of the earliest information retrieval experiments. The search queries were "obtained by asking the authors of selected published papers (`base documents') to reconstruct the questions which originally gave rise to these papers" (Robertson, 1981, p.20). Similarly, 302 genuine search queries were used in the MEDLARS study. Search queries used in MEDLARS tests originated from the real information needs of the system's users (Lancaster, 1968). More recently, Blair and Maron (1985) used some 51 real search queries, obtained from two lawyers, to test the retrieval effectiveness of the STAIRS system. Tague (1981) observes that "the number of queries in information retrieval tests seems to vary from 15 to 300, with values in the range 50 to 100 being most common" (p.82).
8.1.3 Relevance Judgments
Relevance judgments for each retrieved document for a given query will be recorded for further analysis and for computation of the precision and recall ratios. Similarly, relevance judgments for documents retrieved in subsequent runs will also be recorded for the same purposes.
The procedure of recording relevance judgments is as follows: For each and every record retrieved in response to the user's search request, the user is required to take some action. If the record he/she scans is relevant to his/her query, then, he/she simply needs to push the "relevant" key on the keyboard. If the record retrieved is not relevant, then the user simply needs to press the "return" key. This will tell CHESHIRE that the record retrieved is not relevant. Records thus identified as relevant or non-relevant by the user will be taken into account should he/she wishes to perform a relevance feedback search later.
Note that relevance assessments will be based on retrieved references with full bibliographic information including subject headings, not the full text of documents. Relevance judgments will be done by the users themselves for search queries stemming from their real information needs.
For the purposes of further testing the retrieval effectiveness of CHESHIRE, some search queries will be repeated on the system. Relevance judgments in this stage will be performed by the researcher. However, this will be done after the data concerning user needs and intentions have been gathered through the critical incident technique. It is believed that, based on the contextual feedback to be gained from users for each query, objective relevance judgments reflecting actual users' decision making processes as much as possible can be made by the researcher.
8.1.4 Evaluation of Retrieval Effectiveness
Precision and recall ratios will be used to measure the initial retrieval performance in CHESHIRE. Above measures will be calculated in this experiment as follows:
Precision will be taken as the ratio of the number of documents that a user judged relevant (by pressing "relevant" key) for a particular query over the total number of records he/she scanned when either decided to quit or do a relevance feedback search. Note that there is a slight difference between the original definition of the precision and that which will be used in this experiment: instead of taking the total number of retrieved records in response to a particular query, we will take the total number of records scanned by the user no matter how many records the system retrieves for a particular query. For instance, if the user stops after scanning 2 records and judges one of them relevant, then the precision ratio will be 50%.
Precision ratios for retrievals during the relevance feedback process will be calculated in the same way.
Recall is considerably more difficult to calculate than precision since it requires finding relevant documents that will not be retrieved in the course of users' initial searches (Blair and Maron, 1985). In this experiment, recall will be calculated for each search based on the previously identified relevant documents that will be retrieved using various techniques such as taking samples from rich subsets of the database (Blair and Maron, 1985; Larson, 1991c). The familiarity with the database (i.e., records mainly about Library and Information Science) is thought to facilitate the researcher's task in this respect.
It is worth repeating that the relevance judgments when calculating recall will be made by the researcher based on the data to be obtained from the users through the critical incident technique. As mentioned earlier, incident reports gathered by means of the critical incident technique will provide feedback about users' information needs and put each search statement in perspective, which will facilitate relevance judgments further.
In addition to finding out retrieval effectiveness through precision and recall measures, retrieval effectiveness will also be evaluated by gathering data from the users. In other words, users will be consulted as to what they think about the effectiveness of specific searches that they performed on CHESHIRE. Although it is not possible to quantify user designated retrieval effectiveness in mathematical terms, it will nonetheless be interesting to compare user designated ineffective searches with precision and recall ratios for corresponding search queries.
8.2 Subjects
Doctoral and entering master's students (Fall 1991) in the School of Library and Information Studies at UC Berkeley have been approached and their agreement has been sought for data collection, analysis and evaluation purposes in conducting this experiment.
Some 24 doctoral students have been invited by the researcher to participate in the experiment (see Attachments: Letter to Doctoral Students). Along with the invitation letter they also received a handout entitled "Background Information About CHESHIRE and Guidelines for CHESHIRE Searches" and detailed instructions to get access to the CHESHIRE system (see Attachments).
In order to have master's students participate in this study, the cooperation of instructors who taught the course L210 (Organization of Information) has been requested. All entering master's students (ca. 85) in the Fall semester of 1991 were required to take L210. This course deals with, inter alia, the organization of information, namely cataloging and classification. The researcher has been granted 20 minutes of class time by the instructors to introduce his research to MLIS students. (There were four sections of L210.) The purpose of the research has been explained briefly to the students in each section and their participation has been requested. An invitation letter has been handed out along with the detailed instructions to get access to CHESHIRE (see Attachments). In addition, the researcher walked through the instructions with students and presented the necessary steps to get access to CHESHIRE via overhead transparencies. Altogether some 85 entering MLIS students received documentation about the research and their participation has been sought.
The CHESHIRE online catalog has been made accessible to doctoral and entering master's students for online searches throughout the Fall 1991 semester. It is believed that the CHESHIRE catalog will be utilized as its database contains the holdings of the Library of the School of Library and Information Studies along with full bibliographic information for each document including call numbers.
Each subject who agreed to participate in the study has been issued a password in order for them to get access to the CHESHIRE online catalog. Passwords have been used to identify subjects for data gathering purposes and trigger the transaction log programs to record each subject's entire session on CHESHIRE. (It is possible to provide access to CHESHIRE without passwords (e.g., by means of creating a list of users who already have passwords and opening CHESHIRE to them via rlogin). In fact, CHESHIRE is also available on Internet without passwords.)
Some information about the subjects such as their familiarity with the online catalogs and basic application programs (e.g., word processing, database management systems) has also been collected.
8.3 Data Gathering Tools
A wide variety of data gathering tools have been used throughout the experiment, the most important ones being CHESHIRE experimental online library catalog (see Section 8.1.1), transaction logs and interviews to collect critical incident reports about search failures. Each "tool" is briefly explained below.
Transaction logs were used to capture data about the entire session for each search conducted on CHESHIRE. A number of data elements has been recorded in transaction logs. The following elements represent the kind of data that was captured for each search request in this research:
- user's password;
- logon time and date (to the nearest second);
- the search statement(s);
- records retrieved and displayed to the user;
- number of records displayed for each search;
- user's relevance judgment on each record displayed;
- relevance feedback requests;
- number of times user requests relevance feedback search for the same query;
- the total time user spent on the system, etc.
A number of programs which can be used to record transaction logs are available on CHESHIRE. Search statements can be stored as text files at present. Existing programs to capture transaction data have been modified by the researcher with the help of Professor Ray Larson. Programs to capture transaction log data were in place by the beginning of the fall semester (1991).
A letter inviting doctoral students to participate in the experiment has been handed out to all students during the third week of the fall semester (1991). The experiment has been briefly explained in this letter and the cooperation of students has been sought. Permission to review their transactions has also been obtained from participating students (see Attachments).
A similar letter has been handed out to all entering master's students taking the course L210. In addition, the researcher has explained the experiment briefly in the classroom during a 20-minute presentation supported with overhead transparencies. The written consent of master's students to review their transactions have been obtained.
A brief handout about CHESHIRE and guidelines for CHESHIRE searches have been distributed during the class along with more detailed instructions to get access to CHESHIRE. The handout and detailed instructions explain such issues as how to get access to CHESHIRE, how to formulate natural language-based search queries and how to retrieve and display bibliographic records in CHESHIRE. The researcher has made himself available throughout the semester for further consultation.
Students have been asked to answer a few questions about their prior catalog and computer experience and return it to the researcher with their consent forms.
Students will be encouraged to use the system as frequently as they desire. The number of search queries submitted to the system will be monitored throughout the experiment. Reminders will be mailed to students throughout the semester asking them to continue to use CHESHIRE.
Further data will be gathered through the critical incident technique. Two types of critical incident report forms were devised (modified from Wilson, Starr-Schneidkraut and Cooper (1989)): one for reporting effective searches and the other for ineffective searches (see Attachments: "Effective Incident Report Form," "Ineffective Incident Report Form"). Interviews with the subjects will be recorded on "incident report forms." Conversations with the subjects will also be tape-recorded (with permission) for further analysis. Incident reports will contain data about users' recent searches in CHESHIRE along with information needs that triggered those particular searches. Search statements which users had typed during the search and the effectiveness of relevance feedback results (if applicable) will also be recorded. Also inquired in incident report forms are users' own assessments of the effectiveness of their searches. A brief questionnaire has been developed (see Attachments: "Questionnaire") which will be used during the structured interviews with the subjects concerning their experience and search success in CHESHIRE. Each searcher will be asked to fill out a questionnaire for each incident he/she reports. The questionnaire aims to measure, in more precise terms, users' perceived search success in CHESHIRE. In a way, this questionnaire complements critical incident reports in that it contains similar questions. The questionnaire will also be used to corroborate the findings to be obtained from the critical incident reports.
8.4 Data Analysis and Evaluation Methodology
After gathering raw data from the users by means of transaction logs, structured interviews and questionnaires, a comprehensive analysis and evaluation will be conducted on this data.
The analysis of transaction logs will reveal quantitative data about the use of the CHESHIRE online library catalog system during the period of experiment. For instance, such statistical data as the number of searches conducted, number of different users, number of records displayed and judged relevant, average number of terms in search statements, average number of matching terms between search statements and titles and subject headings of the documents, and system usage statistics can be easily gathered.
The critical incident reports will be analyzed next. These reports will reveal users' specific information needs and intentions that resulted in performing a search in CHESHIRE. Information contained in incident reports will be tabulated so that it will be compared with the corresponding transaction logs and the questionnaire. Each incident report and corresponding questionnaire will be given an identification number. The resultant incident reports and questionnaires will later be "matched" with the transaction log records so as to determine the relationship between the features as reported by the user and those reflected in the transaction log (Wilson, Starr-Schneidkraut and Cooper, 1989, pp.14-15).
Session(s) belonging to each user can be identified in advance from the transaction log by means of password entered by the user (password consists of first and last names of each participating student). This will facilitate the matching process. In fact, advance identification of sessions may well be very helpful during the interviews. The interviewer could help the interviewees by leading them to their most recent searches.
Transaction logs will later be analyzed for qualitative purposes. An attempt will be made to identify search failures along with their causes by making use of a variety of methods: analyzing search statements, comparing the match between search terms and titles and subject headings, analyzing the user supplied incident report, and analyzing the records retrieved and displayed.
Critical incident reports and questionnaires belonging to ineffective searches will be further analyzed. Search terms in the queries and the titles of documents and LCSH assigned to them will be compared so as to find out the match between users' vocabulary and that of the system. Such a comparison may furnish further evidence to help explain search failures. The results can be tabulated for each query to see if there is a correlation between the success rates obtained through matching and non-matching search queries.
System-assigned term weights may sometimes cause search failures as well. It is conceivable that some terms may well be more heavily weighted due to collection characteristics and probabilistic retrieval rules during the analysis of search statement. However, heavily weighted terms by the system may not necessarily be the most important ones from the user's point of view. For instance, Larson (1991c) found that within-document term frequency used in probabilistic weighting is "not very helpful in determining a ranking for documents" in CHESHIRE (p.21). Nevertheless, it is possible to go back and see the assigned weights and determine if the search failure occurred because of system-assigned term weights.
In order to identify the role of natural language-based user interfaces in retrieval effectiveness, some randomly selected queries can be searched on MELVYL using detailed search tactics. Although the results will not be directly comparable to those obtained in CHESHIRE, the individual records can be compared so as to see if additional records are retrieved by either of the systems.
As mentioned above, a comprehensive analysis of search failures will be conducted. Let's now look at in more detail how it is that we will study the causes of search failures in CHESHIRE.
First, precision value for each search query will be found from the transaction logs. Queries which failed or produced poor precision values will be identified for further investigation.
Second, ineffective critical incident reports and questionnaires will be analyzed separately. Those searches which were identified by users as being ineffective will be checked for clues as to find out why the search had failed or was deemed ineffective by the user. The causes reported by users will be tabulated.
Since login IDs will be recorded in the transaction log data, it will be relatively easy to scan the transaction logs automatically to find all occurrences of the searches performed by each user. Upon matching an ineffective incident report with the equivalent transaction log record, the user's description of search performance will be compared with that of transaction log record.
If the two accounts match and the search is deemed to be ineffective by the user, the transaction log record will be examined to verify this. If the search is ineffective, it will be classified as such and studied further.
The two accounts of search performance could sometimes be dissimilar. That is to say, the results for a given search query might be deemed ineffective by the user while the precision value for the same query recorded in the transaction log would suggest otherwise, or vice versa. If this is the case, then the source of the discrepancy will be investigated. The user's account will play an important role to discover the discrepancy between the two accounts. The query statement entered to the system will be examined along with the user's stated information needs and intentions recorded in the corresponding incident report. The search terms used and recorded in the transaction log will also be examined to see if there is any discrepancy between the two versions. Retrieved records (titles, subject headings, etc.) and the user's relevance judgments for these records will be analyzed further to find out the source of disagreement. Depending on the nature of disagreement, the search will be classified as being ineffective and thus further studied. Or it will be eliminated from the sample due to conflicting reports. This does not necessarily mean, however, that some ineffective searches will be eliminated just because the user's story does not corroborate with the transaction log record. It only means that full corroboration between the incident reports and transaction log records may not be accomplished due to, among others, lack of details in the user's story or inability to locate such a search in transaction logs. The incidents to be eliminated will be recorded.
Having thus identified all ineffective searches, a more comprehensive analysis will be conducted to determine the causes of search failures. Information obtained through ineffective incident reports will help the researcher understand the research question better. The retrieved records will be further analyzed. The terms in the titles and subject headings of the records will be noted. The relevance judgments of the user will be examined. It is expected that, especially with the help of incident reports, it will be relatively easy to identify the cause(s) of failures for most searches (i.e., indexing language, vague, specific or general search statements, the types of records retrieved). The cause(s) of each search failure will be tabulated for further statistical analysis.
As CHESHIRE has relevance feedback capabilities the effect of relevance feedback on search failures will be examined using similar methods.
Recording the causes of search failures for each query will help us develop a taxonomy of search failures in addition to facilitating the interpretation of the results of failure analysis.
The detailed examination of critical incident reports will also help the researcher better understand users' information needs and intentions, thereby facilitating the evaluation of retrieval effectiveness performance in CHESHIRE. Having reviewed the critical incident report for each search query, we will make more accurate relevance judgments on the retrieved documents. More importantly, this information will be most useful when calculating the recall values for documents to be found by means of exhaustive search tactics.
Evaluation of retrieval effectiveness performance will be based on precision and recall measures. For each search query, precision and recall measures will be calculated. The same measures will be used to evaluate the retrieval effectiveness for relevance feedback process as well.
Precision value for each search query will be obtained from the transaction logs. As pointed out earlier, division of the number of documents judged relevant by the user over the total number of documents the user has scanned will give the precision value.
Recall value for each search query will be calculated in a somewhat different manner. Using a wide variety of search tactics, the researcher will attempt to find all relevant documents in the database for each search query as much as possible. The number of relevant documents displayed over the total number of relevant documents in the database will give the recall value.
Precision and recall values will be averaged over all search queries in order to find the average precision/recall ratio for CHESHIRE. "Macro evaluation" method will be used to calculate average precision and recall values. It provides both adequate comparisons for test purposes and meets the need of indicating a user-oriented view of the results (Rocchio, 1971b). Macro evaluation method uses the average of ratios, not the average of numbers. (The latter is called "micro evaluation.") For instance, suppose that we have two search queries. The user displays 25 documents and finds 10 of them relevant in the first case. In the second case, the user displays 10 documents and finds only one relevant document. The average precision value for these two queries will be equal to 0.25 using the macro evaluation method ((10/25)+(1/10)=0.25). (Micro evaluation method, on the other hand, will give the result of 0.31 for the same queries ((25+1)/(25+35)=0.31).) As Rocchio (1971) points out, macro evaluation method is query-oriented while micro evaluation method is document-oriented. The former "represents an estimate of the worth of the system to the average user" while the latter tends to give undue weight to search queries that have many relevant documents (i.e., document-oriented) (Rocchio, 1971b; cf. Tague, 1981).
The recall/precision graph will then be plotted. In order to construct a recall/precision graph, precision values will be interpolated at standard recall values (i.e., 0,0.1,0.2,...,0.9,1). Then the precision values for all queries will be averaged over all the queries. This process will determine the points representing precision values at standard recall values in the recall-precision graph (Tague, 1981). The precision/recall graph for the MEDLARS study is provided below as an example.
Precision/recall graphs will illustrate the retrieval effectiveness that users attained in CHESHIRE. The improvement in precision/recall ratios, should there be any, due to the relevance feedback effect can also be observed from such graphs.
"Normalized" precision and recall values would be easier to calculate, as was done in some studies (Salton, 1971). However, normalized recall does not take into account of all relevant documents in the database. Whenever the user stops, the recall value at that point is assumed as 100% even though there might be more relevant documents in the database for the same query which the user has not yet seen. The recall figures to be obtained through normalized recall may not reflect the actual performance levels. Yet we believe that, after analyzing critical incidence reports which will contain much helpful information about users' information needs and intentions, we will obtain more reliable recall values based on the comprehensive searches to be conducted in CHESHIRE in order to find out the causes of search failures.
Note that the users tend to be more concerned with precision values. In other words, what counts most of the time for the user is if he/she can retrieve some relevant documents from the database which are not too diluted with non-relevant ones. As long as the user is able to find some relevant documents among the retrieved ones, he/she may not necessarily think of the fact that the system might be missing some more relevant documents. Recall values, on the other hand, are of greater concern to system designers, indexers and collection developers than users. Recall failures tend to generate much needed feedback to improve retrieval effectiveness in present document retrieval systems, although they are more difficult and time-consuming to detect and analyze.
The relationship between user designated retrieval effectiveness and precision/recall measures will be studied. In order to make user designated retrieval effectiveness more explicit for the purpose of comparison, questions are added to the critical incident report forms and questionnaire asking users to rate the retrieval effectiveness of their recent searches more precisely. The results can be compared with the precision/recall ratios found for corresponding search queries recorded in transaction logs.
Although determining the exact role of relevance feedback in improving the retrieval effectiveness in CHESHIRE is difficult, Larson (1989) points out that "experience with the CHESHIRE system has indicated that the ranking mechanism is working quite well, and the top ranked clusters provide the largest numbers of relevant items" (p.133).
Quantitative data obtained through the analysis of transaction logs, critical incident reports and questionnaires will be entered into a statistical package so that further analyses and comparisons using the same data will be expedited.
Various statistical tests are intended to measure the significance of the average difference in values of retrieval effectiveness between the initial retrieval and that of relevance feedback. Significance tests will measure the probability if "the two sets of values obtained from two separate runs are actually drawn from samples which have the same characteristics" (Ide, 1971 p.344). The t test and Wilcoxon test will be used for the evaluation of findings as appropriate.
The correlation between the search failures and matching of users' natural language query terms with the titles of documents and LCSH will also be sought. It is expected that the more the users' vocabulary would match that of the system the fewer the number of search failures would be.
8.5 Data Gathering, Analysis and Evaluation Procedure
Various methods to gather, analyze and evaluate data have been discussed in the previous sections. We will summarize the major activities that are going to take place chronologically during data gathering, analysis and evaluation process in this section. This study has begun with a detailed review of the CHESHIRE experimental online library catalog. It is important that the researcher learn the inner workings of the computer programs running CHESHIRE. The source code for the SMART system which was modified by Buckley is the basis of document retrieval and indexing in CHESHIRE and it has been studied by the researcher during coursework. Theoretical issues such as the retrieval and relevance feedback formulae to be used during the experiment has been discussed with Professor Larson.
Modifying existing programs to capture transaction log data in CHESHIRE was the first step. The code has been written and tested in consultation with, and with the help of, Professor Larson over the summer of 1991. An example of a transaction log data for one of the users is attached to this report (see Attachments).
The preparation of instructions for the users, the demonstration handouts and questionnaires was the next step. Instructions, questionnaire and critical incident forms have been tested during the summer of 1991.
An invitation letter along with instructions has been sent to doctoral students inviting them to participate in the experiment. Students agreeing to participate were asked to fill out a brief questionnaire, which was part of the invitation letter and consent form, about their computer and catalog search experience. The use of the system has been explained and demonstrated to each doctoral student as and when appropriate.
As explained in section 8.2, a similar arrangement has been made for entering master's students as well. Master's students were briefed about the purpose of research and their participation was requested. In addition, a short demo supported with overhead transparencies took place in one of the courses that all entering master's students have to take.
Once the preparations for data gathering were complete, the transaction log programs have to "wait" for the very first search query. Suppose that a legitimate user, say, a doctoral student, decides to use CHESHIRE. The date and time, the login id, the search statement he/she enters into CHESHIRE, the records he/she retrieves and displays, the relevance judgment for each record, the request for relevance feedback search and the records consequently retrieved and displayed are recorded by the transaction log programs. The following parameters can be found from the transaction log data: number of records displayed and judged relevant by the user, the number of terms in the search statement (excluding stop words), the match between search terms and title words and subject headings, the number of relevance feedback iterations, etc. Such data will be tabulated throughout the experiment for each search query submitted to the system.
After the experiment ends, the researcher will interview each user to find out about their experiences with CHESHIRE. The critical incident report form will be filled out for each search reported by the user (see Attachments: Effective and Ineffective Critical Incident Forms). If agreed, users' responses to the questions in the critical incident form will be tape-recorded for further analysis. The user will also be asked to complete a brief questionnaire for each query (see Attachments: "Questionnaire").
The researcher will have the searches performed by each interviewee ready during the interviews so that the user can be helped to remember what his/her search was about. Transaction(s) for each user will be found in the transaction logs by checking the login id which will be entered during the search.
Based on the user's answers to the questions in the critical incident form, each search will be examined and classified accordingly either as an ineffective or effective search. Attempt will be made to identify all searches in transaction logs. (However, 100% match may not be achieved for the reasons explained in Section 8.4.)
Each ineffective search will later be examined. The search terms will be checked. The match between user's vocabulary and titles and subject headings will be analyzed. The precision value will be obtained for the same search. The cause of search failure will finally be identified and recorded along with needed statistics.
A similar procedure will also apply to an effective search except the cause of search failure.
The types of search failures will be classified in order to develop a taxonomy for search failures in online library catalogs.
For each search query matching the transaction log records, an exhaustive search will be performed on CHESHIRE in order to find further relevant documents in the database. These searches will also be recorded in transaction logs for further analysis.
Based on the relevant documents found in this stage, the precision/recall ratios will be calculated (see Section 8.4). The results will be averaged for all search queries. A precision/recall graph illustrating the retrieval effectiveness in CHESHIRE will be plotted.
Various statistics collected throughout the experiment will be entered to a statistical package and analyzed further. (Examples of such analyses are given in Section 8.4.) Statistical tests will be performed on the results.
The results will be evaluated and the research report will consequently be written.
8.6 Expected Results
First and foremost, the causes of search failures in online catalogs will be identified. The detailed analysis of search failures will help improve the training programs for online library catalogs.
Based on the results, the design of CHESHIRE and other online catalogs can be improved so as to accommodate user preferences. For instance, if we find that users rely more on subject searching in online catalogs, more weights can be assigned to query terms matching LCSH assigned to the records.
It is expected that users will find it easier to use online catalogs with natural query languages than online catalogs with controlled languages. Similarly, it is expected that users will find online catalogs with natural query languages easier to use than the ones based on Boolean logic. Based on the findings, it is hoped that more helpful online catalog user interfaces can be designed.
It is also to be expected that relevance feedback process
will improve retrieval effectiveness in online catalogs. Further, users will find relevance feedback technique useful and use it.
Retrieval effectiveness values to be found for "third generation" online catalogs with relevance feedback and natural language query-based user interfaces such as CHESHIRE will be comparable to that in "second generation" online catalogs.
A pool of search queries stemming from real information needs will be gathered for CHESHIRE. This will allow further testing and comparison of advanced retrieval techniques in CHESHIRE.
The critical incident technique will be used for the first time in studying search failures in online catalogs. If proved useful and practical, the technique can be utilized in other online catalog studies as well. It is expected that the critical incident technique will add value to the data gathered through transaction logs.
9. Select Bibliography
Alzofon, Sammy R. and Noelle Van Pulis. 1984. "Patterns of Searching and Success Rates in an Online Public Access Catalog," College & Research Libraries 45(2): 110-115, March 1984.
Bates, Marcia J. 1972. "Factors Affecting Subject Catalog Search Success," Unpublished Doctoral Dissertation. University of California, Berkeley.
___________. 1977a. "Factors Affecting Subject Catalog Search Success,"Journal of the American Society for Information Science 28(3): 161-169.
___________. 1977b."System Meets User: Problems in Matching Subject Search Terms," Information Processing and Management 13: 367-375.
__________. 1986. "Subject Access in Online Catalogs: a Design
Model," Journal of American Society for Information Science
37(6): 357-376.
___________. 1989a. "The Design of Browsing and Berrypicking Techniques for the Online Search Interface," Online Review 13(5): 407-424.
__________. 1989b. "Rethinking Subject Cataloging in the Online Environment," Library Resources and Technical Services 33(4): 400-412.
Besant, Larry. 1982. "Early Survey Findings: Users of Public Online Catalogs Want Sophisticated Subject Access," American Libraries 13: 160.
Blair, David C. and M.E. Maron. 1985. "An Evaluation of Retrieval Effectiveness for a Full-Text Document-Retrieval System," Communications of the ACM 28(3): 289-299, March 1985.
Blazek, Ron and Dania Bilal. 1988. "Problems with OPAC: a Case Study of an Academic Research Library," RQ 28:169-178.
Borgman, Christine L. 1986. "Why are Online Catalogs Hard to Use? Lessons Learned from Information-Retrieval Studies" Journal of American Society for Information Science 37(6): 387-400.
Borgman, Christine L. "End User Behavior on an Online Information Retrieval System: A Computer Monitoring Study," in: International Conference on Research and Development in Information Retrieval. 6th Annual International ACM SIGIR Conference. Edited by Jennifer J. Kuehn. New York: ACM, 1983. pp.162-176.
Buckley, Chris. (1987). Implementation of the SMART Information Retrieval System. Ithaca, N.Y.: Cornell University, Department of Computer Science.
Byrne, Alex and Mary Micco. 1988. "Improving OPAC Subject Access: The ADFA Experiment," College & Research Libraries 49(5): 432-441.
Campbell, Robert L. 1990. "Developmental Scenario Analysis of Smalltalk Programming," in Empowering People: CHI '90 Conference Proceedings, Seattle, Washington, April 1-5, 1990. Edited by Jane Carrasco Chew and John Whiteside. New York: ACM, 1990, pp.269-276.
Carlyle, Allyson. 1989. "Matching LCSH and User Vocabulary in the Library Catalog," Cataloging & Classification Quarterly 10(1/2): 37-63, 1989.
Chan, Lois Mai. 1986a. Library of Congress Subject Headings:
Principles and Application. 2nd edition. Littleton, Co.:Libraries Unlimited, Inc.
_________. 1986b. Improving LCSH for Use in Online Catalogs.
Littleton, CO.: Libraries Unlimited, Inc.
_________. 1986c. "Library of Congress Classification as an Online Retrieval Tool: Potentials and Limitations," Information Technology and Libraries 5(3): 181-192, September 1986.
_________. 1989. "Library of Congress Class Numbers in Online Catalog Searching," RQ 28: 530-536, Summer 1989.
Cochrane, Pauline A. and Karen Markey. 1983. "Catalog Use Studies -Since the Introduction of Online Interactive Catalogs: Impact on Design for Subject Access," Library and Information Science Research 5(4): 337-363.
Cooper, Michael D. 1991. "Failure Time Analysis of Office System Use," Journal of American Society for Information Science (to appear in 1991).
********* [discard]
Cooper, Michael D. and Cristina Campbell. 1989. "An Analysis of User Designated Ineffective MEDLINE Searches," Berkeley, CA: University of California at Berkeley, 1989.
*********
Dale, Doris Cruger. 1989. "Subject Access in Online Catalogs: An Overview Bibliography," Cataloging & Classification Quarterly 10(1/2): 225-251, 1989.
Dickson, J. 1984. "Analysis of User Errors in Searching an Online Catalog," Cataloging & Classification Quarterly 4: 19-38, 1984.
Doszkocs, T.E. 1983. "CITE NLM: Natural Language Searching in an Online Catalog," Information Technology and Libraries 2(4): 364-380, 1983.
Flanagan, John C. 1954. "The Critical Incident Technique," Psychological Bulletin 51(4): 327-358, July 1954.
Frost, Carolyn O. 1987a. "Faculty Use of Subject Searching in Card and Online Catalogs," Journal of Academic Librarianship 13(2): 86-92.
Frost, Carolyn O. 1989. "Title Words as Entry Vocabulary to LCSH: Correlation between Assigned LCSH Terms and Derived Terms From Titles in Bibliographic Records with Implications for Subject Access in Online Catalogs," Cataloging & Classification Quarterly 10(1/2): 165-179, 1989.
Frost, Carolyn O. and Bonnie A. Dede, 1988. "Subject Heading Compatibility between LCSH and Catalog Files of a Large Research Library: a Suggested Model for Analysis," Information Technology and Libraries 7: 292-299, September 1988.
___________. 1987b. "Subject Searching in an Online Catalog,"
Information Technology and Libraries 6: 61-63.
Gerhan, David R. 1989. "LCSH in vivo: Subject Searching Performance and Strategy in the OPAC Era," Journal of Academic Librarianship 15(2): 83-89.
Hancock-Beaulieu, Micheline. 1987. "Subject Searching Behaviour at the Library Catalogue and at the Shelves: Implications for Online Interactive Catalogues," Journal of Documentation 43(4): 303-321.
____________. 1990. "Evaluating the Impact of an Online Library Catalogue on Subject Searching Behaviour at the Catalogue and at the Shelves," Journal of Documentation 46(4): 318-338, December 1990.
Hartley, R.J. 1988. "Research in Subject Access: Anticipating the
User," Catalogue and Index (88): 1,3-7.
Hays, W.L. and R.L. Winkler. 1970. Statistics: Probability, Inference and Decision. Vol. II. New York: Holt, Rinehart and Winston, 1970. (pp.236-8 for Wilcoxon sign tests in IR research.)
Henty, M. 1986. "The User at the Online Catalogue: a Record of Unsuccessful Keyword Searches," LASIE 17(2): 47-52, 1986.
Hildreth, Charles R. 1989. Intelligent Interfaces and Retrieval Methods for Subject Searching in Bibliographic Retrieval Systems. Washington, DC: Cataloging Distribution Service, Library of Congress.
Holley, Robert P. 1989. "Subject Access in the Online Catalog," Cataloging & Classification Quarterly 10(1/2): 3-8, 1989.
Hunter, Rhonda N. 1991. "Successes and Failures of Patrons Searching the Online Catalog at a Large Academic Library: a Transaction Log Analysis," RQ 30(3): 395-402, Spring 1991.
Ide, E. (1971). "New Experiments in Relevance Feedback." in Salton, Gerard, ed. The SMART Retrieval System: Experiments in Automatic Document Processing. Englewood Cliffs, N.J.: Prentice-Hall. pp. 337-354.
**[ILL could not locate it in Ohio State University; discard]****
Janosky, B., Smith, P.J. and Hildreth, C. 1983. Online Library Catalog Systems: An Analysis of User Errors. Columbus, OH: The Ohio State University Department of Industrial and Systems Engineering. MS Thesis submitted for publication.
***************************************************
Jones, R. 1986. "Improving Okapi: Transaction Log Analysis of Failed Searches in an Online Catalogue," Vine (62): 3-13, 1986.
Kaske, Neal N. 1988a. "A Comparative Study of Subject Searching in an OPAC Among Branch Libraries of a University Library System," Information Technology and Libraries 7: 359-372.
___________. 1988b. "The Variability and Intensity over Time of Subject Searching in an Online Public Access Catalog," Information Technology and Libraries 7: 273-287.
Kaske, Neal K. and Sanders, Nancy P. 1980. "Online Subject Access: the Human Side of the Problem," RQ 20(1): 52-58.
__________. 1983. A Comprehensive Study of Online Public Access Catalogs: an Overview and Application of Findings. Dublin, OH: OCLC. (OCLC Research Report # OCLC/OPR/RR-83-4)
Keen, E. M. 1971. "Evaluation Parameters," in Salton, Gerard, ed. The SMART Retrieval System: Experiments in Automatic Document Processing. Englewood Cliffs, N.J.: Prentice-Hall. pp.74-111.
Kern-Simirenko, Cheryl. 1983. "OPAC User Logs: Implications for Bibliographic Instruction," Library Hi Tech 1: 27-35, Winter 1983.
Kinsella, Janet and Philip Bryant. 1987. "Online Public Access Catalog Research in the United Kingdom: An Overview," Library Trends 35: 619-629, 1987.
Kirby, Martha and Naomi Miller. 1986. "MEDLINE Searching on Colleague: Reasons for Failure or Success of Untrained End Users," Medical Reference Services Quarterly 5(3): 17-34, Fall 1986.
Klugman, Simone. 1989. "Failures in Subject Retrieval," Cataloging & Classification Quarterly 10(1/2): 9-35, 1989.
Kretzschmar, J.G. 1987. "Two Examples of Partly Failing Information Systems," in: Wise, John A. and Anthony Debons, eds. Information Systems: Failure Analysis. Berlin: Springer Verlag, 1987.
Lancaster, F.W. 1968. Evaluation of the MEDLARS Demand Search Service. Washington, DC: US Department of Health, Education and Welfare, 1968.
Lancaster, F.W. 1969. "MEDLARS: Report on the Evaluation of Its Operating Efficiency," American Documentation 20(2): 119-142, April 1969.
Larson, Ray R. 1986. "Workload Characteristics and Computer System Utilization in Online Library Catalogs." Doctoral Dissertation, University of California at Berkeley, 1986. (University Microfilms No. 8624828)
___________. 1989. "Managing Information Overload in Online Catalog Subject Searching,"In: ASIS '89 Proceedings of the 52nd ASIS Annual Meeting Washington, DC, October 30-November 2, 1989. Ed. by Jeffrey Katzer et al. Medford, NJ: Learned Information. pp. 129-135.
___________. 1991a. "Classification Clustering, Probabilistic Information Retrieval and the Online Catalog," Library Quarterly 61(2): 133-173, April 1991.
___________. 1991b. "The Decline of Subject Searching: Long Term Trends and Patterns of Index Use in an Online Catalog," Journal of American Society for Information Science 42(3): 197-215, April 1991.
___________. 1991c. "Evaluation of Advanced Information Retrieval Techniques in an Experimental Online Catalog," Journal of American Society for Information Science (Submitted for publication) [1991].
Larson, Ray R. and V. Graham. 1983. "Monitoring and Evaluating MELVYL," Information Technology and Libraries 2: 93-104.
Lawrence, Gary S. 1985. "System Features for Subject Access in the Online Catalog," Library Resources and Technical Services 29(1): 16-33.
Lawrence, Gary S., V. Graham and H. Presley. 1984. "University of California Users Look at MELVYL: Results of a Survey of Users of the University of California Prototype Online Union Catalog," Advances in Library Administration 3: 85-208.
Lewis, David. 1987. "Research on the Use of Online Catalogs and Its Implications for Library Practice," Journal of Academic Librarianship 13(3): 152-157.
Markey, Karen. 1980. Analytical Review of Catalog Use Studies. Dublin, OH: OCLC, 1980. (OCLC Research Report # OCLC/OPR/RR-80/2.)
_________. 1983. The Process of Subject Searching in the Library Catalog: Final Report of the Subject Access Research Project. Dublin, OH: OCLC.
_________. 1984. Subject Searching in Library Catalogs: Before
and After the Introduction of Online Catalogs. Dublin, OH: OCLC.
_________. 1985. "Subject Searching Experiences and Needs of Online Catalog Users: Implications for Library Classification," Library Resources and Technical Services 29: 34-51.
_________. 1986. "Users and the Online Catalog: Subject Access Problems," in Matthews, J.R. (ed.) The Impact of Online Catalogs pp.35-69. New York: Neal-Schuman, 1986.
_________. 1988. "Integrating the Machine-Readable LCSH into Online Catalogs," Information Technology and Libraries 7: 299-312.
Markey, Karen and Anh N. Demeyer. 1986. Dewey Decimal Classification Online Project: Evaluation of a Library Schedule and Index Integrated into the Subject Searching Capabilities of an Online Catalog. (Report Number: OCLC/OPR/RR-86-1) Dublin, OH: OCLC, 1986.
Maron, M.E. 1984. "Probabilistic Retrieval Models," in: Dervin, Brenda and Melvin J. Voigt, (eds.). Progress in Communication Sciences Vol. V. Norwood, NJ: Ablex, 1984, pp.145-176.
Matthews, Joseph K. 1982. A Study of Six Public Access Catalogs: a Final Report Submitted to the Council on Library Resources, Inc. Grass Valley, CA: J. Matthews and Assoc., Inc.
Matthews, Joseph, Gary S. Lawrence and Douglas Ferguson (eds.) 1983. Using Online Catalogs: a Nationwide Survey. New York: Neal-Schuman.
Mitev, Nathalie Nadia, Gillian M. Venner and Stephen Walker. 1985. Designing an Online Public Access Catalogue: Okapi, a Catalogue on a Local Area Network. (Library and Information Research Report 39) London: British Library, 1985.
Naharoni, A. 1980. "An Investigation of W.T. Grant as Information System Failure," Ph.D. Dissertation, University of Pittsburgh, Pittsburgh, PA, 1980.
Nielsen, Brian. 1986. "What They Say They Do and What They Do: Assessing Online Catalog Use Instruction Through Transaction Monitoring," Information Technology and Libraries 5: 28-34, March 1986.
Norman, D.A. 1980. Errors in Human Performance. San Diego, CA: University of California, 1980.
Norman, D.A. 1983. "Some Observations on Mental Models," in: Stevens, A.L and D. Gentner, eds. Mental Models. Hillsdale, NJ: Erlbaum, 1983.
Pease, Sue and Gouke, Mary Noel. 1982. "Patterns of Use in an Online Catalog and a Card Catalog," College and Research Libraries 43(4): 279-291.
Penniman, W. David. 1975. "A Stochastic Process Analysis of On-line User Behavior," Information Revolution: Proceedings of the 38th ASIS Annual Meeting, Boston, Massachusetts, October 26-30, 1975. Volume 12. Washington, DC: ASIS, 1975. pp.147-148.
Penniman, William David. "Rhythms of Dialogue in Human-Computer Conversation." Ph.D. Dissertation. The Ohio State University, 1975.
Penniman, W.D. and W.D. Dominic. 1980. "Monitoring and Evaluation of On-line Information System Usage," Information Processing & Management 16(1): 17-35, 1980.
Peters, Thomas A. 1989. "When Smart People Fail: An Analysis of the Transaction Log of an Online Public Access Catalog," Journal of Academic Librarianship 15(5): 267-273, November 1989.
Porter, Martin and Valerie Galpin. 1988. "Relevance Feedback in a Public Access Catalogue for a Research Library: Muscat at the Scott Polar Research Institute," Program 22(1): 1-20, January 1988.
Puttapithakporn, Somporn. 1990. "Interface Design and User Problems and Errors: A Case Study of Novice Searchers," RQ 30(2): 195-204, Winter 1990.
Reason, J. and K. Mycielska. 1982. Absent-Minded? The Psychology of Mental Lapses and Everyday Errors. Englewood Cliffs, NJ: Prentice Hall, 1982.
Robertson, Stephen E. 1981. "The Methodology of Information Retrieval Experiment," in: Sparck Jones, Karen, ed. Information Retrieval Experiment. London: Butterworths, 1981. pp. 9-31.
Rocchio, Jr., J.J. 1971a. "Relevance Feedback in Information Retrieval." in Salton, Gerard, ed. The SMART Retrieval System: Experiments in Automatic Document Processing. Englewood Cliffs, N.J.: Prentice-Hall. pp.313-323.
___________. 1971b. "Evaluation Viewpoints in Document Retrieval," in Salton, Gerard, ed. The SMART Retrieval System: Experiments in Automatic Document Processing. Englewood Cliffs, N.J.: Prentice-Hall. pp.68-73.
Salton, G. (1971a). "Relevance Feedback and the Optimization of Retrieval Effectiveness." in Salton, Gerard, ed. The SMART Retrieval System: Experiments in Automatic Document Processing. Englewood Cliffs, N.J.: Prentice-Hall. pp.324-336.
Salton, Gerard, ed. (1971b). The SMART Retrieval System: Experiments in Automatic Document Processing. Englewood Cliffs, N.J.: Prentice-Hall.
Salton, Gerard and Chris Buckley. (1990). "Improving Retrieval Performance by Relevance Feedback," Journal of the American Society for Information Science 41(4): 288-297.
Shepherd, Michael A. 1981. "Text Passage Retrieval Based on Colon Classification: Retrieval Performance," Journal of Documentation 37(1): 25-35, March 1981.
Shepherd, Michael A. 1983. "Text Retrieval Based on Colon Classification: Failure Analysis," Canadian Journal of Information Science 8: 75-82, June 1983.
Sparck Jones, Karen, ed. 1981. Information Retrieval Experiment. London: Butterworths, 1981.
Sparck Jones, Karen. 1981. "Retrieval System Tests 1958-1978," in: Sparck Jones, Karen, ed. Information Retrieval Experiment. London: Butterworths, 1981. pp. 213-255.
Svenonius, Elaine. 1983. "Use of Classification in Online Retrieval," Library Resources and Technical Services 27(1): 76-80, January-March 1983.
_________. 1986. "Unanswered Questions in the Design of Controlled Vocabularies," Journal of the American Society for information Science 37(5): 331-340, 1986.
Svenonius, Elaine and H. P. Schmierer. 1977. "Current Issues in the Subject Control of Information," Library Quarterly 47: 326-346.
Swanson, Don R. (1977). "Information Retrieval as a Trial-and-Error Process," Library Quarterly 47(2): 128-148.
Tague, Jean M. 1981. "The Pragmatics of Information Retrieval Experimentation," in: Sparck Jones, Karen. (ed.) Information Retrieval Experiment. London: Butterworths, 1981. pp. 59-102.
Tague, J. and J. Farradane. 1978. "Estimation and Reliability of Retrieval Effectiveness Measures," Information Processing and Management 14: 1-16, 1978.
Tolle, John E. 1983. Current Utilization of Online Catalogs: Transaction Log Analysis. Dublin, OH: OCLC, 1983.
Tolle, John E. 1983. "Transaction Log Analysis: Online Catalogs," in: International Conference on Research and Development in Information Retrieval. 6th Annual International ACM SIGIR Conference. Edited by Jennifer J. Kuehn. New York: ACM, 1983. pp.147-160.
Users Look at Online Catalogs: Results of a National Survey of Users and Non-Users of Online Public Access Catalogs. 1982. Berkeley, CA: The University of California.
University of California Users Look at MELVYL: Results of a
Survey of Users of the University of California Prototype Online Union Catalog. 1983. Berkeley, CA: The University of California, 1983.
Van der Veer, Gerrit C. 1987. "Mental Models and Failures in Human-Machine Systems," in: Wise, John A. and Anthony Debons, eds. Information Systems: Failure Analysis. Berlin: Springer Verlag, 1987.
Van Pulis, N. and L.E. Ludy. 1988. "Subject Searching in an Online Catalog with authority Control," College & Research Libraries 49; 523-533, 1988.
Van Rijsbergen, C.J. (1979). Information Retrieval. 2nd ed. London: Butterworths.
Walker, Cynthia J.; K. Ann McKibbon; R. Brian Haynes and Michael F. Ramsden. 1991. "Problems Encountered by Clinical End Users of MEDLINE and GRATEFUL MED," Bulletin of the Medical Library Association 79(1): 67-69, January 1991.
Walker, Stephen. 1988. "Improving Subject Access Painlessly: Recent Work on the Okapi Online Catalogue Projects," Program 22(1): 21-31, January 1988.
Walker, Stephen and Richard M. Jones. 1987. Improving Subject Retrieval in Online Catalogues. 1: Stemming, Automatic Spelling Correction and Cross-Reference Tables. (British Library Research Paper 24) London: The British Library, 1987.
Walker, Stephen and R. de Vere. 1990. Improving Subject Retrieval in Online Catalogues. 2: Relevance Feedback and Query Expansion. (British Library Research Paper, no. 72) London: British Library, 1990.
Wilson, Patrick. 1983. "The Catalog as Access Mechanism: Background and Concepts," Library Resources and Technical Services 27(1): 4-17.
Wilson, Sandra R., Norma Starr-Schneidkraut and Michael D. Cooper. 1989. Use of the Critical Incident Technique to Evaluate the Impact of MEDLINE. (Final Report) September 30, 1989. Contract No. N01-LM-8-3529. Bethesda, MD: National Library of Medicine.
Wise, John A. and Anthony Debons, eds. 1987. Information Systems: Failure Analysis. Berlin: Springer Verlag, 1987.
Yannakoudakis, E.J. 1983. "Expert Spelling Error Analysis and Correction," in: Jones, P.K. ed. Informatics 7: Intelligent Information Retrieval: Proceedings of a Conference held by the Aslib Informatics Group and the Information Retrieval Group of the British Computer Society, Cambridge, 22-23 March 1983. London, Aslib, 1983. pp.????????
Zink, Steven D. "Monitoring User Search Success through Transaction Log Analysis: the WolfPac Example," Reference Services Review 19(1): 49-56, 1991.
ATTACHMENTS
1. Invitation Letter and Consent Form for Doctoral Students
2. Invitation Letter and Consent Form for MLIS Students
3. Background Information on CHESHIRE and Guidelines for CHESHIRE Searches
4. Access to CHESHIRE: An Experimental Online Catalog (Instructions)
5. Questionnaire
6. Effective Incident Form
7. Ineffective Incident Form
8. Reminder Letter (to be prepared)
9. Example of Transaction Log Data