CHAPTER III
FAILURE ANALYSIS IN DOCUMENT RETRIEVAL SYSTEMS:
A CRITICAL REVIEW OF STUDIES
3.0 Introduction
In Chapter II an overview of a document retrieval system was given along with the definitions of retrieval effectiveness measures such as precision, recall, and fallout. Relevance feedback and classification clustering techniques were briefly explained. The uses of such techniques in enhancing the effectiveness of document retrieval systems were discussed.
This chapter will examine the concepts of failure analysis in document retrieval systems and review the literature on failure analysis studies. A critical review of various methods of analyzing search failures is given in Section 3.2. A brief overview of major studies of search failures in document retrieval systems is given in Section 3.3.
3.1 Analysis of Search Failures
Online document retrieval systems often fail to retrieve some relevant documents. More often than not they also retrieve non-relevant documents. Such search failures may occur due to a variety of reasons, including problems with user-system interfaces, retrieval rules, and indexing languages.
Studying search failures presents extremely complicated problems. For instance, it is not clear exactly what constitutes a "search failure." While some researchers study search failures using retrieval effectiveness measures such as precision and recall, others prefer using "user satisfaction" as a criterion in deciding whether a search has failed or not. Before reviewing major failure analysis studies, it is helpful to examine some approaches used in studying search failures in document retrieval systems and to discuss the various (mostly implied) definitions of "search failure" used by researchers.
3.2 Methods of Analyzing Search Failures
This section discusses the analysis of search failures using retrieval effectiveness methods (e.g., precision and recall), user satisfaction measures, transaction logs, and the critical incident technique.
3.2.1 Analysis of Search Failures Utilizing Retrieval Effectiveness Measures
A detailed discussion of retrieval effectiveness measures such as precision and recall was given in Chapter II. As pointed out earlier, precision is defined as the proportion of retrieved documents which are relevant, whereas recall is defined as the proportion of relevant documents retrieved (Van Rijsbergen, 1979, p.10).
If precision and recall are seen as performance measures with the given definitions, it becomes clear that "performance" can no longer be defined as a dichotomous concept. When precision and recall are defined as percentages, we can think of "degrees" of search failure or success. This view best reflects different performance levels attained by current document retrieval systems. It is impossible to find a perfect document retrieval system. In reality, retrieval systems are imperfect, and one is better or worse than another.
Performance measures such as precision and recall can be used in the analysis of search failures. In the precision example of a calculation of a precision value in Section 2.8 of Chapter II, only 50% of the documents retrieved were relevant, resulting in a precision of 50%. If each nonrelevant document that the system retrieves for a given query represents a search failure, then it is also possible to think of precision as a measure of search failure: failure to retrieve relevant documents only. The more nonrelevant documents the system retrieves for a given query, the higher the degree of precision failures. If no retrieved document happens to be relevant, then the precision ratio becomes zero due to severe precision failures.
In the recall example, the recall ratio was 25%, implying that the system missed 75% of the relevant documents in the collection. If each missed relevant document represents a search failure, then it is possible to think of recall as a measure of search failure: failure to retrieve all relevant documents in the collection. The more relevant documents the system misses, the higher the degree of recall failure. If the system fails to retrieve any relevant documents from the collection, then the recall ratio becomes zero due to severe recall failures.
Precision and recall are two different quantitative measures of aggregation of search failures. For convenience, search failures analyzed using precision and recall are called precision failures and recall failures.
Precision failures can easily be detected. They occur when the user finds some retrieved documents nonrelevant, even if those documents are assigned the index terms that the user initially asked for in the search query. Users may feel that index terms have been incorrectly assigned to documents that are not really relevant to those subjects.
Note that "relevance" is defined as a relationship ". . . between a document and a person in search of information" and it is a function of a large number of variables concerning both the document (e.g., what it is about, its currency, language, and date) and the person (e.g, person's education and beliefs) (Robertson et al., 1982, p.1).
Recall failures mainly occur because index terms that users would normally utilize to retrieve documents about particular subjects do not get assigned to documents that are relevant to those subjects. As stated earlier, detecting recall failures, especially in large scale document retrieval systems, is much more difficult. Researchers have therefore used somewhat different approximations to calculate recall figures in their experiments.
Although information retrieval textbooks mention "fallout" as a measure of retrieval effectiveness, we are not aware of any experiment where fallout ratio has been successfully calculated. Fallout is the proportion of nonrelevant documents retrieved over all the nonrelevant documents in the collection. Calculating the fallout ratio in large collections is as difficult, if not more difficult, as calculating the recall ratio. To calculate the fallout ratio, all nonrelevant documents retrieved during the search must be identified, all nonrelevant documents in the overall collection must be found, and the size of the collection must be established.
It is tempting to say that documents that are not retrieved are probably not relevant; however, since recall failures do occur in document retrieval systems, this is not the case. If all of the unretrieved documents in a collection were scanned, some of them would be relevant. The fallout ratio could then be calculated. This method can only be used for specific queries where the number of relevant documents in the whole collection is known to be small.
"Fallout failures" do occur constantly in document retrieval systems even if it is impractical to quantify them. Whenever the system retrieves too many nonrelevant records, users feel the consequences of fallout failure. Either they must scan long lists of useless records (hence "fallout") or abandon the search.
Fallout failures also can be seen as severe precision failures. Fallout failure has not been adequately studied; however, it is known that users tend to resist scanning through screens of retrieved items. For instance, Larson (1991c, p.188) found that in a large online catalog the average number of records retrieved was 77.5, but users scanned an average of less than 10 records per search. (See also: Wiberley & Dougherty, 1988.) It is not clear why the users stopped scanning after a few records. Some may have been satisfied with the results. Some users might have abandoned their searches due to frustration because the system retrieved too many unpromising, nonrelevant records. It would be interesting to study what percentage of searches in online catalogs get abandoned in view of user frustration from fallout failures.
Mainly, then, retrieval effectiveness measures are used to determine and study three types of search failures: (1) retrieving nonrelevant documents (precision failures); (2) missing relevant documents (recall failures); and (3) retrieving too many unpromising, nonrelevant documents (fallout failures). Failure analysis aims to find out the causes of these failures so that existing systems can be improved in a variety of ways.
So far, we have looked at a few of the measures of retrieval effectiveness and the ways in which they are used in the study of search failures. We noted that document retrieval systems are not perfect and that we cannot expect them to achieve, or even approximate, the impossible ideal of retrieving all and only relevant documents in the collection. Furthermore, users would like to find some relevant documents, but not necessarily all of them, unless (as in rare occasions such as patent searching) all are wanted. They prefer high precision to high recall. They wish to retrieve "some good references without having to examine too many bad ones" (Wages, 1989, p.80). Consequently, it is more important for a document retrieval system to "distinguish between wanted and unwanted items" quickly than to retrieve all relevant items in the collection.
Not everyone is satisfied with the most commonly used retrieval effectiveness measures (precision and recall), however. For instance, William Cooper has questioned the use of recall as a performance measure because it takes into account not only retrieved documents, but also unretrieved documents. In his view, this is wasted effort since the relevance of unretrieved documents has little bearing on the notion of subjective user satisfaction (Cooper, W., 1973; cf. Soergel, 1976). He maintains that "an ideal evaluation methodology must somehow measure the ultimate worth of a retrieval system to its users in terms of an appropriate unit of utility" (Cooper, W., 1973, p.88).
3.2.2 Analysis of Search Failures Utilizing User Satisfaction Measures
Some failure analysis studies are based on user satisfaction measures, rather than on retrieval effectiveness measures. Although it may at first seem straightforward, analyzing search failures utilizing user satisfaction measures is a complex process that provides interesting challenges.
First, defining user satisfaction is difficult. Several authors tried to address this issue. Tessier et al. (1977) discussed such factors as the search output, the intermediary, the service policies, and the "library as a whole" as the main determinants of the user satisfaction. Bates (1972, 1977a, 1977b) examined the effects of "subject familiarity" and "catalog familiarity" on search success and found that the former has a slight detrimental effect, while the latter has a very significant beneficial effect on search success. Tessier (1981) used factor analysis and multiple regression techniques to study the influence of various variables on overall search satisfaction. She found that "the strongest predictors of satisfaction were the precision of search, the amount of time saved, and the perceived quality of the database as a source of information" (Tessier, 1981; cited in Kinnucan, 1992, p.73). Hilchey and Hurych (1985, p.455) found "a strong positive relationship between perceived relevance of citations and search value" when they performed a statistical analysis on the online reference questionnaire forms returned by the users in a university library.
Second, user satisfaction relies heavily on users' judgments about search failures or successes; however, users' judgments may be inconsistent for various reasons. For example, Tagliacozzo (1977) found that "MEDLINE was perceived as `helpful' by respondents who, in other parts of the questionnaire [used in the author's research], showed that they had not found it particularly useful" (p. 248, original emphasis). Tagliacozzo warns us that ". . . caution should therefore be used in taking the users' judgments at face value, and in inferring from single responses that their information needs were, or were not, satisfied by the service."
It follows that it is not usually sufficient to obtain a binary "Yes/No" response from the user about being satisfied or not satisfied with the results. Ankeny (1991, p.356) found that the use of a two-point (yes-no) scale ". . . appeared to result in inflated success ratings." When pressed, users are likely to come up with further explanations. For example, a user might say: "Yes, in a way my search was successful even though I couldn't find what I wanted." A second user might say that a given search was not successful because "it did not retrieve anything new."
A researcher getting such answers would have hard time classifying them. The data gathering tools that the researcher employs to elicit information from users should be sensitive enough to handle such answers by asking more detailed questions. After all, a decision has to be made if a search was successful or not. Further conditions have been introduced in some studies to facilitate this decision-making process. In Ankeny's study, for example, a successful search has three characteristics:
the patron must indicate that s/he found exactly what was wanted, that s/he was fully satisfied with the search, and that s/he marked none of the 10 listed reasons for dissatisfaction where the reasons for dissatisfaction ranged from `system problems' to `too much information,' from `information not relevant enough' to `need different viewpoint' (Ankeny, 1991, p.354, original emphasis; see also Auster & Lawton, 1984).
Nevertheless, it is still possible that a given search may be a failure even if answers given by a user met all three of these conditions. It was noted earlier that users tend to abandon some searches that retrieve too many items. Many users may prefer to retrieve a few relevant documents quickly. They would not be considered a search "failure" even if the system has missed some relevant documents (i.e., recall failure).
User satisfaction measures are influenced by both the type of user and search goal factors. For example, an undergraduate student writing a term paper may be satisfied if a search retrieves a few relevant textbooks. However, the situation is quite different for a health professional. This user may want to know everything about a certain case because the outcome of missing relevant information may have serious consequences. For example, in a search a health professional investigating a medical procedure using the MEDLINE database only found records showing the procedure to be safe, and did not find records that indicated fatalities associated with the procedure (Wilson et al., 1989).
The above examples show that some caution is needed when interpreting users' indication of satisfaction. There are some published studies that show that "in many cases high levels of reported end-user `satisfaction' . . . may not reflect true success rates" (Ankeny, 1991, p.356). Furthermore, as Cheney (1991, p.155) notes, we do not "know what end users expect of their search results, because no study has examined end users' expectations of database searching. Neither has any study examined the actual quality of end-user search results measured in terms of precision and recall."
So far, the discussion has concentrated on the analysis of search failures that were based on retrieval effectiveness or "user satisfaction." As part of a carefully designed and conducted experiment under "as real-life a situation as possible," Saracevic and Kantor (1988) studied, among other things, the relationship between user satisfaction and precision and recall.
Their experiment involved 40 users who each submitted a query that reflected a real information need. Thirty-nine professional searchers did online searches on Dialog databases for these queries. Each query was searched by nine different professionals and the results were combined for evaluation purposes. The precision ratio for a given search was estimated as the number of relevant items retrieved by the search divided by the total number of items retrieved by the search. Similarly, the recall ratio was estimated as the number of relevant items retrieved by the search divided by the total number of relevant items in the union of items retrieved by all searchers for that question (Saracevic et al., 1988). Five utility measures were used: (1) whether the user's participation and the resultant information was worth it (on a five-point scale); (2) time spent; (3) perceived (by the users) dollar value of the items; (4) whether the information contributed to the resolution of the research problem (on a five-point scale); and (5) whether the user was satisfied with the results (on a five-point scale).
They found that "searchers in questions where users indicated high overall satisfaction with results . . . were 2.49 times more likely to have higher precision" (Saracevic & Kantor, 1988, p.193). They interpreted their findings pertaining to the relationship between utility measures and retrieval effectiveness measures as follows:
In general, retrieved sets with high precision increased the chance that users assessed that the results were `worth more of their time than it took,' were `high in dollar value,' contributed `considerably to their problem resolution,' and `were highly satisfactory.' On the other hand, high recall did not significantly affect the odds for any of those measures. . . . These are interesting findings in another respect. They indicate that utility of results (or user satisfaction) may be associated with high precision, while recall does not play a role that is even closely as significant. For users, precision seems to be the king and they indicated so in the type of searches desired. In a way this points out to the elusive nature of recall: this measure is based on the assumption that something may be missing. Users cannot tell what is missing any more than searchers or systems can. However, users can certainly tell what is in their hand, and how much is not relevant (Saracevic & Kantor, 1988, p.193, original emphasis).
3.2.3 Analysis of Search Failures Utilizing Transaction Logs
The availability of transaction logs, which record users' interaction with the document retrieval systems, provides the opportunity to study and monitor search failures unobtrusively (Tolle, 1983a, 1983b; Borgman, 1983; Simpson, 1989). Larson (1991b, p.198) states: "Transaction monitoring, in its simplest form, involves the recording of user interactions with an online system. More complete transaction monitoring also will record the system responses and performance data (such as response time for searches), providing enough information to reconstruct all of the user's interactions with the system." This includes search queries entered, records displayed, help requests, errors, and the system responses.
Since transaction logs also contain invaluable information about failed searches, researchers have been interested in scanning transaction logs in order to identify failed searches. Several researchers identified "zero hits" from the transaction logs of selected online catalogs and looked into the reasons for search failures (see, for instance, Dickson, 1984; Peters, 1989; Hunter, 1991; Zink, 1991; Cherry, 1992). A few others employed the same method when they studied search failures in MEDLINE (Kirby & Miller, 1986; Walker, C.J. et al., 1991). These researchers used a rather practical definition of search failure when scanning transaction logs. A search was treated as a failure if it retrieved no records.
Needless to say, the definition of search failure as zero hits is incomplete since it does not include partial search failures. More importantly, there is no reason to believe that all "non-zero hits" searches were successful ones. Such an assumption would mean that no precision failures occurred in the systems under investigation! Furthermore, "not all zero hits represent failures for the patrons . . . It is possible that the patron is satisfied knowing that the information sought is not in the database, in which case the zero-hit search is successful" (Hunter, 1991, p.401). Precedence searching in litigation is an example of a zero-hit search that is successful.
Some newer document retrieval systems such as Okapi and CHESHIRE can accommodate relevance feedback techniques and incorporate users' relevance judgments in order to improve retrieval effectiveness in subsequent iterations (Walker, S. & Hancock-Beaulieu, 1991; Larson, 1991a). Transaction logs of such online catalogs also record the user's relevance judgment for each record that is displayed. Using these logs, the researcher is able to determine whether the user found a given record to be relevant or not.
The availability of relevance judgments in transaction logs has opened up new avenues for studying search failures in online library catalogs. Researchers are now able to study not only zero-hit searches, but also failed searches that retrieve nonrevelant records. Obviously, the rendering of relevance judgments makes it easier to identify precision failures, but there still needs to be some kind of mechanism to identify recall failures.
What constitutes a search failure when the relevance judgment for each retrieved document is recorded in the transaction log? Walker, S. and Jones (1987) introduced yet another practical definition of search failure during the analysis and evaluation of an experimental online catalog (Okapi) where they recorded users' relevance judgments in transaction logs. They considered a search query as a failure "if no relevant record appears in the first ten which are displayed" (Walker, S. & Jones, 1987; see also Jones, 1986). This definition of search failure is quite different from one based on precision and recall. It is dichotomous, and it assumes that users will scan at least ten records before quitting. This assumption might be true for some searches and for some users, but not for all searches and users. It also downplays the importance of search failures. Searches retrieving at least one relevant record in ten are considered "successful" even though the precision rate for such searches is quite low (10%).
Although transaction monitoring offers unprecedented opportunities to study search failures in document retrieval systems and provides "highly detailed information about how users actually interact with an online system, . . . it cannot reveal their intentions or whether they are satisfied with the results" (Larson, 1991, p.198).
Some of the shortcomings of transaction monitoring in studying search failures are as follows.
First, it is not clear what constitutes a "search failure" in transaction logs. As mentioned earlier, defining all zero-hit searches as search failures has some serious flaws.
Second, transaction logs have very little to offer when studying recall failures in document retrieval systems. Recall failures can only be determined by using different methods such as analysis of search statements, indexing records, and retrieved documents. In addition, additional relevant documents that were not retrieved in the first place can be found by performing successive searches in the database.
Third, transaction logs can document search failure occurrences, but they cannot explain why a particular failure occurred. Search failures in online catalogs occur for a variety of reasons, including simple typographical errors, mismatches between users' search terms and the vocabulary used in the catalog, collection failures (i.e., requested item is not in the system), user interface problems, and the way search and retrieval algorithms function. Further information is needed about users' needs and intentions in order to find out why a particular search failed.
Finally, since the users usually remain anonymous in transaction logs, analysis of these logs "prevents correlation of results with user characteristics" (Seymour, 1991, p.97).
3.2.4 Analysis of Search Failures Utilizing the Critical Incident Technique
Based on their empirical investigation of tools, techniques, and methods for the evaluation of online catalogs, Hancock-Beaulieu et al. (1991, p.532) found that "transaction logs can only be used as an effective evaluative method with the support of other means of eliciting information from users." One of the techniques to elicit information from users about their needs and intentions is known as the "critical incident technique." Data gathered through this technique, which is briefly discussed below, facilitates the study of search failures in document retrieval systems. When it is used in conjunction with the analysis of transaction log data, the critical incident technique permits search failures to be correlated with user characteristics.
The critical incident technique was first used during World War II to analyze the reasons that pilot candidates failed to learn to fly. Since then, this technique has been widely used, not only in aviation, but also in defining the critical requirements of and measuring typical performance in the health professions. Flanagan (1954, p.327) describes it as follows:
The critical incident technique consists of a set of procedures for collecting direct observations of human behavior in such a way as to facilitate their potential usefulness in solving practical problems and developing broad psychological principles. The critical incident technique outlines procedures for collecting observed incidents having special significance and meeting systematically defined criteria.
By an incident is meant any observable human activity that is sufficiently complete in itself to permit inferences and predictions to be made about the person performing the act.
The major advantage of this technique is to obtain "a record of specific behaviors from those in the best position to make the necessary observations and evaluations" (Flanagan, 1954, p.355). In other words, it is observed behavior that counts in critical incident technique, not opinions, hunches and estimates.
The critical incident technique consists of two steps: (1) collecting and classifying detailed incident reports, and (2) making inferences that are based on the observed incidents. Wilson et al. (1989, p.2) summarize these two steps as follows:
The collection and careful analysis of a sufficient number of detailed reports of such observations of effective and ineffective behaviors results in comprehensive definition of the behaviors that are required for success in the activity in question under a wide range of conditions. These organized lists of critical requirements (generally termed performance `taxonomies') can then be used for a variety of practical purposes such as the evaluation of performance, the selection of individuals with the greatest likelihood of success in the activity, or the development of training programs or other aids to increase the effectiveness of individuals.
The critical incident technique can also be used to gather data "on observations previously made which are reported from memory." Flanagan (1954) claims that collecting data about incidents which happened in the recent past is usually satisfactory. However, the accuracy of reporting depends on what the incident reports contain: the more detailed and precise the incident reports are the more accurate, it is assumed, the information contained therein.
Recently, the critical incident technique has been used to assess "the effectiveness of the retrieval and use of biomedical information by health professionals" (Wilson et al., 1989, p.2). In the same study, researchers have used this technique to analyze and evaluate search failures in MEDLINE. Using a structured interview process that included administering a questionnaire, they asked users to comment on the effectiveness of online searches that they performed on the MEDLINE database. Each report obtained through structured interviews was called an "incident report." Researchers matched these incident reports against MEDLINE transaction log records corresponding to each search in order to find out the actual reasons for search success or failure. These incident reports provided much sought after information about user needs and intentions, and they put each transaction log record in context by linking search data to the searcher.
Although the critical incident technique enables the researcher to gather information about user needs and intentions so that he or she can better explain the causes of search failures, it also has some shortcomings. Information gathered through the critical incident technique has to be corroborated with transaction log data. The verification of user satisfaction or dissatisfaction via transaction log data may provide further clues as to why searches succeed or fail. However, the researcher may not be able to confirm each and every user's account of his or her search from the transaction logs. As the users are generally not identified in the transaction logs, it is sometimes difficult to find the search in question in the logs.
There are a variety of reasons for this problem. First, the user's permission has to be sought in advance in order to examine his or her search(es) in the transaction logs. Second, users may not be able to recall the details of their searches after the fact. Third, the logs may not contain enough data about the search: the items displayed and users' relevance judgments are not recorded in most transaction logs.
The lack of enough data in transaction logs also influences the effectiveness of the critical incident technique. The researcher has to rely a great deal on what the user says about the search. For instance, if the items displayed by the user along with relevance judgments are not recorded in the transaction logs, the researcher will not be able to find the precision ratio. Furthermore, the critical incident technique per se does not tell us much about the documents that the user may have missed during the search: we still have to find out about recall failures using other methods.
3.2.5 Summary
This section discussed various methods of analyzing search failures in document retrieval systems. It emphasized that the issue of search failure is complex. It demonstrated that no single method of analysis is self-sufficient to characterize all the causes of search failures. The next section will review the findings of major studies in this area.
3.3 Review of Studies Analyzing Search Failures
Numerous studies have shown that users experience a variety of problems when they search document retrieval systems and they often fail to retrieve relevant documents. The problems users frequently encounter when searching, especially in online catalogs, are well documented in the literature (Alzofon & Van Pulis, 1984; Bates, 1986; Blazek & Bilal, 1988; Borgman, 1986; Cochrane & Markey, 1983; Gouke & Pease, 1982; Hartley, 1988; Henty, 1986; Hildreth, 1982, 1985, 1989; Janosky et al, 1986; Kaske, 1983; Kern-Simirenko, 1983; Kinsella & Bryant, 1987; Larson, 1986, 1991c; Lawrence et al., 1984; Markey, 1980, 1984, 1985, 1986; Matthews, 1982; Matthews et al., 1983; Mitev et al., 1985; Nielsen, 1986; Wang, 1985). However, few researchers have studied search failures directly (Cleverdon, 1962; Cleverdon et al., 1966; Cleverdon & Keen, 1966; Lancaster, 1968, 1969; Dickson, 1984; Blair & Maron, 1985; Jones, 1986; Markey & Demeyer, 1986; Walker, S. & Jones, 1987; Wilson et al., 1989; Klugman, 1989; Peters, 1989; Ankeny, 1991; Hunter, 1991; Walker, C.J. et al., 1991; Zink, 1991; Cherry, 1992). What follows is a brief overview of major studies of search failures in document retrieval systems. Not surprisingly, the results of these studies are not directly comparable because they use different definitions and methods of analysis.
3.3.1 Studies Utilizing Precision and Recall Measures
Several major studies have employed precision and recall measures to analyze search failures.
3.3.1.1 The Cranfield Studies
Cyril Cleverdon, who was Librarian of the College of Aeronautics at Cranfield, England, and his colleagues conducted a series of studies in late 1950s and early 1960s to investigate the performance of indexing systems (Cleverdon, 1962; Cleverdon et al., 1966, and Cleverdon & Keen, 1966). They also studied the causes of search failures in document retrieval systems. Findings pertaining to search failures are reviewed here.
In the first study (Cranfield I), Cleverdon (1962, p.1) compared the efficiency of retrieval effectiveness of four indexing systems: the Universal Decimal Classification, an alphabetical subject index, a special facet classification, and the uniterm system of co-ordinate indexing. Some 18,000 research reports and periodical articles in the field of aeronautics were indexed using these four indexing systems, and 1,200 queries were used in the tests.
The main purpose of the Cranfield I experiment was to test the ability of each indexing system to retrieve the "source document" upon which each query was based. Researchers knew beforehand that "there was at least one document which would be relevant to each question" (pp.8-9). The recall ratio was calculated based on the retrieval of source documents. However, this recall ratio should be regarded as a type of "constrained" recall since the objective was just to find source documents in the collection. Cranfield I tests have shown that "the general working level of I.R. systems appears to be in the general area of 60%-90% recall and 10%-25% of relevance [i.e., precision]" (pp.8-9).
During the tests, each search was "carried on to the stage where the source document was retrieved or alternatively the searcher was unable to devise any further reasonable search programmes" (p.11). Each query was judged to be a success or failure: a search was a success if the source document was retrieved, a failure if it was not. Swanson (1965, p.5) states: "The decision to measure retrieval success solely in terms of the source document was prompted by an understandable, though unfortunate, desire to determine whether any given document was or was not relevant to the question." Relevant documents other than source documents, which would have been retrieved during the search, were not taken into account.
The success rate for all searches was found as 78%; source documents were successfully retrieved for most search queries.
Cleverdon's analysis of search failures was based on 329 documents and queries. The total number of search failures was 495. He classified the causes of search failures under four main headings: (1) question, (2) indexing, (3) searching, and (4) system. Each heading included further subdivisions to specify the exact cause(s) of each search failure. For example, questions could be "too detailed," "too general," "misleading" or just plain "incorrect." Likewise, insufficient, incorrect, or careless indexing; insufficient number of entries; and lack of cross references caused further search failures. Included under searching were "lack of understanding," "failure to use all concepts," "failure to search systematically," and "incorrect" or "insufficient searching." The lack of some features in indexing systems, such as synonymity and inability to combine particular concepts, also caused search failures.
The number of failed searches under each subdivision is given in several tables. The reasons for failures in searches carried out by the project staff are as follows: questions, 17%; indexing process, 60%; searching 17%; and, indexing system, 6%. The percentages of failures in searches performed by the technical staff (i.e., the end-users) were somewhat higher for searching (37%).
It appears that well over half of the failures in this study were caused by the indexing process. Cleverdon (1962, p.88) summarizes the results of the analysis of search failures as follows:
The analysis of failures . . . shows most decisively that the failures were, for more than all other reasons together, due to mistakes by the indexers or searchers, and that a third of the failures could have been avoided if the project staff had indexed consistently, as well as they were capable of doing. Put another way, this means that in every hundred documents, the indexers failed to index adequately five documents, the failure usually consisting of the omission of some particular concept.
The second study (Cranfield II) conducted by Cleverdon and his colleagues was an attempt to investigate the performance of indexing systems based on such factors as the exhaustivity of indexing and the level of specificity of the terms in the index language. The test collection consisted of some 1,400 research reports and periodical articles on the subject of aerodynamics and aircraft structures. Some 221 queries (all single theme queries) were obtained from the authors of selected published papers. However, most tests were based on 42 queries and 200 documents (Cleverdon et al., 1966, and Cleverdon & Keen, 1966).
Precision and recall were used to determine the retrieval effectiveness of indexing systems. It is difficult to cite a single performance figure because the Cranfield II experiment involved a number of different index languages with a large number of variables. It was found that there exists an inverse relationship between recall and precision and that "the two factors which appear most likely to affect performance are the level of exhaustivity of indexing and the level of specificity of the terms in the index language" (Cleverdon & Keen, 1966, p.i). As noted in the preface to volume two of the report, a detailed intellectual analysis of the reasons for search failures was not carried out.
3.3.1.2 Lancaster's MEDLARS Studies
The Cranfield projects tested retrieval effectiveness in a laboratory setting, and the size of the test collection was small (1,400 documents). By contrast, Lancaster (1968), studied the retrieval effectiveness of a large biomedical reference retrieval system (MEDLARS) in operation. The MEDLARS database (Medical Literature Analysis and Retrieval System) contained some 700,000 records at that time. Some 300 "real life" queries were obtained from researchers and were used in the tests.
The retrieval effectiveness of the MEDLARS search service was measured using precision and recall. The precision ratio was calculated according to the definition given in Chapter 2. However, it would have been extremely difficult to calculate a true recall figure in a file of 700,000 records because this would have meant having the requester examine and judge each and every document in the collection. Lancaster explains how the recall figure was obtained:
We therefore estimated the MEDLARS recall figure on the basis of retrieval performance in relation to a number of documents, judged relevant by the requester, but found by means outside MEDLARS. These documents could be, for example,
1. documents known to the requester at the time of his request,
2. documents found by his local librarian in non-NLM [National Library of Medicine] generated tools,
3. documents found by NLM in non-NLM-generated tools,
4. documents found by some other information center, or
5. documents known by authors of papers referred to by the requester (pp.16, 19, original emphasis).
Relevant documents identified by the requester for each query made up the "recall base" upon which the calculation of the recall figure was based. An example illustrates how recall was calculated. The recall base consists of six documents that are known to the requester to be relevant before the search. Under these circumstances, if "only 4 are retrieved, we can say that the recall ratio for this search is 66%" (pp.19-20).
Based on the results of 299 test searches, Lancaster found that the MEDLARS Search Service was operating with an average performance of 58% recall and 50% precision.
Lancaster also studied the search failures using precision and recall. He investigated recall failures by finding some relevant documents using sources other than MEDLARS and then checking to see if the relevant documents had also been retrieved during the experiment. If some relevant documents were missed, this was considered as a recall failure and measured quantitatively. Precision failures were easier to detect since users were asked to judge the retrieved documents as being relevant or nonrelevant. If the user decided that some documents were nonrelevant, this was considered to be a precision failure and measured accordingly. However, identifying the causes of precision failures proved to be much more difficult because the user might have judged a document to be nonrelevant due to index, search, document, and other characteristics as well as the user's background and previous experience with the document.
To date, Lancaster's study is the most detailed account of the causes of search failures that has been attempted. As Lancaster (1969, p.123) points out:
The `hindsight' analysis of a search failure is the most challenging aspect of the evaluation process. It involves, for each `failure,' an examination of the full text of the document; the indexing record for this document (i.e., the index terms assigned . . . ); the request statement; the search formulation upon which the search was conducted; the requester's completed assessment forms, particularly the reasons for articles being judged `of no value'; and any other information supplied by the requester. On the basis of all these records, a decision is made as to the prime cause or causes of the particular failure under review.
Lancaster found that recall failures occurred in 238 out of 302 searches, while precision failures occurred in 278 out of 302 searches. More specifically, some 797 relevant documents were not retrieved. More than 3,000 documents that were retrieved were judged nonrelevant by the requesters. Lancaster's original research report contains statistics about search failures along with detailed explanations of their causes.
Lancaster discovered that almost all of the failures could be attributed to problems with indexing, searching, the index language, and the user-system interface. For instance, the indexing subsystem in his research "contributed to 37% of the recall failures and . . . 13% of the precision failures" (Lancaster, 1969, p.127). The searching subsystem, on the other hand, was "the greatest contributor to all the MEDLARS failures, being at least partly responsible for 35% of the recall failures and 32% of the precision failures" (Lancaster, 1969, p.131).
3.3.1.3 Blair and Maron's Full-Text Retrieval System Study
More recently, Blair and Maron (1985) conducted a retrieval effectiveness test on a full-text document retrieval system. They utilized a database that "consisted of just under 40,000 documents, representing roughly 350,000 pages of hard-copy text, which were to be used in the defense of a large corporate law suit" (pp.290-291). The tests were based on some 51 queries obtained from two lawyers.
Precision and recall were used as performance measures in the Blair and Maron study. The precision ratio was straightforward to calculate (by dividing the total number of relevant documents retrieved by the total number of documents retrieved). Blair and Maron used a different method to calculate the recall ratio. The way they found unretrieved relevant documents (and thus studied recall failures) was as follows. They developed "sample frames consisting of subsets of the unretrieved database" that they believed to be "rich in relevant documents" and took random samples from these subsets. Taking samples from subsets of the database rather than the entire database was more advantageous from the methodological point of view "because, for most queries, the percentage of relevant documents in the database was less than 2 percent, making it almost impossible to have both manageable sample sizes and a high level of confidence in the resulting Recall estimates" (pp.291-293).
The results of Blair and Maron's tests showed that the mean precision ratio was 79% and the mean recall ratio was 20% (p.293).
Blair and Maron found that recall failures occurred much more frequently than one would expect: the system failed to retrieve, on the average, four out of five relevant documents in the database. They showed quite convincingly that high recall failures can result from free-text queries, where the user's terminology and that of the system do not match. They also observed that users involved in their retrieval effectiveness study believed that "they were retrieving 75 percent of the relevant documents when, in fact, they were only retrieving 20 percent" (p.295).
3.3.1.4 Markey and Demeyer's Dewey Decimal Classification Online Project
Markey and Demeyer (1986) studied the Dewey Decimal Classification (DDC) system "as an online searcher's tool for subject access, browsing, and display in an online catalog" (p.1). Two online catalogs were employed in the study: "(1) DOC, or Dewey Online Catalog, in which the DDC had been implemented as an online searcher's tool for subject access, browsing, and display; and (2) SOC, or Subject Online Catalog, in which the DDC had not been implemented" (p.109).
They also conducted online retrieval performance tests using recall and precision measures to reveal problems with online catalogs and to identify their inadequacies. Precision was defined in their study as the proportion of unique relevant items retrieved and displayed. This definition of precision differs from the one given in Chapter 2 in that it takes into account only retrieved and displayed items (instead of all retrieved items) in the calculation of precision ratio. The researchers made no attempt to have users display and make relevance assessments about all the retrieved items in order to calculate the absolute precision ratio (p.162).
Their estimated recall scores were also based on retrieved and displayed items only, not on all the relevant items in the collection. Understandably, they found it impractical to scan the entire database for every query to find all the relevant items in the collection. They used an estimated recall formula "that combined the relevant items retrieved and displayed in the SOC search for a query and the relevant items retrieved and displayed in the DOC search for the same query" (p.144). In order to find the estimated recall ratio for each search, the number of unique relevant items retrieved and displayed in one catalog was divided by the total number of unique relevant items retrieved and displayed for the same query in both catalogs. No attempt was made to find other potentially relevant items in the database.
The estimated recall scores in the study ranged from a low of 44% to a high of 75%. They found that "searches were likely to retrieve and display a large proportion of relevant items that were unique . . . for the same topic in SOC and DOC" even though DOC's estimated recall was lower than that of SOC (p.146). They also asked users if they were satisfied with the search results, and "the majority of patrons expressed satisfaction with the search in the system yielding higher estimated recall" (p.149). The average precision scores ranged from a low of 26% to a high of 65% (p.165, Table 42). Considering that only a fraction of items retrieved in the searches were actually displayed, the authors noted that precision was affected by the order in which retrieved items were displayed. They found precision to be a less reliable criterion with which to measure the performance of an online catalog (p.162).
They asked users which system gave more satisfactory results for their searches and compared users' responses with the precision scores. They concluded that "there was no relationship between patrons' search satisfaction and the precision of their online searches" (p.166; cf. Tessier, 1981).
Markey and Demeyer also analyzed a total of 680 subject searches as part of the DDC Online Project and found that 34 out of 680 subject searches (5%) failed. Two major reasons for subject search failures were identified as follows: (1) the topic was marginal (35%), and (2) the users' vocabulary did not match subject headings (24%) (p.182). Their research report gives a detailed account of the failure analysis of different subject searching options in an online catalog enhanced with a classification system (DDC) (p.182).
Markey and Demeyer apparently did not count "zero retrievals" as search failures. Nor did they include in their analysis partial search failures that retrieved at least some relevant documents. Presumably, that's why the number of search failures they analyzed were relatively low.
3.3.2 Studies Utilizing User Satisfaction Measures
It was noted earlier (Section 3.2.2) that analyzing search failures utilizing user satisfaction measures is extremely complicated. Few researchers have attempted to look at search failures in light of user satisfaction.
Hilchey and Hurych (1985) analyzed 153 online search evaluation forms returned by the users in a university library. Almost half of the respondents (47%) found the search results "most relevant." An additional 32% of the respondents graded the results as "half relevant." Only 6% found all search results relevant. In short, 85% of the respondents felt that search results were at least half relevant. The return rate in this study was about 10%. Although authors claim that the return rate was "unprejudiced in any way," returned questionnaire forms may have primarily come from satisfied users.
Ankeny (1991, pp.352-354) reviewed the studies reporting user satisfaction in end-user search services such as MEDLINE and BRS/After Dark and also reported the results of two studies he conducted. In the first study, he surveyed 190 end-users and found that 78% of the users located what they wanted in two business databases (DIALOG Business Connection and Dow Jones News/Retrieval). More than 81% of the users rated the services favorably by giving "an overall rating of 4 or 5 on the five-point scale" (p.354).
In the second study, he surveyed some 600 end-users. He used a stricter measure of search success (with a reliability coefficient of .90) in the second study in which a search query was considered as successful when the user: a) was fully satisfied with the search; b) found exactly what was desired; and c) was not dissatisfied in any way. He found that "[o]f the 600 searches in the sample, 233 met all three criteria for complete success and 367 were less than successful, yielding an overall success rate of 38.8 percent" (p.354). Reported reasons for dissatisfaction in 367 "less-than-successful" searches were as follows: system problems; amount, relevancy, or level of the information retrieved; lack of better printed instructions; and lack of more informed and accommodating staff.
Kirby and Miller (1986) analyzed search failures encountered by MEDLINE end-users employing the Colleague search software. In order to find the search successes and failures, end-users compared their search results with the mediated follow-up search results. "Successful" and "incomplete" end-user searches were identified as follows:
`Successful' Colleague searches were those for which the follow-up search added nothing important, as indicated by one of two questionnaire responses: `My search gave satisfactory results, and nothing essential was added by the second search' . . . or `Neither search provided satisfactory results.' Both responses were regarded as `successful' in that the end user was no less successful in meeting the information need than the trained search analyst. `Incomplete' Colleague searches were those which had missed important articles, according to end user questionnaire responses after reviewing the follow-up search results (p.20, original emphasis).
However, end-users were not asked to judge each record retrieved by either search. Rather, "the comparison was based on search terms and combinations recorded on the follow-up search form, and on the number of citations printed in the follow-up search" (p.20).
Kirby and Miller examined 52 searches. Of the 52 searches, 31 were "incomplete." The major cause of search failures (67.7%) was the search strategy. The rest of the search failures were due to system mechanics and database selection (22.6% and 9.7%, respectively).
3.3.3 Studies Utilizing Transaction Logs
Several researchers have used transaction logs to study search failures in online catalogs. Dickson (1984, p.26) studied a sample of "zero-hit" author and title searches using the transaction log of Northwestern University Library's online catalog and analyzed why the searches failed. She found out that about 23% of author searches and 37% of title searches retrieved nothing. Misspellings and mistakes in the search formulation were the major causes of zero-hit searches.
Jones (1986) examined transaction logs of the Okapi online catalog and found several unsatisfactory areas in its operation due to, among others, spelling errors, failures in subject searching, and user-system interface problems. He analyzed some 300 subject searches performed on Okapi and found that 25% of them failed: "Using relevance assessments based on a display of the first ten records, the experimenter decided that 62.4% of searches were almost certainly successful, 13% may have been successful, 4.5% were collection failures and 25% failed absolutely" (pp.7-8).
In a follow-up study, it was found that 17 out of 122 sessions (or 13.9%) failed in the Okapi (including two sessions that failed due the collection not containing relevant items). (Most sessions contained more than one search.) In seven sessions, the users' vocabulary did not match that of the catalog (e.g., "sociology of shopping"). Another four sessions failed because the topics expressed by the users were too specific (e.g., "textile industry input-output tables"). Two searches failed because searches did not describe users' needs (e.g., one user entered his query simply as "sterling" although the interviewer found out he was actually looking for "economics--sterling shares and gold") (Walker, S. & Jones, 1987, pp.117-119).
The most recent Okapi report states that "the proportion of (non-aborted) searches which failed to retrieve any records is very low indeed (3.9% overall)" (Walker, S. & Hancock-Beaulieu, 1991, p.30). The authors of the report claim that the improvement is primarily due to: (1) Okapi's "best match" search, and (2) stemming and automatic cross-referencing (p.31).
Peters (1989) analyzed the transaction logs of a union online catalog (the University of Missouri Information Network) and found that 40% of the searches in that catalog produced zero hits. He classified the causes of search failures under 14 different groups, including typographical and spelling errors (10.9% and 9.9%, respectively) and the search system itself (9.7%). Approximately 40% of the failures were collection failures (i.e., the item sought was not in the database). However, Peters' study was not based on a rigorous analysis of zero-hit searches by re-entering queries to determine the exact causes of failures. Rather, "the analyzers made intelligent guesses . . . of the probable causes" (p.270).
Hunter (1991) analyzed thirteen hours of transaction logs, amounting to some 3,700 searches performed in a large academic library online catalog. She used the same classification schema as Peters (1989) and categorized the causes of search failures under 18 different groups. The overall search failure rate in Hunter's study was found to be 54.2%. The major causes of search failures were identified as the controlled vocabulary in subject searching (29%), the system itself (18%), and the typographical errors (15%). However, it was not explained in detail what sorts of controlled vocabulary failures occurred and what the specific causes were.
C.J. Walker and her colleagues (1991) obtained similar results when they studied the problems encountered by clinical end-users of MEDLINE and GRATEFUL MED. They defined search failure, which they called "unproductive search," as "one that did not retrieve any citations," and they analyzed 172 such searches (p.68). They found that 48% of the search failures occurred because of some flaw in the search strategy. The software in use was responsible for 41% of the search failures. System failures constituted some 11% of all search failures.
Zink (1991) analyzed transaction logs of 6,118 searches that took place on the WolfPAC online catalog at the University of Nevada. He found that:
more than one of every four (27.81 percent or 1,702) failed to retrieve at least one bibliographical record. Subject searches yielded 667 unsuccessful searches, or 39.19 percent of the total number of unsuccessful searches. Author searches resulted in 250 unsuccessful searches (14.69 percent of the total). Searches by all other criteria accounted for 300 unsuccessful searches (17.63 percent of the total) (p.51).
Collection failures (57.60%), misspellings (18%), and placing first name "improperly" before last name (15.20%) caused most of the author search failures. Similar failure rates were also observed for the title searches (collection failures, 61.86%, and misspellings, 14.23%). In 111 unsuccessful title searches (22.89%), searchers seemed to be attempting to find subject or author information. Sixty-three percent of the subject searches failed because the user-entered subject words were not "legitimate" Library of Congress subject headings. Misspellings and collection failures accounted for 23.24% and 10.64% of all subject search failures.
Most of the studies summarized above benefited from transaction monitoring to the extent that "zero-hit" searches were identified from transaction logs. Researchers examined the zero-hit searches in order to find out why a particular search query failed to retrieve anything in the database. Unlike Lancaster (1968), they did not attempt to identify the causes of recall and precision failures.
3.3.4 Studies Utilizing the Critical Incident Technique
It was mentioned earlier (Section 3.2.4) that Wilson et al. (1989) studied searching in MEDLINE using the critical incident technique. The researchers first devised a sampling strategy and developed an interview protocol to elicit the desired information from the subjects. They then developed three "frames of reference" to analyze the interview data: "(1) `Why was the information needed?,' (2) `How did the information obtained impact the decision-making of the individual who needed the information?,' and (3) `How did the information obtained impact the outcome of the clinical or other situation that occasioned the search?'" (p.5). After a qualitative analysis of the critical incident reports, the frames of reference were used to create three similar taxonomies.
In the same study, they asked users to explain what they needed the information for and whether they were satisfied with the search outcome. They used incident forms to record the user's account of why a particular search failed or succeeded and, with permission, they tape-recorded the user's comments. They later tried to match these "incident reports" against MEDLINE transaction log records for each search in order to find out the actual reasons for search failures and successes.
They examined some 26 user-designated ineffective incident reports in order to "characterize the nature of the ineffective searches, analyze the relationship between what the user said and what the transaction log said happened during the search, and ascertain, by performing an analogous MEDLINE search, whether a search could have been performed which would have met the user's objective" (p.81). Most ineffective searches (23 out of 26) were identified as such because the users "could not find what they were looking for and/or could not find relevant materials." An appendix summarizing the analysis of each ineffective search accompanied their research report.
After extensive examination of interview transcripts and transaction logs for ineffective searches, the researchers concluded that users did not appear to comprehend:
1. How to do subject searching.
2. How MeSH [Medical Subject Headings] works.
3. How they can apply that understanding to map their search requests into a vocabulary that is likely to retrieve considerably more relevant materials (pp.83-84).
It appears that critical incident technique can successfully be used in the analysis of search failures in online catalogs as well. Matching incident reports against transaction logs is especially promising. Since the analyst will, through incident reports, gather contextual data for each search query, more informed relevance judgments can be made. Furthermore, this technique also can be utilized to compare user-designated search effectiveness with that obtained through traditional retrieval effectiveness measures.
3.3.5 Other Search Failure Studies
Some experimental studies looked into strict matching failures that occurred when users tried to do catalog searches.
Gouke and Pease (1982) analyzed the success rates of the users in matching titles and found that the success rate in finding "nonproblem" titles was 82%, whereas the rate was 48% for "problem" titles. Almost half of the users failed to match simple titles in the online catalog for various reasons (e.g., titles appearing as subject, hyphenated words, words on stoplist, foreign titles, and abbreviations) (p.139).
Alzofon and Van Pulis (1984) surveyed 430 users of the LCS online catalog of the Ohio State University Libraries to identify the patterns of searching. They also studied the success rates for known-item and subject searches. They replicated the users' searches on the catalog and found that the author-title search had a success rate of 85% compared with 77% for author searches and 68% for subject searches (p.113).
Janosky et al. (1986) studied the errors that users made in performing searches in the LCS online catalog of the Ohio State University Libraries. They hired 30 volunteer students who had no prior experience with the online catalog under investigation. Each student searched four queries in the catalog. (Queries were the same for all students.) They performed one subject search and three known-item searches. Authors summarize the procedure and results as follows:
They [users] were asked to search until they either found the item(s) in question or believed that the item(s) was not present in the library system. They were told that it was possible that the item in question was not contained in the library. While searching, subjects were asked to think aloud. . . . A success rate was computed for each search. Since all search items were actually in the library system (subjects were not told this fact), `success' is defined as correctly locating the information requested about an item. . . . For the four searches, the success rate ranged from a high of 58% to a low of 0% (p.576).
It appears that users experienced serious problems with mechanical aspects of searching in this catalog, which in turn influenced the success rate considerably. For instance, "HELP-AUTHOR" was the "correct" help command, and users who entered "HELP AUTHOR" failed to get any help about author searches (notice the hyphen between the two words). On-screen and offline instructions in this system that advised users to type in commands "exactly as listed" did not seem to help users much to recover from such search failures. A more forgiving user interface would have easily prevented similar failures from occurring in the first place. The authors concluded: "It is not sufficient to simply tell users that they have made an error. Failure to deal with the causes of an error often snowballed into a whole string of misinterpretations, resulting in complete failures to solve the problem of using LCS" (p.591).
Cherry (1992) studied some 100 search sessions using the University of Toronto Libraries' online catalog (FELIX). She analyzed, among others, a small number (42) of zero-hit subject searches "in an effort to identify conversions that would improve recall" (p.97). Each zero-hit subject search was re-entered as, among others, title, keyword title, and keyword subject search so as to see if it would retrieve any documents. She found that:
keyword subject, keyword title, or title searches using the original query from the user's zero-hit subject search were as fruitful or more fruitful than new searches constructed from cross-references provided by LCSH. Thus, it is suggested that educating users in the use of LCSH or providing OPAC [online public access catalog] software to automatically provide LCSH cross-references will not solve the problems with the majority of zero-hit subject searches (p.99).
Seaman (1992) examined the interlibrary loan borrowing requests made by the users for items that were listed in the online catalog of the Ohio State University. Approximately 9% of the requests were for such items. The author reasoned that each interlibrary loan borrowing request for a known item that was already in the online catalog represents "either a failure of the user to search the system correctly or a failure of the catalog to retrieve the required record" (p.113). He took a sample of 226 interlibrary loan borrowing requests and identified user errors (such as spelling errors, incorrect author or title) and catalog errors (such as punctuation or corporate word order). Approximately half of the failures in the sample were due to user errors while catalog failures represented the other half.
3.3.6 Related Studies
A few studies that were not directly concerned with the causes of search failures, but which nevertheless addressed relevant issues are summarized below.
Hildreth (1989) considers the "vocabulary" problem as the major retrieval problem in today's online catalogs and asserts that "no other issue is as central to retrieval performance and user satisfaction" (p.69). It may be so because controlled vocabularies are far more complicated than users can easily grasp in a short period of time. Several researchers have found that the lack of knowledge concerning the Library of Congress Subject Headings (LCSH) is one of the most important reasons why searches fail in online catalogs (Bates, 1986; Borgman, 1986; Byrne & Micco, 1988; Dale, 1989; Frost, 1987a, 1987b, 1989; Frost & Dede, 1988; Gerhan, 1989; Holley, 1989; Kaske, 1988a, 1988b; Kaske & Sanders, 1980; Lawrence, 1985; Lewis, 1987; Markey, 1983, 1984, 1985, 1986, 1988; Mischo, 1981; Svenonius, 1986; Svenonius & Schmierer, 1977; Wang, 1985). Larson (1991c, p.181) found that almost half of all subject searches in the MELVYLŪ online catalog retrieved nothing. More recently, Larson (1991b) analyzed the use of MELVYL over a longer period of time (six years) and found that there is a significant positive correlation between the failure rate, which is defined as the proportion of search queries that retrieved nothing, and the percentage of subject searching (p.208). This result confirms the findings of an earlier formal analysis of factors contributing to success and satisfaction: "problems with subject searching were the most important deterrents to user satisfaction" (University of California Users Look at MELVYL, 1983, p.97).
Larson (1991a, 1991c) reviewed the literature on subject search failures in online catalogs along with remedies offered to reduce subject search problems. Subject retrieval failures in online catalogs could be reduced in a number of ways, including assigning more subject headings to bibliographic records, providing keyword searching, and enhancing classification retrieval.
Carlyle (1989) studied the match between users' vocabulary and LCSH using transaction logs and found that "single LCSH headings match user expressions exactly about 47% of the time" (p.37). A study conducted by Van Pulis and Ludy (1988) showed that 53% of the users' terms matched subject headings in the online catalog (pp.528-529). Vizine-Goetz and Markey Drabenstott (1991) extracted queries from transaction logs of three online catalogs (SULIRS, ORION, and LS/2000) and analyzed them "both by computer and manually to determine the extent to which they matched subject headings" (p.157). They found that less than half of the subject query terms exactly matched the Library of Congress subject headings. The findings suggest that some search failures can be attributed to controlled vocabularies in online catalogs. However, as the authors note, "such analyses . . . reveal little about whether matching terms satisfactorily represent users' topics of interest" (p.161).
3.4 Conclusion
There is no agreed-upon definition of what constitutes search failure in document retrieval systems. In part, this is due to the multiplicity of data gathering tools and techniques used in the analysis of search failures (e.g., the critical incident technique, controlled experiments, interviews, questionnaires, talk-aloud techniques, and transaction monitoring). Different data gathering methods have different strengths and weaknesses.
Many of the studies reviewed in this paper examined search failures based on zero retrievals in online catalogs. Partial search failures have been studied much less frequently. Experiments that investigate the relationship between search failures and user needs or characteristics are even scarcer. This is not surprising because identifying zero retrievals from transaction logs is relatively easy and inexpensive. By contrast, analyzing search failures using precision and recall measures is more expensive and time-consuming. So is the investigation of user needs and interests, which could help researchers make more informed judgments about search failures identified through other means. No single method or technique is self-sufficient to analyze all search failures in document retrieval systems and to interpret the findings.
As for the causes of search failures, transaction logs of the searches that retrieved no records in online catalogs reveal that users are having numerous mechanical problems, such as improperly keying commands and misspelling words. Such problems can be alleviated to a certain extent by designing more intuitive user interfaces that would not only take into account user expertise and task complexity, but also would give advice and simplify the user's task (Buckland & Florian, 1991). Newer online catalogs are dealing with these problems by incorporating more sophisticated stemming algorithms and Soundex-type techniques to correct misspellings.
Transaction log analysis also reveals that users' lack of knowledge of controlled vocabularies and query languages causes many search failures and, subsequently, results in user frustration. Most users are not aware of the role of controlled vocabularies in document retrieval systems. They do not seem to understand the structure of rigid indexing and query languages. Consequently, their search query terms, which are expressed in their own words, often fail to match the titles and subject headings of the documents, causing search failures. "Brittle" query languages based on Boolean logic tend to exacerbate this situation further, especially for complicated search queries.
Transaction monitoring is the most appropriate technique to study search failures when the cause(s) of search failures are obvious (e.g., zero retrievals due to misspellings or collection failures). However, transaction monitoring seems to be less efficient in dealing with more complicated failures. For example, partial failures can be best studied with the help of the user. After all, the user is the key person in the analysis of search failures. It is the user who can explain what he or she was trying to do and whether it was successful. Such input from the user puts each search into perspective and provides much needed contextual information. However, users do not get identified in most transaction log studies. Without user feedback, researchers are faced with the unenviable task of coming up with a rational explanation as to why a particular search failed.
Notwithstanding the circumstantial evidence gathered through various online catalog studies in the past, studies examining the match between users' vocabulary and that of online document retrieval systems are scarce. Moreover, the probable effects of mismatching on search failures are yet to be fully explored.
Users prefer to be able to express their information needs in natural language, but most contemporary online catalogs cannot accommodate search requests submitted in natural language form. However, it is believed that natural language query interfaces may reduce search failures in document retrieval systems. Natural language search terms will more likely match the titles of the documents in the database. Consequently, the role of natural language interfaces in reducing search failures in document retrieval systems needs to be thoroughly studied.
User input should be sought when analyzing search failures with retrieval effectiveness measures such as precision and recall. The same can be said for failure analysis studies that are based on user satisfaction measures. We should strive for full-scale user involvement as much as possible in every stage of analysis of search failures. Despite user participation in the evaluation process, search failures in document retrieval systems are unlikely to be eliminated altogether. However, only through user participation will we find the real causes of search failures and, consequently, build better document retrieval systems.