CHAPTER V
THE EXPERIMENT
5.0 Introduction
The theoretical foundations of the present study were presented in the last three chapters. An overview of document retrieval systems is given in Chapter II. Chapter III examined the methods used in the study of search failures and reviewed the major works in the field. A conceptual model of search failures in online catalogs was presented in Chapter IV. A detailed description of the experiment conducted for the present study is given in this chapter.
5.1 The Experiment
The purposes of this study are, among others, to analyze search failures in an experimental online catalog with advanced information retrieval capabilities; to measure the retrieval performance in terms of precision and recall ratios and user designated retrieval effectiveness; and to develop a conceptual model to categorize search failures that occur in online catalogs. An experiment was conducted in order to test the hypotheses and to address the research questions presented in Chapter I.
The hypotheses were as follows:
1. Users' assessments of retrieval effectiveness may differ from retrieval performance as measured by precision and recall;
2. Increasing the match between users' vocabulary and system's vocabulary (e.g., titles and subject headings assigned to documents) will help reduce the search failures and improve the retrieval effectiveness in online catalogs;
3. The relevance feedback process will reduce the search failures and enhance the retrieval effectiveness in online catalogs.
Data was gathered on the use of the catalog from September to December 1991. Data concerning users' actual search queries submitted to the catalog, the records retrieved and displayed to the users, users' relevance judgments for each record displayed, records retrieved and displayed after relevance feedback process represents the kinds of data collected during this experiment. Further data was collected, by means of the critical incident technique, from the users about their information needs and intentions when they performed their searches in the online catalog. This data was then analyzed in order to find out the retrieval effectiveness attained in the experimental online catalog. The search failures are documented and their causes investigated in detail.
5.2 The Experimental Environment
5.2.1 The System
The research was carried out at the School of Library and Information Studies, University of California at Berkeley on the CHESHIRE system. CHESHIRE (California Hybrid Extended SMART for Hypertext and Information Retrieval Experimentation) is an experimental online library catalog system "designed to accommodate information retrieval techniques that go beyond simple keyword matching and Boolean retrieval to incorporate methods derived from information retrieval research and hypertext experiments" (Larson, 1989, p.130). It uses a modified version of Salton's SMART system for indexing and retrieval purposes (Salton, 1971b; Buckley, 1987) and currently runs on a DECStation 5000/240 with about one gigabyte of disk space and 64 megabyte of memory. Larson (1989) provides a more detailed information about CHESHIRE.
CHESHIRE accommodates queries in natural language form. It currently supports subject searching only. The user describes his or her information need or interest(s) using words taken from natural language and submits this statement to the system. This statement is then "parsed" and analyzed to create a vector representation of the search query. The query is submitted to the system for the retrieval of relevant classification clusters from the collection that best match the user's query. Each cluster record contains the most common title keywords, subject headings and the normalized classification number for the records represented in that cluster. Upon the user's selection of one or more clusters, the query gets further enriched with the terms that appeared in relevant clusters before it is submitted to the system for the retrieval of individual documents from the database.
The classification clustering technique which Larson developed and implemented in CHESHIRE is used for query expansion in CHESHIRE (Larson, 1989, 1991a). The technique is briefly mentioned in Chapter II (section 2.7.1). The method used to retrieve and rank classification clusters is based on probabilistic retrieval models (Larson, 1992). What follows is a more detailed overview of the use of "classification clustering method" in CHESHIRE.
Fig. 5.1 illustrates the classification clustering procedure diagrammatically.
Larson (1991a, 1992) provides a more formal presentation of the classification clustering method he developed. He (1992) states:
[the method] involves merging the topical descriptive elements (title keywords and subject headings) for all MARC records in a given Library of Congress classification. The individual records are clustered based on a normalized version of their class number, and each such classification cluster is treated as a single `document' with the combined access points of all the individual documents in the cluster.... The clusters can be characterized as an automatically generated pseudothesaurus, where the terms from titles and subject headings provide a lead-in vocabulary to the concept, or topic, represented by the classification number (p.39).
The classification clustering method improves retrieval effectiveness during document retrieval process as follows:
Suppose that a collection of documents has already been clustered using a particular classification clustering algorithm. Let's further suppose that a user has come to the document retrieval system and issued a specific search query (e.g., "intelligence in dolphins"). First, a retrieval function within the system analyzes the query, eliminates the useless words (using a stop list), processes the query using the stemming and indexing routines and weights the terms in the query to produce a vector representation of the query. Second, the system compares the query representation with each and every document cluster representation in order "to retrieve and rank the cluster records by their probabilistic "score" based on the term weights stored in the inverted file. . . .The ranked clusters are then displayed to the user in the form of a textual description of the classification area (derived from the LC classification summary schedule) along with several of the most frequently assigned subject headings within the cluster." (Larson, 1991a, p.158).
Once the system finds the potentially relevant clusters, the user can judge some of the clusters as being relevant by simply identifying the relevant clusters on the screen and pushing a function key. "After one or more clusters have been selected, the system reformulates the user's query to include class numbers for the selected clusters and retrieves and ranks the individual MARC records based on this expanded query" (Larson, 1991a, p.159).
Larson (1991a) describes how it is that this tentative relevance information for the selected clusters can be utilized for ranking the individual records:
In the second stage of retrieval . . . , we still have no information about the relevance of individual documents, only the tentative relevance information provided by cluster selection. In this search, the class numbers assigned to the selected clusters are added to the other terms used in the first-stage query. The individual documents are ranked in decreasing order of document relevance weight calculated, using both the original query terms and the selected class numbers, and their associated MARC records are retrieved, formatted, and displayed in this rank order . . . In general, documents from the selected classes will tend to be promoted over all others in the ranking. However, a document with very high index term weights that is not from one of the selected classes can appear in the rankings ahead of documents from that class that have fewer terms in common with the query (pp.159-60).
Although the identification of relevant clusters can properly be considered a type of relevance feedback, we prefer to regard it as some sort of system help before the user's query is run on the entire database.
After all of the above re-weighting and ranking processes, which are based on the user's original query and the selection of relevant clusters, are done, individual records are displayed to the user. This time the user is able to judge each individual record (rather than the cluster) that is retrieved as being relevant or nonrelevant, again by simply pushing the appropriate function key. The user can examine several records by making relevance judgments along the way for each record until he or she thinks that there is no use to continue displaying records as the probability of relevance gets smaller and smaller.
To sum up, classification clustering method brings similar documents together by checking the class number assigned to each document. It also allows users to improve their search queries by displaying some retrieved clusters for the original query. At this point users are given a chance to judge retrieved clusters as being relevant or nonrelevant to their queries. Users' relevance judgments then get to be incorporated into the original search queries, thereby making the original queries more precise and shifting them in the "right direction" to increase retrieval effectiveness.
CHESHIRE has a set of both vector space (e.g., cosine matching, term frequency - inverse document frequency matching (TFIDF)) and probabilistic retrieval models available for experimental purposes. Formal presentations of these models can be found elsewhere (e.g., Larson, 1992). In essence, cosine matching measures the similarity between document and query vectors and "ranks the documents in the collection in decreasing order of their similarity to the query." TFIDF matching is similar to cosine matching. However, TFIDF takes the term frequencies into account and attaches "the highest weights to terms that occur frequently in a given document, but relatively infrequently in the collection as a whole, and low weights to terms that occur infrequently in a given document, but are very common throughout the collection" (Larson, 1992, p.37). Probabilistic models (Model 1, Model 2, Model 3), on the other hand, approach the "document retrieval problem" probabilistically and assume that probability of relevance is a relationship between the searcher and the document, not between the terms used in indexing documents and the terms used in expressing search queries (Maron, 1984).
CHESHIRE also has relevance feedback capabilities to improve retrieval effectiveness. Upon retrieval of documents from the database, the user is asked to judge if the retrieved document is relevant or not. Based on users' relevance judgments on retrieved documents, the original search queries are modified and a new set of, presumably more relevant, documents is retrieved for the same query. Users can repeat the relevance feedback process in CHESHIRE as many times as they want.
Probabilistic retrieval techniques, along with classification clustering and relevance feedback capabilities, have been used for evaluation purposes in this experiment. The feedback weight for an individual query term i was computed according to the following probabilistic relevance feedback formula:
where
where
freq is the frequency of term i in the entire collection;
relret is the number of relevant documents term i is in;
numrel is the number of relevant documents that are retrieved;
numdoc is the number of documents.
This formula takes into account only the "feedback effect," not the artificial "ranking effect" (i.e., documents retrieved in the first run are not included in the second run) (see Chapter II, section 2.9).
5.2.2 Test Collection
The test collection used for this experiment was that of the bibliographic records of the Library of the School of Library and Information Studies (LSL) of the University of California at Berkeley. LSL has a specialized collection concentrated in library and information sciences, publishing and the book arts, management of libraries and information services, bibliographic organization, censorship and copyright, children's literature, printing and publishing, information policy, information retrieval, systems analysis and automation of libraries, archives and records management, office information systems, and the use of computers in libraries and information services.
The test database for the CHESHIRE system consists of 30,471 MARC records representing the machine-readable holdings of the LSL up to February 1989. Using the test database, Larson (1989, 1991a) created a bibliographic file containing the titles, subject headings and classification numbers from the MARC records. He then generated a cluster file from the bibliographic file using the classification clustering technique. Due to the nature of the LSL's highly specialized collection, more than 80% of the records in the test database fall into LC main class Z. MARC records in the database had some 57,000 Library of Congress subject headings (LCSH) assigned to them, which amounts to about two subject headings per record (Larson, 1991a, p.162). Table 5.1 provides some collection statistics for the test database.
Table 5.1 MARC Test Collection Statistics (Source: Larson 1992, p.40)
Cluster File |
Bibliographic File |
|
No. of document vectors | 8,435 |
33,371 |
No. of distinct terms | 33,883 |
33,891 |
Total term occurrences | 221,042 |
397,790 |
Avg. terms per document | 26.21 |
11.92 |
Avg. term freq. in vectors | 2.03 |
1.14 |
Avg. documents per term | 6.52 |
11.74 |
Max. documents per term | 2,754 |
11,999 |
Avg. documents per cluster | 3.95 |
- |
Larson (1992) interprets the data in the table as follows:
the bibliographic database generated only 8345 clusters, giving an average of just under four bibliographic records per cluster (the standard deviation was 19.50). The majority of records in the database fall into LC main class Z, and in that class the average is about 4.8 records per cluster with a standard deviation of about 23.29 records. As the large standard deviations would suggest, the distribution of bibliographic records to classification clusters is very uneven, with many clusters (67%) consisting of a single record, and some (1.1%) with more than 40 records. The large number of single record clusters is primarily due to the enumerative nature of the LC classification, where Cutter numbers are used to order items alphabetically within broad classes (e.g., the `By Name A-Z' direction in the schedules). It should be noted also that these single record clusters represent only about 16.9% of the input records. The number of document vectors generated for the test database is larger (33,371) than the number of input MARC records [30,471] due to variant record forms generated for MARC records having more than one class number.
The searchable terms in both of the files consist of keyword stems extracted from titles and subject headings, and the normalized class number. Most common words (e.g., `and,' `of,' `the,' `a,' etc.) are included in a stop list and ignored during indexing and retrieval. All other keywords are reduced to word stems using a stemming algorithm. . . . Terms from subject headings and titles are treated separately and considered to be different terms, even if they are based on the same word.
. . .there were an average of about 12 terms per document (standard deviation 5.92) in the bibliographic file and about 26 terms per cluster (standard deviation 65.21) in the clustered file. . . . These statistics indicate that the classification clustering process is having the desired effect of grouping similar bibliographic records together, but the enumerative nature of the classification scheme prevents some records from clustering (pp.40-41).
5.2.3 Subjects
Forty-five entering master's and continuing doctoral students (hereafter "users") in the School of Library and Information Studies of the University of California at Berkeley voluntarily participated in the experiment. They were not compensated for their participation in the study.
5.2.4 Queries
Users performed a total of 228 catalog searches on CHESHIRE during the fall semester of 1991. The topics of search queries were determined by the users, not by the researcher. Most, if not all, search queries originated from users' real information needs.
The number of queries users searched on CHESHIRE is thought to be appropriate for evaluation purposes as most information retrieval experiments in the past had been conducted with either comparable or much fewer number of queries. For instance, some 221 search queries were used in Cranfield II tests, one of the earliest information retrieval experiments. The search queries were "obtained by asking the authors of selected published papers (`base documents') to reconstruct the questions which originally gave rise to these papers" (Robertson, 1981, p.20). Similarly, 302 genuine search queries were used in the MEDLARS study. Search queries used in MEDLARS tests originated from the real information needs of the system's users (Lancaster, 1968). More recently, Blair and Maron (1985) used some fifty-one real search queries, obtained from two lawyers, to test the retrieval effectiveness of the STAIRS system. Tague (1981) observes that "the number of queries in information retrieval tests seems to vary from 15 to 300, with values in the range 50 to 100 being most common" (p.82).
5.3 Preparation for the Experiment
The experiment was carried out on CHESHIRE online catalog. The users' complete interaction with the online catalog was captured on transaction logs. A self-administered questionnaire for each search was filled out by the users. In addition, a post-search structured interview was conducted with the users.
5.3.1 Preparation of Instructions for Users
A two-page handout and a booklet were prepared for instructional purposes. The handout contained background information about CHESHIRE as well as guidelines for CHESHIRE searches (see Appendix A). The booklet demonstrated, with step-by-step instructions, how to get access to CHESHIRE, how to log on, enter a search query, display clusters and bibliographic records, make relevance judgments, and how to perform relevance feedback searches (see Appendix B). Both the handout and booklet were pilot-tested on two users who were unfamiliar with the system.
5.3.2 Preparation of the Data Gathering Tools
A comprehensive analysis of search failures in an online catalog requires the use of a number of data gathering tools, the most important ones being transaction logs, questionnaires and critical incident forms used during the structured interviews to collect critical incident reports about search failures.
Transaction logs were used to record relevant data about the entire session for each search conducted on CHESHIRE. Transaction record for each search consists of the user's password, logon and logoff times and dates (to the nearest second), the full search statement entered by the user, the stemmed roots of search terms and their weights, cluster and bibliographic records retrieved along with their id numbers, ranks, and the user's relevance judgments on displayed records. Relevance feedback data, if applicable, was also captured in the transaction record. The types of data recorded for each search in the transaction logs are illustrated in Appendix C.
A questionnaire and critical incident report forms were designed to record user's experience for each search carried out on CHESHIRE. Both the questionnaire and critical incident report forms were pretested, and suggestions obtained from users (e.g., slight changes in wording of some questions) were incorporated into the final versions of questionnaire and critical incident report forms.
The questionnaire aims to measure, in more precise terms, users' perceived search success for each query submitted to CHESHIRE. It included such questions as: the type of the user; how long ago the search was performed and whether the user was successful in the first try; if not, what was the reason for the search failure; what percentage of the sources the user found especially useful (precision); whether relevance feedback was performed or not; if yes, what was its impact on the search results; and, most helpful and most confusing features of CHESHIRE (see Appendix D for a copy of the questionnaire form).
Two types of critical incident report forms were devised (modified from Wilson et al. (1989)): one for reporting "effective searches" and the other for "ineffective searches" (see Appendices E and F for effective and ineffective incident report forms, respectively). The critical incident report form aims to gather, for each search query submitted to CHESHIRE, data on the effectiveness or ineffectiveness of the search query, the user's information needs that triggered the search, the types of sources retrieved and whether they were helpful or not, relevance feedback process, whether CHESHIRE retrieved most of the useful sources or not (recall), and whether sources retrieved were useful or not (precision). Incident reports also include users' own assessments of the effectiveness of their searches. Note that critical incident report form was intended to be used as a structured interview form during the interview with the user. (Interviews were audiotaped (with permission) for further analysis.)
The critical incident report form and the questionnaire form consist of similar questions. The questionnaire form was designed to complement the critical incident reports and to corroborate the findings to be obtained from the critical incident reports.
5.3.3 Recruitment of Users to Participate in the Experiment
Potential participants (all entering master's and continuing doctoral students) were invited to take part in the experiment (see Appendices G and H). The guidelines and detailed instructions were sent to doctoral students (see Appendices A and B) along with the invitation letter. A live demo introducing logon and the search procedures in CHESHIRE was offered to the interested doctoral students. Permission to review their transactions was obtained from participating doctoral students.
Entering master's students were handed out the invitation letter during their scheduled class times for a course offered in the School of Library and Information Studies called LIS 210: Organization of Information (Fall 1991). This was followed by a 20-minute presentation in which an example search session on CHESHIRE was demonstrated. Students were told that, should they decide to participate, the system would be open to their use throughout the semester. They were encouraged to use the system as often as they desired. The written consents of participating master's students to review their transactions were obtained after the presentation.
After the presentations, the transaction log file was monitored daily to see if any searches had been performed. "Thank you" messages were sent to first time users. Later in the semester, students were reminded periodically that they could continue to perform searches on CHESHIRE.
5.4 Data Gathering
As was indicated earlier, users' full interaction with CHESHIRE (user names, search statements, records displayed, relevance judgments, and so on) was recorded in the transaction log file. Users carried out a total of 228 search queries on CHESHIRE. By the time the data collection period ended (mid-December 1991), more than 200,000 lines of data were gathered through transaction monitoring.
The transaction log file was scanned, using data reduction techniques, to extract information about users, search statements, and the outcome of their searches. Such information proved to be the foundation of data gathering process through questionnaires and structured interviews.
Participating users were interviewed throughout December 1991 and spring semester of 1992. Users filled out a questionnaire form for each search they performed (see Appendix D). Afterwards, a structured interview, which was audiotaped, was carried out with users for each search query. If a given search was judged as being "effective" by the user, questions in the Effective Incident Report Form (see Appendix E) were asked. If not, questions in the Ineffective Incident Report Form (see Appendix F) were asked. In addition to audiotaping, users' answers were also recorded on critical incident report forms. More than sixteen hours worth of user comments were audiotaped. These tapes were later transcribed in order to facilitate the analysis process.
All searches submitted to CHESHIRE during the data gathering period were repeated on MELVYLŪ, the nine-campus University of California online catalog, using its title and subject keyword options. The results were recorded in script files. In addition, searches that retrieved nothing (zero retrievals) on CHESHIRE were redone just to make sure that that was the case. The results of both MELVYL and CHESHIRE searches were later used to calculate the recall ratio for each query.
The limited resources available for this study prevented an experimental design in which the participants would be divided into a control and experimental group so as to compare the results obtained from each of the two groups. In addition, it was not possible to have more than one evaluator to examine the search results or critical incident report forms.
To sum up, then, four methods were used to gather data for each search query performed on CHESHIRE: 1) The outcome of the full search process (query statement, clusters and records retrieved, relevance judgments, etc.) was recorded in the transaction log file; 2) A questionnaire form were filled out for each search query; 3) A structured interview, which was both audiotaped and recorded on critical incident report forms; and 4) Search queries submitted to CHESHIRE were repeated on MELVYL and the results were recorded in script files. Table 5.2 summarizes the data types, and methods of data collection and analysis.
Table 5.2 Summary of Data Types, Methods of Data Gathering and Analysis
Data collection Data analysis
Data collection methods methods
1. Quantitative data from transaction logs, questionnaires, and critical incident report forms
2. Qualitative data from transaction logs, structured interviews, and searches on MELVYL
1. Transaction logs
2. Questionnaire
forms
3. Critical incident report forms and audiotaped structured interviews
4. Repetition of searches on MELVYL
1. Statistical analysis of quantitative data
2. Qualitative analysis of search sessions recorded on transaction logs, audiotapes, and MELVYL script files
3. Comparison of results from both analyses
5.5 Data Analysis and Evaluation Methodology
A comprehensive quantitative and qualitative analysis and evaluation was carried out on the raw data gathered through by means of transaction logs, questionnaires, and critical incident reports.
5.5.1 Quantitative Analysis and Evaluation
5.5.1.1 Analysis of Transaction Logs
The quantitative analysis of transaction logs revealed a wealth of data about the use of the CHESHIRE catalog during the experimental period. For instance, such statistical data as the number of searches conducted, number of searches that retrieved no records, number of different users participated in the experiment, number of records displayed and judged relevant (i.e., precision), and average number of terms in search statements were easily computed. Figures obtained from the quantitative analysis of transaction logs were entered into a spreadsheet package for further evaluation.
Searches that retrieved nothing (zero retrievals) as well as searches wherein users selected no clusters as being relevant were identified from the transaction logs. As discussed in Chapter IV, zero retrievals may occur due to, among others, collection failures, misspellings, and vocabulary mismatch. A search on CHESHIRE may also fail to retrieve any bibliographic records even if the search query terms match the terms in titles and subject headings of the items in the database. The way CHESHIRE works at present is such that it first retrieves some classification clusters if there exists a match between the query term(s) and titles and subject headings. If there is a match, CHESHIRE displays up to twenty clusters for user's relevance judgment. The user has to select at least one cluster as relevant in order for CHESHIRE to continue the search and retrieve individual bibliographic records from the database. The implicit assumption here is that if the user finds no cluster as being relevant, then it is highly unlikely that the document collection may have any relevant records to offer to the user.
5.5.1.2 Calculating Precision and Recall Ratios
In addition to a comprehensive analysis of search failures and zero retrievals, retrieval effectiveness of the CHESHIRE experimental online catalog was studied using precision and recall measures.
As the user's relevance judgment for each record displayed was recorded in the transaction log file, it was possible to calculate the precision ratio for each search that retrieved some records. If the record scanned was relevant, the user was simply asked to press the "relevant" key. If it was nonrelevant, hitting the carriage return key would display the next record. Thus for each and every record displayed, there was a piece of relevance judgment data attached to it in the transaction log file (see Appendix C).
Note that relevance assessments were based on retrieved references with full bibliographic information including subject headings, not the full text of documents. Relevance judgments were done by the users themselves who submitted search queries to satisfy their real information needs.
The precision value for a given search query was taken as the ratio of the number of documents judged relevant by the user over the total number of records scanned before the user either decided to quit or do a relevance feedback search. There is a slight difference between the original definition of the precision formula (given in Chapter II) and that which was used in this experiment: instead of taking the total number of retrieved records in response to a particular query, we took the total number of records scanned by the user no matter how many records the system retrieved for a particular query. For instance, if the user stopped after scanning two records and judged one of them being relevant, then the precision ratio was 50%. Precision ratios for retrievals during the relevance feedback process were calculated in the same way.
Precision ratios were calculated from the transaction logs without much difficulty. Calculating recall ratio for each search query proved to be the most challenging task as it required finding relevant documents that were not retrieved in the course of user's initial search (Blair & Maron, 1985). The procedure went as follows:
The approximate `recall base' for each search query performed on CHESHIRE was found by repeating all search queries on both CHESHIRE and on the UC online catalog MELVYL. (The database used in CHESHIRE is a subset of the MELVYL database.) This was done for a variety of reasons. First, it is believed that repeating the same searches on CHESHIRE would somewhat facilitate the task as the researcher is familiar with both the database (i.e., records mainly about Library and Information Science) and the search system (CHESHIRE). Second, CHESHIRE and MELVYL have completely different retrieval rules. The CHESHIRE experimental online catalog utilizes probabilistic retrieval techniques along with classification clustering mechanism whereas MELVYL uses the Boolean operators AND, OR, and NOT to retrieve records from the database. It was thought that searching on two different systems for the same queries would expand the recall base by retrieving different records.
Although the database used in CHESHIRE is a subset of MELVYL, calculating the recall base for each search query proved to be a formidable task. On MELVYL, it was not possible to restrict the retrievals to the holdings of the Library of the School of Library and Information Studies (LSL) only. Each and every record retrieved by MELVYL was checked to identify the ones located at LSL. The publication date of each retrieved record was also checked as the CHESHIRE database contains records up to the beginning of 1989. Records with publication dates 1989 or later were deleted from the MELVYL retrievals. Unique records retrieved by each system (CHESHIRE and MELVYL) were identified. The total number of unique records retrieved by both systems constituted the `recall base' for a given search.
Next, clusters and bibliographic records retrieved by the user and judged as being relevant were reviewed. User's search statement, the questionnaire form and the script of the structured interview belonging to this query were examined in order to determine the user needs and intentions that generated the query. Given the fact that the user judged the retrieved documents as being relevant in the way he or she did, given the user needs and intentions that generated the search query, the question asked was: "In addition to the records the user selected as being relevant in the original CHESHIRE search, which records would he or she have selected as being relevant had he or she seen all the records retrieved by CHESHIRE, MELVYL, or both?"
In order to answer this question, each record in the recall base was reviewed and judged as being relevant or not relevant. Thus, the total number of relevant records in the recall base for a given query was identified. Relevant records retrieved by the user were then compared with all the relevant records in the recall base. The number of relevant records retrieved by the user was divided by the total number of relevant records in the recall base to find the recall ratio for a given search query. The following example illustrates how the recall base was determined for a given query and how recall ratio was calculated.
The search query "human-computer interaction" (query # 211) was submitted to the CHESHIRE system and two records were displayed. One was marked as being relevant. The precision ratio was computed to be 50% (1/2) from the transaction log. During the interview the user said she was looking for "anything on the topic of human-computer interaction."
Next, we searched the UC online catalog MELVYL under title keyword and subject keyword indexes (with truncations) using several synonyms. A title keyword search under "human-computer interaction" retrieved two unique items (i.e., retrieved by MELVYL, but not retrieved by CHESHIRE). Similarly, a title keyword search under "user interface" retrieved seven more unique items. A subject search "human computer interaction" retrieved two more unique items whereas a subject search under "user interfaces" retrieved no unique items. Title words and subject headings in the retrieved items were examined. It was found that "man-machine interaction" has also been used in relevant records. A title keyword search under "man-machine interaction" retrieved two more unique items. One of these items was cataloged under a general LC subject heading Information storage and retrieval systems. A subject search under "man-machine systems" retrieved seven more unique items. A subject search under "human engineering" retrieved one more unique item. A subject and title keyword search under "interactive computer systems" retrieved three more unique items. All in all, 24 unique items were retrieved using MELVYL which were located at the LSL collection. So the recall base for this search query was 24. (Searches under "man-machine communication" and "system engineering" retrieved no unique items.) Table 5.3 gives the search query terms used and the type of searches conducted in order to retrieve those 24 unique items:
Table 5.3 Searches Conducted to Find the Records Constituting the Recall Base for Query #211
Search query | Type of Search | N |
human-computer interaction | title keyword | 2 |
user interface | title keyword | 7 |
human-computer interaction | subject keyword | 2 |
user interfaces | subject keyword | 0 |
man-machine interaction | title keyword | 2 |
man-machine systems | subject keyword | 7 |
human engineering | subject keyword | 1 |
interactive computer system | title/subject | 3 |
TOTAL | 24 |
Note: N represents the number of "unique" records retrieved at each step.
Some of the terms used in the titles of books about human-computer interaction are as follows: "human-computer interaction," "user interface," "user/computer interface," "computer interfaces for user access," "interactive computer systems," "human-computer environment," "man-machine interaction," "machine-human interface," "interactive title computer environment," "person-computer interaction," "man-computer dialogue," "human-machine interaction," "patron interface," etc. These items were indexed under the following LC subject headings: Human-computer interaction, User interfaces (Computer systems), Computer interfaces, Man-machine systems, Interactive computer systems, Human engineering, and Information storage and retrieval systems.
After finding the number of records that made up the recall base (24), the percentage of records retrieved by CHESHIRE was calculated. CHESHIRE retrieved 13 out of 24 records that were in the recall base. As indicated earlier, CHESHIRE ranks the retrieved records in the order of their similarity to the search query and presents the top 20 records in the output list to the user. Assuming that all 20 records CHESHIRE retrieved could have been relevant yet only 13 of them actually were, the recall ratio was calculated as 65% (13/20) for this query. The rest of the records retrieved by CHESHIRE were about online communities, computer output microfilm, development and testing of computer-assisted instruction, and so on.
It is worth repeating that the relevance judgments when calculating recall were made by the researcher, not by the user. Relevance judgments to calculate the recall ratios were based on the analysis of user's query statements, records retrieved and judged relevant by the user, analysis of the user's needs and intentions from the structured interviews and questionnaire forms. Contextual feedback gained from users for each query and the review of retrieved records facilitated, to a certain extent, making relevance judgments for recall calculation purposes. It was assumed that, with all this feedback, objective relevance judgments reflecting actual users' decision-making processes could be made by the researcher. Nevertheless, recall ratios obtained in this study should be taken as approximate, not absolute, figures.
Once precision and recall ratios for queries retrieving some records were calculated, recall/precision graphs were plotted. Precision/recall graphs illustrate the retrieval effectiveness that users attained in CHESHIRE.
Precision and recall values were averaged over all search queries in order to find the average precision/recall ratio for CHESHIRE. "Macro evaluation" method was used to calculate average precision and recall values. This method provides both adequate comparisons for test purposes and meets the need of indicating a user-oriented view of the results (Rocchio, 1971b). It uses the average of ratios, not the average of numbers. (The latter is called "micro evaluation.") For instance, suppose that we have two search queries. The user displays 25 documents and finds 10 of them relevant in the first case. In the second case, the user displays 10 documents and finds only one relevant document. The average precision value for these two queries will be equal to 0.25 using the macro evaluation method ((10/25)+(1/10)=0.25). (Micro evaluation method, on the other hand, will give the result of 0.31 for the same queries ((25+1)/(25+35)=0.31).) As Rocchio (1971) points out, macro evaluation method is query-oriented while micro evaluation method is document-oriented. The former "represents an estimate of the worth of the system to the average user" while the latter tends to give undue weight to search queries that have many relevant documents (i.e., document-oriented) (Rocchio, 1971b; cf. Tague, 1981).
"Normalized" precision and recall values would have been easier to calculate, as was done in some studies (Salton, 1971). However, normalized recall does not take into account of all relevant documents in the database. Whenever the user stops scanning records, the recall value at that point is assumed as 100% even though there might be more relevant documents in the database for the same query which the user has not yet seen. The recall figures to be obtained through normalized recall may not reflect the actual performance levels. It is believed that more reliable recall values were obtained in this study. For, the comprehensive analysis of transaction logs and other records retrieved through exhaustive searches on CHESHIRE and MELVYL established the basis for the calculation of recall ratios. In addition, review of questionnaire forms and critical incidence reports provided much helpful information about users' information needs and intentions.
The users tend to be more concerned with precision values. They seem to value highly systems that could retrieve some relevant documents from the database which are not too diluted with nonrelevant ones. As long as they are able to find some relevant documents among the retrieved ones, they may not necessarily think of the fact that the system might be missing some more relevant documents. Recall values, on the other hand, are of greater concern to system designers, indexers and collection developers than users. Recall failures tend to generate much needed feedback to improve retrieval effectiveness in present document retrieval systems, although they are more difficult and time-consuming to detect and analyze.
5.5.1.3 Analysis of Questionnaire Forms and Critical Incident Report Forms
Questionnaire forms were analyzed to identify the effective and ineffective searches and to tabulate the user-designated reasons for search failures. Most useful and confusing features of the CHESHIRE experimental online catalog were also noted.
The questionnaire form included a question about the search success in terms of precision (Question #5: ". . . what percent of the sources you found were especially useful?"). This was an attempt to quantify users' perception of search success in terms of precision and to compare it with that obtained from the transaction logs.
As indicated earlier, questionnaire form and the critical incident report forms used during the structured interviews contain similar questions. Some answers from questionnaire forms were compared with the answers given in the critical incident report forms. For instance, both the questionnaire and incident report forms included some questions so as to determine what the users thought of the effect of relevance feedback technique on the overall retrieval effectiveness in CHESHIRE. It is difficult to determine the exact role of relevance feedback in improving the retrieval effectiveness in CHESHIRE. Larson (1989, p.133) points out that "experience with the CHESHIRE system has indicated that the ranking mechanism is working quite well, and the top ranked clusters provide the largest numbers of relevant items."
Scripts of structured interviews were also analyzed and compared with results that were obtained from both the questionnaire forms and the transaction logs. The relationship between user-designated retrieval effectiveness and precision/recall measures was studied. The results were compared with the precision/recall ratios found for corresponding search queries recorded in transaction logs. This three-way comparison for some questions (e.g., search effectiveness) enabled us to investigate the causes of search failures more carefully.
5.5.2 Qualitative Analysis and Evaluation
The main objective of this study is to find out the causes of search failures in an experimental online catalog with sophisticated information retrieval capabilities. Therefore a comprehensive qualitative analysis and evaluation of the available data from transaction logs, questionnaires, and structured interview scripts was essential.
A wide variety of strategies were used to identify search failures that occurred in CHESHIRE. First, searches that retrieved no records were easily identified from the transaction logs. Analysis of the causes of zero retrieval searches showed that some searches retrieved nothing due to collection failures and misspellings whereas some others retrieved nothing because they were personal author or known-item searches, which are not supported by CHESHIRE. Yet some others failed to retrieve any records because they were out of domain search queries.
Second, search queries that retrieved some clusters but nevertheless were not pursued by the users to the end were identified. As indicated earlier, the user had to select at least one cluster record as relevant in order for the search query to retrieve bibliographic records from the database. If no cluster records were selected, then the search ended there with failure. All such failures were not necessarily due to collection failures. Some occurred because cluster records did not seem relevant while others were abandoned because of the user interface problems. False drops and stemming algorithm were also responsible for some of the cluster failures.
Analysis of search failures that occurred because no clusters were chosen as relevant required some additional work. The cluster records for such searches were not recorded in the transaction log file. These searches were redone on CHESHIRE just to record the cluster records so that the reason why the user selected no clusters could be understood.
Third, ineffective search queries were identified from the critical incident forms. Ineffective search queries were those for which users retrieved some bibliographic records but they nevertheless thought that retrieved records were not satisfactory. Precision and recall ratios for such searches were identified. Search statements, clusters, and bibliographic records were examined from transaction logs to determine what caused the search query to fail in the user's eyes.
Once search failures were identified, analysis then concentrated on the causes of search failures. Again, a wide variety of methods were used: analysis of search queries (broad vs. specific), users' information needs, cluster records, bibliographic records and the subject headings attached, false drops, collection failures, precision and recall ratios, are to name but a few. Out-of-domain search queries where the user entered a search query that could not be answered using the CHESHIRE database were examined. So were personal author, known-item or call number searches.
Questionnaire forms were examined to determine what the users thought of the system's effectiveness along with user-designated reasons for failures and users' perception of search success.
Finally, the scripts of structured interviews (i.e., incident reports) were studied. The detailed examination of critical incident reports proved useful to understand users' information needs and intentions better, which facilitated the evaluation of retrieval effectiveness performance in CHESHIRE. Other observable data about the characteristics of users and search queries were also noted.
Based on the comprehensive analysis presented above, types of failures were recorded and classified along with the cause(s) of each search failure.
5.6 Summary
The overall experiment was summarized in this chapter. Features of the system and the document collection database were explained in detail. Data gathering tools were introduced along with instructional materials that were used to recruit users to participate in the experiment. Finally, quantitative and qualitative data analysis and evaluation methodologies were explicated.