CHAPTER VIII
CONCLUSION
8.0 Summary
The hypothesis of this dissertation was the assertion that online catalog users often fail in their attempts to retrieve relevant items from document collections using existing online library catalogs. A conceptual model was developed to examine and categorize search failures that occur in these catalogs. To test the model, an experiment was designed in which we recorded in transaction logs complete interactions of 45 users performing 228 queries. A questionnaire was administered and participating users were interviewed after the completion of their searches. One of the main objectives of this dissertation has been to analyze search failures comprehensively by employing not only precision and recall measures but also by identifying user-designated ineffective searches and comparing them with the precision and recall measures for corresponding queries.
Using a regression model, we tested the hypothesis that users' assessments of retrieval effectiveness differ from retrieval performance as measured by precision and recall ratios and that increasing the match between the users' vocabulary and that of the system by means of clustering and relevance feedback techniques will improve retrieval effectiveness and help reduce search failures in online catalogs.
8.1 Conclusions
Retrieval performance of the system as measured by precision and recall ratios was such that users judged half the retrieved records as being relevant before relevance feedback searches while the system retrieved less than a quarter of the relevant sources in the database. As users proceeded with relevance feedback searches, they found the retrieved sources less and less helpful although the system retrieved additional relevant sources from the database. In other words, as should be expected, precision ratios dropped sharply after relevance feedback searches while recall ratios almost doubled.
In spite of the fact that users selected less than four records as being relevant in more than 75% of the search queries and, in their view, two-thirds of the searches contained less than 25% of the useful sources, they judged two-thirds of the search queries as being effective. In other words, low precision rates do not necessarily mean that users found their search results ineffective. Furthermore, they indicated the relevance feedback mechanism was helpful and that they retrieved additional relevant sources during the relevance feedback searches, although precision ratios were much lower in those cases.
These seemingly conflicting findings obtained from transaction logs, questionnaires and critical incident reports were confirmed by the results of a multiple linear regression analysis. No strong correlation was found between retrieval performance as measured by precision and recall ratios and users' assessments of search effectiveness (i.e., whether they judged their search as being effective or not, or whether they found what they wanted). Furthermore, there was no strong correlation, either, between precision and recall ratios and the user characteristics such as the frequency of online catalog use and knowledge of online searching. These findings also proved the main hypothesis of this dissertation, which was that retrieval performance as measured by precision and recall ratios differs from users' assessments of retrieval effectiveness and that variables that define users characteristics do not explain the variability in performance measures.
The relationship (or lack thereof) between traditional performance measures such as precision and recall and that of user characteristics and users' assessments of retrieval effectiveness shows, once again, that measuring retrieval performance is a complex task. It also shows that it is difficult to explain the retrieval effectiveness in online catalogs on the basis of variables that define user characteristics and traditional performance measures. No meaningful pattern has emerged as to how the user judges the retrieval results for a given query based on retrieval performance and other variables. Although not directly examined in this dissertation, the findings also indicate that it is extremely difficult to study the search behavior of users when searching online catalogs.
Quantitative findings also suggest that measuring retrieval performance solely on the basis of precision and recall ratios may not satisfactorily explain the causes of all types of search failures that occur in online catalogs. Each search query is unique in the sense that success or failure depends very much on individual circumstances. This observation was confirmed by the qualitative analysis of search failures that occurred during the experiment.
Search queries failed predominantly due to collection and user interface failures in the experiment. More than half the search failures were caused by collection and user interface failures. In addition, search statements, users' unawareness of the capabilities of an experimental online catalog, lack of specific subject headings, cluster failures, among others, also caused search failures. Users experienced some difficulties in adapting to an experimental online catalog with advanced retrieval techniques such as classification clustering and relevance feedback and sometimes they could not figure out how to continue their searches. Some users also experienced problems with the natural language user interface as they expected more from it than a natural language interface can deliver. For instance, they expected such interfaces to be capable of not only interpreting Boolean operators and qualifiers but also, in some cases, providing in-depth or factual answers to research questions. To put it in somewhat different terms, users seemed to have transferred some of the search tactics they developed on Boolean systems over to a probabilistic system and, at the same time, wished to benefit from whatever the probabilistic retrieval systems may have to offer (i.e., "best match" techniques, natural language interfaces). This suggests that if probabilistic retrieval systems are to be alternatives to existing online catalogs, they should have the capabilities of existing online catalogs in addition to more advanced search features such as clustering and relevance feedback techniques. For example, the functionality of probabilistic online catalogs can be further increased by utilizing some of the information that is already in place in a MARC record (i.e., author, title, language, publication date).
Users tend to issue longer and sometimes rather specific search queries in probabilistic online catalogs presumably because they are not constrained with the limitations of the command language and Boolean logic. This, however, complicates the query parsing process as longer search requests are more likely to contain useless words from the retrieval point of view. Presumably in the future, online catalogs will be equipped with a multitude of user interfaces where users will have a choice to select their most favorite user interface type, be it the command language or the natural language user interface. Co-existence of several user interfaces in an online catalog will facilitate the use of the system by all types of users. Thus, it will be possible to perform a search in a probabilistic online catalog using a command language.
One hypothesis tested in the dissertation was that certain types of search failures will occur less frequently in probabilistic online catalogs than in second generation online catalogs. This hypothesis was confirmed in that zero retrievals and failures due to vocabulary mismatch occurred much less frequently during the experiment. Despite the fact that users submitted several very specific search requests to the system, failures due to zero retrievals constituted less than 8% of all the queries, a far better rate than that in second generation online catalogs. It appears that probabilistic online catalogs are less "brittle" than online catalogs with Boolean searching capabilities regarding zero retrievals. Similarly, very few queries completely failed as a direct consequence of users' not matching their search terms with the system's vocabulary (i.e., titles and subject headings assigned to the documents). Thus, the classification clustering and relevance feedback techniques that are available in the experimental online catalog helped decrease these types of search failures because search terms are matched against both titles and subject headings, thereby increasing the chances of a potential match.
The qualitative analysis of search failures showed that the conceptual model to examine and categorize search failures was comprehensive enough to encompass most, if not all, the types of search failures that occurred during the experiment.
8.2 Further Research
As mentioned earlier, we found that there was no strong correlation between traditional retrieval performance measures and variables that defined users' characteristics and users' assessment of search effectiveness. Similar findings also have been reported elsewhere. However, more research is needed to validate the findings obtained in this study over larger populations of search queries.
It would also be useful to see if the conceptual model developed in this study can be employed to examine and categorize search failures in other studies. Moreover, the model can be refined and used as the starting point to create an even more detailed taxonomy of search failures in online catalogs.
Although retrieval performance in probabilistic online catalogs has been studied using precision and recall measures, search failures that occur in such catalogs have not been fully examined. The present study is the first attempt and should be replicated on other probabilistic online catalogs and catalogs with natural language interfaces.