CHAPTER VI

FINDINGS

6.0 Introduction

This chapter presents the findings obtained from the experiment described in Chapter V. The first part provides descriptive statistics about users, searches and search statements captured through transaction logs, questionnaires and critical incident report forms. Qualitative analysis and evaluation of successful and unsuccessful searches is presented in the second part.

As discussed in Chapter V, an experiment was carried out in the School of Library and Information Studies of the University of California at Berkeley involving master's and doctoral students. They were given access to an experimental online catalog (CHESHIRE) for one semester (Fall 1991) and their complete interactions with the catalog were recorded in transaction logs. The purpose of the experiment was to identify search failures occurring in this experimental online catalog with a view to explicate the causes of search failures. The data analyzed below came from a variety of sources including transaction logs, questionnaire forms, and critical incident report forms.

6.1 Users

Users who agreed to participate in the experiment were asked to fill out a pre-search questionnaire and signed consent forms (see Appendix G and Appendix H). A total of 45 users participated in the experiment, 30 entering Masters-level (MLIS) students (69.8%) and 13 Ph.D. students (30.2%) (Table 6.1). Fifty-eight percent of the participating users indicated through the questionnaire that they search online catalogs daily whereas 37% used them weekly (Table 6.2). Two users (4.7%) indicated that they used online catalogs four times a year.

Table 6.1 Users Participating in the Experiment (N=43)

User Type N %
MLIS

30

69.8

Ph.D.

13

30.2

TOTAL

43

100.0

 

Table 6.2 Online Catalog Use by Participants (N=43)

Catalog Use

N

%
Daily

25

58.1

Weekly

16

37.2

Four times a year

2

4.7

TOTAL

43

100.0

 

A large majority of participating users stated that they know how to use several application software packages such as word-processing, database management systems (DBMSs) and spreadsheets (Table 6.3). More than 80% of the users knew how to perform online searching. Almost 63% could use at least one computer programming language (e.g., BASIC, C, Pascal). Similarly, 65% of the users were familiar with electronic mail and bulletin board systems (BBSs).

Table 6.3 Users' Knowledge of Computer Software Applications (N=43)

Knowledge of application

N

%
Word-processing

43

100.0

Online searching

35

81.4

Database Management Systems

32

74.1

Spreadsheets

31

72.1

Electronic mail & bulletin board systems

28

65.1

Programming languages

27

62.8

Other (e.g., SPSS)

3

7.0

 

These users performed a total of 228 search queries on CHESHIRE online catalog. Of the 228 search queries conducted throughout the experiment, 175 were (76.8%) carried out by MLIS students and 53 (23.2%) by Ph.D. students (Table 6.4). A more detailed description and analysis of searches, which is based on transaction logs, is presented in the next section.

Table 6.4 The Number of CHESHIRE Search Queries Conducted by User Type (N=228)

User Type

N

%

MLIS

175

76.8

Ph.D.

53

23.2

TOTAL

228

100.0

 

 

 

6.2 Description and Analysis of Data Obtained From Transaction Logs

6.2.1 Description and Analysis of Searches and Sessions

The average number of search queries performed by users was 5.3. Number of searches performed by each user varied a great deal with a mode of 4 searches and a minimum of one and a maximum of 21 searches. Almost 80% of the users issued between 1 and 6 search queries, which represented almost half the total. Two users issued a total of 41 search queries, 18% of all search queries submitted to CHESHIRE during the experiment. Table 6.5 gives the distribution of search queries issued by all participating users.

Table 6.5 Distribution of Search Queries by Users

No. of search queries issued by users No. of users performing searches

Total number of queries

(col. 1 x col. 2)

% distribution of total searches

1 4

4

1.8

2 8

16

7.0

3 6

18

7.9

4 11

44

19.3

5 3

15

6.6

6 2

12

5.3

7 2

14

6.1

8 4

32

14.0

10 2

20

8.8

12 1

12

5.3

20 1

20

8.8

21 1

21

9.2

TOTAL 45

228

100.1

Note: Percentage totals do not always equal 100% due to rounding

Using the definition that "[a] session is defined as one continuous period of time during which one user with a single user logon performs a search" or a number of searches (Tremain & Cooper, 1983, p.67), it was found that 228 search queries were issued in 106 search sessions, which represents just over 2 searches per session. Almost half the search sessions (52) consisted of a single search query. Twenty-two sessions consisted of two search queries. Table 6.6 provides the session information along with the number of search queries performed. The fourth column gives percentages of search queries performed in those sessions within the total.

Table 6.6 Distribution of Search Queries by Session

No. of search queries issued by users No. of sessions Total queries (col.1 x col. 2)

% distribution of total searches

1 52

52

22.8

2 22

44

19.3

3 11

33

14.5

4 10

40

17.5

5 5

25

11.0

6 2

12

5.3

7 2

14

6.1

8 1

8

3.5

TOTAL 106

228

100.0

Two-thirds of participating users (29) performed only one or two sessions. Number of search queries (99) carried out in those sessions constituted 43% of all search queries. Nine users performed three sessions each, which made up of almost 20% of all searches (or 44 searches). The highest number of sessions performed by any user was 10. This single user performed 9% of all searches. Table 6.7 gives the distribution of search sessions by number of users.

Table 6.7 Distribution of Search Sessions by Users

No. of sessions No. of users performing that many sessions

Total no. of queries issued

% distribution of total searches

1 17

49

21.5

2 12

50

21.9

3 9

44

19.3

4 4

43

18.9

5 1

10

4.4

6 1

12

5.3

10 1

20

8.8

TOTAL 45

228

100.1

Note: Percentage totals do not always equal 100% due to rounding

Users spent almost 23 hours searching on CHESHIRE. The average search query took just under 6 minutes to complete. However, the time it took to complete a search query varied a great deal. More than one-third of search queries (36%) took less than one minute to complete (Table 6.8). Those search queries appear to be primarily the ones which retrieved nothing or which were discontinued by users. Ninety search queries (39.5%) took between one and eight minutes to complete. Forty (17.5%) took between nine and 16 minutes to complete. Few searches (7%) were completed in more than 17 minutes. The longest search took 35 minutes to complete.

Table 6.8 Distribution of Search Queries by Completion Time

Time it took to complete a search (in minutes)

No. of search queries

% distribution of total searches

Less than 1

82

36.0

1-5

40

17.6

5-9

50

21.9

9-13

26

11.4

13-17

14

6.1

17-21

8

3.5

More than 21

8

3.5

TOTAL

228

100.0

 

6.2.2 Description and Analysis of Search Statements

The full list of all search queries submitted to CHESHIRE during the experiment is given in Appendix I. The total number of search terms (excluding stop words) contained in 228 search queries was 802, which represents an average of 3.5 search terms per query (mode 2, median 3). The average number of stop words per search query was 1.3.

One- and two-term search queries represented 40% of all search queries (15% and 25%, respectively) (Table 6.9). Twenty-two percent of all search queries consisted of three terms. Four- and five-term search queries represented more than a quarter of all search queries (17.5% and 8.8%, respectively). Queries with six or more search terms constituted 11.4% of all search queries. The highest number of search terms in a single query was 24 (two instances), which was followed by a query with 19 search terms.

Table 6.9 The Number of Search Terms (excluding stop words) Included in Search Queries (N=802)

Number of Search queries   Search terms
search terms

N

%

 

N

%

0

1

.4

 

0

0.0

1

34

14.9

 

34

4.2

2

57

25.0

 

114

14.2

3

50

21.9

 

150

18.7

4

40

17.5

 

160

19.9

5

20

8.8

 

100

12.5

6

11

4.8

 

66

8.2

7 or more

15

6.6

 

178

22.2

TOTAL

228

99.9

 

802

99.9

Note: Percentage totals do not always equal 100% due to rounding

 

There were a total of 85 search terms that were taken into account during the retrieval process even if they were not retrieval-worthy. That is to say, some of the search terms users entered should not have been evaluated as part of the search query. For instance, search queries "I want information on . . ." or "I want books on . . ." contain terms such as "information" and "books" that had nothing to do with the user's query. However, CHESHIRE cannot identify such terms and exclude them from the query. This requires natural language understanding capabilities in the system's part.

Whether such terms are retrieval-worthy or not depend on the context. For instance, "information" and "books" in the previous example are not retrieval worthy. Yet in a query like "find books on information policy," the term "information" is crucial for retrieval purposes whereas "books" is not.

The most frequently used unretrieval-worthy search terms (in context) were "books" (13 times), "find" (7 times), "information" and "want" (5 times each), "subject" and "search" (4 times each), "materials" (3 times), "library" and "studies" (2 times each). Inclusion of these terms (except "find" and "want") during the retrieval process may be especially undesirable for some search queries in a library and information studies database like that of CHESHIRE. For there are many sources in the CHESHIRE database with these terms in their titles and subject headings which may cause false drops when they match the users' search terms.

The CHESHIRE system only allows subject searching and does not support qualification or Boolean operators. Nevertheless, users entered queries where they asked the system to limit their searches by period (10 times), title (8 times), author (5 times), subject and language (4 times each), and form (3 times) qualifiers. Similarly, one search query contained a negation ("not dissertations"), in which the term "not" was treated simply as a stop word. Two search queries contained as part of the query truncated search terms ("librar" and "bibliograph#").

A search concept can be described by one or more terms or phrases. The analysis of search statements shows that search queries contained a total of 384 search concepts, an average of 1.7 concepts per search query. Although CHESHIRE does not support Boolean searching, users utilized (sometimes implicitly) Boolean operators to describe their search queries. Boolean AND was used in 133 search queries whereas Boolean OR was used in 41 search queries. There was only one search query with a Boolean NOT operator.

There was a total of 20 misspelled or mistyped terms in all search queries. In other words, a mere 2.5% of all search terms (20/802) entered by the users contained spelling or typographical errors. Table 6.10 lists the misspelled and miskeyed search terms.

Table 6.10 Spelling and Typographical Errors (N=20)

Search term Search term Search term Search term
Abut englich marchant seamenship
Acookery fin managment suess (2)
alred Finalnd profesiions systenm
basball hitchock policyand vctorian
childrens infor salors  

 

There were 295 search terms that were treated as stop words (e.g., "in," "of," "on") and thus not retrieval-worthy. Some of the most frequently used stop words in search queries were: "of" (47 times), "and" (43 times), "the" (38 times), "in" (28 times), "on" (23 times) and "or" and "for" (12 times each). Table 6.11 gives the ranked list of stop words that were used five or more times in all search queries.

Table 6.11 Ranked List of Stop Words Used in Search Queries (N=295)

Stop word

N

Stop word

N

Stop word

N

of

47

or

12

to

7

and

43

for

12

all

5

the

38

about

8

like

5

in

28

I'm

8

(all others)

51

on

23

I'd

7

   

 

6.2.3 Analysis of Search Outcomes

The analysis of transaction logs showed that 18 out of 228 search queries (7.9%) retrieved nothing. Users selected no cluster record as being relevant in 61 (26.8%) search queries. Users proceeded to view bibliographic records in 149 search queries (65.3%). That is to say, they displayed records and selected some records as being relevant. Users performed relevance feedback searches in 91 search queries out of 149 (61.1%). In other words, almost two-thirds of searches were followed up by relevance feedback searches.

Relevance feedback searches were performed more than once for some search queries. For instance, users repeated relevance feedback searches once for 91 search queries, twice for 28 search queries, three times for 6 queries, and four times for two search queries.

The number of records users displayed and selected as relevant was recorded in transaction logs. Table 6.12 provides descriptive data about precision ratios obtained in the original search and relevance feedback (RF) iterations. Search queries that retrieved nothing due to collection failures were also included in the precision ratio calculations.

Table 6.12 Descriptive Statistics on Number of Records Seen and Selected, and Precision Ratios (Source: Transaction Logs) "Selected" means "selected as being relevant."

Retrieval

Total no.

of records

 

Average no.

of records

Precision

stage

Seen

Selected

 

Seen

Selected

ratio (%)

Original

1928

369

 

12.6

2.4

21.49

RF (1)

1173

156

 

12.4

1.6

13.21

RF (2)

352

58

 

11.3

1.2

8.44

RF (3)

82

0

 

11.7

0.0

0.0

RF (4)

26

1

 

8.6

0.3

2.5

TOTAL

3561

584

 

AVERAGE PREC.

15.84

Notes: Macro evaluation method was used in the calculation of precision ratios. "RF" refers to relevance feedback cycles.

As the table shows, users displayed a total of 3,561 records and selected 584 of them as being relevant. The precision ratio was just over 20% during the original retrieval. In other words, users selected, on the average, one in five records as being relevant. It is interesting to note that as users continue their searches with relevance feedback iterations, precision ratios went down sharply, from 21.49% during the original retrieval (e.g., before relevance feedback cycles) to 13.21% in the first relevance feedback cycle to 8.44% in the second cycle. It became 0% by the time users reached the third relevance feedback cycle.

It is not possible at this stage to explain why precision ratios went down just by looking at the precision figures obtained through transaction log data. It can only be conjectured that users who performed relevance feedback searches might have been more demanding. Or, retrieved records might have become less and less promising as the user proceeded. It has also been suggested that records retrieved during relevance feedback searches often contain a high proportion of false drops because too many nonrelevant terms are being used in the feedback process (Walker, S. & Hancock-Beaulieu, 1991, p.62).

Table 6.12 records the total and average number of records displayed and selected as being relevant in each step (i.e., original retrieval and relevance feedback iterations). However, the number of records displayed and selected vary a great deal from search to search. Table 6.13 and 6.14 provide the distribution of the number of records displayed and selected as being relevant during the original retrieval and first two relevance feedback cycles.

Table 6.13 The Number of Records Displayed in Search Queries

No. of records

Original retrieval

 

Relevance feedback cycle 1

 

Relevance feedback cycle 2

displayed

N

%

 

N

%

 

N

%

1-5

29

25.0

 

17

23.0

 

5

20.8

6-10

10

8.6

 

8

10.9

 

4

16.7

11-15

13

11.2

 

11

14.9

 

3

12.5

16-20

64

55.2

 

38

51.3

 

12

50.0

TOTAL

116

100.0

 

74

100.1

 

24

100.0

Note: Percentage totals do not always equal 100% due to rounding

 

In about a quarter of search queries, users displayed between one and five records, which seems to indicate that they were looking for a few relevant records. More importantly, users displayed between 16 and 20 records in more than half of the search queries. In such search queries user were evidently either performing exhaustive searches or they did not find what they wanted and therefore continued to display subsequent records.

Table 6.14 shows that users selected no records as being relevant in more than a quarter of searches during original retrieval. Number of searches in which users selected no records went up considerably during the relevance feedback cycles (about 50%). An overwhelming majority of users (56.9%) selected between one and six records as being relevant. Very few users selected more than seven records as relevant.

Table 6.14 The Number of Records Selected as Relevant

No. of records

Original retrieval

 

Relevance feedback cycle 1

 

Relevance feedback cycle 2

selected

N

%

 

N

%

 

N

%

0

33

28.4

 

33

44.6

 

13

54.2

1-2

40

34.5

 

23

31.1

 

7

29.2

3-4

14

12.1

 

6

7.9

 

0

0.0

5-6

12

10.3

 

9

12.2

 

2

8.3

7 or more

17

14.6

 

3

4.1

 

2

8.3

TOTAL

116

99.9

 

74

99.9

 

24

100.0

Note: Percentage totals do not always equal 100% due to rounding.

6.3 Description and Analysis of Data Obtained From Questionnaires

In addition to recording users' complete interaction with CHESHIRE in transaction logs, users were asked to fill out a post-search questionnaire form for the searches they conducted. This section summarizes the data obtained through the questionnaire.

Questionnaire forms were completed for only those searches which retrieved some records. No questionnaire forms were filled out for out-of-domain search queries, either.

The self-administered questionnaire contained 10 questions (Appendix D). In addition to factual questions, it also elicited data about the users' experience with CHESHIRE. For instance, they were asked whether they found what they wanted along with the user-perceived search success rates (precision). Questions about the CHESHIRE system were also included.

Altogether 92 questionnaire forms were filled out by the users, 62 (67.4%) by MLIS students and 30 (32.6%) by Ph.D. students.

As the post-search questionnaire was applied at the end of the data collection period, there was a time lag of one week and 16 weeks between the time the users performed their searches and they answered questionnaire questions. For instance, 74% of questionnaire forms were filled out at least one month after the searches were conducted.

Users were asked (question #3) whether they found what they wanted in their first try when they performed their search queries (Table 6.15). Close to 34% said they did while the remainder were not as positive. When answers to negative categories ("no" and "not quite what I wanted") are collapsed, in 58 searches (64%) users did not find what they wanted.

Table 6.15 Answers to Question #3: "Did you find what you wanted in your first try?" (N=92)

Answer

N

%

Yes

31

33.7

No

33

35.9

Not quite what I wanted

25

27.2

Don't remember

4

3.3

TOTAL

92

100.1

Note: Percentage totals do not always equal 100% due to rounding.

The major reasons users did not find what they wanted were that they were looking for something more specific (41.1%) and that sources retrieved did not look as helpful (30.4%) (Table 6.16).

 

Table 6.16 Answers to Question #4: Why Users Did Not Find What They Wanted (N=56)

Reasons

N

%

Sources didn't look helpful

17

30.4

Looking for more specific sources

23

41.1

Looking for more general sources

1

1.8

Had to wade through a lot of useless sources

11

19.6

Had problems with CHESHIRE

3

5.4

Other

1

1.8

TOTAL

56

100.1

Note: Percentage totals do not always equal 100% due to rounding.

Users were asked their perception of search success in terms of precision (Table 6.17). In close to 14% of the cases, users found none of the sources useful, and about 27% of the cases they found less than 10% of the retrieved sources useful. In almost two-thirds of the cases (56 or 63.6%) the percentage of useful sources was found to be less than 50%. In 20 cases only (22.8%) did users found more than 50% of the retrieved sources useful.

Table 6.17 Percentage of Retrieved Sources Users Found Useful (N=88)

Percent Useful

N

%

0

12

13.6

Less than 10

24

27.3

Less than 25

17

19.3

Less than 50

15

17.0

More than 50

13

14.8

More than 75

4

4.5

More than 90

2

2.3

100

1

1.1

TOTAL

88

99.9

Note: Percentage totals do not always equal 100% due to rounding.

Figures in Table 6.17 suggest that the precision ratios as perceived by the users were quite low. Their perception of low precision ratios somewhat correspond to how they actually judged the retrieved sources (relevant or nonrelevant). As we shall see later, the average precision ratio was calculated as less than 20%, which was based on users' relevance judgments as recorded in transaction logs.

Users said they performed relevance feedback searches for close to 54% of the queries, and no relevance feedback searches for about 14% of the queries. Of those who performed relevance feedback searches, more than 50% said relevance feedback search improved the search results. More than 20% said the sources retrieved during the relevance feedback search were similar to the original retrievals. In almost 20% of the cases relevance feedback results were either less helpful or not helpful at all (Table 6.18).

Table 6.18 Answers to Question #7: "Did relevance feedback improve the search results?" (N=48)

Relevance Feedback Results

N

%

More useful

16

33.3

Better

9

18.8

Similar

11

22.9

Less helpful

5

10.4

Not helpful at all

4

8.3

Missing

3

6.3

TOTAL

48

100.0

Users who performed relevance feedback search were asked what percent of the retrieved sources, including the ones retrieved during relevance feedback searches, they found especially useful (Table 6.19). Twelve-and-one-half percent of the users said none of the retrieved sources were useful. Almost 17% said they found less than 10% of the retrieved sources useful. Approximately one-third of the users thought that retrieved sources contained less than 25% useful sources. A further 19% indicated that less than 50% of the retrieved sources were useful. More than 20% said retrieved sources contained more than 50% useful sources.

It is interesting to note that although users thought that relevance feedback searches improved the results and retrieved additional relevant sources in more than 50% of the cases (Table 6.18), their perceptions of precision ratios for retrievals obtained after the relevance feedback searches were quite low (Table 6.19). In other words, they thought that retrievals after the relevance feedback searches contained too many nonrelevant documents, which, in fact, directly corresponds to how the users judged the records retrieved after the relevance feedback searches, as recorded in transaction logs (see Table 6.12). As we discussed earlier, the transaction logs data show that precision ratios for queries for which relevance feedback searches were performed deteriorated quickly and become zero after the third relevance feedback iteration.

Table 6.19 Percentage of Retrieved Sources Users Found Useful After Relevance Feedback Searches (N=48)

Percent Useful

N

%

0

6

12.5

Less than 10

8

16.7

Less than 25

15

31.2

Less than 50

9

18.7

More than 50

8

16.7

More than 75

1

2.1

More than 90

1

2.1

TOTAL

48

100.0

 

The last two questions in the questionnaire form were about users' experience with the CHESHIRE experimental online catalog. They were asked to indicate what was it that they found most useful and most confusing in CHESHIRE.

6.4 Description and Analysis of Data Obtained From Critical Incident Reports

Critical incident report forms were used to gather both qualitative and quantitative information (see Appendix E and Appendix F). Users were asked to evaluate their searches from a number of different perspectives: the overall effectiveness of the search, their information needs, types of sources they were looking for, whether they carried out relevance feedback search, and so on. No critical incident report forms were filled out for searches which retrieved no clusters (zero retrievals and out-of-domain search queries). Similarly, search queries for which users selected no clusters as relevant were also excluded. The quantitative data obtained through critical incident forms are presented below.

A total of 114 critical incident report forms were filled out. Users judged their search queries as being effective in almost 70% of the cases (Table 6.20). The search outcome was found ineffective in the remainder of the cases (31.7%).

Table 6.20 User-Designated Search Success (N=114) (Source: Critical Incident Report Forms)

Search Outcome N %
Effective

79

69.3

Ineffective

35

30.7

TOTAL

114

100.0

 

Of those users who judged their searches as being effective, about 42% said the system retrieved most of the useful sources that they needed for the search (i.e., perceived recall ratio was greater than at least 50%) whereas 15% thought otherwise. Similarly, about 37% of the respondents said more than half the sources they found using the system were useful whereas about 18% thought otherwise.

It is interesting to compare the data obtained through the transaction logs, the questionnaire and the critical incident report forms at this point. As we pointed out in the close of the previous section, users' perceptions of low precision ratios for retrieval performance were confirmed from the transaction logs. Yet, as Table 6.20 indicates, users we interviewed found the search results effective for the majority of search queries, which suggests that there is very little correspondence between the retrieval performance as measured by precision and the ways in which users evaluate the outcome of search queries as a whole. To put it differently, a user may find the search results effective even if the precision ratio for a given search query, as judged by the same user, happens to be low.

Of those users who judged the search results as being ineffective, about 83% said the system failed to retrieve most of the useful sources (i.e., perceived recall ratio was less than 50%) whereas, despite their judging the search outcome as being ineffective, about 14% did not think that the system failed. All the respondents who judged their search results as being ineffective indicated that more than half the sources they found using the system were useless.

This finding suggests that some users judged the search outcome as being ineffective when the majority of the useful records were not retrieved. That is to say, they were more concerned about retrieving most, if not all, relevant records in the database (i.e., high recall) and they attributed a considerable weight to this fact when they judged the overall outcome of the search query.

6.5 Descriptive and Comparative Analysis of Data Gathered Through All Three Data Collection Methods

In section 6.2 above, the results of search queries users performed on CHESHIRE were given. Descriptive data about searches, search sessions, and search statements are delineated and precision ratios recorded in transaction logs for 149 search queries presented in tables (section 6.2.3). The precision ratio is in itself not sufficient to determine retrieval effectiveness in document retrieval systems. In the following analysis, recall ratios for each search query are also calculated. (For detailed explanation of the calculation of recall ratios, see Chapter V, section 5.5.1.2.) Precision and recall ratios obtained before relevance feedback searches are given in Tables 6.21 and 6.22, respectively. Figure 6.1 plots the precision and recall ratios for each search query on the same graph.

Table 6.21 Precision Ratios Before Relevance Feedback Searches (N=118)

Ranges of precision ratio

Number of searches having this precision value

%

0 - 10%

24

20.3

11 - 20

2

1.7

21 - 30

15

12.7

31 - 40

8

6.8

41 - 50

12

10.2

51 - 60

6

5.1

61 - 70

9

7.6

71 - 80

7

5.9

81 - 90

15

12.7

91 - 100

20

16.9

Average Precision Ratio Before Relevance Feedback Searches = 50.1%

Table 6.22 Recall Ratios Before Relevance Feedback Searches (N=118)

Ranges of recall ratio

Number of searches having this recall value

%

0 - 10%

45

38.1

11 - 20

15

12.7

21 - 30

19

16.1

31 - 40

8

6.8

41 - 50

15

12.7

51 - 60

6

5.1

61 - 70

5

4.2

71 - 80

2

1.7

81 - 90

2

1.7

91 - 100

1

.8

Average Recall Ratio Before Relevance Feedback Searches = 23.6%

 

Figure 6.1 Retrieval Performance in CHESHIRE

Before Relevance Feedback Searches (N=118)

P 100%| n § ¨§ • • • •

r | §

e | • • •

c | ° • • •• • •

i | • • •

s 75| • • • • • • •

i |

o | • • • §

n | • • • • •

| • • • •

R 50| • • X• •

a | • ••

t | °

i | • • • °

o | • • •

25| § §§

|§ • •

|

|

|

0| l ° § • • • • °

0 10 20 30 40 50 60 70 80 90 100%

Recall Ratio

Multiple occurrences: "§"=Twice; "°"=3 times; "¨"=4 times; "n"=5 times; "l"=12 times. "X" = Average precision (50.1%) and average recall (23.6%) ratios.

 

Precision ratios (Table 6.21) obtained before relevance feedback searches show a great deal of variation. The average precision ratio for 118 queries was 50.1%. In other words, half of the retrieved sources were judged as being relevant by the users. Recall ratios, on the other hand, are concentrated in the lower end of the spectrum, indicating that majority of the searches retrieved less than half the relevant sources in the database (Table 6.22). In fact, the recall ratio was about 25% or less for almost 80% of the search queries. The average recall ratio was 23.6%. (Precision and recall ratios for all search queries are given in Appendix J.) The figure shows that there is no strong correlation between precision and recall ratios obtained before the relevance feedback searches, and a correlation analysis confirms this (Pearson's r=.20, p =.033).

The precision and recall ratios presented in Table 6.21, Table 6.22, and Figure 6.1 exhibit some interesting findings. Several studies in the past reported that there is an inverse relationship between precision and recall measures whereas no clear pattern has emerged in this study as to the relationship between precision and recall ratios that were obtained before relevance feedback searches. The discrepancy may be due to two factors: 1) the number of observations in this study was relatively small and the findings regarding precision and recall ratios may not be definitive; and, more importantly, 2) the method of calculation of retrieval performance measures in this study differs from other studies. For instance, precision ratios reported in the past were usually based on all the retrieved records for a given query whereas in this study they were based not on all the retrieved records but only on the retrieved and displayed records. The precision ratio was calculated as the proportion of displayed records that were judged as being relevant to all the displayed records, which disregards the fact that there may have been more relevant records among the retrieved ones that the user chose not to display. In fact, this is one of the reasons why precision ratios for individual search queries varied a great deal in this study. Some users displayed only a few records while others displayed several.

It is also conceivable that some users may have been browsing and thus did not necessarily wish to make relevance judgments on the retrieved records, which may have suppressed the precision ratios to a certain extent.

Table 6.21, Table 6.22, and the scatter diagram (Fig. 6.1) presented above represent precision and recall ratios accomplished before relevance feedback retrieval process. As mentioned before (section 6.2.3), users continued their searches with relevance feedback iterations in 91 search queries. Tables 6.23 and 6.24 provide the precision and recall ratios obtained after relevance feedback searches along with the scatter diagram (Fig. 6.2). The average precision and recall ratios given in these figures represent the averages of ratios obtained both before and after relevance feedback searches. That is to say, if the user continued his or her search after the original retrievals and performed a relevance feedback search, the average of both results is taken. For instance, if, for a given search query, the precision ratio is 40% before the relevance feedback search and it increases to 60% after the relevance feedback search, the average precision ratio for the full search will be the average of both ratios (i.e., 50%).

Table 6.23 Precision Ratios After Relevance Feedback Searches (N=116)

Ranges of precision ratio

Number of searches having this precision value

%

0 - 10%

47

40.5

11 - 20

24

20.7

21 - 30

23

19.8

31 - 40

5

4.3

41 - 50

10

8.6

51 - 60

1

.9

61 - 70

2

1.7

71 - 80

1

.9

81 - 90

3

2.6

91 - 100

0

.0

  • Average Precision Ratio After Relevance Feedback Searches = 18.3%
  • Table 6.24 Recall Ratios After Relevance Feedback Searches (N=116)

    Ranges of recall ratio

    Number of searches having this recall value

    %

    0 - 10%

    23

    19.8

    11 - 20

    5

    4.3

    21 - 30

    14

    12.1

    31 - 40

    9

    7.8

    41 - 50

    19

    16.4

    51 - 60

    7

    6.0

    61 - 70

    11

    9.5

    71 - 80

    2

    1.7

    81 - 90

    12

    10.3

    91 - 100

    14

    12.1

    Average Recall Ratio After Relevance Feedback Searches = 45.4%

     

    Figure 6.2 Retrieval Performance in CHESHIRE After Relevance Feedback Searches (N=116)

    P |

    r | • •

    e |

    c 75%|

    i |

    s | • •

    i |

    o |

    n 50| § • • • •

    | • •

    R | ••

    a | • • •

    t | §

    i 25| § • • ••• • • • • • •

    o | ° • •• • ••X • •

    | • • • §•• • • •§ • •

    | • • § • • • • §

    | ¨ • • •• •• • • °

    0| l • • •§§¨ • •• • °

    0 10 20 30 40 50 60 70 80 90 100%

    Recall Ratio

    Multiple occurrences: "§"=Twice; "°"=3 times; "¨"=4 times; "l"=10 times.

    "X" = Average precision (18.3%) and average recall (45.4%) ratios.

    As should be expected, as users proceeded with their searches with relevance feedback iterations, precision ratios decreased whereas recall ratios increased. That is to say, the CHESHIRE managed to retrieve additional relevant records during the relevance feedback searches that were not retrieved in the original searches. On the other hand, as the number of retrieved records increased with relevance feedback searches, so did the ratio of nonrelevant records among the retrieved ones. The average recall ratio went up almost twice from 23.6% to 45.4% whereas the average precision ratio went down from 50% to less than 20%. (See Appendix J for complete precision and recall ratios for all search queries.) Again, there is no strong correlation between precision and recall ratios (Pearson's r=-.13, p=.165) obtained after relevance feedback searches.

    The above figures show that relevance feedback technique used in CHESHIRE improved the search results by retrieving additional relevant records from the database. However, there is no strong correlation between the precision ratios obtained before the relevance feedback searches and precision ratios obtained after the relevance feedback searches (Pearson's r=-.09, p=.327). Similarly, there is no strong correlation between the recall ratios obtained before the relevance feedback searches and recall ratios obtained after the relevance feedback searches (Pearson's r=.17, p=.072). However, there was a fairly high correlation between precision ratios obtained before relevance feedback searches and recall ratios obtained after the relevance feedback searches (Pearson's r=.86, p=.0005).

    Notice that no observations were recorded in the upper left-hand corner of the scatter diagram, which represents the search queries with higher precision (i.e., greater than 50%) and lower recall ratios (i.e., less than 50%). This was due to two factors. First, precision ratios reported in Fig. 6.2 are the average of precision ratios obtained both before and after relevance feedback searches. For example, if the precision ratio for a given query is 60% before relevance feedback search and 20% after the relevance feedback search, the average precision ratio will be equal to 40% ((60+20)/2).

    Second, the upper left-hand corner of the scatter diagram clearly indicates that it is difficult, if not impossible, to score consistently high precision and high recall ratios in online catalogs. More often than not, users have to make compromises (i.e., high recall or high precision, but not both). This finding is consistent with the probabilistic nature of the document retrieval process, too.

    T-tests were performed to determine if there was any difference in average precision and recall ratios between the MLIS and Ph.D. students. MLIS students obtained slightly higher precision and recall ratios (before relevance feedback searches) than Ph.D. students did (51% vs. 47% for precision, and 25% vs. 19%, respectively). However, the difference between the two groups is not statistically significant (for precision, t=.50, p=.65; for recall, t=1.41, p=.16). Similarly, there appears to be no difference between precision and recall ratios (after relevance feedback searches) obtained by MLIS and Ph.D. students (20% vs. 14% for precision and 45% vs. 46% for recall, respectively) and the results of t-tests were not statistically significant (t=.50, p=.65 for precision; t=1.41, p=.16 for recall).

    T-tests also were carried out to determine if there was any difference in average precision and recall ratios for effective and ineffective searches. As should be expected, precision and recall ratios for effective searches were different. Average precision and recall ratios for effective searches were sometimes as much as two times higher than that for ineffective ones (Table 6.25).

    Table 6.25 Descriptive Statistics For Effective and Ineffective Searches

    Precision (P) &

    recall (R) ratios

    before & after relevance feedback (RF) searches

    Effective searches   Ineffective searches   Total
      N Avg SD   N Avg SD   N Avg SD
    P before RF 71 .64 .29   47 .29 .35   118 .51 .36
    R before RF 71 .28 .25   47 .17 .20   118 .24 .24
    P after RF 71 .21 .21   45 .14 .17   116 .18 .20
    R after RF 71 .56 .27   45 .28 .33   116 .45 .33

     

    The results of t-tests indicate that the differences in precision and recall ratios for effective and ineffective searches are all statistically significant. Average precision ratio for effective searches (before relevance feedback) was 64% as opposed to 29% for ineffective searches (t=5.93, p=.0005) whereas the average recall ratio was 28% for effective searches compared with 17% for ineffective ones (t=2.47, p=.015). Similarly, average precision ratio for effective searches (after relevance feedback) was 21% as opposed to 14% for ineffective searches (t=2.01, p=.047) while the average recall ratio was 56% for effective searches compared with 28% for ineffective ones (t=4.84, p=.0005).

    The results of c2 test show that there was a strong relationship between the user type (MLIS vs. Ph.D.) and that of users' finding what they wanted (c2=6.82, df=1, p=0.009), indicating that Ph.D. students are more likely to find what they wanted in their online catalog searches than MLIS students are. MLIS students found what they wanted only in quarter of the searches they performed whereas Ph.D. students found what they wanted in more than half the searches they performed.

    6.6 Multiple Linear Regression Analysis Results

    A model was developed to examine the relationship between the performance of the system as measured by precision and recall and variables that defined user characteristics and users' assessment of search performance. The models were of the form

    Y = a + b1 x UTYPE + b2 x CATUSE + b3 x ONSRCH + b4 x PLANG +

    + b5 x EI + b6 x FINDIT + b7 x RFPERF

    where Y is the dependent variable, and UTYPE, CATUSE, ONSRCH, PLANG, EI, FINDIT and RFPERF are the independent variables. Four dependent variables were used. They are:

    1) ORPREC: Precision ratio obtained before relevance feedback searches

    2) ORRCLL: Recall ratio obtained before relevance feedback searches

    3) AVPREC: Precision ratio obtained after relevance feedback searches, and

    4) AVRCLL: Recall ratio obtained after relevance feedback searches.

    The seven independent variables are defined below:

    1) UTYPE: User type (MLIS vs. Ph.D. students)

    2) CATUSE: The frequency of online catalog use (i.e., daily, weekly)

    3) ONSRCH: Knowledge of online searching

    4) PLANG: Knowledge of programming languages

    5) EI: Search effectiveness (i.e., whether the user found his or her search as being effective or not)

    6) FINDIT: Finding what is wanted (i.e., whether the user found what he or she was looking for), and

    7) RFPERF: Relevance feedback search (i.e., whether the user performed relevance feedback search).

    Descriptive statistics about the independent variables are summarized in Table 6.26.

     

    Table 6.26 Descriptive Statistics About Independent Variables

    Independent variable name

    Frequency distribution

     

    1

    2

    User type (1: MLIS 2: Ph.D.)

    88

    30

    Frequency of catalog use (1: Daily 2: Weekly)

    69

    39

    Knowledge of online searching (1: Yes 2: No)

    93

    25

    Knowledge of programming (1: Yes 2: No)

    77

    41

    Search effectiveness (1: Effective 2: Ineffective)

    71

    47

    User finding what he or she wanted (1: Yes 2: No)

    29

    47

    Performed relevance feedback search (1: Yes 2: No)

    40

    9

     

    Multiple linear regression analysis was used to evaluate relationships between precision and recall ratios and seven independent variables. Table 6.27 shows the correlation between precision ratios obtained before relevance feedback searches and seven independent variables. As can be seen from the correlation coefficients, there was no strong correlation between precision and any of the independent variables. However, two independent variables had some slight correlation with the dependent variable. They were the users' perception of search effectiveness (r=-.41) and whether they found in the online catalog what they were looking for (r=-.22).

    Table 6.27 Relationships of Measures That Are Correlated With

    ORPREC (Precision Ratio Before Relevance Feedback Searches) (N=73)

      UTYPE CATUSE ONSRCH PLANG EI FINDIT RFPERF
    ORPREC -.01 .09 -.02 -.10 -.41* -.22* .08
    UTYPE   .15 .38* .46* .05 -.19 -.19*
    CATUSE     .39* -.04 .07 .02 .16
    ONSRCH       .29* -.24* -.17 -.19
    PLANG         -.07 -.20* -.21*
    EI           .52* .11*
    FINDIT             -.03
    RFPERF              

    *Statistically significant at or below the .05 level.

     

     

    Similarly, there was no strong correlation between recall ratios obtained before relevance feedback searches and any of the independent variables (Table 6.28). However, two independent variables had some slight correlation with the dependent variable. They were the frequency of catalog use (r=.26) and the search effectiveness (r=-.22).

    Table 6.28 Relationships of Measures That Are Correlated With ORRCLL (Recall Ratio Before Relevance Feedback Searches) (N=73)

      UTYPE CATUSE ONSRCH PLANG EI FINDIT RFPERF
    ORRCLL -.09 .26* .06 .19 -.22* -.11 .02
    UTYPE   .15 .38* .46* .05 -.19 -.19*
    CATUSE     .39* -.04 .07 .02 .16
    ONSRCH       .29* -.24* -.17 -.19*
    PLANG         -.07 -.20* -.21*
    EI           .52* .11
    FINDIT             -.03
    RFPERF              

    *Statistically significant at or below the .05 level.

     

    There was no strong correlation between precision and recall ratios obtained after the relevance feedback searches and any of the independent variables (Tables 6.29 and 6.30, respectively). Knowledge of programming was slightly correlated (r=.26) with the dependent variable AVPREC, precision ratios obtained after the relevance feedback searches (Table 6.29). Search effectiveness had some slight correlation (r=-.30) with the dependent variable ORRCLL, recall ratios obtained after the relevance feedback searches (Table 6.30).

    Table 6.29 Relationships of Measures That Are Correlated With AVPREC (Precision Ratio After Relevance Feedback Searches) (N=71)

      UTYPE CATUSE ONSRCH PLANG EI FINDIT RFPERF
    AVPREC -.10 .14 -.01 .26* -.19 -.10 -.17
    UTYPE   .19 .40* .43* .01 -.22* -.17
    CATUSE     .39* -.01 .02 .01 .15
    ONSRCH       .32* -.24* .17 -.20*
    PLANG         -.14 -.25* -.18
    EI           .51* .15
    FINDIT             -.01
    RFPERF              

    *Statistically significant at or below the .05 level.

    Table 6.30 Relationships of Measures That Are Correlated With AVRCLL (Recall Ratio After Relevance Feedback Searches) (N=71)

      UTYPE CATUSE ONSRCH PLANG EI FINDIT RFPERF
    AVRCLL .08 .03 -.02 .00 -.30* -.11 -.02
    UTYPE   .19 .40* .43* .01 -.22* -.17
    CATUSE     .39* -.01 .10 .04 .15
    ONSRCH       .32* -.24* -.17 -.20*
    PLANG         -.14 -.25* -.18
    EI           .51* .15
    FINDIT             -.01
    RFPERF              

    *Statistically significant at or below the .05 level.

    No strong intercorrelations were observed amongst the independent variables, either. However, search effectiveness was moderately intercorrelated in all cases with whether the user found what he or she wanted in the online catalog search, indicating that users who found what they wanted are more likely to judge their searches as being effective.

    These findings suggest that users' judgment of the effectiveness of their searches turned out to be the most significant factor in predicting precision and recall ratios. Search effectiveness was negatively correlated, although not strongly, with all but one (precision obtained after relevance feedback searches) dependent variables, indicating that those who judged their searches as being ineffective are less likely to have higher precision and recall values.

    Nonetheless, it should be emphasized that correlations between the dependent and independent variables were not strong. As can be seen from the multiple linear regression analysis results (Table 6.31), all seven independent variables combined explain only about 25% of the observed variability in precision and recall ratios. Almost 75% of the observed variability in precision and recall ratios remain unexplained.

    Table 6.31 Summary of Multiple Linear Regression Analysis

    Dependent variable name N r2 F

    Significance of F

    Precision ratio before rel. fdbck. search (ORPREC) 73 .25 3.03

    .008

    Recall ratio before rel. fdbck. search (ORRCLL) 73 .24 2.91

    .010

    Precision ratio after rel. fdbck. search (AVPREC) 71 .26 3.21

    .006

    Recall ratio after rel. fdbck. search (AVRCLL) 71 .14 1.46

    .196

    The results of the multiple linear regression analysis may not be definitive as the sample size was small. Nonetheless, the results indicate that user characteristics (i.e., frequency of online catalog use, knowledge of online searching and programming languages) and users' own assessment of search performance (i.e., search effectiveness, finding what is wanted) are not adequate measures to predict the system performance as measured by precision and recall ratios. To put it somewhat differently, as a considerable percentages of observed variabilities in precision and recall ratios remain unexplained, the regression model developed earlier cannot reliably explain the correlation between precision and recall ratios and the measures studied here. Therefore, it is difficult to use this model to examine the relationship between the system performance as measured by precision and recall ratios and variables defining user characteristics and users' judgments of search effectiveness.

    6.7 Summary

    Quantitative data collected by means of transaction logs, questionnaires and critical incident report forms were summarized in this chapter. Descriptive statistics on participating users, search queries, search outcomes in terms of number of records seen by the users and selected as being relevant, users' assessments of search results were given. Retrieval performance of CHESHIRE as measured by precision and recall was also discussed along with the results of a regression model.

    The quantitative analysis of retrieval performance of the system was based on a total of 228 queries submitted by the MLIS and doctoral students of the School of Library and Information Studies at the University of California at Berkeley. An average search query took just under six minutes to complete, although one-third of the queries submitted took less than one minute due to search failures (i.e., zero retrievals). On average, a search statement contained 3.5 terms, which is relatively higher than that submitted to second generation online catalogs. This suggests that users may have felt less constrained to describe their requests to an online catalog with a natural language user interface. Misspelling and typographical errors were relatively few; only 2.5% of all search terms contained such errors. Some queries also contained terms that were useless from the retrieval point of view ("I want information on . . .", "please find some books on. . .").

    Although users displayed between 16 and 20 records in more than half the searches, they selected only between 0 and 4 records as relevant in more than 75% of all search queries. In users' view, two-thirds of the searches contained less than 25% of the useful sources. The main reason for this was that they were looking for more specific sources and the retrieved sources did not look helpful. The number of records selected further declined as users performed relevance feedback searches. Yet they felt that they retrieved additional useful sources during relevance feedback searches in more than 50% of the cases. Although precision ratios obtained from transaction logs were low, users who were interviewed judged two-thirds of the search queries as being effective. This finding suggests that precision was not the only criterion in their assessments of search effectiveness.

    The average precision ratio before relevance feedback searches was about 50% whereas the average recall ratio was about 24%. In other words, one out of every two records retrieved was judged as being relevant by the users. Yet the system retrieved only one out of every four relevant documents in the database. The average precision ratio after relevance feedback searches went down to less than 20% whereas the average recall ratio rose to 45%. In other words, an almost two-fold increase was observed in the recall ratios after relevance feedback searches whereas the average precision ratio declined from 50% to 18%. Although relevance feedback technique helped retrieve additional relevant documents from the database after each iteration, thereby increasing the average recall ratio up to 45%, the average precision ratio went down drastically after each relevance feedback cycle.

    T-tests showed that MLIS students obtained slightly higher precision and recall ratios than Ph.D. students, although the difference was statistically insignificant. Yet a c2 test indicates that Ph.D. students were more likely to find what they wanted in the online catalog than MLIS students and the difference was statistically significant. MLIS students found what they wanted in less than a quarter of the searches whereas Ph.D. students found what they wanted in more than half the searches. As should be expected, precision and recall ratios for effective searches were significantly higher than for ineffective ones.

    Finally, a multiple linear regression analysis, which aimed to examine the relationship between CHESHIRE's retrieval performance as measured by precision and recall ratios and the users' judgment of the system's search performance, found that users' assessments of the effectiveness of their searches was the most significant factor in explaining precision and recall ratios. However, there was no strong correlation between precision and recall measures and user characteristics, and users' assessment of retrieval performance. It was concluded that the regression model developed cannot be used to examine the relationship between these measures as all seven independent variables combined explained only a quarter of the observed variability in precision and recall ratios.

    It must be stated that these results were obtained without an experimental design with a control and experimental group (see Chapter V) and thus the results may be biased. Nevertheless, findings we obtained and the conclusion we reached regarding the relationship between performance measures and users' assessments of search effectiveness are commensurate with findings obtained in other studies. For instance, although she did not study recall, Su (1992) found that precision is not correlated with search success. However, more research is needed to validate the findings obtained in this study over larger populations of search queries.

    Go to Next Chapter

    Go to Bibliography