Once upon a time there was a Center for Contemporary Middle East studies at Odense University in a country called Denmark by its inhabitants.1 That center wanted an Arabic book collection badly, and they did in fact have one: It was stored in a dark room in a big box. But the books wanted to be on the nice airy shelves in the University library among all the other colorful and beautiful books, and the faculty felt so sorry for the lonesome Arabic books that they decided to send the books to the library for cataloging.
One week passed and everybody thought that the books were happy on their shelves -- but it was not so. The books were still very unhappy and still stored in the box in another dark room. The faculty felt sorry for the books again and complained to the Library. And six months later (this is the academic world...) the box came back to the center accompanied by a pale and shaky librarian saying something like -- "This is too much! This is Arabic -- what are we going to do with this?"
The faculty now felt so sorry for the books that they decided to hire someone to take care of them and to help them get into the library so that they could take their rightful place amongst the shining books about chemistry, archaeology, management, etc. But the books were not quite happy yet: They also wanted to be retrieved. To be retrieved just as effectively as the other gleaming books.
To romanize or not to romanize...
The primary task was to find out whether or not each bibliographic record of the Arabic books should be romanized or the vernacular script should be used for cataloging. It was decided to use the Arabic script for cataloging. This is not the easiest solution in terms of programming -- and another thing was, that Arabic is not very similar to English, Danish, German, etc. It is actually quite different.
These differences could lead to the assumption that Arabic will work differently from European languages for retrieval purposes. The question of how effective the Arabic language is for retrieval purposes should also be raised. Alternatively, can Arabic script records can be added to existing Latin script-based implementations without introducing system modifications that take into account language differences? However, nobody has so far investigated the retrieval effectiveness of Arabic in bibliographic databases.
So in order to measure how effectively or ineffectively Arabic functions in bibliographic databases, it was necessary to identify a bibliographic database with a sufficiently large scale implementation of Arabic script, a well documented system supported by an apparently expert and informed staff. Containing more than 30,000 Arabic script records, the Research Libraries Information Network (RLIN) database of the Research Libraries Group (RLG) fulfills these requirements. After some negotiation between RLG and Odense University, RLG decided to welcome and support this research project. The project became in effect a joint venture in which both institutions share the same interest -- the need to answer the questions asked above.
The Methodology
Testing the retrieval effectiveness of the Arabic language has not yet been undertaken. Therefore, it was necessary to construct a theoretical framework for such a test, to operationalize the experimental set-up and, naturally, to carry out the experiment itself.
The first step taken in developing the methodology was to map the differences of the Arabic language that might affect retrieval effectiveness. These can be seen simply as three features of Arabic. The Arabic language uses:
The second step was to investigate how the effect of these features could be measured. As the formulation of the problem relates to the term "differences" the methodology had to incorporate a comparison. It was decided to compare the retrieval effectiveness of Arabic to the retrieval effectiveness of English. Such a comparative analysis of two languages for retrieval purposes had not been carried out before. However, several studies relate to the retrieval effectiveness of English.
Ever since the 1960's information retrieval (IR) effectiveness has been measured by use of the twin measures of recall and precision.

In the above formula, a represents the retrieved relevant documents, b the retrieved non-relevant documents and c the non-retrieved relevant documents. These two measures are indeed very problematic. 1. If a recall and precision analysis is done on one database, the results cannot be compared to other similar analyses, if the studies are not employing two very similar databases.2 The comparison is invalid because figures will vary especially according to the number of records. And, unfortunately, no one has so far discovered the exact ratio of variance, so that corrections can be made for the number of records. Therefore, if comparison is what one wants, it is necessary to carry out two experiments: one purely English and one purely Arabic, using the same structure of records, the same retrieval facilities and the same number of records.
2. It is apparent that the recall and precision analysis necessitates a relevance judgment. IR research has so far used several definitions, none of them being very clear. How to define relevance depends on the basic set-up of the analysis.
3. For the basic set-up of the experiment, the researcher carrying out a recall and precision analysis will have to choose between control over the experimental variables (the laboratory experiment) or to aim at reflecting a real life situation (the operational investigation). A recall and precision analysis can be carried out in both ways. If very little or nothing is known about performance before the experiment, the solution closest at hand in terms of choosing between the two approaches is that of the laboratory experiment. As the retrieval effectiveness of Arabic has not been investigated before, and as no prior studies relate to comparison of performance between two languages, the basic methodology for this study became that of the laboratory experiment.
The laboratory experiment necessitates very strict definitions, both conceptual and operational. It allows for highly artificial definitions and artificial raw material. So it was possible to define relevance in an artificial way. For this study relevance has been defined conceptually as:
IR subdivides aboutness of documents into author aboutness, indexer aboutness, user aboutness and request aboutness. This subdivision is, as can be seen, based on how various groups of people consider what a document is about. And the various aboutnesses of a document might not be the same. Thus, dealing with a laboratory experiment where all the components are isolated and static means that these groups of people could not be asked. However, as all records of the RLIN database contain descriptions of what the documents are about as conceived by indexers, that is, indexing terms, the obvious choice was to prefer the indexer aboutness as the basis of the relevance judgment.
Finally, what is needed for a recall and precision analysis is a basic set of raw materials, including a system, some records and some queries.
For the system, as the laboratory experiment was chosen, the RLIN production database itself could not be used. The laboratory experiment should keep the experimental environment static and not allow anything from the outside of this environment to disturb the results. Thus, a clone of RLIN was built for the purpose.
Comparison can only take place if the two things compared share similarities. As the experiment aimed at comparing retrieval effectiveness of Arabic with retrieval effectiveness of English, it was necessary to collect the remainder of the raw material in such a way that it shared as many similarities as possible.
It was necessary to determine the number of records, as well as their subject area coverage. The number of records can be relatively easily determined through the use of standard statistics, and 1,100 randomly selected records for each sample proved sufficient. For this experiment, three systematic random samples needed to be identified: One in English, one in Arabic, and one bi-lingual for control purposes. Previous research, however, indicated that recall and precision differ from subject area to subject area. Previous results more or less indicated that recall and precision are higher for "hard sciences" than for "soft sciences". In order to assure that these differences were considered, it was decided to aim at a subject area coverage in the samples that were as similar to each other as possible. The obvious solution to this problem was to construct the random samples as stratified random samples in terms of subject area coverage. It was decided to use the first letter of the Library of Congress Classification Number (LCCN) as the indicator of the subject area of the records. However, not all the groups in LCCN can be expected to be covered equally by the Arabic records. Thus, it was decided to use only some of the subject area groups of LCCN. In order for an LCCN group to become a candidate for the samples, it had to contain more than 1% of the Arabic records in RLIN.
The queries are more problematic. Every librarian probably knows that a user-stated information request might change significantly from the moment when a user recognizes an information need until it can be processed against a database. As the aim was to test the effectiveness of the Arabic language -- not the database or its capabilities, the catalogers' skills, or the intelligence of the users -- it was decided to select the query terms from the records themselves, using what might be called artificial queries. For this particular experiment there was no reason to construct complex search statements. Thus, it was decided to use single terms from the natural language statements of the records (title information) for the queries. The number of query terms could therefore also easily be decided: as many as possible.
Operationalizing the experiment
In carrying out the above described experiment the first task was to investigate the subject area distribution of the Arabic records in RLIN, or to find out which of the LCCN groups contained more than 1% of the Arabic records. This was done by counting one year of accession. The groups proved to be B, D, H, K and P. Thus, each of the three samples should contain 220 records from each of these five groups. For the bilingual sample, however, this requirement could not be met, as RLIN at the time of constructing the samples contained fewer than 900 bilingual (Arabic/English) records.
The next step was to collect the query terms. For the recall and precision analysis, it was necessary to ascertain which records were relevant to which queries. As the indexing terms reflect the indexer aboutness of the documents, and is thus related to the relevance judgment, what had to be done in terms of identifying query terms was that an exact match should exist between the query term and the indexing term.
For example, the title of document might be "The Arabic tongue" (RLIN search term: title word, or TW), and the indexing terms (Library of Congress Subject Heading (LCSH)/RLIN search term: subject word, or SW) assigned by the indexer might be "Arabic Language". In this example, an exact match exists between Arabic in the title and in the indexing terms. Thus, a search for "Arabic" in the title will result in X (including the document in the example) records. Of these, Y records will be relevant. The relevant records will have the indexing term "Arabic" assigned to them, as the indexer must have considered the aboutness of the document to be "Arabic".
In order to collect all such exact matches between indexing terms and title words, all 3070 records of the experiment were scanned one by one. Whenever an exact match could be identified, the term was entered into a separate file containing a query ID, the TW and the SW and the various levels of truncations.
For the English query terms, collecting exact matches was a simple process. Identifying exact matches between SW's and TW's proved easy. And as only one level truncation was used, each TW had two versions: One without truncation (i.e., the form that the TW had in the record(s)) and one with a hard truncation (i.e., a truncation that stripped as much as possible of the ending of the word without distorting the meaning).
Identifying the Arabic query terms proved more complicated, primarily because LCSH is based on the English language. Therefore, it was necessary to establish a basic set of rules for inclusion and exclusion of query terms. In addition, the same advantages the truncations give in English should be obtained in Arabic, and the hypotheses tested 11 different levels of truncations necessary for a large number of Arabic query terms.
The next step was to prepare the query terms for searching. Thus, a way of obtaining the values of a, b, and c had to be found. The figure below is an illustration of how a sample of records might look.
| a | tw Arabic | b | c | tw Arabic | d | tw Arabic | e | f | tw Arabic | ||
| sw Arabic | sw Arabic | sw Arabic | sw Arabic | sw Arabic | |||||||
| g | h | i | tw Arabic | j | k | l | tw Arabic | ||||
| sw Arabic | sw Arabic | sw Arabic |
What needs to be identified is:
a = Retrieved and relevant records
b = Retrieved non-relevant records
c = Non-retrieved relevant records
In the example above a search statement like
In the example above the value of a for the recall and precision
analysis is therefore 4.
As can be seen, some of the records contains the word "Arabic"
as
TW but not as SW. These records were not
considered by the indexer as having the
indexer aboutness of "Arabic".
On the illustration above these are records d and i. Thus,
records
d and i are to be considered
retrieved non-relevant documents, and the value of b is 2.
Finally, for records b, e, g, and k, the indexer considered their
aboutness to be "Arabic"; however,
for a search on "Arabic" as TW, these records will not
be
retrieved. As such, the group of these
four records must be considered to have the value of c. Thus, in
this example, c is equal to 4.
Formulating this as search statements would lead to the
following steps:
This means, that the values of a, b and c for the recall and
precision analysis for this experiment,
can be obtained by issuing the following three types of search
statements for each query term and
at each level of truncation. For the English query terms this
indicates two versions of each query
term. For Arabic this indicates 12 versions of each query term in
order to test the effect of prefix,
infix and suffix.
Conclusion
I hope that the description of this experiment has stimulated
interest or curiousity concerning the results. The experimental
model contains approximately 3,000 records, and 20,000 search
statements. At this time, the samples have
been constructed, and the data files
containing the query terms have been built and are waiting to be
processed. The results should
show---if Arabic language and script in RLIN perform in a manner
identical to English, and if
none of the three features (the use of prefix, the use of
infix, and the use of suffix) affects the retrieval effectiveness
more than another and at any level---therefore, how to improve
retrieval
effectiveness for Arabic script and language records in catalog
databases
like RLIN. The results are
expected to be ready for publication in early spring 1998. Until
then, the Arabic books at Odense
University Library still will not know for sure if anybody can
find
them.
1This article is based on a
presentation given at the RLIN user group meeting held in
Providence on the 21th of November 1996. The article itself is
not intended as a scientific article, but is merely a
presentation of an ongoing research project.
2They must have the same
record structure, the same amount of
records and the same retrieval facilities.
NOTE: To return to the text from this reference, click on "Back" on the button bar, or click on the right mouse button and choose "Back".