[XML4Lib] simple sru client [spelling, etc.]

Eric Lease Morgan emorgan at nd.edu
Tue Aug 9 11:24:32 EDT 2005


On Aug 9, 2005, at 9:10 AM, Walter Lewis wrote:

>> Specifically she has written an Search Retrieve via URL (SRU)  
>> client  and server. The server supports spelling suggestions,  
>> synonym  suggestions, and keyword suggestions based on statistical  
>> analysis.
>>
>>  http://mylibrary.ockham.org/simple/
>
> Fascinating.  So what's the spelling engine?  Is it constraining  
> results based on actual strings in the swish-e index or just  
> working through an external list?
>
> Is the server side dependent on the MyLibrary data structure?

Thank you for your interest.


Spelling

The spelling technique is one I learned from Bill Mosely of swish-e  
fame. This is how it works:

   * Index your data.

   * Dump all the words in your index to a white-space
     delimited file. (Using swish-e you use the -T switch. In
     Plucene you need to write a report against an intermediary
     file produced by a script in the Plucene distribution.)

   * Feed the dump to a dictionary program. We use ASPELL.

   * In your server, parse the query, send each term in the
     query to your dictionary requesting alternative spellings.
     In other words run spell-check against the query.

   * Return the suggested spellings back to the client.

What is really great about this technique is that the spell checker  
will only recommend words that are in the dictionary, and the  
dictionary is only built from words in your index. Consequently,  
every single suggested word should have at least one record  
associated with it.


Thesaurus

The thesaurus works in a similar way:

   * The query is parsed.

   * Words from the query are applied to a thesaurus, in this
     case WordNet.

   * Synonyms are then returned to the client.

In technical literature, such as the literature of theses and  
dissertations, this technique is not optimal because the WordNet  
thesaurus is more or less limited to general English, but I can't  
wait to apply the thesaurus against the full text of my Alex  
Catalogue of Electronic Texts. Whenever I get around to this I expect  
the full text searching to be quite fun.


Statistical analysis

The statistically recommended search terms of the full-blown SRU  
client is based on a logarithmic calculation.

The calculation looks at each word in our index, compares the number  
of times the word is found in a particular document with the number  
of times the word is found in the entire index. We then take the most  
significant five words from each record and save them to the  
underlying (MyLibrary) database. When records are displayed so are  
the statistically calculated words. Finally, we create a canned  
search representing the Boolean intersection of each of the  
calculated terms. When selected, this search will return at least one  
record, but hopefully others. In this way we provide a Find More Like  
This One feature against the index.


MyLibrary

In this particular SRU implementation, MyLibrary is used as a cache.

We harvest OAI-accessible data. The data is organized using Dublin  
Core, as defined by the OAI protocol. MyLibrary included Perl methods  
for saving metadata, and it too uses Dublin Core to describe  
information resources. Therefore, there is a possible one-to-one  
correspondence between OAI records and MyLibrary records. As the data  
gets saved to the underlying database, each record is associated with  
one or more facet/term combinations such as: Formats/Articles,  
Subjects/Library science, or Audiences/Teachers. Once the data is  
saved to MyLibrary, we write reports against the database for  
indexing; we create reports based on the facet/term combinations.  
Searches applied against the resulting index simply return keys.  
These keys are then used to extract associated metadata from the  
MyLibrary database, using the MyLibrary API, of course.

Fun!

-- 
Eric Lease Morgan
University Libraries of Notre Dame








More information about the XML4Lib mailing list