[XML4Lib] simple sru client [spelling, etc.]
Eric Lease Morgan
emorgan at nd.edu
Tue Aug 9 11:24:32 EDT 2005
On Aug 9, 2005, at 9:10 AM, Walter Lewis wrote:
>> Specifically she has written an Search Retrieve via URL (SRU)
>> client and server. The server supports spelling suggestions,
>> synonym suggestions, and keyword suggestions based on statistical
>> analysis.
>>
>> http://mylibrary.ockham.org/simple/
>
> Fascinating. So what's the spelling engine? Is it constraining
> results based on actual strings in the swish-e index or just
> working through an external list?
>
> Is the server side dependent on the MyLibrary data structure?
Thank you for your interest.
Spelling
The spelling technique is one I learned from Bill Mosely of swish-e
fame. This is how it works:
* Index your data.
* Dump all the words in your index to a white-space
delimited file. (Using swish-e you use the -T switch. In
Plucene you need to write a report against an intermediary
file produced by a script in the Plucene distribution.)
* Feed the dump to a dictionary program. We use ASPELL.
* In your server, parse the query, send each term in the
query to your dictionary requesting alternative spellings.
In other words run spell-check against the query.
* Return the suggested spellings back to the client.
What is really great about this technique is that the spell checker
will only recommend words that are in the dictionary, and the
dictionary is only built from words in your index. Consequently,
every single suggested word should have at least one record
associated with it.
Thesaurus
The thesaurus works in a similar way:
* The query is parsed.
* Words from the query are applied to a thesaurus, in this
case WordNet.
* Synonyms are then returned to the client.
In technical literature, such as the literature of theses and
dissertations, this technique is not optimal because the WordNet
thesaurus is more or less limited to general English, but I can't
wait to apply the thesaurus against the full text of my Alex
Catalogue of Electronic Texts. Whenever I get around to this I expect
the full text searching to be quite fun.
Statistical analysis
The statistically recommended search terms of the full-blown SRU
client is based on a logarithmic calculation.
The calculation looks at each word in our index, compares the number
of times the word is found in a particular document with the number
of times the word is found in the entire index. We then take the most
significant five words from each record and save them to the
underlying (MyLibrary) database. When records are displayed so are
the statistically calculated words. Finally, we create a canned
search representing the Boolean intersection of each of the
calculated terms. When selected, this search will return at least one
record, but hopefully others. In this way we provide a Find More Like
This One feature against the index.
MyLibrary
In this particular SRU implementation, MyLibrary is used as a cache.
We harvest OAI-accessible data. The data is organized using Dublin
Core, as defined by the OAI protocol. MyLibrary included Perl methods
for saving metadata, and it too uses Dublin Core to describe
information resources. Therefore, there is a possible one-to-one
correspondence between OAI records and MyLibrary records. As the data
gets saved to the underlying database, each record is associated with
one or more facet/term combinations such as: Formats/Articles,
Subjects/Library science, or Audiences/Teachers. Once the data is
saved to MyLibrary, we write reports against the database for
indexing; we create reports based on the facet/term combinations.
Searches applied against the resulting index simply return keys.
These keys are then used to extract associated metadata from the
MyLibrary database, using the MyLibrary API, of course.
Fun!
--
Eric Lease Morgan
University Libraries of Notre Dame
More information about the XML4Lib
mailing list