[XML4Lib] Web search engines: full-text searching of our digital
holdings
Eric Lease Morgan
emorgan at nd.edu
Wed Apr 4 07:45:53 EDT 2007
On Apr 3, 2007, at 9:53 PM, Deridder, Jody L wrote:
> For those of you who have already traveled this path, would you
> be willing to tell me what you did, and the pros, cons and
> outcome of your choices?
>
> The primary methods I'm considering at the moment are creating
> sitemaps versus making static copies of each
> dynamically-generated web page of interest. If you know of other
> methods besides these, I'd be grateful if you would share them
> with me.
Yep, creating a "sitemap" of my content is exactly what I did.
I am maintaining a list of just less than 14,000 electronic texts --
ebooks. Each item is "cataloged" with an author name, title, and a
small set statistically significant keywords. All of this
information, sans the full-text itself, is managed in a relational
database. To provide access to the collection I created three
browsable lists (author, title, and file system) plus a full-text
index. Each access mechanism and individual record includes links to
related content through the remaining three access mechanisms.
The system gets about 3,200 hits/day, but only small (tiny) fraction
of the hits are generated from search results against my local full-
text index. Instead, most hits come from Google which has indexed the
full-text also. See:
http://infomotions.com/alex/
I suggest you save versions of your EAD and TEI documents as static
HTML/XML files on your file system and allow Google, et al to index
them as well as your deep web. Disk space is cheap and the resulting
URL's are more human readable. Moreover, access to the content will
not be limited through your search engine/CGI indexer. If your search
engine/CGI indexer breaks, then you will still have access to your
content. Yet another advantage of this approach is preservation. You
can use backup mechanisms to archive your digital content. Much
better than archiving a database or index that dynamically generates
your content. Static files are more stable than dynamically created
ones.
--
Eric Lease Morgan
Head, Digital Access and Information Architecture Department
University Libraries of Notre Dame
(574) 631-8604
More information about the XML4Lib
mailing list