[XML4Lib] Web search engines: full-text searching of our digital holdings

Eric Lease Morgan emorgan at nd.edu
Wed Apr 4 07:45:53 EDT 2007


On Apr 3, 2007, at 9:53 PM, Deridder, Jody L wrote:

> For those of you who have already traveled this path, would you
> be willing to tell me what you did, and the pros, cons and
> outcome of your choices?
>
> The primary methods I'm considering at the moment are creating
> sitemaps versus making static copies of each
> dynamically-generated web page of interest.  If you know of other
> methods besides these, I'd be grateful if you would share them
> with me.


Yep, creating a "sitemap" of my content is exactly what I did.

I am maintaining a list of just less than 14,000 electronic texts --  
ebooks. Each item is "cataloged" with an author name, title, and a  
small set statistically significant keywords. All of this  
information, sans the full-text itself, is managed in a relational  
database. To provide access to the collection I created three  
browsable lists (author, title, and file system) plus a full-text  
index. Each access mechanism and individual record includes links to  
related content through the remaining three access mechanisms.

The system gets about 3,200 hits/day, but only small (tiny) fraction  
of the hits are generated from search results against my local full- 
text index. Instead, most hits come from Google which has indexed the  
full-text also. See:

   http://infomotions.com/alex/

I suggest you save versions of your EAD and TEI documents as static  
HTML/XML files on your file system and allow Google, et al to index  
them as well as your deep web. Disk space is cheap and the resulting  
URL's are more human readable. Moreover, access to the content will  
not be limited through your search engine/CGI indexer. If your search  
engine/CGI indexer breaks, then you will still have access to your  
content. Yet another advantage of this approach is preservation.  You  
can use backup mechanisms to archive your digital content. Much  
better than archiving a database or index that dynamically generates  
your content. Static files are more stable than dynamically created  
ones.

-- 
Eric Lease Morgan
Head, Digital Access and Information Architecture Department
University Libraries of Notre Dame

(574) 631-8604




More information about the XML4Lib mailing list