[XML4Lib] mods: the new marc?
Eric Lease Morgan
emorgan at nd.edu
Sun Dec 16 16:34:17 EST 2007
Is MODS the new MARC?
As you may or may not know, I advocate "catalogs" include content
beyond the things a library owns or licenses. Moreover, I advocate
libraries take a more active role in collecting and providing
services against information resources no matter where they reside on
a network. Don't get me wrong, I don't advocating "cataloging" the
entire Internet, but I do advocate actively collecting materials
apropos to the needs of a particular library's patrons.
In an effort to demonstrate such an idea I would like to collect and
provide services against a number of different types of data/
information freely available on the 'Net. Some of these things
include but are not to the following listed in no priority order:
electronic books/texts (Project Gutenberg, University of Michigan
MBooks, Open Content Alliance, etc.), electronic journals from DOAJ,
electronic journal articles from DOAJ Articles, pre-prints and post-
prints from various OAI repositories, mailing list messages, selected
blog postings, theses & dissertations from NDLTD, etc.
Each of the things above can be systematically harvested through the
use of OAI, simple Web crawling, or the retrieval of data sets. Once
harvested the data could be stored in a database and/or indexed
providing the means for discovery and services. The storage of this
content in a database begs questions regarding tables, records, and
fields. What might they be? Similarly, unless the index is going to
be 100% free text, the harvest content/metadata will need to mapped
to fields. Again, what fields?
I'm not so naive to believe there is such a thing a the perfect
database structure for this "catalog", nor do I believe free text
indexing is the answer either. So, what sort of data structure should
I use? Not MARC. MODS? Some incarnation of RDF?
If I go this route I see the following plan:
0. Articulate a collection policy.
1. Acquire/harvest content in its raw form.
2. Convert the raw content into MODS, RDF, or
something else.
3. Save/archive the raw data because things get lost
in translation.
4. Save the MODS or RDF to a (XML) database.
5. Parse the MODS or RDF and save it to a
(relational) database.
6. Run scripts against the database to create things
like browsable lists, create new relationships
between items, or simply enhanced.
7. Index the MODS or RDF, or write a report against
the database intended for indexing.
8. Provide access to the index (via SRU, OpenSearch,
or Z39.50).
9. Provide services against the search results such
as Get It, Review It, Buy It, Bookmark It, Compare It
To Other things, etc.
10. Got to Step #1.
Assuming there is no single database structure for such a idea, what
flavor of XML would you use as your canonical data format? MODS? RDF?
Something else?
--
Eric Lease Morgan
University Libraries of Notre Dame
More information about the XML4Lib
mailing list