[XML4Lib] seeking good HTML to XML converter

Jon Gorman jonathan.gorman at gmail.com
Wed Aug 26 08:53:26 EDT 2009


2009/8/26 John Fitzgibbon <jfitzgibbon at galwaylibrary.ie>:
> Hi,
>
>
>
> I am seeking a converter that will automatically convert a HTML file to XML
> (not XHTML). If it can operate in batch mode (convert a batch of HTML files
> to XML) that would be ideal.
>

If you don't want XHTML, what xml standard do you want? XHTML is HTML
following the XML standard.   We need more information.  What type of
xml are you picturing?  XML is a "meta" standards, a standard used in
creating other standards.  The only library analogy I can imagine
would be "I need someone to catalog this book and use MARC.  But don't
use AACR2, use a standard for constructing catalogs.".

My first general thrust on this would be run the html through tidy or
tagsoup to clean it up.  Then the question becomes are you trying to
extract out certain information that

Way back in the day I'd say OpenJade or some other DSSSL type tool
might be an option, but html is rarely anywhere near valid enough to
make sgml tools really worth the learning curve.  Thankfully better
html parsers exist now.

If you want to extract certain information from html files and put it
into a certain xml format, you'll probably have to do some
programming.  I'm not aware of any tool that will do that
automatically, although perhaps Oxygen or similar might have something
might work.  I'd be interested if there's any tools out there that do
this, it would be useful.

Jon Gorman




More information about the XML4Lib mailing list