[XML4Lib] seeking good HTML to XML converter

Walker, David dwalker at calstate.edu
Wed Aug 26 09:09:41 EDT 2009


PHP (and I would suspect some other languages) has a DOM class that can accept non-well-formed HTML via a loadHTML() method.  I assume it 'corrects' the HTML like browsers do?  See:

  http://us3.php.net/manual/en/domdocument.loadhtml.php

Anyway, that would provide a pretty painless path to an XSLT transformation, which I would agree with Jon is probably the best way to go:

   http://us3.php.net/manual/en/book.xsl.php

--Dave

==================
David Walker
Library Web Services Manager
California State University
http://xerxes.calstate.edu
________________________________________
From: xml4lib-bounces at webjunction.org [xml4lib-bounces at webjunction.org] On Behalf Of Jon [jpstroop at gmail.com]
Sent: Wednesday, August 26, 2009 6:00 AM
To: xml4lib at webjunction.org
Subject: Re: [XML4Lib] seeking good HTML to XML converter

John,
As far as I know, unless you can guarantee that your HTML document(s)
are well-formed XML, you'll need to start with getting them into XHTML
format.  There are at least two tools available for doing this: Tidy[1]
and TagSoup[2]. I've only really used the former; it does the best it
can, if that makes sense.

Once you know you have well-formed documents you can use XSLT (1.0 or
2.0) to reformat your (X)HTML to another XML format.

Also, I ran across Elliotte Rusty Harold's newish book, Refactoring
HTML[3] recently.  It may have some material that can help you.

Hope this helps you get started,
-Jon

1. http://www.w3.org/People/Raggett/tidy/
2. http://home.ccil.org/~cowan/XML/tagsoup/
3. http://books.google.com/books?id=I2MKuc4iZDYC

Jon Stroop
Metadata Analyst
C-17-D2 Firestone Library
Princeton University
Princeton, NJ 08544

Email: jstroop at princeton.edu
Phone: (609)258-0059
Fax: (609)258-0441

http://diglib.princeton.edu
http://diglib.princeton.edu/ead



John Fitzgibbon wrote:
>
> Hi,
>
>
>
> I am seeking a converter that will automatically convert a HTML file
> to XML (not XHTML). If it can operate in batch mode (convert a batch
> of HTML files to XML) that would be ideal.
>
>
>
> I would appreciate any advice.
>
>
>
> Regards
>
> John
>
>
>
> John Fitzgibbon
>
>
>
> ------------------------------------------------------------------------
>
> This e-mail message has been scanned for content and cleared by
> MailMarshal Hosted at Galway County Council
>
> Tá an teachtaireacht ríomhphoist seo scanáilte d’Ábhar agus glanta ag
> MailMarshal atá Óstálta i gComhairle Chontae na Gaillimhe.
>
> Correspondance is welcome in Irish or in English.
>
> Tá míle fáilte roimh chomhfhreagras i nGaeilge nó i mBéarla.
>
> Tá eolas atá príobháideach agus rúnda sa ríomhphost seo agus aon iatán
> a ghabhann leis agus is leis an duine/na daoine sin amháin a bhfuil
> siad seolta chucu a bhaineann siad. Mura seolaí thú, níl tú údaraithe
> an ríomhphost nó aon iatán a ghabhann leis a léamh, a chóipáil ná a
> úsáid. Má tá an ríomhphost seo faighte agat trí dhearmad, cuir an
> seoltóir ar an eolas thrí aischur ríomhphoist agus scrios ansin é le
> do thoil.
>
> This e-mail and any attachment contains information which is private
> and confidential and is intended for the addressee only. If you are
> not an addressee, you are not authorised to read, copy or use the
> e-mail or any attachment. If you have received this e-mail in error,
> please notify the sender by return e-mail and then destroy it.
>
> If you need this email in an alternative format please contact the sender
>
> Má tá an ríomhphost seo ag teastáil uait i bhformáid eile déan
> teagmháil leis an duine a sheol chugat é
>
>
>
> ------------------------------------------------------------------------
> ------------------------------------------------------------------------
>
> _______________________________________________
> XML4Lib mailing list
> XML4Lib at webjunction.org
> http://lists.webjunction.org/mailman/listinfo/xml4lib
>



_______________________________________________
XML4Lib mailing list
XML4Lib at webjunction.org
http://lists.webjunction.org/mailman/listinfo/xml4lib



More information about the XML4Lib mailing list