[XML4Lib] seeking good HTML to XML converter

Jon jpstroop at gmail.com
Wed Aug 26 09:00:33 EDT 2009


John,
As far as I know, unless you can guarantee that your HTML document(s)
are well-formed XML, you'll need to start with getting them into XHTML
format.  There are at least two tools available for doing this: Tidy[1]
and TagSoup[2]. I've only really used the former; it does the best it
can, if that makes sense.

Once you know you have well-formed documents you can use XSLT (1.0 or
2.0) to reformat your (X)HTML to another XML format.

Also, I ran across Elliotte Rusty Harold's newish book, Refactoring
HTML[3] recently.  It may have some material that can help you.

Hope this helps you get started,
-Jon

1. http://www.w3.org/People/Raggett/tidy/
2. http://home.ccil.org/~cowan/XML/tagsoup/
3. http://books.google.com/books?id=I2MKuc4iZDYC

Jon Stroop
Metadata Analyst
C-17-D2 Firestone Library
Princeton University
Princeton, NJ 08544

Email: jstroop at princeton.edu
Phone: (609)258-0059
Fax: (609)258-0441

http://diglib.princeton.edu
http://diglib.princeton.edu/ead



John Fitzgibbon wrote:
>
> Hi,
>
>  
>
> I am seeking a converter that will automatically convert a HTML file 
> to XML (not XHTML). If it can operate in batch mode (convert a batch 
> of HTML files to XML) that would be ideal.
>
>  
>
> I would appreciate any advice.
>
>  
>
> Regards
>
> John
>
>  
>
> John Fitzgibbon
>
>  
>
> ------------------------------------------------------------------------
>
> This e-mail message has been scanned for content and cleared by 
> MailMarshal Hosted at Galway County Council
>
> Tá an teachtaireacht ríomhphoist seo scanáilte d’Ábhar agus glanta ag 
> MailMarshal atá Óstálta i gComhairle Chontae na Gaillimhe.
>
> Correspondance is welcome in Irish or in English.
>
> Tá míle fáilte roimh chomhfhreagras i nGaeilge nó i mBéarla.
>
> Tá eolas atá príobháideach agus rúnda sa ríomhphost seo agus aon iatán 
> a ghabhann leis agus is leis an duine/na daoine sin amháin a bhfuil 
> siad seolta chucu a bhaineann siad. Mura seolaí thú, níl tú údaraithe 
> an ríomhphost nó aon iatán a ghabhann leis a léamh, a chóipáil ná a 
> úsáid. Má tá an ríomhphost seo faighte agat trí dhearmad, cuir an 
> seoltóir ar an eolas thrí aischur ríomhphoist agus scrios ansin é le 
> do thoil.
>
> This e-mail and any attachment contains information which is private 
> and confidential and is intended for the addressee only. If you are 
> not an addressee, you are not authorised to read, copy or use the 
> e-mail or any attachment. If you have received this e-mail in error, 
> please notify the sender by return e-mail and then destroy it.
>
> If you need this email in an alternative format please contact the sender
>
> Má tá an ríomhphost seo ag teastáil uait i bhformáid eile déan 
> teagmháil leis an duine a sheol chugat é
>
>  
>
> ------------------------------------------------------------------------
> ------------------------------------------------------------------------
>
> _______________________________________________
> XML4Lib mailing list
> XML4Lib at webjunction.org
> http://lists.webjunction.org/mailman/listinfo/xml4lib
>   





More information about the XML4Lib mailing list