[XML4Lib] batch conversion of HTML files to XML

Houghton,Andrew houghtoa at oclc.org
Tue Jul 15 09:25:19 EDT 2008


No you are not searching for the holy grail.  There are several tools that do what you are asking for.  Tidy [1] and tagsoup [2] come to mind.

 

Andy.

 

[1] http://tidy.sourceforge.net/

[2] http://ccil.org/~cowan/XML/tagsoup/

 

 

From: xml4lib-bounces at webjunction.org [mailto:xml4lib-bounces at webjunction.org] On Behalf Of John Fitzgibbon
Sent: Tuesday, July 15, 2008 4:47 AM
To: xml4lib
Subject: [XML4Lib] batch conversion of HTML files to XML

 

Hi,

 

Is it possible to convert a folder of HTML files to XML without having to edit each file with a text editor that supports regular expressions? In the past this is how I accomplished this task but I am hoping there is an easier way.

 

The process would have to change tags like <br> to <br/>. Input tags in forms would also have to be closed.

 

It may have to close tags like <p> and <li>.

 

Finally, attribute values are not necessarily bounded by quotes. For example, width=200 will have to become width=”200”.

 

Am I searching for a holy grail?

 

Any advice would be much appreciated.

 

Regards

Jon

 

w: www.galwaylibrary.ie

e: info at galwaylibrary.ie

p: 00 353 91 562471

f: 00 353 91 565039

 

________________________________

This e-mail message has been scanned for Contentand cleared by MailMarshal Hosted at Galway County Council 

________________________________

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.webjunction.org/wjlists/xml4lib/attachments/20080715/21385c71/attachment-0001.htm


More information about the XML4Lib mailing list