IWETHEY v. 0.3.0 | TODO
1,095 registered users | 1 active user | 0 LpH | Statistics
Login | Create New User
IWETHEY Banner

Welcome to IWETHEY!

New Do you mind if the HTML is a pile of dung?
One way to tackle the first question is to write a macro that takes each Word document in a directory, opens it, and saves as HTML.

Figuring out what to do with the HTML after that is up to you.

I'd be shocked if you couldn't find open source tools that convert Word format to HTML and produce better output. They'd be more likely to not layout exactly as Word did, though. Your users might or might not care.

Cheers,
Ben
To deny the indirect purchaser, who in this case is the ultimate purchaser, the right to seek relief from unlawful conduct, would essentially remove the word consumer from the Consumer Protection Act
- [link|http://www.techworld.com/opsys/news/index.cfm?NewsID=1246&Page=1&pagePos=20|Nebraska Supreme Court]
New Re: Do you mind if the HTML is a pile of dung?
I'm not too bothered about the HTML initially.

I know what you mean; the HTML produced by Word needs substantial demoronisation before it's useful in general terms.

One tool I've seen (but not tried) is wv.


Peter
[link|http://www.debian.org|Shill For Hire]
[link|http://www.kuro5hin.org|There is no K5 Cabal]
[link|http://guildenstern.dyndns.org|Blog]
New I wrote a Python Word->HTML converter in Feb
...for a larger application. Hack as needed/desired. The "SaveAs type 10" is one of the more important bits--"filtered" html.

\nimport win32com.client\nimport os\nimport re\n\nclass Converter(object):\n    """Convert plain text documents to Junct Topics."""\n    def __init__(self, fileName):\n        self.fileName = fileName\n    \n    def toTopic(self):\n        return u''.join([unicode(line, "windows-1252", "replace")\n                         for line in file(self.fileName, 'rU')])\n\n\nclass WordDocument(Converter):\n    """Convert Microsoft Word documents to Junct Topics."""\n    \n    def toTopic(self):\n        htmlFile = self.fileName.split(u'.')\n        htmlFile = u'.'.join(htmlFile[:-1] + ['htm'])\n        \n        # Convert the doc to filtered html\n        app = win32com.client.Dispatch('Word.Application')\n        doc = app.Documents.Add(self.fileName)\n        doc.SaveAs(htmlFile, 10)    # 10 == HTML-Filtered\n        doc.Close(0)                #  0 == don't save changes?\n        app.Quit()\n        \n        # Read in the new html file.\n        content = u''.join([unicode(line, "windows-1252", "replace")\n                            for line in file(htmlFile, 'rU')])\n        \n        # Grab the body element and strip out HTML cruft.\n        content = re.sub(r"\\r\\n", r'\\n', content)\n        content = re.sub(r'(?s)^.*<body[^>]*>(.*)</body>.*$', r'\\1', content)\n        content = re.sub(r"style='[^']*'", r'', content)\n        content = re.sub(r"class=Mso[^>]*", r'', content)\n        content = re.sub(r"<div [^>]*>", r'', content)\n        content = re.sub(r"</div>", r'', content)\n        content = re.sub(r"\\n", r' ', content)\n        \n        # Delete the intermediate file.\n        try:\n            os.remove(htmlFile)\n        except OSError:\n            pass\n        \n        return content\n

New BTW, the larger app is available (wiki-like)
New Seconded
We employed a summer intern last year to convert the Y2K documentation from Word to html. First step was "Save As". Then clean up the html document. Took all summer. Doing it yourself... find a better way. As always YMMV
     From Word To Eternity - (pwhysall) - (12)
         Do you mind if the HTML is a pile of dung? - (ben_tilly) - (4)
             Re: Do you mind if the HTML is a pile of dung? - (pwhysall) - (2)
                 I wrote a Python Word->HTML converter in Feb - (FuManChu)
                 BTW, the larger app is available (wiki-like) -NT - (FuManChu)
             Seconded - (jbrabeck)
         different question - (boxley) - (3)
             That's certainly one approach. - (pwhysall) - (2)
                 That's what a wiki is for, I wot. -NT - (admin) - (1)
                     I'd vote against a wiki myself. - (Another Scott)
         It's a problem no matter how you slice it. - (Another Scott)
         Sharepoint? (ducks, runs) -NT - (altmann) - (1)
             That's not funny! - (Silverlock)

BASICA required
107 ms