IWETHEY v. 0.3.0 | TODO
1,095 registered users | 1 active user | 0 LpH | Statistics
Login | Create New User
IWETHEY Banner

Welcome to IWETHEY!

New Re: Do you mind if the HTML is a pile of dung?
I'm not too bothered about the HTML initially.

I know what you mean; the HTML produced by Word needs substantial demoronisation before it's useful in general terms.

One tool I've seen (but not tried) is wv.


Peter
[link|http://www.debian.org|Shill For Hire]
[link|http://www.kuro5hin.org|There is no K5 Cabal]
[link|http://guildenstern.dyndns.org|Blog]
New I wrote a Python Word->HTML converter in Feb
...for a larger application. Hack as needed/desired. The "SaveAs type 10" is one of the more important bits--"filtered" html.

\nimport win32com.client\nimport os\nimport re\n\nclass Converter(object):\n    """Convert plain text documents to Junct Topics."""\n    def __init__(self, fileName):\n        self.fileName = fileName\n    \n    def toTopic(self):\n        return u''.join([unicode(line, "windows-1252", "replace")\n                         for line in file(self.fileName, 'rU')])\n\n\nclass WordDocument(Converter):\n    """Convert Microsoft Word documents to Junct Topics."""\n    \n    def toTopic(self):\n        htmlFile = self.fileName.split(u'.')\n        htmlFile = u'.'.join(htmlFile[:-1] + ['htm'])\n        \n        # Convert the doc to filtered html\n        app = win32com.client.Dispatch('Word.Application')\n        doc = app.Documents.Add(self.fileName)\n        doc.SaveAs(htmlFile, 10)    # 10 == HTML-Filtered\n        doc.Close(0)                #  0 == don't save changes?\n        app.Quit()\n        \n        # Read in the new html file.\n        content = u''.join([unicode(line, "windows-1252", "replace")\n                            for line in file(htmlFile, 'rU')])\n        \n        # Grab the body element and strip out HTML cruft.\n        content = re.sub(r"\\r\\n", r'\\n', content)\n        content = re.sub(r'(?s)^.*<body[^>]*>(.*)</body>.*$', r'\\1', content)\n        content = re.sub(r"style='[^']*'", r'', content)\n        content = re.sub(r"class=Mso[^>]*", r'', content)\n        content = re.sub(r"<div [^>]*>", r'', content)\n        content = re.sub(r"</div>", r'', content)\n        content = re.sub(r"\\n", r' ', content)\n        \n        # Delete the intermediate file.\n        try:\n            os.remove(htmlFile)\n        except OSError:\n            pass\n        \n        return content\n

New BTW, the larger app is available (wiki-like)
     From Word To Eternity - (pwhysall) - (12)
         Do you mind if the HTML is a pile of dung? - (ben_tilly) - (4)
             Re: Do you mind if the HTML is a pile of dung? - (pwhysall) - (2)
                 I wrote a Python Word->HTML converter in Feb - (FuManChu)
                 BTW, the larger app is available (wiki-like) -NT - (FuManChu)
             Seconded - (jbrabeck)
         different question - (boxley) - (3)
             That's certainly one approach. - (pwhysall) - (2)
                 That's what a wiki is for, I wot. -NT - (admin) - (1)
                     I'd vote against a wiki myself. - (Another Scott)
         It's a problem no matter how you slice it. - (Another Scott)
         Sharepoint? (ducks, runs) -NT - (altmann) - (1)
             That's not funny! - (Silverlock)

My God... that pizza combo is full of stars...!
148 ms