IWETHEY v. 0.3.0 | TODO
1,095 registered users | 1 active user | 0 LpH | Statistics
Login | Create New User
IWETHEY Banner

Welcome to IWETHEY!

New From Word To Eternity
Here's the deal.

I've got a pile of Word documents.

They're technical notes; they should fit into categories, and ideally they'd be searchable. As people write them, they call them TNxxx.DOC where xxx is the next available integer. The information in them is invaluable.

There's a master INDEX.DOC file that has a table that has TN numbers and a longer description.

Quite frankly, now there's 200 of these things, it sucks trying to find the one you want.

There are two[0] parts to this question.
  1. What's the best process for getting from "pile of Word documents" to "pile of HTML documents"?

  2. What's the best way to present these documents and allow people to add to and modify them? Wiki? dwww? telnet + VMS EVE? netcat?

  3. I'm doing this on my own time and my own dime (a little initiative never killed anyone), so bear in mind I can't commit huge resources to it. Hell, even locating a machine to host this at work is going to be a bit of a battle - so cross-platform (Windows/Linux/VMS) solutions will be more favoured, as that might let me just park it in a quiet corner of an existing server.

All suggestions welcome. I'm not really looking for someone to solve the problem for me; however, if someone has been down this route before, war stories welcome :)

[0] For suitably large values of 2.


Peter
[link|http://www.debian.org|Shill For Hire]
[link|http://www.kuro5hin.org|There is no K5 Cabal]
[link|http://guildenstern.dyndns.org|Blog]
New Do you mind if the HTML is a pile of dung?
One way to tackle the first question is to write a macro that takes each Word document in a directory, opens it, and saves as HTML.

Figuring out what to do with the HTML after that is up to you.

I'd be shocked if you couldn't find open source tools that convert Word format to HTML and produce better output. They'd be more likely to not layout exactly as Word did, though. Your users might or might not care.

Cheers,
Ben
To deny the indirect purchaser, who in this case is the ultimate purchaser, the right to seek relief from unlawful conduct, would essentially remove the word consumer from the Consumer Protection Act
- [link|http://www.techworld.com/opsys/news/index.cfm?NewsID=1246&Page=1&pagePos=20|Nebraska Supreme Court]
New Re: Do you mind if the HTML is a pile of dung?
I'm not too bothered about the HTML initially.

I know what you mean; the HTML produced by Word needs substantial demoronisation before it's useful in general terms.

One tool I've seen (but not tried) is wv.


Peter
[link|http://www.debian.org|Shill For Hire]
[link|http://www.kuro5hin.org|There is no K5 Cabal]
[link|http://guildenstern.dyndns.org|Blog]
New I wrote a Python Word->HTML converter in Feb
...for a larger application. Hack as needed/desired. The "SaveAs type 10" is one of the more important bits--"filtered" html.

\nimport win32com.client\nimport os\nimport re\n\nclass Converter(object):\n    """Convert plain text documents to Junct Topics."""\n    def __init__(self, fileName):\n        self.fileName = fileName\n    \n    def toTopic(self):\n        return u''.join([unicode(line, "windows-1252", "replace")\n                         for line in file(self.fileName, 'rU')])\n\n\nclass WordDocument(Converter):\n    """Convert Microsoft Word documents to Junct Topics."""\n    \n    def toTopic(self):\n        htmlFile = self.fileName.split(u'.')\n        htmlFile = u'.'.join(htmlFile[:-1] + ['htm'])\n        \n        # Convert the doc to filtered html\n        app = win32com.client.Dispatch('Word.Application')\n        doc = app.Documents.Add(self.fileName)\n        doc.SaveAs(htmlFile, 10)    # 10 == HTML-Filtered\n        doc.Close(0)                #  0 == don't save changes?\n        app.Quit()\n        \n        # Read in the new html file.\n        content = u''.join([unicode(line, "windows-1252", "replace")\n                            for line in file(htmlFile, 'rU')])\n        \n        # Grab the body element and strip out HTML cruft.\n        content = re.sub(r"\\r\\n", r'\\n', content)\n        content = re.sub(r'(?s)^.*<body[^>]*>(.*)</body>.*$', r'\\1', content)\n        content = re.sub(r"style='[^']*'", r'', content)\n        content = re.sub(r"class=Mso[^>]*", r'', content)\n        content = re.sub(r"<div [^>]*>", r'', content)\n        content = re.sub(r"</div>", r'', content)\n        content = re.sub(r"\\n", r' ', content)\n        \n        # Delete the intermediate file.\n        try:\n            os.remove(htmlFile)\n        except OSError:\n            pass\n        \n        return content\n

New BTW, the larger app is available (wiki-like)
New Seconded
We employed a summer intern last year to convert the Y2K documentation from Word to html. First step was "Save As". Then clean up the html document. Took all summer. Doing it yourself... find a better way. As always YMMV
New different question
if these are word docs create an html index that usefully describes the contents and when you click on a title it pulls up the word.doc? Assuming all the users currently have word or couldnt read them anyway I would consider this the fastest way to organize them. Putting them in html means that a user would edit them in word and then put all the winblows crap back into the html when they save.
my 2 cents
thanx,
bill
attempting to explain profiling doesn't require one to take a position for or against it any more than attempting to explain gravity requires one to be for or against gravity. Walter Williams
questions, help? [link|mailto:pappas@catholic.org|email pappas at catholic.org]
New That's certainly one approach.
These are documents created for and by engineers. I don't see why they couldn't use some simple markup in Notepad :)

Alternatively, of course, if I pursue the Wiki route, they won't use Word at all to edit them.

General observation: these are not strictly controlled documents; they're informal technical notes that people write whenever a particularly tricky task is performed, the goal being to prevent unnecessary reinvention of wheels.


Peter
[link|http://www.debian.org|Shill For Hire]
[link|http://www.kuro5hin.org|There is no K5 Cabal]
[link|http://guildenstern.dyndns.org|Blog]
New That's what a wiki is for, I wot.
Regards,

-scott anderson

"Welcome to Rivendell, Mr. Anderson..."
New I'd vote against a wiki myself.
Every one I've come across has been rather sluggish to display pages and not at all intuitive to modify.

YMMV.

I'd go with Boxley's suggestion myself, and I actually do something similar at work (have a very simple web page with pointers to sections of .DOC, .XLS, .PPT, .PDF, etc., documents).

Cheers,
Scott.
New It's a problem no matter how you slice it.
I think the best solution is to have the documents generated in a format that matches the final form necessary for the task. Otherwise, work will need to be done to convert them to HTML or PDF or whatever, and there will be concerns about which document is the latest version, etc. But that means taking the time to convince or train people to use the tool of interest for this task rather than what they're used to.

As I mentioned in my other reply, it seems to me to be less work to modify the documents as little as possible, but make it simple for users to find whatever document they need. You would still need to spend the time to generate indexes or lists of key words if you need them. I don't know of a good way to do that automatically. That might be an advantage of a Wiki approach, but I don't think a Wiki is suitable for casual users (though YMMV).

There are many tools out there to convert .DOC to HTML, but I don't know how good they are.

This is a problem that has been solved, but I imagine that it's in the context of internal documentation (perhaps SGML-like stuff). I would be surprised if there was a free, high-quality, intuitive, and easy to use solution. Perhaps Fu has cracked that nut. :-)

So, I'd vote for making a simple web page with descriptions of the existing documents and pointers to the original files. The client machines would automagically open them in the appropriate application. If your colleagues want something fancier, then I'd branch out from there. If the original documents are modified, perhaps a "simple" script can be run to update the web page based on changes in time stamps, etc.

HTH a bit. Good luck!

Cheers,
Scott.
New Sharepoint? (ducks, runs)
--
Chris Altmann
New That's not funny!
Guess what we have to use for our document store.
-----------------------------------------
It is much harder to be a liberal than a conservative. Why?
Because it is easier to give someone the finger than it is to give them a helping hand.
Mike Royko
     From Word To Eternity - (pwhysall) - (12)
         Do you mind if the HTML is a pile of dung? - (ben_tilly) - (4)
             Re: Do you mind if the HTML is a pile of dung? - (pwhysall) - (2)
                 I wrote a Python Word->HTML converter in Feb - (FuManChu)
                 BTW, the larger app is available (wiki-like) -NT - (FuManChu)
             Seconded - (jbrabeck)
         different question - (boxley) - (3)
             That's certainly one approach. - (pwhysall) - (2)
                 That's what a wiki is for, I wot. -NT - (admin) - (1)
                     I'd vote against a wiki myself. - (Another Scott)
         It's a problem no matter how you slice it. - (Another Scott)
         Sharepoint? (ducks, runs) -NT - (altmann) - (1)
             That's not funny! - (Silverlock)

A few slices of bread short of a loaf.
63 ms