IWETHEY v. 0.3.0 | TODO
1,095 registered users | 2 active users | 1 LpH | Statistics
Login | Create New User

Welcome to IWETHEY!

New Indexing documents.
Not sure that this is the proper forum, but being how this is bit a piece of my larger program, thought I'd ask it here.

Got a bunch of documents being uploaded. They get stored in my database in binary format. I have a name and the binary document itself and a pointer reference to the task with which it is associated. Documents are in whatever format the end user used - which for this population usually means the big three of Word, Excel and PDF.

What I'd like is to somehow index the document contents for search purposes. Not necessary to be able to understand every document format that comes down the pike - every little bit helps. What I need is some software that accept the document info and contents in a pipe (I don't really won't these files to be in the file system but will write them temporarily to a file if that's what's needed). (And I can not allowed the documents to be opened by the native program - Macros, virii & other nonsense would have to be dealt with).

Are there any generic open source utilities for performing document indexing for various document types?

New I noticed PyLucene recently; haven't used it.
New Looks like the way to go
The Lucene project looks interesting for the purposes of Indexing. Seems to handle only text though. Indexing is the trickier part - guess I could front an extraction process for the varous document formats and feed text into it. And the Python front to it might also come in handy.

Thanks for the pointer. Helps me get my bearings on the solution.
     Indexing documents. - (ChrisR) - (2)
         I noticed PyLucene recently; haven't used it. - (FuManChu) - (1)
             Looks like the way to go - (ChrisR)

Interstate Face Stab
51 ms