The Lucene project looks interesting for the purposes of Indexing. Seems to handle only text though. Indexing is the trickier part - guess I could front an extraction process for the varous document formats and feed text into it. And the Python front to it might also come in handy.

Thanks for the pointer. Helps me get my bearings on the solution.