Automation comes down to a matter of identifying major issues, then coding up a few lines of something to deal with them. This continues to the point that tidy decides it wants to deal with the remaining breakage itself. Breakage is largley unmatched or illegal (unrecognized) tags. Generating a list of effected files, I'll skip through them or filter via grep and identify a typical pattern to the breakage and see what a good fix rule will be. There's a small risk of fouling things up, but you've got sources to refer back to.
I think I ran through a dozen or so different scripts on the small percentage (a few thousand, of the 120k) of IWE posts that were broken. Much faster than hand-editing the posts. And the results mostly seem to work -- I hunt through the archive periodically, and haven't found any pages that are grossly mangled. The EZ stuff is likely going to be more difficult, though I think the volume is smaller.
No, this isn't a generalized system, it's an assisted process. Hence my description of "semi-auto". There's still a large bit of assess, diagnose, and apply remedy involved.
WRT processing time, I think the main tidy run through the IWE archive was about 8 hours on my laptop (600 MHz, 128MB, 20GB IDE).
For what you seem to be asking -- how do you find crap in a busted page -- I'd dump page to text and use editing tools with a good search function (eg: vim). For a more comprehensive site, snarf the pages (wget, snarf, etc.), and run a find foo | xargs grep bar to isolate content. I've actually done same. Pretty sad commentary on site research support.