Automation

Automation comes down to a matter of identifying major issues, then coding up a few lines of something to deal with them. This continues to the point that tidy decides it wants to deal with the remaining breakage itself. Breakage is largley unmatched or illegal (unrecognized) tags. Generating a list of effected files, I'll skip through them or filter via grep and identify a typical pattern to the breakage and see what a good fix rule will be. There's a small risk of fouling things up, but you've got sources to refer back to.

I think I ran through a dozen or so different scripts on the small percentage (a few thousand, of the 120k) of IWE posts that were broken. Much faster than hand-editing the posts. And the results mostly seem to work -- I hunt through the archive periodically, and haven't found any pages that are grossly mangled. The EZ stuff is likely going to be more difficult, though I think the volume is smaller.

No, this isn't a generalized system, it's an assisted process. Hence my description of "semi-auto". There's still a large bit of assess, diagnose, and apply remedy involved.

WRT processing time, I think the main tidy run through the IWE archive was about 8 hours on my laptop (600 MHz, 128MB, 20GB IDE).

For what you seem to be asking -- how do you find crap in a busted page -- I'd dump page to text and use editing tools with a good search function (eg: vim). For a more comprehensive site, snarf the pages (wget, snarf, etc.), and run a find foo | xargs grep bar to isolate content. I've actually done same. Pretty sad commentary on site research support.

I now have a small arsenal of tools that, with some suitable and reasonably straightforward shell magic, basically do the job required of them across an arbitrary sized input space until done. The bulk of this message is much what Maggs asked for and I sent.

First of all, archives of my Static Page have been put together at [link|http://yceran.org/static/archived.html|http://yceran.org/s...rchived.html] so you can see what I was in the process of doing when the topic of backing up IWETHEY came up.

I started by saving the forum overview pages manually (there wasn't really that many). Then I was using [link|http://yceran.org/static/findthreads.icn|[link|http://yceran.org/static/findthreads.icn|http://yceran.org/s...dthreads.icn]] to read these to figure out the thread structure. Once that was done, I used [link|http://yceran.org/static/massfetch|[link|http://yceran.org/static/massfetch|http://yceran.org/static/massfetch]] to get the messages. This requires the forum types be Legacy BBS, since that returns one message at a time. So far I've been changing them by hand (that may change, but Icon's networking is not reliable, I suspect) and I haven't changed them back yet. Actually fetching all the content took quite a few hours.

The script to fish out the real content is [link|http://yceran.org/static/dissectmessage.icn|[link|http://yceran.org/static/dissectmessage.icn|http://yceran.org/s...tmessage.icn]]. It took maybe half an hour on a P100 to run over all the messages I had (I estimate between a third and a half of the whole board).

Now I've run it for all forums, the next step is to create some static pages for all of them. I'm toying with the idea of doing HTML cleanup on the content which is by far a bigger task than any I've done so far. Fortunately, I have some Icon code I can plunder for more general HTML dissection. The big ogre, of course, is EZBoard's <BR> tags, but there are other evils to watch out for, too.

The file layout I've been working with is in [link|http://yceran.org/static/forum|[link|http://yceran.org/static/forum|http://yceran.org/static/forum]]. The index files are the ones created from findthreads.icn and the forumname files are required for the newer forums that EZBoard gave different ids than names. :-/ As you can see, I haven't got any of the really large fora yet.

You will need Unicon [link|http://unicon.sourceforge.net|http://unicon.sourceforge.net] or Icon [link|http://www.cs.arizona.edu/icon|http://www.cs.arizona.edu/icon] if you want to compile the .icn files.

Wade.

Welcome to IWETHEY!