IWETHEY v. 0.3.0 | TODO
1,095 registered users | 0 active users | 1 LpH | Statistics
Login | Create New User
IWETHEY Banner

Welcome to IWETHEY!

New General laundering Q, please
I almost understand the processes you mentioned. (Meaning only - you said it clearly enough that I can imagine the method) Unclear is the degree of automation which is today practical re such an aim (?) I appreciate that, what Wade is doing is a labor of love, needs a thought-out scheme and lots of hand massaging..

Given the proliferation of Front Page? and other M$-spawned egregious sites, other causes of mangled HTML:

Does it seem possible (well.. likely then?) that a 'home laundry' ap just might process an errant site, where you want to find something allegedy there? I mean, say - linked to such via a query ongoing, so you need to use the site somehow.

Natch I cannot guess just How-bad some of these are - but "post-processing", a la running Babelfish on an entire .doc or page - sometimes does elicit useful stuff. (This especially when I'm looking for tech info for European test equipment, find a link to someone's ~ problem)

Once.. in Czech! :-\ufffd. Friend translated..

Pipe dream?



Ashton
New Automation
Automation comes down to a matter of identifying major issues, then coding up a few lines of something to deal with them. This continues to the point that tidy decides it wants to deal with the remaining breakage itself. Breakage is largley unmatched or illegal (unrecognized) tags. Generating a list of effected files, I'll skip through them or filter via grep and identify a typical pattern to the breakage and see what a good fix rule will be. There's a small risk of fouling things up, but you've got sources to refer back to.

I think I ran through a dozen or so different scripts on the small percentage (a few thousand, of the 120k) of IWE posts that were broken. Much faster than hand-editing the posts. And the results mostly seem to work -- I hunt through the archive periodically, and haven't found any pages that are grossly mangled. The EZ stuff is likely going to be more difficult, though I think the volume is smaller.

No, this isn't a generalized system, it's an assisted process. Hence my description of "semi-auto". There's still a large bit of assess, diagnose, and apply remedy involved.

WRT processing time, I think the main tidy run through the IWE archive was about 8 hours on my laptop (600 MHz, 128MB, 20GB IDE).

For what you seem to be asking -- how do you find crap in a busted page -- I'd dump page to text and use editing tools with a good search function (eg: vim). For a more comprehensive site, snarf the pages (wget, snarf, etc.), and run a find foo | xargs grep bar to isolate content. I've actually done same. Pretty sad commentary on site research support.
--
Karsten M. Self [link|mailto:kmself@ix.netcom.com|kmself@ix.netcom.com]

What part of "gestalt" don't you understand?
New Some links.
I now have a small arsenal of tools that, with some suitable and reasonably straightforward shell magic, basically do the job required of them across an arbitrary sized input space until done. The bulk of this message is much what Maggs asked for and I sent.

First of all, archives of my Static Page have been put together at [link|http://yceran.org/static/archived.html|http://yceran.org/s...rchived.html] so you can see what I was in the process of doing when the topic of backing up IWETHEY came up.

I started by saving the forum overview pages manually (there wasn't really that many). Then I was using [link|http://yceran.org/static/findthreads.icn|[link|http://yceran.org/static/findthreads.icn|http://yceran.org/s...dthreads.icn]] to read these to figure out the thread structure. Once that was done, I used [link|http://yceran.org/static/massfetch|[link|http://yceran.org/static/massfetch|http://yceran.org/static/massfetch]] to get the messages. This requires the forum types be Legacy BBS, since that returns one message at a time. So far I've been changing them by hand (that may change, but Icon's networking is not reliable, I suspect) and I haven't changed them back yet. Actually fetching all the content took quite a few hours.

The script to fish out the real content is [link|http://yceran.org/static/dissectmessage.icn|[link|http://yceran.org/static/dissectmessage.icn|http://yceran.org/s...tmessage.icn]]. It took maybe half an hour on a P100 to run over all the messages I had (I estimate between a third and a half of the whole board).

Now I've run it for all forums, the next step is to create some static pages for all of them. I'm toying with the idea of doing HTML cleanup on the content which is by far a bigger task than any I've done so far. Fortunately, I have some Icon code I can plunder for more general HTML dissection. The big ogre, of course, is EZBoard's <BR> tags, but there are other evils to watch out for, too.

The file layout I've been working with is in [link|http://yceran.org/static/forum|[link|http://yceran.org/static/forum|http://yceran.org/static/forum]]. The index files are the ones created from findthreads.icn and the forumname files are required for the newer forums that EZBoard gave different ids than names. :-/ As you can see, I haven't got any of the really large fora yet.

You will need Unicon [link|http://unicon.sourceforge.net|http://unicon.sourceforge.net] or Icon [link|http://www.cs.arizona.edu/icon|http://www.cs.arizona.edu/icon] if you want to compile the .icn files.

Wade.

"All around me are nothing but fakes
Come with me on the biggest fake of all!"

New is demoronizer still usefull?
been a while but you used to run that perl script on front page stuff to clean it up.
thanx,
bill
Our bureaucracy and our laws have turned the world into a clean, safe work camp. We are raising a nation of slaves.
Chuck Palahniuk
New Think last time I heard about that one..
I must have gotten the idea it was a diss of Le Moron :-\ufffd
(who I note, still pops up in various places with his cheery little notes about the wonderfulness of all things with RunTogetherNames and a short half-life..) Hmm:

What's the half-life of a M$ product in milliseconds?

Might be fun to look up. Thanks,

A.
New err RunTogetherNamesIsAnSNMPStandard and would make my life
AWholeLotSimplerAsDifferentOs'sHaveDifferentReservedWordsAndSymbolsWhileMs"$"
LikesToEmulateMacs,HavingSpacesOnANixPlatformIsANonoAsIFoundOutReplacingAnNtBox
WithLinuxForFtpServices<256 chars
thanx,
bill
Our bureaucracy and our laws have turned the world into a clean, safe work camp. We are raising a nation of slaves.
Chuck Palahniuk
Expand Edited by boxley Aug. 20, 2001, 08:35:54 AM EDT
New PleaseEditThatIntoACoupleShorterLines,Bill,AsItIsItMakesMySc
...reenScrollHorizontally.
New Somewhat
Demoronizer is probably a prcursor to tidy, though I haven't done and extensive comparison of the two. Demoronizer is largely aimed at broken Microsoft-generated HTML. I haven't checked to see that tidy corrects the same problems, but it does address quite a few others.
--
Karsten M. Self [link|mailto:kmself@ix.netcom.com|kmself@ix.netcom.com]

What part of "gestalt" don't you understand?
     More EZ news - (wharris2) - (14)
         Maybe related (?) - (Ashton)
         Not just moz. - (inthane-chan)
         I'm surprised it works at all, really. - (static) - (10)
             Scary.. maybe inexorable.. so worse.___A solution? \ufffd - (Ashton)
             Tidy - (kmself) - (8)
                 General laundering Q, please - (Ashton) - (7)
                     Automation - (kmself) - (1)
                         Some links. - (static)
                     is demoronizer still usefull? - (boxley) - (4)
                         Think last time I heard about that one.. - (Ashton) - (2)
                             err RunTogetherNamesIsAnSNMPStandard and would make my life - (boxley) - (1)
                                 PleaseEditThatIntoACoupleShorterLines,Bill,AsItIsItMakesMySc - (CRConrad)
                         Somewhat - (kmself)
         Add NS 4.5 to EZ's blank-page intro.. Reload fixes, but -NT - (Ashton)

CFOC!
112 ms