Post #5,732
8/18/01 7:47:48 AM

I'm surprised it works at all, really.
I've been dissecting EZBoard's HTML as part of the effort to back up EZBoard/IWETHEY. It emits some atrocious HTML - and I mean really really awful output. Apparantly they noticed someone (us?) "complaining" about the very hard to read HTML and now it there are lots of line-endings littered through it. Some of them are LFs. Some of them are CRs. Occasionally, you find both together. Big mess. Whatever it is, it upsets my upstream proxy enough that I can't use it - I get incomplete pages.
I also found the Community Chest anchor isn't closed. There's a <A> tag, but it is often missing it's </A> tag. Well, it was in the pages I dissected. Also, they still haven't managed or bothered to call their advertisers to heel. They still inflict an assorted mishmash of incorrect JavaScript and/or IFRAMES and so on.
Post #5,770
8/18/01 3:53:10 PM

Scary.. maybe inexorable.. so worse.___A solution? \ufffd
You know my and Confucius's views on er mangling Language.. But human Language is indeed a complex, never perfectly specific, ever-growing nexus - we understand why translation is imperfect, is an art across the ages.
HTML OTOH - (somewhat like Sanskrit, its origin being intentional too, so as to enable some exchange of ineffable material a bit more successfully) - has *Standards* and deals exclusively in Boolean relationships: NOT ever - matters of 'opinion' and comfort with, "wouldn't it be nice if ___ were true, so it could accord with my personal hopes?"
Perhaps the atrocity known as This site best viewed with Web-Monopoly 5.77alpha is the stark indicator of the cost of intentionally- commercially-corrupted (mere HTML)-Language ? A red flag er Code Red, to coin a phrase.
Are we collectively (heh) Smart enough yet, to see the parallel of intentional corruption for Profit being congruent to, Goebbels's intentional corruption / named Propaganda / for the accretion of Power ??
I see these as perfectly cross-linked processes - much like Cantor's Alephs- 'nondenumerable infinities' (null and higher classes): 1:1 correlations across the board.
I suppose the next level of intentional incompatibilities shall have to run the course all the way to near inability to use a large chunk of 'the web' - before any concerted actually serious movement begins, to recreate and enforce actual Language standards.
Standards which demand compliance via - er BITE ('built-in-test-equipment', in electronics): if your header doesn't indicate passage of an imbedded quick-scan ~~ CRC test for Compliance within the body -- "[our browser] will not deal with your content and will send message to your server re Why". (Later on, send dup message to Org below..)
ARE we Smart enough. Yet?
A. Vanchau's boys aren't, but they have Lots of company already - no?
Web-Nazi Compliance \ufffd 2001 Skinheads for Harmony Treblinka Division
Post #5,771
8/18/01 5:30:30 PM

I ran tidy (W3C's HTML validator/cleaner) across their stuff a few times. While EZ's not quite as bad as, say, MSWord output (tidy gives up and dies), there's extensive handholding required.
My suggestion would be to snarf the pages raw, strip all LF/CR characters, run through tidy, clean up its errors, and then do a pass to trim and wrap the HTML. This is pretty close to what I did with the IWE forums, though their HTML was in far better shape (hard to believe, but I'll give them that).
Semi-automated cleanup via some sed and awk scripts (on-offs, aimed at the particular problems found) let me cut through the 125,000+ posts in about a day. Processing time itself was a significant factor. Idea was to run tidy on a few posts, identify problems, then code sed/awk to clean those up wherever found (natch: a find command would locate and dump effected posts). You're relying on tidy to spot problems initially, not to fix them for you.
You might also use fold to run a first-approximation of linefolding, though you want to make sure it doesn't split lines w/o any whitespace breaks.
Post #5,776
8/18/01 6:56:10 PM

General laundering Q, please
I almost understand the processes you mentioned. (Meaning only - you said it clearly enough that I can imagine the method) Unclear is the degree of automation which is today practical re such an aim (?) I appreciate that, what Wade is doing is a labor of love, needs a thought-out scheme and lots of hand massaging..
Given the proliferation of Front Page? and other M$-spawned egregious sites, other causes of mangled HTML:
Does it seem possible (well.. likely then?) that a 'home laundry' ap just might process an errant site, where you want to find something allegedy there? I mean, say - linked to such via a query ongoing, so you need to use the site somehow.
Natch I cannot guess just How-bad some of these are - but "post-processing", a la running Babelfish on an entire .doc or page - sometimes does elicit useful stuff. (This especially when I'm looking for tech info for European test equipment, find a link to someone's ~ problem)
Once.. in Czech! :-\ufffd. Friend translated..
Pipe dream?
Post #5,786
8/18/01 8:21:15 PM

Automation comes down to a matter of identifying major issues, then coding up a few lines of something to deal with them. This continues to the point that tidy decides it wants to deal with the remaining breakage itself. Breakage is largley unmatched or illegal (unrecognized) tags. Generating a list of effected files, I'll skip through them or filter via grep and identify a typical pattern to the breakage and see what a good fix rule will be. There's a small risk of fouling things up, but you've got sources to refer back to.
I think I ran through a dozen or so different scripts on the small percentage (a few thousand, of the 120k) of IWE posts that were broken. Much faster than hand-editing the posts. And the results mostly seem to work -- I hunt through the archive periodically, and haven't found any pages that are grossly mangled. The EZ stuff is likely going to be more difficult, though I think the volume is smaller.
No, this isn't a generalized system, it's an assisted process. Hence my description of "semi-auto". There's still a large bit of assess, diagnose, and apply remedy involved.
WRT processing time, I think the main tidy run through the IWE archive was about 8 hours on my laptop (600 MHz, 128MB, 20GB IDE).
For what you seem to be asking -- how do you find crap in a busted page -- I'd dump page to text and use editing tools with a good search function (eg: vim). For a more comprehensive site, snarf the pages (wget, snarf, etc.), and run a find foo | xargs grep bar to isolate content. I've actually done same. Pretty sad commentary on site research support.
Post #5,932
8/19/01 11:55:25 PM

Some links.
I now have a small arsenal of tools that, with some suitable and reasonably straightforward shell magic, basically do the job required of them across an arbitrary sized input space until done. The bulk of this message is much what Maggs asked for and I sent.
First of all, archives of my Static Page have been put together at [link||] so you can see what I was in the process of doing when the topic of backing up IWETHEY came up.
I started by saving the forum overview pages manually (there wasn't really that many). Then I was using [link||[link||]] to read these to figure out the thread structure. Once that was done, I used [link||[link||]] to get the messages. This requires the forum types be Legacy BBS, since that returns one message at a time. So far I've been changing them by hand (that may change, but Icon's networking is not reliable, I suspect) and I haven't changed them back yet. Actually fetching all the content took quite a few hours.
The script to fish out the real content is [link||[link||]]. It took maybe half an hour on a P100 to run over all the messages I had (I estimate between a third and a half of the whole board).
Now I've run it for all forums, the next step is to create some static pages for all of them. I'm toying with the idea of doing HTML cleanup on the content which is by far a bigger task than any I've done so far. Fortunately, I have some Icon code I can plunder for more general HTML dissection. The big ogre, of course, is EZBoard's <BR> tags, but there are other evils to watch out for, too.
The file layout I've been working with is in [link||[link||]]. The index files are the ones created from findthreads.icn and the forumname files are required for the newer forums that EZBoard gave different ids than names. :-/ As you can see, I haven't got any of the really large fora yet.
You will need Unicon [link||] or Icon [link||] if you want to compile the .icn files.
Post #5,791
8/18/01 10:34:19 PM

is demoronizer still usefull?
been a while but you used to run that perl script on front page stuff to clean it up. thanx, bill
Post #5,799
8/18/01 11:19:29 PM

Think last time I heard about that one..
I must have gotten the idea it was a diss of Le Moron :-\ufffd (who I note, still pops up in various places with his cheery little notes about the wonderfulness of all things with RunTogetherNames and a short half-life..) Hmm:
What's the half-life of a M$ product in milliseconds?
Might be fun to look up. Thanks,
Post #5,801
8/19/01 12:18:08 AM
8/20/01 8:35:54 AM

err RunTogetherNamesIsAnSNMPStandard and would make my life
AWholeLotSimplerAsDifferentOs'sHaveDifferentReservedWordsAndSymbolsWhileMs"$" LikesToEmulateMacs,HavingSpacesOnANixPlatformIsANonoAsIFoundOutReplacingAnNtBox WithLinuxForFtpServices<256 chars thanx, bill
Edited by boxley
Aug. 20, 2001, 08:35:54 AM EDT
Post #5,850
8/19/01 3:59:36 PM

Post #5,951
8/20/01 6:17:57 AM

Demoronizer is probably a prcursor to tidy, though I haven't done and extensive comparison of the two. Demoronizer is largely aimed at broken Microsoft-generated HTML. I haven't checked to see that tidy corrects the same problems, but it does address quite a few others.
