IWETHEY v. 0.3.0 | TODO
1,095 registered users | 0 active users | 0 LpH | Statistics
Login | Create New User
IWETHEY Banner

Welcome to IWETHEY!

New Lord give me strength
I've spent the last 5 days diagnosing a failure of a particular document. I use xml/fo to generate word documents. Since the only code that exists to do this is in Java, I run a tomcat server that takes an http post from the seaside server and returns word document. It works fine until this one case pushed it over the edge.

First, tomcat has some 2M hard limit on post sizes that had to be configured around. Then I ran into decoding issues so I pitched tomcat and wrote a bog simple java server based on a server skeleton I found. I can now echo back what I send, regardless of size.

However, the code that does the translation is the next problem. This is all I get (the tomcat environment was swallowing this for some reason).
org.xml.sax.SAXParseException: character not allowed
at com.jclark.xml.sax.SAX2Driver.parse(Unknown Source)
at javax.xml.parsers.SAXParser.parse(SAXParser.java:375)
at com.xmlmind.fo.converter.o.if(Unknown Source)
at com.xmlmind.fo.converter.Driver.B(Unknown Source)
at com.xmlmind.fo.converter.Driver.main(Unknown Source)

Lovely. So which character out of the 1.2 million of them I have in this document might be the problem? I try the parser in Squeak. It likes the document fine. I scan the document for control characters and illegal utf8 sequences but I find nothing.

So I set out to download the source to this xml parser and run it in a debugger to try to figure it out. But I can't find all the pieces. There are apparently 472 xml parsers in the java world, along with 27 abstract interfaces to allow you to mix and match. Don't Java programmers have anything better to do than to write xml parsers? Why would anyone care enough to choose one over another? Each parser has about 6 versions - all mutually incompatible. Some require java 1.5, some 1.4, some are fine back to 1.2. Yet it is only the java parser that complains about the document, all other parsers I have access to like it fine.

Keeeeyyyyrrrriiiiissssst on toast! Tips for where I can just download the source code to the offending xml parser would be appreciated.



[link|http://www.blackbagops.net|Black Bag Operations Log]

[link|http://www.objectiveclips.com|Artificial Intelligence]

[link|http://www.badpage.info/seaside/html|Scrutinizer]
New Dunno.
It sounds very painful.

Please pardon my ignorance, but why did you have to go through all these contortions to generate a Word document from XML? Aren't there Python or similar tools that will do what you need? Are the capabilities of XSL/FO so much better than the other tools? (I realize that the tool was working fine until you hit this particular pothole.)

I suspect (as it seems you found) it's not a character in the file that's causing problems, but rather something in the parser.

Does [link|http://www.javaworld.com/javaworld/jw-05-2002/jw-0517-sax.html?|this] help? It talks about generating the source for a SAX parser.

HTH a bit. Good luck!

Cheers,
Scott.
New XML/FO? What's that?
"eXtensible Markup Language / Fuck Off"?

Perhaps not the most fortunate name I've ver seen or heard of...


   [link|mailto:MyUserId@MyISP.CountryCode|Christian R. Conrad]
(I live in Finland, and my e-mail in-box is at the Saunalahti company.)
Ah, the Germans: Masters of Convoluted Simplification. — [link|http://www.thetruthaboutcars.com/?p=1603|Jehovah]
New Formatting Objects
you describe rich text documents using xml/fo dialect, and then there are renderers that will render a decent approximation in rtf (which users think is a MS turd document), PDF (apache FOP project), or HTML.




[link|http://www.blackbagops.net|Black Bag Operations Log]

[link|http://www.objectiveclips.com|Artificial Intelligence]

[link|http://www.badpage.info/seaside/html|Scrutinizer]
New Re: Lord give me strength

Don't Java programmers have anything better to do than to write xml parsers? Why would anyone care enough to choose one over another?

\r\n\r\n

Not really ;)

\r\n\r\n

[link|http://www.cafeconleche.org/XOM/whatswrong/text0.html|Here's] one guy's slides of why he's inventing yet another XML API for Java, including some ranting about why he doesn't like any of the eight trillion ones written so far.

--\r\nYou cooin' with my bird?
Expand Edited by ubernostrum Dec. 1, 2006, 12:06:13 PM EST
New They're like the scorpion in the parable: It's their nature
Don't Java programmers have anything better to do than to write xml parsers?


No. It's what they do. C++ programmers write STL template instances...java programmers write XML parsers. It's just their nature....
jb4
"When the final history is written in Iraq, [link|http://images.ucomics.com/comics/tmate/2006/tmate060926.gif|it'll look just like a comma.]"
George W. Bush, 24 Sep 06
Expand Edited by jb4 Dec. 1, 2006, 12:45:04 PM EST
New Solved it
The parser in question is the XP parser by Jim Clark of expat fame.

Why they are using it, I have no idea. The readme with the parser describes error reporting as "brutal" and this is true. But it is a smallish piece of code and it didn't take me too long to thoroughly instrument it to report exactly where the error is. My respect for Clark goes up several notches as I looked up the offending character at unicode.org and found it to be in the unassigned high surrogates range. This is the only thing that detected it and cared about whether the characters were assigned. Java's String object, when initialized from the bytes was happy with it, as were several other chunks of code that read UTF-8. Only Clark's parser cared.

Anyhow, now I have a tool for finding these things quickly.

How did such a byte sequence get there in the first place? Copied from MS Word into IE Explorer - it is apparently a MS specific byte sequence that wasn't properly translated to unicode on paste. You knew MS technology was at the bottom of this. Don't even get me started on "stupid quotes".



[link|http://www.blackbagops.net|Black Bag Operations Log]

[link|http://www.objectiveclips.com|Artificial Intelligence]

[link|http://www.badpage.info/seaside/html|Scrutinizer]
New Excellent. Drop him a line if you get a chance.
     Lord give me strength - (tuberculosis) - (7)
         Dunno. - (Another Scott)
         XML/FO? What's that? - (CRConrad) - (1)
             Formatting Objects - (tuberculosis)
         Re: Lord give me strength - (ubernostrum)
         They're like the scorpion in the parable: It's their nature - (jb4)
         Solved it - (tuberculosis) - (1)
             Excellent. Drop him a line if you get a chance. -NT - (Another Scott)

The Pseudo Markov Chain O' Puerile Musings.
76 ms