Solved it

The parser in question is the XP parser by Jim Clark of expat fame.

Why they are using it, I have no idea. The readme with the parser describes error reporting as "brutal" and this is true. But it is a smallish piece of code and it didn't take me too long to thoroughly instrument it to report exactly where the error is. My respect for Clark goes up several notches as I looked up the offending character at unicode.org and found it to be in the unassigned high surrogates range. This is the only thing that detected it and cared about whether the characters were assigned. Java's String object, when initialized from the bytes was happy with it, as were several other chunks of code that read UTF-8. Only Clark's parser cared.

Anyhow, now I have a tool for finding these things quickly.

How did such a byte sequence get there in the first place? Copied from MS Word into IE Explorer - it is apparently a MS specific byte sequence that wasn't properly translated to unicode on paste. You knew MS technology was at the bottom of this. Don't even get me started on "stupid quotes".

Welcome to IWETHEY!