\r\n\r\nCharacter encodings and font sets are very ugly in HTML, as the early versions where an english centric defacto-standard. XHTML cleans up most of these issues. Web browsers have to use some guesswork and follow some unwritten conventions for handeling these issues in HTML. The browser has to peek at the web page and try to figure out what the encoding is in many cases. In pracitce, in HTML, iso-8859-1 is the default if the browser can't find anything else.
Actually, XHTML makes it more complicated. If you serve XHTML without sending the charset
parameter in your Content-Type
header, then the MIME-type of the document can determine the character encoding to be used for parsing. If you sent text/html
or text/xml
, a conforming parser must assume that the document is encoded in us-ascii, no matter what you've specified inside it; you have [link|http://www.ietf.org/rfc/rfc3023.txt|RFC 3023] and the legacy of text/*
media types and transcoding proxies to thank for that.
If you sent application/xhtml+xml
or application/xml
, then the receiving parser is allowed to read the XML prolog inside the file, but if that's not present then the character set must be assumed to be utf-8 or utf-16, depending on whether the document begins with a byte-order mark; you're not allowed to look at the meta
tags for this information.
This is one of the reasons why Mark Pilgrim claimed [link|http://www.xml.com/pub/a/2004/07/21/dive.html|XML on the web has failed], and it just barely represents the tip of the iceberg as far as character-encoding issues in HTML, XHTML and XML are concerned.