Text is not as simple as it seems

Text is any combination of the characters that we write with.

Unfortunately we don't write those characters on computers, so we need to represent them internally.

What we need to do is find a way to encode those characters into bits, and then another way to decode those bits back into an idea of what the characters are. We further need a way to draw those characters on an output device (screen, paper, etc).

None of this is as simple as it seems. Ideally it should look simple, users should just type what they want and see it appear. And be able to send documents to other users who can read it. However programmers are likely to need to need to know something more about the details than that.

How much more? Well that article is Joel's attempt to say what he thinks programmers absolutely need to know. If you want to learn it, read Joel's article. :-)

Another attempt at talking about just what is involved in a string can be found at [link|http://www.sidhe.org/~dan/blog/archives/000255.html|http://www.sidhe.org...hives/000255.html].

Anyways, let me address your questions in no particular order.

Where can I learn the minimum that Joel recommends at [link|http://www.joelonsoftware.com/articles/Unicode.html|http://www.joelonsof...cles/Unicode.html]? You can learn it from [link|http://www.joelonsoftware.com/articles/Unicode.html|http://www.joelonsof...cles/Unicode.html].

I am typing text in this forum, who receives the keyboard event, and translate it to the right character on the screen, is it the kernel or mozilla? I don't know what operating system you're using, so there is no way to give a full, definitive answer. But the preliminary one is that the operating system is responsible for receiving the keyboard input, and deciding where it goes. And whatever receives that may delegate it farther (in Linux X will get it and decide what application gets it, if that application is Mozilla, then the application decides which part of the application - eg which text box - gets the input). The application recieves that input, and decides what it means internally. The application then decides to draw the information and makes library calls, some of which go back out to the operating system which hopefully knows what to do with those. (I'm defining "operating system" loosely here. On Linux, for instance, the kernel doesn't actually take care of this. Instead various higher level processes, such as a font server, decide the nitty gritty details.)

Are all the users of this forum type in the same character set? No. We have some European types who undoubtably use local character sets.

Does Unicode use some sort of trick to read ASCII? No, and sorta. The Unicode standard says nothing about how to cooperate with existing ASCII data. It just maps integers to text. But there are many ways of representing an integer in bits and bytes, and one of those ways (namely UTF-8) was designed so that the characters that you have in ASCII are all represented by the same bytes in that representation. Of course UTF-8 can represent many characters that were not standardized in ASCII.

What the heck is ASCII? ASCII is an agreed on standard for how to turn the numbers from 1-127 into characters. Those characters are good enough for anything that you want to do with English, but won't include, for instance, accented characters that you might find in French, German or Spanish. Most representations of text adhere to this standard.

Why is ASCII not an option in the Mozilla dropdown for character sets? Because ASCII would be a useless option. Many character sets agree on the part that ASCII specifies, and ASCII doesn't provide any of the extra characters that others do.

Why didn't my C programs worry about this? Because the ideas of character sets, languages, and so on are higher order abstractions, and C tends to be a very low-level language. Furthermore you were probably working with programs that only had to deal with US input (or failing that, input on a system where someone else was worrying about locale) so your programs didn't have to think about what text meant, just how many bytes were in the string. Higher order languages generally try to provide higher level abstractions, and programmers today are more likely to have to deal with strange text, so they need to deal with the complications that C ignores.

Why don't programming books mention the sorting issues that Dan Sugalski mentions? There are programming books that do. But introductory ones oversimplify. Also there isn't a single generic way to handle this problem, and programming books like to bring up questions that they have good answers to.

Tell me about a language that understand different character sets! Many do. You just have to look. For instance for Perl see [link|http://perldoc.perldrunks.org/perllocale.html|http://perldoc.perld...g/perllocale.html] for locale issues, and [link|http://perldoc.perldrunks.org/perluniintro.html|http://perldoc.perld...perluniintro.html] + [link|http://perldoc.perldrunks.org/perlunicode.html|http://perldoc.perld.../perlunicode.html] for information on Perl's Unicode support.

Can I write a single text file, using different char sets? different encodings? No. The character set/encoding says what various bits and bytes will mean. The file just contains the bits and bytes. The file is data. The character set is metadata. You need to combine them to get meaning. If you're using Windows, there is a simple way to experience this first hand. In Windows GUI applications use a different codeset than DOS ones. Cut-and-paste tries to keep the glyphs the same, even if it means changing bits and bytes. If you save to a file and read that from a different program, you'll keep the bits and bytes the same, but can change characters. Try this. Paste é into Wordpad, save it, and then look at the file you saved it to in DOS. It won't be é any more. Try to paste it into a DOS-based editor, save it, then open it in Wordpad. It will be something else again.

What does it mean to say that Ruby doesn't do Unicode? Well Ruby sort of does Unicode. You can, for instance, use the Jcode library in Ruby and you'll have Unicode support. You can read Unicode, write Unicode, search Unicode strings for Unicode substrings, do pattern matching against Unicode, and so on. However some things either won't work or will go to a stupid default. For instance Ruby doesn't do well when it comes time to sort Unicode. (Unicode does not address the sorting issue - indeed it is impossible since different languages with the same extra characters sort them differently.)

And why utf-8 in particular, why would windows-1256 be okay but not utf-8? There's a generic answer that I'll give. The operating system makes windows-1256 trivial - there are standard libraries that have the facilities to do basic things like tell you which string is greater than another. Just load the right library, tell it your locale, and pass it the strings. By contrast utf-8 is much harder, it forces you to change certain things about how you handle strings (eg one character might not now be one byte) and common operations that people want are not necessarily available (like saying how to order strings).

can ruby, written in c, be able to handle char sets that c cant? Ruby offers abstractions that, by default, C does not. However obviously there is nothing that you can do with Ruby that you can't do in C using the right libraries. (But even with the right libraries, it might still be much easier to do in Ruby than C.)

Cheers,
Ben

Welcome to IWETHEY!