IWETHEY v. 0.3.0 | TODO
1,095 registered users | 0 active users | 0 LpH | Statistics
Login | Create New User
IWETHEY Banner

Welcome to IWETHEY!

New What the heck is text?
Okay, this is kind of premature
I know I can probably get to learn more from google before asking.

But I've head this question on my mind for quite some time now, and
I just don't feel like googling, I want a more direct answer.

Some of the doubts/questions/worries might be mythical, I am not even sure

Okay so lets stop excusing ourselves and lets start asking...

What the heck is text?
What the heck is the minimum that joel means in this [link|http://www.joelonsoftware.com/articles/Unicode.html|article]?
Where can I learn it? I can't find a book that address this issue directly and solely

The problem is, wherever I write text, I usually have no clue what character set is being used, that's why I am kind of unable to correctly solve character set related problems.

I want a direct, to the point, as technical as clearly possible explanation.

I am typing text in this forum, who receives the keyboard event, and translate it to the right character on the screen, is it the kernel or mozilla?

Are all the users of this forum type in the same character set? I dont think so, yet to view this forum we all (I am only guessing) tell our browser to open this forum site in the same character set.

Okay, probably not either ... Some part of this screen can be viewed unchanged read assuming different character sets, so this implies, and I think intentionally that many character sets have a lot in common

Okay I wont lie, I read something like this, the first 128 characters the first 7 bits are common in many character sets, bu the second 128 char sets are different

I think even unicode use some trick to read ASCII chars

Okay surprise question?
What the heck is ASCII?

When I click View_> Character Coding_> in mozilla non of the availble
options is called ASCII, they are all called UTF-* or ISO-* or IBM-* etc

Does this mean, that no one nowadays uses pure ascii, and all of those are character sets, that may be or may not be ASCII compatible (same code for first 128 chars)

Another problem banging my head, when I used to write those silly ...
scanf, printf programs in C, it didn't seem that the compiler bothered
about the character set

If according to this link [link|http://www.sidhe.org/~dan/blog/archives/000255.html|what the heck is a string]
Each string must have an encoding, char set, language attached to it, why didn't printf in c ask me about all this, why doesn't the printf variation in any programming language ask me about this. Even more riddling, string comparision procedures, them too, they never seem to recognize the existance of different character sets, when I learn the string compare methods in many langs, they all seemed to implie that all strings are equal (from the same type), but Dan suggest differently, a Japanese string is of different type as opposed to a chinese string. We need to compare string to sort a lit alphabetically, and logically it is not feasible to sort a list of names from different langs, programming books never even mention such issues!!!!

Do you know of a language that insist on asking, please let me, I would like to read it's docs.

I am guessing defaults or even worst hard coded values are in place, thus let me be the first to say it "Default values are considered hamrful"

Can I write a single text file, using different char sets? different encodings?

Again this is very important, which program handles which character set the keyboard i sending or using, who tells the keyboar ...
now when the user type 'A' send 1001111010101011 instead of 01011010 (for example) and who tells a simples text editor like say notepad, now use windows-1256 or now use utf-8, or in things like notepad, they use some default value

I heard that Ruby doesnt do unicode? (regardless of the truth of this) what the heck does that mean? the string processing library in ruby will raise errors if i give it a string written in unicode? or would the ruby interpreter complain if i send to him a script writen in utf-8?

And why utf-8 in particular, why would windows-1256 be okay but not utf-8?
Isnt ruby writen in c, doesn't c know utf-8?

can ruby, written in c, be able to handle char sets that c cant?

Does linux have default char set values, why?
Lets put it differently does a system have a global char set? why?
in case of linux where?

I know I asked many incoherent questions?
I know I might be misusing this forum, but I needed to ask them outloud, so that I eventually try to answer them.
Actually it is sometimes enough to state your problem out loud and u find urself able to see the answer right away.

If you think I asked question worth answering please do then ...
New It depends on the context.
The meaning of "text" depends on the context. For example, if you're doing SMS stuff - "text messaging" using the "Short Message Service" - on a mobile phone, you're limited to whatever character encoding is supported by the system and the phone. It's may not be Unicode, but might be if you have a [link|http://people.netscape.com/ftang/paper/SMS_and_Unicode.html|GSM] phone.

Saying everyone needs to understand Unicode and encoding and such is simplistic because it assumes every "Software Developer" is programming Windows or Web stuff that needs multilanguage support.

The meaning depends on the context.

It's my understanding that Unicode includes ASCII as a subset.

A C program doesn't care about character sets; the compiler assumes ASCII. At least it did. Some discussion about extending gcc for 2-byte Unicode support is [link|http://mail.nl.linux.org/linux-utf8/2000-08/msg00101.html|here].

A text file in a computer context doesn't exist on its own. It's a set of bytes on some form of storage media. If the file is to present non-ASCII characters to a program or a person, then it has to have a way of representing non-ASCII characters to the program or person, so an encoding method must be indicated. But then I wouldn't call it "text" myself - without any qualification, I assume "text" means ASCII.

That's my take, anyway.

HTH.

Cheers,
Scott.
New Unicode and ASCII
It's my understanding that Unicode includes ASCII as a subset.
\r\nNitpick - A particular variable length encoding of Unicode (UTF-8, using 1 to 6 8-bit bytes) is compatible with the 7 bit encoding of ASCII when only characters from the 7-bit ASCII encoding is used.
New Re: Unicode and ASCII - Nitpick II
ASCII and UNICODE define a set of code points, a binary representation of a character. As it turns out, The ASCII code points are identical to the UNICODE code poitns for the characters represented by ASCII

UTF-8, UTF-16 (both versions*), UTF-32, UCS-2, etc. are all encoding schemes; that is mechanisms through which the code points can be represented. In ASCII, such things are not necessary because ASCII is defined to be fully representable in a singe byte. UNICODE is not, and so we have come up with all sorts of ways to represent the 97,000+ characters that UNICODE currently represents (and more coming RSN!). The encoding schemes listed above (along with UCS-4) are specifically for UNICODE. So talking about representing ASCII as UTF-8 is (pedantically) meaningless. You can represent the ASCII subset of UNICODE using UTF-8, however (its a "null translation"), but then you're really representing UNICODE.


* Both versions means big-endian and little-endian, but you already nkew that...
jb4
shrub\ufffdbish (Am., from shrub + rubbish, after the derisive name for America's 43 president; 2003) n. 1. a form of nonsensical political doubletalk wherein the speaker attempts to defend the indefensible by lying, obfuscation, or otherwise misstating the facts; GIBBERISH. 2. any of a collection of utterances from America's putative 43rd president. cf. BULLSHIT

New you are confusing text with display
test means a type of data that is represented as char, display is how to present that char to the screen in a format the recieptient can understand, that is by wrappering the charset to a graphics set. Unicode is one method of doing so. Bitmap is another (see adobe)
hope that helps
thanx,
bill
Any opinions expressed by me are mine alone, posted from my home computer, on my own time as a free american and do not reflect the opinions of any person or company that I have had professional relations with in the past 48 years. meep
questions, help? [link|mailto:pappas@catholic.org|email pappas at catholic.org]
New Uhhh..Not quite, Bill
First, [link|http://z.iwethey.org/forums/render/content/show?contentid=200487|see this post].

UNICODE has nothing to do with rendering; indeed in Arabic (for example) where there are positional forms, there are up to 4 glyphs (renderings) for a given code point. UNICODE does not in any way define the renderings, it simply defines a code point for the character. How that character is rendered is a function of a rendering engine (like Adobe) that knows about how a code point is supposed to be rendered.
jb4
shrub\ufffdbish (Am., from shrub + rubbish, after the derisive name for America's 43 president; 2003) n. 1. a form of nonsensical political doubletalk wherein the speaker attempts to defend the indefensible by lying, obfuscation, or otherwise misstating the facts; GIBBERISH. 2. any of a collection of utterances from America's putative 43rd president. cf. BULLSHIT

New And that is one thing that sucks about Unicode
It is true that Unicode has nothing to do with rendering.

For instance there are different characters in different Asian languages that have been mapped to the same codepoint in Unicode. Which mean that rendering engines have to play bad games about guessing what language they are currently working in to correctly render them on screen! (For instance the same character can have a Han Chinese, Traditional Chinese, Taiwanese, Japanese and Korean variant.) You know, the same kind of bad games that Unicode supposedly protects us from. :-(

Further complicating things is the fact that the same character may have multiple Unicode sequences that produce it, for instance codepoints 69 (Latin letter "i"), 2139 (information source), 2148 (imaginary unit), and 2170 (Roman numeral i) are all likely to be written the same way.

Cheers,
Ben
I have come to believe that idealism without discipline is a quick road to disaster, while discipline without idealism is pointless. -- Aaron Ward (my brother)
New At least they're consistent
UNICODE doesn't consider the glyph, they only consider the underlying character. That the character can be represented by several glyphs, or that the same glyph can be used to render several different characters is not important to them.

Nor, I suspect, is it important to any other encoding scheme.
jb4
shrub\ufffdbish (Am., from shrub + rubbish, after the derisive name for America's 43 president; 2003) n. 1. a form of nonsensical political doubletalk wherein the speaker attempts to defend the indefensible by lying, obfuscation, or otherwise misstating the facts; GIBBERISH. 2. any of a collection of utterances from America's putative 43rd president. cf. BULLSHIT

New But it is a problem
It means that you cannot conveniently have both Japanese and Chinese text in the same document, even though you're dealing with an encoding that is supposed to solve internationalization problems.

The "multiple code points" problem also creates complexity, and that complexity can lead to security holes. See [link|http://www.schneier.com/crypto-gram-0007.html#9|http://www.schneier....-gram-0007.html#9] for the kinds of security problems that could happen and [link|http://www.schneier.com/blog/archives/2005/02/unicode_url_hac_1.html|http://www.schneier....de_url_hac_1.html] for a concrete example of it being exploited in practice.

Cheers,
Ben
I have come to believe that idealism without discipline is a quick road to disaster, while discipline without idealism is pointless. -- Aaron Ward (my brother)
New Except for that full width/half width ascii thing
See, they're not even consistent - full width ascii is there specifically to support typography.



"Whenever you find you are on the side of the majority, it is time to pause and reflect"   --Mark Twain

"The significant problems we face cannot be solved at the same level of thinking we were at when we created them."   --Albert Einstein

"This is still a dangerous world. It's a world of madmen and uncertainty and potential mental losses."   --George W. Bush
New I dunno...
My understanding of "full-width ASCII" is to support roman-ji in Japanese, where such niceties as proportional spacing are just now appearing in the public marketplace. These code points (U+FF00 - U+FF60) are supported primarily so that fonts that contain both the "standard" ASCII and the "full-width ASCII" (e.g. Monotype Andale, Arial Unicode, Mincho, etc.) can differentiat which glyph to use.

In looking up the code point range for the full-width ASCII, I also discovered that there really are code points for the various arabic presentation forms, as well as Latin presentation forms and others. So, within UNICODE, you can explicitly define the correct glyph for presentation, even if you have a font engine capable of "fixing it up" for you. Course that does complicate UNICODE even more, as someone who is "using UNICODE" cannot be known to be using the presentation forms, so the rendering engine must be capable of passing through the presentation forms, and "massaging" the un-presentation forms when necessary. Sheesh!

That's why I love IWETHEY...you learn something new even when you don't expect to....
jb4
shrub\ufffdbish (Am., from shrub + rubbish, after the derisive name for America's 43 president; 2003) n. 1. a form of nonsensical political doubletalk wherein the speaker attempts to defend the indefensible by lying, obfuscation, or otherwise misstating the facts; GIBBERISH. 2. any of a collection of utterances from America's putative 43rd president. cf. BULLSHIT

New My personal take on it
is that supporting typography IS the job of schema like this. For programming, 7 bit ascii is fine, and it is historically justifiable that it should be based on the English language... but for the rest of it, typography of the languages in question is the raison d'être of text encoding schemes, and when people attempt to come up with standards for such, typographers should be the people consulted as there is up to several centuries worth of experience for any given language to draw upon for what is needed underneath to support rendering a language completely.
--\n-------------------------------------------------------------------\n* Jack Troughton                            jake at consultron.ca *\n* [link|http://consultron.ca|http://consultron.ca]                   [link|irc://irc.ecomstation.ca|irc://irc.ecomstation.ca] *\n* Kingston Ontario Canada               [link|news://news.consultron.ca|news://news.consultron.ca] *\n-------------------------------------------------------------------
New Perhaps, but it makes searching tricky
Finding a phone number in a database used to be annoying as you had to make sure that you normalized the full width ascii to regular ascii before storing and searching. Oracle didn't find a full width represented phone number when it was stored as regular ascii. Made web form programming kind of fiddly.

I think this is better now, but it points up some inconsistencies in the unicode standard. Clearly, "characters, no glyphs" is bogus.




"Whenever you find you are on the side of the majority, it is time to pause and reflect"   --Mark Twain

"The significant problems we face cannot be solved at the same level of thinking we were at when we created them."   --Albert Einstein

"This is still a dangerous world. It's a world of madmen and uncertainty and potential mental losses."   --George W. Bush
New Well, if it was an easy problem
there would've been a solution found a long time ago:)

Your point about searching is very well taken. A badly designed standard could make that nightmarish.
--\n-------------------------------------------------------------------\n* Jack Troughton                            jake at consultron.ca *\n* [link|http://consultron.ca|http://consultron.ca]                   [link|irc://irc.ecomstation.ca|irc://irc.ecomstation.ca] *\n* Kingston Ontario Canada               [link|news://news.consultron.ca|news://news.consultron.ca] *\n-------------------------------------------------------------------
New ICLRPD (new thread)
Created as new thread #201145 titled [link|/forums/render/content/show?contentid=201145|ICLRPD]
jb4
shrub\ufffdbish (Am., from shrub + rubbish, after the derisive name for America's 43 president; 2003) n. 1. a form of nonsensical political doubletalk wherein the speaker attempts to defend the indefensible by lying, obfuscation, or otherwise misstating the facts; GIBBERISH. 2. any of a collection of utterances from America's putative 43rd president. cf. BULLSHIT

New Have you all seen the HUGE unicode poster?
[link|http://www.ianalbert.com/misc/unichart.php|http://www.ianalbert...misc/unichart.php]
New close enough to debug a table entry :-)
yer right the rendering engine decides how to present the unicode.
thanx,
bill
All tribal myths are true, for a given value of "true" Terry Pratchett
[link|http://boxleys.blogspot.com/|http://boxleys.blogspot.com/]

Any opinions expressed by me are mine alone, posted from my home computer, on my own time as a free american and do not reflect the opinions of any person or company that I have had professional relations with in the past 48 years. meep
questions, help? [link|mailto:pappas@catholic.org|email pappas at catholic.org]
New Text is not as simple as it seems
Text is any combination of the characters that we write with.

Unfortunately we don't write those characters on computers, so we need to represent them internally.

What we need to do is find a way to encode those characters into bits, and then another way to decode those bits back into an idea of what the characters are. We further need a way to draw those characters on an output device (screen, paper, etc).

None of this is as simple as it seems. Ideally it should look simple, users should just type what they want and see it appear. And be able to send documents to other users who can read it. However programmers are likely to need to need to know something more about the details than that.

How much more? Well that article is Joel's attempt to say what he thinks programmers absolutely need to know. If you want to learn it, read Joel's article. :-)

Another attempt at talking about just what is involved in a string can be found at [link|http://www.sidhe.org/~dan/blog/archives/000255.html|http://www.sidhe.org...hives/000255.html].

Anyways, let me address your questions in no particular order.

  • Where can I learn the minimum that Joel recommends at [link|http://www.joelonsoftware.com/articles/Unicode.html|http://www.joelonsof...cles/Unicode.html]? You can learn it from [link|http://www.joelonsoftware.com/articles/Unicode.html|http://www.joelonsof...cles/Unicode.html].

  • I am typing text in this forum, who receives the keyboard event, and translate it to the right character on the screen, is it the kernel or mozilla? I don't know what operating system you're using, so there is no way to give a full, definitive answer. But the preliminary one is that the operating system is responsible for receiving the keyboard input, and deciding where it goes. And whatever receives that may delegate it farther (in Linux X will get it and decide what application gets it, if that application is Mozilla, then the application decides which part of the application - eg which text box - gets the input). The application recieves that input, and decides what it means internally. The application then decides to draw the information and makes library calls, some of which go back out to the operating system which hopefully knows what to do with those. (I'm defining "operating system" loosely here. On Linux, for instance, the kernel doesn't actually take care of this. Instead various higher level processes, such as a font server, decide the nitty gritty details.)

  • Are all the users of this forum type in the same character set? No. We have some European types who undoubtably use local character sets.

  • Does Unicode use some sort of trick to read ASCII? No, and sorta. The Unicode standard says nothing about how to cooperate with existing ASCII data. It just maps integers to text. But there are many ways of representing an integer in bits and bytes, and one of those ways (namely UTF-8) was designed so that the characters that you have in ASCII are all represented by the same bytes in that representation. Of course UTF-8 can represent many characters that were not standardized in ASCII.

  • What the heck is ASCII? ASCII is an agreed on standard for how to turn the numbers from 1-127 into characters. Those characters are good enough for anything that you want to do with English, but won't include, for instance, accented characters that you might find in French, German or Spanish. Most representations of text adhere to this standard.

  • Why is ASCII not an option in the Mozilla dropdown for character sets? Because ASCII would be a useless option. Many character sets agree on the part that ASCII specifies, and ASCII doesn't provide any of the extra characters that others do.

  • Why didn't my C programs worry about this? Because the ideas of character sets, languages, and so on are higher order abstractions, and C tends to be a very low-level language. Furthermore you were probably working with programs that only had to deal with US input (or failing that, input on a system where someone else was worrying about locale) so your programs didn't have to think about what text meant, just how many bytes were in the string. Higher order languages generally try to provide higher level abstractions, and programmers today are more likely to have to deal with strange text, so they need to deal with the complications that C ignores.

  • Why don't programming books mention the sorting issues that Dan Sugalski mentions? There are programming books that do. But introductory ones oversimplify. Also there isn't a single generic way to handle this problem, and programming books like to bring up questions that they have good answers to.

  • Tell me about a language that understand different character sets! Many do. You just have to look. For instance for Perl see [link|http://perldoc.perldrunks.org/perllocale.html|http://perldoc.perld...g/perllocale.html] for locale issues, and [link|http://perldoc.perldrunks.org/perluniintro.html|http://perldoc.perld...perluniintro.html] + [link|http://perldoc.perldrunks.org/perlunicode.html|http://perldoc.perld.../perlunicode.html] for information on Perl's Unicode support.

  • Can I write a single text file, using different char sets? different encodings? No. The character set/encoding says what various bits and bytes will mean. The file just contains the bits and bytes. The file is data. The character set is metadata. You need to combine them to get meaning. If you're using Windows, there is a simple way to experience this first hand. In Windows GUI applications use a different codeset than DOS ones. Cut-and-paste tries to keep the glyphs the same, even if it means changing bits and bytes. If you save to a file and read that from a different program, you'll keep the bits and bytes the same, but can change characters. Try this. Paste é into Wordpad, save it, and then look at the file you saved it to in DOS. It won't be é any more. Try to paste it into a DOS-based editor, save it, then open it in Wordpad. It will be something else again.

  • What does it mean to say that Ruby doesn't do Unicode? Well Ruby sort of does Unicode. You can, for instance, use the Jcode library in Ruby and you'll have Unicode support. You can read Unicode, write Unicode, search Unicode strings for Unicode substrings, do pattern matching against Unicode, and so on. However some things either won't work or will go to a stupid default. For instance Ruby doesn't do well when it comes time to sort Unicode. (Unicode does not address the sorting issue - indeed it is impossible since different languages with the same extra characters sort them differently.)

  • And why utf-8 in particular, why would windows-1256 be okay but not utf-8? There's a generic answer that I'll give. The operating system makes windows-1256 trivial - there are standard libraries that have the facilities to do basic things like tell you which string is greater than another. Just load the right library, tell it your locale, and pass it the strings. By contrast utf-8 is much harder, it forces you to change certain things about how you handle strings (eg one character might not now be one byte) and common operations that people want are not necessarily available (like saying how to order strings).

  • can ruby, written in c, be able to handle char sets that c cant? Ruby offers abstractions that, by default, C does not. However obviously there is nothing that you can do with Ruby that you can't do in C using the right libraries. (But even with the right libraries, it might still be much easier to do in Ruby than C.)



Cheers,
Ben
I have come to believe that idealism without discipline is a quick road to disaster, while discipline without idealism is pointless. -- Aaron Ward (my brother)
New This is one thing that Java handles pretty well
In Java all characters are represented as 16 bit Unicode characters and Java (1.3+) provides transformations to any other character representation.

In 1.4+ the nio package provides a whole bunch of classes that do this kind of stuff, CharsetEncoder, CharsetDecoder, Charset, etc.
Expand Edited by bluke March 13, 2005, 11:35:09 AM EST
New Rule #1 - Everything you think you know is wrong
The problem with the C standard library (and the C++ standard library) is that it was written by xenophobic white english speaking men. (The proof of this statement is the hijacking of char to mean byte) It only works with 7 bit ascii characters. So if you care about anybody outside the USA, give up on the standard C library.

So give up on standard C and C++ library.

Here is your replacement: [link|http://www-306.ibm.com/software/globalization/icu/index.jsp|http://www-306.ibm.c...ion/icu/index.jsp]

It has all of the unicode handling capabilities you need, collation (not same as strcmp), number and date formatters/parsers, character testing (isAlpha, isDigit, etc).

ICU is now used in the Parrot VM as well - which means perl v6.0 and python will use it.

The Java standard library contains a port of ICU to Java. Its the same code though.

The big river book company is also using it for I18n feature implementation.

If you are working in C++, you just use the string object and the right things happen. In general, you should do OK to assume that all data in files is in UTF-8 format. Store all data in files in UTF-8 format. UTF-8 is the only unicode format that Oracle supports. All ascii files are already in UTF-8 format.

I hope this helps.



"Whenever you find you are on the side of the majority, it is time to pause and reflect"   --Mark Twain

"The significant problems we face cannot be solved at the same level of thinking we were at when we created them."   --Albert Einstein

"This is still a dangerous world. It's a world of madmen and uncertainty and potential mental losses."   --George W. Bush
Expand Edited by tuberculosis Aug. 21, 2007, 06:28:08 AM EDT
New Why xenophobic?
When I write php my variable names and function names are english words, or based on them. Does that make me xenophobic? When I write a function to split a name into first-name/last-name and make an assumption that the last piece is the family name and the first piece is the given name, am I disrespecting Asians?

I suspect the "xenophobic white english speaking men" you refer to are the same ones who decided two bytes was enough to store a year, becasue they didn't expect their code to be used for that long. Isn't it possible that they simply didn't consider the fact that the entire world would someday be writing code to the standards they were creating?
===

Purveyor of Doc Hope's [link|http://DocHope.com|fresh-baked dog biscuits and pet treats].
[link|http://DocHope.com|http://DocHope.com]
New Because they didn't think...
...anyone would ever have to write any computer code in anything other than Latin1, that's why.


Peter
[link|http://www.ubuntulinux.org|Ubuntu Linux]
[link|http://www.kuro5hin.org|There is no K5 Cabal]
[link|http://guildenstern.dyndns.org|Home]
Use P2P for legitimate purposes!
New Because if they had spent any time at all
exploring things outside of their sphere of experience, we wouldn't be in this fix.





"Whenever you find you are on the side of the majority, it is time to pause and reflect"   --Mark Twain

"The significant problems we face cannot be solved at the same level of thinking we were at when we created them."   --Albert Einstein

"This is still a dangerous world. It's a world of madmen and uncertainty and potential mental losses."   --George W. Bush
Expand Edited by tuberculosis Aug. 21, 2007, 06:30:01 AM EDT
New Now how about addressing my example
Did they use two digits to store the year because they were xenophobic?
===

Purveyor of Doc Hope's [link|http://DocHope.com|fresh-baked dog biscuits and pet treats].
[link|http://DocHope.com|http://DocHope.com]
New The best explanation that I've seen of why 2 digits...
is at [link|http://www.perl.org/about/y2k.html|http://www.perl.org/about/y2k.html].

Just yesterday I was filling out paperwork and encountered 2 digit years. On paper.

Cheers,
Ben
I have come to believe that idealism without discipline is a quick road to disaster, while discipline without idealism is pointless. -- Aaron Ward (my brother)
New No, but they were xenophobic etc
because they expressly disregarded the potential desire and need of people to write software in a non-European language, and even in a lot of European ones.

A very little bit of time having a discussion with a competent typographer back in the fifties could have avoided the whole mess very easily, but the people making these decisions didn't think they needed to look outside their sphere of expertise when they were coming up with a lot of this stuff.

For example, how do you do even basic typesetting in French with only 7 bit ASCII? Simply put, you don't; it can't be done without coming up with extensions.
--\n-------------------------------------------------------------------\n* Jack Troughton                            jake at consultron.ca *\n* [link|http://consultron.ca|http://consultron.ca]                   [link|irc://irc.ecomstation.ca|irc://irc.ecomstation.ca] *\n* Kingston Ontario Canada               [link|news://news.consultron.ca|news://news.consultron.ca] *\n-------------------------------------------------------------------
New xenophobic's probably the wrong word
[link|http://dictionary.reference.com/search?q=xenophobic|xenophobic]
having abnormal fear or hatred of the strange or foreign
I highly doubt they feared or hated people in other countries.

I bet it has more to do with the fact that people in the US don't often use other languages, so it wasn't anything they'd be concerned with when they designed ASCII - which of course stands for the American Standard Code for Information Interchange.
Darrell Spice, Jr.                      [link|http://spiceware.org/gallery/ArtisticOverpass|Artistic Overpass]\n[link|http://www.spiceware.org/|SpiceWare] - We don't do Windows, it's too much of a chore
New Yeah, you're right
ignorant would be a better word.

That said, they should've known better; as I said, a five minute discussion of this standard with any competent typographer would have let them know why it was broken.
--\n-------------------------------------------------------------------\n* Jack Troughton                            jake at consultron.ca *\n* [link|http://consultron.ca|http://consultron.ca]                   [link|irc://irc.ecomstation.ca|irc://irc.ecomstation.ca] *\n* Kingston Ontario Canada               [link|news://news.consultron.ca|news://news.consultron.ca] *\n-------------------------------------------------------------------
New How about "escessively humble"?
They used two digits for the year because they didn't think their code would still be in use that far in the future. They assumed that when someone 40 years in the future wanted to program a computer they'd write something for it.

They probably thought the same thing about language. At the time, they were writing for their specific piece of hardware. They assumed that when someone wanted to run their own new computer they'd write their own language and operating system to do it, just like everyone else always had.
===

Purveyor of Doc Hope's [link|http://DocHope.com|fresh-baked dog biscuits and pet treats].
[link|http://DocHope.com|http://DocHope.com]
New Look, the point about the two digits for a year is well
taken, but the point about ascii is not. It was supposed to be a standard that could be used by anybody, but ended up only being really usable by people who speak and use only English. The idea that they only expected americans to use it is easily debunked by reading any of the literature extant at the time... what they did expect is that only americans would program it.
--\n-------------------------------------------------------------------\n* Jack Troughton                            jake at consultron.ca *\n* [link|http://consultron.ca|http://consultron.ca]                   [link|irc://irc.ecomstation.ca|irc://irc.ecomstation.ca] *\n* Kingston Ontario Canada               [link|news://news.consultron.ca|news://news.consultron.ca] *\n-------------------------------------------------------------------
New Disagree
ASCII was supposed to be a standard for American Information Interchange (as pointed out above), hence the name. And being first it followed the now-increasingly-popular YAGNI rule.

Hey, I didn't notice that IBM (International Business Machines) fixed the problem with its highly vaunted and hopelessly idiotic EBCDIC, either.

When ASCII became problematic due to its limitations, it was extended (in the best style of Micros~1); now we have UNICODE, and a whole raft of ISO 8859's. And then there are the Asians: JIS, Shift-JIS, GB2232, GB18030, Big 5 et al. So now you can have your \ufffd in any one of several dozen encoding schemes. Ain't choice grand?
jb4
shrub\ufffdbish (Am., from shrub + rubbish, after the derisive name for America's 43 president; 2003) n. 1. a form of nonsensical political doubletalk wherein the speaker attempts to defend the indefensible by lying, obfuscation, or otherwise misstating the facts; GIBBERISH. 2. any of a collection of utterances from America's putative 43rd president. cf. BULLSHIT

New Maybe...
shortsighted [link|http://dictionary.reference.com/search?q=ethnocentric|ethnocentrists].

After all Information Interchange implies you might want to exchange it with someone. Not those nobodies - only our kind of someones.

Whatever. Anyhow that collection of cheap hacks is way broken.



"Whenever you find you are on the side of the majority, it is time to pause and reflect"   --Mark Twain

"The significant problems we face cannot be solved at the same level of thinking we were at when we created them."   --Albert Einstein

"This is still a dangerous world. It's a world of madmen and uncertainty and potential mental losses."   --George W. Bush
Expand Edited by tuberculosis Aug. 21, 2007, 06:31:43 AM EDT
New How about simply "provincial".
Solving their own problems and not the worlds.

Do you forget that 6 computers were going to satisfy all the computational needs of the world?

And no one has yet mentioned IBM's EBCDIC - Extended Binary Coded Decimal Interchange Code which came out in 1964 or so. These came in "national" flavors.

Hey, 2nd generation computers used 6-bit codes for characters (related to punch card codes) - uppercase letters, numbers, and a few punctuation marks. [link|http://www.cs.uiowa.edu/~jones/cards/codes.html|Link].

It's evolution, guys!
Alex

The trouble with the world is that the stupid are cocksure and the intelligent are full of doubt. -- Bertrand Russell
New The people who coded for teletypes and green terminals
should have been asking "a typographer"? They had no idea that what they were doing will be used in "typography" one day. Remember, it was called a "computer" back then. You don't need too many symbols to "compute"
--


And what are we doing when the two most powerful nations on earth -- America and Israel -- stomp on the elementary rights of human beings?

-- letter to the editor from W. Ostermeier, Liechtenstein

New Yes, a typographer
text layout on a green screen is still text layout. Hence the comment about "ignorant outside their area of specialty".

Another word that one might use could be uncultured. And, if they'd been a little less ignorant or a little more cultured then the current internationalisation mess (and it is a mess) could have been very easily avoided.
--\n-------------------------------------------------------------------\n* Jack Troughton                            jake at consultron.ca *\n* [link|http://consultron.ca|http://consultron.ca]                   [link|irc://irc.ecomstation.ca|irc://irc.ecomstation.ca] *\n* Kingston Ontario Canada               [link|news://news.consultron.ca|news://news.consultron.ca] *\n-------------------------------------------------------------------
New Internationalization would not have been so easy
A lot of people wrote a lot of programs that were a lot simpler and more compact because they were able to assume that, for instance, one byte was one character. Back in the 70s it made little sense to waste precious time and space in processing text to take into account issues that would become critical decades later.

Furthermore you're acting as if i18n issues are something that are easily dealt with if you've chose to deal with them. This seems to me to be absurdly wrong.

Getting i18n right takes a lot of knowledge, and speaking to a random competent typographer wouldn't magically solve the problem. Oh, it might solve your problem, sorta. You could just add some extra characters for some European languages. But you quickly get to too many characters for one byte, and don't handle people who use different alphabets. You've also made life harder for whoever wanted to solve the problem for real later. (Something like UTF-8 would have been impossible.) You could take a few to use for combining characters (letting people say something like `e and have it be one character), but you'd probably miss the issue of multiple combining characters. You could try to create an extensible system, and invariably you'd overdesign.

In fact this is an example where I'd accuse Todd of setting impossible standards for others. Suppose that they did try to solve i18n back in the 70's. Inevitably it would have caused computers to waste memory and run more slowly, and the design would have sucked (without experience trying to solve the problem, you're unlikely to come up with the right abstraction). The alternative is to try to solve the problem incrementally - solve the problem that you have now, now, in a way that can be extended later when you have a better idea how to do it. Normally Todd would be all over that approach, but not in this case. Because many different people tried to improve the system independently, each of them solved their own problem, and their solutions conflict.

So if they overabstracted then he'd blast them for overabstracting and coming up with a bad solution, while what they did gets them accused of being xenophobic and causing a long-term mess. Neither way could they win.

If you think that my summary is wrong, tell me what you think should have been done, within the needs and limitations of 70's technology, to solve i18n. Or did you only care about them finding a practical solution to your problems, leaving other people in the cold? (In which case you're no better than they were...)

Cheers,
Ben
I have come to believe that idealism without discipline is a quick road to disaster, while discipline without idealism is pointless. -- Aaron Ward (my brother)
New Text layout in 80 by 24 grid of monspaced font?
Gimme a break. Anyone involved in typesetting would laugh the questioner out of the door.
--


And what are we doing when the two most powerful nations on earth -- America and Israel -- stomp on the elementary rights of human beings?

-- letter to the editor from W. Ostermeier, Liechtenstein

New Phone books back then
were pretty much like that... and they used typesetters to figure out the best way to lay them out so they'd be easy to read and easy to use.

Well designed text layouts and badly designed text layouts for that sort of screen were prevalent, and were looked at as a design problem. That's about putting text into a space, and that is typesetting. Given the parameters of a problem, a competent typesetter would most definitely not laugh them out of the room... unless of course they'd been expected to work for free.
--\n-------------------------------------------------------------------\n* Jack Troughton                            jake at consultron.ca *\n* [link|http://consultron.ca|http://consultron.ca]                   [link|irc://irc.ecomstation.ca|irc://irc.ecomstation.ca] *\n* Kingston Ontario Canada               [link|news://news.consultron.ca|news://news.consultron.ca] *\n-------------------------------------------------------------------
New Please don't use the letter "e" in your code.
After all, you don't need many symbols to compute.

That's what 7-bit ASCII does to the French language.


Peter
[link|http://www.ubuntulinux.org|Ubuntu Linux]
[link|http://www.kuro5hin.org|There is no K5 Cabal]
[link|http://guildenstern.dyndns.org|Home]
Use P2P for legitimate purposes!
New I certainly used to do without "e"
"E" works just fine, thank you very much. ANother case in point: on Russian version of VT52 we had a button: you had a choice of capital/small english vs. capital English/capital Russian. That's the kind of resources people used to work with.
--


And what are we doing when the two most powerful nations on earth -- America and Israel -- stomp on the elementary rights of human beings?

-- letter to the editor from W. Ostermeier, Liechtenstein

New I couldn't use "e" either ...
That's what 7-bit ASCII does to the French language.

You had seven bits to work with? I was working with 6 bit CDC display code. You couldn't even do the letter "e" on that machine, you had to be content with "E". That's ok, we didn't have any printers that could print lower case anyways. (Hmmm ... I recall some terminals that didn't do lower case either).

On the other hand, the other machine I worked on encoded file names in RAD50 (Radix 50). 50 octal is 40 decimal. 40**3 = 64000 which is less that 2**16. Encoding characters in RAD50 meant that you could get 3 (yes, 3!) characters in a 16 bit word. A file name (6 character basename and 3 character extension) all fit into 3 words.

I remember spec'ing out an early system that was going to use ASCII to represent the data. It was a big deal. One of the users of the system pulled me aside one day and wondered what all the fuss was with this "ASK TWO" stuff.

To go from fighting propriety character encodings to a standard that could talk to the computer across the room was a big step forward. That they couldn't talk to computers across the ocean yet was not really on the horizon.
--
-- Jim Weirich jim@weirichhouse.org [link|http://onestepback.org|http://onestepback.org]
---------------------------------------------------------------------
"Beware of bugs in the above code; I have only proved it correct,
not tried it." -- Donald Knuth (in a memo to Peter van Emde Boas)
New Oh, come ON already
We know how you feel about any language whose name starts with the letter C, but let's get on about it, OK? (Especially when helping someone...)

C, written in the early '70s far antedated even foreign computing. When the ANSI C standard was ratified in 1989, UNICODE was only a glimmer in the eyes of its designers. Nonetheless, C still had given it some thought, with the advent of locale-specific functions and so on. Remember, that the state of the art then was Latin-n and JIS and Shift-JIS. (Who cared about China? It was so backwards, it'd never get on board with ubiquitous computing.)

Well, then came the internet, and damn if talking to and between non-Latin-character-writing folks became important. So did C++. C++ defines as part of its standard the ability to determine locales via facets, and defined the wchar_t type specifically for handling the (then) two-byte code points of the nascent UNICODE and other wide-character encoding schemes that were becomeing more widely used. That's why, as you pointed out, "If you are working in C++, you just use the string object and the right things happen". That wouldn't be the case if those xenophobic white men had behaved in the way you described, now would it?

And no, they did not update the existing legacy C library to explicitly maintain backwards compatibility. Right decision? I dunno...but since there is a workaround, it seems like an OK choice at this point.


[Edit: Fixed typos so the 2nd paragraph made some sense]

jb4
shrub\ufffdbish (Am., from shrub + rubbish, after the derisive name for America's 43 president; 2003) n. 1. a form of nonsensical political doubletalk wherein the speaker attempts to defend the indefensible by lying, obfuscation, or otherwise misstating the facts; GIBBERISH. 2. any of a collection of utterances from America's putative 43rd president. cf. BULLSHIT

Expand Edited by jb4 March 25, 2005, 02:20:51 PM EST
New The C++ standard i18n library is awful
ICU is MUCH better and easier to use.

C++ designers have gone off the deep end of the complexity curve.



"Whenever you find you are on the side of the majority, it is time to pause and reflect"   --Mark Twain

"The significant problems we face cannot be solved at the same level of thinking we were at when we created them."   --Albert Einstein

"This is still a dangerous world. It's a world of madmen and uncertainty and potential mental losses."   --George W. Bush
New Dont know ICU
but I will agree that the C++ library complexity is...well, high.
jb4
shrub\ufffdbish (Am., from shrub + rubbish, after the derisive name for America's 43 president; 2003) n. 1. a form of nonsensical political doubletalk wherein the speaker attempts to defend the indefensible by lying, obfuscation, or otherwise misstating the facts; GIBBERISH. 2. any of a collection of utterances from America's putative 43rd president. cf. BULLSHIT

New ICLRPD (new thread)
Created as new thread #200886 titled [link|/forums/render/content/show?contentid=200886|ICLRPD]
===

Purveyor of Doc Hope's [link|http://DocHope.com|fresh-baked dog biscuits and pet treats].
[link|http://DocHope.com|http://DocHope.com]
New You can find it here
[link|http://www-306.ibm.com/software/globalization/icu/index.jsp|http://www-306.ibm.c...ion/icu/index.jsp]

Incidentally, NextStep had a fully compliant unicode string object for Objective C from the beginning (1990-ish).

What is C++'s excuse? Particularly given that its creator was a European? Definitely short sighted tunnel vision.



"Whenever you find you are on the side of the majority, it is time to pause and reflect"   --Mark Twain

"The significant problems we face cannot be solved at the same level of thinking we were at when we created them."   --Albert Einstein

"This is still a dangerous world. It's a world of madmen and uncertainty and potential mental losses."   --George W. Bush
New Time line?
Would you give me the time line of NextStep and its evil twin Objective C vs that of C++?

I think you'll find the answer there....
jb4
shrub\ufffdbish (Am., from shrub + rubbish, after the derisive name for America's 43 president; 2003) n. 1. a form of nonsensical political doubletalk wherein the speaker attempts to defend the indefensible by lying, obfuscation, or otherwise misstating the facts; GIBBERISH. 2. any of a collection of utterances from America's putative 43rd president. cf. BULLSHIT

New Released in 1988
I recall buying a copy of NextWorld #1 (from the MacWorld people) while working for a little Mac utilities company in Boulder in 1990 and showing it around the office.

Unicode from the beginning I'm told (I started working with them in 1997).




"Whenever you find you are on the side of the majority, it is time to pause and reflect"   --Mark Twain

"The significant problems we face cannot be solved at the same level of thinking we were at when we created them."   --Albert Einstein

"This is still a dangerous world. It's a world of madmen and uncertainty and potential mental losses."   --George W. Bush
New Actually, Algol 68 was designed from the ground up
to be localized. The formal grammar was written in language-neutral way, and every keyword could be translated into many languages. It was so unvieldy that not many people used it, at least in Russia.
--


And what are we doing when the two most powerful nations on earth -- America and Israel -- stomp on the elementary rights of human beings?

-- letter to the editor from W. Ostermeier, Liechtenstein

New Re: What the heck is text?
I am typing text in this forum, who receives the keyboard event, and translate it to the right character on the screen, is it the kernel or mozilla?


The keyboard sends set of signals, which vary by keyboard model a bit, when you press a key. The OS takes those signals and translates them into a consistant "keycode". If it's not a command that the OS processes itself, it is then passed on to the active application. The application then can do whatever processing it wants. When it wants to display text, it passes a string of characters on the display layer, along with font and other display information. The display layer then builds the actual bitmap from that information.

In the simplest case, the application can take the keycode passed by the OS and tack it on to the string it passes to the display layer. But that isn't always true.

Are all the users of this forum type in the same character set? I dont think so, yet to view this forum we all (I am only guessing) tell our browser to open this forum site in the same character set.

Character encodings and font sets are very ugly in HTML, as the early versions where an english centric defacto-standard. XHTML cleans up most of these issues. Web browsers have to use some guesswork and follow some unwritten conventions for handeling these issues in HTML. The browser has to peek at the web page and try to figure out what the encoding is in many cases. In pracitce, in HTML, ISO-8859-1 is the default if the browser can't find anything else.

On the upload side, the browser is reponsible for encoding the text in a standard format before sending the form data to the server.

Okay I wont lie, I read something like this, the first 128 characters the first 7 bits are common in many character sets, bu the second 128 char sets are different

I think even unicode use some trick to read ASCII chars

Okay surprise question?
What the heck is ASCII?

ASCII is an old standard for 8-bit encoding. Many, but not all, character encodings follow the ASCII encodings for english letters and numbers. This allows many programs to work correctly with basic english even if they don't handle encoding correctly.

Another problem banging my head, when I used to write those silly ...
scanf, printf programs in C, it didn't seem that the compiler bothered
about the character set

C is an old and low level language. It doesn't really deal with these issues. As far as C is concerned, a string is simply a sequence of bytes. C pretty much assumes that the keycodes passed the OS keycode = string codes = display codes = 8 bits. You can work in other encodings in C, but then you have to use functions that understand your encoding.

Does linux have default char set values, why?
Lets put it differently does a system have a global char set? why?

Not really. The OS does have to set some standard for communication between the OS and the applications, but that is independent of what is displayed or what is stored in files. Most OSs use ASCII for communication between the OS and applications. Windows NT and later can use an unicode system for some interfaces, but I don't know the specifics.

Jay
New I must correct you - ASCII is a 7-bit encoding
The 8th bit is padding. Many mail gateways still strip the 8th bit. The world wide email system is still only 7 bit safe. All other bits are passed through this ugly old pipe by representing non-ascii using ascii via mechanisms like base64 and entities.

At least, this was true 5 years ago when I last investigated the problem in detail.



"Whenever you find you are on the side of the majority, it is time to pause and reflect"   --Mark Twain

"The significant problems we face cannot be solved at the same level of thinking we were at when we created them."   --Albert Einstein

"This is still a dangerous world. It's a world of madmen and uncertainty and potential mental losses."   --George W. Bush
Expand Edited by tuberculosis Aug. 21, 2007, 06:30:04 AM EDT
New Whoa, there.

Character encodings and font sets are very ugly in HTML, as the early versions where an english centric defacto-standard. XHTML cleans up most of these issues. Web browsers have to use some guesswork and follow some unwritten conventions for handeling these issues in HTML. The browser has to peek at the web page and try to figure out what the encoding is in many cases. In pracitce, in HTML, iso-8859-1 is the default if the browser can't find anything else.

\r\n\r\n

Actually, XHTML makes it more complicated. If you serve XHTML without sending the charset parameter in your Content-Type header, then the MIME-type of the document can determine the character encoding to be used for parsing. If you sent text/html or text/xml, a conforming parser must assume that the document is encoded in us-ascii, no matter what you've specified inside it; you have [link|http://www.ietf.org/rfc/rfc3023.txt|RFC 3023] and the legacy of text/* media types and transcoding proxies to thank for that.

\r\n\r\n

If you sent application/xhtml+xml or application/xml, then the receiving parser is allowed to read the XML prolog inside the file, but if that's not present then the character set must be assumed to be utf-8 or utf-16, depending on whether the document begins with a byte-order mark; you're not allowed to look at the meta tags for this information.

\r\n\r\n

This is one of the reasons why Mark Pilgrim claimed [link|http://www.xml.com/pub/a/2004/07/21/dive.html|XML on the web has failed], and it just barely represents the tip of the iceberg as far as character-encoding issues in HTML, XHTML and XML are concerned.

--\r\nYou cooin' with my bird?
\r\n[link|http://www.shtuff.us/|shtuff]
New Your right mostly
I said that XHTML cleans up the issue, not that it made it simpler. And you are right, XHTML blew their chance by failing to specify a good solution to the problem. However, XHTML at least has a manditory specification standard.

With HTML the browser really has to guess in many cases. The current method of reading the file till you find a content-type tag and then restarting the process of reading the file in the specified type is horribly ugly and depends on no non-ASCII characters being put at the top of the file.

Jay
New Using a pencil, it's unambiguous.
bcnu,
Mikem

Eine Leute. Eine Welt. Ein F\ufffdhrer.
God Bless America.
New You haven't seen my handwriting....
New Uh-oh. I wouldn't confess that ;0)
I've been told that "handwriting is a mark of your character." And, in truth, I believe it. Imagine how valueless an original transcript of Tom Sawyer would be if it was just bits on a CD...
bcnu,
Mikem

Eine Leute. Eine Welt. Ein F\ufffdhrer.
God Bless America.
New My father's handwriting was so bad...
How bad was it!?

In high school, whenever I wanted to cut
a day, I had an easy out.

I'd be late on purpose the next day.

I'd have my father write a note, excusing the
lateness.

I would write the "translation" on the back,
excusing the previous day's absence and the
current day's lateness.

Worked every time.
     What the heck is text? - (systems) - (56)
         It depends on the context. - (Another Scott) - (2)
             Unicode and ASCII - (StevenYap) - (1)
                 Re: Unicode and ASCII - Nitpick II - (jb4)
         you are confusing text with display - (boxley) - (12)
             Uhhh..Not quite, Bill - (jb4) - (11)
                 And that is one thing that sucks about Unicode - (ben_tilly) - (9)
                     At least they're consistent - (jb4) - (8)
                         But it is a problem - (ben_tilly)
                         Except for that full width/half width ascii thing - (tuberculosis) - (5)
                             I dunno... - (jb4)
                             My personal take on it - (jake123) - (3)
                                 Perhaps, but it makes searching tricky - (tuberculosis) - (2)
                                     Well, if it was an easy problem - (jake123)
                                     ICLRPD (new thread) - (jb4)
                         Have you all seen the HUGE unicode poster? - (FuManChu)
                 close enough to debug a table entry :-) - (boxley)
         Text is not as simple as it seems - (ben_tilly)
         This is one thing that Java handles pretty well - (bluke)
         Rule #1 - Everything you think you know is wrong - (tuberculosis) - (29)
             Why xenophobic? - (drewk) - (28)
                 Because they didn't think... - (pwhysall)
                 Because if they had spent any time at all - (tuberculosis) - (25)
                     Now how about addressing my example - (drewk) - (17)
                         The best explanation that I've seen of why 2 digits... - (ben_tilly)
                         No, but they were xenophobic etc - (jake123) - (15)
                             xenophobic's probably the wrong word - (SpiceWare) - (14)
                                 Yeah, you're right - (jake123) - (13)
                                     How about "escessively humble"? - (drewk) - (4)
                                         Look, the point about the two digits for a year is well - (jake123) - (1)
                                             Disagree - (jb4)
                                         Maybe... - (tuberculosis) - (1)
                                             How about simply "provincial". - (a6l6e6x)
                                     The people who coded for teletypes and green terminals - (Arkadiy) - (7)
                                         Yes, a typographer - (jake123) - (3)
                                             Internationalization would not have been so easy - (ben_tilly)
                                             Text layout in 80 by 24 grid of monspaced font? - (Arkadiy) - (1)
                                                 Phone books back then - (jake123)
                                         Please don't use the letter "e" in your code. - (pwhysall) - (2)
                                             I certainly used to do without "e" - (Arkadiy)
                                             I couldn't use "e" either ... - (JimWeirich)
                     Oh, come ON already - (jb4) - (6)
                         The C++ standard i18n library is awful - (tuberculosis) - (5)
                             Dont know ICU - (jb4) - (4)
                                 ICLRPD (new thread) - (drewk)
                                 You can find it here - (tuberculosis) - (2)
                                     Time line? - (jb4) - (1)
                                         Released in 1988 - (tuberculosis)
                 Actually, Algol 68 was designed from the ground up - (Arkadiy)
         Re: What the heck is text? - (JayMehaffey) - (3)
             I must correct you - ASCII is a 7-bit encoding - (tuberculosis)
             Whoa, there. - (ubernostrum) - (1)
                 Your right mostly - (JayMehaffey)
         Using a pencil, it's unambiguous. -NT - (mmoffitt) - (3)
             You haven't seen my handwriting.... -NT - (Another Scott) - (2)
                 Uh-oh. I wouldn't confess that ;0) - (mmoffitt) - (1)
                     My father's handwriting was so bad... - (broomberg)

Yeah, baby!
473 ms