Post #198,432
3/14/05 3:36:10 AM
8/21/07 6:28:08 AM
|
Rule #1 - Everything you think you know is wrong
The problem with the C standard library (and the C++ standard library) is that it was written by xenophobic white english speaking men. (The proof of this statement is the hijacking of char to mean byte) It only works with 7 bit ascii characters. So if you care about anybody outside the USA, give up on the standard C library.
So give up on standard C and C++ library.
Here is your replacement: [link|http://www-306.ibm.com/software/globalization/icu/index.jsp|http://www-306.ibm.c...ion/icu/index.jsp]
It has all of the unicode handling capabilities you need, collation (not same as strcmp), number and date formatters/parsers, character testing (isAlpha, isDigit, etc).
ICU is now used in the Parrot VM as well - which means perl v6.0 and python will use it.
The Java standard library contains a port of ICU to Java. Its the same code though.
The big river book company is also using it for I18n feature implementation.
If you are working in C++, you just use the string object and the right things happen. In general, you should do OK to assume that all data in files is in UTF-8 format. Store all data in files in UTF-8 format. UTF-8 is the only unicode format that Oracle supports. All ascii files are already in UTF-8 format.
I hope this helps.
"Whenever you find you are on the side of the majority, it is time to pause and reflect" --Mark Twain
"The significant problems we face cannot be solved at the same level of thinking we were at when we created them." --Albert Einstein
"This is still a dangerous world. It's a world of madmen and uncertainty and potential mental losses." --George W. Bush
|
Post #198,470
3/14/05 10:44:16 AM
|
Why xenophobic?
When I write php my variable names and function names are english words, or based on them. Does that make me xenophobic? When I write a function to split a name into first-name/last-name and make an assumption that the last piece is the family name and the first piece is the given name, am I disrespecting Asians?
I suspect the "xenophobic white english speaking men" you refer to are the same ones who decided two bytes was enough to store a year, becasue they didn't expect their code to be used for that long. Isn't it possible that they simply didn't consider the fact that the entire world would someday be writing code to the standards they were creating?
===
Purveyor of Doc Hope's [link|http://DocHope.com|fresh-baked dog biscuits and pet treats]. [link|http://DocHope.com|http://DocHope.com]
|
Post #198,490
3/14/05 11:24:31 AM
|
Because they didn't think...
...anyone would ever have to write any computer code in anything other than Latin1, that's why.
Peter [link|http://www.ubuntulinux.org|Ubuntu Linux] [link|http://www.kuro5hin.org|There is no K5 Cabal] [link|http://guildenstern.dyndns.org|Home] Use P2P for legitimate purposes!
|
Post #198,514
3/14/05 12:08:42 PM
8/21/07 6:30:01 AM
|
Because if they had spent any time at all
exploring things outside of their sphere of experience, we wouldn't be in this fix.
"Whenever you find you are on the side of the majority, it is time to pause and reflect" --Mark Twain
"The significant problems we face cannot be solved at the same level of thinking we were at when we created them." --Albert Einstein
"This is still a dangerous world. It's a world of madmen and uncertainty and potential mental losses." --George W. Bush
|
Post #198,519
3/14/05 12:13:47 PM
|
Now how about addressing my example
Did they use two digits to store the year because they were xenophobic?
===
Purveyor of Doc Hope's [link|http://DocHope.com|fresh-baked dog biscuits and pet treats]. [link|http://DocHope.com|http://DocHope.com]
|
Post #198,531
3/14/05 12:46:32 PM
|
The best explanation that I've seen of why 2 digits...
is at [link|http://www.perl.org/about/y2k.html|http://www.perl.org/about/y2k.html].
Just yesterday I was filling out paperwork and encountered 2 digit years. On paper.
Cheers, Ben
I have come to believe that idealism without discipline is a quick road to disaster, while discipline without idealism is pointless. -- Aaron Ward (my brother)
|
Post #198,577
3/14/05 2:50:01 PM
|
No, but they were xenophobic etc
because they expressly disregarded the potential desire and need of people to write software in a non-European language, and even in a lot of European ones.
A very little bit of time having a discussion with a competent typographer back in the fifties could have avoided the whole mess very easily, but the people making these decisions didn't think they needed to look outside their sphere of expertise when they were coming up with a lot of this stuff.
For example, how do you do even basic typesetting in French with only 7 bit ASCII? Simply put, you don't; it can't be done without coming up with extensions.
--\n-------------------------------------------------------------------\n* Jack Troughton jake at consultron.ca *\n* [link|http://consultron.ca|http://consultron.ca] [link|irc://irc.ecomstation.ca|irc://irc.ecomstation.ca] *\n* Kingston Ontario Canada [link|news://news.consultron.ca|news://news.consultron.ca] *\n-------------------------------------------------------------------
|
Post #198,588
3/14/05 4:12:18 PM
|
xenophobic's probably the wrong word
[link|http://dictionary.reference.com/search?q=xenophobic|xenophobic] having abnormal fear or hatred of the strange or foreign I highly doubt they feared or hated people in other countries. I bet it has more to do with the fact that people in the US don't often use other languages, so it wasn't anything they'd be concerned with when they designed ASCII - which of course stands for the American Standard Code for Information Interchange.
Darrell Spice, Jr. [link|http://spiceware.org/gallery/ArtisticOverpass|Artistic Overpass]\n[link|http://www.spiceware.org/|SpiceWare] - We don't do Windows, it's too much of a chore
|
Post #198,596
3/14/05 5:02:32 PM
|
Yeah, you're right
ignorant would be a better word.
That said, they should've known better; as I said, a five minute discussion of this standard with any competent typographer would have let them know why it was broken.
--\n-------------------------------------------------------------------\n* Jack Troughton jake at consultron.ca *\n* [link|http://consultron.ca|http://consultron.ca] [link|irc://irc.ecomstation.ca|irc://irc.ecomstation.ca] *\n* Kingston Ontario Canada [link|news://news.consultron.ca|news://news.consultron.ca] *\n-------------------------------------------------------------------
|
Post #198,614
3/14/05 5:35:25 PM
|
How about "escessively humble"?
They used two digits for the year because they didn't think their code would still be in use that far in the future. They assumed that when someone 40 years in the future wanted to program a computer they'd write something for it.
They probably thought the same thing about language. At the time, they were writing for their specific piece of hardware. They assumed that when someone wanted to run their own new computer they'd write their own language and operating system to do it, just like everyone else always had.
===
Purveyor of Doc Hope's [link|http://DocHope.com|fresh-baked dog biscuits and pet treats]. [link|http://DocHope.com|http://DocHope.com]
|
Post #198,621
3/14/05 6:09:09 PM
|
Look, the point about the two digits for a year is well
taken, but the point about ascii is not. It was supposed to be a standard that could be used by anybody, but ended up only being really usable by people who speak and use only English. The idea that they only expected americans to use it is easily debunked by reading any of the literature extant at the time... what they did expect is that only americans would program it.
--\n-------------------------------------------------------------------\n* Jack Troughton jake at consultron.ca *\n* [link|http://consultron.ca|http://consultron.ca] [link|irc://irc.ecomstation.ca|irc://irc.ecomstation.ca] *\n* Kingston Ontario Canada [link|news://news.consultron.ca|news://news.consultron.ca] *\n-------------------------------------------------------------------
|
Post #200,492
3/25/05 2:15:41 PM
|
Disagree
ASCII was supposed to be a standard for American Information Interchange (as pointed out above), hence the name. And being first it followed the now-increasingly-popular YAGNI rule.
Hey, I didn't notice that IBM (International Business Machines) fixed the problem with its highly vaunted and hopelessly idiotic EBCDIC, either.
When ASCII became problematic due to its limitations, it was extended (in the best style of Micros~1); now we have UNICODE, and a whole raft of ISO 8859's. And then there are the Asians: JIS, Shift-JIS, GB2232, GB18030, Big 5 et al. So now you can have your \ufffd in any one of several dozen encoding schemes. Ain't choice grand?
jb4 shrub\ufffdbish (Am., from shrub + rubbish, after the derisive name for America's 43 president; 2003) n. 1. a form of nonsensical political doubletalk wherein the speaker attempts to defend the indefensible by lying, obfuscation, or otherwise misstating the facts; GIBBERISH. 2. any of a collection of utterances from America's putative 43rd president. cf. BULLSHIT
|
Post #198,640
3/14/05 7:27:23 PM
8/21/07 6:31:43 AM
|
Maybe...
shortsighted [link|http://dictionary.reference.com/search?q=ethnocentric|ethnocentrists].
After all Information Interchange implies you might want to exchange it with someone. Not those nobodies - only our kind of someones.
Whatever. Anyhow that collection of cheap hacks is way broken.
"Whenever you find you are on the side of the majority, it is time to pause and reflect" --Mark Twain
"The significant problems we face cannot be solved at the same level of thinking we were at when we created them." --Albert Einstein
"This is still a dangerous world. It's a world of madmen and uncertainty and potential mental losses." --George W. Bush
|
Post #198,652
3/14/05 8:02:42 PM
|
How about simply "provincial".
Solving their own problems and not the worlds.
Do you forget that 6 computers were going to satisfy all the computational needs of the world?
And no one has yet mentioned IBM's EBCDIC - Extended Binary Coded Decimal Interchange Code which came out in 1964 or so. These came in "national" flavors.
Hey, 2nd generation computers used 6-bit codes for characters (related to punch card codes) - uppercase letters, numbers, and a few punctuation marks. [link|http://www.cs.uiowa.edu/~jones/cards/codes.html|Link].
It's evolution, guys!
Alex
The trouble with the world is that the stupid are cocksure and the intelligent are full of doubt. -- Bertrand Russell
|
Post #198,615
3/14/05 5:37:55 PM
|
The people who coded for teletypes and green terminals
should have been asking "a typographer"? They had no idea that what they were doing will be used in "typography" one day. Remember, it was called a "computer" back then. You don't need too many symbols to "compute"
--
And what are we doing when the two most powerful nations on earth -- America and Israel -- stomp on the elementary rights of human beings?
-- letter to the editor from W. Ostermeier, Liechtenstein
|
Post #198,622
3/14/05 6:10:49 PM
|
Yes, a typographer
text layout on a green screen is still text layout. Hence the comment about "ignorant outside their area of specialty".
Another word that one might use could be uncultured. And, if they'd been a little less ignorant or a little more cultured then the current internationalisation mess (and it is a mess) could have been very easily avoided.
--\n-------------------------------------------------------------------\n* Jack Troughton jake at consultron.ca *\n* [link|http://consultron.ca|http://consultron.ca] [link|irc://irc.ecomstation.ca|irc://irc.ecomstation.ca] *\n* Kingston Ontario Canada [link|news://news.consultron.ca|news://news.consultron.ca] *\n-------------------------------------------------------------------
|
Post #198,638
3/14/05 7:23:05 PM
|
Internationalization would not have been so easy
A lot of people wrote a lot of programs that were a lot simpler and more compact because they were able to assume that, for instance, one byte was one character. Back in the 70s it made little sense to waste precious time and space in processing text to take into account issues that would become critical decades later.
Furthermore you're acting as if i18n issues are something that are easily dealt with if you've chose to deal with them. This seems to me to be absurdly wrong.
Getting i18n right takes a lot of knowledge, and speaking to a random competent typographer wouldn't magically solve the problem. Oh, it might solve your problem, sorta. You could just add some extra characters for some European languages. But you quickly get to too many characters for one byte, and don't handle people who use different alphabets. You've also made life harder for whoever wanted to solve the problem for real later. (Something like UTF-8 would have been impossible.) You could take a few to use for combining characters (letting people say something like `e and have it be one character), but you'd probably miss the issue of multiple combining characters. You could try to create an extensible system, and invariably you'd overdesign.
In fact this is an example where I'd accuse Todd of setting impossible standards for others. Suppose that they did try to solve i18n back in the 70's. Inevitably it would have caused computers to waste memory and run more slowly, and the design would have sucked (without experience trying to solve the problem, you're unlikely to come up with the right abstraction). The alternative is to try to solve the problem incrementally - solve the problem that you have now, now, in a way that can be extended later when you have a better idea how to do it. Normally Todd would be all over that approach, but not in this case. Because many different people tried to improve the system independently, each of them solved their own problem, and their solutions conflict.
So if they overabstracted then he'd blast them for overabstracting and coming up with a bad solution, while what they did gets them accused of being xenophobic and causing a long-term mess. Neither way could they win.
If you think that my summary is wrong, tell me what you think should have been done, within the needs and limitations of 70's technology, to solve i18n. Or did you only care about them finding a practical solution to your problems, leaving other people in the cold? (In which case you're no better than they were...)
Cheers, Ben
I have come to believe that idealism without discipline is a quick road to disaster, while discipline without idealism is pointless. -- Aaron Ward (my brother)
|
Post #198,650
3/14/05 7:59:49 PM
|
Text layout in 80 by 24 grid of monspaced font?
Gimme a break. Anyone involved in typesetting would laugh the questioner out of the door.
--
And what are we doing when the two most powerful nations on earth -- America and Israel -- stomp on the elementary rights of human beings?
-- letter to the editor from W. Ostermeier, Liechtenstein
|
Post #198,752
3/15/05 11:18:10 AM
|
Phone books back then
were pretty much like that... and they used typesetters to figure out the best way to lay them out so they'd be easy to read and easy to use.
Well designed text layouts and badly designed text layouts for that sort of screen were prevalent, and were looked at as a design problem. That's about putting text into a space, and that is typesetting. Given the parameters of a problem, a competent typesetter would most definitely not laugh them out of the room... unless of course they'd been expected to work for free.
--\n-------------------------------------------------------------------\n* Jack Troughton jake at consultron.ca *\n* [link|http://consultron.ca|http://consultron.ca] [link|irc://irc.ecomstation.ca|irc://irc.ecomstation.ca] *\n* Kingston Ontario Canada [link|news://news.consultron.ca|news://news.consultron.ca] *\n-------------------------------------------------------------------
|
Post #198,625
3/14/05 6:20:45 PM
|
Please don't use the letter "e" in your code.
After all, you don't need many symbols to compute.
That's what 7-bit ASCII does to the French language.
Peter [link|http://www.ubuntulinux.org|Ubuntu Linux] [link|http://www.kuro5hin.org|There is no K5 Cabal] [link|http://guildenstern.dyndns.org|Home] Use P2P for legitimate purposes!
|
Post #198,651
3/14/05 8:01:39 PM
|
I certainly used to do without "e"
"E" works just fine, thank you very much. ANother case in point: on Russian version of VT52 we had a button: you had a choice of capital/small english vs. capital English/capital Russian. That's the kind of resources people used to work with.
--
And what are we doing when the two most powerful nations on earth -- America and Israel -- stomp on the elementary rights of human beings?
-- letter to the editor from W. Ostermeier, Liechtenstein
|
Post #198,846
3/15/05 5:52:09 PM
|
I couldn't use "e" either ...
That's what 7-bit ASCII does to the French language.
You had seven bits to work with? I was working with 6 bit CDC display code. You couldn't even do the letter "e" on that machine, you had to be content with "E". That's ok, we didn't have any printers that could print lower case anyways. (Hmmm ... I recall some terminals that didn't do lower case either).
On the other hand, the other machine I worked on encoded file names in RAD50 (Radix 50). 50 octal is 40 decimal. 40**3 = 64000 which is less that 2**16. Encoding characters in RAD50 meant that you could get 3 (yes, 3!) characters in a 16 bit word. A file name (6 character basename and 3 character extension) all fit into 3 words.
I remember spec'ing out an early system that was going to use ASCII to represent the data. It was a big deal. One of the users of the system pulled me aside one day and wondered what all the fuss was with this "ASK TWO" stuff.
To go from fighting propriety character encodings to a standard that could talk to the computer across the room was a big step forward. That they couldn't talk to computers across the ocean yet was not really on the horizon.
-- -- Jim Weirich jim@weirichhouse.org [link|http://onestepback.org|http://onestepback.org] --------------------------------------------------------------------- "Beware of bugs in the above code; I have only proved it correct, not tried it." -- Donald Knuth (in a memo to Peter van Emde Boas)
|
Post #200,490
3/25/05 2:06:33 PM
3/25/05 2:20:51 PM
|
Oh, come ON already
We know how you feel about any language whose name starts with the letter C, but let's get on about it, OK? (Especially when helping someone...)
C, written in the early '70s far antedated even foreign computing. When the ANSI C standard was ratified in 1989, UNICODE was only a glimmer in the eyes of its designers. Nonetheless, C still had given it some thought, with the advent of locale-specific functions and so on. Remember, that the state of the art then was Latin-n and JIS and Shift-JIS. (Who cared about China? It was so backwards, it'd never get on board with ubiquitous computing.)
Well, then came the internet, and damn if talking to and between non-Latin-character-writing folks became important. So did C++. C++ defines as part of its standard the ability to determine locales via facets, and defined the wchar_t type specifically for handling the (then) two-byte code points of the nascent UNICODE and other wide-character encoding schemes that were becomeing more widely used. That's why, as you pointed out, "If you are working in C++, you just use the string object and the right things happen". That wouldn't be the case if those xenophobic white men had behaved in the way you described, now would it?
And no, they did not update the existing legacy C library to explicitly maintain backwards compatibility. Right decision? I dunno...but since there is a workaround, it seems like an OK choice at this point.
[Edit: Fixed typos so the 2nd paragraph made some sense]
jb4 shrub\ufffdbish (Am., from shrub + rubbish, after the derisive name for America's 43 president; 2003) n. 1. a form of nonsensical political doubletalk wherein the speaker attempts to defend the indefensible by lying, obfuscation, or otherwise misstating the facts; GIBBERISH. 2. any of a collection of utterances from America's putative 43rd president. cf. BULLSHIT
Edited by jb4
March 25, 2005, 02:20:51 PM EST
|
Post #200,809
3/27/05 9:56:44 PM
|
The C++ standard i18n library is awful
ICU is MUCH better and easier to use.
C++ designers have gone off the deep end of the complexity curve.
"Whenever you find you are on the side of the majority, it is time to pause and reflect" --Mark Twain
"The significant problems we face cannot be solved at the same level of thinking we were at when we created them." --Albert Einstein
"This is still a dangerous world. It's a world of madmen and uncertainty and potential mental losses." --George W. Bush
|
Post #200,879
3/28/05 12:13:20 PM
|
Dont know ICU
but I will agree that the C++ library complexity is...well, high.
jb4 shrub\ufffdbish (Am., from shrub + rubbish, after the derisive name for America's 43 president; 2003) n. 1. a form of nonsensical political doubletalk wherein the speaker attempts to defend the indefensible by lying, obfuscation, or otherwise misstating the facts; GIBBERISH. 2. any of a collection of utterances from America's putative 43rd president. cf. BULLSHIT
|
Post #200,887
3/28/05 12:33:42 PM
|
ICLRPD (new thread)
Created as new thread #200886 titled [link|/forums/render/content/show?contentid=200886|ICLRPD]
===
Purveyor of Doc Hope's [link|http://DocHope.com|fresh-baked dog biscuits and pet treats]. [link|http://DocHope.com|http://DocHope.com]
|
Post #200,914
3/28/05 2:44:08 PM
|
You can find it here
[link|http://www-306.ibm.com/software/globalization/icu/index.jsp|http://www-306.ibm.c...ion/icu/index.jsp]
Incidentally, NextStep had a fully compliant unicode string object for Objective C from the beginning (1990-ish).
What is C++'s excuse? Particularly given that its creator was a European? Definitely short sighted tunnel vision.
"Whenever you find you are on the side of the majority, it is time to pause and reflect" --Mark Twain
"The significant problems we face cannot be solved at the same level of thinking we were at when we created them." --Albert Einstein
"This is still a dangerous world. It's a world of madmen and uncertainty and potential mental losses." --George W. Bush
|
Post #201,147
3/29/05 7:59:43 PM
|
Time line?
Would you give me the time line of NextStep and its evil twin Objective C vs that of C++?
I think you'll find the answer there....
jb4 shrub\ufffdbish (Am., from shrub + rubbish, after the derisive name for America's 43 president; 2003) n. 1. a form of nonsensical political doubletalk wherein the speaker attempts to defend the indefensible by lying, obfuscation, or otherwise misstating the facts; GIBBERISH. 2. any of a collection of utterances from America's putative 43rd president. cf. BULLSHIT
|
Post #201,150
3/29/05 8:07:25 PM
|
Released in 1988
I recall buying a copy of NextWorld #1 (from the MacWorld people) while working for a little Mac utilities company in Boulder in 1990 and showing it around the office.
Unicode from the beginning I'm told (I started working with them in 1997).
"Whenever you find you are on the side of the majority, it is time to pause and reflect" --Mark Twain
"The significant problems we face cannot be solved at the same level of thinking we were at when we created them." --Albert Einstein
"This is still a dangerous world. It's a world of madmen and uncertainty and potential mental losses." --George W. Bush
|
Post #198,522
3/14/05 12:16:18 PM
|
Actually, Algol 68 was designed from the ground up
to be localized. The formal grammar was written in language-neutral way, and every keyword could be translated into many languages. It was so unvieldy that not many people used it, at least in Russia.
--
And what are we doing when the two most powerful nations on earth -- America and Israel -- stomp on the elementary rights of human beings?
-- letter to the editor from W. Ostermeier, Liechtenstein
|