IWETHEY v. 0.3.0 | TODO
1,095 registered users | 0 active users | 0 LpH | Statistics
Login | Create New User
IWETHEY Banner

Welcome to IWETHEY!

New Re: What the heck is text?
I am typing text in this forum, who receives the keyboard event, and translate it to the right character on the screen, is it the kernel or mozilla?


The keyboard sends set of signals, which vary by keyboard model a bit, when you press a key. The OS takes those signals and translates them into a consistant "keycode". If it's not a command that the OS processes itself, it is then passed on to the active application. The application then can do whatever processing it wants. When it wants to display text, it passes a string of characters on the display layer, along with font and other display information. The display layer then builds the actual bitmap from that information.

In the simplest case, the application can take the keycode passed by the OS and tack it on to the string it passes to the display layer. But that isn't always true.

Are all the users of this forum type in the same character set? I dont think so, yet to view this forum we all (I am only guessing) tell our browser to open this forum site in the same character set.

Character encodings and font sets are very ugly in HTML, as the early versions where an english centric defacto-standard. XHTML cleans up most of these issues. Web browsers have to use some guesswork and follow some unwritten conventions for handeling these issues in HTML. The browser has to peek at the web page and try to figure out what the encoding is in many cases. In pracitce, in HTML, ISO-8859-1 is the default if the browser can't find anything else.

On the upload side, the browser is reponsible for encoding the text in a standard format before sending the form data to the server.

Okay I wont lie, I read something like this, the first 128 characters the first 7 bits are common in many character sets, bu the second 128 char sets are different

I think even unicode use some trick to read ASCII chars

Okay surprise question?
What the heck is ASCII?

ASCII is an old standard for 8-bit encoding. Many, but not all, character encodings follow the ASCII encodings for english letters and numbers. This allows many programs to work correctly with basic english even if they don't handle encoding correctly.

Another problem banging my head, when I used to write those silly ...
scanf, printf programs in C, it didn't seem that the compiler bothered
about the character set

C is an old and low level language. It doesn't really deal with these issues. As far as C is concerned, a string is simply a sequence of bytes. C pretty much assumes that the keycodes passed the OS keycode = string codes = display codes = 8 bits. You can work in other encodings in C, but then you have to use functions that understand your encoding.

Does linux have default char set values, why?
Lets put it differently does a system have a global char set? why?

Not really. The OS does have to set some standard for communication between the OS and the applications, but that is independent of what is displayed or what is stored in files. Most OSs use ASCII for communication between the OS and applications. Windows NT and later can use an unicode system for some interfaces, but I don't know the specifics.

Jay
New I must correct you - ASCII is a 7-bit encoding
The 8th bit is padding. Many mail gateways still strip the 8th bit. The world wide email system is still only 7 bit safe. All other bits are passed through this ugly old pipe by representing non-ascii using ascii via mechanisms like base64 and entities.

At least, this was true 5 years ago when I last investigated the problem in detail.



"Whenever you find you are on the side of the majority, it is time to pause and reflect"   --Mark Twain

"The significant problems we face cannot be solved at the same level of thinking we were at when we created them."   --Albert Einstein

"This is still a dangerous world. It's a world of madmen and uncertainty and potential mental losses."   --George W. Bush
Expand Edited by tuberculosis Aug. 21, 2007, 06:30:04 AM EDT
New Whoa, there.

Character encodings and font sets are very ugly in HTML, as the early versions where an english centric defacto-standard. XHTML cleans up most of these issues. Web browsers have to use some guesswork and follow some unwritten conventions for handeling these issues in HTML. The browser has to peek at the web page and try to figure out what the encoding is in many cases. In pracitce, in HTML, iso-8859-1 is the default if the browser can't find anything else.

\r\n\r\n

Actually, XHTML makes it more complicated. If you serve XHTML without sending the charset parameter in your Content-Type header, then the MIME-type of the document can determine the character encoding to be used for parsing. If you sent text/html or text/xml, a conforming parser must assume that the document is encoded in us-ascii, no matter what you've specified inside it; you have [link|http://www.ietf.org/rfc/rfc3023.txt|RFC 3023] and the legacy of text/* media types and transcoding proxies to thank for that.

\r\n\r\n

If you sent application/xhtml+xml or application/xml, then the receiving parser is allowed to read the XML prolog inside the file, but if that's not present then the character set must be assumed to be utf-8 or utf-16, depending on whether the document begins with a byte-order mark; you're not allowed to look at the meta tags for this information.

\r\n\r\n

This is one of the reasons why Mark Pilgrim claimed [link|http://www.xml.com/pub/a/2004/07/21/dive.html|XML on the web has failed], and it just barely represents the tip of the iceberg as far as character-encoding issues in HTML, XHTML and XML are concerned.

--\r\nYou cooin' with my bird?
\r\n[link|http://www.shtuff.us/|shtuff]
New Your right mostly
I said that XHTML cleans up the issue, not that it made it simpler. And you are right, XHTML blew their chance by failing to specify a good solution to the problem. However, XHTML at least has a manditory specification standard.

With HTML the browser really has to guess in many cases. The current method of reading the file till you find a content-type tag and then restarting the process of reading the file in the specified type is horribly ugly and depends on no non-ASCII characters being put at the top of the file.

Jay
     What the heck is text? - (systems) - (56)
         It depends on the context. - (Another Scott) - (2)
             Unicode and ASCII - (StevenYap) - (1)
                 Re: Unicode and ASCII - Nitpick II - (jb4)
         you are confusing text with display - (boxley) - (12)
             Uhhh..Not quite, Bill - (jb4) - (11)
                 And that is one thing that sucks about Unicode - (ben_tilly) - (9)
                     At least they're consistent - (jb4) - (8)
                         But it is a problem - (ben_tilly)
                         Except for that full width/half width ascii thing - (tuberculosis) - (5)
                             I dunno... - (jb4)
                             My personal take on it - (jake123) - (3)
                                 Perhaps, but it makes searching tricky - (tuberculosis) - (2)
                                     Well, if it was an easy problem - (jake123)
                                     ICLRPD (new thread) - (jb4)
                         Have you all seen the HUGE unicode poster? - (FuManChu)
                 close enough to debug a table entry :-) - (boxley)
         Text is not as simple as it seems - (ben_tilly)
         This is one thing that Java handles pretty well - (bluke)
         Rule #1 - Everything you think you know is wrong - (tuberculosis) - (29)
             Why xenophobic? - (drewk) - (28)
                 Because they didn't think... - (pwhysall)
                 Because if they had spent any time at all - (tuberculosis) - (25)
                     Now how about addressing my example - (drewk) - (17)
                         The best explanation that I've seen of why 2 digits... - (ben_tilly)
                         No, but they were xenophobic etc - (jake123) - (15)
                             xenophobic's probably the wrong word - (SpiceWare) - (14)
                                 Yeah, you're right - (jake123) - (13)
                                     How about "escessively humble"? - (drewk) - (4)
                                         Look, the point about the two digits for a year is well - (jake123) - (1)
                                             Disagree - (jb4)
                                         Maybe... - (tuberculosis) - (1)
                                             How about simply "provincial". - (a6l6e6x)
                                     The people who coded for teletypes and green terminals - (Arkadiy) - (7)
                                         Yes, a typographer - (jake123) - (3)
                                             Internationalization would not have been so easy - (ben_tilly)
                                             Text layout in 80 by 24 grid of monspaced font? - (Arkadiy) - (1)
                                                 Phone books back then - (jake123)
                                         Please don't use the letter "e" in your code. - (pwhysall) - (2)
                                             I certainly used to do without "e" - (Arkadiy)
                                             I couldn't use "e" either ... - (JimWeirich)
                     Oh, come ON already - (jb4) - (6)
                         The C++ standard i18n library is awful - (tuberculosis) - (5)
                             Dont know ICU - (jb4) - (4)
                                 ICLRPD (new thread) - (drewk)
                                 You can find it here - (tuberculosis) - (2)
                                     Time line? - (jb4) - (1)
                                         Released in 1988 - (tuberculosis)
                 Actually, Algol 68 was designed from the ground up - (Arkadiy)
         Re: What the heck is text? - (JayMehaffey) - (3)
             I must correct you - ASCII is a 7-bit encoding - (tuberculosis)
             Whoa, there. - (ubernostrum) - (1)
                 Your right mostly - (JayMehaffey)
         Using a pencil, it's unambiguous. -NT - (mmoffitt) - (3)
             You haven't seen my handwriting.... -NT - (Another Scott) - (2)
                 Uh-oh. I wouldn't confess that ;0) - (mmoffitt) - (1)
                     My father's handwriting was so bad... - (broomberg)

They got the Discovery Channel, don't they?
159 ms