Monument ([info]marnanel) wrote,
@ 2005-01-17 22:56:00
Previous Entry  Add to memories!  Tell a Friend!  Next Entry
All that &stuff;
(because someone was asking)

Computers are very good at dealing with numbers. But how do you represent letters and symbols in a computer? A long time ago people had the idea of using a code where a different number stood for each symbol.

Take the letter "A", for example. In the common code called ASCII, that has the number 65: whenever you get an A on your screen, it's most likely represented as the number 65 inside your computer. B is 66, C is 67, and so on.

ASCII has a little over a hundred symbols in it. Soon, people began to want more, especially people who wanted to speak languages other than U.S. English. Some people wanted accented vowels, Spanish speakers wanted the n-tilde, British people wanted the pound sign, and so on. There were a number of codes proposed which could deal with some of these symbols, but eventually one code arose that could handle all of them: Unicode. Unicode is huge. It begins with the same symbols as ASCII, but its code numbers go on much higher-- up beyond 90,000. You can use it to write pretty much any language, and they're still working on adding new ones.

So suppose you're writing an LJ post, and you want to write the sign for pounds sterling. Of course, you could just press the key if you have it on your keyboard, or copy and paste it from somewhere else. But there's another way: in Unicode it has the number 163. HTML lets you write

£


to get £. You can insert any Unicode character this way, even the weird ones like pictures of telephones (which happens to be Unicode number 9743, so you'd write ☏). Of course, if your computer doesn't have a picture of a telephone in its font, it won't be able to draw it, so you'll probably just get a box or something. ☏.

A lot of the characters get used quite often, so HTML has a mnemonic system, too. If you want an e-acute (é), just type

é


and you won't have to look up the code number for the symbol. There's also £ for £, ¢ for ¢, and many others; there are some lists here.


(Post a new comment)


[info]malver
2005-01-18 04:34 am UTC (link)
And there's a nice page with lists of Unicode to symbol correspondences here if one needs that moreover, which I have.. — less useful or more depending on how many fonts one’s computer has, I suppose.

(Reply to this)


[info]trochee
2005-01-18 04:43 am UTC (link)
a PDF of slides with more than you wanted to know that I did last year.

I give this talk to my lab roughly once per year.

(Reply to this)(Thread)


[info]floatyfish
2005-01-18 05:30 am UTC (link)
Thank you for linking to this. I was one of the people who was asking Marn about all of this and now I have all kinds of information to learn. :)

(Reply to this)(Parent)


[info]alex_rex
2005-01-18 04:58 am UTC (link)
Another one table (see links in comments too)

It's in russian, but I suppose, you can understand what symbols are.

(Reply to this)


[info]trochee
2005-01-18 05:12 am UTC (link)
One code to rule them all
one code to bind them
one code to bring them all
and in 32-bit bind them

I'm tempted to go on with something about the seven big-endians under the stone and the nine latin-encodings doomed to die, but I'll have to work on that bit later.

(Reply to this)(Thread)


[info]marnanel
2005-01-18 05:15 am UTC (link)
:)

There Is No Code But Unicode And UTF-8 Is Its Encoding.

(Reply to this)(Parent)


[info]floatyfish
2005-01-18 05:28 am UTC (link)
Thank you so much for doing this, Marn. I actually rather like it when you teach things..you do it well and simply, but with excitement.

And now..I have several questions from this..if you don't mind. They may seem rather simple.

1. Do you have any idea why they started with the number 65 when they did the ASCII codes for letters? It seems so random and thus makes me curious. :)

2. So..when you are writing the HTML code (the unicode ones) for things like the pound sterling and the like..you don't need <> or anything like that around it? How does it know to LookAtThis in a way? I really know nothing about coding and the like and it intruiges me. So..you could type that code out in the middle of anywhere and the computer would just read it as whatever the symbol was?

3. I know my computer does little boxes a lot when you and Fin use some of those little codes in your post, so this mean my computer doesn't have the proper font to read that code then? I could update my font and I would see what you were actually putting there?

4. The weird thing is that I saw the telephone in your post when you posted it, but when we were commenting back and forth and the like that day about code..I tried to do the telephone and I didn't get anything. I'll try it here just in case and see... ☏. Although I am on the laptop now, so maybe the laptop has the proper font for the telephone. I guess we'll see.

Thank you for the info and the links, Marn. I'm sorry if I seem rather sillyvirgin at it all, but I am fascinated.

(Reply to this)(Thread)


[info]marnanel
2005-01-18 05:55 am UTC (link)
Those are all good questions. I'll take them in order.

1) Like good computer scientists, they were counting in blocks of sixteen rather than blocks of ten when they laid out ASCII. Sixteen is a much more convenient number for machines that count in binary to work with. So we have:

0 to 31: special control codes (like "end transmission")
32 (2x16) to 47: punctuation
48 (3x16) to 63: ten digits, plus some more punctuation to fill in the end
64 (4x16) to 95: 26 capital letters, plus some more punctuation to fill in the end
96 (6x16) to 127: 26 lowercase letters, plus some more punctuation to fill in the end

The capital letters block actually begins with @ for 64 before going on to A for 65: I'm guessing, but I think they began with @ because they were counting from 0 and they wanted A to be number 1. The lowercase letters and the digits similarly have another character at the beginning of their block.

So that's where the 65 comes from.

(Reply to this)(Parent)(Thread)


[info]floatyfish
2005-01-18 06:40 am UTC (link)
Wonderful. Thank you...that makes sense now.

I love how they kept throwing in random punctuation to complete the sets. ;)

(Reply to this)(Parent)


[info]marnanel
2005-01-18 06:00 am UTC (link)
2. No, you don't need < > around these. The computer knows to look at these codes because they begin with an & and end with a ;. Only character codes can look like that, so they're quite distinctive.

(If you wanted to ACTUALLY WRITE &something;, you'd have to write it out as &amp;something;, where &amp; is a code for the & symbol. Or you could use its Unicode number, of course. Either way, it wouldn't look like a character code.)

3. Yes and yes.

4. Did it work?

(Reply to this)(Parent)(Thread)


[info]floatyfish
2005-01-18 06:39 am UTC (link)
*hugs the Marn* I want you to teach me computer stuff..you're good at it. *smiles*

2. Ok, so the & and the ; are a bit like the < and the > in all truth, right?

4. It did actually, so the laptop must have that particular font that is needed, whereas our desktop computer does not.

(Reply to this)(Parent)(Thread)


[info]marnanel
2005-01-18 06:43 am UTC (link)
*hugs* yay, thanks.

2. yes, exactly!

4. yay!

(Reply to this)(Parent)


[info]darkmoon
2005-01-18 06:21 am UTC (link)
http://www.lunamorena.net/amp.html

A few years ago, before you could count on computers reading unicode properly (or consistently), I decided I needed a way to remember how those characters were written with the names, which seemed to work more consistently across platforms.

I've recently updated it to about twice the size it was before; it now includes every unicode character I could find that has a name.

Exceedingly basic JavaScript makes that page possible... hover over one of the characters, and the little text box populates with the code to get the character.

(Reply to this)(Thread)


[info]marnanel
2005-01-18 05:27 pm UTC (link)
Nice work-- I like this.

(Reply to this)(Parent)(Thread)


[info]darkmoon
2005-01-18 05:54 pm UTC (link)
Thanks. :) And if you think of any that I missed (ie, couldn't find), let me know so I can add it. :)

(Reply to this)(Parent)


[info]dyddgu
2005-01-18 09:38 am UTC (link)
Thank you so much for that! [info]knirirr's keyboard is in American, and I never knew how to do a pound sign on LJ from home before :D *bounces*

(Reply to this)


[info]hatter
2005-01-18 11:24 am UTC (link)
Though using numerics for <256 is probably still a bad idea, as browsers and their surrounding pages might make all sorts of assertions or assumptions about which character set to use, and all those elements are named.


the hatter

(Reply to this)


earthsister
2005-01-18 03:05 pm UTC (link)
You are truly a fountain. Thank you. :)

(Reply to this)


[info]naltrexone
2005-01-18 09:16 pm UTC (link)
Thanks! Some of the links helped me track down fonts for the unicode ranges that were unsupported on my machine.

Would you believe that some of the airline reservation systems still use EBCDIC under the hood? (Galileo / Apollo comes to mind.) They convert back and forth to ASCII in their API wrappers, so mostly you don't notice. But if you have to bypass their wrappers or write your own, you discover this very quickly...

(Reply to this)(Thread)


[info]marnanel
2005-01-19 03:42 am UTC (link)
Would you believe that some of the airline reservation systems still use EBCDIC under the hood?

From what I've heard about these systems' capacity for doing things in idiosyncratic ways, I'm somehow not awfully surprised :)

(Reply to this)(Parent)


Create an Account
Forgot your login?
Login w/ OpenID
English • Español • Deutsch • Русский…