python - encoding

  • Why does Python print unicode characters when the default encoding is ASCII?

Terminologies

  • What’s the difference between an “encoding,” a “character set,” and a “code page”?
  • Character sets, maps and code pages

Character set

A not should be used term.[1]

  • A “character set” is just what it says: a properly-specified list of distinct characters.
  • A “character set” in HTTP (and MIME) parlance is the same as a character encoding (but not the same as CCS).

Encoding

  • An ‘encoding’ is a mapping between a character set (typically Unicode today) and a (usually byte-based) technical representation of the characters.
  • UTF-8 is an encoding, but not a character set. It is an encoding of the Unicode character set(*).

Code page

  • a code page is a table of values that describes the character set used for encoding a particular set of glyphs.[2]
  • Code page is another name for character encoding. It consists of a table of values that describes the character set for a particular language.[3]
  • Windows code pages are sets of characters or code pages (known as character encodings in other operating systems) used in Microsoft Windows systems from the 1980s and 1990s.
  • In the ANSI standard, everybody agreed on what to do below 128, which was pretty much the same as ASCII, but there were lots of different ways to handle the characters from 128 and on up, depending on where you lived. These different systems were called code pages.

ANSI

I have been misunderstanding the ANSI encoding.

  • The name “ANSI” is a misnomer, since it doesn’t correspond to any actual ANSI standard, but the name has stuck.[4]
  • There’s no one fixed ANSI encoding - there are lots of them. Usually when people say “ANSI” they mean “the default locale/codepage for my system” which is obtained via Encoding.Default, and is often Windows-1252 but can be other locales.[5]

UTF-8

The intuition behind UTF-8’s coding scheme.[6]

The basic rules are this:

  1. If a byte starts with a 0 bit, it’s a single byte value less than 128.
  2. If it starts with 11, it’s the first byte of a multi-byte sequence and the number of 1 bits at the start indicates how many bytes there are in total (110xxxxx has two bytes, 1110xxxx has three and 11110xxx has four).
    3.If it starts with 10, it’s a continuation byte.

This distinction allows quite handy processing such as being able to back up from any byte in a sequence to find the first byte of that code point. Just search backwards until you find one not beginning with the 10 bits.

Similarly, it can also be used for a UTF-8 strlen by only counting non-10xxxxxx bytes.

Links

  • 字符编码笔记:ASCII,Unicode和UTF-8
  • 谈谈Unicode编码,简要解释UCS、UTF、BMP、BOM等名词
  • The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets
  • Unicode In Python, Completely Demystified
  • UTF-8 Everywhere
  • Programming with Unicode
  1. https://en.wikipedia.org/wiki/Category:Character_sets ↩
  2. https://en.wikipedia.org/wiki/Code_page ↩
  3. https://en.wikipedia.org/wiki/Code_page ↩
  4. What is ANSI format? ↩
  5. Unicode, UTF, ASCII, ANSI format differences ↩
  6. UTF-8 Continuation bytes ↩

你可能感兴趣的:(unicode,python)