"byte string" and "string"???

"In python, particularly, when you dealing with text data, you more or less need to know things about "string" and "byte-string", so what is the difference???"


  • String - It is a sequence of characters which cannot be stored on hard disk directly(Unicode codepoint or Ascii codepoint), the only way is to convert it to something else like "byte-string"(Utf-8/16/32).
  • Byte-string - It is a sequence of bytes that can be stored on hard disk.
  • Mapping - The gap between "String" and "byte-string" is the mapping mechanism, we can call this "encoding".

"To make it even more clear, we can think of it as following"

  • There are Unicode and Ascii for character representation(A giant set of characters representation)

  • In order to store those character sets on the hard disk and retrieve it back, we gotta use Utf-8, Utf-16, Utf-32 encoding to be the translator in between.

  • What is codepoint码位
    Ascii

    • ASCII码包含128个码位,范围是016进制到7F16进制,扩展ASCII码包含256个码位,范围是016进制到FF16进制

    Unicode

    • Unicode包含1,114,112个码位,范围是016进制到10FFFF16进制。Unicode码空间划分为17个Unicode字符平面(基本多文种平面,16个辅助平面),每个平面有65,536(= 216)个码位。因此Unicode码空间总计是17 × 65,536 = 1,114,112
  • In All, in python text editing, we need to find a way to represent the text content, i.e. to show the text on the screen or to print it out on to the paper. ?HOW? Unicode helps to represent the concrete content for you, such as "A, B, 1, 2" or even "宣雄民"(my name), hence, Unicode is a giant data set which contains all the characters and symbols.

    Then what is Utf-8/16/32 things??
    Well, they are used to convert the actual stuff you can see on the screen and paper which was printed out by the computer into binary code or byte code(data that can be stored on computer hard disk) and get it back to the text we can see on the screen or print out on to paper.


That's it!!!
string - Unicode, concrete representation for human eyes.
byte string - encoding rule(utf8-32), things can be stored.

你可能感兴趣的:("byte string" and "string"???)