娴呰皥 UTF-8 缂栫爜

ASCII銆丟BK銆乁nicode 涓� UTF-8

鍦ㄨ绠楁満鍐呴儴锛屾墍鏈変俊鎭渶缁堥兘鏄竴涓簩杩涘埗鍊笺�傛瘡涓�涓簩杩涘埗浣嶏紙bit锛夋湁 0 鍜� 1 涓ょ鐘舵�侊紝鍥犳鍏釜浜岃繘鍒朵綅灏卞彲浠ョ粍鍚堝嚭 256 绉嶇姸鎬侊紝杩欒绉颁负涓�涓瓧鑺傦紙byte锛夈�備篃灏辨槸璇达紝涓�涓瓧鑺備竴鍏卞彲浠ョ敤鏉ヨ〃绀� 256 绉嶄笉鍚岀殑鐘舵�侊紝姣忎竴涓姸鎬佸搴斾竴涓鍙凤紝灏辨槸 256 涓鍙凤紝浠� 00000000 鍒� 11111111銆�

涓婁釜涓栫邯 60 骞翠唬锛岀編鍥藉埗瀹氫簡涓�濂楀熀浜庢媺涓佸瓧姣嶇殑璁$畻鏈虹紪鐮佺郴缁燂紝鐢ㄤ簬鏄剧ず鐜颁唬鑻辨枃銆傜О涓� ASCII锛�American Standard Code for Information Interchange锛岀編鍥戒俊鎭氦鎹㈡爣鍑嗕唬鐮侊級锛屼竴鐩存部鐢ㄨ嚦浠娿��

ASCII 涓�鍏辫瀹氫簡 128 涓瓧绗︾殑缂栫爜锛屽湪璁$畻鏈轰腑甯哥敤涓�涓瓧鑺傝〃绀猴紝瀛楄妭鏈�鍓嶉潰鐨勪竴浣嶇粺涓�瑙勫畾涓�0锛屽悗闈� 7 浣嶆潵琛ㄧず鍏蜂綋鐨勭爜鐐�(code point)銆傚�煎緱涓�鎻愮殑鏄湪 ASCII 涓爜鐐瑰氨鏄湪 ASCII 瀛楃闆嗕腑鐨勫簭鍙凤紝渚嬪澶у啓鐨勫瓧姣�A鍦� ASCII 瀛楃闆嗕腑瀵瑰簲鐨勪簩杩涘埗鏄�01000001锛岃�屽畠鐨� ASCII 鐮佺偣涓� 65锛屽垰濂戒竴涓�瀵瑰簲銆�

铏界劧鑻辫鐢� ASCII 缂栫爜灏卞浜嗭紝浣嗘槸瀵逛簬鍏朵粬璇█锛孉SCII 鏄笉澶熺殑銆備緥濡�姹夊瓧灏卞杈� 10 涓囧乏鍙筹紝鍥犳涓浗鏀垮簻灏辨帹鍑轰簡 GB 2312锛�淇℃伅浜ゆ崲鐢ㄦ眽瀛楃紪鐮佸瓧绗﹂泦路鍩烘湰闆�锛屽浗鏍�)锛屽叾涓富瑕佸寘鍚簡涓ら儴鍒嗭紝鍗崇紪鐮佸瓧绗﹂泦鍜岀紪鐮佹柟寮忋�傚叿浣撶粏鑺傝繖閲屽氨涓嶈禈杩帮紝浣嗘槸绠�鍗曟潵璇村彧鏈� UTF-16 杩欑缂栫爜鏂瑰紡鐨� Unicode銆傚�煎緱涓�鎻愮殑鏄� GB 2312 鏈韩鐨勫瓧绗﹂泦鏍囧噯鐞嗚涓婃渶澶氬彲浠ヨ〃绀� 256 x 256 = 65536 涓瓧绗︼紝鎵�浠ュ疄闄呬笂鐩墠鎴戜滑甯哥敤鐨勬槸GBK锛�銆婃眽瀛楀唴鐮佹墿灞曡鑼�(GBK)銆�1.0 鐗�锛夎繖涓瓧绗﹂泦锛屼笉杩� GBK 鏈韩涓嶆槸涓�涓浗鏍囷紝鏄井杞帹鍑虹殑涓�涓墿灞曪紙鎿嶄綔绯荤粺鐨勫彂灞曡繙杩滆秴杩囧浗瀹跺埗瀹氭爣鍑嗙殑鍙戝睍锛屾搷浣滅郴缁熷巶鍟嗕笉寰椾笉鍏堣В鍐充汉姘戠殑涓�涓棝鐐癸級锛屾墍浠ュ畠骞舵病鏈夊悗闈㈢殑閭d釜鍙枫��

閭d箞浠�涔堟槸鍒氭墠鎻愬埌鐨� Unicode锛熸濡傚墠闈㈡墍璇寸殑涓浗鏀垮簻鎺ㄥ嚭浜� GB 2312 瀛楃闆嗭紝閭d箞鍏朵粬鍥藉銆佽法鍥藉叕鍙歌嚜鐒朵篃浼氭帹鍑鸿嚜宸辩殑瀛楃闆嗐�傚鏋滄垜浠妸瀛楃闆嗘兂璞℃垚涓�涓暀瀹わ紝姣忎釜璇炬涓婂潗鐨勫鐢熷氨鏄瓧绗︼紝鑰屾瘡涓鐢熺殑瀛﹀彿涓虹爜鐐癸紝涓嶉毦鎯宠薄涓嶅悓鐨勬暀瀹や細鏈夊悇鑷粰瀛︾敓缂栧鍙风殑瑙勫垯锛屽悓涓�涓鐢熷湪涓嶅悓鐨勬暀瀹ゅ彲鑳藉潗鍦ㄤ笉鍚岀殑浣嶇疆涓婏紝鑷劧鍚屼竴涓鍙峰湪涓嶅悓鐨勬暀瀹ゆ壘鍒扮殑寰堟湁鍙兘鏄笉鍚岀殑瀛︾敓銆�鎵�浠ヤ汉浠揩鍒囬渶瑕佷竴绉嶈鍒欙紝鍙互鎶婁笘鐣屼笂鎵�鏈夌殑瀛︾敓閮芥斁杩涘悓涓�涓暀瀹わ紝姣忎釜瀛︾敓閮芥湁涓�涓嫭涓�鏃犱簩鐨勫鍙凤紝杩欐牱灏辫兘鏂逛究鐨勬壘鍒板搴旂殑瀛︾敓锛岃繖灏辨槸 Unicode 瀛楃闆嗭紝灏卞儚瀹冪殑鍚嶅瓧閮借〃绀虹殑锛岃繖鏄竴绉嶆墍鏈夊瓧绗︾殑瀛楃闆嗐��

浣嗘槸杩欐牱鍙堝紩鍑轰簡涓�绯诲垪闂锛岄鍏� Unicode 浣滀负涓�涓嫭绔嬬殑鏈烘瀯锛屽笇鏈涜兘鎺ㄥ姩鍏ㄧ悆鏂囧瓧缂栫爜鍜屽瓧绗﹂泦鏍囧噯閮界粺涓�锛屼絾鍙堜笉鑳藉簾闄ゅ悇鍦版柟鎬х殑缂栫爜鏂规銆俇nicode 閫夋嫨鍒涘缓浜嗕竴濂楀畬鍏ㄧ嫭绔嬫爣璁版柟寮忊�斺��Unicode scalar values锛岃繖涓柟妗堟樉绀轰笌鎴戜滑甯歌 ASCII 绛夊唴鐮佹暟鍊兼柟妗堝畬鍏ㄤ笉鍚岋紝鐒跺悗涓轰簡鍏煎鍏朵粬涓绘祦鏂规锛孶nicode 鎺ㄥ嚭浜� Unicode 杞崲鏍煎紡锛圲nicode Transformation Format锛岀畝绉颁负 UTF锛夛紝甯歌鐨勬湁 UTF-8銆乁TF-16 鍜� UTF-32銆傚叾涓� 32 鏄竴涓浐瀹氬洓瀛楄妭鐨勭紪鐮佹柟妗堬紝浠栫殑鐮佺偣涓� Unicode scalar values 鏄竴涓�瀵瑰簲鐨勶紝姣旇緝婕備寒锛�16 鏄敱鍙屽瓧鑺傚拰鍥涘瓧鑺傚垏鎹㈢殑鏂规锛�8 鏄彉闀跨殑锛屽崟瀛楄妭鏃跺吋瀹� ASCII銆傚啀鑰呮棭鏈� Unicode 鍏跺疄骞舵病鏈夋兂鍒颁細杩涙潵杩欎箞澶氱殑瀛楃锛屾瘮濡� 馃懆馃懇馃懅锛堝搴級杩欎釜 emoji锛岀敱浜庣绉嶅師鍥犱汉浠笉鑳芥弧瓒冲彧鐢变竴涓敺浜�+濂充汉+濂冲/鐢峰杩欑褰㈠紡鐨勫搴紝涓嶅緱涓嶇户缁姞涓� 馃懇馃懇馃懄(濂充汉銆佸コ浜恒�佺敺瀛�)锛岎煈煈煈�(濂充汉銆佸コ浜恒�佸コ瀛�)锛岎煈煈煈ю煈�(濂充汉銆佸コ浜恒�佸コ瀛┿�佺敺瀛�)锛岎煈煈煈ю煈� 瀹跺涵 (鐢蜂汉銆佺敺浜恒�佸コ瀛┿�佺敺瀛�)锛岎煈煈煈ю煈� 瀹跺涵 (鐢蜂汉銆佺敺浜恒�佸コ瀛┿�佸コ瀛�)鈥︹�﹀悗鏉ヨ偆鑹蹭篃涓嶈兘鍥哄畾涓虹櫧浜猴紝杩樺緱鏈夐粍绉嶄汉锛岄粦浜猴紝澶栨槦浜轰箣绫荤殑銆�

鍑轰簬缁忔祹锛堣兘鐢� ASCII 琛ㄧず鐨勮嫳鏂囩敤 UTF-32 鍥哄畾 4 瀛楄妭鐨勬柟妗堜細鍗犵敤棰濆鐨勭┖闂达級鍜屽彂灞曪紙褰撶劧鍙兘鍥涘瓧鑺備篃涓嶄竴瀹氳兘瑁呬笅瓒婃潵瓒婂鐨� Unicode 瀛楃锛夌殑瑙掑害锛孶TF-8 鐩墠鎴愪负浜嗕娇鐢ㄦ渶骞跨殑涓�绉� Unicode 缂栫爜鏂瑰紡銆�

UTF-8 瑙勫垯

UTF-8 鐨勭紪鐮佽鍒欏緢绠�鍗曪紝鍙湁浜屾潯锛�

  1. 瀵逛簬鍗曞瓧鑺傜殑绗﹀彿锛屽瓧鑺傜殑绗竴浣嶈涓�0锛屽悗闈� 7 浣嶄负杩欎釜绗﹀彿鐨勭爜鐐广�傚洜姝ゅ浜庤嫳璇瓧姣嶏紝UTF-8 缂栫爜鍜� ASCII 缂栫爜鏄浉鍚岀殑銆�
  2. 瀵逛簬n瀛楄妭鐨勭鍙凤紙n > 1锛夛紝绗竴涓瓧鑺傜殑鍓�n浣嶉兘璁句负1锛岀n + 1浣嶈涓�0锛屽悗闈㈠瓧鑺傜殑鍓嶄袱浣嶄竴寰嬭涓�10銆傚墿涓嬬殑娌℃湁鎻愬強鐨勪簩杩涘埗浣嶏紝鍏ㄩ儴涓鸿繖涓鍙风殑鐮佺偣銆�

Unicode 鍜� UTF-8 涔嬮棿鐨勮浆鎹㈠叧绯昏〃 ( x 瀛楃琛ㄧず鐮佺偣鍗犳嵁鐨勪綅 )

鐮佺偣鐨勪綅鏁� 鐮佺偣璧峰�� 鐮佺偣缁堝�� 瀛楄妭搴忓垪 Byte 1 Byte 2 Byte 3 Byte 4 Byte 5 Byte 6
7 U+0000 U+007F 1 0xxxxxxx
11 U+0080 U+07FF 2 110xxxxx 10xxxxxx
16 U+0800 U+FFFF 3 1110xxxx 10xxxxxx 10xxxxxx
21 U+10000 U+1FFFFF 4 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
26 U+200000 U+3FFFFFF 5 111110xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
31 U+4000000 U+7FFFFFFF 6 1111110x 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx

闇�瑕佹敞鎰忕殑闂

  1. 涓枃鍦� UTF-8 涓苟涓嶄竴瀹氶暱涓変釜瀛楄妭

    绗旇�呯粡甯镐細鐪嬪埌涓�浜涚粡楠屼赴瀵岀殑绋嬪簭鍛樹細璁や负涓�涓腑鏂囧瓧绗﹀湪 GBK 涓槸涓や釜瀛楄妭锛岃浆涓� UTF-8 鏄笁涓瓧鑺傘�傛墍浠� UTF-8 涓腑鏂囧瓧绗︾殑闀垮害鏄笁涓瓧鑺傦紝瀹為檯涓婂苟涓嶇劧锛岄渶瑕佺湅杩欎釜杩欎釜瀛楃鏄笉鏄湪 Unicode 鐨勫熀鏈潰涓婏紝闈炲父瑙佸瓧鍙兘浼氬崰 4 涓瓧鑺傦紙UTF-8 鍙兘鏈� 1~4 涓瓧鑺傦級锛屽洜涓� GBK 鏍囧噯鎻愬嚭鐨勬椂闂存棭锛屾墍浠ュ熀鏈笂閮藉湪 Unicode 鐨勫熀鏈潰涓娿��

  2. 璁$畻 UTF-8 缂栫爜鐨勫瓧绗︿覆闀垮害涓嶈鎯冲綋鐒�

    鐢变簬 Cocos2d-x 鍘熺敓骞舵病鏈夋彁渚涜绠� UTF-8 鐨� API锛岀瑪鑰呰杩囧緢澶氬鎬濆鎯崇殑鏂瑰紡璁$畻涓嫳娣峰悎瀛楃涓查暱搴︾殑鏂瑰紡銆備緥濡傚亣璁句腑鑻辨贩鍚堝瓧绗︿覆鐨勬瘡涓瓧绗﹂兘鍗犲洓涓瓧鑺傦紱璋冪敤鍘熺敓 OC銆丣ava 搴撳嚱鏁� String 鏉ヨ绠楅暱搴︾瓑銆備絾鏄緱鍒伴暱搴﹀悗鍙兘闇�瑕佹埅鍙栧瓧绗︿覆锛屾埅鍙栧闀跨殑鍙傛暟鍙堟嬁涓嶅噯銆傝�屼笖鐩墠涓绘祦鎵嬫満閮芥敮鎸佽緭鍏� emoji锛屽綋鐜╁杈撳叆鐨勬枃瀛椾腑鏈夊ぇ閲� emoji 鏃舵埅鍙栫殑鏁堟灉灏卞彲鑳介潪甯哥殑涓嶇悊鎯炽��

璁$畻 UTF-8 缂栫爜瀛楃涓查暱搴︾殑瀹炰緥

#include 

static inline size_t utf8Length(const char *s)
{
  size_t i = 0, j = 0;
  while (s[i])
  {
    //if ((s[i] & 0b11000000) != 0b10000000) j++;
    if ((s[i] & 0xc0) != 0x80)
      j++;
    i++;
  }
  return j;
}

int main()
{
  const auto &utf8 =
      u8"鑻嶅ぉ鏈変簳鐙嚜绌猴紝鏉炬煆瀛ゅ矝鍞祻鏋�傛鍥灟钘ょ┖鐣欏叞锛屾槦钀藉ぉ宸濋仴鏄犵灣銆�";
  auto size = utf8Length(utf8);
  std::cout << size << std::endl;
  return 0;
}
32

Process finished with exit code 0

你可能感兴趣的:(娴呰皥 UTF-8 缂栫爜)