Codepage vs Charset

之前在面试话题中介绍了如何言简意赅的回答各种char的使用区别,本文将谈到另外一个高频问题——codepage跟charset的区别和联系。虽然这两个概念几乎天天都会在工作中出现,但就个人过往经验看,面对该问题时能够给出清晰答案的着实凤毛麟角。

 

首先来看codepage(“内码表”或者“代码页”),这是wiki对他的定义——代码页是字符编码的别名,也称“内码表”,是特定语言的字符集的一张表。(https://zh.wikipedia.org/wiki/%E4%BB%A3%E7%A0%81%E9%A1%B5

 

那么对于字符编码,wiki又是什么如何定义的?字符编码(英语:Characterencoding)、字集码是把字符集中的字符编码为指定集合中某一对象(例如:比特模式、自然数序列、8位组或者电脉冲),以便文本在计算机中存储和通过通信网络的传递。按照惯例,人们认为字符集和字符编码是同义词,因为使用同样的标准来定义提供什么字符并且这些字符如何编码到一系列的代码单元(通常一个字符一个单元)。(https://zh.wikipedia.org/wiki/%E5%AD%97%E7%AC%A6%E7%BC%96%E7%A0%81

 

搞了半天原来这二者是同义词啊,也就是大名跟小名的关系呗!所以Code Page 936 就是GB2312,Code Page 950 就是BIG5,而Code Page 65001就对应UTF-8。这就破哏了?文章可以结束了吧?如果故事就这么简单,那还真是很傻很天真……

 

尽管这二者在很多情况下是同义词,但在web开发中两者的作用域还是明显不同的。这里我们将以ASP.NET为技术背景,来进行进一步剖析。首先new一个WebForm工程,在default page中添加如下代码:

namespace CodepageTest
{
    public partial class _Default : Page
    {
        protected void Page_Load(object sender,EventArgs e)
        {
            Response.Write("Codepage is :" + Session.CodePage.ToString());
        }
    }
}


启动程序,我们可以看到如下内容。


一切正常!查看header,发现charset已设置为utf-8。

 

修改代码,添加Session.CodePage = 950; 后再次运行,乱码出现了。


responds header中的charset也随之显示为big5。


到这步已经有人开始提示我说,在浏览中点击查看->编码,然后选择UTF-8不就好了么?OK! 如你所愿。选择完毕后,page显示如下图。


到这里我们其实已经发现,在ASP.NET中codepage的作用域在服务器端,而charset的作用域在浏览器端,两者的最大区别就在于此。同时他们必须匹配,才能有效避免乱码问题的出现,任何一方的单独行动都于事无补。

 

最后附上codepage和charset的对应关系表,熟记其中的几个常用对应关系,无论对于工作还是面试相信都大有裨益。

CodePage

CharSet

Display Name

37

IBM037

IBM EBCDIC (US-Canada)

437

IBM437

OEM United States

500

IBM500

IBM EBCDIC (International)

708

ASMO-708

Arabic (ASMO 708)

720

DOS-720

Arabic (DOS)

737

ibm737

Greek (DOS)

775

ibm775

Baltic (DOS)

850

ibm850

Western European (DOS)

852

ibm852

Central European (DOS)

855

IBM855

OEM Cyrillic

857

ibm857

Turkish (DOS)

858

IBM00858

OEM Multilingual Latin I

860

IBM860

Portuguese (DOS)

861

ibm861

Icelandic (DOS)

862

DOS-862

Hebrew (DOS)

863

IBM863

French Canadian (DOS)

864

IBM864

Arabic (864)

865

IBM865

Nordic (DOS)

866

cp866

Cyrillic (DOS)

869

ibm869

Greek, Modern (DOS)

870

IBM870

IBM EBCDIC (Multilingual Latin-2)

874

windows-874

Thai (Windows)

875

cp875

IBM EBCDIC (Greek Modern)

932

shift_jis

Japanese (Shift-JIS)

936

gb2312

Chinese Simplified (GB2312)

949

ks_c_5601-1987

Korean

950

big5

Chinese Traditional (Big5)

1026

IBM1026

IBM EBCDIC (Turkish Latin-5)

1047

IBM01047

IBM Latin-1

1140

IBM01140

IBM EBCDIC (US-Canada-Euro)

1141

IBM01141

IBM EBCDIC (Germany-Euro)

1142

IBM01142

IBM EBCDIC (Denmark-Norway-Euro)

1143

IBM01143

IBM EBCDIC (Finland-Sweden-Euro)

1144

IBM01144

IBM EBCDIC (Italy-Euro)

1145

IBM01145

IBM EBCDIC (Spain-Euro)

1146

IBM01146

IBM EBCDIC (UK-Euro)

1147

IBM01147

IBM EBCDIC (France-Euro)

1148

IBM01148

IBM EBCDIC (International-Euro)

1149

IBM01149

IBM EBCDIC (Icelandic-Euro)

1200

utf-16

Unicode

1201

UnicodeFFFE

Unicode (Big-Endian)

1250

windows-1250

Central European (Windows)

1251

windows-1251

Cyrillic (Windows)

1252

Windows-1252

Western European (Windows)

1253

windows-1253

Greek (Windows)

1254

windows-1254

Turkish (Windows)

1255

windows-1255

Hebrew (Windows)

1256

windows-1256

Arabic (Windows)

1257

windows-1257

Baltic (Windows)

1258

windows-1258

Vietnamese (Windows)

1361

Johab

Korean (Johab)

10000

macintosh

Western European (Mac)

10001

x-mac-japanese

Japanese (Mac)

10002

x-mac-chinesetrad

Chinese Traditional (Mac)

10003

x-mac-korean

Korean (Mac)

10004

x-mac-arabic

Arabic (Mac)

10005

x-mac-hebrew

Hebrew (Mac)

10006

x-mac-greek

Greek (Mac)

10007

x-mac-cyrillic

Cyrillic (Mac)

10008

x-mac-chinesesimp

Chinese Simplified (Mac)

10010

x-mac-romanian

Romanian (Mac)

10017

x-mac-ukrainian

Ukrainian (Mac)

10021

x-mac-thai

Thai (Mac)

10029

x-mac-ce

Central European (Mac)

10079

x-mac-icelandic

Icelandic (Mac)

10081

x-mac-turkish

Turkish (Mac)

10082

x-mac-croatian

Croatian (Mac)

20000

x-Chinese-CNS

Chinese Traditional (CNS)

20001

x-cp20001

TCA Taiwan

20002

x-Chinese-Eten

Chinese Traditional (Eten)

20003

x-cp20003

IBM5550 Taiwan

20004

x-cp20004

TeleText Taiwan

20005

x-cp20005

Wang Taiwan

20105

x-IA5

Western European (IA5)

20106

x-IA5-German

German (IA5)

20107

x-IA5-Swedish

Swedish (IA5)

20108

x-IA5-Norwegian

Norwegian (IA5)

20127

us-ascii

US-ASCII

20261

x-cp20261

T.61

20269

x-cp20269

ISO-6937

20273

IBM273

IBM EBCDIC (Germany)

20277

IBM277

IBM EBCDIC (Denmark-Norway)

20278

IBM278

IBM EBCDIC (Finland-Sweden)

20280

IBM280

IBM EBCDIC (Italy)

20284

IBM284

IBM EBCDIC (Spain)

20285

IBM285

IBM EBCDIC (UK)

20290

IBM290

IBM EBCDIC (Japanese katakana)

20297

IBM297

IBM EBCDIC (France)

20420

IBM420

IBM EBCDIC (Arabic)

20423

IBM423

IBM EBCDIC (Greek)

20424

IBM424

IBM EBCDIC (Hebrew)

20833

x-EBCDIC-KoreanExtended

IBM EBCDIC (Korean Extended)

20838

IBM-Thai

IBM EBCDIC (Thai)

20866

koi8-r

Cyrillic (KOI8-R)

20871

IBM871

IBM EBCDIC (Icelandic)

20880

IBM880

IBM EBCDIC (Cyrillic Russian)

20905

IBM905

IBM EBCDIC (Turkish)

20924

IBM00924

IBM Latin-1

20932

EUC-JP

Japanese (JIS 0208-1990 and 0212-1990)

20936

x-cp20936

Chinese Simplified (GB2312-80)

20949

x-cp20949

Korean Wansung

21025

cp1025

IBM EBCDIC (Cyrillic Serbian-Bulgarian)

21866

koi8-u

Cyrillic (KOI8-U)

28591

iso-8859-1

Western European (ISO)

28592

iso-8859-2

Central European (ISO)

28593

iso-8859-3

Latin 3 (ISO)

28594

iso-8859-4

Baltic (ISO)

28595

iso-8859-5

Cyrillic (ISO)

28596

iso-8859-6

Arabic (ISO)

28597

iso-8859-7

Greek (ISO)

28598

iso-8859-8

Hebrew (ISO-Visual)

28599

iso-8859-9

Turkish (ISO)

28603

iso-8859-13

Estonian (ISO)

28605

iso-8859-15

Latin 9 (ISO)

29001

x-Europa

Europa

38598

iso-8859-8-i

Hebrew (ISO-Logical)

50220

iso-2022-jp

Japanese (JIS)

50221

csISO2022JP

Japanese (JIS-Allow 1 byte Kana)

50222

iso-2022-jp

Japanese (JIS-Allow 1 byte Kana - SO/SI)

50225

iso-2022-kr

Korean (ISO)

50227

x-cp50227

Chinese Simplified (ISO-2022)

51932

euc-jp

Japanese (EUC)

51936

EUC-CN

Chinese Simplified (EUC)

51949

euc-kr

Korean (EUC)

52936

hz-gb-2312

Chinese Simplified (HZ)

54936

GB18030

Chinese Simplified (GB18030)

57002

x-iscii-de

ISCII Devanagari

57003

x-iscii-be

ISCII Bengali

57004

x-iscii-ta

ISCII Tamil

57005

x-iscii-te

ISCII Telugu

57006

x-iscii-as

ISCII Assamese

57007

x-iscii-or

ISCII Oriya

57008

x-iscii-ka

ISCII Kannada

57009

x-iscii-ma

ISCII Malayalam

57010

x-iscii-gu

ISCII Gujarati

57011

x-iscii-pa

ISCII Punjabi

65000

utf-7

Unicode (UTF-7)

65001

utf-8

Unicode (UTF-8)

65005

utf-32

Unicode (UTF-32)

65006

utf-32BE

Unicode (UTF-32 Big-Endian)

你可能感兴趣的:(C#)