1.C++支持的编码

C++支持1,2,3,4个字节的字符串，已经有了std::string，std::wstring，std::u8string，std::u16string，std::u32string一堆的字符串类型。

类型	字符串表现方式	类型	说明
std::string	"hello world"	char	ANSI
std::wstring	L"hello world"	wchar_t	Unicode
std::u8string	u8"hello world"	char	UTF-8
std::u16string	u"hello world"	char16_t	UTF-16
std::u32string	U"hello world"	char32_t	UTF-32

2.Windows操作系统编码

Windows操作系统的API包含了Multibyte(ANSI)和Unicode两种接口，即A和W为后缀的两套API。其中Multibyte(ANSI)接口能够处理的字符集，根据不同的国家和地区制定了不同的标准，由此产生了 GB2312、GBK、GB18030、Big5、Shift_JIS 等各自的编码标准。这些使用多个字节来代表一个字符的各种汉字延伸编码方式，称为 ANSI 编码。在简体中文Windows操作系统中，ANSI 编码代表 GBK (CP 936)编码；在繁体中文Windows操作系统中，ANSI编码代表Big5；在日文Windows操作系统中，ANSI 编码代表 Shift_JIS 编码。

3.C/C++库函数字符编码

C/C++库函数因为使用了Windows API来实现对操作系统资源的使用，所以使用字符串字符集的标准和Windows操作系统的要求是一致的，比如fopen使用了OpenFile，此时fopen必须传入GBK编码的字符串才可以正常打开文件。

4.字符转码

GBK，UTF8，UNICODE互相转码是很常见的，此处使用ICU，代码如下：

#ifndef STRING_UTIL_H_
#define STRING_UTIL_H_
#include 

std::string unicode2gbk(const std::wstring& ws);
std::wstring gbk2unicode(const std::string& str);

std::wstring utf82unicode(const std::string& str);
std::string unicode2utf8(const std::wstring& ws);

std::string utf82gbk(const std::string& str);
std::string gbk2utf8(const std::string& str);
#endif

#include "string_util.h"
#include 
#include 
#include 
#include 

#define  BUFFER_SIZE 8192

#ifdef WIN32_MSVC
#ifdef _DEBUG
#pragma comment(lib, "icuucd.lib")
#pragma comment(lib, "icudtd.lib")
#else
#pragma comment(lib, "icuuc.lib")
#pragma comment(lib, "icudt.lib")
#endif
#endif

std::string conv(const std::wstring& ws, const char *converterName)
{
    UErrorCode status = U_ZERO_ERROR;
#ifdef _MSC_VER
    const UChar* source = (const UChar*)ws.c_str();
    int32_t srcLength = ws.length();
#else
    UChar source[BUFFER_SIZE];
    int32_t srcLength = 0;
    u_strFromUTF32(source, BUFFER_SIZE, &srcLength, (const UChar32*)ws.c_str(), ws.length(), &status);
    if (U_FAILURE(status))
    {
        return "";
    }
#endif

    UConverter* converter = ucnv_open(converterName, &status);
    if (U_FAILURE(status))
    {
        return "";
    }

    char buffer[BUFFER_SIZE];
    ucnv_fromUChars(converter, buffer, BUFFER_SIZE, source, srcLength, &status);
    if (U_FAILURE(status))
    {
        ucnv_close(converter);
        return "";
    }
    ucnv_close(converter);


    return buffer;
}

std::wstring conv(const std::string& str, const char *converterName)
{
    UErrorCode status = U_ZERO_ERROR;
    UConverter* converter = ucnv_open(converterName, &status);
    if (U_FAILURE(status))
    {
        return L"";
    }

    UChar dest[BUFFER_SIZE];
    ucnv_toUChars(converter, dest, BUFFER_SIZE, str.c_str(), str.length(), &status);
    if (U_FAILURE(status))
    {
        ucnv_close(converter);
        return L"";
    }
    ucnv_close(converter);


#ifdef _MSC_VER
    return (wchar_t*)dest;
#else
    wchar_t dest32[BUFFER_SIZE];
    int32_t pDestLength = 0;
    u_strToUTF32((UChar32*)dest32, BUFFER_SIZE, &pDestLength, dest, destCapacity, &status);
    if (U_FAILURE(status))
    {
        return L"";
    }

    return dest32;
#endif
}

std::string unicode2gbk(const std::wstring& ws)
{
    return conv(ws, "gb18030");
}

std::wstring gbk2unicode(const std::string& str)
{
    return conv(str, "gb18030");
}

std::wstring utf82unicode(const std::string& str)
{
    return conv(str, "utf-8");
}

std::string unicode2utf8(const std::wstring& ws)
{
    return conv(ws, "utf-8");
}

std::string utf82gbk(const std::string& str)
{
    return unicode2gbk(utf82unicode(str));
}

std::string gbk2utf8(const std::string& str)
{
    return unicode2utf8(gbk2unicode(str));
}

5.没有乱码

再做以下测试之前，此处的文本文件字符集代表的是，编译器读取该代码文件时，对代码中的字符串明文使用的字符集。
一般来讲，用Visual Stdio新建一个工程，然后在代码里面加入以下代码，是可以正常运行的。

#include 
using namespace std;

int main()
{
    cout << "你好 世界!" << endl;
    return 0;
}

用Notepad3打开源文件，看到源文件使用的字符集为ANSI(CP-936)，即GBK：

图片.png

因为文本文件是GBK,字符串是GBK，所以"你好世界!"被解释为了GBK，所以才一切正常。

6.文本文件字符集导致的乱码

在Notepad3中，选择文件->编码->设置文档为->UTF-8后保存，此时，文本文件字符集变成了UTF8。

图片.png

再运行一下，发现，输出的是乱码了。

图片.png

因为文本文件是UTF8,字符串是GBK，所以"你好世界!"被解释为了UTF8，所以输出是乱码了。
解决方法，将字符串转为GBK即可

#include 
#include "string_util.h"

using namespace std;

int main()
{
    cout << utf82gbk("你好 世界!") << endl;
    return 0;
}

7.文本修饰符导致的乱码

我们先恢复文件的文本格式，文件->编码->设置文档为->ANSI，然后修改一下源码，在字符串前加入u8，运行，发现结果和上段中输出的乱码是一样的。

#include 
using namespace std;

int main()
{
    cout << u8"你好 世界!" << endl;
    return 0;
}

因为文本文件是GBK,字符串是UTF8，所以"你好世界!"被解释为了UTF8，所以输出也是乱码。
解决方法，将字符串转为GBK即可

#include 
#include "string_util.h"

using namespace std;

int main()
{
    cout << utf82gbk(u8"你好 世界!") << endl;
    return 0;
}

该处问题其实很常见，比如通过网络发送过来的字符串，提取其中的一部分数据后，用std::string保存的字符串大概率使用的是UTF8编码，这也直接导致一些底层使用fopen这些标准库的函数无法正常处理字符串，此处只需要转换为GBK就可以正常使用了。

8.文本文件字符集和文本修饰符同时使用UTF8

这里，我们把文件格式转换为UTF8，并运行以下代码：

#include 
using namespace std;

int main()
{
    cout << u8"你好 世界!" << endl;
    return 0;
}

发现，乱码和前两段中的例子输出不一样。

图片.png

这里其实设计到的是双重转码的问题，即字符串被字符串修饰符u8转码一次，再被文本文件转码一次。所以此处修正方法如下:

#include 
#include "string_util.h"

using namespace std;

int main()
{
    cout << utf82gbk(utf82gbk(u8"你好 世界!")) << endl;
    return 0;
}

C++字符串，字符集，字符转换和各种乱码原因

1.C++支持的编码

2.Windows操作系统编码

3.C/C++库函数字符编码

4.字符转码

5.没有乱码

6.文本文件字符集导致的乱码

7.文本修饰符导致的乱码

8.文本文件字符集和文本修饰符同时使用UTF8

你可能感兴趣的:(C++字符串，字符集，字符转换和各种乱码原因)