Linux下编码的转换[iconv]

1.iconv命令

关于这个命令的详细可以man看一下

[基本用法]

iconv –f encoding –t encoding inputfile

-f: –-from-code 指定原始文件编码

-t: –to-code 指定输出编码

例如我存储一个a.txt，文件采用utf-8编码保存，如果要修改为Big5格式的，可以这么写：

$: iconv –f utf-8 –t Big5 a.txt

如果希望结果存储进b.txt

$: iconv –f utf-8 –t Big5 a.txt –o b.txt

-o: –output 指定目标文件

如果不知道该怎么指定编码，可以使用

$: iconv –l

来查看所有已知的字符集

上面文件的操作如果没有方便的工具，可以使用word来查看或者建立。

2.iconv函数

与iconv command一样，更多的信息同样可以man来查看，注意这里需要man 3 iconv到另外一个manpage。

相关的函数主要有3个：

1.iconv_open

[函数原型]：

#include <iconv.h>

iconv_t iconv_open(const char* tocode, const char* fromcode);

[函数说明]

该函数分配一个编码转换的描述符，从fromcode转到tocode

tocode字符串后可以添加//TRANSLIT,//IGNORE。

函数失败则返回-1，同时设置errno

常见的errno：

EINVAL The conversion from fromcode to tocode is not supported by the implementation.

2.iconv_open

[函数原型]：

#include <iconv.h>

size_t iconv(iconv_t cd, char** inbuf, size_t *inbytesleft, char** outbuf, size_t *outbytesleft);

cd:转码描述符

*inbuf:指向源字符串

*inbytesleft：指向需要读取的长度

*outbuf:指向目标字符串

*outbytesleft: 指向可用的目标地址长度

细心的读者肯定已经发现，这里传入的全是指针，因为这里的参数并不只是起到一个传入的作用！

为了不误人子弟，关于iconv的工作流程，直接贴了manpage里的一句话：

The argument cd must be a conversion descriptor created using the function iconv_open(3).

The main case is when inbuf is not NULL and *inbuf is not NULL. In this case, the iconv() function converts the multibyte sequence starting at *inbuf to a multi‐
byte sequence starting at *outbuf. At most *inbytesleft bytes, starting at *inbuf, will be read. At most *outbytesleft bytes, starting at *outbuf, will be writ‐
ten.

The iconv() function converts one multibyte character at a time, and for each character conversion it increments *inbuf and decrements *inbytesleft by the number
of converted input bytes, it increments *outbuf and decrements *outbytesleft by the number of converted output bytes, and it updates the conversion state con‐
tained in cd. If the character encoding of the input is stateful, the iconv() function can also convert a sequence of input bytes to an update to the conversion
state without producing any output bytes; such input is called a shift sequence. The conversion can stop for four reasons:

1. An invalid multibyte sequence is encountered in the input. In this case it sets errno to EILSEQ and returns (size_t) -1. *inbuf is left pointing to the
beginning of the invalid multibyte sequence.

2. The input byte sequence has been entirely converted, that is, *inbytesleft has gone down to 0. In this case iconv() returns the number of nonreversible con‐
versions performed during this call.

3. An incomplete multibyte sequence is encountered in the input, and the input byte sequence terminates after it. In this case it sets errno to EINVAL and
returns (size_t) -1. *inbuf is left pointing to the beginning of the incomplete multibyte sequence.

4. The output buffer has no more room for the next converted character. In this case it sets errno to E2BIG and returns (size_t) -1.

A different case is when inbuf is NULL or *inbuf is NULL, but outbuf is not NULL and *outbuf is not NULL. In this case, the iconv() function attempts to set cd's
conversion state to the initial state and store a corresponding shift sequence at *outbuf. At most *outbytesleft bytes, starting at *outbuf, will be written. If
the output buffer has no more room for this reset sequence, it sets errno to E2BIG and returns (size_t) -1. Otherwise it increments *outbuf and decrements *out‐
bytesleft by the number of bytes written.

A third case is when inbuf is NULL or *inbuf is NULL, and outbuf is NULL or *outbuf is NULL. In this case, the iconv() function sets cd's conversion state to the
initial state.

看的不是很懂也没有关系，我再贴下自己的翻译和理解：

iconv函数会从*inbuf开始，每转换一个字符，增大*inbuf,减少*inbytesleft,增大/减少的字节数即为转换的input字节数，同时增大*outbuf,减少*outbytesleft,增大/减少的字节数即为转换的output字节数。

编码的转换遇到以下四种情况会stop：

1.inbuf里碰到非法的字节：这种情况设置errno为EILSEQ，返回(size_t)-1,*inbuf指向非法字符的开始处。

2.inbuf里字节完全转换：*inbytesleft减少至0，返回不可逆的转换的字节数（这个不明白什么意思，似乎是由于在目标字符集里找不到字符的问题？）

3.inbuf不完整的字符串：设置errno为EINVAL，返回(size_t)-1,*inbuf指向不完整字符的开始处。

比如一个字符占了4个字节，我故意只穿了前3个字节进去。

4.outbuf空间不够，此时设置errno为E2BIG，返回-1.

关于返回值还在纠结的直接看这里，自己体会下：

The iconv() function returns the number of characters converted in a nonreversible way during this call; reversible conversions are not counted. In case of
error, it sets errno and returns (size_t) -1.

3.iconv_close

[函数原型]：

#include <iconv.h>

int iconv_close(iconv_t cd);

关闭转换编码的描述符，成功返回0，否则设置errno并返回-1.

说了这么多，来个例子看下，其中源代码文件保存为utf-8编码，test.txt可以在word里使用BIG5编码查看。

Code Snippet

#include <stdio.h>
#include <iconv.h>
#include <errno.h>
int main()
{
char *src = "你好一二三四五六七八九十";
unsigned int srcLen = strlen(src);
unsigned int destLen = 256;
int iconvLen = -1;
char* dest = (char*)malloc(destLen * sizeof(char));
char* obj = dest;//记录存储开始位置，iconv里会对dest修改
memset(dest, 0, 256);
iconv_t convId;
if ((convId = iconv_open("BIG-5", "utf-8")) == -1)
{
perror("iconv open error");
}
if (convId == 0)
{
printf("cd == 0, return\n");
return -1;
}
printf("before, src: %x, dest: %x, srcLen: %d, destLen: %d\n", src, dest, srcLen, destLen);
iconvLen = iconv(convId, &src, &srcLen, &dest, &destLen);
printf("after, src: %x, dest: %x, srcLen: %d, destLen: %d\n", src, dest, srcLen, destLen);
printf("iconvLen: %d\n", iconvLen);
if (iconvLen == -1)
{
perror("error:");
}
unsigned int convLen = 256 - destLen;
FILE* f = fopen("test.txt", "w");
if (fwrite(obj, 1, convLen, f) != convLen)
{
printf("fwrite error!\n");
}
fclose(f);
iconv_close(convId);
free(obj);
return 0;
}

如果还有什么疑问，欢迎评论。

Linux下编码的转换[iconv]

你可能感兴趣的:(Linux下编码的转换[iconv])