[原创]PHP中GBK和UTF8编码处理


一、编码范围

1. GBK (GB2312/GB18030)
\x00-\xff GBK双字节编码范围
\x20-\x7f ASCII
\xa1-\xff 中文
\x80-\xff 中文

2. UTF-8 (Unicode)
\u4e00-\u9fa5 (中文)
\x3130-\x318F (韩文
\xAC00-\xD7A3 (韩文)
\u0800-\u4e00 (日文)
ps: 韩文是大于[\u9fa5]的字符


正则例子:
preg_replace("/([\x80-\xff])/","",$str);
preg_replace("/([u4e00-u9fa5])/","",$str);

二、代码例子


//判断内容里有没有中文-GBK (PHP)
function check_is_chinese ( $ s ) {
return preg_match ( '/[\x80-\xff]./' , $ s ) ;
}

//获取字符串长度-GBK (PHP)
function gb_strlen ( $ str ) {
$ count = 0 ;
for ( $ i =0 ; $ i < strlen ( $ str ) ; $ i + + ) {
$ s = substr ( $ str , $ i , 1 ) ;
if ( preg_match ( "/[\x80-\xff]/" , $ s ) ) + + $ i ;
+ + $ count ;
}
return $ count ;
}

//截取字符串字串-GBK (PHP)
function gb_substr ( $ str , $ len ) {
$ count = 0 ;
for ( $ i =0 ; $ i < strlen ( $ str ) ; $ i + + ) {
if ( $ count = = $ len ) break ;
if ( preg_match ( "/[\x80-\xff]/" , substr ( $ str , $ i , 1 ) ) ) + + $ i ;
+ + $ count ;
}
return substr ( $ str , 0 , $ i ) ;
}

//统计字符串长度-UTF8 (PHP)
function utf8_strlen ( $ str ) {
$ count = 0 ;
for ( $ i = 0 ; $ i < strlen ( $ str ) ; $ i + + ) {
$ value = ord ( $ str [ $ i ] ) ;
if ( $ value > 127 ) {
$ count + + ;
if ( $ value > = 192 & & $ value < = 223 ) $ i + + ;
elseif ( $ value > = 224 & & $ value < = 239 ) $ i = $ i + 2 ;
elseif ( $ value > = 240 & & $ value < = 247 ) $ i = $ i + 3 ;
else die ( 'Not a UTF-8 compatible string' ) ;
}
$ count + + ;
}
return $ count ;
}


//截取字符串-UTF8(PHP)
function utf8_substr ( $ str , $ position , $ length ) {
$ start_position = strlen ( $ str ) ;
$ start_byte = 0 ;
$ end_position = strlen ( $ str ) ;
$ count = 0 ;
for ( $ i = 0 ; $ i < strlen ( $ str ) ; $ i + + ) {
if ( $ count > = $ position & & $ start_position > $ i ) {
$ start_position = $ i ;
$ start_byte = $ count ;
}
if ( ( $ count - $ start_byte ) > = $ length ) {
$ end_position = $ i ;
break ;
}
$ value = ord ( $ str [ $ i ] ) ;
if ( $ value > 127 ) {
$ count + + ;
if ( $ value > = 192 & & $ value < = 223 ) $ i + + ;
elseif ( $ value > = 224 & & $ value < = 239 ) $ i = $ i + 2 ;
elseif ( $ value > = 240 & & $ value < = 247 ) $ i = $ i + 3 ;
else die ( 'Not a UTF-8 compatible string' ) ;
}
$ count + + ;

}
return ( substr ( $ str , $ start_position , $ end_position - $ start_position ) ) ;
}


//字符串长度统计-UTF8 [中文3个字节,俄文、韩文占2个字节,字母占1个字节] (Ruby)
def utf8_string_length (str )
temp = CGI : :unescape (str )
i = 0 ;
j = 0 ;
temp .length .times { |t |
if temp [t ] < 127
i + = 1
elseif temp [t ] > = 127 and temp [t ] < 224
j + = 1
if 0 = = (j % 2 )
i + = 2
j = 0
end
else
j + = 1
if 0 = = (j % 3 )
i + =2
j = 0
end
end
}
return i
}

//判断是否是有韩文-UTF-8 (JavaScript)
function checkKoreaChar (str ) {
for (i =0 ; i <str .length ; i + + ) {
if ( ( (str .charCodeAt (i ) > 0x3130 & & str .charCodeAt (i ) < 0x318F ) | | (str .charCodeAt (i ) > = 0xAC00 & & str .charCodeAt (i ) < = 0xD7A3 ) ) ) {
return true ;
}
}
return false ;
}

//判断是否有中文字符-GBK (JavaScript)
function check_chinese_char(s){
return (s.length != s.replace(/[^\x00-\xff]/g,"**").length);
}

三、参考文档

http://www.unicode.org/
http://examples.oreilly.com/cjkvinfo/doc/cjk.inf
http://www.ansell-uebersetzungen.com/gbuni.html
http://www.haiyan.com/steelk/navigator/ref/gbk/gbindex.htm
http://baike.baidu.com/view/40801.htm
http://www.chedong.com/tech/hello_unicode.html

你可能感兴趣的:(JavaScript,PHP,cgi,J#,Ruby)