1.简介
本次学习的目的是为了弄清JAVA在不同情况下对String处理方式,从而更好的解决String乱码问题。
2.获取JAVA中String的编码
代码
package com.siyuan.jdk.test;
import java.io.UnsupportedEncodingException;
import java.util.Arrays;
public class StringGetBytes {
public static void main(String[] args) throws UnsupportedEncodingException {
String str = "I AM 中国人";
System.out.println("str = " + str);
System.out.println("Default byte codes of str : " + Arrays.toString(str.getBytes()));
System.out.println("GBK codes of str : " + Arrays.toString(str.getBytes("GBK")));
System.out.println("UTF-8 codes of str : " + Arrays.toString(str.getBytes("UTF-8")));
System.out.println("UTF-16 codes of str : " + Arrays.toString(str.getBytes("UTF-16")));
}
}
运行结果
str = I AM 中国人
Default byte codes of str : [73, 32, 65, 77, 32, -42, -48, -71, -6, -56, -53]
GBK codes of str : [73, 32, 65, 77, 32, -42, -48, -71, -6, -56, -53]
UTF-8 codes of str : [73, 32, 65, 77, 32, -28, -72, -83, -27, -101, -67, -28, -70, -70]
UTF-16 codes of str : [-2, -1, 0, 73, 0, 32, 0, 65, 0, 77, 0, 32, 78, 45, 86, -3, 78, -70]
疑问
1)默认的getBytes()返回的编码为GBK的,而不是JAVA中的char编码方式Unicode,即UTF-16
通过跟踪String.getBytes()方法发现返回字节使用的编码为JVM的默认charset:Charset.defaultCharset(),而不是UTF-16
代码片段:
String
public byte[] getBytes() {
return StringCoding.encode(value, offset, count);
}
StringCoding
static byte[] encode(char[] ca, int off, int len) {
String csn = Charset.defaultCharset().name();
try {
return encode(csn, ca, off, len);
} catch (UnsupportedEncodingException x) {
warnUnsupportedCharset(csn);
}
try {
return encode("ISO-8859-1", ca, off, len);
} catch (UnsupportedEncodingException x) {
// If this code is hit during VM initialization, MessageUtils is
// the only way we will be able to get any kind of error message.
MessageUtils.err("ISO-8859-1 charset not available: "
+ x.toString());
// If we can not find ISO-8859-1 (a required encoding) then things
// are seriously wrong with the installation.
System.exit(1);
return null;
}
}
Charset
public static Charset defaultCharset() {
if (defaultCharset == null) {
synchronized (Charset.class) {
java.security.PrivilegedAction pa =
new GetPropertyAction("file.encoding");
String csn = (String)AccessController.doPrivileged(pa);
Charset cs = lookup(csn);
if (cs != null)
defaultCharset = cs;
else
defaultCharset = forName("UTF-8");
}
}
return defaultCharset;
}
可通过运行参数-Dfile.encoding="UTF-8"进行修改
修改eclipse中的运行参数
str = I AM 涓浗浜?
Default byte codes of str : [73, 32, 65, 77, 32, -28, -72, -83, -27, -101, -67, -28, -70, -70]
GBK codes of str : [73, 32, 65, 77, 32, -42, -48, -71, -6, -56, -53]
UTF-8 codes of str : [73, 32, 65, 77, 32, -28, -72, -83, -27, -101, -67, -28, -70, -70]
UTF-16 codes of str : [-2, -1, 0, 73, 0, 32, 0, 65, 0, 77, 0, 32, 78, 45, 86, -3, 78, -70]
问题
打印str出现乱码,但是字节编码正常
原因:Console控制台的编码不是UTF-8
修改eclipse中的console控制台编码
str = I AM 中国人
Default byte codes of str : [73, 32, 65, 77, 32, -28, -72, -83, -27, -101, -67, -28, -70, -70]
GBK codes of str : [73, 32, 65, 77, 32, -42, -48, -71, -6, -56, -53]
UTF-8 codes of str : [73, 32, 65, 77, 32, -28, -72, -83, -27, -101, -67, -28, -70, -70]
UTF-16 codes of str : [-2, -1, 0, 73, 0, 32, 0, 65, 0, 77, 0, 32, 78, 45, 86, -3, 78, -70]
2)UTF-16编码前面有两个字节为-2,-1
由于不同处理器对2字节处理方式不同,Big-endian(高位字节在前,低位字节在后)或Little-endian(低位字节在前,高位字节在后)编码,所以在对一串字符串进行编码是需要指明到底是Big-endian还是Little-endian,所以前面有两个字节用来保存BYTE_ORDER_MARK值