String编码(一) 关于String.getBytes()

1.简介

本次学习的目的是为了弄清JAVA在不同情况下对String处理方式,从而更好的解决String乱码问题。

2.获取JAVA中String的编码

代码

package com.siyuan.jdk.test;

import java.io.UnsupportedEncodingException;
import java.util.Arrays;

public class StringGetBytes {
	
	public static void main(String[] args) throws UnsupportedEncodingException {
		String str = "I AM 中国人";
		System.out.println("str = " + str);
		System.out.println("Default byte codes of str : " + Arrays.toString(str.getBytes()));
		System.out.println("GBK codes of str : " + Arrays.toString(str.getBytes("GBK")));
		System.out.println("UTF-8 codes of str : " + Arrays.toString(str.getBytes("UTF-8")));
		System.out.println("UTF-16 codes of str : " + Arrays.toString(str.getBytes("UTF-16")));
	}
	
}

 运行结果

str = I AM 中国人
Default byte codes of str : [73, 32, 65, 77, 32, -42, -48, -71, -6, -56, -53]
GBK codes of str : [73, 32, 65, 77, 32, -42, -48, -71, -6, -56, -53]
UTF-8 codes of str : [73, 32, 65, 77, 32, -28, -72, -83, -27, -101, -67, -28, -70, -70]
UTF-16 codes of str : [-2, -1, 0, 73, 0, 32, 0, 65, 0, 77, 0, 32, 78, 45, 86, -3, 78, -70]

疑问

1)默认的getBytes()返回的编码为GBK的,而不是JAVA中的char编码方式Unicode,即UTF-16

通过跟踪String.getBytes()方法发现返回字节使用的编码为JVM的默认charset:Charset.defaultCharset(),而不是UTF-16

代码片段:

String

    public byte[] getBytes() {
	return StringCoding.encode(value, offset, count);
    }

 StringCoding

    static byte[] encode(char[] ca, int off, int len) {
	String csn = Charset.defaultCharset().name();
	try {
	    return encode(csn, ca, off, len);
	} catch (UnsupportedEncodingException x) {
	    warnUnsupportedCharset(csn);
	}
	try {
	    return encode("ISO-8859-1", ca, off, len);
	} catch (UnsupportedEncodingException x) {
	    // If this code is hit during VM initialization, MessageUtils is
	    // the only way we will be able to get any kind of error message.
	    MessageUtils.err("ISO-8859-1 charset not available: "
			     + x.toString());
	    // If we can not find ISO-8859-1 (a required encoding) then things
	    // are seriously wrong with the installation.
	    System.exit(1);
	    return null;
	}
    }

 Charset

    public static Charset defaultCharset() {
        if (defaultCharset == null) {
	    synchronized (Charset.class) {
		java.security.PrivilegedAction pa =
		    new GetPropertyAction("file.encoding");
		String csn = (String)AccessController.doPrivileged(pa);
		Charset cs = lookup(csn);
		if (cs != null)
		    defaultCharset = cs;
                else 
		    defaultCharset = forName("UTF-8");
            }
	}
	return defaultCharset;
    }

 可通过运行参数-Dfile.encoding="UTF-8"进行修改
修改eclipse中的运行参数


String编码(一) 关于String.getBytes()_第1张图片
运行结果

str = I AM 涓浗浜?
Default byte codes of str : [73, 32, 65, 77, 32, -28, -72, -83, -27, -101, -67, -28, -70, -70]
GBK codes of str : [73, 32, 65, 77, 32, -42, -48, -71, -6, -56, -53]
UTF-8 codes of str : [73, 32, 65, 77, 32, -28, -72, -83, -27, -101, -67, -28, -70, -70]
UTF-16 codes of str : [-2, -1, 0, 73, 0, 32, 0, 65, 0, 77, 0, 32, 78, 45, 86, -3, 78, -70]

 问题

打印str出现乱码,但是字节编码正常
原因:Console控制台的编码不是UTF-8

修改eclipse中的console控制台编码


String编码(一) 关于String.getBytes()_第2张图片
 运行结果:

str = I AM 中国人
Default byte codes of str : [73, 32, 65, 77, 32, -28, -72, -83, -27, -101, -67, -28, -70, -70]
GBK codes of str : [73, 32, 65, 77, 32, -42, -48, -71, -6, -56, -53]
UTF-8 codes of str : [73, 32, 65, 77, 32, -28, -72, -83, -27, -101, -67, -28, -70, -70]
UTF-16 codes of str : [-2, -1, 0, 73, 0, 32, 0, 65, 0, 77, 0, 32, 78, 45, 86, -3, 78, -70]

2)UTF-16编码前面有两个字节为-2,-1

由于不同处理器对2字节处理方式不同,Big-endian(高位字节在前,低位字节在后)或Little-endian(低位字节在前,高位字节在后)编码,所以在对一串字符串进行编码是需要指明到底是Big-endian还是Little-endian,所以前面有两个字节用来保存BYTE_ORDER_MARK值

你可能感兴趣的:(J2SE&J2EE&J2ME)