UTF-16BE、UTF-16LE、UTF-16 三者之间的区别

简介

实际项目开发中,我们有时候可能需要将字符串转换成字节数组,而转化字节数组跟编码格式有关,不同的编码格式转化的字节数组不一样。下面列举了java支持的几种编码格式:

  • US-ASCII Seven-bit ASCII, a.k.a. ISO646-US, a.k.a. the Basic Latin block of the Unicode character set
  • ISO-8859-1 ISO Latin Alphabet No. 1, a.k.a. ISO-LATIN-1
  • UTF-8 Eight-bit UCS Transformation Format
  • UTF-16BE Sixteen-bit UCS Transformation Format, big-endian byte order
  • UTF-16LE Sixteen-bit UCS Transformation Format, little-endian byte order
  • UTF-16 Sixteen-bit UCS Transformation Format, byte order identified by an optional byte-order mark

本文重点讲解的是 UTF-16 编码格式字节数组的转化。UTF-16 顾名思义,就是用两个字节表示一个字符。那么用两个字节表示必然存在字节序的问题,即大端小端的问题。下面就来讲讲 UTF-16BEUTF-16LEUTF-16 三者之间的区别吧。

UTF-16BE,其后缀是 BEbig-endian,大端的意思。大端就是将高位的字节放在低地址表示。
UTF-16LE,其后缀是 LElittle-endian,小端的意思。小端就是将高位的字节放在高地址表示。
UTF-16,没有指定后缀,即不知道其是大小端,所以其开始的两个字节表示该字节数组是大端还是小端。即FE FF表示大端,FF FE表示小端。

代码测试

TestUTF_16.java

public static void main(String[] args) {
    String test = "代码";

    try {
        //UTF-16BE编码格式 大端形式编码
        byte[] bytesUTF16BE = test.getBytes("UTF-16BE");
        printHex("UTF-16BE", bytesUTF16BE);

        //UTF-16LE编码格式 小端形式编码
        byte[] bytesUTF16LE = test.getBytes("UTF-16LE");
        printHex("UTF-16LE", bytesUTF16LE);

        //UTF-16
        byte[] bytesUTF16 = test.getBytes("UTF-16");
        printHex("UTF-16", bytesUTF16);
        
        //大端编码格式的字节数组转化
        String strUTF16BE = new String(bytesUTF16BE, "UTF-16BE");
        String strUTF16LE = new String(bytesUTF16BE, "UTF-16LE");
        String strUTF16 = new String(bytesUTF16BE, "UTF-16");
        System.out.println("strUTF16BE:"+strUTF16BE);
        System.out.println("strUTF16LE:"+strUTF16LE);
        System.out.println("strUTF16:"+strUTF16);
        
        //小端编码格式的字节数组转化
        strUTF16BE = new String(bytesUTF16LE, "UTF-16BE");
        strUTF16LE = new String(bytesUTF16LE, "UTF-16LE");
        strUTF16 = new String(bytesUTF16LE, "UTF-16");
        System.out.println("strUTF16BE:"+strUTF16BE);
        System.out.println("strUTF16LE:"+strUTF16LE);
        System.out.println("strUTF16:"+strUTF16);
        
        //自带大小端编码格式的字节数组转化
        strUTF16BE = new String(bytesUTF16, "UTF-16BE");
        strUTF16LE = new String(bytesUTF16, "UTF-16LE");
        strUTF16 = new String(bytesUTF16, "UTF-16");
        System.out.println("strUTF16BE:"+strUTF16BE);
        System.out.println("strUTF16LE:"+strUTF16LE);
        System.out.println("strUTF16:"+strUTF16);
        
    } catch (UnsupportedEncodingException e) {
        e.printStackTrace();
    }
}

private static void printHex(String type, byte[] data) {
    StringBuilder builder = new StringBuilder();
    builder.append("type:").append(type).append(":   ");
    int temp = 0;
    for (int i = 0; i < data.length; i++) {
        temp = data[i] & 0xFF;
        builder.append("0x").append(Integer.toHexString(temp).toUpperCase()).append(" ");
    }

    System.out.println(builder.toString());
}

UTF-16BE、UTF-16LE、UTF-16 三者之间的区别_第1张图片

从上面的测试结果可以看出来,指定大小端编码格式的,转化为字节数组时不会带FE FF或者FF FE。带有FE FF或者FF FE的字节数组可以转化成指定大小端编码格式的字符串。

源码

看看String源码中对getBytes的实现

/**
 * Returns a new byte array containing the characters of this string encoded using the named charset.
 *
 * 

The behavior when this string cannot be represented in the named charset * is unspecified. Use {@link java.nio.charset.CharsetEncoder} for more control. * * @throws UnsupportedEncodingException if the charset is not supported */ public byte[] getBytes(String charsetName) throws UnsupportedEncodingException { return getBytes(Charset.forNameUEE(charsetName)); } /** * Returns a new byte array containing the characters of this string encoded using the * given charset. * *

The behavior when this string cannot be represented in the given charset * is to replace malformed input and unmappable characters with the charset's default * replacement byte array. Use {@link java.nio.charset.CharsetEncoder} for more control. * * @since 1.6 */ public byte[] getBytes(Charset charset) { String canonicalCharsetName = charset.name(); if (canonicalCharsetName.equals("UTF-8")) { return Charsets.toUtf8Bytes(value, offset, count); } else if (canonicalCharsetName.equals("ISO-8859-1")) { return Charsets.toIsoLatin1Bytes(value, offset, count); } else if (canonicalCharsetName.equals("US-ASCII")) { return Charsets.toAsciiBytes(value, offset, count); } else if (canonicalCharsetName.equals("UTF-16BE")) { return Charsets.toBigEndianUtf16Bytes(value, offset, count); } else { CharBuffer chars = CharBuffer.wrap(this.value, this.offset, this.count); ByteBuffer buffer = charset.encode(chars.asReadOnlyBuffer()); byte[] bytes = new byte[buffer.limit()]; buffer.get(bytes); return bytes; } }

你可能感兴趣的:(android之路)