Character Set
Character set is a map of characters and numbers(ascii code, code point).
ASCII
- American Standard Code for Information Interchange
- Map Latin characters to ASCII code (eg: A->0x41, a-> 0x61)
- 0 - 31 and 127 are Device Control Characters
- 32 - 126 are Printable Characters
- 0x00 - 0x7F
GB
GB2312 -> GBK -> GB18030
Unicode
- Map ALL characters in ALL languages to a unique number(a code point)
- 0x0000-0x10FFFF
- 17 Planes, 65536 code points in each plane
- Sample Chinese characters in BMP 0x4E00-0x9FBF
Plane | Name | Range |
---|---|---|
Plane 0 | Basic Multilingual Plane (BMP) | 0x0000-0xFFFF |
Plane 1 | Supplementary Multilingual Plane | 0x10000-0x1FFFF |
Plane 2 | Supplementary Ideographic Plane | 0x20000-0x2FFFF |
Plane 3 | Tertiary Ideographic Plane (unassigned) | 0x30000-0x3FFFF |
Plane 4-13 | unassigned | 0x40000-0xDFFFF |
Plane 14 | Supplementary Special-purpose Plane | 0xE0000-0xEFFFF |
Plane 15 | Supplementary Private Use Area planes - A | 0xF0000-0xFFFFF |
Plane 16 | Supplementary Private Use Area planes - B | 0x100000-0x10FFFF |
Character Encoding
Character set translate characters to numbers, character encoding translate numbers into binary.
UTF-8
- Unicode Transformation Format 8 bits
- Variable width(1-4 bytes) character encoding
- Backward compatible with ASCII
- Chinese characters (0x0800-0xFFFF) are encoded to 3 bytes
Code Point Range | UTF-8 binary |
---|---|
0x00-0x7F | 0xxxxxxx |
0x80-0x07FF | 110xxxxx 10xxxxxx |
0x0800-0xFFFF | 1110xxxx 10xxxxxx 10xxxxxx |
0x10000-0x10FFFF | 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx |
UTF-16
- Unicode Transformation Format 16 bits
- Variable width(2 or 4 bytes) character encoding
- Characters in BMP are encoded to 2 bytes
- Characters in Plane1-16 are encoded to 4 bytes
- Byte Order Mark(BOM)
- 0xFEFF big-ending BE
- 0xFFFE little-ending LE
Plane | Code Point Range | UTF-16 binary |
---|---|---|
Plane0(BMP) | 0x0000-0xFFFF | xxxxxxxx xxxxxxxx |
Plane1 | 0x10000-0x1FFFF | 11011000 00xxxxxx 110111xx xxxxxxxx |
Plane2 | 0x20000-0x2FFFF | 11011000 01xxxxxx 110111xx xxxxxxxx |
... | ... | 110110pp ppxxxxxx 110111xx xxxxxxxx |
Plane15 | 0xF0000-0xFFFFF | 11011011 10xxxxxx 110111xx xxxxxxxx |
Plane16 | 0x100000-0x10FFFF | 11011011 11xxxxxx 110111xx xxxxxxxx |
UTF-32
- Unicode Transformation Format 32 bits
- Fixed width(4 bytes) character encoding
Code Point Range | UTF-32 binary |
---|---|
0x0000-0x10FFFF | 00000000 000xxxxx xxxxxxxx xxxxxxxx |
Practise in Java
import java.nio.ByteBuffer;
import java.nio.charset.Charset;
import java.util.ArrayList;
import java.util.List;
public class CharacterTest {
private static final Charset UTF8 = Charset.forName("UTF-8");
private static final Charset UTF16_BE = Charset.forName("UTF-16BE");
private static final Charset UTF32 = Charset.forName("UTF-32");
private static final List charsets = new ArrayList();
static {
charsets.add(UTF8);
charsets.add(UTF16_BE);
charsets.add(UTF32);
}
public static void main(String[] args) {
printCharacter("A");
printCharacter("¼");
printCharacter("一");
printCharacter("");
}
/**
* Print Code Points of str
* Print str encoded by utf-8,utf-16,utf-32
*
* @param str
*/
private static void printCharacter(String str) {
str.codePoints().forEach((s) ->
System.out.format("%10s%40s\n", "Code Point", "0x " + Integer.toHexString(s).toUpperCase() + " "));
charsets.forEach((charset -> {
ByteBuffer byteBuffer = charset.encode(str);
System.out.format("%10s%40s\n", charset.name(), byteBufferToHexString(byteBuffer));
System.out.format("%10s%40s\n", charset.name(), byteBufferToBinaryString(byteBuffer));
}));
System.out.println();
}
/**
* ByteBuffer to hexadecimal string
*
* @param byteBuffer
* @return
*/
private static String byteBufferToHexString(ByteBuffer byteBuffer) {
StringBuilder hexString = new StringBuilder("0x ");
byteBuffer.rewind();
while (byteBuffer.hasRemaining()) {
int i = Byte.toUnsignedInt(byteBuffer.get());
hexString.append(padZeros(Integer.toHexString(i), 2));
}
return hexString.toString();
}
/**
* ByteBuffer to binary string
*
* @param byteBuffer
* @return
*/
private static String byteBufferToBinaryString(ByteBuffer byteBuffer) {
StringBuilder binaryString = new StringBuilder("0b ");
byteBuffer.rewind();
while (byteBuffer.hasRemaining()) {
int i = Byte.toUnsignedInt(byteBuffer.get());
binaryString.append(padZeros(Integer.toBinaryString(i), 8));
}
return binaryString.toString();
}
/**
* Pad len-str 0(s) to the left of str.
*
* @param str
* @param len
* @return
*/
private static String padZeros(String str, int len) {
int numOfZeros = len - str.length();
StringBuilder stringBuilder = new StringBuilder();
for (int i = 0; i < numOfZeros; i++) {
stringBuilder.append("0");
}
stringBuilder.append(str.toUpperCase()).append(" ");
return stringBuilder.toString();
}
}
Character encoded in UTF-8/16/32
Character | Code Point | UTF-8 | UTF-16 | UTF-32 |
---|---|---|---|---|
A | 0x41 | 01000001 0x41 |
00000000 01000001 0x00 41 |
00000000 00000000 00000000 01000001 0x00 00 00 41 |
¼ | 0xBC | 11000010 10111100 | 00000000 10111100 0x00 BC |
00000000 00000000 00000000 10111100 0x00 00 00 BC |
一 | 0x4E00 | 11100100 10111000 10000000 | 01001110 00000000 0x4E 00 |
00000000 00000000 01001110 00000000 0x00 00 4E 00 |
0x20021 | 11110000 10100000 10000000 10100001 | 11011000 01000000 11011100 00100001 | 00000000 00000010 00000000 00100001 0x00 02 00 21 |
Reference
Unicode Wiki
Unicode code points
Unicode lookup
SOF - Unicode UTF-8 UTF-16
cnblogs - Unicode UTF-8 UTF-16