Character
由于代码7000+行,这里只贴自己认为有必要的方法和注释
先贴着类注视,说了以下几个事
- 提供了各种互转(大小写、数字、以及其他)
- 采用的是unicode-6.2.0编码
- Character类的方法和数据都是根据unicode数据库的数据进行定义的
- 是基于fixed-width 16bit设计的,现在的unicode标准允许大于16bit的数据,现在的范围是U+0000 to U+10FFFF(左闭右开20位),Unicode scalar value
- unicode分为两部分BMP(Basic Multilingual Plane)是from U+0000 to U+FFFF的characters和补充字符集from U+10000
- 补充字符集的表示方式
表示的基础
U+D800-U+DFFF这2048值没有表示任何字符,被称为Unicode的替代区域
具体实现
为了字符不发生混乱对于高位补4个0,取高10位并且前面补充1101 10(D800-DBFF)
低位取后10位补充1101 11(DC00-DFFF)高低位都落在unicode替代区
- code point 就是char的unicode值对应的数字
- 接受char的方法不支持补充字符集
/**
* The {@code Character} class wraps a value of the primitive
* type {@code char} in an object. An object of type
* {@code Character} contains a single field whose type is
* {@code char}.
*
* Character 类在对象中包装了基本类型char,一个Character对象只包含了一个属性char
*
*
* In addition, this class provides several methods for determining
* a character's category (lowercase letter, digit, etc.) and for converting
* characters from uppercase to lowercase and vice versa.
* 此外, 这个类提供了很多方法用来确认character的类型(小写字母,数字,还是其他的),
* 也提供了大小写的互转
*
* Character information is based on the Unicode Standard, version 6.2.0.
*
* Character 信息以来unicode标准,版本是6.2.0
*
* The methods and data of class {@code Character} are defined by
* the information in the UnicodeData file that is part of the
* Unicode Character Database maintained by the Unicode
* Consortium. This file specifies various properties including name
* and general category for every defined Unicode code point or
* character range.
* Character 的数据和方法由UnicodeData文件的信息定义。
* 这个文件是由Unicode基金维护的Unicode字符数据库的一部分
* 这个文件指定各种属性包括:名称、unicode code point 的分类、字符串的范围
*
*
* The file and its description are available from the Unicode Consortium at:
*
*
* Unicode Character Representations
*
* The {@code char} data type (and therefore the value that a
* {@code Character} object encapsulates) are based on the
* original Unicode specification, which defined characters as
* fixed-width 16-bit entities. The Unicode Standard has since been
* changed to allow for characters whose representation requires more
* than 16 bits. The range of legal code points is now
* U+0000 to U+10FFFF, known as Unicode scalar value.
* (Refer to the
* definition of the U+n notation in the Unicode
* Standard.)
* char的数据类型是基于fixed-width 16bit实体的原始数据类型设计的
* 现在的unicode标准允许大于16bit的数据,现在的范围是U+0000 to U+10FFFF
* 被叫做Unicode scalar value
*
*
The set of characters from U+0000 to U+FFFF is
* sometimes referred to as the Basic Multilingual Plane (BMP).
* Characters whose code points are greater
* than U+FFFF are called supplementary characters. The Java
* platform uses the UTF-16 representation in {@code char} arrays and
* in the {@code String} and {@code StringBuffer} classes. In
* this representation, supplementary characters are represented as a pair
* of {@code char} values, the first from the high-surrogates
* range, (\uD800-\uDBFF), the second from the
* low-surrogates range (\uDC00-\uDFFF).
* BMP(Basic Multilingual Plane)是from U+0000 to U+FFFF的characters
* 比U+FFFF大的叫补充字符集
*
* Java在char数组,String,StringBuffer中用UTF-16表示,在这种表现形式下,
* 补充字符集,用两段表示,高位范围为(\uD800-\uDBFF),低位范围为\uDC00-\uDFFF
*
* 具体算法是将补充字符集(范围U+10000-U+10FFFF)-U+10000得到A的数范围为(0-FFFF)
* 即0000 0000 0000 0000 到 1111 1111 1111 1111
* 由于U+D800-U+DFFF这2048值没有表示任何字符,被称为Unicode的替代区域,
* 为了字符不发生混乱对于A高位补4个0,取高10位并且前面补充1101 10(D800-DBFF)
* 低位取后10位补充1101 11(DC00-DFFF)高低位都落在unicode替代区,方便识别
* from https://blog.csdn.net/wangdingqiaoit/article/details/12083191
*
*
A {@code char} value, therefore, represents Basic
* Multilingual Plane (BMP) code points, including the surrogate
* code points, or code units of the UTF-16 encoding. An
* {@code int} value represents all Unicode code points,
* including supplementary code points. The lower (least significant)
* 21 bits of {@code int} are used to represent Unicode code
* points and the upper (most significant) 11 bits must be zero.
* Unless otherwise specified, the behavior with respect to
* supplementary characters and surrogate {@code char} values is
* as follows:
*
* 因此, char值代表基本多语言平面(BMP)代码点,包括代码代码点或UTF-16编码的代码单位。
* int值代表所有Unicode代码点,包括补充代码点。
* 下(至少显著)的21个比特int用于表示Unicode代码点和上部(最显著)11位必须为零。
* 除非另有说明,关于补充字符和代数char值的行为如下:
*
*
* - The methods that only accept a {@code char} value cannot support
* supplementary characters. They treat {@code char} values from the
* surrogate ranges as undefined characters. For example,
* {@code Character.isLetter('\u005CuD840')} returns {@code false}, even though
* this specific value if followed by any low-surrogate value in a string
* would represent a letter.
*
* 仅接受char值的方法不能支持补充字符。 他们将char范围中的char值视为未定义的字符。
* 例如, Character.isLetter('\uD840')返回false ,
* 即使这个特定值如果后面跟着一个字符串中的任何低代理值都会表示一个字母。
*
*
- The methods that accept an {@code int} value support all
* Unicode characters, including supplementary characters. For
* example, {@code Character.isLetter(0x2F81A)} returns
* {@code true} because the code point value represents a letter
* (a CJK ideograph).
*
* 接受int值的方法支持所有Unicode字符,包括补充字符。
* 例如, Character.isLetter(0x2F81A)返回true ,因为代码点值表示一个字母(一个CJK表意文字)。
*
* In the Java SE API documentation, Unicode code point is
* used for character values in the range between U+0000 and U+10FFFF,
* and Unicode code unit is used for 16-bit
* {@code char} values that are code units of the UTF-16
* encoding. For more information on Unicode terminology, refer to the
* Unicode Glossary.
*
* 在Java SE API文档中, Unicode代码点用于U + 0000和U + 10FFFF之间的字符值,
* Unicode代码单位用作UTF-16编码的16位char值。
* 有关Unicode术语的更多信息,请参阅Unicode Glossary 。
*
* @author Lee Boynton
* @author Guy Steele
* @author Akira Tanaka
* @author Martin Buchholz
* @author Ulf Zibis
* @since 1.0
*/
变量 & 方法介绍
进制
/**
* The minimum radix available for conversion to and from strings.
* The constant value of this field is the smallest value permitted
* for the radix argument in radix-conversion methods such as the
* {@code digit} method, the {@code forDigit} method, and the
* {@code toString} method of class {@code Integer}.
*
* 可用于转换到字符串和从字符串转换的最小进制。
* 此字段的恒定值是允许在进制转换方法如radix参数的最小值digit方法,
* 所述forDigit方法和toString类的方法Integer 。
*
* @see Character#digit(char, int)
* @see Character#forDigit(int, int)
* @see Integer#toString(int, int)
* @see Integer#valueOf(String)
*/
public static final int MIN_RADIX = 2;
/**
* The maximum radix available for conversion to and from strings.
* The constant value of this field is the largest value permitted
* for the radix argument in radix-conversion methods such as the
* {@code digit} method, the {@code forDigit} method, and the
* {@code toString} method of class {@code Integer}.
* 最大进制
* @see Character#digit(char, int)
* @see Character#forDigit(int, int)
* @see Integer#toString(int, int)
* @see Integer#valueOf(String)
*/
public static final int MAX_RADIX = 36;
最大最小进制,最大36进制,我猜36 = (0-9)10 + (a-z)26
方向控制
/**
* Undefined bidirectional character type. Undefined {@code char}
* values have undefined directionality in the Unicode specification.
* 未定义的视觉方向性,
* unicode 控制字符
* 参见https://blog.csdn.net/weixin_33709609/article/details/94704685
* @since 1.4
*/
public static final byte DIRECTIONALITY_UNDEFINED = -1;
/**
*
* Strong bidirectional character type "L" in the Unicode specification.
* 从左到右
* @since 1.4
*/
public static final byte DIRECTIONALITY_LEFT_TO_RIGHT = 0;
字符方向控制,其实就是字符从左到右显示还是反过来,像下面这样,还有很多比如方向控制结束
各种边界值
/**
* The minimum value of a
*
* Unicode high-surrogate code unit
* in the UTF-16 encoding, constant {@code '\u005CuD800'}.
* A high-surrogate is also known as a leading-surrogate.
* 补充字符集高位最小值
*
* @since 1.5
*/
public static final char MIN_HIGH_SURROGATE = '\uD800';
/**
* The maximum value of a
*
* Unicode high-surrogate code unit
* in the UTF-16 encoding, constant {@code '\u005CuDBFF'}.
* A high-surrogate is also known as a leading-surrogate.
* 补充字符集高位最大值
*
* @since 1.5
*/
public static final char MAX_HIGH_SURROGATE = '\uDBFF';
/**
* The minimum value of a
*
* Unicode low-surrogate code unit
* in the UTF-16 encoding, constant {@code '\u005CuDC00'}.
* A low-surrogate is also known as a trailing-surrogate.
* 补充字符集低位最小值
*
* @since 1.5
*/
public static final char MIN_LOW_SURROGATE = '\uDC00';
/**
* The maximum value of a
*
* Unicode low-surrogate code unit
* in the UTF-16 encoding, constant {@code '\u005CuDFFF'}.
* A low-surrogate is also known as a trailing-surrogate.
* 补充字符集低位最大值
*
* @since 1.5
*/
public static final char MAX_LOW_SURROGATE = '\uDFFF';
/**
* The minimum value of a Unicode surrogate code unit in the
* UTF-16 encoding, constant {@code '\u005CuD800'}.
*
* 替代区最小值
* @since 1.5
*/
public static final char MIN_SURROGATE = MIN_HIGH_SURROGATE;
/**
* The maximum value of a Unicode surrogate code unit in the
* UTF-16 encoding, constant {@code '\u005CuDFFF'}.
*
* 替代区最大值
* @since 1.5
*/
public static final char MAX_SURROGATE = MAX_LOW_SURROGATE;
/**
* The minimum value of a
*
* Unicode supplementary code point, constant {@code U+10000}.
*
* 补充字符集最小范围
* @since 1.5
*/
public static final int MIN_SUPPLEMENTARY_CODE_POINT = 0x010000;
/**
* The minimum value of a
*
* Unicode code point, constant {@code U+0000}.
* unicode code point 最小值
* @since 1.5
*/
public static final int MIN_CODE_POINT = 0x000000;
/**
* The maximum value of a
*
* Unicode code point, constant {@code U+10FFFF}.
* unicode code point 最大值
* @since 1.5
*/
public static final int MAX_CODE_POINT = 0X10FFFF;
Character.UnicodeBlock
主要定义了各个代码段的范围,彼此之间不交叉这个是和UnicodeScript的区别,列出了基础的块。采用两个有序数组对应,查找方法采用二分法
/**
* A family of character subsets representing the character blocks in the
* Unicode specification. Character blocks generally define characters
* used for a specific script or purpose. A character is contained by
* at most one Unicode block.
* 用于定义字符块,每个字符应该只属于一个字符块,名称应该遵守unicode规范中的名称
*
* 用双数组的形式定义了,字符集(名称 & 别名)和字符集范围
* 提供了查char(code point)的所属字符集的方法,提供了根据名称获取字符集的方法*
*
* 中文的标点符号主要存在于以下5个UnicodeBlock中,
*
* U2000-General Punctuation (百分号,千分号,单引号,双引号等)
*
* U3000-CJK Symbols and Punctuation ( 顿号,句号,书名号,〸,〹,〺 等;PS: 后面三个字符你知道什么意思吗? : ) )
*
* UFF00-Halfwidth and Fullwidth Forms ( 大于,小于,等于,括号,感叹号,加,减,冒号,分号等等)
*
* UFE30-CJK Compatibility Forms (主要是给竖写方式使用的括号,以及间断线﹉,波浪线﹌等)
*
* UFE10-Vertical Forms (主要是一些竖着写的标点符号, 等等)
* @since 1.2
*/
public static final class UnicodeBlock extends Subset {
/**
* Constant for the "Basic Latin" Unicode character block.
* @since 1.2
*/
public static final UnicodeBlock BASIC_LATIN =
new UnicodeBlock("BASIC_LATIN",
"BASIC LATIN",
"BASICLATIN");
/**
* Constant for the "Latin-1 Supplement" Unicode character block.
* @since 1.2
*/
public static final UnicodeBlock LATIN_1_SUPPLEMENT =
new UnicodeBlock("LATIN_1_SUPPLEMENT",
"LATIN-1 SUPPLEMENT",
"LATIN-1SUPPLEMENT");
/**
* Constant for the "Latin Extended-A" Unicode character block.
* @since 1.2
*/
public static final UnicodeBlock LATIN_EXTENDED_A =
new UnicodeBlock("LATIN_EXTENDED_A",
"LATIN EXTENDED-A",
"LATINEXTENDED-A");
/**
* Constant for the "Latin Extended-B" Unicode character block.
* @since 1.2
*/
public static final UnicodeBlock LATIN_EXTENDED_B =
new UnicodeBlock("LATIN_EXTENDED_B",
"LATIN EXTENDED-B",
"LATINEXTENDED-B");
/**
* Returns the object representing the Unicode block containing the
* given character, or {@code null} if the character is not a
* member of a defined block.
*
* Note: This method cannot handle
* supplementary
* characters. To support all Unicode characters, including
* supplementary characters, use the {@link #of(int)} method.
* 返回char的字符集,不支持补充字符集,
* 想要支持需要使用重载方法,为啥不支持补充字符集呢,
* 因为补充字符集不能用一个char表示
*
* @param c The character in question
* @return The {@code UnicodeBlock} instance representing the
* Unicode block of which this character is a member, or
* {@code null} if the character is not a member of any
* Unicode block
*/
public static UnicodeBlock of(char c) {
return of((int)c);
}
/**
* Returns the object representing the Unicode block
* containing the given character (Unicode code point), or
* {@code null} if the character is not a member of a
* defined block.
*
* 这个二分写的不好,遇到相等时候没有及时跳出
*
* @param codePoint the character (Unicode code point) in question.
* @return The {@code UnicodeBlock} instance representing the
* Unicode block of which this character is a member, or
* {@code null} if the character is not a member of any
* Unicode block
* @exception IllegalArgumentException if the specified
* {@code codePoint} is an invalid Unicode code point.
* @see Character#isValidCodePoint(int)
* @since 1.5
*/
public static UnicodeBlock of(int codePoint) {
if (!isValidCodePoint(codePoint)) {
throw new IllegalArgumentException();
}
int top, bottom, current;
bottom = 0;
top = blockStarts.length;
current = top/2;
// invariant: top > current >= bottom && codePoint >= unicodeBlockStarts[bottom]
while (top - bottom > 1) {
if (codePoint >= blockStarts[current]) {
bottom = current;
} else {
top = current;
}
current = (top + bottom) / 2;
}
return blocks[current];
}
/**
* Returns the UnicodeBlock with the given name. Block
* names are determined by The Unicode Standard. The file
* Blocks-<version>.txt defines blocks for a particular
* version of the standard. The {@link Character} class specifies
* the version of the standard that it supports.
*
* This method accepts block names in the following forms:
*
* - Canonical block names as defined by the Unicode Standard.
* For example, the standard defines a "Basic Latin" block. Therefore, this
* method accepts "Basic Latin" as a valid block name. The documentation of
* each UnicodeBlock provides the canonical name.
*
- Canonical block names with all spaces removed. For example, "BasicLatin"
* is a valid block name for the "Basic Latin" block.
*
- The text representation of each constant UnicodeBlock identifier.
* For example, this method will return the {@link #BASIC_LATIN} block if
* provided with the "BASIC_LATIN" name. This form replaces all spaces and
* hyphens in the canonical name with underscores.
*
* Finally, character case is ignored for all of the valid block name forms.
* For example, "BASIC_LATIN" and "basic_latin" are both valid block names.
* The en_US locale's case mapping rules are used to provide case-insensitive
* string comparisons for block name validation.
*
* If the Unicode Standard changes block names, both the previous and
* current names will be accepted.
*
* 这个方法也就判断一下是不是同一个字符集, 名称应该遵守unicode规范
*
* @param blockName A {@code UnicodeBlock} name.
* @return The {@code UnicodeBlock} instance identified
* by {@code blockName}
* @throws IllegalArgumentException if {@code blockName} is an
* invalid name
* @throws NullPointerException if {@code blockName} is null
* @since 1.5
*/
public static final UnicodeBlock forName(String blockName) {
UnicodeBlock block = map.get(blockName.toUpperCase(Locale.US));
if (block == null) {
throw new IllegalArgumentException();
}
return block;
}
UnicodeScript
/**
* A family of character subsets representing the character scripts
* defined in the
* Unicode Standard Annex #24: Script Names. Every Unicode
* character is assigned to a single Unicode script, either a specific
* script, such as {@link Character.UnicodeScript#LATIN Latin}, or
* one of the following three special values,
* {@link Character.UnicodeScript#INHERITED Inherited},
* {@link Character.UnicodeScript#COMMON Common} or
* {@link Character.UnicodeScript#UNKNOWN Unknown}.
* UnicodeScript从使用角度对字符进行划分,
* UnicodeBlock是从硬编码角度,可以理解为对一个线段划分为多个线段
*
* @since 1.7
*/
public static enum UnicodeScript {
/**
* Returns the enum constant representing the Unicode script of which
* the given character (Unicode code point) is assigned to.
*
* @param codePoint the character (Unicode code point) in question.
* @return The {@code UnicodeScript} constant representing the
* Unicode script of which this character is assigned to.
* 返回codePoint对应的UnicodeScript
*
* 不再unicode字符集范围内返回IllegalArgumentException
* @exception IllegalArgumentException if the specified
* {@code codePoint} is an invalid Unicode code point.
* @see Character#isValidCodePoint(int)
*
*/
public static UnicodeScript of(int codePoint) {
if (!isValidCodePoint(codePoint))
throw new IllegalArgumentException();
int type = getType(codePoint);
// leave SURROGATE and PRIVATE_USE for table lookup
if (type == UNASSIGNED)
return UNKNOWN;
int index = Arrays.binarySearch(scriptStarts, codePoint);
if (index < 0)
index = -index - 2;
return scripts[index];
}
/**
* Returns the UnicodeScript constant with the given Unicode script
* name or the script name alias. Script names and their aliases are
* determined by The Unicode Standard. The files Scripts<version>.txt
* and PropertyValueAliases<version>.txt define script names
* and the script name aliases for a particular version of the
* standard. The {@link Character} class specifies the version of
* the standard that it supports.
*
* Character case is ignored for all of the valid script names.
* The en_US locale's case mapping rules are used to provide
* case-insensitive string comparisons for script name validation.
*
*
* @param scriptName A {@code UnicodeScript} name.
* @return The {@code UnicodeScript} constant identified
* by {@code scriptName}
* @throws IllegalArgumentException if {@code scriptName} is an
* invalid name
* @throws NullPointerException if {@code scriptName} is null
*/
public static final UnicodeScript forName(String scriptName) {
scriptName = scriptName.toUpperCase(Locale.ENGLISH);
//.replace(' ', '_'));
UnicodeScript sc = aliases.get(scriptName);
if (sc != null)
return sc;
return valueOf(scriptName);
}
Character本身
内存存储了一个char,所以不能表示补充字符集(补充字符> 16bit)
/**
* The value of the {@code Character}.
*
* @serial
*/
private final char value;
/** use serialVersionUID from JDK 1.0.2 for interoperability */
private static final long serialVersionUID = 3786198910865385080L;
/**
* Constructs a newly allocated {@code Character} object that
* represents the specified {@code char} value.
* 传说中的常量池
* @param value the value to be represented by the
* {@code Character} object.
*/
public Character(char value) {
this.value = value;
}
private static class CharacterCache {
private CharacterCache(){}
static final Character cache[] = new Character[127 + 1];
static {
for (int i = 0; i < cache.length; i++)
cache[i] = new Character((char)i);
}
}
/**
* Returns a Character instance representing the specified
* char value.
* If a new Character instance is not required, this method
* should generally be used in preference to the constructor
* {@link #Character(char)}, as this method is likely to yield
* significantly better space and time performance by caching
* frequently requested values.
*
* This method will always cache values in the range {@code
* '\u005Cu0000'} to {@code '\u005Cu007F'}, inclusive, and may
* cache other values outside of this range.
* 为啥cache这段呢~~~,我理解如下
* 0x00-0x7F之间完全和ASCII一致,0x80-0x9F之间是控制字符,0xA0-0xFF之间是文字符号。
* 也就是说cache了所有ascii
* @param c a char value.
* @return a Character instance representing c.
* @since 1.5
*/
public static Character valueOf(char c) {
if (c <= 127) { // must cache
return CharacterCache.cache[(int)c];
}
return new Character(c);
}
/**
* Returns a {@code String} object representing this
* {@code Character}'s value. The result is a string of
* length 1 whose sole component is the primitive
* {@code char} value represented by this
* {@code Character} object.
* 就是String.valueOf
* @return a string representation of this object.
*/
public String toString() {
char buf[] = {value};
return String.valueOf(buf);
}
主要方法
有对codePoint的方法
判断合法性,BMP,补充字符集,补充字符表示的合法性
获取各种的codePoint
大小写、数字相关,会支持全角的英文啥的
/**
* Determines if the specified character is a lowercase character.
* 支持补充字符集,LOWERCASE_LETTER OR Other_Lowercase
*
* A character is lowercase if its general category type, provided
* by {@code Character.getType(ch)}, is
* {@code LOWERCASE_LETTER}, or it has contributory property
* Other_Lowercase as defined by the Unicode Standard.
*
* The following are examples of lowercase characters:
*
* a b c d e f g h i j k l m n o p q r s t u v w x y z
* '\u00DF' '\u00E0' '\u00E1' '\u00E2' '\u00E3' '\u00E4' '\u00E5' '\u00E6'
* '\u00E7' '\u00E8' '\u00E9' '\u00EA' '\u00EB' '\u00EC' '\u00ED' '\u00EE'
* '\u00EF' '\u00F0' '\u00F1' '\u00F2' '\u00F3' '\u00F4' '\u00F5' '\u00F6'
* '\u00F8' '\u00F9' '\u00FA' '\u00FB' '\u00FC' '\u00FD' '\u00FE' '\u00FF'
*
* Many other Unicode characters are lowercase too.
*
*
Note: This method cannot handle supplementary characters. To support
* all Unicode characters, including supplementary characters, use
* the {@link #isLowerCase(int)} method.
*
* @param ch the character to be tested.
* @return {@code true} if the character is lowercase;
* {@code false} otherwise.
* @see Character#isLowerCase(char)
* @see Character#isTitleCase(char)
* @see Character#toLowerCase(char)
* @see Character#getType(char)
*/
public static boolean isLowerCase(char ch) {
return isLowerCase((int)ch);
}
/**
* Determines if the specified character is a digit.
*
* A character is a digit if its general category type, provided
* by {@code Character.getType(ch)}, is
* {@code DECIMAL_DIGIT_NUMBER}.
*
* Some Unicode character ranges that contain digits:
*
* - {@code '\u005Cu0030'} through {@code '\u005Cu0039'},
* ISO-LATIN-1 digits ({@code '0'} through {@code '9'})
*
- {@code '\u005Cu0660'} through {@code '\u005Cu0669'},
* Arabic-Indic digits
*
- {@code '\u005Cu06F0'} through {@code '\u005Cu06F9'},
* Extended Arabic-Indic digits
*
- {@code '\u005Cu0966'} through {@code '\u005Cu096F'},
* Devanagari digits
*
- {@code '\u005CuFF10'} through {@code '\u005CuFF19'},
* Fullwidth digits
*
*
* Many other character ranges contain digits as well.
* 是不是数字,各种数字,不知道都是啥数字
* 0
* 1
* 2
* 3
* 4
* १
* २
* ३
* ४
* ١
* ٢
* ٣
* ٤
* ٥
*
* ۰
* ۱
* ۲
* ۳
* ۴
* ۵
* Note: This method cannot handle supplementary characters. To support
* all Unicode characters, including supplementary characters, use
* the {@link #isDigit(int)} method.
*
* @param ch the character to be tested.
* @return {@code true} if the character is a digit;
* {@code false} otherwise.
* @see Character#digit(char, int)
* @see Character#forDigit(int, int)
* @see Character#getType(char)
*/
public static boolean isDigit(char ch) {
return isDigit((int)ch);
}
各种编码的关系
其实Unicode涉及到两个步骤,首先是定义一个规范,给所有的字符指定一个唯一对应的数字,
这完全是数学问题,可以跟计算机没半毛钱关系.第二步才是怎么把字符对应的数字保存在计算机中, 这才涉及到实际在计算机中占多少字节空间.
所以我们也可以这样理解,Unicode是用0至65535之间的数字来表示所有字符.其中0至127这128个数字表示的字符仍然跟ASCII完全一样.65536是2的16次方.这是第一步
.第二步就是怎么把0至65535这些数字转化成01串保存到计算机中.这肯定就有不同的保存方式了.于是出现了UTF(unicode transformation format),有UTF-8,UTF-16.UTF-8 表示补充字符集需要3个字节,
/**
* 转成utf-16
* 其实Unicode涉及到两个步骤,首先是定义一个规范,给所有的字符指定一个唯一对应的数字,
* 这完全是数学问题,可以跟计算机没半毛钱关系.第二步才是怎么把字符对应的数字保存在计算机中,
* 这才涉及到实际在计算机中占多少字节空间.
*
* 所以我们也可以这样理解,Unicode是用0至65535之间的数字来表示所有字符.
* 其中0至127这128个数字表示的字符仍然跟ASCII完全一样.65536是2的16次方.这是第一步
* .第二步就是怎么把0至65535这些数字转化成01串保存到计算机中.这肯定就有不同的保存方式了.
* 于是出现了UTF(unicode transformation format),有UTF-8,UTF-16.
* Converts the specified character (Unicode code point) to its
* UTF-16 representation. If the specified code point is a BMP
* (Basic Multilingual Plane or Plane 0) value, the same value is
* stored in {@code dst[dstIndex]}, and 1 is returned. If the
* specified code point is a supplementary character, its
* surrogate values are stored in {@code dst[dstIndex]}
* (high-surrogate) and {@code dst[dstIndex+1]}
* (low-surrogate), and 2 is returned.
*
* @param codePoint the character (Unicode code point) to be converted.
* @param dst an array of {@code char} in which the
* {@code codePoint}'s UTF-16 value is stored.
* @param dstIndex the start index into the {@code dst}
* array where the converted value is stored.
* @return 1 if the code point is a BMP code point, 2 if the
* code point is a supplementary code point.
* @exception IllegalArgumentException if the specified
* {@code codePoint} is not a valid Unicode code point.
* @exception NullPointerException if the specified {@code dst} is null.
* @exception IndexOutOfBoundsException if {@code dstIndex}
* is negative or not less than {@code dst.length}, or if
* {@code dst} at {@code dstIndex} doesn't have enough
* array element(s) to store the resulting {@code char}
* value(s). (If {@code dstIndex} is equal to
* {@code dst.length-1} and the specified
* {@code codePoint} is a supplementary character, the
* high-surrogate value is not stored in
* {@code dst[dstIndex]}.)
* @since 1.5
*/
public static int toChars(int codePoint, char[] dst, int dstIndex) {
if (isBmpCodePoint(codePoint)) {
dst[dstIndex] = (char) codePoint;
return 1;
} else if (isValidCodePoint(codePoint)) {
toSurrogates(codePoint, dst, dstIndex);
return 2;
} else {
throw new IllegalArgumentException();
}
}
最后再贴一下很重要的CharacterData子类实现之一,很多方法都依赖于他的实现,比如收拾数字,大小写转化,话不多说贴代码,里面有注释可以看出,怎么做的,都是基于位的定义做的
class CharacterDataLatin1 extends CharacterData {
/* The character properties are currently encoded into 32 bits in the following manner:
A[49] = 0-0011-000000000-0-0-0-011-01-10000-01001
1位:0表示没有mirrored property,如果是'(','[',这些字符,这个位置的值为1
4位:3
9位:无偏移
1位:无小写
1位:无大写
1位:无首字母大写属性
3位:3 表示是一个合法的Unicode标识符或Java标识符
2位:1 有数字的属性
5位:数字移位为0
5位:字符类型代表的值为9
1 bit mirrored property
4 bits directionality property
9 bits signed offset used for converting case
1 bit if 1, adding the signed offset converts the character to lowercase
1 bit if 1, subtracting the signed offset converts the character to uppercase
1 bit if 1, this character has a titlecase equivalent (possibly itself)
3 bits 0 may not be part of an identifier
1 ignorable control; may continue a Unicode identifier or Java identifier
2 may continue a Java identifier but not a Unicode identifier (unused)
3 may continue a Unicode identifier or Java identifier
4 is a Java whitespace character
5 may start or continue a Java identifier;
may continue but not start a Unicode identifier (underscores)
6 may start or continue a Java identifier but not a Unicode identifier ($)
7 may start or continue a Unicode identifier or Java identifier
Thus:
5, 6, 7 may start a Java identifier
1, 2, 3, 5, 6, 7 may continue a Java identifier
7 may start a Unicode identifier
1, 3, 5, 7 may continue a Unicode identifier
1 is ignorable within an identifier
4 is Java whitespace
2 bits 0 this character has no numeric property
1 adding the digit offset to the character code and then
masking with 0x1F will produce the desired numeric value
2 this character has a "strange" numeric value
3 a Java supradecimal digit: adding the digit offset to the
character code, then masking with 0x1F, then adding 10
will produce the desired numeric value
5 bits digit offset
5 bits character type
The encoding of character properties is subject to change at any time.
*/
int getProperties(int ch) {
char offset = (char)ch;
int props = A[offset];
return props;
}
int getPropertiesEx(int ch) {
char offset = (char)ch;
int props = B[offset];
return props;
}
boolean isOtherLowercase(int ch) {
int props = getPropertiesEx(ch);
return (props & 0x0001) != 0;
}
boolean isOtherUppercase(int ch) {
int props = getPropertiesEx(ch);
return (props & 0x0002) != 0;
}
boolean isOtherAlphabetic(int ch) {
int props = getPropertiesEx(ch);
return (props & 0x0004) != 0;
}
boolean isIdeographic(int ch) {
int props = getPropertiesEx(ch);
return (props & 0x0010) != 0;
}
int getType(int ch) {
int props = getProperties(ch);
return (props & 0x1F);
}
boolean isJavaIdentifierStart(int ch) {
int props = getProperties(ch);
return ((props & 0x00007000) >= 0x00005000);
}
boolean isJavaIdentifierPart(int ch) {
int props = getProperties(ch);
return ((props & 0x00003000) != 0);
}
boolean isUnicodeIdentifierStart(int ch) {
int props = getProperties(ch);
return ((props & 0x00007000) == 0x00007000);
}
boolean isUnicodeIdentifierPart(int ch) {
int props = getProperties(ch);
return ((props & 0x00001000) != 0);
}
boolean isIdentifierIgnorable(int ch) {
int props = getProperties(ch);
return ((props & 0x00007000) == 0x00001000);
}
int toLowerCase(int ch) {
int mapChar = ch;
int val = getProperties(ch);
if (((val & 0x00020000) != 0) &&
((val & 0x07FC0000) != 0x07FC0000)) {
int offset = val << 5 >> (5+18);
mapChar = ch + offset;
}
return mapChar;
}
int toUpperCase(int ch) {
int mapChar = ch;
int val = getProperties(ch);
if ((val & 0x00010000) != 0) {
if ((val & 0x07FC0000) != 0x07FC0000) {
int offset = val << 5 >> (5+18);
mapChar = ch - offset;
} else if (ch == 0x00B5) {
mapChar = 0x039C;
}
}
return mapChar;
}
int toTitleCase(int ch) {
return toUpperCase(ch);
}
int digit(int ch, int radix) {
int value = -1;
if (radix >= Character.MIN_RADIX && radix <= Character.MAX_RADIX) {
int val = getProperties(ch);
int kind = val & 0x1F;
if (kind == Character.DECIMAL_DIGIT_NUMBER) {
value = ch + ((val & 0x3E0) >> 5) & 0x1F;
}
else if ((val & 0xC00) == 0x00000C00) {
// Java supradecimal digit
value = (ch + ((val & 0x3E0) >> 5) & 0x1F) + 10;
}
}
return (value < radix) ? value : -1;
}
int getNumericValue(int ch) {
int val = getProperties(ch);
int retval = -1;
switch (val & 0xC00) {
default: // cannot occur
case (0x00000000): // not numeric
retval = -1;
break;
case (0x00000400): // simple numeric
retval = ch + ((val & 0x3E0) >> 5) & 0x1F;
break;
case (0x00000800) : // "strange" numeric
retval = -2;
break;
case (0x00000C00): // Java supradecimal
retval = (ch + ((val & 0x3E0) >> 5) & 0x1F) + 10;
break;
}
return retval;
}
boolean isWhitespace(int ch) {
int props = getProperties(ch);
return ((props & 0x00007000) == 0x00004000);
}
byte getDirectionality(int ch) {
int val = getProperties(ch);
byte directionality = (byte)((val & 0x78000000) >> 27);
if (directionality == 0xF ) {
directionality = -1;
}
return directionality;
}
boolean isMirrored(int ch) {
int props = getProperties(ch);
return ((props & 0x80000000) != 0);
}
int toUpperCaseEx(int ch) {
int mapChar = ch;
int val = getProperties(ch);
if ((val & 0x00010000) != 0) {
if ((val & 0x07FC0000) != 0x07FC0000) {
int offset = val << 5 >> (5+18);
mapChar = ch - offset;
}
else {
switch(ch) {
// map overflow characters
case 0x00B5 : mapChar = 0x039C; break;
default : mapChar = Character.ERROR; break;
}
}
}
return mapChar;
}
static char[] sharpsMap = new char[] {'S', 'S'};
char[] toUpperCaseCharArray(int ch) {
char[] upperMap = {(char)ch};
if (ch == 0x00DF) {
upperMap = sharpsMap;
}
return upperMap;
}
static final CharacterDataLatin1 instance = new CharacterDataLatin1();
private CharacterDataLatin1() {};
// The following tables and code generated using:
// java GenerateCharacter -template c:/re/workspace/8-2-build-windows-amd64-cygwin/jdk8u74/6087/jdk/make/data/characterdata/CharacterDataLatin1.java.template -spec c:/re/workspace/8-2-build-windows-amd64-cygwin/jdk8u74/6087/jdk/make/data/unicodedata/UnicodeData.txt -specialcasing c:/re/workspace/8-2-build-windows-amd64-cygwin/jdk8u74/6087/jdk/make/data/unicodedata/SpecialCasing.txt -proplist c:/re/workspace/8-2-build-windows-amd64-cygwin/jdk8u74/6087/jdk/make/data/unicodedata/PropList.txt -o c:/re/workspace/8-2-build-windows-amd64-cygwin/jdk8u74/6087/build/windows-amd64/jdk/gensrc/java/lang/CharacterDataLatin1.java -string -usecharforbyte -latin1 8
// The A table has 256 entries for a total of 1024 bytes.
static final int A[] = new int[256];
static final String A_DATA =
"\u4800\u100F\u4800\u100F\u4800\u100F\u4800\u100F\u4800\u100F\u4800\u100F\u4800"+
"\u100F\u4800\u100F\u4800\u100F\u5800\u400F\u5000\u400F\u5800\u400F\u6000\u400F"+
"\u5000\u400F\u4800\u100F\u4800\u100F\u4800\u100F\u4800\u100F\u4800\u100F\u4800"+
"\u100F\u4800\u100F\u4800\u100F\u4800\u100F\u4800\u100F\u4800\u100F\u4800\u100F"+
"\u4800\u100F\u4800\u100F\u5000\u400F\u5000\u400F\u5000\u400F\u5800\u400F\u6000"+
"\u400C\u6800\030\u6800\030\u2800\030\u2800\u601A\u2800\030\u6800\030\u6800"+
"\030\uE800\025\uE800\026\u6800\030\u2000\031\u3800\030\u2000\024\u3800\030"+
"\u3800\030\u1800\u3609\u1800\u3609\u1800\u3609\u1800\u3609\u1800\u3609\u1800"+
"\u3609\u1800\u3609\u1800\u3609\u1800\u3609\u1800\u3609\u3800\030\u6800\030"+
"\uE800\031\u6800\031\uE800\031\u6800\030\u6800\030\202\u7FE1\202\u7FE1\202"+
"\u7FE1\202\u7FE1\202\u7FE1\202\u7FE1\202\u7FE1\202\u7FE1\202\u7FE1\202\u7FE1"+
"\202\u7FE1\202\u7FE1\202\u7FE1\202\u7FE1\202\u7FE1\202\u7FE1\202\u7FE1\202"+
"\u7FE1\202\u7FE1\202\u7FE1\202\u7FE1\202\u7FE1\202\u7FE1\202\u7FE1\202\u7FE1"+
"\202\u7FE1\uE800\025\u6800\030\uE800\026\u6800\033\u6800\u5017\u6800\033\201"+
"\u7FE2\201\u7FE2\201\u7FE2\201\u7FE2\201\u7FE2\201\u7FE2\201\u7FE2\201\u7FE2"+
"\201\u7FE2\201\u7FE2\201\u7FE2\201\u7FE2\201\u7FE2\201\u7FE2\201\u7FE2\201"+
"\u7FE2\201\u7FE2\201\u7FE2\201\u7FE2\201\u7FE2\201\u7FE2\201\u7FE2\201\u7FE2"+
"\201\u7FE2\201\u7FE2\201\u7FE2\uE800\025\u6800\031\uE800\026\u6800\031\u4800"+
"\u100F\u4800\u100F\u4800\u100F\u4800\u100F\u4800\u100F\u4800\u100F\u5000\u100F"+
"\u4800\u100F\u4800\u100F\u4800\u100F\u4800\u100F\u4800\u100F\u4800\u100F\u4800"+
"\u100F\u4800\u100F\u4800\u100F\u4800\u100F\u4800\u100F\u4800\u100F\u4800\u100F"+
"\u4800\u100F\u4800\u100F\u4800\u100F\u4800\u100F\u4800\u100F\u4800\u100F\u4800"+
"\u100F\u4800\u100F\u4800\u100F\u4800\u100F\u4800\u100F\u4800\u100F\u4800\u100F"+
"\u3800\014\u6800\030\u2800\u601A\u2800\u601A\u2800\u601A\u2800\u601A\u6800"+
"\034\u6800\030\u6800\033\u6800\034\000\u7005\uE800\035\u6800\031\u4800\u1010"+
"\u6800\034\u6800\033\u2800\034\u2800\031\u1800\u060B\u1800\u060B\u6800\033"+
"\u07FD\u7002\u6800\030\u6800\030\u6800\033\u1800\u050B\000\u7005\uE800\036"+
"\u6800\u080B\u6800\u080B\u6800\u080B\u6800\030\202\u7001\202\u7001\202\u7001"+
"\202\u7001\202\u7001\202\u7001\202\u7001\202\u7001\202\u7001\202\u7001\202"+
"\u7001\202\u7001\202\u7001\202\u7001\202\u7001\202\u7001\202\u7001\202\u7001"+
"\202\u7001\202\u7001\202\u7001\202\u7001\202\u7001\u6800\031\202\u7001\202"+
"\u7001\202\u7001\202\u7001\202\u7001\202\u7001\202\u7001\u07FD\u7002\201\u7002"+
"\201\u7002\201\u7002\201\u7002\201\u7002\201\u7002\201\u7002\201\u7002\201"+
"\u7002\201\u7002\201\u7002\201\u7002\201\u7002\201\u7002\201\u7002\201\u7002"+
"\201\u7002\201\u7002\201\u7002\201\u7002\201\u7002\201\u7002\201\u7002\u6800"+
"\031\201\u7002\201\u7002\201\u7002\201\u7002\201\u7002\201\u7002\201\u7002"+
"\u061D\u7002";
// The B table has 256 entries for a total of 512 bytes.
static final char B[] = (
"\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000"+
"\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000"+
"\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000"+
"\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000"+
"\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000"+
"\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000"+
"\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000"+
"\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000"+
"\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\001"+
"\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\001\000\000\000"+
"\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000"+
"\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000"+
"\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000"+
"\000\000\000\000\000\000\000\000\000").toCharArray();
// In all, the character property tables require 1024 bytes.
static {
{ // THIS CODE WAS AUTOMATICALLY CREATED BY GenerateCharacter:
char[] data = A_DATA.toCharArray();
assert (data.length == (256 * 2));
int i = 0, j = 0;
while (i < (256 * 2)) {
int entry = data[i++] << 16;
A[j++] = entry | data[i++];
}
}
}
public static void main(String[] args) {
System.out.println(A[49]);
}
}