前言
- 本文根据《精通正则表达式》和 Unicode Regular Expressions 整理。
- 本文的示例默认以 Python3 为实现语言,用到 Python3 的 re 模块或 regex 库。
基本的 Unicode 属性分类
\p{L}|\p{Letter} 字母
\p{M}|\p{Mark} 不能单独出现,必须与其他基本字符一起出现(重音符号、包围框,等等)的字符
\p{Z}|\p{Separator} 用于表示分割,但本身不可见的字符(各种空白字符)
\p{S}|\p{Symbol} 各种图形符号(Dingbats)和字母符号
\p{N}|\p{Number} 任何数字字符
\p{P}|\p{Punctuation} 标点字符
\p{C}|\p{Other} 匹配其他任何字符(很少用于正常字符)
基本的 Unicode 子属性
Letter
\p{Ll}|\p{Lowercase_Letter} 小写字母
\p{Lu}|\p{Uppercase_Letter} 大写字母
\p{Lt}|\p{Titlecase_Lettter} 出现在单词开头的字母
\p{L&}|\p{Ll}、\p{Lu} 、\p{Lt} 并集的简写法
\p{Lm}|\p{Modifier_Letter} 少数形似字母的,有特殊用途的字符
\p{Lo}|\p{Other_Letter} 没有大小写形式,也不属于修饰符的字母,
包括希伯来语、阿拉伯语、孟加拉语、泰国语、日语中的字母。
Mark
\p{Mn}|\p{Non_Spacing_Mark} 用于修饰其他字符的“字符(Characters)”,
例如重音符号、变音符号、某些“元音记号”和语调标记。
\p{Mc}|\p{Spacing_Combining_Mark} 会占据一定宽度的修饰字符
(各种语言中的大多数“元音记号”,这些语言包括孟加拉语、印度古哈拉地语、
泰米尔语、泰卢固语、埃纳德语、马来语、僧伽罗语、缅甸语和高棉语)。
\p{Me}|\p{Enclosing_Mark} 可以围住其他字符的标记,例如圆圈、方框、钻石型等
Separator
\p{Zs}|\p{Space_Separator} 各种空白字符,例如空格符、不间断空格(non-breakspace),
以及各种固定宽度的空白字符。
\p{Zl}|\p{Line_Separator} LINE SEPARATOR 字符(U+2028)
\p{Zp}|\p{Paragraph_Separator} PARAGRAPH SEPARATOR 字符(U+2029),段落分割符
Symbol
\p{Sc}|\p{Currency_Symbol} 货币符号、$、¥、...。
\p{Sk}|\p{Modifier_Symbol} 大多数版本中它表示组合字符,
但是作为功能完整的字符,它们有自己的意义。
\p{So}|\p{Other_Symbol} 各种印刷符号、框图符号、盲文符号,
以及非字母形式的中文字符,等等。
Number
\p{Nd}|\p{Decimal_Digit_Number} 各种字母表中从 0 到 9 的数字(不包括中文、日文和韩文)
\p{Nl}|\p{Letter_Number} 几乎所有罗马数字。
\p{No}|\p{Other_Number} 作为加密符号(superscripts)和记号的数字,
非阿拉伯数字的数字表示字符(不包括中文、日文、韩文中的字符)。
Punctuation
\p{Pd}|\p{Dash_Punctuation} 各种格式的连字符(hyphen)和短划线(dash)
\p{Ps}|\p{Open_Punctuation} (、《 等字符
\p{Pe}|\p{Close_Punctuation} )、》 等字符
\p{Pi}|\p{Initial_Punctuation} “、< 等字符
\p{Pf}|\p{Final_Punctuation} ”、> 等字符
\p{Pc}|\p{Connector_Punctuation} 少数有特殊语法含义的标点,如下划线
\p{Po}|\p{Other_Punctuation} 用于表示其他所有标点符号: !、&、.、: 等
Other
\p{Cc}|\p{Control} ASCII 和 Latin-1 编码中的控制字符(TAB、LF、CR)等
\p{Cf}|\p{Format} 用于表示格式的不可见字符
\p{Co}|\p{Private_Use} 分配与私人的代码点(例如公司的 logo)
\p{Cs}|\p{Surrogate} one half of a surrogate pair in UTF-16 encoding
\p{Cn}|\p{Unassigned} 目前尚未分配字符的代码点
Unicode Scripts
- 主要用于匹配特定语言
示例:匹配汉字
>>> regex.findall(r'\p{Han}', '孔子/现代价值/Theory of "Knowing"') ['孔', '子', '现', '代', '价', '值']
列表
\p{Common} \p{Arabic} \p{Armenian} \p{Bengali} \p{Bopomofo} \p{Braille} \p{Buhid} \p{Canadian_Aboriginal} \p{Cherokee} \p{Cyrillic} \p{Devanagari} \p{Ethiopic} \p{Georgian} \p{Greek} \p{Gujarati} \p{Gurmukhi} \p{Han} \p{Hangul} \p{Hanunoo} \p{Hebrew} \p{Hiragana} \p{Inherited} \p{Kannada} \p{Katakana} \p{Khmer} \p{Lao} \p{Latin} \p{Limbu} \p{Malayalam} \p{Mongolian} \p{Myanmar} \p{Ogham} \p{Oriya} \p{Runic} \p{Sinhala} \p{Syriac} \p{Tagalog} \p{Tagbanwa} \p{TaiLe} \p{Tamil} \p{Telugu} \p{Thaana} \p{Thai} \p{Tibetan} \p{Yi}
Unicode Blocks
- 正则与 Unicode 编码段的映射
列表
\p{InBasic_Latin}: U+0000–U+007F \p{InLatin-1_Supplement}: U+0080–U+00FF \p{InLatin_Extended-A}: U+0100–U+017F \p{InLatin_Extended-B}: U+0180–U+024F \p{InIPA_Extensions}: U+0250–U+02AF \p{InSpacing_Modifier_Letters}: U+02B0–U+02FF \p{InCombining_Diacritical_Marks}: U+0300–U+036F \p{InGreek_and_Coptic}: U+0370–U+03FF \p{InCyrillic}: U+0400–U+04FF \p{InCyrillic_Supplementary}: U+0500–U+052F \p{InArmenian}: U+0530–U+058F \p{InHebrew}: U+0590–U+05FF \p{InArabic}: U+0600–U+06FF \p{InSyriac}: U+0700–U+074F \p{InThaana}: U+0780–U+07BF \p{InDevanagari}: U+0900–U+097F \p{InBengali}: U+0980–U+09FF \p{InGurmukhi}: U+0A00–U+0A7F \p{InGujarati}: U+0A80–U+0AFF \p{InOriya}: U+0B00–U+0B7F \p{InTamil}: U+0B80–U+0BFF \p{InTelugu}: U+0C00–U+0C7F \p{InKannada}: U+0C80–U+0CFF \p{InMalayalam}: U+0D00–U+0D7F \p{InSinhala}: U+0D80–U+0DFF \p{InThai}: U+0E00–U+0E7F \p{InLao}: U+0E80–U+0EFF \p{InTibetan}: U+0F00–U+0FFF \p{InMyanmar}: U+1000–U+109F \p{InGeorgian}: U+10A0–U+10FF \p{InHangul_Jamo}: U+1100–U+11FF \p{InEthiopic}: U+1200–U+137F \p{InCherokee}: U+13A0–U+13FF \p{InUnified_Canadian_Aboriginal_Syllabics}: U+1400–U+167F \p{InOgham}: U+1680–U+169F \p{InRunic}: U+16A0–U+16FF \p{InTagalog}: U+1700–U+171F \p{InHanunoo}: U+1720–U+173F \p{InBuhid}: U+1740–U+175F \p{InTagbanwa}: U+1760–U+177F \p{InKhmer}: U+1780–U+17FF \p{InMongolian}: U+1800–U+18AF \p{InLimbu}: U+1900–U+194F \p{InTai_Le}: U+1950–U+197F \p{InKhmer_Symbols}: U+19E0–U+19FF \p{InPhonetic_Extensions}: U+1D00–U+1D7F \p{InLatin_Extended_Additional}: U+1E00–U+1EFF \p{InGreek_Extended}: U+1F00–U+1FFF \p{InGeneral_Punctuation}: U+2000–U+206F \p{InSuperscripts_and_Subscripts}: U+2070–U+209F \p{InCurrency_Symbols}: U+20A0–U+20CF \p{InCombining_Diacritical_Marks_for_Symbols}: U+20D0–U+20FF \p{InLetterlike_Symbols}: U+2100–U+214F \p{InNumber_Forms}: U+2150–U+218F \p{InArrows}: U+2190–U+21FF \p{InMathematical_Operators}: U+2200–U+22FF \p{InMiscellaneous_Technical}: U+2300–U+23FF \p{InControl_Pictures}: U+2400–U+243F \p{InOptical_Character_Recognition}: U+2440–U+245F \p{InEnclosed_Alphanumerics}: U+2460–U+24FF \p{InBox_Drawing}: U+2500–U+257F \p{InBlock_Elements}: U+2580–U+259F \p{InGeometric_Shapes}: U+25A0–U+25FF \p{InMiscellaneous_Symbols}: U+2600–U+26FF \p{InDingbats}: U+2700–U+27BF \p{InMiscellaneous_Mathematical_Symbols-A}: U+27C0–U+27EF \p{InSupplemental_Arrows-A}: U+27F0–U+27FF \p{InBraille_Patterns}: U+2800–U+28FF \p{InSupplemental_Arrows-B}: U+2900–U+297F \p{InMiscellaneous_Mathematical_Symbols-B}: U+2980–U+29FF \p{InSupplemental_Mathematical_Operators}: U+2A00–U+2AFF \p{InMiscellaneous_Symbols_and_Arrows}: U+2B00–U+2BFF \p{InCJK_Radicals_Supplement}: U+2E80–U+2EFF \p{InKangxi_Radicals}: U+2F00–U+2FDF \p{InIdeographic_Description_Characters}: U+2FF0–U+2FFF \p{InCJK_Symbols_and_Punctuation}: U+3000–U+303F \p{InHiragana}: U+3040–U+309F \p{InKatakana}: U+30A0–U+30FF \p{InBopomofo}: U+3100–U+312F \p{InHangul_Compatibility_Jamo}: U+3130–U+318F \p{InKanbun}: U+3190–U+319F \p{InBopomofo_Extended}: U+31A0–U+31BF \p{InKatakana_Phonetic_Extensions}: U+31F0–U+31FF \p{InEnclosed_CJK_Letters_and_Months}: U+3200–U+32FF \p{InCJK_Compatibility}: U+3300–U+33FF \p{InCJK_Unified_Ideographs_Extension_A}: U+3400–U+4DBF \p{InYijing_Hexagram_Symbols}: U+4DC0–U+4DFF \p{InCJK_Unified_Ideographs}: U+4E00–U+9FFF \p{InYi_Syllables}: U+A000–U+A48F \p{InYi_Radicals}: U+A490–U+A4CF \p{InHangul_Syllables}: U+AC00–U+D7AF \p{InHigh_Surrogates}: U+D800–U+DB7F \p{InHigh_Private_Use_Surrogates}: U+DB80–U+DBFF \p{InLow_Surrogates}: U+DC00–U+DFFF \p{InPrivate_Use_Area}: U+E000–U+F8FF \p{InCJK_Compatibility_Ideographs}: U+F900–U+FAFF \p{InAlphabetic_Presentation_Forms}: U+FB00–U+FB4F \p{InArabic_Presentation_Forms-A}: U+FB50–U+FDFF \p{InVariation_Selectors}: U+FE00–U+FE0F \p{InCombining_Half_Marks}: U+FE20–U+FE2F \p{InCJK_Compatibility_Forms}: U+FE30–U+FE4F \p{InSmall_Form_Variants}: U+FE50–U+FE6F \p{InArabic_Presentation_Forms-B}: U+FE70–U+FEFF \p{InHalfwidth_and_Fullwidth_Forms}: U+FF00–U+FFEF \p{InSpecials}: U+FFF0–U+FFFF
Unicode 编码表
Example
文字过滤,去除标点符号等特殊字符
>>> regex.sub(r'[^\p{L}]', '', '1孔子/现代价值/Theory of "Knowing') '孔子现代价值TheoryofKnowing'
本文出自 qbit snap