Unicode 正则表达式(qbit)

前言

  • 本文根据《精通正则表达式》和 Unicode Regular Expressions 整理。
  • 本文的示例默认以 Python3 为实现语言,用到 Python3 的 re 模块或 regex 库。

基本的 Unicode 属性分类

\p{L}|\p{Letter} 字母
\p{M}|\p{Mark} 不能单独出现,必须与其他基本字符一起出现(重音符号、包围框,等等)的字符
\p{Z}|\p{Separator} 用于表示分割,但本身不可见的字符(各种空白字符)
\p{S}|\p{Symbol} 各种图形符号(Dingbats)和字母符号
\p{N}|\p{Number} 任何数字字符
\p{P}|\p{Punctuation} 标点字符
\p{C}|\p{Other} 匹配其他任何字符(很少用于正常字符)

基本的 Unicode 子属性

Letter

\p{Ll}|\p{Lowercase_Letter} 小写字母
\p{Lu}|\p{Uppercase_Letter} 大写字母 
\p{Lt}|\p{Titlecase_Lettter} 出现在单词开头的字母 
\p{L&}|\p{Ll}、\p{Lu} 、\p{Lt} 并集的简写法 
\p{Lm}|\p{Modifier_Letter} 少数形似字母的,有特殊用途的字符 
\p{Lo}|\p{Other_Letter} 没有大小写形式,也不属于修饰符的字母,
         包括希伯来语、阿拉伯语、孟加拉语、泰国语、日语中的字母。

Mark

\p{Mn}|\p{Non_Spacing_Mark} 用于修饰其他字符的“字符(Characters)”,
         例如重音符号、变音符号、某些“元音记号”和语调标记。 
\p{Mc}|\p{Spacing_Combining_Mark} 会占据一定宽度的修饰字符
        (各种语言中的大多数“元音记号”,这些语言包括孟加拉语、印度古哈拉地语、
        泰米尔语、泰卢固语、埃纳德语、马来语、僧伽罗语、缅甸语和高棉语)。
\p{Me}|\p{Enclosing_Mark} 可以围住其他字符的标记,例如圆圈、方框、钻石型等

Separator

\p{Zs}|\p{Space_Separator} 各种空白字符,例如空格符、不间断空格(non-breakspace),
         以及各种固定宽度的空白字符。
\p{Zl}|\p{Line_Separator} LINE SEPARATOR 字符(U+2028)
\p{Zp}|\p{Paragraph_Separator} PARAGRAPH SEPARATOR 字符(U+2029),段落分割符

Symbol

\p{Sc}|\p{Currency_Symbol} 货币符号、$、¥、...。
\p{Sk}|\p{Modifier_Symbol} 大多数版本中它表示组合字符,
         但是作为功能完整的字符,它们有自己的意义。
\p{So}|\p{Other_Symbol} 各种印刷符号、框图符号、盲文符号,
         以及非字母形式的中文字符,等等。

Number

\p{Nd}|\p{Decimal_Digit_Number} 各种字母表中从 0 到 9 的数字(不包括中文、日文和韩文)
\p{Nl}|\p{Letter_Number} 几乎所有罗马数字。
\p{No}|\p{Other_Number} 作为加密符号(superscripts)和记号的数字,
         非阿拉伯数字的数字表示字符(不包括中文、日文、韩文中的字符)。

Punctuation

\p{Pd}|\p{Dash_Punctuation} 各种格式的连字符(hyphen)和短划线(dash)
\p{Ps}|\p{Open_Punctuation} (、《 等字符
\p{Pe}|\p{Close_Punctuation} )、》 等字符
\p{Pi}|\p{Initial_Punctuation} “、< 等字符
\p{Pf}|\p{Final_Punctuation} ”、> 等字符
\p{Pc}|\p{Connector_Punctuation} 少数有特殊语法含义的标点,如下划线
\p{Po}|\p{Other_Punctuation} 用于表示其他所有标点符号: !、&、.、: 等

Other

\p{Cc}|\p{Control} ASCII 和 Latin-1 编码中的控制字符(TAB、LF、CR)等
\p{Cf}|\p{Format} 用于表示格式的不可见字符
\p{Co}|\p{Private_Use} 分配与私人的代码点(例如公司的 logo)
\p{Cs}|\p{Surrogate} one half of a surrogate pair in UTF-16 encoding
\p{Cn}|\p{Unassigned} 目前尚未分配字符的代码点

Unicode Scripts

  • 主要用于匹配特定语言
  • 示例:匹配汉字
>>> regex.findall(r'\p{Han}', '孔子/现代价值/Theory of "Knowing"')
['孔', '子', '现', '代', '价', '值']
  • 列表
\p{Common}
\p{Arabic}
\p{Armenian}
\p{Bengali}
\p{Bopomofo}
\p{Braille}
\p{Buhid}
\p{Canadian_Aboriginal}
\p{Cherokee}
\p{Cyrillic}
\p{Devanagari}
\p{Ethiopic}
\p{Georgian}
\p{Greek}
\p{Gujarati}
\p{Gurmukhi}
\p{Han}
\p{Hangul}
\p{Hanunoo}
\p{Hebrew}
\p{Hiragana}
\p{Inherited}
\p{Kannada}
\p{Katakana}
\p{Khmer}
\p{Lao}
\p{Latin}
\p{Limbu}
\p{Malayalam}
\p{Mongolian}
\p{Myanmar}
\p{Ogham}
\p{Oriya}
\p{Runic}
\p{Sinhala}
\p{Syriac}
\p{Tagalog}
\p{Tagbanwa}
\p{TaiLe}
\p{Tamil}
\p{Telugu}
\p{Thaana}
\p{Thai}
\p{Tibetan}
\p{Yi}

Unicode Blocks

  • 正则与 Unicode 编码段的映射
  • 列表
{InBasic_Latin}: U+0000–U+007F
\p{InLatin-1_Supplement}: U+0080–U+00FF
\p{InLatin_Extended-A}: U+0100–U+017F
\p{InLatin_Extended-B}: U+0180–U+024F
\p{InIPA_Extensions}: U+0250–U+02AF
\p{InSpacing_Modifier_Letters}: U+02B0–U+02FF
\p{InCombining_Diacritical_Marks}: U+0300–U+036F
\p{InGreek_and_Coptic}: U+0370–U+03FF
\p{InCyrillic}: U+0400–U+04FF
\p{InCyrillic_Supplementary}: U+0500–U+052F
\p{InArmenian}: U+0530–U+058F
\p{InHebrew}: U+0590–U+05FF
\p{InArabic}: U+0600–U+06FF
\p{InSyriac}: U+0700–U+074F
\p{InThaana}: U+0780–U+07BF
\p{InDevanagari}: U+0900–U+097F
\p{InBengali}: U+0980–U+09FF
\p{InGurmukhi}: U+0A00–U+0A7F
\p{InGujarati}: U+0A80–U+0AFF
\p{InOriya}: U+0B00–U+0B7F
\p{InTamil}: U+0B80–U+0BFF
\p{InTelugu}: U+0C00–U+0C7F
\p{InKannada}: U+0C80–U+0CFF
\p{InMalayalam}: U+0D00–U+0D7F
\p{InSinhala}: U+0D80–U+0DFF
\p{InThai}: U+0E00–U+0E7F
\p{InLao}: U+0E80–U+0EFF
\p{InTibetan}: U+0F00–U+0FFF
\p{InMyanmar}: U+1000–U+109F
\p{InGeorgian}: U+10A0–U+10FF
\p{InHangul_Jamo}: U+1100–U+11FF
\p{InEthiopic}: U+1200–U+137F
\p{InCherokee}: U+13A0–U+13FF
\p{InUnified_Canadian_Aboriginal_Syllabics}: U+1400–U+167F
\p{InOgham}: U+1680–U+169F
\p{InRunic}: U+16A0–U+16FF
\p{InTagalog}: U+1700–U+171F
\p{InHanunoo}: U+1720–U+173F
\p{InBuhid}: U+1740–U+175F
\p{InTagbanwa}: U+1760–U+177F
\p{InKhmer}: U+1780–U+17FF
\p{InMongolian}: U+1800–U+18AF
\p{InLimbu}: U+1900–U+194F
\p{InTai_Le}: U+1950–U+197F
\p{InKhmer_Symbols}: U+19E0–U+19FF
\p{InPhonetic_Extensions}: U+1D00–U+1D7F
\p{InLatin_Extended_Additional}: U+1E00–U+1EFF
\p{InGreek_Extended}: U+1F00–U+1FFF
\p{InGeneral_Punctuation}: U+2000–U+206F
\p{InSuperscripts_and_Subscripts}: U+2070–U+209F
\p{InCurrency_Symbols}: U+20A0–U+20CF
\p{InCombining_Diacritical_Marks_for_Symbols}: U+20D0–U+20FF
\p{InLetterlike_Symbols}: U+2100–U+214F
\p{InNumber_Forms}: U+2150–U+218F
\p{InArrows}: U+2190–U+21FF
\p{InMathematical_Operators}: U+2200–U+22FF
\p{InMiscellaneous_Technical}: U+2300–U+23FF
\p{InControl_Pictures}: U+2400–U+243F
\p{InOptical_Character_Recognition}: U+2440–U+245F
\p{InEnclosed_Alphanumerics}: U+2460–U+24FF
\p{InBox_Drawing}: U+2500–U+257F
\p{InBlock_Elements}: U+2580–U+259F
\p{InGeometric_Shapes}: U+25A0–U+25FF
\p{InMiscellaneous_Symbols}: U+2600–U+26FF
\p{InDingbats}: U+2700–U+27BF
\p{InMiscellaneous_Mathematical_Symbols-A}: U+27C0–U+27EF
\p{InSupplemental_Arrows-A}: U+27F0–U+27FF
\p{InBraille_Patterns}: U+2800–U+28FF
\p{InSupplemental_Arrows-B}: U+2900–U+297F
\p{InMiscellaneous_Mathematical_Symbols-B}: U+2980–U+29FF
\p{InSupplemental_Mathematical_Operators}: U+2A00–U+2AFF
\p{InMiscellaneous_Symbols_and_Arrows}: U+2B00–U+2BFF
\p{InCJK_Radicals_Supplement}: U+2E80–U+2EFF
\p{InKangxi_Radicals}: U+2F00–U+2FDF
\p{InIdeographic_Description_Characters}: U+2FF0–U+2FFF
\p{InCJK_Symbols_and_Punctuation}: U+3000–U+303F
\p{InHiragana}: U+3040–U+309F
\p{InKatakana}: U+30A0–U+30FF
\p{InBopomofo}: U+3100–U+312F
\p{InHangul_Compatibility_Jamo}: U+3130–U+318F
\p{InKanbun}: U+3190–U+319F
\p{InBopomofo_Extended}: U+31A0–U+31BF
\p{InKatakana_Phonetic_Extensions}: U+31F0–U+31FF
\p{InEnclosed_CJK_Letters_and_Months}: U+3200–U+32FF
\p{InCJK_Compatibility}: U+3300–U+33FF
\p{InCJK_Unified_Ideographs_Extension_A}: U+3400–U+4DBF
\p{InYijing_Hexagram_Symbols}: U+4DC0–U+4DFF
\p{InCJK_Unified_Ideographs}: U+4E00–U+9FFF
\p{InYi_Syllables}: U+A000–U+A48F
\p{InYi_Radicals}: U+A490–U+A4CF
\p{InHangul_Syllables}: U+AC00–U+D7AF
\p{InHigh_Surrogates}: U+D800–U+DB7F
\p{InHigh_Private_Use_Surrogates}: U+DB80–U+DBFF
\p{InLow_Surrogates}: U+DC00–U+DFFF
\p{InPrivate_Use_Area}: U+E000–U+F8FF
\p{InCJK_Compatibility_Ideographs}: U+F900–U+FAFF
\p{InAlphabetic_Presentation_Forms}: U+FB00–U+FB4F
\p{InArabic_Presentation_Forms-A}: U+FB50–U+FDFF
\p{InVariation_Selectors}: U+FE00–U+FE0F
\p{InCombining_Half_Marks}: U+FE20–U+FE2F
\p{InCJK_Compatibility_Forms}: U+FE30–U+FE4F
\p{InSmall_Form_Variants}: U+FE50–U+FE6F
\p{InArabic_Presentation_Forms-B}: U+FE70–U+FEFF
\p{InHalfwidth_and_Fullwidth_Forms}: U+FF00–U+FFEF
\p{InSpecials}: U+FFF0–U+FFFF

Unicode 编码表

Example

  • 文字过滤,去除标点符号等特殊字符
>>> regex.sub(r'[^\p{L}]', '', '1孔子/现代价值/Theory of "Knowing')
'孔子现代价值TheoryofKnowing'
本文出自 qbit snap

你可能感兴趣的:(正则表达式,unicode,python)