一个语言标记(language tag)是由一个或多个“子标记”(subtag)序列组成的。而子标记就是由字母和数字字符序列构成的,在一个语言标记中,子标记由连字符(-
,[Unicode] U+002D)隔开。子标记有多种类型,它们是通过长度,在语言标记中所处的位置,以及内容来区分的。
语言标记句法是由 ABNF [RFC5234] 描述的,接下来我将逐步解释一下语言标记句法的意思:
Language-Tag = langtag ; normal language tags
/ privateuse ; private use tag
/ grandfathered ; grandfathered tags
langtag = language
["-" script]
["-" region]
*("-" variant)
*("-" extension)
["-" privateuse]
language = 2*3ALPHA ; shortest ISO 639 code
["-" extlang] ; sometimes followed by
; extended language subtags
/ 4ALPHA ; or reserved for future use
/ 5*8ALPHA ; or registered language subtag
extlang = 3ALPHA ; selected ISO 639 codes
*2("-" 3ALPHA) ; permanently reserved
script = 4ALPHA ; ISO 15924 code
region = 2ALPHA ; ISO 3166-1 code
/ 3DIGIT ; UN M.49 code
variant = 5*8alphanum ; registered variants
/ (DIGIT 3alphanum)
extension = singleton 1*("-" (2*8alphanum))
; Single alphanumerics
; "x" reserved for private use
singleton = DIGIT ; 0 - 9
/ %x41-57 ; A - W
/ %x59-5A ; Y - Z
/ %x61-77 ; a - w
/ %x79-7A ; y - z
privateuse = "x" 1*("-" (1*8alphanum))
grandfathered = irregular ; non-redundant tags registered
/ regular ; during the RFC 3066 era
irregular = "en-GB-oed" ; irregular tags do not match
/ "i-ami" ; the 'langtag' production and
/ "i-bnn" ; would not otherwise be
/ "i-default" ; considered 'well-formed'
/ "i-enochian" ; These tags are all valid,
/ "i-hak" ; but most are deprecated
/ "i-klingon" ; in favor of more modern
/ "i-lux" ; subtags or subtag
/ "i-mingo" ; combination
/ "i-navajo"
/ "i-pwn"
/ "i-tao"
/ "i-tay"
/ "i-tsu"
/ "sgn-BE-FR"
/ "sgn-BE-NL"
/ "sgn-CH-DE"
regular = "art-lojban" ; these tags match the 'langtag'
/ "cel-gaulish" ; production, but their subtags
/ "no-bok" ; are not extended language
/ "no-nyn" ; or variant subtags: their meaning
/ "zh-guoyu" ; is defined by their registration
/ "zh-hakka" ; and all of these are deprecated
/ "zh-min" ; in favor of a more modern
/ "zh-min-nan" ; subtag or sequence of subtags
/ "zh-xiang"
alphanum = (ALPHA / DIGIT) ; letters and numbers
Language-Tag
Language-Tag = langtag ; normal language tags
/ privateuse ; private use tag
/ grandfathered ; grandfathered tags
Language-Tag
是由 langtag
、privateuse
或 grandfathered
三者中的某一个表示的,其中 langtag
表示标准的语言标记,privateuse
表示私有用途的标记,grandfathered
表示祖父级标记(永远不会改变)。
langtag
langtag = language
["-" script]
["-" region]
*("-" variant)
*("-" extension)
["-" privateuse]
langtag
(语言标记)是由一个 language
,0 或 1 个 "-" script
,0 或 1 个 "-" region
,0 或无数个 "-" variant
,0 或无数个 "-" extension
,0 或 1 个 "-" privateuse
构成。
language
language = 2*3ALPHA ; shortest ISO 639 code
["-" extlang] ; sometimes followed by
; extended language subtags
/ 4ALPHA ; or reserved for future use
/ 5*8ALPHA ; or registered language subtag
language
(语言)是由 2*3ALPHA ["-" extlang]
、4ALPHA
或 5*8ALPHA
表示,第一个是最短的 ISO 639 编码,有时后面会紧跟扩展的语言子标记;第二个预留给未来使用;第三个是已注册的语言子标记。
script
script = 4ALPHA ; ISO 15924 code
script
(字母系统)是由 4 个字母序列表示,值为 ISO 15924 标准中列出的编码。
region
region = 2ALPHA ; ISO 3166-1 code
/ 3DIGIT ; UN M.49 code
region
(区域)是由 2 个字母序列(ISO 3166-1 编码)或 3 个数字序列(UN M.49 编码)表示。
variant
variant = 5*8alphanum ; registered variants
/ (DIGIT 3alphanum)
variant
(变体)是由 5 到 8 个字母数字序列或 1 个数字及 3 个字母或数字序列表示,这些变体都是经过注册的。
extension
extension = singleton 1*("-" (2*8alphanum))
; Single alphanumerics
; "x" reserved for private use
singleton = DIGIT ; 0 - 9
/ %x41-57 ; A - W
/ %x59-5A ; Y - Z
/ %x61-77 ; a - w
/ %x79-7A ; y - z
extension
(扩展)是由一个数字或字母(其中 "x" 字母预留给 privateuse
)以及 1 个或多个 "-" (2*8alphanum)
表示。
privateuse
privateuse = "x" 1*("-" (1*8alphanum))
privateuse
开头由 "x" 标记,后面追加一个或多个 "-" (1*8alphanum)
,其中 (1*8alphanum)
表示 1 到 8 个字母数字组成的序列。
grandfathered
grandfathered = irregular ; non-redundant tags registered
/ regular ; during the RFC 3066 era
grandfathered
是由 irregular
或 regular
表示。它们都是在 RFC 3066 期间注册的非重复标记。
irregular
irregular = "en-GB-oed" ; irregular tags do not match
/ "i-ami" ; the 'langtag' production and
/ "i-bnn" ; would not otherwise be
/ "i-default" ; considered 'well-formed'
/ "i-enochian" ; These tags are all valid,
/ "i-hak" ; but most are deprecated
/ "i-klingon" ; in favor of more modern
/ "i-lux" ; subtags or subtag
/ "i-mingo" ; combination
/ "i-navajo"
/ "i-pwn"
/ "i-tao"
/ "i-tay"
/ "i-tsu"
/ "sgn-BE-FR"
/ "sgn-BE-NL"
/ "sgn-CH-DE"
irregular
(不规则标记)并不匹配 langtag
的标记规则,也因此不被认为是“格式良好”的标记,这些标记都是合法的,但是为了支持更现代的子标记或子标记组合,大多数都已被弃用。
regular
regular = "art-lojban" ; these tags match the 'langtag'
/ "cel-gaulish" ; production, but their subtags
/ "no-bok" ; are not extended language
/ "no-nyn" ; or variant subtags: their meaning
/ "zh-guoyu" ; is defined by their registration
/ "zh-hakka" ; and all of these are deprecated
/ "zh-min" ; in favor of a more modern
/ "zh-min-nan" ; subtag or sequence of subtags
/ "zh-xiang"
regular
(规则标记)能够匹配 langtag
的标记规则,但是它们的子标记并不是扩展语言或者是变体子标记:它们的含义是由它们的注册所定义,并且为了支持更现代的子标记或子标记序列,它们都已经被废弃了。
参考资料
- BCP 47
- RFC 5234
- ISO 639
- ISO 15924
- ISO 3166-1
- UN M.49