1 Overview
XML Schema mainly defines reference and datatype, reference can be <element> or <attribute>, and both will refer to a certain datatype. Datatype may in turn contains references (<attribute for <complexType> and <element> for <complexContent>).
1.1 Difference between OO language
XML Schema define type system just like OO language, note that for OO language class definition is mixed with codes which will initialize and operate on instance, while for XML, schema and instance document are separated.
1.2 Inheritance
Like OO language, datatype here can be derived, also datatype is polymorphic, so you can use “xsi:type” attribute in instance document to specify to use an instance of child datatype.
1.3 Substitution
For reference like <element>, it’s merely a named link to a datatype, you can use substitution (by specifying the “substitutionHead” attribute) to enable a link to be replaced by some other links. Substitution can form a tree style graph just like inheritance dose, you can change the “name” of the child link to any valid string, also, for the “type”, you can change it to be any child datatype of the parent link’s “type”.
1.4 Global and local
Both reference and datatype can be global or local, only global ones can be referred to by “ref”, “substitution”, “type” and “xsi:type”.
For the “type” attribute used to specify a global datatype, it can be replaced by defining a local datatype.
Like the “base” attribute for <extension>, for local <element>, theoretically you can use “substitutionHead” to reference a global element just like using “base” to reference a global datatype, but since local <element> can not be used in other place, doing this is meaningless.
1.5 Final and block
The “final” attribute affects the schema itself, while the “block” attribute affects instance document.
The “final” attribute for <element> is used to prohibit substitution with certain derived datatype (derived by restriction or extension), note that there’s no way to prohibit substitution with the same datatype in schema level.
The “block” attribute for <element> in addition, can accept a “substitution” value which can be used to prohibit substitution in instance document.
1.6 Abstract
The “abstract” attribute only affects instance document, with used with <element>, it mean that this element can not be appear in instance document, it should only be used as substitution head. When used with <complexType>, it means that this type can not be used directly in instance document, you can use “xsi:type” to specify a child type.
2 Built-in datatypes
Built-in datatypes contain built-in primitive and built-in derived datatype. Datatypes defined by schema author is call user-derived datatype.
Built-in datatypes are put together with other xml schema definitions inside namespace "http://www.w3.org/2001/XMLSchema", also, they are separated and put in namespace "http://www.w3.org/2001/XMLSchema-datatypes".
2.1. String
String is considered to have the following aspect:
§ Contains return, line feed or tab (preserve or replace)? - normalized
§ Is normalized? Contains leading or trialing spaces, contains internal sequences of two or more spaces (collapse)? - tokenized
§ Contains only name characters (not whitespace)? - Name and NMTOKEN
§ Name characters fall into the following groups: letter, number, other character (. : _ -)
§ The first character of a string – Name (:, _ and letters), NCName (non-colonized name, so only _ and letters), NMTOKEN (any name character is ok)
§ String may have external constrains, like ID should be unique, IDREF should refer to an existing ID, ENTITY should refer to an external unparsed entity, QName(qualified name, should has a prefix and a postfix separated by a colon, both prefix and postfix should be NCName, also prefix should be declared as a namespace prefix before)
2.2. Number
Numbers are felled into 2 groups: finite (float, double, and [unsigned](long|int|short|byte)), infinite (decimal and [[non]positive|negative]integer). Note that decimal can be any number.
2.3. Binary
Datatype base64Binary and hexBinary are the same in value space (they are both byte array), but different in lexical space (one is base64 coded, another is hexadecimal coded).
2.4. Date time
For datatypes regarding to time, dateTime, time, date, gYearMonth and gYear represent a exact period, duration represent the length of some period, gMonthDay, gDay and gMonth represents some period that will recur.
3 Inheritance overview
The xml schema type system:
Any Type (contains attribute?)
- Simple Type
- Complex Type (contains element?)
- Simple Content
- Complex Content (can be mixed with character?)
- Element Only
- Mixed
For those "√", unless otherwise explained, they should be derived from the same type.
Create New By List By Union By Restriction By Extension
Simple Type √ √ √
Complex Type with Simple Content √ √ [1]
Complex Type with Complex Content
(Element Only) √ [2] √ √
Complex Type with Complex Content
(Mixed) √ [2] √ √
[1] can extend from Simple Type
[2] [3] don't need to use <complexContext>, use <sequence>, <any> or <all>
4 Inheritance for simple type
A datatype is composed of a value space, a lexical space and facets. There’s a one-to-many relationship between value space and lexical space.
Atomic datatype is simple datatype like string or decimal or types that derived from these simple datatypes, so datatypes derived by list or union is not atomic datatype.
4.1. By restriction
Facet defines aspect of a value space (maybe through lexical space).
Fundamental facets describe some properties of a datatype which is read-only, that means each datatype has such properties but you can not modify them, I believe is only useful for describing those built-in datatypes so that they can be mapped to other languages. Here are the fundamental facets: equal, ordered, bounded, cardinality, numeric.
Restriction facets are placed inside the <restriction> tag, several facets can be used together. All facets except pattern and enumeration facet have a fixed attribute which denote whether derived datatype can override this attribute. All facets have a value attribute to specify the value. All facets’ content text are ignored and used as annotation. Some facets can appear more than once, like pattern and enumeration.
<restriction base="xxx">
<facet_xxx fix="true|false" value="xxx">annotation</facet_xxx>
</restriction>
§ The whitespace, enumeration and pattern facets apply to all datatypes.
§ The maxExclusive, maxInclusive, minExclusive, minInclusive, totalDigits and fractionDigits apply to number.
§ The length, maxLength, minLength apply to string, list and byte array.
4.2. By list
Value of a list datatype should contains space seperated value(s) of given itemType:
<simpleType name='sizes'>
<list itemType='decimal'/>
</simpleType>
<cerealSizes xsi:type='sizes'> 8 10.5 12 </cerealSizes>
Value of list datatype is always separated by space first, so the following codes contains 18 items, not 3 items:
<simpleType name='listOfString'>
<list itemType='string'/>
</simpleType>
<someElement xsi:type='listOfString'>
this is not list item 1
this is not list item 2
this is not list item 3
</someElement>
The pattern facet for list datatype apply to the whole list, not a single item.
4.3. By union
Union datatype is the union of its member types, the order of member types definition is significant and can be overrided by “xsi:type”:
<xsd:element name='size'>
<xsd:simpleType>
<xsd:union>
<xsd:simpleType>
<xsd:restriction base='integer'/>
</xsd:simpleType>
<xsd:simpleType>
<xsd:restriction base='string'/>
</xsd:simpleType>
</xsd:union>
</xsd:simpleType>
</xsd:element>
<size>1</size>
<size>large</size>
<size xsi:type='xsd:string'>1</size>
5 Definition and inheritance for complex type
5.1. Choice, sequence and all
Comparing to <sequence>, <all> allow elements to appear in any order.
5.2. The mixed content model
Both <complexType> and <complexContent> has "mixed" attribute which can be used to specify a mixed content model, note that there’s no way to specify datatype for "mixed" content, they are always treated as character data.
5.3. Derived by restriction and extension
About restriction, if A is restricted from B, then instance of A will always be valid for B. For simple type, only facets are allowed, so it should be true, for complex type, this means that you can use restriction to narrow down valid usage of some attributes and elements.
About extension, extension is more like template, if A is extended from B, this is equivalent to a datatype created by first copy B's definition and then append A's definition. So extension can be use to add new attribute or element to an existing type, and either instance of A or B may be invalid for the other. For complex content, is a little special as that the element definition for A and B can be considered to be surrounded by a bigger <sequence>.
5.4. The “default” and “fixed” attributes
Both <element> and <attribute> contain “fixed” and “default” attributes, they are mutually exclusive. The “default” attribute mean that if not present, it will be set to this value. Beyond the constraint stated by “default”, the “fixed” attribute enforce that if present, it should be the same as this value.
5.5. The “form”, “elementFormDefault” and “attributeFormDefault” attributes
The “form” attribute only apply to local <element> and <attribute> (they are consider local even when they are nested inside a global <group> or <attributeGroup>). Because they are local, they are not explicitly related to a namespace, so you can choose the form they present, whether they should be enforce to be qualified with a namespace.
§ If the value is “unqualified”, it can be related to target namespace or blank namespace.
§ If the value is “qualified”, it must be related to target namespace.
The <schema> tag has “elementFormDefault” and “attributeFormDefault” attributes to allow a default value to be set.
5.6. The “ref” attribute
Use “ref” attribute to refer to global <element>, <attribute>, <group> or <attributeGroup>, note that lots attributes can not present if “ref” is used.
5.7. The “maxOccurs”, “minOccurs” and “use” attributes
The “maxOccurs” and “minOccurs” attributes apply to <element>, <any>, <choice>, <sequence>, <all> and <group>. Note that they can only be used within local scope. The <all> tag and all its child tags should has “maxOccurs” set to 1, and “minOccurs” set to 1 or 0, I think this limitation is used to eliminate ambiguous definition or to ease xml processor implementation.
The “use” attribute has similar usage to <attribute>, it values:
§ “prohibited”, should appear 0 times
§ “required”, should appear 1 times
§ “optional”, can appear 0 or 1 times
5.8. The “nillable” attribute
For <element>, if “nillable” is set to true, "xsi:nil" attribute can be used in instance document to explicitly specify that the elements content (not matter it's simple content or complex content) is null, for example, if a element is type of "long", you can not set it to null unless you use "xsi:nil".
Grouping
Grouping is another way to build a modularize schema, use <group> for element and use <attributeGroup> for attribute. Note that <group> can not contain <element> or <any> as its direct child.
6 Wildcard
6.1. Any and any attribute
<any> apply to element, and <anyAttribute> apply to attribute, both has “namespace” and “processContents” attributes.
For “namespace”, here are the valid values:
For “processContents”, here are the valid values:
§ “strict”, obtain schema (error if not found), validate content (error if invalid).
§ “lax”, obtain schema (skip validation if not found), validate content (error if invalid).
§ “skip”, do nothing.
6.2. Any type and any simple type
The anySimpleType are quite special as all primitive type is considered as derived by restriction from anySimpleType, but they are call "primitive" (normally, primitive types don't derive from other). The anyType is the base type for anySimpleType and complex types.
You can use anyType, anySimpleType in attributes like "type" or "base", anyType allow any content and attribute, while anySimpleType allow any text content. You can also use “xsi:Type” to specify actual type being used.
These kinds of datatype are called ur-type (ur prefix means origin in German).
7 Namespace
7.1. Several schema namespaces
http://www.w3.org/2001/XMLSchema
http://www.w3.org/2001/XMLSchema-datatypes
http://www.w3.org/2001/XMLSchema-instance
7.2. Different between prefixed / default / blank namespace.
Elements with prefixed and default namespace both are related to some namespace, though they appear in different form.
Elements with blank namespace are related to nothing, though they are in the same form as default namespace.
7.3. Namespace regarding the “name” and “type” attribute.
Values inside the “name” attribute (e.g. of <element>, <attribute> or <simpleType>), are consider to be declared inside the “targetNamespace” of <schema> tag. While values inside “type” (or “ref”, “base”…) attribute don’t has such assumption, you should always quality them with a namespace (though you may use default namespace so they appear the same as values in “name” attribute).
8 Constraint
8.1. Unique, key and keyref
The <unique>, <key> and <keyref> tags are used to specify constraints to document instance. The <selector> and <field> tags are nested inside these tags to retrieve a list of value.
§ <unique>, values inside the list must either be null or unique.
§ <key>, values inside the list must not be null and must be unique.
§ <keyref>, values inside the list must correspond to values retrieved from <unique> or <key> the “name” of which is specified by the “refer” attribute.
8.2. Selector and field
The <selector> and <field> both has a “xpath” attribute, for <selector>, it’s used to select a list of element, for <field>, it’s used to select value(s) from each element in list. Note that <field> tag may appear several times to specify a complex value (like complex primary key in relational database).
8.3. Xpath scope and name scope
For the “xpath” attribute in <selector>, its context path is related to the closest enclosing <element>, which means that unless you use absolute path, you should place <unique>, <key> or <keyref> properly inside certain <element>, in fact, these tags are only allowed to be nested inside <element> tag.
For the name scope, though <unique>, <key> and <keyref> always not in global scope (not directly inside <schema>), they always have global names and can be referred by other tags.
9 Import, include, redefine and schema
9.1. Import, include and redefine,
The <import> tag imports schema from different namespace to current schema, the <include> tag include a schema, the <redefine> tag act like <include> tag, beyond that it provide the ability to redefine datatype, group and attribute group defined inside the schema.
All these tags have a “schemaLocation” attribute.
§ <import> tag has an additional “namespace” attribute, it maybe used to guess the schema location if the “schemaLocation” attribute is missed, the “namespace” attribute must be exactly the same as the “targetNamespace” attribute in the imported schema.
§ <include> tag should refer to a schema which has the same namespace as current schema, or has blank namespace.
§ <redefine> tag further requires that datatype should always derived from itself, while group and attribute group should always have exactly one reference to itself.
9.2. Schema
The <schema> tag is the root tag in a schema, it has attributes like “targetNamespace”, “version” and “xml:lang” to provide information for current schema, it also has attributes like “finalDefault”, “blockDefault”, “elementFormDefault” and “attributeFormDefault” to provide default values.
10 Schema instance
These attributes are put inside namespace “http://www.w3.org/2001/XMLSchema-instance”
10.1. The “schemaLocation” and “noNamespaceSchemaLocation”
The “schemaLocation” attribute accepts a list of URI pair, each pair contains a namespace and a schema location. The “noNamespaceSchemaLocation” only accepts one URI, which is treated as schema location.
These attributes can appear anywhere as long as they are before (or at) the first element to validate, note that if the root element dose not contains “schemaLocation” attribute, for my understanding the root element's content is consider to be "skip" (refer to the "processContent" attribute of <any> and <anyAttribute>), so no validation will perform.
10.2. The “type”
Use to specify the actual type used.
10.3. The “nil”
Use to specify that an element is null.
11 Other
11.1. Schema documentation
<annotation> contains <appinfo> (for machine) and <documentation> (for human). Each contains a "source" attribute refer to a URI, <documentation> may has "xml:lang" attribute.
11.2. About DTD
Schema can be used together with DTD, note that schema can not define entity like dtd, while dtd can not support local element, namespace, grouping, inheritance.
11.3. Root element
There' no way to force which element should be the root element using schema, specify this information to xml processor.
12 Questions
Why float and double are not derived from decimal?
Why <complexContent> need the “mixed” attribute?
Why not force <choice>, <sequence>, <all> and <group> to be put inside <complexContent>?
What’s the usage of <notation>?
13 References
"MSDN - XML Standards Reference"
http://msdn.microsoft.com/en-us/library/ms256177.aspx
"XML Schema Design Patterns: Is Complex Type Derivation Unnecessary?"
http://www.oreillynet.com/xml/blog/2007/07/derivation_by_implied_restrict.html
“XML Schema Part 0: Primer Second Edition”
http://www.w3.org/TR/xmlschema-0/
“XML Schema Part 1: Structures Second Edition”
http://www.w3.org/TR/xmlschema-1/
“XML Schema Part 2: Datatypes Second Edition”
http://www.w3.org/TR/xmlschema-2/