http://www.xmlw.ie/aboutxml/wordml.htm
Word 2003 Beta 2 has been released. We have installed a copy, and saved a Word file as XML for you to examine. This is the native binary Word document we used, and the Word XML document generated when saved as XML. The XML document is well-formed, and conforms to the XML Schema called WordML. Below is a slightly annotated version of the WordML mark-up.
Some more complex structures are included in a second sample: wordsample2.doc, wordsample2.xml.
Here is the top-level XML and namespace declarations, which are similar to Word 2000 and XP.
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<?mso-application progid="Word.Document"?>
<w:wordDocument
xmlns:w="http://schemas.microsoft.com/office/word/2003/2/wordml"
xmlns:v="urn:schemas-microsoft-com:vml"
xmlns:w10="urn:schemas-microsoft-com:office:word"
xmlns:SL="http://schemas.microsoft.com/schemaLibrary/2003/2/core"
xmlns:aml="http://schemas.microsoft.com/aml/2001/core"
xmlns:wx="http://schemas.microsoft.com/office/word/2003/2/auxHint"
xmlns:o="urn:schemas-microsoft-com:office:office"
xmlns:dt="uuid:C2F41010-65B3-11d1-A29F-00AA00C14882" xml:space="preserve">
Here is the custom property section, which includes both built-in and user-defined properties. Note that user-defined properties are assigned an element name, which is a bit silly, as it makes validation more difficult.
<o:DocumentProperties>
<o:Title>Sample Word file encoded in XML</o:Title>
<o:Subject>XML, XHTML, Word</o:Subject>
<o:Author>Eoin Campbell</o:Author>
<o:LastAuthor>Eoin Campbell</o:LastAuthor>
<o:Revision>2</o:Revision>
<o:TotalTime>0</o:TotalTime>
<o:Created>2003-03-27T14:35:00Z</o:Created>
<o:LastSaved>2003-03-27T14:35:00Z</o:LastSaved>
<o:Pages>1</o:Pages>
<o:Words>103</o:Words>
<o:Characters>588</o:Characters>
<o:Company>XML Workshop Ltd.</o:Company>
<o:Lines>4</o:Lines>
<o:Paragraphs>1</o:Paragraphs>
<o:CharactersWithSpaces>690</o:CharactersWithSpaces>
<o:Version>11.4920</o:Version>
</o:DocumentProperties>
<o:CustomDocumentProperties>
<o:DCIdentifier dt:dt="string">http://www.xmlw.ie/xml2word/xml2word.xml</o:DCIdentifier>
</o:CustomDocumentProperties>
Here is a chunk of the style section, which is very long.
<w:fonts>
<w:defaultFonts w:ascii="Times New Roman"
w:fareast="Times New Roman" w:h-ansi="Times New Roman"
w:cs="Times New Roman"/>
<w:font w:name="Tahoma">
<w:panose-1 w:val="020B0604030504040204"/>
<w:charset w:val="00"/>
<w:family w:val="Swiss"/>
<w:pitch w:val="variable"/>
<w:sig w:usb-0="21007A87" w:usb-1="80000000" w:usb-2="00000008" w:usb-3="00000000" w:csb-0="000101FF" w:csb-1="00000000"/>
</w:font>
</w:fonts>
<w:lists>
<w:listDef w:listDefId="0">
<w:lsid w:val="FFFFFF7F"/>
<w:plt w:val="SingleLevel"/>
<w:tmpl w:val="5A5E4FAC"/>
<w:lvl w:ilvl="0">
<w:start w:val="1"/>
<w:pStyle w:val="ListNumber2"/>
<w:lvlText w:val="%1."/>
<w:lvlJc w:val="left"/>
<w:pPr>
<w:tabs>
<w:tab w:val="list" w:pos="643"/>
</w:tabs>
<w:ind w:left="643" w:hanging="360"/></w:pPr>
</w:lvl>
</w:listDef>
</w:lists>
<w:styles>
<w:versionOfBuiltInStylenames w:val="3"/>
<w:latentStyles w:defLockedState="off" w:latentStyleCount="156"/>
<w:style w:type="paragraph" w:default="on" w:styleId="Normal">
<w:name w:val="Normal"/>
<w:pPr>
<w:spacing w:before="60" w:after="60"/>
</w:pPr>
<w:rPr>
<wx:font wx:val="Times New Roman"/>
<w:lang w:val="EN-IE" w:fareast="EN-US" w:bidi="AR-SA"/>
</w:rPr>
</w:style>
<w:style w:type="paragraph" w:styleId="Heading1">
<w:name w:val="heading 1"/>
<wx:uiName wx:val="Heading 1"/>
<w:basedOn w:val="Normal"/>
<w:next w:val="Normal"/>
<w:pPr>
<w:pStyle w:val="Heading1"/>
<w:keepNext/>
<w:spacing w:before="240"/>
<w:outlineLvl w:val="0"/>
</w:pPr>
<w:rPr>
<w:rFonts w:ascii="Arial" w:h-ansi="Arial"/>
<wx:font wx:val="Arial"/><w:b/>
<w:kern w:val="28"/>
<w:sz w:val="28"/>
<w:lang w:val="EN-GB"/>
</w:rPr>
</w:style>
Here are heading levels 1 to 5. Hierarchy is deduced from the headings, and wrapper elements (wx:sub-section
) are added in appropriate places to associate headings and following text. This is really useful, as documents become hierarchical, not linear.
<wx:sub-section>
<w:p>
<w:pPr><w:pStyle w:val="Title"/></w:pPr>
<w:r><w:t>Sample Word file</w:t></w:r>
</w:p>
<w:p>
<w:r><w:t>This file contains various paragraph and character styles, custom properties, tables and images.</w:t></w:r>
</w:p>
</wx:sub-section>
<wx:sub-section>
<w:p>
<w:pPr>
<w:pStyle w:val="Heading1"/>
</w:pPr>
<w:r>
<w:t>Heading Level 1</w:t>
</w:r>
</w:p>
<w:p>
<w:r>
<w:t>Normal paragraph</w:t>
</w:r>
</w:p>
<wx:sub-section>
<w:p>
<w:pPr>
<w:pStyle w:val="Heading2"/>
</w:pPr>
<w:r>
<w:t>Heading Level 2</w:t>
</w:r>
</w:p>
<w:p>
<w:r>
<w:t>Normal paragraph</w:t>
</w:r>
</w:p>
<wx:sub-section>
<w:p>
<w:pPr>
<w:pStyle w:val="Heading3"/>
</w:pPr>
<w:r>
<w:t>Heading Level 3</w:t>
</w:r>
</w:p>
<w:p>
<w:r>
<w:t>Normal paragraph</w:t>
</w:r>
</w:p>
<wx:sub-section>
<w:p>
<w:pPr>
<w:pStyle w:val="Heading4"/>
</w:pPr>
<w:r>
<w:t>Heading Level 4</w:t>
</w:r>
</w:p>
<w:p>
<w:r>
<w:t>Normal paragraph</w:t>
</w:r>
</w:p>
<wx:sub-section>
<w:p>
<w:pPr>
<w:pStyle w:val="Heading5"/>
</w:pPr>
<w:r>
<w:t>Heading Level 5</w:t>
</w:r>
</w:p>
</wx:sub-section>
</wx:sub-section>
</wx:sub-section>
</wx:sub-section>
</wx:sub-section>
Here is the XML markup for bulleted and numbered lists. All hierarchy is lost, because the generated mark-up doesn't contain any nested structures. This is not too surprising, as Word doesn't have the concept of nested lists anyway, but hierarchy is deduced for headings, so why not lists too? The hierarchy could be re-instated by post-processing with a very clever piece of XSLT on export, but why should you have to?
Perhaps if an XML Schema is used, you can assign hierarchical levels to lists.
<w:p>
<w:pPr>
<w:pStyle w:val="ListNumber"/>
<w:listPr>
<wx:t wx:val="1." wx:wTabBefore="0" wx:wTabAfter="225"/>
<wx:font wx:val="Times New Roman"/>
</w:listPr>
</w:pPr>
<w:r>
<w:t>This is a numbered list item </w:t>
</w:r>
</w:p>
<w:p>
<w:pPr>
<w:pStyle w:val="ListNumber"/>
<w:listPr>
<wx:t wx:val="2." wx:wTabBefore="0" wx:wTabAfter="225"/>
<wx:font wx:val="Times New Roman"/>
</w:listPr>
</w:pPr>
<w:r>
<w:t>Item 2.</w:t>
</w:r>
</w:p>
<w:p>
<w:pPr>
<w:pStyle w:val="ListBullet"/>
<w:listPr>
<wx:t wx:val="·" wx:wTabBefore="0" wx:wTabAfter="270"/>
<wx:font wx:val="Symbol"/>
</w:listPr>
</w:pPr>
<w:r>
<w:t>This is a bullet list item </w:t>
</w:r>
</w:p>
<w:p>
<w:pPr>
<w:pStyle w:val="ListBullet"/>
<w:listPr>
<wx:t wx:val="·" wx:wTabBefore="0" wx:wTabAfter="270"/>
<wx:font wx:val="Symbol"/>
</w:listPr>
</w:pPr>
<w:r>
<w:t>Item 2.</w:t>
</w:r>
</w:p>
Here are unnamed styles like bold and italic, and named styles and hyperlinks. The format specification is not applied by wrapping the text with an element, but instead by specifying an empty element that switches on the formatting required. The following chunk of text has that format. This is how RTF does this type of inline formatting.
Inline elementsElement
Meaning
|
Paragraph |
|
Text run container |
|
Text run properties container |
|
Underline property flag |
|
Italic property flag |
|
Bold property flag |
|
Text container |
<w:p>
<w:r>
<w:t>Some unnamed character level styles </w:t>
</w:r>
<w:r>
<w:rPr>
<w:u w:val="single"/>
</w:rPr>
<w:t>underline</w:t>
</w:r>
<w:r>
<w:t>, </w:t>
</w:r>
<w:r>
<w:rPr>
<w:i/>
<w:i-cs/>
</w:rPr>
<w:t>italic</w:t>
</w:r>
<w:r>
<w:t>, </w:t>
</w:r>
<w:r>
<w:rPr>
<w:b/>
</w:rPr>
<w:t>bold</w:t>
</w:r>
<w:r>
<w:t>, </w:t>
</w:r>
<w:hlink w:dest="http://www.xmlw.ie/">
<w:r>
<w:rPr>
<w:rStyle w:val="Hyperlink"/>
</w:rPr>
<w:t>Hyperlink</w:t>
</w:r>
</w:hlink>
<w:r>
<w:t>. </w:t>
</w:r>
</w:p>
<w:p>
<w:r>
<w:t>Named character styles: </w:t>
</w:r>
<w:r>
<w:rPr>
<w:rStyle w:val="EIRORef"/>
</w:rPr>
<w:t>EIRORef</w:t>
</w:r>
</w:p>
This is a table with 5 columns and 2 rows, with a spanned row and a spanned column.
<w:tbl><w:tblPr><w:tblW w:w="0" w:type="auto"/><w:tblBorders><w:top w:val="single" w:sz="12" wx:bdrwidth="30" w:space="0" w:color="000000"/><w:left w:val="single" w:sz="12" wx:bdrwidth="30" w:space="0" w:color="000000"/><w:bottom w:val="single" w:sz="12" wx:bdrwidth="30" w:space="0" w:color="000000"/><w:right w:val="single" w:sz="12" wx:bdrwidth="30" w:space="0" w:color="000000"/><w:insideH w:val="single" w:sz="6" wx:bdrwidth="15" w:space="0" w:color="000000"/><w:insideV w:val="single" w:sz="6" wx:bdrwidth="15" w:space="0" w:color="000000"/></w:tblBorders><w:tblLook w:val="0000003F"/></w:tblPr><w:tblGrid><w:gridCol w:w="1704"/><w:gridCol w:w="1704"/><w:gridCol w:w="1704"/><w:gridCol w:w="1705"/><w:gridCol w:w="1705"/></w:tblGrid><w:tr><w:tc><w:tcPr><w:tcW w:w="1704" w:type="dxa"/><w:tcBorders><w:bottom w:val="single" w:sz="12" wx:bdrwidth="30" w:space="0" w:color="000000"/></w:tcBorders></w:tcPr><w:p/></w:tc><w:tc><w:tcPr><w:tcW w:w="1704" w:type="dxa"/><w:tcBorders><w:bottom w:val="single" w:sz="12" wx:bdrwidth="30" w:space="0" w:color="000000"/></w:tcBorders></w:tcPr><w:p/></w:tc><w:tc><w:tcPr><w:tcW w:w="1704" w:type="dxa"/><w:tcBorders><w:bottom w:val="single" w:sz="12" wx:bdrwidth="30" w:space="0" w:color="000000"/></w:tcBorders></w:tcPr><w:p/></w:tc><w:tc><w:tcPr><w:tcW w:w="1705" w:type="dxa"/><w:tcBorders><w:bottom w:val="single" w:sz="12" wx:bdrwidth="30" w:space="0" w:color="000000"/></w:tcBorders></w:tcPr><w:p/></w:tc><w:tc><w:tcPr><w:tcW w:w="1705" w:type="dxa"/><w:tcBorders><w:bottom w:val="single" w:sz="12" wx:bdrwidth="30" w:space="0" w:color="000000"/></w:tcBorders></w:tcPr><w:p/></w:tc></w:tr><w:tr><w:tc><w:tcPr><w:tcW w:w="1704" w:type="dxa"/><w:tcBorders><w:top w:val="single" w:sz="12" wx:bdrwidth="30" w:space="0" w:color="000000"/></w:tcBorders></w:tcPr><w:p/></w:tc><w:tc><w:tcPr><w:tcW w:w="1704"
w:type="dxa"/><w:tcBorders><w:top w:val="single" w:sz="12" wx:bdrwidth="30"
w:space="0" w:color="000000"/></w:tcBorders></w:tcPr><w:p/></w:tc><w:tc><w:tcPr>
<w:tcW w:w="1704" w:type="dxa"/><w:tcBorders><w:top w:val="single" w:sz="12"
wx:bdrwidth="30" w:space="0" w:color="000000"/></w:tcBorders></w:tcPr><w:p/>
</w:tc><w:tc><w:tcPr><w:tcW w:w="1705" w:type="dxa"/><w:tcBorders><w:top
w:val="single" w:sz="12" wx:bdrwidth="30" w:space="0" w:color="000000"/>
</w:tcBorders></w:tcPr><w:p/></w:tc><w:tc><w:tcPr><w:tcW w:w="1705"
w:type="dxa"/><w:tcBorders><w:top w:val="single" w:sz="12" wx:bdrwidth="30"
w:space="0" w:color="000000"/></w:tcBorders></w:tcPr><w:p/></w:tc></w:tr>
</w:tbl><w:p/><w:sectPr><w:pgSz w:w="11906" w:h="16838"/><w:pgMar w:top="1440"
w:right="1800" w:bottom="1440" w:left="1800" w:header="708" w:footer="708"
w:gutter="0"/><w:cols w:space="708"/><w:docGrid w:line-pitch="360"/>
Here is the markup for a linked image. Vector Markup Language (VML), Microsofts' non-standard alternative to SVG, is used, which is a pity.
Embedded images are stored within the XML file in an encoded format, and not in an external file. This means a single XMLfile represents a complete Word file, and is an improvement on Word 2000/XP, which create multiple files when using the Save as HTML function.
<w:pict>
<v:shapetype id="_x0000_t75" coordsize="21600,21600" o:spt="75"
o:preferrelative="t" path="m@4@5l@4@11@9@11@9@5xe"
filled="f" stroked="f">
<v:stroke joinstyle="miter"/>
<v:formulas>
<v:f eqn="if lineDrawn pixelLineWidth 0"/>
<v:f eqn="sum @0 1 0"/><v:f eqn="sum 0 0 @1"/>
<v:f eqn="prod @2 1 2"/>
<v:f eqn="prod @3 21600 pixelWidth"/>
<v:f eqn="prod @3 21600 pixelHeight"/>
<v:f eqn="sum @0 0 1"/><v:f eqn="prod @6 1 2"/>
<v:f eqn="prod @7 21600 pixelWidth"/>
<v:f eqn="sum @8 21600 0"/>
<v:f eqn="prod @7 21600 pixelHeight"/>
<v:f eqn="sum @10 21600 0"/>
</v:formulas>
<v:path o:extrusionok="f" gradientshapeok="t" o:connecttype="rect"/>
<o:lock v:ext="edit" aspectratio="t"/>
</v:shapetype>
<v:shape id="_x0000_i1025" type="#_x0000_t75"
alt="XML Workshop Ltd." style="width:150pt;height:75pt">
<v:imagedata src="D:\yawconline\test\xmlw.gif"/>
</v:shape>
</w:pict>
This is how a footnote is encoded. The text of the footnote is embedded within the paragraph, which seems like a sensible option.
<w:p><w:r><w:t>This paragraph contains a footnote</w:t></w:r>
<w:r><w:rPr><w:rStyle w:val="FootnoteReference"/></w:rPr>
<w:footnote><w:p><w:pPr><w:pStyle w:val="FootnoteText"/></w:pPr><w:r><w:rPr><w:rStyle w:val="FootnoteReference"/></w:rPr><w:footnoteRef/></w:r><w:r><w:t> This is footnote text</w:t></w:r></w:p></w:footnote>
</w:r><w:r><w:t>.</w:t></w:r></w:p>