How to get the number of pages in a Word Document

<?php

function get_num_pages_docx($filename)
{
    $zip = new ZipArchive();

    if($zip->open($filename) === true)
    {  
        if(($index = $zip->locateName('docProps/app.xml')) !== false)
        {
            $data = $zip->getFromIndex($index);
            $zip->close();

            $xml = new SimpleXMLElement($data);
            return $xml->Pages;
        }

        $zip->close();
    }

    return false;
}

?>

Microsoft Office Word 2007提供了一种新的默认文件格式,叫做Microsoft Office Word XML格式(Word XML格式)。

Microsoft Office 97到Microsoft Office 2003中使用的二进制文件格式仍然可以作为一种保存格式来使用,但是它不是保存新文档时的默认文档。

Word 2007中的文件格式由一个压缩的ZIP包组成,称为包。

所以简单来说word就是一种压缩过的xml文档。所以后缀是.docx

具体说明可以猛击这个链接

http://www.microsoft.com/china/msdn/library/office/office/Word2007XMLFormat.mspx?mfr=true

所以上文程序先用zip类来打开文档,查找对应的内容

App.xml file。 包含了应用程序特定的属性。其中有一些有用的信息,比如说页数,字数,字符数等。

所以就查找 docProps/app.xml包的位置,根据位置获取内容。

之后用SimpleXmlElement类将内容转化为xml对象,方便操作

最后取出page即可。

word的正文内容在word/document.xml这个包里,同样可以根据名字来获取内容。

但需要注意的是,这里的xml标签都是带有namespaces的,一般都是w:

所以无论是用children()迭代遍历还是用Xpath直接寻找都需要指明namespaces的路径

对于children函数,路径填写于它的参数中,Xpath查找之前,先使用registerxpathnamespaces函数注册xpath的namespaces路径。

之后就可以查找正文内容了。

比如说同样要找页数,我可以寻找w:lastranderpagebreak来判断有多少页数,因为新一页最顶上有内容的话,这个标签会附加在整个内容块里,但如果是回车开始似乎就没有了(待验证)。


Getting the number of pages for docx files is very easy

For 97-2003 format it's certainly challenging, but by no means impossible. The number of pages is stored in the SummaryInformation section of the document, but due to the OLE format of the files that makes it a pain to find. The structure is defined extremely thoroughly (though badly imo) here and simpler here. I looked at this for an hour today, but didn't get very far! (not a level of abstraction I'm used to), but output the hex to better understand the structure:

function get_num_pages_doc($filename) 
{
    $handle = fopen($filename, 'r');
    $line = @fread($handle, filesize($filename));

    echo '<div style="font-family: courier new;">';

        $hex = bin2hex($line);
        $hex_array = str_split($hex, 4);
        $i = 0;
        $line = 0;
        $collection = '';
        foreach($hex_array as $key => $string)
        {
            $collection .= hex_ascii($string);
            $i++;

            if($i == 1)
            {
                echo '<b>'.sprintf('%05X', $line).'0:</b> ';
            }

            echo strtoupper($string).' ';

            if($i == 8)
            {
                echo ' '.$collection.' <br />'."\n";
                $collection = '';
                $i = 0;

                $line += 1;
            }
        }

    echo '</div>';

    exit();
}

function hex_ascii($string, $html_safe = true)
{
    $return = '';

    $conv = array($string);
    if(strlen($string) > 2)
    {
        $conv = str_split($string, 2);
    }

    foreach($conv as $string)
    {
        $num = hexdec($string);

        $ascii = '.';
        if($num > 32)
        {   
            $ascii = unichr($num);
        }

        if($html_safe AND ($num == 62 OR $num == 60))
        {
            $return .= htmlentities($ascii);
        }
        else
        {
            $return .= $ascii;
        }
    }

    return $return;
}

function unichr($intval)
{
    return mb_convert_encoding(pack('n', $intval), 'UTF-8', 'UTF-16BE');
}

which will out put code where you can find the sections such as:

007000: 0500 5300 7500 6D00 6D00 6100 7200 7900 ..S.u.m.m.a.r.y. 007010: 4900 6E00 6600 6F00 7200 6D00 6100 7400 I.n.f.o.r.m.a.t. 007020: 6900 6F00 6E00 0000 0000 0000 0000 0000 i.o.n........... 007030: 0000 0000 0000 0000 0000 0000 0000 0000 ................ 

Which will allow you to see the referencing info such as:

007040: 2800 0201 FFFF FFFF FFFF FFFF FFFF FFFF (...ÿÿÿÿÿÿÿÿÿÿÿÿ 007050: 0000 0000 0000 0000 0000 0000 0000 0000 ................ 007060: 0000 0000 0000 0000 0000 0000 0000 0000 ................ 007070: 0000 0000 2500 0000 0010 0000 0000 0000 ....%...........

Which will allow you to determine properties described:

_ab = ("SummaryInformation") _cb = 0028 _mse = 02 (STGTY_STREAM) _bflags = 01 (DE_BLACK) _sidLeftSib = FFFF FFFF 
_sidRightSib = FFFF FFFF (none) _sidChild = FFFF FFFF (n/a for STGTY_STREAM) _clsid = 0000 0000 0000 0000 0000 0000 0000 0000 (n/a) _dwUserFlags = 0000 0000 (n/a) _time[0] = CreateTime = 0000 0000 0000 0000 (n/a) _time[1] = ModifyTime = 0000 0000 0000 0000 (n/a) _startSect = 0000 0000 _ulSize = 0000 1000 _dptPropType = 0000 (n/a)

Which will let you find the relevant section of code, unpack it and get the page number. Of course this is the hard bit that I just don't have time for, but should set you in the right direction.

M$ don't make it easy!

你可能感兴趣的:(How to get the number of pages in a Word Document)