<?php function get_num_pages_docx($filename) { $zip = new ZipArchive(); if($zip->open($filename) === true) { if(($index = $zip->locateName('docProps/app.xml')) !== false) { $data = $zip->getFromIndex($index); $zip->close(); $xml = new SimpleXMLElement($data); return $xml->Pages; } $zip->close(); } return false; } ?>
Microsoft Office Word 2007提供了一种新的默认文件格式,叫做Microsoft Office Word XML格式(Word XML格式)。
Microsoft Office 97到Microsoft Office 2003中使用的二进制文件格式仍然可以作为一种保存格式来使用,但是它不是保存新文档时的默认文档。
Word 2007中的文件格式由一个压缩的ZIP包组成,称为包。
App.xml file。 包含了应用程序特定的属性。其中有一些有用的信息,比如说页数,字数,字符数等。
所以就查找 docProps/app.xml包的位置,根据位置获取内容。
Getting the number of pages for docx files is very easy
For 97-2003 format it's certainly challenging, but by no means impossible. The number of pages is stored in the SummaryInformation section of the document, but due to the OLE format of the files that makes it a pain to find. The structure is defined extremely thoroughly (though badly imo) here and simpler here. I looked at this for an hour today, but didn't get very far! (not a level of abstraction I'm used to), but output the hex to better understand the structure:
function get_num_pages_doc($filename) { $handle = fopen($filename, 'r'); $line = @fread($handle, filesize($filename)); echo '<div style="font-family: courier new;">'; $hex = bin2hex($line); $hex_array = str_split($hex, 4); $i = 0; $line = 0; $collection = ''; foreach($hex_array as $key => $string) { $collection .= hex_ascii($string); $i++; if($i == 1) { echo '<b>'.sprintf('%05X', $line).'0:</b> '; } echo strtoupper($string).' '; if($i == 8) { echo ' '.$collection.' <br />'."\n"; $collection = ''; $i = 0; $line += 1; } } echo '</div>'; exit(); } function hex_ascii($string, $html_safe = true) { $return = ''; $conv = array($string); if(strlen($string) > 2) { $conv = str_split($string, 2); } foreach($conv as $string) { $num = hexdec($string); $ascii = '.'; if($num > 32) { $ascii = unichr($num); } if($html_safe AND ($num == 62 OR $num == 60)) { $return .= htmlentities($ascii); } else { $return .= $ascii; } } return $return; } function unichr($intval) { return mb_convert_encoding(pack('n', $intval), 'UTF-8', 'UTF-16BE'); }
which will out put code where you can find the sections such as:
007000: 0500 5300 7500 6D00 6D00 6100 7200 7900 ..S.u.m.m.a.r.y. 007010: 4900 6E00 6600 6F00 7200 6D00 6100 7400 I.n.f.o.r.m.a.t. 007020: 6900 6F00 6E00 0000 0000 0000 0000 0000 i.o.n........... 007030: 0000 0000 0000 0000 0000 0000 0000 0000 ................
Which will allow you to see the referencing info such as:
007040: 2800 0201 FFFF FFFF FFFF FFFF FFFF FFFF (...ÿÿÿÿÿÿÿÿÿÿÿÿ 007050: 0000 0000 0000 0000 0000 0000 0000 0000 ................ 007060: 0000 0000 0000 0000 0000 0000 0000 0000 ................ 007070: 0000 0000 2500 0000 0010 0000 0000 0000 ....%...........
Which will allow you to determine properties described:
_ab = ("SummaryInformation") _cb = 0028 _mse = 02 (STGTY_STREAM) _bflags = 01 (DE_BLACK) _sidLeftSib = FFFF FFFF _sidRightSib = FFFF FFFF (none) _sidChild = FFFF FFFF (n/a for STGTY_STREAM) _clsid = 0000 0000 0000 0000 0000 0000 0000 0000 (n/a) _dwUserFlags = 0000 0000 (n/a) _time[0] = CreateTime = 0000 0000 0000 0000 (n/a) _time[1] = ModifyTime = 0000 0000 0000 0000 (n/a) _startSect = 0000 0000 _ulSize = 0000 1000 _dptPropType = 0000 (n/a)
Which will let you find the relevant section of code, unpack it and get the page number. Of course this is the hard bit that I just don't have time for, but should set you in the right direction.
M$ don't make it easy!