PDF文档结构

1、基础:

在PDF文件存储结构中,我们知道PDF文件由文件头(版本号)、文件体(对象)、交叉索引表、和文件尾四部分组成。在PDF基础对象数据结构中,我们知道PDF基础对象有bool、int、real、string、hexstring、name、array、dictionary、stream、null,还有indirect object(间接引用对象)。PDF的文档结构就是建立在这9种基础对象之上,主要以dictionary来组织形成树形结构文档数据:

例子:

%PDF-1.4 % The file header, indicating this is a PDF file


% The main part of the PDF file: all indirect objects


1 0 obj % the root node of page tree
<<
/Type /Pages
/Kids [2 0 R] % kids under this node. 
% Can be another page tree node, or a page dictionary
/Count 1
>>
endobj


2 0 obj % a page dictionary
<<
/Type /Page 
/Parent 1 0 R % parent page tree node
/MediaBox [0 0 612 792] % page size (612x792 points)
/Contents 4 0 R % content stream
/Resources << % resource dictionary
/Font << % font list
/Font1 5 0 R
>>
/XObject << % XObject list (including forms and images)
/Image1 6 0 R
>>
>>
>>
endobj


3 0 obj % the catalog dictionary.
<<
/Type /Catalog
/Pages 1 0 R % pointer to the root node of pages tree
>>
endobj


4 0 obj % the page content stream
<<
/Length 0 % should be byte length of the stream data. 
% We use 0 here for convenience.
>> stream
q
% Path example: show a red line across the page
10 w % line width
1 0 0 RG % set to red color
0 592 m % move to left-top corner
612 592  l % line to right-bottom corner
s % stroke the path


% Text example: draw "Hello World!" text
1 w % line width


BT % begin text object
/Font1 70 Tf % set font to /Font1, font size to 10 points
6 Tr % text render mode
50 700 TD % move text position to 100,700
(Hello World!) Tj % output the text
ET % end text object


% % rect fill in text clip
1 0 0 rg
45 652 70 70 re f
0 1 0 rg
115 652 70 70 re f
0 0 1 rg
185 652 70 70 re f
1 1 0 rg
255 652 70 70 re f
0 1 1 rg
325 652 70 70 re f
1 0 1 rg
375 652 70 70 re f
Q


% Image example: output a colored image


q % save graphic states (including current matrix)
100 0 0 100 300 600 cm % change the current matrix for the image
/Image1 Do % display the image
Q % restore graphic states (including current matrix)


endstream
endobj


5 0 obj % a font dictionary. 
% This is a base-14 font, so only a few data is required.
<<
/Type /Font
/Subtype /Type1
/BaseFont /Helvetica
>>
endobj


6 0 obj % a colored image
<<
/Type /XObject
/Subtype /Image
/Width 8 % pixel width
/Height 8 % pixel height
/ColorSpace /DeviceRGB % color space: each pixel uses R,G,B components
/BitsPerComponent 8 % each component uses one byte
/Length 0 % should be length of image data. 
% Use 0 here for convinience
/Filter /ASCIIHexDecode % use hexidecimal form for convinience
>> stream
FF0000 C00000 A00000 800000 600000 400000 200000 0000FF
FF2000 C00000 A00000 800000 600000 400000 200000 0000C0
FF4000 C00000 A00000 800000 600000 400000 200000 0000A0
FF6000 C00000 A00000 800000 600000 400000 200000 000080
FF8000 C00000 A00000 800000 600000 400000 200000 000060
FFA000 C00000 A00000 800000 600000 400000 200000 000040
FFC000 C00000 A00000 800000 600000 400000 200000 000020
FFFF00 C0C000 A0A000 808000 606000 404000 202000 000000>
endstream
endobj


xref % Cross reference table. 
% Should contain byte offsets of all indirect objects
% We omit the cross reference table here for convenience.


trailer % The trailer part.
<<
/Size 0 % should be the size of cross reference table
% we use 0 here for convenience.
/Root 3 0 R % pointer to the catalog object
>>


startxref
0 % this should be the offset to the cross reference table.
% We use 0 here for convenience.
%%EOF % end of file

从该例子中抽取出文档结构中必须的部分:

Trailer - Catalog - Pages - Page - ..

- Page - Contents   - (PS语言描述的页面基本对象)

   - Resources - (字体、颜色空间、XObject、Image等资源)

   - MediaBox   - (页面大小)

这样就清楚的看到解析一个文档需要从文件尾开始,Catalog,然后Pages-Page,页面对象。这个例子并不是一个正确的PDF,它少了xref的内容,而且Size,Length的值都是0,但这个例子adobe reader 依然能正确显示,因为它对文档错误做了兼容处理。






你可能感兴趣的:(PDF)