This is a screenshot of the GIOS PDF splitter and merger v1.0, the first open source PDF splitter and merger tool written in C# .NET.
After the success of the GIOS PDF .NET library released on April, 2005, I decided to invest more of my time for the community. Extending and improving the PDF library was one of the things I could do, but what about the new features to be added?
Well, I have to thank my friend Charles. Last month we were discussing about the new features to be added to the PDF library. He said: "if you need another challenge, how about developing a PDF merger program?" His words rocked me: There is no free Windows application that does this. Moreover, there is no open source project written in C#. So, I took the giant PDF reference by Acrobat for evaluating the possibility of doing that.
Reading Adobe's Portable Document Format (PDF) Specification, Third Edition, Version 1.4, Section 3.4, you will find that a PDF is made of:
The Body is made of a nodal structure of generic objects. The Root or Catalog is a container of container of pages (Pages objects).
We have to point out what we need to change in order to split (merge) a PDF:
This is the schema of splitting a document of three pages into a new PDF made (in order) from the third and the first page of the original document:
The application works with these engines:
The objects parser parses the lines of the PDF and stores the objects in memory recognizing their types.
I'm really not proud of my object parser. It's not the best but it works. Here an extract of my code in which the object itself searches for some matches inside its content in order to know its own type. I've seen some better parsers here, for example in the article A pdf Forms parser, if you are a purist coder don't look inside! ;-).
The use of Regex
here is not necessary, but it's surely a more elegant way of searching string matches:
if (Regex.IsMatch(s, @"/Page")&!Regex.IsMatch(s, @"/Pages")) { this.type = PdfObjectType.Page; return this.type; } if (Regex.IsMatch(s,@"stream")) { this.type = PdfObjectType.Stream; return this.type; } if (Regex.IsMatch(s, @"(/Creator)|(/Author)|(/Producer)")) { this.type = PdfObjectType.Info; return this.type; } this.type = PdfObjectType.Other;
The splitter takes a collection of objects (input) and returns a collection of objects (output).
The input is provided by the objects parser, and the output is basically a filtered list of the original objects. This is how it works:
This is a recursive method in PdfFileObject.cs used for exploring its children:
internal void PopulateRelatedObjects(PdfFile PdfFile, Hashtable container) { Match m = Regex.Match(this.OriginalText, @"/d+ 0 R[^G]"); while (m.Success) { int num=int.Parse( m.Value.Substring(0,m.Value.IndexOf(" "))); bool notparent = !Regex.IsMatch(this.OriginalText, @"/Parent/s+"+num+" 0 R"); if (notparent &! container.Contains(num)) { PdfFileObject pfo = PdfFile.LoadObject(num); if (pfo != null & !container.Contains(pfo.number)) { container.Add(num,null); pfo.PopulateRelatedObjects(PdfFile, container); } } m = m.NextMatch(); } }
The merger
is a simple class that is used to append the output of each splitter and write the necessary objects (in our example, objects 17 and 18). It also writes the header, the cross reference table and the trailer. Take a look into PdfSplitterMerger.cs, it's very simple.
I hope this project is useful for non-coders. Splitting and merging documents should be free. Let's hope that these projects demystifying the PDF will get some good result in the near future.