pdf文件拆分为单个pdf
Article Update 13-March-2020: I removed the full source code and the code snippets. The article that remains should act as a "design roadmap" for members who want to write the code in the programming language of your choice. If you are interested in discussing the program further, please contact me via the EE message system. 文章更新2020年3月13日:我删除了完整的源代码和代码片段。 对于希望用您选择的编程语言编写代码的成员,剩下的文章应作为“设计路线图”。 如果您有兴趣进一步讨论该程序,请通过EE消息系统与我联系。 INTRODUCTION 介绍This Article is a follow-up to the Article entitled How To Rename-Move a Batch of PDF Files Based on Contents of the Files, recently published here at Experts Exchange.
本文是最近在Experts Exchange上发布的标题为“ 如何基于文件内容重命名移动一批PDF文件的文章”的后续文章。
I considered adding the new feature (splitting a single document into multiple documents) to that Article and program, but concluded that it is a significant enough enhancement to warrant a new Article and program.
我考虑过在该条款和程序中添加新功能(将一个文档拆分为多个文档),但是得出的结论是,它是一项重要的增强功能,足以保证可以使用新的条款和程序。
PREVIOUS ARTICLE 上一条To understand this Article, it will be helpful to read the previous Article, but to get things going here right away, here's a summary of the previous problem and solution.
要理解本文, 阅读上一篇文章会有所帮助,但是为了让事情马上开始,这里是上一个问题和解决方案的摘要。
There is a large batch of PDF files, all with cryptic names, such as [D123456.PDF]. Inside each file on the first line of the first page (always starting at a fixed column and running to the end of the line) is a human-friendly identifier for the file, such as [John Smith]. The requirement is to loop through all of the files in a specified folder in an automated fashion, changing the file names from, for example,
有大量PDF文件,所有文件都带有隐名,例如[D123456.PDF]。 在第一页第一行(始终从固定列开始到行尾)的每个文件中都有一个易于识别的文件标识符,例如[John Smith]。 要求是自动循环遍历指定文件夹中的所有文件,并更改文件名,例如,
D123456.PDF
D123456.PDF
to
至
D123456 John Smith.PDF
D123456 John Smith.PDF
That is, add the identifier from the first line of the first page to the file name.
也就是说,将标识符从首页的第一行添加到文件名。
NEW REQUIREMENT 新要求Following publication of the previous Article and the program that implements the solution, the Original Poster (OP) of the question that prompted the Article asked if an enhancement is possible. Specifically, a single PDF file may be composed of what are really multiple PDF files, and the OP wants the program to split the single PDF into multiple PDFs. For example, pages 1 to 3 of [D123456.PDF] may be an invoice for John Smith, while page 4 may be a different invoice, and pages 5 to 6 yet another invoice. With the previous program, the 6-page [D123456.PDF] would simply be renamed to [D123456 John Smith.PDF], still containing all six pages (three invoices). The OP wants the program to split the original PDF file and create three PDFs, one for each of the invoices. The program still has to rename the files based on content, but, in addition, has to provide a suffix for the multiple files, such as
在上一篇文章和实现该解决方案的程序发布之后,提示该文章的问题的原始张贴者(OP)询问是否可以进行增强。 具体来说,单个PDF文件可能由实际上是多个PDF文件组成,并且OP希望程序将单个PDF拆分为多个PDF。 例如,[D123456.PDF]的第1至3页可能是John Smith的发票,而第4页可能是其他发票,而第5至6页则是另一张发票。 在以前的程序中,6页的[D123456.PDF]将简单地重命名为[D123456 John Smith.PDF],仍然包含所有的6页(三张发票)。 OP希望程序拆分原始PDF文件并创建三个PDF,每个发票一个。 该程序仍必须根据内容重命名文件,但此外,还必须提供多个文件的后缀,例如
D123456 John Smith-1.PDF
D123456约翰·史密斯-1.PDF
D123456 John Smith-2.PDF
D123456 John Smith-2.PDF
D123456 John Smith-3.PDF
D123456约翰·史密斯-3.PDF
INSTALLATION INSTRUCTIONS FOR REQUIRED SOFTWARE 所需软件的安装说明The previous solution requires two excellent freeware products – the AutoHotkey scripting language (the program is written in this) and [pdftotext.exe] from the Xpdf package to convert the PDF files to text (so the program can extract the identifying names for renaming the files). This new solution requires another excellent freeware product – PDFtk (the PDF Toolkit) from PDF Labs.
以前的解决方案需要两种出色的免费软件产品-AutoHotkey脚本语言(该程序是用这种语言编写的)和Xpdf软件包中的[pdftotext.exe],将PDF文件转换为文本(因此,该程序可以提取标识名称以重命名该文件)。文件)。 此新解决方案需要另一个出色的免费软件产品-PDF Labs的PDFtk(PDF工具包)。
Here are the steps for installation of these three packages:
这是安装这三个软件包的步骤:
(1) AutoHotkey – http://ahkscript.org (also, see my EE article: AutoHotkey - Getting Started)
(1)AutoHotkey – http://ahkscript.org (另请参阅我的EE文章: AutoHotkey-入门 )
Click the Download button at the page above, save the install file, and then run it.
单击上方页面上的下载按钮,保存安装文件,然后运行它。
(2) Xpdf – http://www.foolabs.com/xpdf/download.html
(2)Xpdf – http://www.foolabs.com/xpdf/download.html
Click the [xpdfbin-win-3.03.zip] link at the page above to download the Windows files. Unzip the zip file and there will be folders for 32-bit Windows (bin32) and 64-bit Windows (bin64). Be sure to select the right folder for your version of Windows (32-bit or 64-bit) and copy the file called [pdftotext.exe] to wherever you want (the Xpdf binaries are "no-install" executables). The script will automatically find it if you put it in [Program Files\xpdf\] or [Program Files (x86)\xpdf\], but if you put it somewhere else, that's fine – the script gives you a browse-for-file dialog so you may navigate to it.
单击上面页面上的[xpdfbin-win-3.03.zip]链接以下载Windows文件。 解压缩该zip文件,将存在用于32位Windows(bin32)和64位Windows(bin64)的文件夹。 确保为您的Windows版本(32位或64位)选择正确的文件夹,然后将名为[pdftotext.exe]的文件复制到所需的任何位置(Xpdf二进制文件是“无需安装”的可执行文件)。 如果将其放在[Program Files \ xpdf \]或[Program Files(x86)\ xpdf \]中,该脚本将自动找到它,但是,如果将其放在其他位置,那就很好了–该脚本为您提供了浏览-文件对话框,以便您可以导航到它。
(3) PDFtk – http://www.pdflabs.com/tools/pdftk-the-pdf-toolkit/
(3)PDFtk – http://www.pdflabs.com/tools/pdftk-the-pdf-toolkit/
Click the [pdftk_server-1.45-windows-setup.msi] link at the page above, save the install file, and then run it. It will create a folder called [Program Files\PDF Labs\PDFtk Server\] or [Program Files (x86)\PDF Labs\PDFtk Server\] with a [bin] folder that contains two files – [pdftk.exe] and [libiconv2.dll]. If you'd like to move those two files, that's fine. The script automatically finds them if you leave them where the installer put them, but if you move them somewhere else, it gives you a browse-for-file dialog so you may navigate to them (place both files in the same folder).
单击上面页面上的[pdftk_server-1.45-windows-setup.msi]链接,保存安装文件,然后运行它。 它将创建一个名为[Program Files \ PDF Labs \ PDFtk Server \]或[Program Files(x86)\ PDF Labs \ PDFtk Server \]的文件夹,其中的[bin]文件夹包含两个文件– [pdftk.exe]和[ libiconv2.dll]。 如果您想移动这两个文件,那很好。 如果将它们放在安装程序放置的位置,脚本会自动找到它们,但是如果将它们移到其他位置,它将为您提供一个“文件浏览”对话框,以便您可以导航到它们(将两个文件放在同一文件夹中)。
ASSUMPTIONS FOR NEW PROGRAM 新计划的假设All of the assumptions for the previous program apply to the new program, namely:
先前程序的所有假设都适用于新程序,即:
There is a fixed number of characters in the original file name (before the ".pdf"). For example, with file names like [D123456.PDF], that number is 7.
原始文件名中有固定数量的字符(在“ .pdf”之前)。 例如,文件名类似[D123456.PDF],该数字为7。
There is a fixed starting column number for the string that will be in the new file name (and it runs to the end of the line). In other words, following the examples above, this is the column number where "John Smith" begins (for the OP, this is 16).
在新文件名中有一个固定的字符串起始列号(它将运行到该行的末尾)。 换句话说,按照上面的示例,这是“ John Smith”开始的列号(对于OP,这是16)。
The user specifies the source and destination folders. If they are the same, the program does just a Rename; if they are different, the program does a Rename and a Move.
用户指定源和目标文件夹。 如果它们相同,则程序仅重命名;否则,程序将重命名。 如果它们不同,则程序将重命名和移动。
Here is the assumption unique to the new program:
这是新程序独有的假设:
The first line of a page contains a string that identifies it as a new document. It is a fixed string (specified by the user) beginning in a fixed column (also specified by the user). An example is that the first line of the first page of an invoice contains "Customer Name:" beginning in column 5, while all subsequent pages of that same invoice do NOT contain "Customer Name:" beginning in column 5.
页面的第一行包含一个将其标识为新文档的字符串。 它是从固定列(也由用户指定)开始的固定字符串(由用户指定)。 一个示例是,发票第一页的第一行从第5列开始包含“客户名称:”,而同一张发票的所有后续页面在列5中均不包含“客户名称:”。
So the program reads the first line of each page and if it contains the specified new document identifier/separator string (such as "Customer Name:" or "Client Name-" or "Account Number") in the specified starting column (such as 1 or 5 or 10), then it knows this is the first page of a new document; if it does not, then it knows this is a continuation page of the current document.
因此,程序将读取每一页的第一行,并且如果它在指定的起始列(例如)中包含指定的新文档标识符/分隔符字符串(例如“客户名称:”或“客户名称-”或“帐号”), 1或5或10),则它知道这是新文档的第一页; 如果不是,则知道这是当前文档的续页。
HOW TO RUN THIS PROGRAM 如何运行此程序Download the attached file called Batch-Mass-Split-Rename-Move-PDF-Files.ahk. After downloading it, you may run it by simply double-clicking on it in Windows Explorer or whatever file manager you use. Since its file type is AHK, AutoHotkey will be launched to process it. If you prefer, the file may be turned into an executable via the AutoHotkey compiler, which is installed during the standard installation of AutoHotkey. If you right-click on an AHK file in Windows Explorer or whatever file manager you use, there will be a context menu pick called Compile Script. Select that and it will create an EXE file, which is a stand-alone, no-install executable of the AHK program.
下载名为Batch-Mass-Split-Rename-Move-PDF-Files.ahk的附件 。 下载后,只需在Windows资源管理器或使用的任何文件管理器中双击即可运行它。 由于其文件类型为AHK,因此将启动AutoHotkey进行处理。 如果愿意,可以通过AutoHotkey编译器将文件转换为可执行文件,该编译器在标准安装AutoHotkey的过程中安装。 如果您在Windows资源管理器或您使用的任何文件管理器中右键单击AHK文件,将有一个名为“编译脚本”的上下文菜单选项。 选择该选项,它将创建一个EXE文件,该文件是AHK程序的独立的,无需安装的可执行文件。
HOW THE PROGRAM WORKS 程序如何运作For those interested in understanding how the script works, the remainder of this Article shows some code snippets, with a description of what each snippet does, including screenshots where appropriate (this also acts as a form of documentation for the program). However, it does not include code snippets that are the same, or substantially the same, as the code snippets in the previous program, which have already been discussed in the previous Article.
对于那些对了解脚本的工作原理感兴趣的人,本文的其余部分显示了一些代码段,并描述了每个代码段的功能,包括适当的屏幕截图(这也作为程序文档的一种形式)。 但是,它不包括与上一程序中已讨论过的代码片段相同或基本相同的代码片段。
Code snippet:
程式码片段:
removed
What it does: Although this code is similar to the "Starting Column" code in the previous script, I decided to document it here, as it is part of the major enhancement in this script. This code asks the user for the starting column number of the new document identifier/separator string. If the entry is not an integer and/or not greater than zero, it displays a message and gives the user the opportunity to try again or exit.
它的作用:尽管此代码与上一个脚本中的“起始列”代码相似,但我决定在此处进行记录,因为它是此脚本中主要增强功能的一部分。 此代码向用户询问新文档标识符/分隔符字符串的起始列号。 如果输入项不是整数和/或不大于零,则会显示一条消息,并为用户提供再次尝试或退出的机会。
Code snippet:
程式码片段:
removed
What it does: Asks the user to enter the new document identifier/separator string, which is used to split multiple documents that are in a single PDF file into multiple PDFs. It also gives the user the opportunity to exit the program.
它的作用:要求用户输入新的文档标识符/分隔符字符串,该字符串用于将单个PDF文件中的多个文档拆分为多个PDF。 它还为用户提供了退出程序的机会。
The confirmation dialog is similar to the previous program, but the differences are worth noting here:
确认对话框与以前的程序相似,但是在这里值得注意的区别是:
Code snippet:
程式码片段:
removed
What it does: Calls PDFtk to write a text file (known as dump_data) that contains various information about the PDF file. One of the items that it writes to the dump_data file is the number of pages in the PDF file. If PDFtk returns an error code, a Fatal Error dialog is displayed with some helpful information to troubleshoot the error.
作用:调用PDFtk编写一个文本文件(称为dump_data),其中包含有关PDF文件的各种信息。 它写入dump_data文件的一项是PDF文件中的页数。 如果PDFtk返回错误代码,则会显示“致命错误”对话框,其中包含一些有用的信息以对错误进行故障排除。
Code snippet:
程式码片段:
removed
What it does: Reads all of the lines in the dump_data file looking for the "NumberOfPages:" line. If it finds the line, it stores the number of pages in a variable (numpages); if it doesn't find the line, it displays a Fatal Error dialog.
它的作用:读取dump_data文件中的所有行,以查找“ NumberOfPages:”行。 如果找到该行,它将页数存储在一个变量(numpages)中; 如果找不到该行,则会显示“致命错误”对话框。
Code snippet:
程式码片段:
removed
What it does: Loops through all of the pages of the current PDF file, calling [pdftotext.exe] to write the contents of each PDF page, one at a time, to a text file. If [pdftotext.exe] returns an error code, a Fatal Error dialog is displayed with some helpful information to troubleshoot the error.
它的作用:循环浏览当前PDF文件的所有页面,调用[pdftotext.exe]将每个PDF页面的内容一次写入一个文本文件。 如果[pdftotext.exe]返回错误代码,则会显示“致命错误”对话框,其中包含一些有用的信息以对错误进行故障排除。
Code snippet:
程式码片段:
removed
What it does: Checks the first line of the page starting at the specified column to see if it contains the new document identifier/separator string. If it does, then this page begins a new document, and if it isn't the first document in the file, then it calls PDFtk (with the "shuffle" and "output" parameters) to write out the previous document to a new PDF file with a unique suffix. It also increments the suffix for the next new document. If PDFtk returns an error code, a Fatal Error dialog is displayed with some helpful information to troubleshoot the error.
它的作用:检查页面的第一行,从指定的列开始,以查看它是否包含新的文档标识符/分隔符字符串。 如果是这样,则此页面开始一个新文档,如果它不是文件中的第一个文档,则它将调用PDFtk(带有“ shuffle”和“ output”参数)将以前的文档写到新文档中。具有唯一后缀的PDF文件。 它还会增加下一个新文档的后缀。 如果PDFtk返回错误代码,则会显示“致命错误”对话框,其中包含一些有用的信息以对错误进行故障排除。
Code snippet:
程式码片段:
removed
What it does: If it is the first document in the PDF file, it sets the suffix to 1 (of course, there is no prior document to write out).
它的作用:如果它是PDF文件中的第一个文档,则将后缀设置为1(当然,没有先前的文档可以写出)。
Code snippet
程式码片段
removed
What it does: For any new document, whether or not the first one in the current PDF file, it renames/moves it to the destination folder.
它的作用:对于任何新文档,无论当前PDF文件中的第一个文档,它都会将其重命名/移动到目标文件夹。
Code snippet:
程式码片段:
removed
What it does: If this page does not have the new document identifier/separator string starting in the specified column, then it is a continuation page, that is, part of the current document. The only action for this is to build up the "shuffle" parameter for the call to PDFtk.
它的作用:如果此页面在指定的列中没有新的文档标识符/分隔符字符串,则它是一个继续页面,即当前文档的一部分。 唯一的操作是为调用PDFtk建立“ shuffle”参数。
Code snippet:
程式码片段:
removed
What it does: Writes out the last document in the PDF file when there are no more pages to process.
它的作用:没有更多页面要处理时,写出PDF文件中的最后一个文档。
Code snippet:
程式码片段:
removed
What it does: The previous program and this one both write out an Operation Completed dialog with statistics from the run, as shown above. The difference in this new program is that it offers to save the operational statistics in a text file. If the user says Yes, it creates a file with the name
它的作用:上一个程序和该程序都用运行中的统计信息写出一个“操作完成”对话框,如上所示。 此新程序的不同之处在于,它可以将操作统计信息保存在文本文件中。 如果用户说是,它将创建一个名称为
目标文件夹中的 Operational_Statistics_YYYY-MM-DD_HH.MM.SS.txt in the destination folder (where YYYY-MM-DD_HH.MM.SS are the ending date and time of the run). Operational_Statistics_YYYY-MM-DD_HH.MM.SS.txt (其中YYYY-MM-DD_HH.MM.SS是运行的结束日期和时间)。The text file looks like this:
文本文件如下所示:
Operational Statistics from Batch-Mass-Split-Rename-Move-PDF-Files
批处理批量拆分重命名移动PDF文件中的操作统计信息
Beginning date and time: 2013-02-11/18:19:22
开始日期和时间:2013-02-11 / 18:19:22
Number of PDF files processed: 1,969
处理的PDF文件数量:1,969
Number of non-PDF files ignored: 14
忽略的非PDF文件数:14
Ending date and time: 2013-02-11/18:29:24
结束日期和时间:2013-02-11 / 18:29:24
Elapsed time (minutes:seconds): 10:2
经过时间(分钟:秒):10:2
That's it! I hope this helps the OP as well as other EE members. Although I did a bit of generalization, I realize that the solution is still rather specific to the OP's requirements. However, by providing the source code, I hope that other folks with similar needs will be able to modify the program to suit their purposes.
而已! 希望这对OP和其他EE成员有所帮助。 尽管我做了一些概括,但我意识到该解决方案仍然非常符合OP的要求。 但是,通过提供源代码,我希望其他有类似需求的人能够修改程序以适合他们的目的。
If you find this article to be helpful, please click the thumbs-up icon below. This lets me know what is valuable for EE members and provides direction for future articles. Thanks very much! Regards, Joe
如果您发现本文有帮助,请单击下面的大拇指图标。 这使我知道什么对EE成员有价值,并为以后的文章提供了指导。 非常感谢! 问候乔
翻译自: https://www.experts-exchange.com/articles/11211/How-To-Split-Rename-Move-a-Batch-of-PDF-Files-Based-on-Contents-of-the-Files.html
pdf文件拆分为单个pdf