下载xpdf和xpdf-chinese-simplified.tar.gz ,然后将xpdf-chinese-simplified.tar.gz解压到xpdf所在的目录形成一个子目录
http://www.foolabs.com/xpdf/download.html
The following packages are available:
中文包的配置说明
Xpdf: Chinese Simplified support package
========================================
Xpdf project: http://www.foolabs.com/xpdf/
2004-jul-27
If this package includes CMap files, they contain their own copyright
notices and distribution conditions. All other files in the package
are Copyright 2002-2004 Glyph & Cog, LLC, and are licensed under the
GNU General Public License (GPL), version 2.
This package provides support files needed to use the Xpdf tools with
Chinese (Simplified) PDF files.
Contents:
- Adobe-GB1 character collection support
- ISO-2022-CN encoding
- EUC-CN encoding
- GBK encoding
Place all of these files in a directory, typically:
Unix - /usr/local/share/xpdf/chinese-simplified
Win32 - C:\Program Files\xpdf\chinese-simplified
Add the contents of the "add-to-xpdfrc" file to your system-wide
xpdfrc config file, which is typically:
Unix - /usr/local/etc/xpdfrc
Win32 - C:\Program Files\xpdf\xpdfrc
Alternatively, on Unix systems you can add these lines to your
personal xpdfrc file in $HOME/.xpdfrc.
能运行以下平台中
Precompiled binaries are available for the following machines:
I've received reports of xpdf compiling successfully on the following systems (but binaries are not available on the net):
,xpdf比pdfbox适应性更强,既能解析英文PDF,也能解析包括中文在内的PDF,但是XPDF实际上是在命令行运行
下面是在命令行运行,解析英文PDF后的效果
命令如下:
D:\workspace\testsearch2\xpdf>pdftotext ../htmls/xxxx.pdf xxxx.txt
编辑xpdfrc文件
cidToUnicode Adobe-GB1 D:\workspace\testsearch2\xpdf\xpdf-chinese-simplified\Adobe-GB1.cidToUnicode
unicodeMap ISO-2022-CN D:\workspace\testsearch2\xpdf\xpdf-chinese-simplified\ISO-2022-CN.unicodeMap
unicodeMap EUC-CN D:\workspace\testsearch2\xpdf\xpdf-chinese-simplified\EUC-CN.unicodeMap
unicodeMap GBK D:\workspace\testsearch2\xpdf\xpdf-chinese-simplified\GBK.unicodeMap
cMapDir Adobe-GB1 D:\workspace\testsearch2\xpdf\xpdf-chinese-simplified\CMap
toUnicodeDir D:\workspace\testsearch2\xpdf\xpdf-chinese-simplified\CMap
fontDir c:\windows\Fonts
displayCIDFontTT Adobe-GB1 c:\windows\fonts\SimHei.ttf
textEOL dos
在LINUX下可以查看add-to-xpdfrc文档,将该文档内容复制到xpdfrc中
解析中文PDF,需要加参数(同样的参数-enc GBK也能解析英文文档)
D:\workspace\testsearch2\xpdf>pdftotext -layout -enc GBK ..\htmls\readme.pdf
主要参数如下:
OPTIONS
Many of the following options can be set with configuration file com-
mands. These are listed in square brackets with the description of the
corresponding command line option.
-f number
Specifies the first page to convert.
-l number
Specifies the last page to convert.
-layout
Maintain (as best as possible) the original physical layout of
the text. The default is to 'undo' physical layout (columns,
hyphenation, etc.) and output the text in reading order.
-fixed number
Assume fixed-pitch (or tabular) text, with the specified charac-
ter width (in points). This forces physical layout mode.
-raw Keep the text in content stream order. This is a hack which
often "undoes" column formatting, etc. Use of raw mode is no
longer recommended.
-htmlmeta
Generate a simple HTML file, including the meta information.
This simply wraps the text in <pre> and </pre> and prepends the
meta headers.
-enc encoding-name
简体中文包只包含下面三种语言
ISO-2022-CN
EUC-CN
GBK
Sets the encoding to use for text output. The encoding-name
must be defined with the unicodeMap command (see xpdfrc(5)).
The encoding name is case-sensitive. This defaults to "Latin1"
(which is a built-in encoding). [config file: textEncoding]
-eol unix | dos | mac
Sets the end-of-line convention to use for text output. [config
file: textEOL]
-nopgbrk
Don't insert page breaks (form feed characters) between pages.
[config file: textPageBreaks]
-opw password
Specify the owner password for the PDF file. Providing this
will bypass all security restrictions.
-upw password
Specify the user password for the PDF file.
-q Don't print any messages or errors. [config file: errQuiet]
-cfg config-file
Read config-file in place of ~/.xpdfrc or the system-wide config
file.
-v Print copyright and version information.
-h Print usage information. (-help and --help are equivalent.)
下面我们使用JAVA将命令行包装起来形成一个类
package extract;
import java.io.*;
public class ExtractorCJKPDF {
/**
* @param args
*/
public static void pdf2text(String pdffile,String txtfile) throws IOException{
String pdfname=pdffile;
String txtname=txtfile;
String xpdfpath="D:/workspace/testsearch2/xpdf/";
String[] cmd=new String[]{xpdfpath+"pdftotext","-layout","-enc","GBK","-nopgbrk",pdfname,txtname};
//-layout表示保持原有的layout,enc指定字符集,-nopgbrk指定不分页
Process p=Runtime.getRuntime().exec(cmd);
}
public static void main(String[] args) {
// TODO Auto-generated method stub
try {
pdf2text("D:/workspace/testsearch2/htmls/123.pdf","D:/workspace/testsearch2/htmls/123.txt");
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
}
}