lucene-处理中文PDF的xpdf

阅读更多

简单处理中文的方式是xpdf

http://www.foolabs.com/xpdf/home.html

2、

Xpdf is an open source viewer for Portable Document Format (PDF) files. (These are also sometimes also called 'Acrobat' files, from the name of Adobe's PDF software.) The Xpdf project also includes a PDF text extractor, PDF-to-PostScript converter, and various other utilities.

Xpdf runs under the X Window System on UNIX, VMS, and OS/2. The non-X components (pdftops, pdftotext, etc.) also run on Win32 systems and should run on pretty much any system with a decent C++ compilernaries are available for the following machines:

Precompiled binaries are available for the following machines:

  • x86, Linux (glibc 2.2, staticly linked to Motif, t1lib, and FreeType):

    xpdf-3.02pl2-linux.tar.gz (6615400 bytes)

  • SPARC, Solaris 10 (staticly linked to t1lib and FreeType):

    xpdf-3.02pl2-solaris10-sparc.tar.gz (8883747 bytes)

  • x64, Solaris 10 (staticly linked to t1lib and FreeType):

    xpdf-3.02pl2-solaris10-x64.tar.gz (9399494 bytes)

  • x86, DOS/Win32 -- pdftops, pdftotext, pdfimages, pdfinfo, and pdffonts only:

    Win32 (built with MSVC): xpdf-3.02pl2-win32.zip (2027995 bytes)

    DOS6 (built with djgpp, with DPMI support from csdpmi5b): xpdf-3.02pl2-dos6.zip (1745421 bytes)

3、将PDF文档转化为TXT,使用XPDF带的pdftotext程序转化,这是一个独立于lucene外的软件。

你可能感兴趣的:(lucene,Linux,Solaris,DOS,Adobe)