通过XPDF抽取PDF中的中文文本

 通过XPDF抽取PDF中的中文文本

1、下载XPDF,下载地址: ftp://ftp.foolabs.com/pub/xpdf/xpdf-3.02pl4-win32.zip

2、下载字体Gbsn00lp.ttf和gkai00mp.ttf,下载地址:ftp://ftp.foolabs.com/pub/xpdf/xpdf-chinese-simplified.tar.gz

3、解压XPDF和字体,将字体放到xpdf/chinese-simplified/CMap目录下

4、修改add-to-xpdfrc文件中的地址 ,将路径该为本机安装路径

#----- begin Chinese Simplified support package (2004-jul-27) cidToUnicode Adobe-GB1 E:/Study/Flex/xpdf-chinese-simplified/xpdf/chinese-simplified/Adobe-GB1.cidToUnicode unicodeMap ISO-2022-CN E:/Study/Flex/xpdf-chinese-simplified/xpdf/chinese-simplified/ISO-2022-CN.unicodeMap unicodeMap EUC-CN E:/Study/Flex/xpdf-chinese-simplified/xpdf/chinese-simplified/EUC-CN.unicodeMap unicodeMap GBK E:/Study/Flex/xpdf-chinese-simplified/xpdf/chinese-simplified/GBK.unicodeMap cMapDir Adobe-GB1 E:/Study/Flex/xpdf-chinese-simplified/xpdf/chinese-simplified/CMap toUnicodeDir E:/Study/Flex/xpdf-chinese-simplified/xpdf/chinese-simplified/CMap displayCIDFontTT Adobe-GB1 E:/Study/Flex/xpdf-chinese-simplified/xpdf/chinese-simplified/CMap/gkai00mp.ttf #----- end Chinese Simplified support package

5、修改xpdfrc文件 ,把地址修改为本机地址

cidToUnicode Adobe-GB1 E:/Study/Flex/xpdf-chinese-simplified/xpdf/chinese-simplified/Adobe-GB1.cidToUnicode unicodeMap ISO-2022-CN E:/Study/Flex/xpdf-chinese-simplified/xpdf/chinese-simplified/ISO-2022-CN.unicodeMap unicodeMap EUC-CN E:/Study/Flex/xpdf-chinese-simplified/xpdf/chinese-simplified/EUC-CN.unicodeMap unicodeMap GBK E:/Study/Flex/xpdf-chinese-simplified/xpdf/chinese-simplified/GBK.unicodeMap cMapDir Adobe-GB1 E:/Study/Flex/xpdf-chinese-simplified/xpdf/chinese-simplified/CMap toUnicodeDir E:/Study/Flex/xpdf-chinese-simplified/xpdf/chinese-simplified/CMap displayCIDFontTT Adobe-GB1 E:/Study/Flex/xpdf-chinese-simplified/xpdf/chinese-simplified/CMap/gkai00mp.ttf

6、编写简单的程序

string xpdfPath = @"E:/Study/Flex/xpdf-chinese-simplified/xpdf/pdftotext.exe"; string filename = @"E:/Work/FlashViewer/FlashViewer/Flex/Pdf/mayun.pdf"; string strCmd = " -cfg xpdfrc -q " + filename + " - "; Process p = new Process(); p.StartInfo.FileName = xpdfPath;//exe,bat and so on p.StartInfo.WindowStyle = ProcessWindowStyle.Hidden; p.StartInfo.Arguments = strCmd; p.StartInfo.RedirectStandardOutput = true; p.StartInfo.UseShellExecute = false; try { p.Start(); string strmsg = p.StandardOutput.ReadToEnd(); IOHelp.WriteFile(path, strmsg, false); p.WaitForExit(); p.Close(); } catch(Exception e) { Console.WriteLine(e.Message.ToString()); }

 

你可能感兴趣的:(Flex,.net,string,exception,path)