本文章引自:http://www.comicer.com/stronghorse/water/software/officeocr.htm
在简体中文Office 2003下用Micorsoft Office Document Imaging (MODI)做OCR的步骤为:
与其他商业OCR软件相比,MODI具有下列特点:
在正式开始讨论系统设置前,先透露一点技术背景:
要想让简体中文Office 2003能够OCR繁体、日文、韩文,需要做的工作包括两个方面:
找一台安装了繁体中文Office 2003的机器,进入MODI的安装文件夹,缺省为:
C:\Program Files\Common Files\Microsoft Shared\MODI\11.0
将下面的文件复制到安装了简体中文Office 2003的相同文件夹下:
TCCODE.UNI
TCPRINT.DAT
TCPRINT2.DAT
TCSERHT.DAT
TCTREE.DAT
TW_BU.DAT
TW_UB.DAT
TWBIG532.DLL
复制完成后,用记事本创建一个reg文件,把下面内容粘贴后存盘:
Windows Registry Editor Version 5.00
[HKEY_CURRENT_USER\Software\Microsoft\Installer\Components\61BA386016BD0C340BBEAC273D84FD5F]
"1028"=hex(7):28,00,26,00,48,00,42,00,56,00,6e,00,2d,00,7d,00,66,00,28,00,5a,\
00,58,00,66,00,65,00,41,00,52,00,36,00,2e,00,6a,00,69,00,4f,00,43,00,52,00,\
5f,00,31,00,30,00,32,00,38,00,3e,00,7d,00,60,00,45,00,4d,00,61,00,65,00,2c,\
00,37,00,71,00,39,00,2a,00,44,00,58,00,64,00,55,00,40,00,45,00,50,00,69,00,\
3d,00,00,00,00,00
双击此reg文件导入注册表后,在MODI的OCR选项卡里,“OCR语言”即可看到“中文(繁体)”。注意导入注册表时必须先关闭所有MODI窗口,导入后再打开。
在简体中文环境下,按照上述步骤设置后,用MODI识别出来的繁体中文是GBK编码的繁体字,可以用Word的繁简转换,或TextForever的编码转换功能 (支持批量)转换成GB编码的简体字。
需要从日文MODI复制到简体MODI文件夹下的文件为:
JPCODE.UNI
JPPRINT.DAT
JPPRINT2.DAT
JPSERHT.DAT
JPTREE.DAT
TW_SU.DAT
TW_US.DAT
TWRECJ.DLL
TWSJIS32.DLL
需要导入的reg内容为:
Windows Registry Editor Version 5.00
[HKEY_CURRENT_USER\Software\Microsoft\Installer\Components\61BA386016BD0C340BBEAC273D84FD5F]
"1041"=hex(7):30,00,5d,00,67,00,41,00,56,00,6e,00,2d,00,7d,00,66,00,28,00,5a,\
00,58,00,66,00,65,00,41,00,52,00,36,00,2e,00,6a,00,69,00,4f,00,43,00,52,00,\
5f,00,31,00,30,00,34,00,31,00,3e,00,2e,00,61,00,45,00,4d,00,61,00,65,00,2c,\
00,37,00,71,00,39,00,2a,00,44,00,58,00,64,00,55,00,40,00,45,00,50,00,69,00,\
3d,00,00,00,00,00
配置成功后,在MODI的OCR选项卡里,“OCR语言”即可看到“日语”。
在简体中文环境下,按照上述步骤设置后,用MODI识别出来的日文是GBK编码,可以在支持GBK字符集的简体中文环境下正常显示、编辑。
需要从韩文MODI复制到简体MODI文件夹下的文件为:
DATASIM.DAT
HANGULLB.DAT
KRCODE.UNI
KRDIST.DAT
KRPRINT.DAT
KRSERHT.DAT
KRTREE.DAT
TW_KU.DAT
TW_UK.DAT
TWCUTCKR.DLL
TWCUTLKR.DLL
TWKSC32.DLL
TWLAYKR.DLL
TWRECK.DLL
需要导入的reg内容为:
Windows Registry Editor Version 5.00
[HKEY_CURRENT_USER\Software\Microsoft\Installer\Components\61BA386016BD0C340BBEAC273D84FD5F]
"1042"=hex(7):31,00,5d,00,67,00,41,00,56,00,6e,00,2d,00,7d,00,66,00,28,00,5a,\
00,58,00,66,00,65,00,41,00,52,00,36,00,2e,00,6a,00,69,00,4f,00,43,00,52,00,\
5f,00,31,00,30,00,34,00,32,00,3e,00,30,00,61,00,45,00,4d,00,61,00,65,00,2c,\
00,37,00,71,00,39,00,2a,00,44,00,58,00,64,00,55,00,40,00,45,00,50,00,69,00,\
3d,00,00,00,00,00
配置成功后,在MODI的OCR选项卡里,“OCR语言”即可看到“朝鲜语”。
在简体中文环境下,按照上述步骤设置后,用MODI识别出来的韩文是韩文编码(charset:129),可以存为HTML、doc,并能在Word里正常显示、编辑。如果存为TXT,则不能在简体中文环境下显示、编辑。
如果需要在繁体中文环境下OCR简体中文,最正宗的方法是下载、安装一个简体MODI:
当然如果想省事,也可以复制下列文件:
SCCODE.UNI
SCPRINT.DAT
SCPRINT2.DAT
SCSERHT.DAT
SCTREE.DAT
TW_GU.DAT
TW_UG.DAT
TWGB32.DLL
需要导入的reg内容为:
Windows Registry Editor Version 5.00
[HKEY_CURRENT_USER\Software\Microsoft\Installer\Components\61BA386016BD0C340BBEAC273D84FD5F]
"2052"=hex(7):4d,00,6a,00,33,00,47,00,51,00,66,00,5e,00,62,00,54,00,3f,00,42,\
00,3f,00,56,00,50,00,24,00,5e,00,62,00,53,00,6c,00,6c,00,3e,00,25,00,6d,00,\
45,00,4d,00,61,00,65,00,2c,00,37,00,71,00,39,00,2a,00,44,00,58,00,64,00,55,\
00,40,00,45,00,50,00,69,00,3d,00,00,00,00,00