介绍:
Lemur(狐猴)系统是CMU和UMass联合推出的一个用于自然语言模型和信息检索研究的系统。在这个系统上可以实现基于自然语言模型和传统的向量空 间模型以及Okapi的ad hoc或者分布式检索,可以使用结构化查询,跨语言检索,过滤,聚类等等。目前最新的版本是3.0,CMU和UMass在9月将推出新的版本 Indri(大狐猴),将加入支持terabyte(1000G就是1T)的数据库和结构化的文档查询(比如将html文档解析为不同的doc representation方式,利用html文档的结构表达方式信息tag, title, meta等)。
运行Lemur需要什么?Lemur可以在windows或者Unix环境下使用,因此我们可以直接在windows下使用lemur。但是lemur提 供了shell script文件来演示完整的使用lemur进行检索的过程,所以在windows下需要安装cygwin来模拟Unix环境。Lemur还提供了一个 GUI程序以及用户交互的界面的CGI,其中有Java程序可以直接看到检索的结果,,因此需要安装Java 虚拟机,CGI程序需要Perl的解释器
下载网址:
http://www.lemurproject.org/
双击lemur,可以看到4.3到最新版本;
Indri-2.9-install.exe |
Indri安装文件 |
i386 |
indri-2.9.tar.gz |
源文件 |
Platform-Independent |
lemur-4.9.dmg |
MAC系统 |
i386 |
lemur-4.9-doc.tar.gz |
? |
Platform-Independent |
lemur-4.9-install.exe |
lemur安装文件 |
i386 |
lemur-4.9.tar.gz |
源文件 |
Platform-Independent |
下载lemur-4.9-install.exe并安装
目录介绍:..\Lemur 4.9\
bin\ Lemur Toolkit applications供 直接调用的应用程序脚本即命令行方式,详见
windoc\lemur-applications.html
include\ The lemur include files
lib\ the lemur library
windoc\
Overview of the Lemur Toolkit
Overview of the Lemur Toolkit
Installed Applications
Using the Lemur Toolkit API
Indexing
Indexing Overview
Document Formats
Retrieval
Batch Retrieval Methods
The Indri Query Language for Retrieval
The InQuery Query Language for Retrieval
src_vs_2005\ 基于MS平台的完整Lemur Toolkit源码
javadoc\ java API document
GUI\
RetUI.jar provides a basic document retrieval GUI for interactive queries, using the Indri API.
IndexUI.jar provides a basic collection indexing GUI for building an indri repository. LemurRet.jar provides a basic document retrieval GUI for interactive queries using the Lemur API.
LemurIndex.jar provides a basic collection indexing GUI for building Lemur indexes.
lemur.jar and indri.jar for the Lemur and Indri APIS.
doc\ Lemur Toolkit Documentation 如:
Namespace List | Class Hierarchy | Alphabetical List | Class List | Directories | File List | Namespace Members | Class Members | File Members | Related Pages
CSharp\ The C# wrapper classes assembly will be in LemurCsharp.dll This assembly should be referenced by your C# program.
使用方式:
(1)直接拿lemur的程序来使用,即bin\下的可执行程序;
(2)Building applications using Visual Studio .NET即直接在自己的项目中调用Lemur库等;
After installing the lemur toolkit, you can use the library by adding the subfolder include of the target directory to the "C/C++ / General / Additional Include Directories" property for your project:
Next, add the subfolder lib of the target directory to the "Linker / General / Additional Library Directories" property for your project:
Next, add lemur.lib and wsock32.lib to the "Linker / Input / Additional Dependencies" property for your project.
Also, if your project is configured as "Debug", you should choose the "Multi-threaded Debug DLL(/MDd)" runtime library. If your project is configured as "Release", you should choose the "Multi-threaded DLL(/MD)" runtime library. The installable Lemur Library and applications were built in Release / Multi-Threaded mode.
Finally, you should have C/C++ Language Enable Run-Time Type Info set to yes.
(3)Compiling the Lemur Toolkit with Visual Studio .NET即对lemur进行修改以符合自己的要求,然后重新编译再调用;
The installer can optionally install the full Lemur Toolkit source tree, placing it in the "src_vs_2003" subfolder and/or the "src_vs_2005" subfolder of the target directory, depending on which version(s) of Visual Studio you have installed. That folder contains the Visual Studio solution file "Lemur.sln". There is a separate project file for each library and for each application in Lemur.
By default the project configurations are built in "Debug" mode. To change this so that it compiles with fewer warnings and runs at higher efficiency, change the configuration setting in the "Build" menu. Then choose "Configuration Manager". In the menu for "Active Solution Configuration", choose "Release".
When built from source, there is a separate library for each of the sub-libraries that are compiled into "lemur.lib". The combined library, "lemur.lib", is built in the lemur subfolder, with output in either Release or Debug, depending on configuration.
Important Note: 1。Before compiling the toolkit from the source, you must set the proper include path for the Java library. To modify the library, in the Solution Explorer view, right-click on the "lemur_jni" project and choose "Properties". Set the "Configuration" drop-down box (at the top of the dialog box) to "All Configurations". Next, in the "Additional Include Directories" field, set the appropriate paths to your Java JDK installation's include directory and include/win32 directory. Press the "OK" button when finished, and rebuild. [如果依然不能找到file: 'jni.h',则分别将JDK的include和win32也加入到Additional Include Directories] 2。防止出现类似 error PRJ0008 : 未能删除文件“e:\lemur 4.8\src_vs_2005\app\obj\vc80.pdb”或者不能打开等, 进行设置:即parallel project builds 问题,设maximum number of parallel project builds为1。(双核以上CPU问题?)
3。因为lemur有对于阿拉伯文的支持,而在中文系统当中可能会出现字符编码的问题。所以,需要屏蔽掉涉及到阿拉伯文处理的模块。找到parsing模 块下的Arabic_Stemmer.cpp文件,将其中的函数内容全部屏蔽为空。对于返回类型为void型函数,将函数体内容全部注释,对于有返回类型 的函数将整个函数全部注释掉。注意,这里不可删除模块的内容,因为其它的模块会调用相关的接口,如果屏蔽掉接口会导致程序无法通过编译。
使用参考文档:
Lemur Toolkit and Indri Search Engine Documentation
http://www.lemurproject.org/docs/index.php/Main_Page
主要内容:
Where to Begin...
Overview
Compiling and Installing
Technical Details
Using the Toolkit
Toolkit Usage Overview
Building Indexes
Retrieval Tasks
Lemur Toolkit Utilities
The Indri Query Language
The Lemur CGI Application
Programming with the Toolkit
Using the Lemur Toolkit with C/C++
Using the Lemur Toolkit with C Sharp
Using the Lemur Toolkit with Java
Extending the Toolkit Libraries
Lemur and Indri for Multilingual Tasks
Multilingual Overview
Lemur/Indri and Chinese Text
Lemur/Indri and Arabic Text
Reference
Table of Contents
The Lemur Toolkit API documentation
Site Index
from: http://hi.baidu.com/gengshenspirit/blog
reference: http://blog.csdn.net:80/NewNebuladream