本人认为,如果介绍Aperture抽象的API,恐怕使人不知所云;抽象的API失去具体的上下文显得有点苍白。人们认识事物的方式从源头上而言总是从特殊到一般,从具体到抽象 。基于此,本文还是实现具有上下文的example
本文先来演示一下一个简单的数据抽取程序,基本流程是:
1根据InputStream识别文件的mime类型;
2 根据识别的mime类型获取ExtractorFactory,进一步获取Extractor
3 调用 Extractor的extract方法填充RDFContainer
4 输出Model的RDF格式
代码示例如下:
public class ExtractorExample { public static void main(String[] args) throws Exception { // create a MimeTypeIdentifier MimeTypeIdentifier identifier = new MagicMimeTypeIdentifier(); // create an ExtractorRegistry containing all available // ExtractorFactories ExtractorRegistry extractorRegistry = new DefaultExtractorRegistry(); // read as many bytes of the file as desired by the MIME type identifier File file = new File("/home/chenying/web/news1.html"); FileInputStream stream = new FileInputStream(file); BufferedInputStream buffer = new BufferedInputStream(stream); byte[] bytes = IOUtil.readBytes(buffer, identifier.getMinArrayLength()); stream.close(); // let the MimeTypeIdentifier determine the MIME type of this file String mimeType = identifier.identify(bytes, file.getPath(), null); // skip when the MIME type could not be determined if (mimeType == null) { System.err.println("MIME type could not be established."); return; } //System.out.println("mimeType:"+mimeType); // create the RDFContainer that will hold the RDF model URI uri = new URIImpl(file.toURI().toString()); Model model = RDF2Go.getModelFactory().createModel(); model.open(); RDFContainer container = new RDFContainerImpl(model, uri); // determine and apply an Extractor that can handle this MIME type Set factories=extractorRegistry.getExtractorFactories(mimeType); //Set factories = extractorRegistry.get(mimeType); if (factories != null && !factories.isEmpty()) { // just fetch the first available Extractor ExtractorFactory factory = (ExtractorFactory) factories.iterator().next(); Extractor extractor = factory.get(); // apply the extractor on the specified file // (just open a new stream rather than buffer the previous stream) stream = new FileInputStream(file); buffer = new BufferedInputStream(stream, 8192); extractor.extract(uri, buffer, Charset.forName("utf-8"), mimeType, container); stream.close(); } // add the MIME type as an additional statement to the RDF model container.add(NIE.mimeType, mimeType); // report the output to System.out //container.getModel().writeTo(new PrintWriter(System.out),Syntax.Ntriples); container.getModel().writeTo(new PrintWriter(System.out),Syntax.RdfXml); } }
运行上面的类,会在eclipse的控制台看到Model的Syntax.RdfXml格式的输出,本人的输出如下:
<?xml version="1.0" encoding="UTF-8"?> <rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"> <rdf:Description rdf:about="file:/home/chenying/web/news1.html"> <rdf:type rdf:resource="http://www.semanticdesktop.org/ontologies/2007/03/22/nfo#HtmlDocument"/> <plainTextContent xmlns="http://www.semanticdesktop.org/ontologies/2007/01/19/nie#">本文件为测试的解析文件 </plainTextContent> <mimeType xmlns="http://www.semanticdesktop.org/ontologies/2007/01/19/nie#">text/html</mimeType> </rdf:Description> </rdf:RDF>
上面的example里面一些核心功能很多是通过手工编写的,如果Aperture的功能是如此不智能,那我们不免为之泄气;不过本人认为,简单的example是了解高级应用的门径,不如此,往往使我们迷失于高级应用的迷宫。
下面本人来点稍微高级点的example,基本流程是:
1创建Model
2 创建RDFContainer包装Model
3 创建FileSystemDataSource,设置相应属性
4创建FileSystemCrawler并设置DataSource,DataAccessorRegistry和CrawlerHandler(回调处理)
5 FileSystemCrawler的调用crawl方法
代码示例如下:
public class TutorialCrawlingExample { public static void main(String[] args) throws Exception { // create a new ExampleFileCrawler instance TutorialCrawlingExample crawler = new TutorialCrawlingExample(); if (args.length != 1) { System.err.println("Specify the root folder"); System.exit(-1); }
// start crawling and exit afterwards crawler.doCrawling(new File(args[0])); } public void doCrawling(File rootFile) throws Exception { // create a model that will store the data source configuration Model model = RDF2Go.getModelFactory().createModel(); // open the model model.open(); // .. and wrap it in an RDFContainer RDFContainer configuration = new RDFContainerImpl(model, new URIImpl("source:testSource"), false); // now create the data source FileSystemDataSource source = new FileSystemDataSource(); // and set the configuration container source.setConfiguration(configuration); // now we can call the type-specific setters in each DataSource class source.setRootFolder(rootFile.getAbsolutePath()); // setup a crawler that can handle this type of DataSource FileSystemCrawler crawler = new FileSystemCrawler(); crawler.setDataSource(source); crawler.setDataAccessorRegistry(new DefaultDataAccessorRegistry()); crawler.setCrawlerHandler(new TutorialCrawlerHandler()); // start crawling crawler.crawl(); } } class TutorialCrawlerHandler extends CrawlerHandlerBase { // our 'persistent' modelSet private ModelSet modelSet; public TutorialCrawlerHandler() throws ModelException { super (new MagicMimeTypeIdentifier(), new DefaultExtractorRegistry(), new DefaultSubCrawlerRegistry()); modelSet = RDF2Go.getModelFactory().createModelSet(); modelSet.open(); } public void crawlStopped(Crawler crawler, ExitCode exitCode) { try { //modelSet.writeTo(System.out, Syntax.Trix); modelSet.writeTo(System.out, Syntax.RdfXml); } catch (Exception e) { throw new RuntimeException(e); } finally { modelSet.close(); } } public RDFContainer getRDFContainer(URI uri) { // we create a new in-memory temporary model for each data source Model model = RDF2Go.getModelFactory().createModel(uri); // A model needs to be opened before being wrapped in an RDFContainer model.open(); return new RDFContainerImpl(model, uri); } public void objectNew(Crawler crawler, DataObject object) { // first we try to extract the information from the binary file try { processBinary(crawler, object); } catch (Exception x) { // do some proper logging now in real applications x.printStackTrace(); } // then we add this information to our persistent model modelSet.addModel(object.getMetadata().getModel()); // don't forget to dispose of the DataObject object.dispose(); } public void objectChanged(Crawler crawler, DataObject object) { // first we remove old information about the data object modelSet.removeModel(object.getID()); // then we try to extract metadata and fulltext from the file try { processBinary(crawler, object); } catch (Exception x) { // do some proper logging now in real applications x.printStackTrace(); } // an then we add the information from the temporary model to our // 'persistent' model modelSet.addModel(object.getMetadata().getModel()); // don't forget to dispose of the DataObject object.dispose(); } public void objectRemoved(Crawler crawler, URI uri) { // an object has been removed, we delete it from the rdf store modelSet.removeModel(uri); } }
设置参数后,运行上面的类,同样输出Model的Syntax.RdfXml格式
在上面的示例中,我们并没有手动编程方式获取文件的InputStream,这些是Aperture自动完成的,其中TutorialCrawlerHandler是一个内部类,用于FileSystemCrawler对象示例的回调类,这种处理方式与java里面的jaxp规范中的sax方式解析xml文件有点类似,两者感觉是相通的。
我们从TutorialCrawlerHandler类可以发现,Model对象被保存在ModelSet对象里面(大概是集合吧),其实我们在编写回调处理方法时也可以持久化到文件系统,具体详情在此不表。
---------------------------------------------------------------------------
本系列WEB数据挖掘系本人原创
作者 博客园 刺猬的温驯
本文链接 http://www.cnblogs.com/chenying99/archive/2013/06/15/3137067.html
本文版权归作者所有,未经作者同意,严禁转载及用作商业传播,否则将追究法律责任。