新闻去重算法的设计和实现

    之前有介绍过大规模去重算法的设计,可惜没有办法付诸实现,主要是因为没有相应的应用场景,还是停留在纸上谈兵的阶段!这里简单的介绍下新闻去重算法的实现,客户端在基于yahoo shingle算法的基础上进行了封装,便于采用服务的方式来进行调用,这样比较容易进行复用和提高扩展性,利用thrift来进行跨语言的通信,客户端采用Java实现,原有的设计是服务端既提供计算的接口,即根据文本计算它的shingles,又可以在后端提供去重的服务,即根据文档内容查看这个文档和哪些文档内容重复。客户端可以只调用计算shingle的接口在客户端做去重的服务,thrift的接口定义如下:


service shingleService { list<string> getShingleString(1:required string contents), //返回类型为字符串数组 list<i64> getShingleLong(1:required string contents), // 返回类型为长整形数组 list<i64> getSimDocuments(1:required string contents), // 根据内容看重复的文档ID集合(服务端提供去重服务时候的接口) }

    在原有代码的基础上,增加了两个函数,分别用来以字符串和长整形数组来返回文档的shingles,代码如下:


std::vector<fprint_t> yahooShinglesForLongs(unsigned char *input, unsigned ws, unsigned nn) { int i; unsigned char *ifp; shingle_t s; unsigned windowsize = ws; unsigned nminima = nn; unsigned nsupershingles = 6; //参数可调 fprint_t *minima; /* array size nminima */ fprint_t *supershingles; /* array size nsupershingles */ fprint_t *megashingles; /* array size nmegashingles */ #if defined(xx__CYGWIN__) setmode (0, O_BINARY); setmode (1, O_BINARY); #endif ifp = input; minima = (fprint_t *) malloc(sizeof(minima[0]) * nminima); supershingles = (fprint_t*) malloc( sizeof(supershingles[0]) * nsupershingles); s = shingle_new(windowsize, nminima); shingle_doc(s, ifp, minima); shingle_supershingle(s, minima, supershingles, nsupershingles); std::vector<fprint_t> results; for (i = 0; i != nsupershingles; i++) { results.push_back(supershingles[i]); } shingle_destroy(s); free(minima); free(supershingles); return results; } std::vector<std::string> yahooShinglesForStrings(unsigned char *input, unsigned ws, unsigned nn) { int i; unsigned char *ifp; shingle_t s; unsigned windowsize = ws; unsigned nminima = nn; unsigned nsupershingles = 6; fprint_t *minima; /* array size nminima */ fprint_t *supershingles; /* array size nsupershingles */ fprint_t *megashingles; /* array size nmegashingles */ #if defined(xx__CYGWIN__) setmode (0, O_BINARY); setmode (1, O_BINARY); #endif ifp = input; minima = (fprint_t *) malloc(sizeof(minima[0]) * nminima); supershingles = (fprint_t*) malloc( sizeof(supershingles[0]) * nsupershingles); s = shingle_new(windowsize, nminima); shingle_doc(s, ifp, minima); shingle_supershingle(s, minima, supershingles, nsupershingles); std::vector<std::string> results; for (i = 0; i != nsupershingles; i++) { std::stringstream ss; std::string sg; ss << supershingles[i]; ss >> sg; ss.clear(); results.push_back(sg); } shingle_destroy(s); free(minima); free(supershingles); return results; } #endif

    传入的参数是可以调整的,而且会影响算法的准确率和召回率,代码中的参数是个较优的值,具体可以根据具体的场景来选择调整。

客户端调用的代码如下所示:
package com.hot.cmt.duplicate; import java.util.ArrayList; import java.util.List; import org.apache.thrift.TException; import org.apache.thrift.protocol.TProtocol; import org.apache.thrift.transport.TSocket; import org.apache.thrift.transport.TTransport; import org.apache.thrift.transport.TTransportException; /** * 查重服务,用来计算文章内容的shingles * @author yongleixiao * */ public class Shingle { private TTransport transport = null; private TProtocol protocol = null; private shingleService.Client client = null; private String ip = null; private int port = 0; private int portNum = 0; /* 超时等待时间 */ private int timeOut = 5; /* 重试次数 */ private int tryTimes = 3; public Shingle(String ip, int port, int portNum) throws TTransportException { this.ip = ip; this.port = port; this.portNum = portNum; open(ip, port, portNum); } public void open(String ip, int port, int portNum) throws TTransportException { int p = (int) (Math.random() * portNum) + port; transport = new TSocket(ip, p); protocol = new GBKCompactProtocol(transport); client = new shingleService.Client(protocol); try { transport.open(); } catch (TTransportException tt) { try { Thread.sleep(timeOut); transport.open(); } catch (InterruptedException ie) {} } } public List<string> getShingleString(String content) throws TException { List<string> shingles = new ArrayList<string>(); for (int i = 0; i < tryTimes; i++) { try { shingles = client.getShingleString(content); return shingles; } catch (TException ex) { try { Thread.sleep(timeOut); if (i == 1) { close(); open(this.ip, this.port, this.portNum); } } catch (InterruptedException e) {} } } return shingles; } public void close() { transport.close(); } }

    以上是去重的主要代码,应该都可以直接运行的,代码放在news-duplicated上面,通过将文章计算成n个shingles, 如果有一个shingle重复,这篇文章与其它文章重复的概率就非常大了,所以利用以上代码还是很方便的搭建一个去重系统的,前文提到该算法计算速度比较低,但是可以在计算上提高并发效率,剩下的就是查询的问题了,这个应该不是瓶颈,还有一个问题就是,太短的正文不适合采用这种算法,可以考虑用md5来计算hash。

你可能感兴趣的:(算法,thrift,去重)