网页电话/手机号码识别

识别网页上的电话号码,一个比较容易想到的方法就是,通过预先设计电话号码的正则表达式,对网页文本内容中电话号码进行匹配,抽取出对应的联系方式。然而,这种方法是假定电话号码都是按照比较理想的格式在网页上展示的,自然对于这样的识别精度会很高,但是同时也漏掉了很多电话号码。如果你没有深入分析处理过Web网页数据,你是想象不到互联网上网页的格式到底有多不规范。

这里,我们实现一种识别网页上电话号码的方法,不需要设计精确的正则表达式来匹配电话号码,而是通过电话号码最抽象的特征来考虑和设计。

电话号码一定是一个含有数字的序列,而且可能数字之间通过一些特殊或常见的字符来进行分隔,比如“逗号”、“短线”、“空格”、“字母”等等。我们通过对一个页面的文本内容进行分析,将放宽数字字符串的定义:

如果两个数字字符之间连续,则认为两个数字字符属于同一个序列;如果两个数字字符之间存在小于给定阈值限制个数的非数字字符,则认为这两个数字字符也属于同一个序列。这种观点的实质是,将距离比较近的数字字符串合并为一个独立的序列,这样,通过分析一个页面的文本内容就可以得到一个数字字符序列的集合。

然而,这样会把比较短的数字,如日期、年龄、序号等都分析出来。自然而然想到,通过过滤算法将其过滤掉。我们这里通过一种推荐模型,计算每个数字字符序列的相似度,然后根据相似度进行排序,再从排序靠前的数字字符串序列中筛选出电话号码。

下面,看看我们用Java实现这个思路,并观察一下结果。

定义一个序列推荐接口SequenceRecommendation,recommend方法是具体的实现逻辑,可以根据自己的需要去设计。

package org.shirdrn.webmining.recommend;

public interface SequenceRecommendation {
	public void recommend() throws Exception;
}

下面,我们实现一个用来抽取数字字符串序列的算法,并计算相关度,从而进行排序推荐。基本思路如下:

1、清洗原生的网页:将HTML标签等等都去掉,得到最终的文本内容。

2、对文本内容进行分词:使用Lucene自带的SimpleAnalyzer分析器(未使用停用词过滤),之所以选择这个是因为,在数字字符序列附近(前面和后面)存在某些具有领域特定含义的词(如电话号码数字前面和后面可能存在一些词:phone、telephone等;Email地址附近可能存在一些词:email、email us等;等等),可能它是一个停用词(对StandardAnalyzer等来说),我们不希望过滤掉这些词。另外,我们记录了每个词的位置信息。

3、聚集数字字符序列,同时记录前向和后向指定数量的词(核心):这个应该是最核心的,需要精细地处理文本内容,和设计数据结构,得到一个我们能够方便地进行相关度计算的结果集。

4、根据一个样本集的计算结果,来建立领域模型(特征词向量),用于计算数字字符序列的相关度:我这里收集了一部分英文网页,通过英文网页的分析处理,提炼出一批特征词,为简单起见直接使用词频作为权重(注意:这样使用词频简单而且合理,也可以采用其他的方法进行权重的计算,或者补充其它属性权重的贡献)。我们这里使用了两个特征词向量,分别如下所示:

前向特征词向量(文件forwards_feature_vector):

email                                     9124
e                                         3368
mail                                      4767
e-mail                                    2183
email us at                               178
fax                                       147
email address                             146
email us                                  121
fx                                        115
or                                        113
email us                                  102
email or                                  95
email us at                               76
or e-mail                                 67

后向特征词向量(文件backwards_feature_vector):

phone                          27407
call						   13697
free						   13092
toll						   10092
toll free                      9012
tel                            8710
call                           5247
telephone                      4052
call us                        3108
ph                             3067
t                              2838
p                              2830
contact us                     2150
or call                        1889
local                          1477
f                              1437
or                             1362
abn                            1257
call us at                     1194
office                         1183
call us today                  1152
customer service               1101
call toll free                 1080

我们的特征词向量是通过文件形式导入,在后面的测试用例中使用。

5、相关度排序,并进行推荐:这里排序后就可一目了然,排在前面的是电话号码的可能性最大。

下面是整个思想的设计及其实现,NumberSequenceRecommendation类的代码,如下所示:

package org.shirdrn.webmining.recommend;

import java.io.BufferedReader;
import java.io.FileInputStream;
import java.io.FileNotFoundException;
import java.io.IOException;
import java.io.InputStreamReader;
import java.io.StringReader;
import java.nio.charset.Charset;
import java.util.ArrayList;
import java.util.Collections;
import java.util.Comparator;
import java.util.HashMap;
import java.util.LinkedList;
import java.util.List;
import java.util.Map;
import java.util.regex.Matcher;
import java.util.regex.Pattern;

import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.TokenStream;
import org.apache.lucene.analysis.en.EnglishAnalyzer;
import org.apache.lucene.analysis.tokenattributes.OffsetAttribute;
import org.apache.lucene.analysis.tokenattributes.TermAttribute;
import org.apache.lucene.util.Version;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Node;
import org.jsoup.nodes.TextNode;
import org.jsoup.parser.Parser;

public class NumberSequenceRecommendation implements SequenceRecommendation {

	private byte[] content;
	private Charset charset;
	private String baseUri;
	
	/** Max count of non-number number sequence in a continual number sequence, 
	 * on conditions of which we think the number sequence is continual.*/
	private int maxGap = 5;
	/** Max word count after or before a number sequence */
	private int maxWordCount = 5;
	private Pattern numberPattern = Pattern.compile("^\\d+$");
	private String cleanedContent;
	
	/** All words analyzed by Lucene analyzer from specified page text. */
	private LinkedList<Word> wordList = new LinkedList<Word>();
	private LinkedList<NumberSequence> numberSequenceList = new LinkedList<NumberSequence>();
	/** Final result sorted by correlation */
	List<NumberSequence> sortedNumberSequenceSet = new ArrayList<NumberSequence>(1);
	private Map<String, Double> backwardsFeatureVector = new HashMap<String, Double>();
	private Map<String, Double> forwardsFeatureVector = new HashMap<String, Double>();
	
	private double backwardsWeight = 1.75;
	private double forwardsWeight = 1.05;
	
	public NumberSequenceRecommendation() {
		this(new byte[]{}, Charset.defaultCharset(), null, null, null);
	}
	
	public NumberSequenceRecommendation(byte[] content, Charset charset, String baseUri, 
			String backwordsFeatureVectorFile, String forwardsFeatureVectorFile) {
		super();
		this.baseUri = baseUri;
		this.content = content;
		this.charset = charset;
		loadFeatureVectors(backwordsFeatureVectorFile, forwardsFeatureVectorFile);
	}
	
	private void loadFeatureVectors(String backwordsFeatureVectorFile, String forwardsFeatureVectorFile) {
		load(backwordsFeatureVectorFile, backwardsFeatureVector);
		load(forwardsFeatureVectorFile, forwardsFeatureVector);
	}

	private void load(String featureVectorFile, Map<String, Double> featureVector) {
		FileInputStream fis = null;
		BufferedReader reader = null;
		try {
			fis = new FileInputStream(featureVectorFile);
			reader = new BufferedReader(new InputStreamReader(fis, charset));
			String line = null;
			while((line = reader.readLine())!=null) {
				if(!line.isEmpty()) {
					String pair[] = line.trim().split("\\s+");
					try {
						featureVector.put(pair[0].trim(), Double.parseDouble(pair[1].trim()));
					} catch (Exception e) { }
				}
			}
		} catch (FileNotFoundException e) {
			e.printStackTrace();
		}catch (IOException e) {
			e.printStackTrace();
		} finally {
			try {
				if(reader!=null) {
					reader.close();
				}
				if(fis!=null) {
					fis.close();
				}
			} catch (IOException e) {
				e.printStackTrace();
			}
		}		
	}

	@Override
	public void recommend() throws Exception {
		recommend(content, charset, baseUri);
	}

	private List<NumberSequence> recommend(byte[] content, Charset charset, String baseUri) {
		String html = new String(content, charset);
		Document doc = Parser.parse(html, baseUri);
		StringBuffer buf = new StringBuffer();
		parseHtmlText(doc.body(), buf);
		cleanedContent = buf.toString().trim();
		collectWords(cleanedContent);
		analyzeNumberWords();
		return sortByCorrelation();
	}
	
	/**
	 * Compute correlation, and sort result, for recommending.
	 * @return
	 */
	private List<NumberSequence> sortByCorrelation() {
		// sort numberSequenceList
		for(NumberSequence ns : numberSequenceList) {
			// backwards
			double backwardsCorrelation = 0;
			for(Word w : ns.backwardsWords) {
				if(backwardsFeatureVector.containsKey(w.text)) {
					backwardsCorrelation += backwardsFeatureVector.get(w.text);
				}
			}
			// forwards
			double forwardsCorrelation = 0;
			for(Word w : ns.forwardsWords) {
				if(forwardsFeatureVector.containsKey(w.text)) {
					forwardsCorrelation += forwardsFeatureVector.get(w.text);
				}
			}
			ns.correlation = backwardsWeight * backwardsCorrelation + forwardsWeight * forwardsCorrelation;
			sortedNumberSequenceSet.add(ns);
		}
		
		// sort by correlation
		Collections.sort(sortedNumberSequenceSet, new Comparator<NumberSequence>() {

			@Override
			public int compare(NumberSequence o1, NumberSequence o2) {
				if(o1.correlation<o2.correlation) {
					return 1;
				} else if(o1.correlation>o2.correlation) {
					return -1;
				}
				return 0;
			}
			
		});
		return sortedNumberSequenceSet;
	}

	/**
	 * Extract text data from a HTML page.
	 * @param node
	 * @param buf
	 */
	private void parseHtmlText(Node node, StringBuffer buf) {
		List<Node> children = node.childNodes();
		if(children.isEmpty() && node instanceof TextNode) {
			String text = node.toString().trim();
			for(String ch : ESCAPE_SEQUENCE) {
				text = text.replaceAll(ch, "");
			}
			if(!text.isEmpty()) {
				buf.append(text.toLowerCase().trim()).append("\n");
			}
		} else {
			for(Node child : children) {
				parseHtmlText(child, buf);
			}
		}
	}
	
	/**
	 * Analyze text, extract terms by Lucene analyzer.
	 * @param content
	 */
	private void collectWords(String content) {
		StringReader reader = new StringReader(content);
		Analyzer a = new EnglishAnalyzer(Version.LUCENE_36);
		TokenStream ts = a.tokenStream("", reader);
		TermAttribute ta = ts.addAttribute(TermAttribute.class);
		OffsetAttribute oa = ts.addAttribute(OffsetAttribute.class);
		Pos pos = new Pos();
		try {
			while(ts.incrementToken()) {
				Pos nextPos = new Pos(oa.startOffset(), oa.endOffset());
				nextPos.gap = nextPos.startOffset - pos.endOffset;
				Word word = new Word(ta.term(), nextPos);
				wordList.addLast(word);
				pos = nextPos;
				// is number?
				Matcher m = numberPattern.matcher(word.text);
				if(m.find()) {
					word.isNumber = true;
				}
			}
		} catch (IOException e) {
			e.printStackTrace();
		}
	}
	
	/**
	 * Compute number words relations.
	 */
	private void analyzeNumberWords() {
		for(int i=0; i<wordList.size(); i++) {
			Word w = wordList.get(i);
			if(w.isNumber) {
				NumberSequence ns = new NumberSequence();
				ns.numberWords.add(w);
				// compute backwards words
				for(int j=Math.max(0, i-1); j>=Math.max(i-maxWordCount, 0); j--) {
					if(!wordList.get(j).isNumber) {
						ns.backwardsWords.add(wordList.get(j));
					}
				}
				// recognize nearest number string sequence
				int gap = 0;
				if(i<wordList.size()) {
					for(int k=i+1; ; k++) {
						if(gap==0) {
							gap = wordList.get(k).pos.gap;
						}
						if(gap<=maxGap) {
							if(wordList.get(k).isNumber) {
								ns.numberWords.add(wordList.get(k));
								gap = 0;
							} else {
								i = k-1;
								break;
							}
							ns.pos.gap += wordList.get(k).pos.gap;
						} else {
							i = k-1;
							break;
						}
					}
					// compute forwards words
					for(int p=Math.min(i, wordList.size()-1); p<=Math.min(wordList.size(), i+maxWordCount); p++) {
						if(!wordList.get(p).isNumber) {
							ns.forwardsWords.add(wordList.get(p));
						}
					}
					numberSequenceList.add(ns);
				}
			}
		}
	}
	
	private static String[] ESCAPE_SEQUENCE = new String[] {
		""", "&", "—", "–", "‰",
		" ", " ", " ", " ", "‌", "‍",
		"‚", "˜", "ˆ", "‎", "‏",
		"×", "÷", "“", "”", "„", 
		"<", ">", "‹", "›", "‘", "’",
		"¡", "¢", "£", "¤", "¥", "¦", 
		"§", "¨", "©", "ª", "«", "¬",
		"­", "®", "¯", "°", "±", "²",
		"³", "´", "µ", "¶", "·", "¸",
		"¹", "º", "»", "¼", "½", "¾",
		"¿", "À", "Á", "ˆ", "Ã", "Ä",
		"˚", "Æ", "Ç", "È", "É", "Ê",
		"Ë", "Ì", "Í", "Î", "Ï", "Ð",
		"Ñ", "Ò", "Ó", "Ô", "Õ", "Ö",
		"×", "Ø", "Ù", "Ú", "Û", "Ü",
		"Ý", "Þ", "ß", "à", "á", "â",
		"ã", "ä", "å", "æ", "ç", "è",
		"é", "ê", "ë", "ì", "í", "î",
		"ï", "&ieth;", "ñ", "ò", "ó", "ô",
		"õ", "ö", "÷", "ø", "ù", "ú",
		"û", "ü", "ý", "ÿ"
	}; 
	
	/**
	 * Number sequence who holds:
	 * <pre>
	 * a number {@link Word} list which we analyzed from text of a page
	 * a correlation index
	 * a forwards {@link Word} list 
	 * a backwards {@link Word} list
	 * a {@link Pos} which specifies this number sequence's position information
	 * </pre>
	 * @author shirdrn
	 */
	public static class NumberSequence {
		
		/** This sequence's position metadata */
		Pos pos = new Pos();
		/** Number word collection */
		List<Word> numberWords = new LinkedList<Word>();
		/**  */
		List<Word> forwardsWords = new LinkedList<Word>();
		List<Word> backwardsWords = new LinkedList<Word>();
		double correlation;
		
		@Override
		public String toString() {
			return "[" +
				"correlation=" + correlation + ", " +
				"numberWords=" +numberWords + ", " +
				"forwardsWords=" + forwardsWords + ", " +
				"backwardsWords=" + backwardsWords + ", " + "]";
		}

	}
	
	/**
	 * Word unit analyzed by Lucene's {@link Analyzer}. Here
	 * a {@link Word} is minimum and is not split again. 
	 * @author shirdrn
	 */
	static class Word {
		
		/** Word text */
		String text;
		/** Is this word a number? */
		boolean isNumber;
		/** Word's position metadata */
		Pos pos;
		
		public Word(String text, Pos pos) {
			super();
			this.text = text;
			this.pos = pos;
		}
		
		@Override
		public String toString() {
			return "[" +text + pos + "]";
		}
	}
	
	/**
	 * Position information
	 * @author shirdrn
	 */
	static class Pos {
		
		/** Start offset of a word */
		int startOffset;
		/** End offset of a word */
		int endOffset;
		/** Max distance between tow word */
		int gap;
		
		public Pos() {
			super();
		}
		
		public Pos(int startOffset, int endOffset) {
			super();
			this.startOffset = startOffset;
			this.endOffset = endOffset;
		}
		
		@Override
		public String toString() {
			return "<" + startOffset + ", " + endOffset + ", " + gap + ">";
		}
	}

	public List<NumberSequence> getSortedNumberSequenceSet() {
		return sortedNumberSequenceSet;
	}

	public String getCleanedContent() {
		return cleanedContent;
	}
}

结果输出,包括原生网页清理后的网页文本内容,如下:

click here to go to our u.s. or arabic versions
close
cnn
edition: international
u.s.
mxico
arabic
tv
:
cnn
cnni
cnn en espaol
hln
sign up
log in
home
video
world
u.s.
africa
asia
europe
latin america
middle east
business
world sport
entertainment
tech
travel
ireport
about cnn.com/international
cnn.com/international:
the international edition of
cnn.com
is constantly updated to bring you the top news stories from around the world. it is produced by dedicated staff in london and hong kong, working with colleagues at cnn's world headquarters in atlanta, georgia, and with bureaus worldwide. cnn.com relies heavily on cnn's global team of over 4,000 news professionals.
cnn.com/international
features the latest multimedia technologies, from live video streaming to audio packages to searchable archives of news features and background information. the site is updated continuously throughout the day.
contact us:
help us make your comments count. use our
viewer comment page
to tell us what you think about our shows, our anchors, and our hot topics for the day.
help page:
visit our
extensive faqs
for answers to all of your questions, from cnn tv programming to rss to the cnn member center.
cnn:
back to top
what's on:
click here for the full rundown of all
cnn daily programming
.
who's on:
click here for full bios on all of
cnn's anchors, correspondents and executives
.
press office:
click here for information from
cnn international press offices
.
cnn's parent company:
time warner inc.
services:
back to top
your e-mail alerts:
your e-mail alerts, is a free, personalized news alerting service created for you.
with cnn's service you can:
•sign up for your e-mail alerts and follow the news that matters to you.
•select key words and topics across the wide range of news and information on the site.
•create your own alerts.
•customize your delivery options to fit your schedule and be alerted as a story is published on cnn.com. receive your alerts daily or weekly.
•easily manage your alerts. edit, delete, suspend or re-activate them at any time.
register
to be a member and begin customizing your e-mail alerts today!
cnn.com preferences:
personalize your cnn.com page experience
today and receive breaking news in your e-mail inbox and on your cell phone, get your hometown weather on the home page and set your news edition to your world region.
cnn mobile:
cnn.com/international content is now available through your mobile phone. with
cnn mobile
, you can read up-to-the-minute news stories with color photos, watch live, streaming video or the latest video on demand clips and receive cnn breaking news text alerts. no matter where your on-the-go lifestyle takes you, cnn brings the news directly to you.
e-mail newsletters:
be the first to know with a variety of e-mail news services. receiving breaking news alerts, delivered straight to your e-mail address. follow the latest news on politics, technology, health or the topics that interest you most. or stay informed on what's coming up on your favorite cnn tv programs.
cnn offers e-mail updates as numerous and diverse as your tastes.
register now
and select from the various e-mails.
advertise on cnn.com:
advertise with us!
get information about advertising on the cnn web sites.
business development:
companies interested in partnering with cnn should contact cnn business development by sending an e-mail to
[email protected]
.
job search:
visit our web sites for information about internships or job opportunities with cnn international in
europe, middle east, africa
and
other regions
legal terms and conditions:
back to top
cnn interactive service agreement:
view the terms of the
cnn interactive services agreement
.
cnn comment policy:
cnn encourages you to add comment to our discussions. you may not post any unlawful, threatening, libelous, defamatory, obscene, pornographic or other material that would violate the law. please note that cnn makes reasonable efforts to review all comments prior to posting and cnn may edit comments for clarity or to keep out questionable or off-topic material. all comments should be relevant to the post and remain respectful of other authors and commenters. by submitting your comment, you hereby give cnn the right, but not the obligation, to post, air, edit, exhibit, telecast, cablecast, webcast, re-use, publish, reproduce, use, license, print, distribute or otherwise use your comment(s) and accompanying personal identifying information via all forms of media now known or hereafter devised, worldwide, in perpetuity.
cnn privacy statement
.
privacy statement:
to better protect your privacy, we provide this notice explaining our
online information practices
and the choices you can make about the way your information is collected and used
cnn's reprint and copyright information:
copyrights and copyright agent. cnn respects the rights of all copyright holders and in this regard, cnn has adopted and implemented a policy that provides for the termination in appropriate circumstances of subscribers and account holders who infringe the rights of copyright holders. if you believe that your work has been copied in a way that constitutes copyright infringement, please provide cnn's copyright agent the following information required by the online copyright infringement liability limitation act of the digital millennium copyright act, 17 u.s.c.  512:
•a physical or electronic signature of a person authorized to act on behalf of the owner of an exclusive right that is allegedly infringed.
•identification of the copyright work claimed to have been infringed, or, if multiple copyrighted works at a single online site are covered by a single notification, a representative list of such works at that site.
•identification of the material that is claimed to be infringing or to be the subject of infringing activity and that is to be removed or access to which is to be disabled, and information reasonably sufficient to permit us to locate the material.
•information reasonably sufficient to permit us to contact the complaining party.
•a statement that the complaining party has a good-faith belief that use of the material in the manner complained of is not authorized by the copyright owner, its agent, or the law.
•a statement that the information in the notification is accurate, and under penalty of perjury, that the complaining party is authorized to act on behalf of the owner of an exclusive right that is allegedly infringed.
cnn's copyright agent for notice of claims of copyright infringement on or regarding this site can be reached by sending an email to
[email protected]
or writing to-
copyright agent
one cnn center
atlanta, ga 30303
phone: (404) 878-2276
fax: (404) 827-1995
email:
[email protected]
for any questions or requests other than copyright issues, please view our
extensive faqs
.
weather forecast
home
|
video
|
world
|
u.s.
|
africa
|
asia
|
europe
|
latin america
|
middle east
|
business
|
world sport
|
entertainment
|
tech
|
travel
|
ireport
tools  widgets
|
podcasts
|
blogs
|
cnn mobile
|
my profile
|
e-mail alerts
|
cnn radio
|
cnn shop
|
site map
|
cnn partner hotels
cnn en espaol
|
cnn chile
|
cnn expansion
|
|
|
|
cnn tv
|
hln
|
transcripts
2010 cable news network.
turner broadcasting system, inc.
all rights reserved.
terms of service
|
privacy guidelines
|
advertising practices
|
advertise with us
|
about us
|
contact us
|
help
最后,计算结果只是给出了排序的结果,可以直接观察排序推荐的效果,如下所示:

[correlation=57696.8, numberWords=[[404<6705, 6708, 3>], [878<6710, 6713, 2>], [2276<6714, 6718, 1>]], forwardsWords=[[fax<6719, 6722, 1>], [email<6739, 6744, 1>]], backwardsWords=[[phone<6697, 6702, 1>], [ga<6688, 6690, 2>], [atlanta<6679, 6686, 1>], [center<6672, 6678, 1>]], ]
[correlation=57542.45, numberWords=[[404<6725, 6728, 3>], [827<6730, 6733, 2>], [1995<6734, 6738, 1>]], forwardsWords=[[email<6739, 6744, 1>], [copyrightag<6746, 6760, 2>], [turner.com<6761, 6771, 1>], [ani<6776, 6779, 5>], [question<6780, 6789, 1>]], backwardsWords=[[fax<6719, 6722, 1>], [phone<6697, 6702, 1>]], ]
[correlation=154.35, numberWords=[[30303<6691, 6696, 1>]], forwardsWords=[[phone<6697, 6702, 1>], [fax<6719, 6722, 1>]], backwardsWords=[[ga<6688, 6690, 2>], [atlanta<6679, 6686, 1>], [center<6672, 6678, 1>], [cnn<6668, 6671, 1>], [on<6664, 6667, 1>]], ]
[correlation=0.0, numberWords=[[17<5371, 5373, 2>]], forwardsWords=[[u.s.c<5374, 5379, 1>], [physic<5390, 5398, 5>], [electron<5402, 5412, 4>], [signatur<5413, 5422, 1>]], backwardsWords=[[act<5366, 5369, 1>], [copyright<5356, 5365, 1>], [millennium<5345, 5355, 1>], [digit<5337, 5344, 8>], [act<5326, 5329, 1>]], ]
[correlation=0.0, numberWords=[[512<5382, 5385, 3>]], forwardsWords=[[physic<5390, 5398, 5>], [electron<5402, 5412, 4>], [signatur<5413, 5422, 1>], [person<5428, 5434, 6>], [author<5435, 5445, 1>]], backwardsWords=[[u.s.c<5374, 5379, 1>], [act<5366, 5369, 1>], [copyright<5356, 5365, 1>], [millennium<5345, 5355, 1>]], ]
[correlation=0.0, numberWords=[[2010<7239, 7243, 1>]], forwardsWords=[[cabl<7244, 7249, 1>], [new<7250, 7254, 1>], [network<7255, 7262, 1>], [turner<7264, 7270, 2>], [broadcast<7271, 7283, 1>]], backwardsWords=[[transcript<7227, 7238, 3>], [hln<7221, 7224, 3>], [tv<7216, 7218, 1>], [cnn<7212, 7215, 9>], [expans<7194, 7203, 1>]], ]

我们分析解释一下:

numberWords是最终的数字字符串的集合(都是数字);

forwardsWords是对应numberWords所表示的数字字符序列前向词集合;

backwardsWords是对应numberWords所表示的数字字符序列后向词集合。

上面结果,格式化一下,便容易看出来:

[correlation=57696.8, numberWords=[404-878-2276], forwardsWords=[fax, email], backwardsWords=[phone, ga, atlanta, center]]
[correlation=57542.45, numberWords=[404-827-1995], forwardsWords=[email, copyrightag, turner.com, ani, question], backwardsWords=[fax, phone]]
[correlation=154.35, numberWords=[30303], forwardsWords=[phone, fax], backwardsWords=[ga, atlanta, center, cnn, on]]
[correlation=0.0, numberWords=[17], forwardsWords=[u.s.c, physic, electron, signatur], backwardsWords=[act, copyright, millennium, digit, act]]
[correlation=0.0, numberWords=[512], forwardsWords=[physic, electron, signatur, person, author], backwardsWords=[u.s.c, act, copyright, millennium]]
[correlation=0.0, numberWords=[2010], forwardsWords=[cabl, new, network, turner, broadcast], backwardsWords=[transcript, hln, tv, cnn, expans]]
根据上面得到的网页文本内容可以看出,第一条得分最高,确实就是电话号码,第二条是传真号码。

最后,如果我们想要使得到的电话号码更加精确,可以通过多种方式进行筛选和验证,在一定程度上会提高识别出的电话号码的精度。


你可能感兴趣的:(exception,String,Lucene,手机,email,电话)