书接上一回:
实例三:
数据提取
要求:从一段HTML代码中提取出所有的email地址和< a href...>tag中的链接地址
public
class
HtmlTest {
public
static
void
main(String[] args) {
String htmlText =
""
+
".leemaster@163"
+
"luckdog.com"
+
""
;
System.
out
.println(
"开始检查email"
);
for
(String email :
extractEmail
(htmlText)) {
System.
out
.println(
"邮箱是:"
+ email);
}
System.
out
.println(
"开始检查超链接"
);
for
(String link :
extractLink
(htmlText)) {
System.
out
.println(
"超链接是:"
+ link);
}
}
private
static
List extractLink(String htmlText) {
List result =
new
ArrayList();
Pattern p = Pattern.
compile
(Regexes.
HREF_LINK_REGEX
);
Matcher m = p.matcher(htmlText);
while
(m.find()) {
result.add(m.group());
}
return
result;
}
private
static
List extractEmail(String htmlText) {
List result =
new
ArrayList();
Pattern p = Pattern.
compile
(Regexes.
EMAIL_REGEX
);
Matcher m = p.matcher(htmlText);
while
(m.find()) {
result.add(m.group());
}
return
result;
}
}
public
class
Regexes {
public
static
final
String
EMAIL_REGEX
=
"(?i)(?<=\\b)[a-z0-9][-a-z0-9_.]+[a-z0-9]@([a-z0-9][-a-z0-9]+\\.)+[a-z]{2,4}(?=\\b)"
;
public
static
final
String
HREF_LINK_REGEX
}
运行结果:
开始检查email
开始检查超链接
实例四:
查找重复单词
要求:查找一段文本中是否存在重复单词,如果存在,去掉重复单词。
public
class
FindWord {
public
static
void
main(String[] args) {
String[] sentences =
new
String[] {
"this is a normal sentence"
,
"Oh,my god!Duplicate word word"
,
"This sentence contain no duplicate word words"
};
for
(String sentence:sentences){
System.
out
.println(
"校验句子:"
+sentence);
if
(
containDupWord
(sentence)){
System.
out
.println(
"Duplicate word found!!"
);
System.
out
.println(
"正在去除重复单词"
+
removeDupWords
(sentence));
}
System.
out
.println(
""
);
}
}
private
static
String removeDupWords(String
sentence
) {
String regex = Regexes.
DUP_WORD_REGEX
;
return
sentence
.replaceAll(regex,
"$1"
);
}
private
static
boolean
containDupWord(String sentence) {
String regex = Regexes.
DUP_WORD_REGEX
;
Pattern p = Pattern.
compile
(regex);
Matcher m = p.matcher(sentence);
if
(m.find()){
return
true
;
}
else
{
return
false
;
}
}
}
public
class
Regexes
{
public
static
final
String
DUP_WORD_REGEX
=
"(?<=\\b)(\\w+)\\s+\\1(?=\\b)"
;
}
运行结果:
校验句子:this is a normal sentence
校验句子:Oh,my god!Duplicate word word
Duplicate word found!!
正在去除重复单词Oh,my god!Duplicate word
校验句子:This sentence contain no duplicate word words
未完待续。。。