爬取邮箱

需求:爬取某一网页下的邮箱

public class CrawTest 
{
    public static void Craw() throws Exception
    {
        URL url = new URL("https://tieba.baidu.com/p/2702208078?pid=109728187052&cid=109728324464#109728324464");
        URLConnection conn = url.openConnection();//和网页建立连接
        BufferedReader bufin = new BufferedReader(new InputStreamReader(conn.getInputStream()));
        String line = null; 
        String mailregx = "\\w+@\\w+(\\.\\w+)+";//邮箱的正则表达式,\w表示字母+数字
        Pattern p = Pattern.compile(mailregx);//将正则封装成一个对象
        Set set = new HashSet();
        //一开始用的arrayList来装这些邮箱,但是会出现重复数据,由于set集合里面本身自带唯一性属性,故使用set
        while ((line=bufin.readLine())!=null)//逐行读取网页上的字符串
        {
            Matcher m = p.matcher(line);//字符串和规则对象关联,生成一个匹配器
            while(m.find())//循环查找匹配规则的子串
            {
                set.add(m.group());
            }
        }
        for(String string : set){
            System.out.println(string);//获取匹配后的结果
        }
    }
    public static void main(String[] args) throws Exception{
        Craw();
    }
}

打印结果:

[email protected]
[email protected]
[email protected]
[email protected]
[email protected]
[email protected]
 ...

你可能感兴趣的:(爬取邮箱)