Jsoup爬虫以及防反爬

1 java也可以爬取第三方网站的数据;

注: 1 ip限制【防爬】

         2 header参数referer

         3 伪装hearder ua

就源引 一个第三方代理网站试试

{
        Random r = new Random();
        String[] ua = {"Mozilla/5.0 (Windows NT 6.1; WOW64; rv:46.0) Gecko/20100101 Firefox/46.0",
                "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.87 Safari/537.36 OPR/37.0.2178.32",
                "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/534.57.2 (KHTML, like Gecko) Version/5.1.7 Safari/534.57.2",
                "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/45.0.2454.101 Safari/537.36",
                "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/46.0.2486.0 Safari/537.36 Edge/13.10586",
                "Mozilla/5.0 (Windows NT 10.0; WOW64; Trident/7.0; rv:11.0) like Gecko",
                "Mozilla/5.0 (compatible; MSIE 10.0; Windows NT 6.1; WOW64; Trident/6.0)",
                "Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; WOW64; Trident/5.0)",
                "Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.1; WOW64; Trident/4.0)",
                "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/47.0.2526.106 BIDUBrowser/8.3 Safari/537.36",
                "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/47.0.2526.80 Safari/537.36 Core/1.47.277.400 QQBrowser/9.4.7658.400",
                "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/48.0.2564.116 UBrowser/5.6.12150.8 Safari/537.36",
                "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/38.0.2125.122 Safari/537.36 SE 2.X MetaSr 1.0",
                "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/48.0.2564.116 Safari/537.36 TheWorld 7",
                "Mozilla/5.0 (Windows NT 6.1; W…) Gecko/20100101 Firefox/60.0"};
        int i = r.nextInt(14);
        logger.info("检测中------ {}:{}",ip,port );
        Map map = new HashMap();
        map.put("waybillNo","DD1838768852");
        try {
            total ++ ;
            long a = System.currentTimeMillis();
            //爬取的目标网站,url记得换下。。。!!!  代理ip网站
            Document doc = Jsoup.connect("http://xxxx.com/dayProxy/ip/314639.html")
                    .timeout(5000)
                    //.proxy(ip, port)
                    .data(map)
                    .ignoreContentType(true)
                    .userAgent(ua[i])
                    .header("referer","http://xxxx.com/dayProxy.html")//这个来源记得换..
                    .post();
            System.out.println(ip+":"+port+"访问时间:"+(System.currentTimeMillis() -a) + "   访问结果: "+doc.text());
            suc ++ ;
        } catch (IOException e) {
            e.printStackTrace();
            fail ++ ;
        }finally {
            if (total == count ) {
                System.out.println("总次数:"+total);
                System.out.println("成功次数:"+suc);
                System.out.println("失败次数:"+fail);
            }
        }
    }

这样通过org.jsoup.nodes.Document解析返回的数据, 解析出ip 和端口,

然后 上面的同样代码只要

 .proxy(ip, port)

放开这句 填入对应的ip port即可开启代理访问模式 ,

可以过滤90%的反防;

 

 

 

 

 

 

 

 

 

 

 

 

 

你可能感兴趣的:(Java)