java爬虫(jsoup)如何设置http代理ip爬数据

        做java好长时间了,一直没时间,也没心情写笔记。今天空下来认真写一篇,以后也坚持记录下点滴,好好学习。最近由于项目需要,转战挖掘数据的领域,说实话,一开始没接触过,所以踩的坑会比较多,不过没见过猪跑,还没吃过猪肉吗。所以也就慢慢的摸索过来了。下面有几个大坑爬数据的人应该都得经历过,所以我把自己的经验记录下,方便自己跟大家吧。

    现在爬数据越来越难,各种反爬,简单的网站没做什么反爬,就随便介绍下:

1.随便找点网站弄点免费的爬虫代理ip,去爬一下,太简单就不介绍了,免费的大家都知道一般不太好用,目前最好用的动态代理ip是蘑菇代理的,。

 具体说下,稍微有点爬虫技术含量的吧,怎么样伪装自己的爬虫程序,尽量避免反爬:

1.请求头的user-agent参数必不可少,而且!!!!要随机,这里是大坑,我之前就是没有随机,然后爬了几天就被人反爬了,醉了,我当时还以为代理的问题,一直跟客服比比比比比,说他们代理被封了,后来才发现是我的请求头里面的user-agent被封了,然后心里愧疚的跟客服小姐姐抱歉了下。。。僵硬。 user-agent是浏览器的标识,所以越多越好,大量的随机,跟代理ip一样重要!我先提供一部分,也放不了这么多。

 String[] ua = {"Mozilla/5.0 (Windows NT 6.1; WOW64; rv:46.0) Gecko/20100101 Firefox/46.0",
        "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.87 Safari/537.36 OPR/37.0.2178.32",
        "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/534.57.2 (KHTML, like Gecko) Version/5.1.7 Safari/534.57.2",
        "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/45.0.2454.101 Safari/537.36",
        "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/46.0.2486.0 Safari/537.36 Edge/13.10586",
        "Mozilla/5.0 (Windows NT 10.0; WOW64; Trident/7.0; rv:11.0) like Gecko",
        "Mozilla/5.0 (compatible; MSIE 10.0; Windows NT 6.1; WOW64; Trident/6.0)",
        "Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; WOW64; Trident/5.0)",
        "Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.1; WOW64; Trident/4.0)",
        "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/47.0.2526.106 BIDUBrowser/8.3 Safari/537.36",
        "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/47.0.2526.80 Safari/537.36 Core/1.47.277.400 QQBrowser/9.4.7658.400",
        "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/48.0.2564.116 UBrowser/5.6.12150.8 Safari/537.36",
        "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/38.0.2125.122 Safari/537.36 SE 2.X MetaSr 1.0",
        "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/48.0.2564.116 Safari/537.36 TheWorld 7",
        "Mozilla/5.0 (Windows NT 6.1; W…) Gecko/20100101 Firefox/60.0"};

2. 来源!来源!  说实话之前一直没发现,后来是访问某网站的时候发现的,反爬做这么多干嘛,累啊,互联网,数据大家一起用嘛!!  请求头的referer这个参数就是记录的来源。为什么要伪装这个参数。我详细的说明下,你来源不伪装,就直接请求别人的接口,凭什么,他这个接口可能只是给页面调用的。。浏览器请求的时候都有来源,你不伪装,不就暴露了,具体传什么参数,不同的网站都不一样,可以F12看下浏览器请求的时候传的什么。

3. 代理ip必不可少,这里用免费的就不太好了,因为既然要爬数据,肯定要快,ip的要求就比较高,而且要有效的数量比较多,不然别人网站升级什么的,你没爬完,爬虫程序就蹦了。所以让老板花钱省心省力提高效率 。代理Ip的网站现在很多,我就不随便推荐了,免得查水表,现在很多家出的隧道代理是最实惠好用的,代理ip是真的多。

4. 源码我就随便贴一下,应该是可以跑的,就是获取代理ip的url记得换下。

import java.io.IOException;
import java.util.*;
import java.util.concurrent.ExecutorService;
import java.util.concurrent.Executors;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;

import net.sf.json.JSONObject;

public class Test {
    //获取代理ip,记得更换,我用的是蘑菇代理的,可以换成其他的网站的
    private final static String GET_IP_URL = "http://piping.mogumiao.com/proxy/api/get_ip_bs?appKey=xxxxx&count=10&format=1";
    public static void main(String[] args) throws InterruptedException {
        List addrs = new LinkedList();
        Map addr_map = new HashMap();
        Map ipmap = new HashMap();
        ExecutorService exe = Executors.newFixedThreadPool(10);
        for (int i=0 ;i<1;i++) {
            Document doc = null;
            try {
                doc = Jsoup.connect(GET_IP_URL).get();
            } catch (IOException e) {
                continue;
            }
            System.out.println(doc.text());
            JSONObject jsonObject = JSONObject.fromObject(doc.text());
            List> list = (List>) jsonObject.get("msg");
            int count = list.size();

            for (Map map : list ) {
                String ip = (String)map.get("ip");
                String port = (String)map.get("port") ;
                ipmap.put(ip,"1");
                checkIp a = new checkIp(ip, new Integer(port),count);
                exe.execute(a);
            }
            exe.shutdown();
            Thread.sleep(1000);
        }
    }
}


class checkIp implements Runnable {
    private static Logger logger = LoggerFactory.getLogger(aaa.class);
    private static int suc=0;
    private static int total=0;
    private static int fail=0;

    private String ip ;
    private int port;
    private int count;
    public checkIp(String ip, int port,int count) {
        super();
        this.ip = ip;
        this.port = port;
        this.count = count;
    }

    @Override
    public void run() {
        Random r = new Random();
        String[] ua = {"Mozilla/5.0 (Windows NT 6.1; WOW64; rv:46.0) Gecko/20100101 Firefox/46.0",
                "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.87 Safari/537.36 OPR/37.0.2178.32",
                "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/534.57.2 (KHTML, like Gecko) Version/5.1.7 Safari/534.57.2",
                "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/45.0.2454.101 Safari/537.36",
                "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/46.0.2486.0 Safari/537.36 Edge/13.10586",
                "Mozilla/5.0 (Windows NT 10.0; WOW64; Trident/7.0; rv:11.0) like Gecko",
                "Mozilla/5.0 (compatible; MSIE 10.0; Windows NT 6.1; WOW64; Trident/6.0)",
                "Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; WOW64; Trident/5.0)",
                "Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.1; WOW64; Trident/4.0)",
                "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/47.0.2526.106 BIDUBrowser/8.3 Safari/537.36",
                "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/47.0.2526.80 Safari/537.36 Core/1.47.277.400 QQBrowser/9.4.7658.400",
                "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/48.0.2564.116 UBrowser/5.6.12150.8 Safari/537.36",
                "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/38.0.2125.122 Safari/537.36 SE 2.X MetaSr 1.0",
                "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/48.0.2564.116 Safari/537.36 TheWorld 7",
                "Mozilla/5.0 (Windows NT 6.1; W…) Gecko/20100101 Firefox/60.0"};
        int i = r.nextInt(14);
        logger.info("检测中------ {}:{}",ip,port );
        Map map = new HashMap();
        map.put("waybillNo","DD1838768852");
        try {
            total ++ ;
            long a = System.currentTimeMillis();
            //爬取的目标网站,url记得换下。。。!!!
            Document doc = Jsoup.connect("http://trace.yto.net.cn:8022/TraceSimple.aspx")
                    .timeout(5000)
                    .proxy(ip, port, null)
                    .data(map)
                    .ignoreContentType(true)
                    .userAgent(ua[i])
                    .header("referer","http://trace.yto.net.cn:8022/gw/index/index.html")//这个来源记得换..
                    .post();
            System.out.println(ip+":"+port+"访问时间:"+(System.currentTimeMillis() -a) + "   访问结果: "+doc.text());
            suc ++ ;
        } catch (IOException e) {
            e.printStackTrace();
            fail ++ ;
        }finally {
            if (total == count ) {
                System.out.println("总次数:"+total);
                System.out.println("成功次数:"+suc);
                System.out.println("失败次数:"+fail);
            }
        }
    }

}



总结:反正就是要尽量所有参数都伪装成类似浏览器的样子,而且!!!你要时时刻刻都记住,自己是成千上万个用户,自己不是一个人,自己会影分身,所以你的程序不要伪装成一个人去访问,要看成很多人去访问,这样目标网站就很难从你的影分身中抓到真正的你。爬虫程序,全靠演戏!当然有些网站也有token什么的验证,具体网站具体分析,这种的就比较难了,需要破解验证的,我目前还没有需要研究这种的,等我踩完坑再来续写。

你可能感兴趣的:(java,爬虫,代理ip)