爬虫如何实现每天爬取,定点爬取[以股票数据为例]

  • 分析抓取的数据
  • 抓包
  • 框架
  • model
  • main
  • util
  • parse
  • db
  • 问题所在
  • 解决方法
    • job
    • jobmain

近期,有人将本人博客,复制下来,直接上传到百度文库等平台。
本文为原创博客,仅供技术学习使用。未经允许,禁止将其复制下来上传到百度文库等平台。如有转载请注明本文博客的地址(链接)

分析抓取的数据

本文是以东方财富网的数据为例,这里只做技术学习使用,请勿滥用。如本文要抓取的数据是东方财富网的汽车板块及石油板块数据。如下为其地址:http://quote.eastmoney.com/center/list.html#28002481_0_2
http://quote.eastmoney.com/center/list.html#28002464_0_2
如下截图为其数据格式。

爬虫如何实现每天爬取,定点爬取[以股票数据为例]_第1张图片

抓包

写爬虫第一步是做网络抓包,这个我之前的博客中已经讲到即看数据请求的真实地址。关于本文为什么这样设计,请看我的专题博客,爬虫原理及相关基础:http://blog.csdn.net/column/details/14269.html。
爬虫如何实现每天爬取,定点爬取[以股票数据为例]_第2张图片

从上图中,可以看出数据真实的请求地址及请求的方法。而获得的是json数组。如下图所示:
爬虫如何实现每天爬取,定点爬取[以股票数据为例]_第3张图片

框架

本文使用的框架,如下图所示:
爬虫如何实现每天爬取,定点爬取[以股票数据为例]_第4张图片

db:主要放的是数据库操作文件,包含MyDataSource【数据库驱动注册、连接数据库的用户名、密码】,MYSQLControl【连接数据库,插入操作、更新操作、建表操作等】。

model:用来封装对象,说的直白一些,封装的就是我要操作数据对应的属性名。有不明白的看之前写的一个简单的网络爬虫(http://blog.csdn.net/qy20115549/article/details/52203722)。

parse:这里面存放的是针对util获取的文件,进行解析,一般采用Jsoup解析;若是针对json数据,可采用正则表达式或者fastjson工具进行解析,建议使用fastjson,因其操作简单,快捷。

main:程序起点,也是重点,获取数据,执行数据库语句,存放数据。

job:用来执行的job任务。

jobmain:控制器,即合适执行job,如本文中的每天执行一次job。股票数据每天下午3点钟收盘,即设置为3点钟以后的某个时间点开始爬行相关股票数据。

model

model用来封装我要爬去的数据,如当天的日期,股票的id,股票的名称,股票价格等等。如下面程序:

package model;
/**
 * @author:合肥工业大学 管理学院 钱洋
 * @email:[email protected]
 * @ 
 */
public class ExtMarketOilStockModel {
    private String date;
    private String stock_id;
    private String stock_name;
    private float stock_price;
    private float stock_change;
    private float stock_range;
    private float stock_amplitude;
    private int stock_trading_number;
    private int stock_trading_value;
    private float stock_yesterdayfinish_price;
    private float stock_todaystart_price;
    private float stock_max_price;
    private float stock_min_price;
    private float stock_fiveminuate_change;
    private String craw_time;
    public String getDate() {
        return date;
    }
    public void setDate(String date) {
        this.date = date;
    }

    public String getStock_id() {
        return stock_id;
    }
    public void setStock_id(String stock_id) {
        this.stock_id = stock_id;
    }
    public String getStock_name() {
        return stock_name;
    }
    public void setStock_name(String stock_name) {
        this.stock_name = stock_name;
    }
    public float getStock_price() {
        return stock_price;
    }
    public void setStock_price(float stock_price) {
        this.stock_price = stock_price;
    }
    public float getStock_change() {
        return stock_change;
    }
    public void setStock_change(float stock_change) {
        this.stock_change = stock_change;
    }
    public float getStock_range() {
        return stock_range;
    }
    public void setStock_range(float stock_range) {
        this.stock_range = stock_range;
    }
    public float getStock_amplitude() {
        return stock_amplitude;
    }
    public void setStock_amplitude(float stock_amplitude) {
        this.stock_amplitude = stock_amplitude;
    }

    public int getStock_trading_number() {
        return stock_trading_number;
    }
    public void setStock_trading_number(int stock_trading_number) {
        this.stock_trading_number = stock_trading_number;
    }
    public int getStock_trading_value() {
        return stock_trading_value;
    }
    public void setStock_trading_value(int stock_trading_value) {
        this.stock_trading_value = stock_trading_value;
    }
    public float getStock_yesterdayfinish_price() {
        return stock_yesterdayfinish_price;
    }
    public void setStock_yesterdayfinish_price(float stock_yesterdayfinish_price) {
        this.stock_yesterdayfinish_price = stock_yesterdayfinish_price;
    }
    public float getStock_todaystart_price() {
        return stock_todaystart_price;
    }
    public void setStock_todaystart_price(float stock_todaystart_price) {
        this.stock_todaystart_price = stock_todaystart_price;
    }
    public float getStock_max_price() {
        return stock_max_price;
    }
    public void setStock_max_price(float stock_max_price) {
        this.stock_max_price = stock_max_price;
    }
    public float getStock_min_price() {
        return stock_min_price;
    }
    public void setStock_min_price(float stock_min_price) {
        this.stock_min_price = stock_min_price;
    }
    public float getStock_fiveminuate_change() {
        return stock_fiveminuate_change;
    }
    public void setStock_fiveminuate_change(float stock_fiveminuate_change) {
        this.stock_fiveminuate_change = stock_fiveminuate_change;
    }
    public String getCraw_time() {
        return craw_time;
    }
    public void setCraw_time(String craw_time) {
        this.craw_time = craw_time;
    }
}

main

主方法,尽量要求简单,这里我就这样写了。这里面有注释,很好理解。

package navi.main;
/**
 * @author:合肥工业大学 管理学院 钱洋
 * @email:[email protected]
 * @ 
 */
import java.util.ArrayList;
import java.util.List;

import db.MYSQLControl;
import model.ExtMarketOilStockModel;
import parse.ExtMarketOilStockParse;

public class ExtMarketOilStockMain {

    public static void main(String[] args) throws Exception {
        List urloillist=new ArrayList();
        List urlcarlist=new ArrayList();
        List oilstocks=new ArrayList();
        List carstocks=new ArrayList();
        //石油相关股票就两页,对应两个地址
        String url1="http://nufm.dfcfw.com/EM_Finance2014NumericApplication/JS.aspx?type=CT&cmd=C.BK04641&sty=FCOIATA&sortType=C&sortRule=-1&page=1&pageSize=20&js=var%20quote_123%3d{rank:[(x)],pages:(pc)}&token=7bc05d0d4c3c22ef9fca8c2a912d779c&jsName=quote_123&_g=0.13204790262127375";
        String url2="http://nufm.dfcfw.com/EM_Finance2014NumericApplication/JS.aspx?type=CT&cmd=C.BK04641&sty=FCOIATA&sortType=C&sortRule=-1&page=2&pageSize=20&js=var%20quote_123%3d{rank:[(x)],pages:(pc)}&token=7bc05d0d4c3c22ef9fca8c2a912d779c&jsName=quote_123&_g=0.6972178580603532";
        urloillist.add(url1);
        urloillist.add(url2);
        for (int i = 0; i < urloillist.size(); i++) {
            //解析url
            oilstocks=ExtMarketOilStockParse.parseurl(urloillist.get(i));
            //存储每页的数据
            MYSQLControl.insertoilStocks(oilstocks);
        }
        //汽车相关股票有6页,对应6个地址
        for (int i = 1; i <6; i++) {
            String urli="http://nufm.dfcfw.com/EM_Finance2014NumericApplication/JS.aspx?type=CT&cmd=C.BK04811&sty=FCOIATA&sortType=C&sortRule=-1&page="+i+"&pageSize=20&js=var%20quote_123%3d{rank:[(x)],pages:(pc)}&token=7bc05d0d4c3c22ef9fca8c2a912d779c&jsName=quote_123&_g=0.23492960370783944";
            urlcarlist.add(urli);
        }
        for (int i = 0; i < urlcarlist.size(); i++) {
            //解析url
            carstocks=ExtMarketOilStockParse.parseurl(urlcarlist.get(i));
            //存储数据
            MYSQLControl.insertcarStocks(carstocks);
        }

    }

}

util

这里有三个文件,HTTPUtils,TimeUtils(这是我自己经常用的一个类,主要是各种日期的转化,如String转化为date,获取当前时间等等),UumericalUtil(这是一个Float保留几位小数的类)。

package util;
import java.io.BufferedReader;
import java.io.IOException;
import java.io.InputStream;
import java.io.InputStreamReader;
import java.net.URL;
import java.net.URLConnection;

/**
 * @author:合肥工业大学 管理学院 钱洋
 * @email:[email protected]
 * @ 
 */
public abstract class HTTPUtils {
    //这个方法是向后台请求数据,获取html或者json等
    public static String  getRawHtml(String personalUrl) throws InterruptedException,IOException {
        URL url = new URL(personalUrl);
        URLConnection conn = url.openConnection();
        InputStream in=null;
        try {
            conn.setConnectTimeout(3000);
            in = conn.getInputStream();
        } catch (Exception e) {
        }
        //将获取的数据转化为String
        String html = convertStreamToString(in);
        return html;
    }
    //这个方法是将InputStream转化为String
    public static String convertStreamToString(InputStream is) throws IOException {
        if (is == null)
            return "";
        BufferedReader reader = new BufferedReader(new InputStreamReader(is,"utf-8"));
        StringBuilder sb = new StringBuilder();
        String line = null;
        try {
            while ((line = reader.readLine()) != null) {
                sb.append(line);
            }
        } catch (IOException e) {
            e.printStackTrace();
        } finally {
            try {
                is.close();
            } catch (IOException e) {
                e.printStackTrace();
            }
        }
        reader.close();
        return sb.toString();

    }
}

以下类是用来处理各种时间格式之间的转化,大家以后也可以使用。

package util;

import java.text.DateFormat;
import java.text.DecimalFormat;
import java.text.ParseException;
import java.text.SimpleDateFormat;
import java.util.ArrayList;
import java.util.Calendar;
import java.util.Date;
import java.util.List;
/**
 * @author:合肥工业大学 管理学院 钱洋
 * @email:[email protected]
 * @ 
 */
public class TimeUtils {

    public static void main( String[] args ) throws ParseException{

        String time = getMonth("2002-1-08 14:50:38");
        System.out.println(time);
        System.out.println(getDay("2002-1-08 14:50:38"));
        System.out.println(TimeUtils.parseTime("2016-05-19 19:17","yyyy-MM-dd HH:mm"));

    }
    //get current time
    public static String GetNowDate(String formate){  
        String temp_str="";  
        Date dt = new Date();  
        SimpleDateFormat sdf = new SimpleDateFormat(formate);  
        temp_str=sdf.format(dt);  
        return temp_str;  
    }  
    public static String getMonth( String time ){

        SimpleDateFormat sdf = new SimpleDateFormat("yyyy-MM");
        Date date = null;
        try {

            date = sdf.parse(time);
            Calendar cal = Calendar.getInstance();
            cal.setTime(date);

        } catch (ParseException e) {
            e.printStackTrace();
        }

        return sdf.format(date);

    }

    public static String getDay( String time ){

        SimpleDateFormat sdf = new SimpleDateFormat("yyyy-MM-dd");
        Date date = null;
        try {

            date = sdf.parse(time);
            Calendar cal = Calendar.getInstance();
            cal.setTime(date);

        } catch (ParseException e) {
            e.printStackTrace();
        }

        return sdf.format(date);

    }

    public static Date parseTime(String inputTime) throws ParseException{

        SimpleDateFormat sdf = new SimpleDateFormat("yyyy-MM-dd HH:mm:ss");  
        Date date = sdf.parse(inputTime); 

        return date;

    }
    public static String dateToString(Date date, String type) { 
        DateFormat df = new SimpleDateFormat(type);  
        return df.format(date);  
    }
    public static Date parseTime(String inputTime, String timeFormat) throws ParseException{

        SimpleDateFormat sdf = new SimpleDateFormat(timeFormat);  
        Date date = sdf.parse(inputTime); 

        return date;

    }

    public static Calendar parseTimeToCal(String inputTime, String timeFormat) throws ParseException{

        SimpleDateFormat sdf = new SimpleDateFormat(timeFormat);  
        Date date = sdf.parse(inputTime); 
        Calendar calendar = Calendar.getInstance();
        calendar.setTime(date);

        return calendar;

    }

    public static int getDaysBetweenCals(Calendar cal1, Calendar cal2) throws ParseException{

        return (int) ((cal2.getTimeInMillis()-cal1.getTimeInMillis())/(1000*24*3600));

    }

    public static Date parseTime(long inputTime){

        //  SimpleDateFormat sdf = new SimpleDateFormat("yyyy-MM-dd HH:mm:ss");
        Date date= new Date(inputTime);
        return date;

    }

    public static String parseTimeString(long inputTime){

        SimpleDateFormat sdf = new SimpleDateFormat("yyyy-MM-dd HH:mm:ss");
        Date date= new Date(inputTime);
        return sdf.format(date);

    }
    public static String parseStringTime(String inputTime){

        String date=null;
        try {
            Date date1 = new SimpleDateFormat("yyyyMMddHHmmss").parse(inputTime);
            date=new SimpleDateFormat("yyyy-MM-dd HH:mm:ss").format(date1);
        } catch (ParseException e) {
            // TODO Auto-generated catch block
            e.printStackTrace();
        }

        return date;
    }
    public static List YearMonth(int year) {
        List yearmouthlist=new ArrayList();
        for (int i = 1; i < 13; i++) {
            DecimalFormat dfInt=new DecimalFormat("00");
            String sInt = dfInt.format(i);
            yearmouthlist.add(year+sInt);
        }

        return yearmouthlist;
    } 
    public static List YearMonth(int startyear,int finistyear) {
        List yearmouthlist=new ArrayList();
        for (int i = startyear; i < finistyear+1; i++) {
            for (int j = 1; j < 13; j++) {
                DecimalFormat dfInt=new DecimalFormat("00");
                String sInt = dfInt.format(j);
                yearmouthlist.add(i +"-"+sInt);
            }
        }
        return yearmouthlist;
    } 
    public static List TOAllDay(int year){
        List daylist=new ArrayList();
        SimpleDateFormat sdf = new SimpleDateFormat("yyyy-MM-dd"); 
        int m=1;//月份计数 
        while (m<13) 
        { 
            int month=m; 
            Calendar cal=Calendar.getInstance();//获得当前日期对象 
            cal.clear();//清除信息 
            cal.set(Calendar.YEAR,year); 
            cal.set(Calendar.MONTH,month-1);//1月从0开始 
            cal.set(Calendar.DAY_OF_MONTH,1);//设置为1号,当前日期既为本月第一天  

            System.out.println("##########___" + sdf.format(cal.getTime())); 

            int count=cal.getActualMaximum(Calendar.DAY_OF_MONTH); 

            System.out.println("$$$$$$$$$$________" + count); 

            for (int j=0;j<=(count - 2);) 
            { 
                cal.add(Calendar.DAY_OF_MONTH,+1); 
                j++; 
                daylist.add(sdf.format(cal.getTime()));
            } 
            m++; 
        } 
        return daylist;
    }
    //获取昨天的日期
    public static String getyesterday(){
        Calendar   cal   =   Calendar.getInstance();
        cal.add(Calendar.DATE,   -1);
        String yesterday = new SimpleDateFormat( "yyyy-MM-dd ").format(cal.getTime());
        return yesterday;
    }
}

这个类实现的是保留几位小数。如股票价格等,保留两位小数。

package util;
/**
 * @author:合肥工业大学 管理学院 钱洋
 * @email:[email protected]
 * @ 
 */
import java.math.BigDecimal;
import java.text.DecimalFormat;

public class UumericalUtil {

    public static float FloatTO(float f, int number) {
        BigDecimal   b  =   new BigDecimal(f);  
        float   f1   =  b.setScale(number, BigDecimal.ROUND_HALF_UP).floatValue();  
        return f1;  
    }  
    public static String NumberTO(int number) {
        DecimalFormat dfInt=new DecimalFormat("00");
        String sInt = dfInt.format(number);
        System.out.println(sInt);
        return sInt;
    } 

}

parse

parse主要是通过Jsoup或者其他工具来解析html文件。并将解析后的数据,封装在List集合中,将数据通过层层返回到main方法中。如这里只是采用最简单的字符串解析的方式。如下为某一页的数据,这要针对的是此类型的数据进行解析:

var quote_123={rank:["2,002662,京威股份,15.62,0.38,2.49%,2.95,10294,15948185,15.24,15.28,15.65,15.20,-,-,-,-,-,-,-,-,0.00%,0.62,0.17,33.47","2,002536,西泵股份,13.15,0.32,2.49%,3.74,26558,34710121,12.83,12.88,13.27,12.79,-,-,-,-,-,-,-,-,0.00%,0.99,0.87,41.09","1,600741,华域汽车,16.22,0.39,2.46%,2.59,215140,346480560,15.83,15.85,16.26,15.85,-,-,-,-,-,-,-,-,0.12%,1.23,0.75,8.59","1,601689,拓普集团,29.74,0.68,2.34%,3.20,36329,107964394,29.06,29.06,29.94,29.01,-,-,-,-,-,-,-,-,-0.20%,1.34,2.13,34.32","1,603306,华懋科技,33.87,0.74,2.23%,4.50,9251,31242113,33.13,33.14,34.20,32.71,-,-,-,-,-,-,-,-,-0.03%,0.72,1.25,29.60","1,601799,星宇股份,37.40,0.80,2.19%,3.80,5522,20477010,36.60,36.40,37.50,36.11,-,-,-,-,-,-,-,-,0.03%,0.86,0.23,28.43","1,603166,福达股份,14.02,0.29,2.11%,2.91,47265,66170428,13.73,13.80,14.14,13.74,-,-,-,-,-,-,-,-,0.21%,0.96,3.15,95.59","2,002190,成飞集成,32.44,0.66,2.08%,2.99,25213,81219488,31.78,31.63,32.58,31.63,-,-,-,-,-,-,-,-,0.03%,0.86,0.73,93.58","1,600213,亚星客车,14.77,0.30,2.07%,3.46,18878,27820060,14.47,14.52,14.88,14.38,-,-,-,-,-,-,-,-,-0.07%,0.64,0.86,55.39","2,300432,富临精工,21.28,0.43,2.06%,4.70,28707,60945368,20.85,20.60,21.58,20.60,-,-,-,-,-,-,-,-,-0.14%,1.29,2.07,50.58","2,300375,鹏翎股份,21.25,0.42,2.02%,3.94,11367,24164157,20.83,20.83,21.45,20.63,-,-,-,-,-,-,-,-,-0.14%,0.83,1.44,30.27","2,002363,隆基机械,11.47,0.22,1.96%,2.49,33946,38796837,11.25,11.27,11.55,11.27,-,-,-,-,-,-,-,-,0.00%,0.80,0.88,61.45","1,600469,风神股份,11.55,0.22,1.94%,3.09,38444,44305565,11.33,11.33,11.63,11.28,-,-,-,-,-,-,-,-,0.09%,0.67,0.68,27.07","2,002454,松芝股份,12.98,0.24,1.88%,2.83,27839,36056020,12.74,12.70,13.06,12.70,-,-,-,-,-,-,-,-,0.00%,1.17,0.87,25.84","2,002488,金固股份,14.79,0.27,1.86%,2.48,29002,42872475,14.52,14.52,14.88,14.52,-,-,-,-,-,-,-,-,0.00%,0.72,0.75,-","2,002284,亚太股份,13.18,0.24,1.85%,3.32,61756,81198133,12.94,12.87,13.30,12.87,-,-,-,-,-,-,-,-,0.30%,1.10,0.90,58.15","1,603788,宁波高发,35.97,0.64,1.81%,3.40,6719,24160418,35.33,35.21,36.33,35.13,-,-,-,-,-,-,-,-,0.03%,0.59,1.37,34.10","2,000957,中通客车,14.36,0.25,1.77%,2.69,59696,85581415,14.11,14.07,14.45,14.07,-,-,-,-,-,-,-,-,0.00%,0.79,1.25,13.99","2,300304,云意电气,52.12,0.90,1.76%,5.70,179330,922614032,51.22,50.38,52.83,49.91,-,-,-,-,-,-,-,-,-0.04%,1.12,9.35,108.58","2,002607,亚夏汽车,10.03,0.17,1.72%,4.16,27760,27878904,9.86,9.89,10.19,9.78,-,-,-,-,-,-,-,-,-0.30%,0.97,1.03,57.87"],pages:6}
package parse;
/**
 * @author:合肥工业大学 管理学院 钱洋
 * @email:[email protected]
 * @ 
 */
import java.util.ArrayList;
import java.util.List;
import model.ExtMarketOilStockModel;
import util.HTTPUtils;
import util.TimeUtils;
import util.UumericalUtil;
public class ExtMarketOilStockParse {
    public static List parseurl(String url) throws Exception {
        List list=new ArrayList();
        String response=HTTPUtils.getRawHtml(url);
        String html = response.toString();
        String jsonarra=html.split("rank:")[1].split(",pages")[0];
        String stocks[]=jsonarra.split("\",");
        List stocklist=new ArrayList();
        for (int i = 0; i < stocks.length; i++) {
            stocklist.add(stocks[i].replace("[\"", "").replace("\"", "").replace("]", ""));
            System.out.println(stocks[i].replace("[\"", "").replace("\"", "").replace("]", ""));
        }
        for (int i = 0; i < stocklist.size(); i++) {
            String date=TimeUtils.GetNowDate("yyyy-MM-dd");
            String stock_id=stocklist.get(i).split(",")[1];
            String stock_name=stocklist.get(i).split(",")[2];
            float stock_price=0;
            float stock_change=0;
            float stock_range=0;
            float stock_amplitude=0;
            int stock_trading_number=0;
            int stock_trading_value=0;
            float stock_yesterdayfinish_price=0;
            float stock_todaystart_price=0;
            float stock_max_price=0;
            float stock_min_price=0;
            float stock_fiveminuate_change=0;
            if (!stocklist.get(i).split(",")[3].equals("-")) {
                //价格
                stock_price=Float.parseFloat(stocklist.get(i).split(",")[3]);
                //涨跌额
                stock_change=Float.parseFloat(stocklist.get(i).split(",")[4]);
                System.out.println(stock_change);
                //涨跌幅
                stock_range=UumericalUtil.FloatTO((float) (Float.parseFloat(stocklist.get(i).split(",")[5].replace("%", ""))*0.01),4);
                stock_amplitude=UumericalUtil.FloatTO((float) (Float.parseFloat(stocklist.get(i).split(",")[6].replace("%", ""))*0.01),4);;
                stock_trading_number=Integer.parseInt(stocklist.get(i).split(",")[7].replace("%", ""));
                stock_trading_value=Integer.parseInt(stocklist.get(i).split(",")[8].replace("%", ""));
                stock_yesterdayfinish_price=Float.parseFloat(stocklist.get(i).split(",")[9]);
                stock_todaystart_price=Float.parseFloat(stocklist.get(i).split(",")[10]);
                stock_max_price=Float.parseFloat(stocklist.get(i).split(",")[11]);
                stock_min_price=Float.parseFloat(stocklist.get(i).split(",")[12]);
                stock_fiveminuate_change=UumericalUtil.FloatTO((float) (Float.parseFloat(stocklist.get(i).split(",")[21].replace("%", ""))*0.01),4);;
                System.out.println(stock_fiveminuate_change);
            }
            String craw_time=TimeUtils.GetNowDate("yyyy-MM-dd HH:mm:ss");
            ExtMarketOilStockModel model=new ExtMarketOilStockModel();
            model.setDate(date);
            model.setStock_id(stock_id);
            model.setStock_name(stock_name);
            model.setStock_price(stock_price);
            model.setStock_change(stock_change);
            model.setStock_range(stock_range);
            model.setStock_amplitude(stock_amplitude);
            model.setStock_trading_number(stock_trading_number);
            model.setStock_trading_value(stock_trading_value);
            model.setStock_yesterdayfinish_price(stock_yesterdayfinish_price);
            model.setStock_todaystart_price(stock_todaystart_price);
            model.setStock_max_price(stock_max_price);
            model.setStock_min_price(stock_min_price);
            model.setStock_fiveminuate_change(stock_fiveminuate_change);
            model.setCraw_time(craw_time);
            list.add(model);
        }
        return list; 
    }
}

db

db中包含两个java文件,MyDataSource,MYSQLControl。这两个文件的作用已在前面说明了。

package db;
/**
 * @author:合肥工业大学 管理学院 钱洋
 * @email:[email protected]
 * @ 
 */
import javax.sql.DataSource;
import org.apache.commons.dbcp2.BasicDataSource;

public class MyDataSource {

    public static DataSource getDataSource(String connectURI){

        BasicDataSource ds = new BasicDataSource();
         //MySQL的jdbc驱动
        ds.setDriverClassName("com.mysql.jdbc.Driver");
        ds.setUsername("root");              //所要连接的数据库名
        ds.setPassword("112233");                //MySQL的登陆密码
        ds.setUrl(connectURI);

        return ds;

    }

}
package db;
import java.sql.SQLException;
import java.util.List;
import javax.sql.DataSource;
import org.apache.commons.dbutils.QueryRunner;
import org.apache.commons.dbutils.ResultSetHandler;
import org.apache.commons.dbutils.handlers.BeanListHandler;
import org.apache.commons.dbutils.handlers.ColumnListHandler;
import org.apache.commons.dbutils.handlers.ScalarHandler;
import org.apache.commons.logging.Log;
import org.apache.commons.logging.LogFactory;
import model.ExtMarketOilStockModel;
/**
 * @author:合肥工业大学 管理学院 钱洋
 * @email:[email protected]
 * @ 
 */
public class MYSQLControl {
    static final Log logger = LogFactory.getLog(MYSQLControl.class);
    static DataSource ds = MyDataSource.getDataSource("jdbc:mysql://127.0.0.1:3306/datacollection");
    static QueryRunner qr = new QueryRunner(ds);
    //第一类方法
    public static void executeUpdate(String sql){
        try {
            qr.update(sql);
        } catch (SQLException e) {
            logger.error(e);
        }
    }
    //按照SQL查询单个结果
    public static Object getScalaBySQL ( String sql ){

        ResultSetHandler h = new ScalarHandler(1);
        Object obj = null;
        try {
            obj = qr.query(sql, h);
        } catch (SQLException e) {
            e.printStackTrace();
        }
        return obj;

    }
    //按照SQL查询多个结果
    public static  List getListInfoBySQL (String sql, Class type ){
        List list = null;
        try {
            list = qr.query(sql,new BeanListHandler(type));
        } catch (SQLException e) {
            e.printStackTrace();
        }
        return list;
    }
    //查询一列
    public static List getListOneBySQL (String sql,String id){
        List list=null;

        try {
            list = (List) qr.query(sql, new ColumnListHandler(id));
        } catch (SQLException e) {
            e.printStackTrace();
        }
        return list;
    }
    //此种数据库操作方法需要优化
    public static int insertoilStocks ( List oilstocks ) {

        Object[][] params = new Object[oilstocks.size()][17];
        int c = 0;  //success number of update
        int[] sum;
        for ( int i = 0; i < oilstocks.size(); i++ ){
            params[i][0] = oilstocks.get(i).getDate();
            params[i][1] = oilstocks.get(i).getStock_id();
            params[i][2] = oilstocks.get(i).getStock_name();
            params[i][3] = oilstocks.get(i).getStock_price();
            params[i][4] = oilstocks.get(i).getStock_change();
            params[i][5] = oilstocks.get(i).getStock_range();
            params[i][6] = oilstocks.get(i).getStock_amplitude();
            params[i][7] = oilstocks.get(i).getStock_trading_number();
            params[i][8] = oilstocks.get(i).getStock_trading_value();
            params[i][9] = oilstocks.get(i).getStock_yesterdayfinish_price();
            params[i][10] = oilstocks.get(i).getStock_todaystart_price();
            params[i][11] = oilstocks.get(i).getStock_max_price();
            params[i][12] = oilstocks.get(i).getStock_min_price();
            params[i][13] = oilstocks.get(i).getStock_fiveminuate_change();
            params[i][14] = oilstocks.get(i).getCraw_time();
            params[i][15] = null;
            params[i][16] = null;
        }

        QueryRunner qr = new QueryRunner(ds);
        try {
            sum = qr.batch("INSERT INTO `datacollection`.`ext_market_oil_stock` VALUES (?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?)", params);
        } catch (SQLException e) {
            System.out.println(e);
        }
        System.out.println("石油数据入库完毕");

        return c;

    }
    //此种数据库操作方法需要优化
    public static int insertcarStocks ( List carstocks ) {

        int c = 0;  //success number of update
        int[] sum;
        Object[][] params1 = new Object[carstocks.size()][17];
        int c1 = 0; //success number of update
        for ( int i = 0; i < carstocks.size(); i++ ){
            params1[i][0] = carstocks.get(i).getDate();
            params1[i][1] = carstocks.get(i).getStock_id();
            params1[i][2] = carstocks.get(i).getStock_name();
            params1[i][3] = carstocks.get(i).getStock_price();
            params1[i][4] = carstocks.get(i).getStock_change();
            params1[i][5] = carstocks.get(i).getStock_range();
            params1[i][6] = carstocks.get(i).getStock_amplitude();
            params1[i][7] = carstocks.get(i).getStock_trading_number();
            params1[i][8] = carstocks.get(i).getStock_trading_value();
            params1[i][9] = carstocks.get(i).getStock_yesterdayfinish_price();
            params1[i][10] = carstocks.get(i).getStock_todaystart_price();
            params1[i][11] = carstocks.get(i).getStock_max_price();
            params1[i][12] = carstocks.get(i).getStock_min_price();
            params1[i][13] = carstocks.get(i).getStock_fiveminuate_change();
            params1[i][14] = carstocks.get(i).getCraw_time();
            params1[i][15] = null;
            params1[i][16] = null;
        }
        QueryRunner qr = new QueryRunner(ds);
        try {
        //插入的数据表及数据
            sum = qr.batch("INSERT INTO `datacollection`.`ext_market_car_stock` VALUES (?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?)", params1);

        } catch (SQLException e) {
            System.out.println(e);
        }
        System.out.println("汽车数据入库完毕");

        return c;

    }

}
 
  

这样按道理整个爬虫,程序就写完了,运行main方法就行了。如下图,为main方法获取数据的部分结果。

爬虫如何实现每天爬取,定点爬取[以股票数据为例]_第5张图片

问题所在

问题1:针对股票这种数据,每周1到周五都会发布相关股票数据,那么如何每天定时定点让程序自动的去抓取,而不是手工每天运行一下呢?

问题二:股票节假日,是不会开盘的,当网页中存在此数据,即网页中的显示,没有时间标签。针对此,又该如何处理呢?

首先,我带大家来看看我的数据库设计。


爬虫如何实现每天爬取,定点爬取[以股票数据为例]_第6张图片

解决方法

这里使用Quartz实线定期运行程序,即上面提的第一个问题。(http://blog.csdn.net/qy20115549/article/details/52723907)。
针对第二个问题使用是:即如何判断当天股票不开盘,采用的方法是从数据库中随机抽取三个股票(上次时间的,如今天是1月21日,周六,随机从数据库中抽取1月20日的三只股票。将1月20日的三只股票与今天相同id的股票价格进行比较,如果三个股票的价格都相同,则判断,改天为节假日,股票价格没有变动,无需将数据插入数据库)。

job

package job;

import java.util.ArrayList;
import java.util.List;
import org.quartz.Job; 
import org.quartz.JobExecutionContext; 
import org.quartz.JobExecutionException;
import db.MYSQLControl;
import model.ExtMarketOilStockModel;
import parse.ExtMarketOilStockParse;
import timecontrol.TimeControl;
 /**
 * @author:合肥工业大学 管理学院 钱洋
 * @email:[email protected]
 * @ 
 */
public class ExtMarketOilStockJob implements Job { 

    @Override 
    public void execute(JobExecutionContext arg0) throws JobExecutionException {
        //获取上次的插入股票日期,加入判断是否为节假日
        List randomlist = MYSQLControl.getListInfoBySQL("select stock_id,stock_price,stock_change from ext_market_oil_stock where date = (select date from ext_market_oil_stock order by date desc limit 1) ",ExtMarketOilStockModel.class);
        //表格更新时间

        List urloillist=new ArrayList();
        List urlcarlist=new ArrayList();
        List oilstocks=new ArrayList();
        List carstocks=new ArrayList();
        String url1="http://nufm.dfcfw.com/EM_Finance2014NumericApplication/JS.aspx?type=CT&cmd=C.BK04641&sty=FCOIATA&sortType=C&sortRule=-1&page=1&pageSize=20&js=var%20quote_123%3d{rank:[(x)],pages:(pc)}&token=7bc05d0d4c3c22ef9fca8c2a912d779c&jsName=quote_123&_g=0.13204790262127375";
        String url2="http://nufm.dfcfw.com/EM_Finance2014NumericApplication/JS.aspx?type=CT&cmd=C.BK04641&sty=FCOIATA&sortType=C&sortRule=-1&page=2&pageSize=20&js=var%20quote_123%3d{rank:[(x)],pages:(pc)}&token=7bc05d0d4c3c22ef9fca8c2a912d779c&jsName=quote_123&_g=0.6972178580603532";
        urloillist.add(url1);
        urloillist.add(url2);
        int judge=0;
        for (int i = 0; i < urloillist.size(); i++) {
            try {
                oilstocks=ExtMarketOilStockParse.parseurl(urloillist.get(i));
            } catch (Exception e) {
                e.printStackTrace();
            }

            for (int j = 0; j < oilstocks.size(); j++) {
                String stock_id=oilstocks.get(j).getStock_id();
                float stock_price=oilstocks.get(j).getStock_price();
                if (stock_id.equals(randomlist.get(0).getStock_id())) {
                    if (stock_price==randomlist.get(0).getStock_price()) {
                        judge++;
                    }
                }
            }
            for (int j = 0; j < oilstocks.size(); j++) {
                String stock_id=oilstocks.get(j).getStock_id();
                float stock_price=oilstocks.get(j).getStock_price();
                if (stock_id.equals(randomlist.get(1).getStock_id())) {
                    if (stock_price==randomlist.get(1).getStock_price()) {
                        judge++;
                    }
                }
            }
            for (int j = 0; j < oilstocks.size(); j++) {
                String stock_id=oilstocks.get(j).getStock_id();
                float stock_price=oilstocks.get(j).getStock_price();
                if (stock_id.equals(randomlist.get(2).getStock_id())) {
                    if (stock_price==randomlist.get(2).getStock_price()) {
                        judge++;
                    }
                }
            }
            if (judge!=3) {
                MYSQLControl.insertoilStocks(oilstocks);
            }
        }
        if (judge!=3) {
            for (int i = 1; i <6; i++) {
                String urli="http://nufm.dfcfw.com/EM_Finance2014NumericApplication/JS.aspx?type=CT&cmd=C.BK04811&sty=FCOIATA&sortType=C&sortRule=-1&page="+i+"&pageSize=20&js=var%20quote_123%3d{rank:[(x)],pages:(pc)}&token=7bc05d0d4c3c22ef9fca8c2a912d779c&jsName=quote_123&_g=0.23492960370783944";
                urlcarlist.add(urli);
            }
            for (int i = 0; i < urlcarlist.size(); i++) {
                try {
                    carstocks=ExtMarketOilStockParse.parseurl(urlcarlist.get(i));
                } catch (Exception e) {
                    e.printStackTrace();
                }
                MYSQLControl.insertcarStocks(carstocks);
            }
        }

    } 

} 

jobmain

如下,控制的时间是每周一到周五,8点39执行job,即每天都去抓取数据。

package jobmain;
import static org.quartz.CronScheduleBuilder.cronSchedule;
import static org.quartz.JobBuilder.newJob;
import static org.quartz.TriggerBuilder.newTrigger;
import java.text.SimpleDateFormat;
import java.util.Date;
import org.quartz.CronTrigger;
import org.quartz.JobDetail;
import org.quartz.Scheduler;
import org.quartz.SchedulerFactory;
import org.quartz.impl.StdSchedulerFactory;
import job.ExtMarketOilStockJob;
 /**
 * @author:合肥工业大学 管理学院 钱洋
 * @email:[email protected]
 * @ 
 */
public class ExtMarketOilStockJobMain {

    public void go() throws Exception { 
        // 首先,必需要取得一个Scheduler的引用 
        SchedulerFactory sf = new StdSchedulerFactory(); 
        Scheduler sched = sf.getScheduler(); 
        //jobs可以在scheduled的sched.start()方法前被调用 
        JobDetail job = newJob(ExtMarketOilStockJob.class).withIdentity("stockjob", "stockgroup").build(); 
        //每周一到周五8点39开始执行job
        CronTrigger trigger = newTrigger().withIdentity("stocktrigger", "stockgroup").withSchedule(cronSchedule("0 39 20 ? * MON-FRI")).build(); 
        Date ft = sched.scheduleJob(job, trigger); 
        SimpleDateFormat sdf = new SimpleDateFormat("yyyy-MM-dd HH:mm:ss SSS"); 
        System.out.println(job.getKey() + " 已被安排执行于: " + sdf.format(ft) + ",并且以如下重复规则重复执行: " + trigger.getCronExpression()); 
        sched.start(); 
    } 
    public static void main(String[] args) throws Exception { 
        ExtMarketOilStockJobMain maingo = new ExtMarketOilStockJobMain(); 
        maingo.go(); 
    } 

}

运行jobmain中的类,便可以实现每天定点爬取数据。

你可能感兴趣的:(基于java网络爬虫,java)