网络爬虫之东方财富网股票板块

本文为原创博客,仅供技术学习使用。未经允许,禁止将其复制下来上传到百度文库等平台。如有转载请注明本文博客的地址(链接)。
源码或者jar包如有需要请联系:[email protected]

  • 要抓取的数据
  • 抓包
  • 框架
  • model
  • 建立数据表
  • 主方法
  • util
  • parse
  • db
  • job和jobmain

要抓取的数据

这个项目要抓取的是东方财富网的板块数据。
链接为http://quote.eastmoney.com/center/BKList.html#trade_0_0?sortRule=0
网络爬虫之东方财富网股票板块_第1张图片


抓包

抓包具体请看我之前的博客。
链接为http://blog.csdn.net/qq_22499377/article/details/78114734


框架

本文使用的框架,如下图所示:
网络爬虫之东方财富网股票板块_第2张图片
db:主要放的是数据库操作文件,包含MyDataSource和MYSQLControl。
model:用来封装对象,将要操作的对象的属性封装起来。
parse:这里面存放的是针对util获取的文件,进行解析,一般采用Jsoup解析。
main:程序起点,用来获取数据,执行数据库语句以及存放数据。
job:用来执行的job任务。
jobmain:控制器,即设定执行一次job的时间。股票数据每天下午3点钟收盘,即设置为3点钟以后的某个时间点开始爬行相关股票数据。


model

model用来封装我要爬的数据,如当天的日期,板块的id,板块的名称,板块价格等等。如下面程序:

package model;

//创建对象及我们爬取的数据内容包含以下字段
public class ExtMarketPlateModel {
    private String date;
    private String plate_rank;
    private String plate_id;
    private String plate_name;
    private float plate_price;
    private float plate_change;
    private float plate_range;
    private String plate_market_value;
    private float plate_turnover_rate;
    private String craw_time;
    public String getDate() {
        return date;
    }
    public void setDate(String date) {
        this.date = date;
    }
    public String getPlate_rank() {
        return plate_rank;
    }
    public void setPlate_rank(String plate_rank) {
        this.plate_rank = plate_rank;
    }
    public String getPlate_id() {
        return plate_id;
    }
    public void setPlate_id(String plate_id) {
        this.plate_id = plate_id;
    }
    public String getPlate_name() {
        return plate_name;
    }
    public void setPlate_name(String plate_name) {
        this.plate_name = plate_name;
    }
    public float getPlate_price() {
        return plate_price;
    }
    public void setPlate_price(float plate_price) {
        this.plate_price = plate_price;
    }
    public float getPlate_change() {
        return plate_change;
    }
    public void setPlate_change(float plate_change) {
        this.plate_change = plate_change;
    }
    public float getPlate_range() {
        return plate_range;
    }
    public void setPlate_range(float plate_range) {
        this.plate_range = plate_range;
    }
    public String getPlate_market_value() {
        return plate_market_value;
    }
    public void setPlate_market_value(String plate_market_value) {
        this.plate_market_value = plate_market_value;
    }
    public float getPlate_turnover_rate() {
        return plate_turnover_rate;
    }
    public void setPlate_turnover_rate(float plate_turnover_rate) {
        this.plate_turnover_rate = plate_turnover_rate;
    }
    public String getCraw_time() {
        return craw_time;
    }
    public void setCraw_time(String craw_time) {
        this.craw_time = craw_time;
    }

}

建立数据表

在写程序之前,先根据model的属性来建立数据表。建表的时候,一定要记得注明每一个属性的真实含义,以便以后的人可以轻松交接。

CREATE TABLE `ext_market_plate` (
  `date` date NOT NULL COMMENT '当天日期',
  `plate_rank` char(20) NOT NULL COMMENT '板块排名',
  `plate_id` char(20) NOT NULL COMMENT '板块代码',
  `plate_name` char(50) DEFAULT NULL COMMENT '板块名称',
  `plate_price` float(10,2) DEFAULT NULL COMMENT '板块最新价格',
  `plate_change` float(10,2) DEFAULT NULL COMMENT '涨跌额',
  `plate_range` float(10,4) DEFAULT NULL COMMENT '涨跌幅度',
  `plate_market_value` char(50) DEFAULT NULL COMMENT '总市值',
  `plate_turnover_rate` float(10,4) DEFAULT NULL COMMENT '换手率',
  `craw_time` datetime DEFAULT NULL COMMENT '爬取时间',
  `update_time` datetime DEFAULT NULL COMMENT '更新时间',
  `extract_time` datetime DEFAULT NULL COMMENT '抽取时间',
  PRIMARY KEY (`date`,`plate_id`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8;

主方法

main,如下面程序:

package navi.main;

import java.util.ArrayList;
import java.util.List;

import db.MYSQLControl;
import model.ExtMarketPlateModel;
import model.ExtMarketPlateModel;
import parse.ExtMarketPlateParse;
//以下程序采集的是股票板块数据
public class ExtMarketPlateMain {

    public static void main(String[] args) throws Exception {

        List plate =new ArrayList();
        //板块股票的地址
String url="";
url="http://nufm.dfcfw.com/EM_Finance2014NumericApplication/JS.aspxtype=CT&cmd=C._BKHY&sty=FPGBKI&st=c&sr=-1&p=1&ps=5000&cb=&js=var%20BKCache[(x)]&token=7bc05d0d4c3c22ef9fca8c2a912d779c&v=0.3196612374630905";
        plate=ExtMarketPlateParse.parseurl(url); 
        MYSQLControl.insertoilStocks(plate);    
    }
}

util

这里有三个文件,HTTPUtils,TimeUtils,UumericalUtil。

HTTPUtils的程序如下:

package util;
import java.io.BufferedReader;
import java.io.IOException;
import java.io.InputStream;
import java.io.InputStreamReader;
import java.net.URL;
import java.net.URLConnection;


//采用URLConnection获取响应的html文件或json文件
public abstract class HTTPUtils {
    public static String  getRawHtml(String personalUrl) throws InterruptedException,IOException {
        URL url = new URL(personalUrl);
        URLConnection conn = url.openConnection();
        InputStream in=null;
        try {
            conn.setConnectTimeout(3000);
            in = conn.getInputStream();
        } catch (Exception e) {
        }
        //将获取的数据转化为String
        String html = convertStreamToString(in);
        return html;
    }
    //这个方法是将InputStream转化为String
    public static String convertStreamToString(InputStream is) throws IOException {
        if (is == null)
            return "";
        BufferedReader reader = new BufferedReader(new InputStreamReader(is,"utf-8"));
        StringBuilder sb = new StringBuilder();
        String line = null;
        try {
            while ((line = reader.readLine()) != null) {
                sb.append(line);
            }
        } catch (IOException e) {
            e.printStackTrace();
        } finally {
            try {
                is.close();
            } catch (IOException e) {
                e.printStackTrace();
            }
        }
        reader.close();
        return sb.toString();

    }
}

TimeUtils主要是各种日期的转化,如String转化为date,获取当前时间等等。以后别的地方需要用到也可以拿去直接用。

package util;

import java.text.DateFormat;
import java.text.DecimalFormat;
import java.text.ParseException;
import java.text.SimpleDateFormat;
import java.util.ArrayList;
import java.util.Calendar;
import java.util.Date;
import java.util.List;

public class TimeUtils {

    public static void main( String[] args ) throws ParseException{

        String time = getMonth("2002-1-08 14:50:38");
        System.out.println(time);
        System.out.println(getDay("2002-1-08 14:50:38"));
        System.out.println(TimeUtils.parseTime("2016-05-19 19:17","yyyy-MM-dd HH:mm"));

    }
    //get current time
    public static String GetNowDate(String formate){  
        String temp_str="";  
        Date dt = new Date();  
        SimpleDateFormat sdf = new SimpleDateFormat(formate);  
        temp_str=sdf.format(dt);  
        return temp_str;  
    }  
    public static String getMonth( String time ){

        SimpleDateFormat sdf = new SimpleDateFormat("yyyy-MM");
        Date date = null;
        try {

            date = sdf.parse(time);
            Calendar cal = Calendar.getInstance();
            cal.setTime(date);

        } catch (ParseException e) {
            e.printStackTrace();
        }

        return sdf.format(date);

    }

    public static String getDay( String time ){

        SimpleDateFormat sdf = new SimpleDateFormat("yyyy-MM-dd");
        Date date = null;
        try {

            date = sdf.parse(time);
            Calendar cal = Calendar.getInstance();
            cal.setTime(date);

        } catch (ParseException e) {
            e.printStackTrace();
        }

        return sdf.format(date);

    }

    public static Date parseTime(String inputTime) throws ParseException{

        SimpleDateFormat sdf = new SimpleDateFormat("yyyy-MM-dd HH:mm:ss");  
        Date date = sdf.parse(inputTime); 

        return date;

    }
    public static String dateToString(Date date, String type) { 
        DateFormat df = new SimpleDateFormat(type);  
        return df.format(date);  
    }
    public static Date parseTime(String inputTime, String timeFormat) throws ParseException{

        SimpleDateFormat sdf = new SimpleDateFormat(timeFormat);  
        Date date = sdf.parse(inputTime); 

        return date;

    }

    public static Calendar parseTimeToCal(String inputTime, String timeFormat) throws ParseException{

        SimpleDateFormat sdf = new SimpleDateFormat(timeFormat);  
        Date date = sdf.parse(inputTime); 
        Calendar calendar = Calendar.getInstance();
        calendar.setTime(date);

        return calendar;

    }

    public static int getDaysBetweenCals(Calendar cal1, Calendar cal2) throws ParseException{

        return (int) ((cal2.getTimeInMillis()-cal1.getTimeInMillis())/(1000*24*3600));

    }

    public static Date parseTime(long inputTime){

        //  SimpleDateFormat sdf = new SimpleDateFormat("yyyy-MM-dd HH:mm:ss");
        Date date= new Date(inputTime);
        return date;

    }

    public static String parseTimeString(long inputTime){

        SimpleDateFormat sdf = new SimpleDateFormat("yyyy-MM-dd HH:mm:ss");
        Date date= new Date(inputTime);
        return sdf.format(date);

    }
    public static String parseStringTime(String inputTime){

        String date=null;
        try {
            Date date1 = new SimpleDateFormat("yyyyMMddHHmmss").parse(inputTime);
            date=new SimpleDateFormat("yyyy-MM-dd HH:mm:ss").format(date1);
        } catch (ParseException e) {
            // TODO Auto-generated catch block
            e.printStackTrace();
        }

        return date;
    }
    public static List YearMonth(int year) {
        List yearmouthlist=new ArrayList();
        for (int i = 1; i < 13; i++) {
            DecimalFormat dfInt=new DecimalFormat("00");
            String sInt = dfInt.format(i);
            yearmouthlist.add(year+sInt);
        }

        return yearmouthlist;
    } 
    public static List YearMonth(int startyear,int finistyear) {
        List yearmouthlist=new ArrayList();
        for (int i = startyear; i < finistyear+1; i++) {
            for (int j = 1; j < 13; j++) {
                DecimalFormat dfInt=new DecimalFormat("00");
                String sInt = dfInt.format(j);
                yearmouthlist.add(i +"-"+sInt);
            }
        }
        return yearmouthlist;
    } 
    public static List TOAllDay(int year){
        List daylist=new ArrayList();
        SimpleDateFormat sdf = new SimpleDateFormat("yyyy-MM-dd"); 
        int m=1;//月份计数 
        while (m<13) 
        { 
            int month=m; 
            Calendar cal=Calendar.getInstance();//获得当前日期对象 
            cal.clear();//清除信息 
            cal.set(Calendar.YEAR,year); 
            cal.set(Calendar.MONTH,month-1);//1月从0开始 
            cal.set(Calendar.DAY_OF_MONTH,1);//设置为1号,当前日期既为本月第一天  

            System.out.println("##########___" + sdf.format(cal.getTime())); 

            int count=cal.getActualMaximum(Calendar.DAY_OF_MONTH); 

            System.out.println("$$$$$$$$$$________" + count); 

            for (int j=0;j<=(count - 2);) 
            { 
                cal.add(Calendar.DAY_OF_MONTH,+1); 
                j++; 
                daylist.add(sdf.format(cal.getTime()));
            } 
            m++; 
        } 
        return daylist;
    }
    //获取昨天的日期
    public static String getyesterday(){
        Calendar   cal   =   Calendar.getInstance();
        cal.add(Calendar.DATE,   -1);
        String yesterday = new SimpleDateFormat( "yyyy-MM-dd ").format(cal.getTime());
        return yesterday;
    }
}

UumericalUtil,股票价格需要保留几位小数,这个类的作用就是保留几位小数。

package util;

import java.math.BigDecimal;
import java.text.DecimalFormat;
import java.util.ArrayList;
import java.util.List;

public class UumericalUtil {

    public static float FloatTO(float f, int number) {
        BigDecimal   b  =   new BigDecimal(f);  
        float   f1   =  b.setScale(number, BigDecimal.ROUND_HALF_UP).floatValue();  
        return f1;  
    }  
    public static String NumberTO(int number) {
        DecimalFormat dfInt=new DecimalFormat("00");
        String sInt = dfInt.format(number);
        System.out.println(sInt);
        return sInt;
    }   
}

parse

parse主要是通过Jsoup或者其他工具来解析html文件。并将解析后的数据,封装在List集合中,将数据通过层层返回到main方法中。如这里采用最简单的字符串解析的方式。

package parse;

import java.util.ArrayList;
import java.util.List;
import model.ExtMarketPlateModel;
import util.HTTPUtils;
import util.TimeUtils;
import util.UumericalUtil;
public class ExtMarketPlateParse {
    public static List parseurl(String url) throws Exception {
        List list=new ArrayList();
        //根据网址获取html文件
        String response=HTTPUtils.getRawHtml(url);
        String html = response.toString();
        //解析html文件,并存储在集合中
        String jsonarra=html.split("BKCache=")[1];
        String plates[]=jsonarra.split("\",");
        List platelist=new ArrayList();
        for (int i = 0; i < plates.length; i++) {
            platelist.add(plates[i].replace("[\"", "").replace("\"", "").replace("]", ""));
            //System.out.println(plates[i].replace("[\"", "").replace("\"", "").replace("]", ""));
        }
        for (int i = 0; i < platelist.size(); i++) {
            String date=TimeUtils.GetNowDate("yyyy-MM-dd");
            String plate_rank=Integer.toString(i+1);
            String plate_id=platelist.get(i).split(",")[1];
            String plate_name=platelist.get(i).split(",")[2];
            float plate_price=0;
            float plate_change=0;
            float plate_range=0;
            String plate_market_value=null;
            float plate_turnover_rate=0;

            if (!platelist.get(i).split(",")[3].equals("-")) {
                //价格        plate_price=Float.parseFloat(platelist.get(i).split(",")[18]);
                //涨跌额
    plate_change=Float.parseFloat(platelist.get(i).split(",")[19]);
                //涨跌幅
                plate_range=UumericalUtil.FloatTO((float) (Float.parseFloat(platelist.get(i).split(",")[3].replace("%", ""))*0.01),4);
                plate_market_value=platelist.get(i).split(",")[4];
                System.out.println(plate_market_value);
                plate_turnover_rate=UumericalUtil.FloatTO((float) (Float.parseFloat(platelist.get(i).split(",")[5].replace("%", ""))*0.01),4);;
                //System.out.println(plate_lz_range);
            }
            String craw_time=TimeUtils.GetNowDate("yyyy-MM-dd HH:mm:ss");
            ExtMarketPlateModel model=new ExtMarketPlateModel();
            model.setDate(date);
            model.setPlate_rank(plate_rank);;
            model.setPlate_id(plate_id);
            model.setPlate_name(plate_name);;
            model.setPlate_price(plate_price);;
            model.setPlate_change(plate_change);;
            model.setPlate_range(plate_range);;
            model.setPlate_market_value(plate_market_value);;
            model.setPlate_turnover_rate(plate_turnover_rate);;
            model.setCraw_time(craw_time);
            list.add(model);
        }
        //返回集合
        return list; 
    }
}

db

db中包含两个java文件:MyDataSource,MYSQLControl。MyDataSource用来进行数据库驱动注册、连接数据库的用户名、密码,
MYSQLControl用来连接数据库,插入操作、更新操作、建表操作。

MyDataSource的程序如下:

package db;

import javax.sql.DataSource;

import org.apache.commons.dbcp2.BasicDataSource;

public class MyDataSource {

    public static DataSource getDataSource(String connectURI){

        BasicDataSource ds = new BasicDataSource();
         //MySQL的jdbc驱动
        ds.setDriverClassName("com.mysql.jdbc.Driver");
        ds.setUsername("root");           //所要连接的数据库名
        ds.setPassword("123456");         //MySQL的登陆密码
        ds.setUrl(connectURI);  
        return ds;

    }
}

MYSQLControl的程序如下:

package db;
import java.sql.SQLException;
import java.util.List;
import javax.sql.DataSource;
import org.apache.commons.dbutils.QueryRunner;
import org.apache.commons.dbutils.ResultSetHandler;
import org.apache.commons.dbutils.handlers.BeanListHandler;
import org.apache.commons.dbutils.handlers.ColumnListHandler;
import org.apache.commons.dbutils.handlers.ScalarHandler;
import org.apache.commons.logging.Log;
import org.apache.commons.logging.LogFactory;

import model.ExtMarketPlateModel;


public class MYSQLControl {
    static final Log logger = LogFactory.getLog(MYSQLControl.class);
    //设置数据库地址,及所需数据库
    static DataSource ds = MyDataSource.getDataSource("jdbc:mysql://127.0.0.1:3306/gupiaobankuai?useUnicode=true&characterEncoding=UTF-8");
    static QueryRunner qr = new QueryRunner(ds);
    //第一类方法
    public static void executeUpdate(String sql){
        try {
            qr.update(sql);
        } catch (SQLException e) {
            logger.error(e);
        }
    }
    //此种数据库操作方法需要优化
    public static int insertoilStocks ( List plate ) {

        Object[][] params = new Object[plate.size()][12];
        int c = 0;  //success number of update
        int[] sum;
        for ( int i = 0; i < plate.size(); i++ ){
            params[i][0] = plate.get(i).getDate();
            params[i][1] = plate.get(i).getPlate_rank();
            params[i][2] = plate.get(i).getPlate_id();
            params[i][3] = plate.get(i).getPlate_name();
            params[i][4] = plate.get(i).getPlate_price();
            params[i][5] = plate.get(i).getPlate_change();
            params[i][6] = plate.get(i).getPlate_range();
            params[i][7] = plate.get(i).getPlate_market_value();
            params[i][8] = plate.get(i).getPlate_turnover_rate();
            params[i][9] = plate.get(i).getCraw_time();
            params[i][10] = null;
            params[i][11] = null;
        }

        QueryRunner qr = new QueryRunner(ds);
        try {
            sum = qr.batch("INSERT INTO `gupiaobankuai`.`ext_market_plate` VALUES (?,?,?,?,?,?,?,?,?,?,?,?)", params);
        } catch (SQLException e) {
            System.out.println(e);
        }
        System.out.println("板块数据入库完毕");
        return c;
    }
}

job和jobmain

股票数据有点特殊,因为只有周一到周五才需要爬取,这个我们用定时操作来自动爬取数据。还有一个关键点在于股票有不开盘的日子,当网页中存在此数据,即网页中的显示,没有时间标签,这个我们在建数据表的时候就想到了,所以在建表时设置爬取当天的日期和板块id作为联合主键。

job程序如下:

package job;

import java.util.ArrayList;
import java.util.List;
import org.quartz.Job; 
import org.quartz.JobExecutionContext; 
import org.quartz.JobExecutionException;
import db.MYSQLControl;
import model.ExtMarketPlateModel;
import parse.ExtMarketPlateParse;
import util.TimeUtils;

public class ExtMarketPlateJob implements Job { 

    @Override 
    public void execute(JobExecutionContext arg0) throws JobExecutionException {
        //加入判断是否为节假日
        String yesterday=TimeUtils.getyesterday();
        List randomlist = MYSQLControl.getListInfoBySQL("select plate_id,plate_price,plate_change from ext_market_plate where date='"+yesterday+"' ORDER BY rand() LIMIT 3",ExtMarketPlateModel.class);
        //表格更新时间

        List plate=new ArrayList();
        String url="http://nufm.dfcfw.com/EM_Finance2014NumericApplication/JS.aspx?type=CT&cmd=C._BKHY&sty=FPGBKI&st=c&sr=-1&p=1&ps=5000&cb=&js=var%20BKCache=[(x)]&token=7bc05d0d4c3c22ef9fca8c2a912d779c&v=0.3196612374630905";

        int judge=0;

        try {
                plate=ExtMarketPlateParse.parseurl(url);
        } catch (Exception e) {
                e.printStackTrace();
        }

        for (int j = 0; j < plate.size(); j++) {
            String plate_id=plate.get(j).getPlate_id();
            float plate_price=plate.get(j).getPlate_price();
            if (plate_id.equals(randomlist.get(0).getPlate_id())) {
                if (plate_price==randomlist.get(0).getPlate_price()) {
                    judge++;
                }
            }
            if (plate_id.equals(randomlist.get(1).getPlate_id())) {
                if (plate_price==randomlist.get(1).getPlate_price()) {
                    judge++;
                }
            }

            if (plate_id.equals(randomlist.get(2).getPlate_id())) {
                if (plate_price==randomlist.get(2).getPlate_price()) {
                    judge++;
                }
            }
        }
        if (judge!=3) {
            MYSQLControl.insertoilStocks(plate);
        }
    } 
} 

jobmain程序如下:

package jobmain;

import static org.quartz.CronScheduleBuilder.cronSchedule;
import static org.quartz.JobBuilder.newJob;
import static org.quartz.TriggerBuilder.newTrigger;

import java.text.SimpleDateFormat;
import java.util.Date;

import org.quartz.CronTrigger;
import org.quartz.JobDetail;
import org.quartz.Scheduler;
import org.quartz.SchedulerFactory;
import org.quartz.impl.StdSchedulerFactory;

import job.ExtMarketPlateJob;

//以下是定时操作任务,每周一到周五下午3点半去爬相关股票数据
public class ExtMarketPlateJobMain {

    public void go() throws Exception { 
        // 首先,必需要取得一个Scheduler的引用 
        SchedulerFactory sf = new StdSchedulerFactory(); 
        Scheduler sched = sf.getScheduler(); 
        //jobs可以在scheduled的sched.start()方法前被调用 
        JobDetail job = newJob(ExtMarketPlateJob.class).withIdentity("platejob", "plategroup").build(); 
        //每周一到周五15点30分开始
        CronTrigger trigger = newTrigger().withIdentity("platetrigger", "plategroup").withSchedule(cronSchedule("0 30 15 ? * MON-FRI")).build(); 
        Date ft = sched.scheduleJob(job, trigger); 
        SimpleDateFormat sdf = new SimpleDateFormat("yyyy-MM-dd HH:mm:ss SSS"); 
        System.out.println(job.getKey() + " 已被安排执行于: " + sdf.format(ft) + ",并且以如下重复规则重复执行: " + trigger.getCronExpression()); 
        sched.start(); 
    } 
    public static void main(String[] args) throws Exception { 
        ExtMarketPlateJobMain maingo = new ExtMarketPlateJobMain(); 
        maingo.go(); 
    } 
}

友情提醒一下:该项目要导入相应的jar包或者写pom.xml直接从网上下载jar包。

你可能感兴趣的:(java,网络爬虫)