爬虫基础(8)

  • 目录
  1. 结构设计
  2. 程序流程设计
  3. 代码实现
  4. 调试
  • 结构设计:
  1. 获取股标列表
  2. 据股票列表查询前一日股票相关信息
  3. 输出到文件
  • 程序流程设计
    爬虫基础(8)_第1张图片
  • 编程实现
# -*- coding: utf-8 -*-
'''
# 目标:获取沪深A股列表,并查询前一日股票相关信息,然后输出到文件
# 结构设计:
1. 获取股标列表
2. 据列表查询前一日股票相关信息
3. 输出到文件
'''
from bs4 import BeautifulSoup
import pandas as pd
import requests, bs4, re

class spider(object):
    '''
    Description: 先获取股票列表,然后据股票列表查询股票其它信息
    Param: 
    Return: ls- 股票相关信息列表:股票代码,股票名称,收盘价,日环比,日振幅,
                最高,今开,市盈率,流通市值,最低,昨收,振幅,总股本,成交量,
                换手率,成交额,总市值
    '''

    def __init__(self):
        self.__user_agent = {'user-agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.121 Safari/537.36'}
        self.lsUrl = 'http://www.bestopview.com/stocklist.html'
        self.urlQuery = 'https://www.laohu8.com/hq/s/'
        self.ls = []  

    def getHtmlText(self, url):
        try:
            req = requests.get(url, headers=self.__user_agent)
            req.raise_for_status()
            req.encoding = req.apparent_encoding
            return req.text
        except Exception as e:
            print('getHtml产生异常:{}'.format(e))

    def parserLsHtml(self, html):
        soup = BeautifulSoup(html, 'html.parser')
        for ul in soup.find_all('div', class_='result'):
            if isinstance(ul, bs4.element.Tag):
                lis = ul('li')
                self.ls += [i.string for i in lis]
        # reConfigration self.Ls  '股票代码,公司名'
        self.ls = list(map(lambda x: x.split('('), self.ls))
        self.ls = [list(map(lambda x: re.sub(r'[\)]', '', x), i)) for i in self.ls]
        self.ls = [[i[1], i[0]] for i in self.ls]

    def parserQueryHtml(self, html, n):
        try:
            if 'current-quote decrease' in html:
                soup = BeautifulSoup(html, 'html.parser')
                for i in soup.find('div', class_='current-quote decrease'):
                    if isinstance(i, bs4.element.Tag):
                        for j in i:  # 收盘价,日环比,日振幅
                            if j != '\n':
                                self.ls[n].append(j.string)

            elif 'current-quote increase' in html:
                soup = BeautifulSoup(html, 'html.parser')
                for i in soup.find('div', class_='current-quote increase'):
                    if isinstance(i, bs4.element.Tag):
                        for j in i:  # 收盘价,日环比,日振幅
                            if j != '\n':
                                self.ls[n].append(j.string)

            for tr in soup.find('table', class_='detail').children:
                if isinstance(tr, bs4.element.Tag):
                    for span in tr('td'):
                        for j in span:  # 最高,今开,市盈率,流通市值...
                            if isinstance(j, bs4.element.NavigableString):
                                self.ls[n].append(j)
                                
        except Exception as e:
            print('{}Query异常:{}'.format(html[-6:], e))

    def write_to(self):
        def uniteCnt(ls):
            '''
            调整异常退市/查询不到的股票列表,保证股票列表长度一致
            '''
            ls_cnt = list(map(lambda x: len(x), ls))
            cnt_max = max(ls_cnt)
            ls_len_less_than = filter(lambda y: len(y)
  • 调试
  1. 查询异常
sp.ls done
['600000', '浦发银行']
['600001', '邯郸钢铁']
html>
Query异常:'NoneType' object has no attribute 'children'
['600003', 'ST东北高']
html>
Query异常:'NoneType' object has no attribute 'children'

# 邯郸钢铁、ST东北高已退市
1. 抓取股票列表并非最新数据,时效性不足;
2. 查询正常,输出异常。未考虑到空查询,因使用pandas输出文档,引发输出函数异常。
  1. 由于所获取股票列表数据时效性不足,部分股票已退市,无法查询到相关信息
AssertionError: 17 columns passed, passed data had 2 columns
AssertionError: 17 columns passed, passed data had 14 columns
# 
1. 使用pandas处理输出时异常;未考虑
2. soup = BeautifulSoup(html, 'html.parser')
经BeautifulSoup转换后,soup 的数据类型是:
不能再当作字符串类型来进行条件判定:
print('current-quote decrease' in soup,'current-quote increase' in soup)   --False False

你可能感兴趣的:(数据分析)