制作一个简单的webCrawler - 以Goodreads上的作家quotes为例+爬取Finance Yahoo上的公司股价信息

使用工具:Python3; PyCharm
需要模块: requests, BeautifulSoup4

  1. 实现在goodreads上爬所有的Stephen King的quotes#可以根据需要在其他网站爬其他信息

代码片如下

import requests
from bs4 import BeautifulSoup

#requests can be installed by pip through cmd
#BeautifulSoup4 can be installed by File-Settings-Project Interpreter, search and add BeautifulSoup4

''' This is a simple web crawler that crawls quotes by Stephen King on Goodreads.com '''

def stephen_king_quote(max_pages):
    page = 1

    while page <= max_pages:

        #get the url string correctly
        url = 'https://www.goodreads.com/search?page=' + str('page') + '&q=stephen+king&search%5Bsource%5D=goodreads&search_type=quotes&tab=quotes'

        #connect to the web, and store the result to source_code
        source_code = requests.get(url)

        #store the plain text of the source code
        plain_text = source_code.text

        #find all the object and store in soup
        soup = BeautifulSoup(plain_text, "html.parser")

        #gather all the links on this page but only when their class is a quoteText  - this may vary if you want to gather other info or if the element is different
        for link in soup.findAll('div', {'class':'quoteText'}):

            #store the content as text to quote, and print it out
            quote = link.text
            print(quote)
        page += 1

#call the function and get the quotes [For Stephen King, max_pages is 100] stephen_king_quote(1)

运行结果: 成功爬取了goodreads上的quotes
制作一个简单的webCrawler - 以Goodreads上的作家quotes为例+爬取Finance Yahoo上的公司股价信息_第1张图片

可以crawl其他网站获取需要的信息,如果需要crawl连续页面下特定的link再further crawl,可以在for loop下再进行一层loop。

2 . 爬取Finance Yahoo上公司的股价信息 #这里以苹果公司为例

url地址可以参考finance yahoo,具体操作: 进入http://finance.yahoo.com/, 输入想爬的公司名,待页面出来后,左边目录下有一个Historical Prices, 点击进入历史股价页面可以看到该公司的历史股价信息。在表格下方有一个”Download to Spreadsheet”, 右击复制链接地址,为csv文件,这个csv文件即我们要爬的文件。

from urllib import request

apple_url = 'http://real-chart.finance.yahoo.com/table.csv?s=AAPL&d=5&e=16&f=2016&g=d&a=11&b=12&c=1980&ignore=.csv'

#TODO to download the csv data from the internet and save locally

def download_stock_data(url):
    # to connect to the internet first, use REQUEST
    # then store the information to response
    response = request.urlopen(url)
    # read all the data from url, and store in csv
    csv = response.read()
    csv_str = str(csv)
    lines = csv_str.split("\\n")
    #to save the file and name it 'apple.csv'
    dest_url = r'apple.csv'
    fx = open(dest_url,'w')
    for line in lines:
        fx.write(line + '\n')
    fx.close()

download_stock_data(apple_url)

运行结果如下:

制作一个简单的webCrawler - 以Goodreads上的作家quotes为例+爬取Finance Yahoo上的公司股价信息_第2张图片

参考资料:
http://finance.yahoo.com/q/hp?s=AAPL+Historical+Prices

你可能感兴趣的:(python,crawler)