使用工具:Python3; PyCharm
需要模块: requests, BeautifulSoup4
代码片如下
import requests
from bs4 import BeautifulSoup
#requests can be installed by pip through cmd
#BeautifulSoup4 can be installed by File-Settings-Project Interpreter, search and add BeautifulSoup4
''' This is a simple web crawler that crawls quotes by Stephen King on Goodreads.com '''
def stephen_king_quote(max_pages):
page = 1
while page <= max_pages:
#get the url string correctly
url = 'https://www.goodreads.com/search?page=' + str('page') + '&q=stephen+king&search%5Bsource%5D=goodreads&search_type=quotes&tab=quotes'
#connect to the web, and store the result to source_code
source_code = requests.get(url)
#store the plain text of the source code
plain_text = source_code.text
#find all the object and store in soup
soup = BeautifulSoup(plain_text, "html.parser")
#gather all the links on this page but only when their class is a quoteText - this may vary if you want to gather other info or if the element is different
for link in soup.findAll('div', {'class':'quoteText'}):
#store the content as text to quote, and print it out
quote = link.text
print(quote)
page += 1
#call the function and get the quotes [For Stephen King, max_pages is 100] stephen_king_quote(1)
可以crawl其他网站获取需要的信息,如果需要crawl连续页面下特定的link再further crawl,可以在for loop下再进行一层loop。
2 . 爬取Finance Yahoo上公司的股价信息 #这里以苹果公司为例
url地址可以参考finance yahoo,具体操作: 进入http://finance.yahoo.com/, 输入想爬的公司名,待页面出来后,左边目录下有一个Historical Prices, 点击进入历史股价页面可以看到该公司的历史股价信息。在表格下方有一个”Download to Spreadsheet”, 右击复制链接地址,为csv文件,这个csv文件即我们要爬的文件。
from urllib import request
apple_url = 'http://real-chart.finance.yahoo.com/table.csv?s=AAPL&d=5&e=16&f=2016&g=d&a=11&b=12&c=1980&ignore=.csv'
#TODO to download the csv data from the internet and save locally
def download_stock_data(url):
# to connect to the internet first, use REQUEST
# then store the information to response
response = request.urlopen(url)
# read all the data from url, and store in csv
csv = response.read()
csv_str = str(csv)
lines = csv_str.split("\\n")
#to save the file and name it 'apple.csv'
dest_url = r'apple.csv'
fx = open(dest_url,'w')
for line in lines:
fx.write(line + '\n')
fx.close()
download_stock_data(apple_url)
运行结果如下:
参考资料:
http://finance.yahoo.com/q/hp?s=AAPL+Historical+Prices