爬虫学习笔记:以爬取豆瓣网页信息为例

1.需要导入库

from bs4 import BeautifulSoup   #网页解析,获取数据
import re       #正则表达式,文字匹配
import urllib.request,urllib.error      #制定URL,获取网页数据
import xlwt        #进行excel操作
import sqlite3      #进行数据库操作

2.步骤

(1)爬取网页

(2)解析数据

(3)保存数据

(1)爬取网页

得到一个指定URL的网页内容

def askURL(url):
    head={
        "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36 Edg/91.0.864.64"
    }           #用户代理,模拟浏览器头部信息

    request = urllib.request.Request(url,headers=head)
    html=""
    try:
        response=urllib.request.urlopen(request)
        html = response.read().decode("utf-8")
        #print(html)
    except urllib.error.URLError as e:
        if hasattr(e,"code"):
            print(e.code)
        if hasattr(e,"reason"):
            print(e.reason)

    return html

head内容获取:

(1)进入网页,按f12

(2)点击网络

 (3)ctrl+R或点击刷新,然后点击小红点暂停

爬虫学习笔记:以爬取豆瓣网页信息为例_第1张图片

(4)找到第一个文件,单击

(5)下滑,找到User-Agent即网络代理

爬虫学习笔记:以爬取豆瓣网页信息为例_第2张图片

 (6)复制内容并赋值给head。注:要给User-Agent和冒号后内容加双引号,复制内容相当于键值对

 (2)解析数据

#链接
findLink = re.compile(r'')         #全局变量,表示规则
#图片
findImgSrc = re.compile(r'(.*)')
#评分
findRating = re.compile(r'(.*)')
#评价人数
findJudge = re.compile(r'(\d*)人评价')
#概况
findInq = re.compile(r'(.*)')
#相关内容
findBd = re.compile(r'

(.*?)

',re.S) def getData(baseurl): datalist=[] for i in range(0,10): url = baseurl + str(i*25) html=askURL(url) #逐一解析数据 soup = BeautifulSoup(html,"html.parser") for item in soup.find_all('div',class_="item"): #print(item) data = [] item = str(item) link = re.findall(findLink,item)[0] data.append(link) #print(link) Img = re.findall(findImgSrc,item)[0] data.append(Img) title = re.findall(findTitle,item) if(len(title)==2): ctitle = title[0] #中文名 data.append(ctitle) otitle = title[1].replace("/","") #外文名,去掉无关符号 data.append(otitle) else: data.append(title[0]) data.append(' ') #留空 rating = re.findall(findRating,item)[0] data.append(rating) judgeNum = re.findall(findJudge,item)[0] data.append(judgeNum) Inq = re.findall(findInq,item) if len(Inq)!=0: Inq = Inq[0].replace("。","") data.append(Inq) else: data.append(" ") bd = re.findall(findBd,item)[0] bd = re.sub('(\s+)?'," ",bd) bd = re.sub('/'," ",bd) data.append(bd.strip()) datalist.append(data) #print(datalist) return datalist

利用正则表达式将数据规范并存入列表

(3)保存数据

<1>保存至excel

def saveData(datalist,savepath):
    print("save...")
    book = xlwt.Workbook(encoding="utf-8",style_compression=0)
    sheet = book.add_sheet('豆瓣电影TOP250',cell_overwrite_ok=True)
    col = ("电影链接","图片链接","影片中文名","影片外文名","评分","评价数","概况","相关信息")
    for i in range(0,8):
        sheet.write(0,i,col[i])
    for i in range(0,250):
        print("第%d条"%(i+1))
        data = datalist[i]
        for j in range(0,8):
            sheet.write(i+1,j,data[j])
    book.save(savepath)

爬虫学习笔记:以爬取豆瓣网页信息为例_第3张图片 

<2>保存至数据库

1.初始化数据库

def init_db(dbpath):
    sql = '''
        create table movie250
        (
        id integer primary key autoincrement,
        info_link text,
        pic_link text,
        cname varchar,
        ename varchar,
        score numeric,
        rated numeric,
        introduction text,
        info text
        )
    
    '''  #创建表
    conn = sqlite3.connect(dbpath)
    cursor = conn.cursor()
    cursor.execute(sql)
    conn.commit()
    conn.close()

2.插入数据至数据库

def saveData_DB(datalist,dbpath):
    init_db(dbpath)
    conn = sqlite3.connect(dbpath)
    cur = conn.cursor()

    for data in datalist:
        for index in range(len(data)):
            if index == 4 or index ==5:
                continue
            data[index] = '"'+data[index]+'"'
        sql = '''
                insert into movie250(info_link,pic_link,cname,ename,score,rated,introduction,info)
                values(%s)
            '''%",".join(data)
        print(sql)
        cur.execute(sql)
        conn.commit()
    cur.close()
    conn.close()

爬虫学习笔记:以爬取豆瓣网页信息为例_第4张图片

安装database navigator插件后可使用,点击表可查看数据是否插入成功 

爬虫学习笔记:以爬取豆瓣网页信息为例_第5张图片

 3.main函数

def main():
    baseurl="https://movie.douban.com/top250?start="
    #1.爬取网页
    datalist=getData(baseurl)
    #2.解析数据
    #3.保存数据
    #savepath="豆瓣电影TOP250.xls"
    dbpath = "movie.db"

    #saveData(datalist,savepath)
    saveData_DB(datalist,dbpath)
    #askURL("https://movie.douban.com/top250?start=")

注:saveData为存为excel,saveData_DB为存为数据库

你可能感兴趣的:(python,爬虫)