Python学习第四天

爬虫

使用lxml下的html包解析的fromstringHTML文件

from lxml import html
#html_data是网页内容
selector=html.fromstring(html_data)

xpath()方法：能将字符串转化为标签，它会检测字符串内容是否为标签，但是不能检测出内容是否为真的标签；获得的值为列表形式[]
可以在谷歌浏览器中用XPath Helper先调试

image.png

h1=selector.xpath('/html/body/h1/text()')

获取响应需要导入requests包

import requests
url='http://www.dangdang.com/'
response=requests.get(url)
print(response)

#获取str类型的响应
print(response.text)
#获取bytes类型的响应
print(response.content)
#获取响应头
print(response.headers)
# 获取状态码
print(response.status_code)

获得请求头的方法：

image.png
添加请求头

headers={"User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.142 Safari/537.36"}
resp=requests.get('https://www.zhihu.com/',headers=headers)

爬取站点信息步骤

1.设置目标站点
2.设置请求头，获取站点str类型响应，使用 requests.get()方法
3.提取目标站点信息
4.遍历站点信息，取出想要的信息
5.将信息以字典形式添加到列表

#目标站点
    url='https://movie.douban.com/cinema/later/chongqing/'
    #获取站点str类型响应
    headers = {
        "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.142 Safari/537.36"}
    resp = requests.get(url, headers=headers)
    html_data=resp.text
    #提取目标站点信息
    selector=html.fromstring(html_data)
    div_list=selector.xpath('//div[@id="showing-soon"]/div')
    print('当前有{}部电影即将上映'.format(len(div_list)))
    #遍历div_list
    for div in div_list:
        #电影名称
        movie_name=div.xpath('./div[@class="intro"]/h3/a/text()')[0]
        #print(movie_name)
        #上映日期
        movie_date=div.xpath('./div[@class="intro"]/ul/li/text()')[0]
        #类型
        movie_type = div.xpath('./div[@class="intro"]/ul/li/text()')[1]
        #上映国家
        movie_country=div.xpath('./div[@class="intro"]/ul/li/text()')[2]
        # print(movie_country)
        #想看人数
        movie_pnum = div.xpath('./div[@class="intro"]/ul/li/span/text()')[0]
        movie_pnum=float(movie_pnum.replace('人想看',''))
        print(movie_pnum)

        #添加每个电影的信息
        movie_list.append({
            'name':movie_name,
            'date':movie_date,
            'type':movie_type,
            'country':movie_country,
            'pnum':movie_pnum
        })

Python学习第四天

爬虫

爬取站点信息步骤

你可能感兴趣的:(Python学习第四天)