Python爬虫入门学习实战项目(一)

  • 静态数据的采集
    第一个项目我们来抓取拉勾网的招聘信息,话不多说直接开始吧!

1.首先我们导入相关库:

import requests
from lxml import etree
import pandas as pd
from time import sleep
import random

2.查看我们的cookie:
Python爬虫入门学习实战项目(一)_第1张图片
3.设置headers:

cookie = 'user_trace_token=20190329130619-9fcf5ee7-dcc5-4a9b-b82e-53a0eba6862c...LGRID=20190403124044-a4a8c961-55ca-11e9-bd16-5254005c3644'
headers = {
    'User-Agent' : 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.25 Safari/537.36 Core/1.70.3650.400 QQBrowser/10.4.3341.400',
    'Cookie':'cookie'
}

4.查看网页结构循环页数进行采集:

for i in range(2, 8):
    sleep(random.randint(3,10))
    url = 'https://www.lagou.com/zhaopin/jiqixuexi/{}/?filterOption=3'.format(i)
    print('正在抓取第{}页...'.format(i), url)
# 请求网页解析
    con = etree.HTML(requests.get(url=url, headers=headers).text)

5.使用xpath表达式抽取各目标字段:
Python爬虫入门学习实战项目(一)_第2张图片
Python爬虫入门学习实战项目(一)_第3张图片

# 使用xpath表达式抽取各目标字段
    job_name = [i for i in con.xpath("//a[@class='position_link']/h3/text()")]
    job_address = [i for i in con.xpath("//span[@class='add']/em/text()")]
    job_company = [i for i in con.xpath("//div[@class='company_name']/a/text()")]
    job_salary = [i for i in con.xpath("//span[@class='money']/text()")]
    job_links = [i for i in con.xpath("//a[@class='position_link']/@href")]
   
 # 获取详情页连接后采集详情页岗位描述信息
    job_des = []
    for link in job_links:
        sleep(random.randint(3,10))
        con2 = etree.HTML(requests.get(url=link, headers=headers).text)
        des = [[i.xpath('string(.)') for i in con2.xpath("//div[@class='job-detail']/p")]]
        job_des += des

    break

6.对数据进行字典封装:

dataset = {'岗位名称':job_name,'工作地址':job_address,'公司名称':job_company,'工资':job_salary,'任职要求':job_des}

#转化为数据框并保存为csv
data = pd.DataFrame(dataset)
data.to_csv('machine_learning_LG_job.csv')

7.抓取的结果:
Python爬虫入门学习实战项目(一)_第4张图片

你可能感兴趣的:(Python)