Python爬虫实战:极客学院

       今天我们来爬取一下极客学院的课程,这次我们用requests和xpath,小伙伴们看好了,这真是一对神奇组合,棒棒哒!

爬取前我们先看一下我们的目标:

1.抓取极客学院的课程

2.抓取每一门课程课程名称、简介、时长、等级和学习人数

1.确定URL

     首先我们确定好页面的URL,极客学院职业课程的网址是:http://www.jikexueyuan.com/course/,跟上次一样我们看一下第二页就可以发现URL规律,确定URL是http://www.jikexueyuan.com/course/?pageNum=1,其中中间的数字1代表页数,我们可以传入不同的值来获得某一页的段子内容,其他的部分是不变的。

2.用Requests下载网页

      废话不说,直接上代码:
import requests

page = 1
url = 'http://www.jikexueyuan.com/course/?pageNum=' + str(page)
html = requests.get(url)
print html.text


运行看一下结果:
Python爬虫实战:极客学院_第1张图片
就是这么强大,

3.用Xpath来解析网页

     网页代码抓下来了,我们来解析吧,今天我们引入一个新的原则,先大后小,什么意思呢,先抓取每一门课程的整体,再匹配出每门课程的基本信息,F12审查元素,定位到课程,我们看一下每一门课程的基本信息都在一个
  • ....
  • 里面,如下图:
    Python爬虫实战:极客学院_第2张图片

    嗯,那我们就先把这些大抓一下:

    from lxml import etree
    import requests
    
    page = 1
    url = 'http://www.jikexueyuan.com/course/?pageNum=' + str(page)
    html = requests.get(url)
    
    selector = etree.HTML(html.text)
    content_field = selector.xpath('//div[@class="lesson-list"]/ul/li')
    print content_field


     
       
    Python爬虫实战:极客学院_第3张图片

    运行后,我们看一下,一页的课程全部抓下来的,以转码的形式保存在每一个 中,接下来我们来解析课程的基本信息,再简单的说一下Xpath,以课程名称为例
    div[@class="lesson-infor"]/h2[@class="lesson-info-h2"]/a/text(),div盒子下的h2下的a标签内的文字

    Python爬虫实战:极客学院_第4张图片
    
    
    for each in content_field:
        title = each.xpath('div[@class="lesson-infor"]/h2[@class="lesson-info-h2"]/a/text()')[0]
        content = (each.xpath('div[@class="lesson-infor"]/p/text()')[0]).strip()
        classtime = each.xpath('div[@class="lesson-infor"]/div/div/dl/dd[@class="mar-b8"]/em/text()')[0]
        classlevel = each.xpath('div[@class="lesson-infor"]/div/div/dl/dd[@class="zhongji"]/em/text()')[0]
        learnnum = each.xpath('div[@class="lesson-infor"]/div[@class="timeandicon"]/div/em/text()')[0]
        print title
        print content
        print classtime
        print classlevel
        print learnnum


    Python爬虫实战:极客学院_第5张图片


    好啦,就是这么简单,下面我们把整体代码完善一下:

    
    
    #_*_coding:utf-8_*_
    
    from lxml import etree
    import requests
    
    page = 1
    url = 'http://www.jikexueyuan.com/course/?pageNum=' + str(page)
    html = requests.get(url)
    
    selector = etree.HTML(html.text)
    content_field = selector.xpath('//div[@class="lesson-list"]/ul/li')
    for each in content_field:
        title = each.xpath('div[@class="lesson-infor"]/h2[@class="lesson-info-h2"]/a/text()')[0]
        content = (each.xpath('div[@class="lesson-infor"]/p/text()')[0]).strip()
        classtime = each.xpath('div[@class="lesson-infor"]/div/div/dl/dd[@class="mar-b8"]/em/text()')[0]
        classlevel = each.xpath('div[@class="lesson-infor"]/div/div/dl/dd[@class="zhongji"]/em/text()')[0]
        learnnum = each.xpath('div[@class="lesson-infor"]/div[@class="timeandicon"]/div/em/text()')[0]
        print title
        print content
        print classtime
        print classlevel
        print learnnum


    。。。。我只能说这也太草了吧,来一个稍微美观的,我们爬取10页的课程,并把课程存到一个txt中。

    4.面向对象完整代码

    
    
    #_*_coding:utf-8_*_
    from lxml import etree
    import requests
    import sys
    reload(sys)
    sys.setdefaultencoding("utf-8")
    
    #把课程信息保存到info.txt中
    def saveinfo(classinfo):
        f = open('info.txt','a')
        f.writelines('title:'+ classinfo['title']+'\n')
        f.writelines('content:' + classinfo['content'] + '\n')
        f.writelines('classtime:' + classinfo['classtime'] + '\n')
        f.writelines('classlevel:' + classinfo['classlevel'] + '\n')
        f.writelines('learnnum:' +classinfo['learnnum'] +'\n\n')
        f.close()
    #爬虫主体
    def spider(url):
        html = requests.get(url)
        selector = etree.HTML(html.text)
        content_field = selector.xpath('//div[@class="lesson-list"]/ul/li')
        info = []
        for each in content_field:
            classinfo = {}
            classinfo['title'] = each.xpath('div[@class="lesson-infor"]/h2[@class="lesson-info-h2"]/a/text()')[0]
            classinfo['content'] = (each.xpath('div[@class="lesson-infor"]/p/text()')[0]).strip()
            classTime = (each.xpath('div[@class="lesson-infor"]/div/div/dl/dd[@class="mar-b8"]/em/text()')[0]).split()
            classinfo['classtime'] = ''.join(classTime)
            classinfo['classlevel'] = each.xpath('div[@class="lesson-infor"]/div/div/dl/dd[@class="zhongji"]/em/text()')[0]
            classinfo['learnnum'] = each.xpath('div[@class="lesson-infor"]/div[@class="timeandicon"]/div/em/text()')[0]
            info.append(classinfo)
        return info
    
    
    if __name__ == '__main__':
        print u'开始爬取内容。。。'
        page = []
        #循环用来生产不同页数的链接
        for i in range(1,11):
            newpage = 'http://www.jikexueyuan.com/course/?pageNum=' + str(i)
            print u"第%d页"%i
            print u'正在处理页面:'+ newpage
            page.append(newpage)
        for each in page:
            info = spider(each)
            for each in info:
                saveinfo(each)


    好啦,小伙伴们自己去尝试一下吧!

    你可能感兴趣的:(python)