python爬虫+JavaWeb接口+Android完整的小项目开发案例(1)

做了三年的测试,开发的知识学了不少,心血来潮,在测试空闲期写一个小项目消遣一下。
项目的整体构思如下:
1.python 爬虫爬取糗事百科,将需要的元素取出来插入到数据库中
2.java 开发一个接口,以json的形式展示,并分页
3.android 写一个apk,解析json接口,用listView展示数据,并分页

本篇讲解python 爬虫爬取糗事百科的数据

准备:python环境,安装lxml,pymysql,可以进入到python环境下的script目录下用 pip install安装
数据库准备:安装MySql,创建数据库表

CREATE TABLE `qiushibaike` (
    `id` INT  NOT NULL AUTO_INCREMENT ,
    `imgUrl` VARCHAR (3000),
    `username` VARCHAR (3000),
    `content` VARCHAR (3000),
    `vote` VARCHAR (3000),
    `comments` VARCHAR (3000),
    `imgpath` VARCHAR (3000),
    PRIMARY KEY ( id )
)DEFAULT CHARSET=utf8; 

打开糗事百科网站,并翻页,我们可以发现page后面的参数表示页数


python爬虫+JavaWeb接口+Android完整的小项目开发案例(1)_第1张图片
image.png

下面是爬虫代码:

#!/usr/bin/env python
# -*- coding:utf-8 -*-
import requests
from lxml import etree
import pymysql

def insert(imgUrl,username,content,vote,comments,imgpath):

    #连接数据库
    connection = pymysql.connect(host='127.0.0.1', port=3306, user='root', password='root', db='shop',
                             charset='utf8mb4', cursorclass=pymysql.cursors.DictCursor)

# 通过cursor创建游标
    cursor = connection.cursor()

# 创建sql 语句,并执行


    sql = 'INSERT INTO `qiushibaike` (`imgUrl`,`username`,`content`,`vote`,`comments`,`imgpath`) VALUES (%s,%s,%s,%s,%s,%s)'
    cursor.execute(sql,(imgUrl,username,content,vote,comments,imgpath));



# 提交SQL
    connection.commit()

def loadPage(page):

    url = 'http://www.qiushibaike.com/8hr/page/' + str(page)
    headers = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/52.0.2743.116 Safari/537.36',
        'Accept-Language': 'zh-CN,zh;q=0.8'}

    try:
        response = requests.get(url, headers=headers)
        resHtml = response.text

        html = etree.HTML(resHtml)
        result = html.xpath('//div[contains(@id,"qiushi_tag")]')

        #遍历取元素
        for site in result:


            #根据xpath获取用户头像路径,没有获取到则置空
            item = {}
            try:
                imgUrl = site.xpath('./div/a/img/@src')[0].encode('utf-8')
            except:
                imgUrl = ""

            #获取用户名
            try:
                username = site.xpath('./div/a/h2/text()')[0].encode('utf-8')
            except:
                username = ""

            #获取内容
            # username = site.xpath('.//h2')[0].text
            try:
                content = site.xpath('.//div[@class="content"]/span')[0].text.strip().encode('utf-8')
            except:
                connect = ""


            #获取投票数
            try:
                vote = site.xpath('.//i')[0].text
            except:
                vote = ""

            # print site.xpath('.//*[@class="number"]')[0].text
            # 获取评论信息
            try:
                comments = site.xpath('.//i')[1].text
            except:
                comments = ""

            #获取内容图片
            try:
                imgpath = site.xpath('./div/a/img/@src')[1].encode('utf-8')
            except:
                imgpath = ""

            print imgUrl, username, content, vote, comments, imgpath

            #插入数据库
            insert(imgUrl, username, content, vote, comments, imgpath)

    except Exception, e:
        print e


if __name__ == '__main__':
    #加载1-12页的数据
    for num in range(1, 13):
        loadPage(num)
        print "===============第" + str(num)+"页加载完毕================"


爬取完成之后查看数据库如下说明爬取成功


python爬虫+JavaWeb接口+Android完整的小项目开发案例(1)_第2张图片
image.png

你可能感兴趣的:(python爬虫+JavaWeb接口+Android完整的小项目开发案例(1))