Python学习笔记——BeautifulSoup4数据提取+随机身份证提取

一、准备工作

1、安装BeautifulSoup4

最快捷的是直接使用pip安装

pip install beautifulsoup4

2、BeautifulSoup4基础教程

基础使用文档链接
https://beautifulsoup.readthedocs.io/zh_CN/v4.4.0/

3、常用方法笔记整理

image.png

二、实际项目练习

1、练习网址:http://www.chineseidcard.com/

image.png

2、请求接口分析返回数据

http://www.chineseidcard.com/?region=110101&birthday=19900307&sex=1&num=5&r=30
想要的数据就具体的身份证信息

image.png

通过分析这些关键信息保存在这个table标签下

110101199003072631
110101199003070492
110101199003075314
110101199003078398
110101199003071532

3、先模拟请求,获取到页面返回数据

#coding:utf-8
from bs4 import BeautifulSoup
import requests
import json

def gethtml(IDnum):
    url = "http://www.chineseidcard.com/"
    headers = {
        "User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/100.0.4896.127 Safari/537.36",
        "X-Requested-With":"XMLHttpRequest"
    }
    params = {
        "region":"110101",
        "birthday":"19900307",
        "sex":"1",
        "num":IDnum,
        "r":30
    }
    res = requests.get(url,headers=headers,params=params)
    data = json.loads(res.text,encoding="utf-8")

4、BeautifulSoup4来查找标签

#coding:utf-8
from bs4 import BeautifulSoup
import requests
import json

def gethtml(IDnum):
    url = "http://www.chineseidcard.com/"
    headers = {
        "User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/100.0.4896.127 Safari/537.36",
        "X-Requested-With":"XMLHttpRequest"
    }
    params = {
        "region":"110101",
        "birthday":"19900307",
        "sex":"1",
        "num":IDnum,
        "r":30
    }
    res = requests.get(url,headers=headers,params=params)
    data = json.loads(res.text,encoding="utf-8")
    soup = BeautifulSoup(data,"html.parser")

    # 获取第2个table标签下的数据
    table = soup.find_all('table',class_='table')[1]
    #获取单个身份证号
    cardID = id.find_all('td')[0].string

5、遍历结果,返回所有身份证号信息
table = soup.find_all('table',class_='table')[1]
这个主要是因为所有返回结果中,身份证信息是保存在第2个table中


image.png
#coding:utf-8
from bs4 import BeautifulSoup
import requests
import json

def gethtml(IDnum):
    url = "http://www.chineseidcard.com/"
    headers = {
        "User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/100.0.4896.127 Safari/537.36",
        "X-Requested-With":"XMLHttpRequest"
    }
    params = {
        "region":"110101",
        "birthday":"19900407",
        "sex":"1",
        "num":IDnum,
        "r":30
    }
    res = requests.get(url,headers=headers,params=params)
    data = json.loads(res.text,encoding="utf-8")
    soup = BeautifulSoup(data,"html.parser")

    # 获取第2个table标签下的数据
    table = soup.find_all('table',class_='table')[1]
    #获取单个身份证号
    # cardID = id.find_all('td')[0].string

    #遍历每一个td节点
    for i in range(len(table.find_all('td'))):
        td_label = table.find_all('td')[i]
        #获取td标签下的文本
        cardID = td_label.string
        print(cardID)

if __name__ == "__main__":
    gethtml(5)

返回结果如下:

110101199004070873
110101199004077979
110101199004076853
110101199004079552
110101199004076634

你可能感兴趣的:(Python学习笔记——BeautifulSoup4数据提取+随机身份证提取)