Python实战:抓取手机号码

运行结果:

列表页上关于详情页的链接:

Python实战:抓取手机号码_第1张图片

详情页上的部分信息:

Python实战:抓取手机号码_第2张图片

我的代码:

from bs4 import BeautifulSoup
import requests, time
import pymongo

client = pymongo.MongoClient("localhost", 27017)
phone_number = client["phone_number"]
sheet1 = phone_number["sheet1"]
item_phone_number = phone_number["item_phone_number"]
def get_links_from(channel, pages):
# http://bj.58.com/shoujihao/pn2/
list_view = "{}pn{}/".format(channel, str(pages))
wb_data = requests.get(list_view)
soup = BeautifulSoup(wb_data.text, 'lxml')
titles = soup.select("strong.number")
links = soup.select("li > a.t")
for title, link in zip(titles, links):
data = {"title":title.text, "link":link.get("href").split('?')[0]}
sheet1.insert_one(data)
print(data)

def get_info_from(url):
wb_data = requests.get(url)
soup = BeautifulSoup(wb_data.text, "lxml")
title1 = soup.select("h1")[0].text.split()
title = " ".join(title1)
date = soup.select("li.time")[0].text.strip()
price = soup.select("div.su_con > span")[0].text.strip()
sellor = soup.select("ul > ul > li > a")[0].text
data = {
'title' : title,
'date' : date,
'price' : price,
'sellor': sellor,
}
item_phone_number.insert_one(data)
print(data)

get_links_from("http://bj.58.com/shoujihao/", 2)
for item in sheet1.find():
get_info_from(item["link"])

'''
要想获得更多的信息:
numbers = soup.select("div.hm_1 > span > a")[3:]
types = soup.select("ul > li > div.hm_2 > span")[3:]
prices = soup.select("div.hm_3 > span")[3:]
for number, type, price in zip(numbers, types, prices):
print(number.text, type.text, price.text+'元')
'''

总结:

-1 对标签中文本中间含有空格、tab键和换行符时,可以先对文本利用split函数分片,再利用join函数进行聚合

-2 尽量使代码简单,最好不要循环套循环,字典套字典。

说明:对详情页只是进行了上部分的抓取,如果想对下部分进行抓取,可以另外见一个表,进行抓取存储

你可能感兴趣的:(Python实战:抓取手机号码)