python test: crawl my csdn blog contents text

 (Owed by: 春夜喜雨 http://blog.csdn.net/chunyexiyu)

参考: python123上的教程

参考: requests_html的用法

使用python练习抓取, 抓取个人博客的标题文本内容

作为目的:

一是: 尝试练习抓取,

二是: 尝试抓取自己的博客内容, 以用来备份

练习完毕, 脚本也就没有用处了, 共享出来, 以供学习python的同学参考:

抓取的效果如下:

python test: crawl my csdn blog contents text_第1张图片

 

抓取脚本如下:

抓取思路为,

1. 借助文章页面列表,从页面列表1往后逐个查找;

2. 把每个文章列表页面中的文章链接依次打开,并提取text内容,把text内容保存下来。

from requests_html import HTMLSession
import os
import re

#get user name
userName = "chunyexiyu"
session = HTMLSession()
mainUrl = "https://blog.csdn.net/{}/".format(userName)

#create output dir
outputDir = userName
if not os.path.exists(outputDir):
    os.mkdir(outputDir)

#default max get 20 pages
checkUrlUserName= "/" + userName + "/"
for i in range(1, 21):
    print("process page {}".format(i).center(50, "-"))
    listPage = session.get(mainUrl + "article/list/" + str(i))
    articleUrlList = listPage.html.find("#mainBox > main > div.article-list", first=True)
    if (articleUrlList == None):
        break;
    uid = listPage.html.find("body > header > div > div.title-box > h1", first=True).text
    #print(articleUrlList.links)

    for url in articleUrlList.links:
        #ignore error url, like /yoyo_liyy/article/details/82762601
        if (checkUrlUserName.upper() not in url.upper()):
            continue
        page = session.get(url)
        titleElement = page.html.find("head > title", first=True)
        title = titleElement.text
        #print(title)
        title = re.sub('[ :<>/:*?"|\\\\-]', "_", title)  #replace special charactor to '_'
        title = re.sub('(_*{})*_*CSDN博客'.format(uid), "", title)          #remove tail
        title = re.sub('_+', "_", title)                 #replace duplicate '_'
        print("article: {}".format(title))
        file = open("{}\\Page{}_{}.txt".format(outputDir, i, title), "w", encoding="utf-8")
        contentElement = page.html.find("#content_views", first=True)
        file.write(contentElement.text)
        file.close()

 (Owed by: 春夜喜雨 http://blog.csdn.net/chunyexiyu)

 

你可能感兴趣的:(Web,Python,Python)