【正则表达式】斗破苍穹(Python & R)

Python

# 加载模块
import re
import time
import requests

# 伪装报头
headers = {
   'User-Agent' : 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) \
   AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.84 Safari/537.36'
}

# 在指定路径以追加方式新建txt文件,后续写入数据
f = open('D://Spyder/WD/novel.txt', 'a+')

# 定义get_links,获取每一章的链接地址
def get_links(url):
    destination = requests.get(url, headers = headers)
    links = re.findall('
  • , destination.text) urls = ['http://www.doupoxs.com{}'.format(link) for link in links] for url in urls: get_info(url) print(url+'...Done') # 定义get_info,获取每一章链接中的正文,注意写入本地txt时转码 def get_info(url): destination = requests.get(url, headers = headers) contents = re.findall('

    (.*?)

    '
    , destination.content.decode('utf-8'), re.S) for content in contents: f.write(content + '\n') # 程序入口 if __name__ in '__main__': url = 'http://www.doupoxs.com/doupocangqiong/' get_links(url) time.sleep(1) # 关闭写入 f.close()
  • 【正则表达式】斗破苍穹(Python & R)_第1张图片


    R

    # 加载包
    library(stringr)
    
    # 定义GetlinkFunc,获取每一章的链接地址
    GetlinkFunc <- function(url) {
      destination <- readLines(url, encoding = 'UTF-8')
      data <- str_extract_all(destination, '
  • '
  • '', .) %>% paste0('http://www.doupoxs.com', .) } # 定义GetinfoFunc,获取每一章链接中的正文,str_extract_all将标签名

    也匹配出来,用gsub去除 GetinfoFunc <- function(url) { for (i in seq_along(url)) { destination <- readLines(url[i], encoding = 'UTF-8') data <- str_extract_all(destination, '

    (.*?)

    '
    ) %>% unlist() %>% gsub('

    |

    '
    , '', .) print(sprintf('第%d条链接%s抓取成功', i, url[i]), sep = '\n') write.table(data, row.names = FALSE, col.names = FALSE, sep = '\n', append = TRUE, 'novel.txt') } } # 执行函数(导出txt文件) url <- 'http://www.doupoxs.com/doupocangqiong/' link <- GetlinkFunc(url) novel <- GetinfoFunc(link)
  • 【正则表达式】斗破苍穹(Python & R)_第2张图片

    你可能感兴趣的:(【正则表达式】斗破苍穹(Python & R))