网站信息爬取、下载PDF及JSON文件的保存和JSON转CSV文件

要求:

爬取网站:http://www.fsb.org/publications/中的20个page_url下面的title,time,content 和 PDF链接并下载PDF。

 

包括模块:

  • 爬取列表页中每一个page_url
  • 爬取新闻页中的标题、时间、内容和PDF的链接
  • 下载PDF,将所有的PDF保存在一个文件夹里
  • 把时间、标题和内容存为json格式的文件
  • 再将json格式的文件转化成CSV格式的文件

 

最后完成的结果为:由于office打开会乱码,最好用WPS打开。

网站信息爬取、下载PDF及JSON文件的保存和JSON转CSV文件_第1张图片

 

步骤:

 

1 列表页网站的爬取

因为列表页每一页有10条数据,所有只需要取两页就可以。

  • for page in range(1,3) 从第一页到第二页循环
  • 在每一个列表页里面通过requests请求、etree.HTML解析、list_tree.xpath定位到新闻页链接

2 新闻页信息的获取

  • for page in page_url 对所有获取到的page_url进行requests请求
  • 将得到的请求结果,利用etree.HTML解析、list_tree.xpath定位到新新闻的标题、时间和内容及PDF所在的链接
  • 此外,针对PDF的下载,因为不同的新闻页,PDF的位置不一致,不能用统一的一套XPATH定位到PDF的a标签,因此这里设计是获取所有新闻页里面的a标签,然后对获取到的标签进行判断,如果是以pdf结尾的,则保留。

3 下载PDF,并将所有PDF保存在一个文件夹下

PDF保存、MP3文件的下载等都是一类问题,也就是说,可以写一个模板,直接套用就行。

  • 编写函数save_pdf () 将代码封装起来,传入的参数是保存文件的名字,name和当前下载的PDF的链接url
  • 在当前目录下建立文件夹
  • 然后对所有的PDF链接进行requests请求,将得到的请求结果req.content写入到文件为wb(写入二进制的格式)中,该步骤是下载PDF的核心步骤

4 把时间、标题和内容保存为Json的格式

因为需求是爬取到20个新闻页的内容、标题和时间,针对每一个page_url来说,请求解析得到的这些内容存为一个字典,然后再将该字典数据存为json格式,然后这个json文件是 a+ :可读、可写,文件不存在先创建,不会覆盖,追加在末尾。

  • 对于第二步得到的数据,将每一个字段作为字典的键值对,保存起来,主要有:
  •  res_dict['Title'] = str(page_tree.xpath("//h1[@class='post-heading']/text()")[0])
  •  res_dict['TiMe'] = str(page_tree.xpath("//span[@class='post-date']/text()")[0])
  • 另外,因为Content中不同的段落在不用的P标签下,因此对于爬取、定位、解析后得到的内容进行遍历,将每一部分转化成str进行拼接
  • 不同的新闻页的PDF数目不定,因此定义字典res_dict['PDF'] 为列表,在第二步得到的PDF列表中进行遍历,将每一个PDF链接的后面一部分作为名字,append进键为"PDF"的值里。
  • 最后将得到的字典利用json.dumps()保存成json文件

5 将json格式的文件转化为CSV文件

这个也相当于是一个模板了,直接套用就行。

  • 利用readlines() 读取json文件,将json文件中的每一行作为一个单位。将读取到的文件按行添加为字典。
  •  导入pandas模块
  •  利用pd.DataFrame(inp).to_csv('res1.csv')直接将添加好的字典保存为CSV格式的文件res1.csv是文件名。

 

实现代码:

import requests
from lxml import etree
import os
import json

##保存PDF
def save_mp3(mp3_name, url):
    mp3_path = 'allPDF'
    if not os.path.exists(mp3_path):
        os.makedirs(mp3_path)
    try:
        resp = requests.get(url)
        if resp.status_code == 200:
            file_path = mp3_path + os.path.sep + '{mp3_name}'.format(mp3_name=mp3_name)
            if not os.path.exists(file_path):
                with open(file_path, 'wb') as f:
                    f.write(resp.content)
                print('Downloaded autio path is %s' % file_path)
            else:
                print('Already Downloaded', file_path)
    except requests.ConnectionError:
        print('Failed to Save mp3,item %s' % mp3_name)


if __name__ == '__main__':
    ua = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/42.0.2311.135 Safari/537.36 Edge/12.10240"
    Base_url1 = "http://www.fsb.org/publications/?mt_page="
    for page in range(1, 3):
        req = requests.get(Base_url1 + str(page), headers={'User-agent': ua})
        requests.adapters.DEFAULT_RETRIES = 5
        s = requests.session()
        s.keep_alive = False

        req.encoding = "UTF-8"
        # print(req.text)
        list_tree = etree.HTML(req.text)
        page_url = list_tree.xpath("//h3[@class='media-heading']/a/@href")

        try:
            for each in page_url:
                res_dict = {"PDF": [] ,"PDF_Name":[]}
                page_req = requests.get(each, headers={'User-agent': ua,'Connection': 'close'})
                requests.adapters.DEFAULT_RETRIES = 5
                s = requests.session()
                s.keep_alive = False

                page_req.encoding = "UTF-8"
                page_tree = etree.HTML(page_req.text)
                res_dict['Title'] = str(page_tree.xpath("//h1[@class='post-heading']/text()")[0])
                res_dict['TiMe'] = str(page_tree.xpath("//span[@class='post-date']/text()")[0])
                con = page_tree.xpath("//span[@class='post-content']/p/text()")
                ContTent = ''
                for i in con:
                    ContTent += i
                res_dict['content'] = ContTent
                pdf_list =[]
                if page_tree.xpath("//a/@href"):
                    doc = page_tree.xpath("//a/@href")
                    for i in doc:
                        # print(i)
                        if "pdf" in i and "page" not in i:

                            # save_mp3(i.split("/")[-1],i)
                            pdf_list.append(i)
                # res_dict = {"PDF": []}
                for j in pdf_list:
                    res_dict["PDF"].append(j)
                    res_dict["PDF_Name"].append(j.split("/")[-1])


                try:
                    with open("6.json", 'a+', encoding="utf-8") as fp:
                        fp.write(json.dumps(res_dict, ensure_ascii=False) + "\n")
                except IOError as err:
                    print('error' + str(err))
                finally:
                    fp.close()
                pass
        except requests.exceptions.ConnectionError as e:
            print(e)

json转CSV

# -*- coding: utf-8 -*-
import csv
import json
import sys
import collections  # 有序字典
import pandas as pd

inp = []
with open("6.json", "r", encoding="utf-8") as f:
    line = f.readlines()  # list 每个元素为字符串

for i in line:
    inp.append(eval(i))  # 必须为字典才行
print(inp)

pd.DataFrame(inp).to_csv('res1.csv')

附加:

保存的json文件内容为:

{"PDF": ["http://www.fsb.org/wp-content/uploads/AMAFI.pdf", "http://www.fsb.org/wp-content/uploads/Austrian-Fed-Econ-Chamber.pdf", "http://www.fsb.org/wp-content/uploads/BPI.pdf", "http://www.fsb.org/wp-content/uploads/Business-at-OECD.pdf", "http://www.fsb.org/wp-content/uploads/EBF-1.pdf", "http://www.fsb.org/wp-content/uploads/FESE.pdf", "http://www.fsb.org/wp-content/uploads/FBF.pdf", "http://www.fsb.org/wp-content/uploads/IIF-2.pdf", "http://www.fsb.org/wp-content/uploads/Intesa.pdf", "http://www.fsb.org/wp-content/uploads/Invest-Europe.pdf", "http://www.fsb.org/wp-content/uploads/IT-Coop-Alliance.pdf", "http://www.fsb.org/wp-content/uploads/MEW-Consul.pdf", "http://www.fsb.org/wp-content/uploads/Moodys-Investors-Service.pdf.pdf", "http://www.fsb.org/wp-content/uploads/Nasdaq.pdf", "http://www.fsb.org/wp-content/uploads/OakNorth-Bank-plc.pdf", "http://www.fsb.org/wp-content/uploads/Per-Kurowski.pdf", "http://www.fsb.org/wp-content/uploads/Rete-Imprese-Italia.pdf", "http://www.fsb.org/wp-content/uploads/WOCCU-1.pdf", "http://www.fsb.org/wp-content/uploads/WFE-1.pdf", "http://www.fsb.org/wp-content/uploads/WSBI-ESBG.pdf"], "PDF_Name": ["AMAFI.pdf", "Austrian-Fed-Econ-Chamber.pdf", "BPI.pdf", "Business-at-OECD.pdf", "EBF-1.pdf", "FESE.pdf", "FBF.pdf", "IIF-2.pdf", "Intesa.pdf", "Invest-Europe.pdf", "IT-Coop-Alliance.pdf", "MEW-Consul.pdf", "Moodys-Investors-Service.pdf.pdf", "Nasdaq.pdf", "OakNorth-Bank-plc.pdf", "Per-Kurowski.pdf", "Rete-Imprese-Italia.pdf", "WOCCU-1.pdf", "WFE-1.pdf", "WSBI-ESBG.pdf"], "Title": "Feedback on the effects of financial regulatory reforms on SME financing", "TiMe": "26 March 2019", "content": "On 25 February 2019, the FSB invited . Interested parties were invited to provide written comments by 18 March 2019. The public comments received are available below.The feedback will be considered by the FSB as it prepares the draft report for its SME evaluation, which will be issued for public consultation ahead of the June 2019 G20 Summit. The final report, reflecting the feedback from the public consultation, will be published in October 2019."}
{"PDF": ["http://www.fsb.org/wp-content/uploads/P150319.pdf"], "PDF_Name": ["P150319.pdf"], "Title": "FSB letter to ISDA about derivative contract robustness to risks of interest rate benchmark discontinuation", "TiMe": "15 March 2019", "content": "This letter from the Co-chairs of the FSB’s Official Sector Steering Group (OSSG) encourages the International Swaps and Derivatives Association (ISDA) to continue its work on derivatives contractual robustness to risks of interest rate determination. The letter raises three important issues that the OSSG believes ISDA is moving to address:The letter encourages ISDA to ask for market opinion on the events that would trigger a move to the spread-adjusted fallback rate for derivatives referencing IBORs. Triggers that would only take effect on the date on which LIBOR permanently or indefinitely stopped publication could leave those with LIBOR-referencing contracts still exposed to a number of risks. The OSSG also understands that ISDA intends to consult on USD LIBOR, CDOR, HIBOR and SOR in early 2019, and the OSSG strongly supports this. The OSSG Co-chairs also encourage ISDA to consult on the key technical details that ISDA’s Board Benchmark Committee will need to decide on before implementation can begin.The FSB and member authorities through the OSSG are working to implement and monitor the recommendations of the 2014 FSB report .Since July 2016, ISDA has undertaken work, at the request of the OSSG, to strengthen the robustness of derivatives markets to the discontinuation of widely-used interest rate benchmarks. The OSSG engages regularly with ISDA and other stakeholders with a view to their taking action to enhance contractual robustness in derivatives products and cash products, such as loans, mortgages and floating rate notes."}
{"PDF": ["http://www.fsb.org/wp-content/uploads/P080319.pdf"], "PDF_Name": ["P080319.pdf"], "Title": "FSB compensation workshop: Key takeaways", "TiMe": "8 March 2019", "content": "This note provides key takeways from a workshop with banks in October 2018 about implementation of the FSB’s international standards on compensation. As part of its work to monitor implementation of its and their the FSB engages regularly with firms across financial sectors to assess the extent to which the standards have been effectively implemented. This workshop focused on: Executives responsible for managing processes related to compensation at 17 large internationally active banks and officials from the FSB Compensation Monitoring Contact Group participated in the workshop.The workshop provides one input into the FSB’s biennial compensation progress report that will be published later in 2019.The FSB welcomes any feedback on topics discussed at the workshop and summarised in this note. Comments should be sent by   to ."}
{"PDF": ["http://www.fsb.org/wp-content/uploads/P250219.pdf"], "PDF_Name": ["P250219.pdf"], "Title": "FSB roundtable on the effects of reforms on SME financing: Key takeaways", "TiMe": "25 February 2019", "content": "The FSB organised a roundtable in Amsterdam on 12 December 2018 to exchange views with stakeholders on recent trends and drivers in small- and medium-sized enterprise financing across FSB jurisdictions, including the possible effects that financial regulatory reforms may have had on this market. This note summarises the main points raised in the roundtable."}
{"PDF": ["http://www.fsb.org/wp-content/uploads/P140219.pdf"], "PDF_Name": ["P140219.pdf"], "Title": "FinTech and market structure in financial services: Market developments and potential financial stability implications", "TiMe": "14 February 2019", "content": "This report assesses FinTech market developments in the financial system and the potential implications for financial stability. The FSB defines FinTech as technology-enabled innovation in financial services that could result in new business models, applications, processes or products with an associated material effect on the provision of financial services. Technological innovation holds great promise for the provision of financial services, with the potential to increase market access, the range of product offerings, and convenience while also lowering costs to clients. At the same time, new entrants into the financial services space, including FinTech firms and large, established technology companies (‘BigTech’), could materially alter the universe of financial services providers.  Greater competition and diversity in lending, payments, insurance, trading, and other areas of financial services can create a more efficient and resilient financial system. However, heightened competition could also put pressure on financial institutions’ profitability and this could lead to additional risk taking among incumbents in order to maintain margins. Moreover, there could be new implications for financial stability from BigTech in finance and greater third-party dependencies, e.g. in cloud services.  Some key considerations from the FSB’s analysis of the link between technological innovation and market structure are the following: "}
{"PDF": ["http://www.fsb.org/wp-content/uploads/P120219.pdf"], "PDF_Name": ["P120219.pdf"], "Title": "FSB work programme for 2019", "TiMe": "12 February 2019", "content": "This work programme details the FSB’s planned work and an indicative timetable of main publications for 2019. It reflects the FSB’s continued pivot from policy design to the implementation and evaluation of the effects of reforms and, in particular, vigilant monitoring to identify and address new and emerging risks to financial stability."}
{"PDF": ["http://www.fsb.org/wp-content/uploads/AFR.pdf", "http://www.fsb.org/wp-content/uploads/Amundi-2.pdf", "http://www.fsb.org/wp-content/uploads/ASX.pdf", "http://www.fsb.org/wp-content/uploads/BlackRock-1.pdf", "http://www.fsb.org/wp-content/uploads/BME-CLEARING.pdf", "http://www.fsb.org/wp-content/uploads/CCIL-1.pdf", "http://www.fsb.org/wp-content/uploads/CNMV-Advisory-Committee-1.pdf", "http://www.fsb.org/wp-content/uploads/DTCC-4.pdf", "http://www.fsb.org/wp-content/uploads/Eurex-Clearing-2.pdf", "http://www.fsb.org/wp-content/uploads/EACH.pdf", "http://www.fsb.org/wp-content/uploads/CCP12-4.pdf", "http://www.fsb.org/wp-content/uploads/ICI-Global-2.pdf", "http://www.fsb.org/wp-content/uploads/ICE.pdf", "http://www.fsb.org/wp-content/uploads/FIA-IIF-ISDA.pdf", "http://www.fsb.org/wp-content/uploads/ISDA-FIA-IIF-addendum-Incentives-Analysis.pdf", "http://www.fsb.org/wp-content/uploads/LSE-Group.pdf", "http://www.fsb.org/wp-content/uploads/Manfred-E-Will-1.pdf", "http://www.fsb.org/wp-content/uploads/Robert-Rutkowski.pdf", "http://www.fsb.org/wp-content/uploads/SIFMA.pdf", "http://www.fsb.org/wp-content/uploads/World-Federation-of-Exchanges-3.pdf"], "PDF_Name": ["AFR.pdf", "Amundi-2.pdf", "ASX.pdf", "BlackRock-1.pdf", "BME-CLEARING.pdf", "CCIL-1.pdf", "CNMV-Advisory-Committee-1.pdf", "DTCC-4.pdf", "Eurex-Clearing-2.pdf", "EACH.pdf", "CCP12-4.pdf", "ICI-Global-2.pdf", "ICE.pdf", "FIA-IIF-ISDA.pdf", "ISDA-FIA-IIF-addendum-Incentives-Analysis.pdf", "LSE-Group.pdf", "Manfred-E-Will-1.pdf", "Robert-Rutkowski.pdf", "SIFMA.pdf", "World-Federation-of-Exchanges-3.pdf"], "Title": "Public responses to consultation on Financial resources to support CCP resolution and the treatment of CCP equity in resolution", "TiMe": "8 February 2019", "content": "On 15 November 2018, the FSB published a consultation document on proposed . Interested parties were invited to provide written comments by 1 February 2019. The public comments received are available below.The FSB thanks those who have taken the time and effort to express their views. The FSB expects to publish further guidance for public consultation in 2020."}

打开之后,每一行显示为:

{
  "PDF": [
    "http://www.fsb.org/wp-content/uploads/AMAFI.pdf",
    "http://www.fsb.org/wp-content/uploads/Austrian-Fed-Econ-Chamber.pdf",
    "http://www.fsb.org/wp-content/uploads/BPI.pdf",
    "http://www.fsb.org/wp-content/uploads/Business-at-OECD.pdf",
    "http://www.fsb.org/wp-content/uploads/EBF-1.pdf",
    "http://www.fsb.org/wp-content/uploads/FESE.pdf",
    "http://www.fsb.org/wp-content/uploads/FBF.pdf",
    "http://www.fsb.org/wp-content/uploads/IIF-2.pdf",
    "http://www.fsb.org/wp-content/uploads/Intesa.pdf",
    "http://www.fsb.org/wp-content/uploads/Invest-Europe.pdf",
    "http://www.fsb.org/wp-content/uploads/IT-Coop-Alliance.pdf",
    "http://www.fsb.org/wp-content/uploads/MEW-Consul.pdf",
    "http://www.fsb.org/wp-content/uploads/Moodys-Investors-Service.pdf.pdf",
    "http://www.fsb.org/wp-content/uploads/Nasdaq.pdf",
    "http://www.fsb.org/wp-content/uploads/OakNorth-Bank-plc.pdf",
    "http://www.fsb.org/wp-content/uploads/Per-Kurowski.pdf",
    "http://www.fsb.org/wp-content/uploads/Rete-Imprese-Italia.pdf",
    "http://www.fsb.org/wp-content/uploads/WOCCU-1.pdf",
    "http://www.fsb.org/wp-content/uploads/WFE-1.pdf",
    "http://www.fsb.org/wp-content/uploads/WSBI-ESBG.pdf"
  ],
  "Title": "Feedback on the effects of financial regulatory reforms on SME financing",
  "TiMe": "26 March 2019",
  "content": "On 25 February 2019, the FSB invited . Interested parties were invited to provide written comments by 18 March 2019. The public comments received are available below.The feedback will be considered by the FSB as it prepares the draft report for its SME evaluation, which will be issued for public consultation ahead of the June 2019 G20 Summit. The final report, reflecting the feedback from the public consultation, will be published in October 2019."
}

 

你可能感兴趣的:(爬虫)