爬取网站:http://www.fsb.org/publications/中的20个page_url下面的title,time,content 和 PDF链接并下载PDF。
最后完成的结果为:由于office打开会乱码,最好用WPS打开。
1 列表页网站的爬取
因为列表页每一页有10条数据,所有只需要取两页就可以。
2 新闻页信息的获取
3 下载PDF,并将所有PDF保存在一个文件夹下
PDF保存、MP3文件的下载等都是一类问题,也就是说,可以写一个模板,直接套用就行。
4 把时间、标题和内容保存为Json的格式
因为需求是爬取到20个新闻页的内容、标题和时间,针对每一个page_url来说,请求解析得到的这些内容存为一个字典,然后再将该字典数据存为json格式,然后这个json文件是 a+ :可读、可写,文件不存在先创建,不会覆盖,追加在末尾。
5 将json格式的文件转化为CSV文件
这个也相当于是一个模板了,直接套用就行。
import requests
from lxml import etree
import os
import json
##保存PDF
def save_mp3(mp3_name, url):
mp3_path = 'allPDF'
if not os.path.exists(mp3_path):
os.makedirs(mp3_path)
try:
resp = requests.get(url)
if resp.status_code == 200:
file_path = mp3_path + os.path.sep + '{mp3_name}'.format(mp3_name=mp3_name)
if not os.path.exists(file_path):
with open(file_path, 'wb') as f:
f.write(resp.content)
print('Downloaded autio path is %s' % file_path)
else:
print('Already Downloaded', file_path)
except requests.ConnectionError:
print('Failed to Save mp3,item %s' % mp3_name)
if __name__ == '__main__':
ua = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/42.0.2311.135 Safari/537.36 Edge/12.10240"
Base_url1 = "http://www.fsb.org/publications/?mt_page="
for page in range(1, 3):
req = requests.get(Base_url1 + str(page), headers={'User-agent': ua})
requests.adapters.DEFAULT_RETRIES = 5
s = requests.session()
s.keep_alive = False
req.encoding = "UTF-8"
# print(req.text)
list_tree = etree.HTML(req.text)
page_url = list_tree.xpath("//h3[@class='media-heading']/a/@href")
try:
for each in page_url:
res_dict = {"PDF": [] ,"PDF_Name":[]}
page_req = requests.get(each, headers={'User-agent': ua,'Connection': 'close'})
requests.adapters.DEFAULT_RETRIES = 5
s = requests.session()
s.keep_alive = False
page_req.encoding = "UTF-8"
page_tree = etree.HTML(page_req.text)
res_dict['Title'] = str(page_tree.xpath("//h1[@class='post-heading']/text()")[0])
res_dict['TiMe'] = str(page_tree.xpath("//span[@class='post-date']/text()")[0])
con = page_tree.xpath("//span[@class='post-content']/p/text()")
ContTent = ''
for i in con:
ContTent += i
res_dict['content'] = ContTent
pdf_list =[]
if page_tree.xpath("//a/@href"):
doc = page_tree.xpath("//a/@href")
for i in doc:
# print(i)
if "pdf" in i and "page" not in i:
# save_mp3(i.split("/")[-1],i)
pdf_list.append(i)
# res_dict = {"PDF": []}
for j in pdf_list:
res_dict["PDF"].append(j)
res_dict["PDF_Name"].append(j.split("/")[-1])
try:
with open("6.json", 'a+', encoding="utf-8") as fp:
fp.write(json.dumps(res_dict, ensure_ascii=False) + "\n")
except IOError as err:
print('error' + str(err))
finally:
fp.close()
pass
except requests.exceptions.ConnectionError as e:
print(e)
json转CSV
# -*- coding: utf-8 -*-
import csv
import json
import sys
import collections # 有序字典
import pandas as pd
inp = []
with open("6.json", "r", encoding="utf-8") as f:
line = f.readlines() # list 每个元素为字符串
for i in line:
inp.append(eval(i)) # 必须为字典才行
print(inp)
pd.DataFrame(inp).to_csv('res1.csv')
保存的json文件内容为:
{"PDF": ["http://www.fsb.org/wp-content/uploads/AMAFI.pdf", "http://www.fsb.org/wp-content/uploads/Austrian-Fed-Econ-Chamber.pdf", "http://www.fsb.org/wp-content/uploads/BPI.pdf", "http://www.fsb.org/wp-content/uploads/Business-at-OECD.pdf", "http://www.fsb.org/wp-content/uploads/EBF-1.pdf", "http://www.fsb.org/wp-content/uploads/FESE.pdf", "http://www.fsb.org/wp-content/uploads/FBF.pdf", "http://www.fsb.org/wp-content/uploads/IIF-2.pdf", "http://www.fsb.org/wp-content/uploads/Intesa.pdf", "http://www.fsb.org/wp-content/uploads/Invest-Europe.pdf", "http://www.fsb.org/wp-content/uploads/IT-Coop-Alliance.pdf", "http://www.fsb.org/wp-content/uploads/MEW-Consul.pdf", "http://www.fsb.org/wp-content/uploads/Moodys-Investors-Service.pdf.pdf", "http://www.fsb.org/wp-content/uploads/Nasdaq.pdf", "http://www.fsb.org/wp-content/uploads/OakNorth-Bank-plc.pdf", "http://www.fsb.org/wp-content/uploads/Per-Kurowski.pdf", "http://www.fsb.org/wp-content/uploads/Rete-Imprese-Italia.pdf", "http://www.fsb.org/wp-content/uploads/WOCCU-1.pdf", "http://www.fsb.org/wp-content/uploads/WFE-1.pdf", "http://www.fsb.org/wp-content/uploads/WSBI-ESBG.pdf"], "PDF_Name": ["AMAFI.pdf", "Austrian-Fed-Econ-Chamber.pdf", "BPI.pdf", "Business-at-OECD.pdf", "EBF-1.pdf", "FESE.pdf", "FBF.pdf", "IIF-2.pdf", "Intesa.pdf", "Invest-Europe.pdf", "IT-Coop-Alliance.pdf", "MEW-Consul.pdf", "Moodys-Investors-Service.pdf.pdf", "Nasdaq.pdf", "OakNorth-Bank-plc.pdf", "Per-Kurowski.pdf", "Rete-Imprese-Italia.pdf", "WOCCU-1.pdf", "WFE-1.pdf", "WSBI-ESBG.pdf"], "Title": "Feedback on the effects of financial regulatory reforms on SME financing", "TiMe": "26 March 2019", "content": "On 25 February 2019, the FSB invited . Interested parties were invited to provide written comments by 18 March 2019. The public comments received are available below.The feedback will be considered by the FSB as it prepares the draft report for its SME evaluation, which will be issued for public consultation ahead of the June 2019 G20 Summit. The final report, reflecting the feedback from the public consultation, will be published in October 2019."}
{"PDF": ["http://www.fsb.org/wp-content/uploads/P150319.pdf"], "PDF_Name": ["P150319.pdf"], "Title": "FSB letter to ISDA about derivative contract robustness to risks of interest rate benchmark discontinuation", "TiMe": "15 March 2019", "content": "This letter from the Co-chairs of the FSB’s Official Sector Steering Group (OSSG) encourages the International Swaps and Derivatives Association (ISDA) to continue its work on derivatives contractual robustness to risks of interest rate determination. The letter raises three important issues that the OSSG believes ISDA is moving to address:The letter encourages ISDA to ask for market opinion on the events that would trigger a move to the spread-adjusted fallback rate for derivatives referencing IBORs. Triggers that would only take effect on the date on which LIBOR permanently or indefinitely stopped publication could leave those with LIBOR-referencing contracts still exposed to a number of risks. The OSSG also understands that ISDA intends to consult on USD LIBOR, CDOR, HIBOR and SOR in early 2019, and the OSSG strongly supports this. The OSSG Co-chairs also encourage ISDA to consult on the key technical details that ISDA’s Board Benchmark Committee will need to decide on before implementation can begin.The FSB and member authorities through the OSSG are working to implement and monitor the recommendations of the 2014 FSB report .Since July 2016, ISDA has undertaken work, at the request of the OSSG, to strengthen the robustness of derivatives markets to the discontinuation of widely-used interest rate benchmarks. The OSSG engages regularly with ISDA and other stakeholders with a view to their taking action to enhance contractual robustness in derivatives products and cash products, such as loans, mortgages and floating rate notes."}
{"PDF": ["http://www.fsb.org/wp-content/uploads/P080319.pdf"], "PDF_Name": ["P080319.pdf"], "Title": "FSB compensation workshop: Key takeaways", "TiMe": "8 March 2019", "content": "This note provides key takeways from a workshop with banks in October 2018 about implementation of the FSB’s international standards on compensation. As part of its work to monitor implementation of its and their the FSB engages regularly with firms across financial sectors to assess the extent to which the standards have been effectively implemented. This workshop focused on: Executives responsible for managing processes related to compensation at 17 large internationally active banks and officials from the FSB Compensation Monitoring Contact Group participated in the workshop.The workshop provides one input into the FSB’s biennial compensation progress report that will be published later in 2019.The FSB welcomes any feedback on topics discussed at the workshop and summarised in this note. Comments should be sent by to ."}
{"PDF": ["http://www.fsb.org/wp-content/uploads/P250219.pdf"], "PDF_Name": ["P250219.pdf"], "Title": "FSB roundtable on the effects of reforms on SME financing: Key takeaways", "TiMe": "25 February 2019", "content": "The FSB organised a roundtable in Amsterdam on 12 December 2018 to exchange views with stakeholders on recent trends and drivers in small- and medium-sized enterprise financing across FSB jurisdictions, including the possible effects that financial regulatory reforms may have had on this market. This note summarises the main points raised in the roundtable."}
{"PDF": ["http://www.fsb.org/wp-content/uploads/P140219.pdf"], "PDF_Name": ["P140219.pdf"], "Title": "FinTech and market structure in financial services: Market developments and potential financial stability implications", "TiMe": "14 February 2019", "content": "This report assesses FinTech market developments in the financial system and the potential implications for financial stability. The FSB defines FinTech as technology-enabled innovation in financial services that could result in new business models, applications, processes or products with an associated material effect on the provision of financial services. Technological innovation holds great promise for the provision of financial services, with the potential to increase market access, the range of product offerings, and convenience while also lowering costs to clients. At the same time, new entrants into the financial services space, including FinTech firms and large, established technology companies (‘BigTech’), could materially alter the universe of financial services providers. Greater competition and diversity in lending, payments, insurance, trading, and other areas of financial services can create a more efficient and resilient financial system. However, heightened competition could also put pressure on financial institutions’ profitability and this could lead to additional risk taking among incumbents in order to maintain margins. Moreover, there could be new implications for financial stability from BigTech in finance and greater third-party dependencies, e.g. in cloud services. Some key considerations from the FSB’s analysis of the link between technological innovation and market structure are the following: "}
{"PDF": ["http://www.fsb.org/wp-content/uploads/P120219.pdf"], "PDF_Name": ["P120219.pdf"], "Title": "FSB work programme for 2019", "TiMe": "12 February 2019", "content": "This work programme details the FSB’s planned work and an indicative timetable of main publications for 2019. It reflects the FSB’s continued pivot from policy design to the implementation and evaluation of the effects of reforms and, in particular, vigilant monitoring to identify and address new and emerging risks to financial stability."}
{"PDF": ["http://www.fsb.org/wp-content/uploads/AFR.pdf", "http://www.fsb.org/wp-content/uploads/Amundi-2.pdf", "http://www.fsb.org/wp-content/uploads/ASX.pdf", "http://www.fsb.org/wp-content/uploads/BlackRock-1.pdf", "http://www.fsb.org/wp-content/uploads/BME-CLEARING.pdf", "http://www.fsb.org/wp-content/uploads/CCIL-1.pdf", "http://www.fsb.org/wp-content/uploads/CNMV-Advisory-Committee-1.pdf", "http://www.fsb.org/wp-content/uploads/DTCC-4.pdf", "http://www.fsb.org/wp-content/uploads/Eurex-Clearing-2.pdf", "http://www.fsb.org/wp-content/uploads/EACH.pdf", "http://www.fsb.org/wp-content/uploads/CCP12-4.pdf", "http://www.fsb.org/wp-content/uploads/ICI-Global-2.pdf", "http://www.fsb.org/wp-content/uploads/ICE.pdf", "http://www.fsb.org/wp-content/uploads/FIA-IIF-ISDA.pdf", "http://www.fsb.org/wp-content/uploads/ISDA-FIA-IIF-addendum-Incentives-Analysis.pdf", "http://www.fsb.org/wp-content/uploads/LSE-Group.pdf", "http://www.fsb.org/wp-content/uploads/Manfred-E-Will-1.pdf", "http://www.fsb.org/wp-content/uploads/Robert-Rutkowski.pdf", "http://www.fsb.org/wp-content/uploads/SIFMA.pdf", "http://www.fsb.org/wp-content/uploads/World-Federation-of-Exchanges-3.pdf"], "PDF_Name": ["AFR.pdf", "Amundi-2.pdf", "ASX.pdf", "BlackRock-1.pdf", "BME-CLEARING.pdf", "CCIL-1.pdf", "CNMV-Advisory-Committee-1.pdf", "DTCC-4.pdf", "Eurex-Clearing-2.pdf", "EACH.pdf", "CCP12-4.pdf", "ICI-Global-2.pdf", "ICE.pdf", "FIA-IIF-ISDA.pdf", "ISDA-FIA-IIF-addendum-Incentives-Analysis.pdf", "LSE-Group.pdf", "Manfred-E-Will-1.pdf", "Robert-Rutkowski.pdf", "SIFMA.pdf", "World-Federation-of-Exchanges-3.pdf"], "Title": "Public responses to consultation on Financial resources to support CCP resolution and the treatment of CCP equity in resolution", "TiMe": "8 February 2019", "content": "On 15 November 2018, the FSB published a consultation document on proposed . Interested parties were invited to provide written comments by 1 February 2019. The public comments received are available below.The FSB thanks those who have taken the time and effort to express their views. The FSB expects to publish further guidance for public consultation in 2020."}
打开之后,每一行显示为:
{
"PDF": [
"http://www.fsb.org/wp-content/uploads/AMAFI.pdf",
"http://www.fsb.org/wp-content/uploads/Austrian-Fed-Econ-Chamber.pdf",
"http://www.fsb.org/wp-content/uploads/BPI.pdf",
"http://www.fsb.org/wp-content/uploads/Business-at-OECD.pdf",
"http://www.fsb.org/wp-content/uploads/EBF-1.pdf",
"http://www.fsb.org/wp-content/uploads/FESE.pdf",
"http://www.fsb.org/wp-content/uploads/FBF.pdf",
"http://www.fsb.org/wp-content/uploads/IIF-2.pdf",
"http://www.fsb.org/wp-content/uploads/Intesa.pdf",
"http://www.fsb.org/wp-content/uploads/Invest-Europe.pdf",
"http://www.fsb.org/wp-content/uploads/IT-Coop-Alliance.pdf",
"http://www.fsb.org/wp-content/uploads/MEW-Consul.pdf",
"http://www.fsb.org/wp-content/uploads/Moodys-Investors-Service.pdf.pdf",
"http://www.fsb.org/wp-content/uploads/Nasdaq.pdf",
"http://www.fsb.org/wp-content/uploads/OakNorth-Bank-plc.pdf",
"http://www.fsb.org/wp-content/uploads/Per-Kurowski.pdf",
"http://www.fsb.org/wp-content/uploads/Rete-Imprese-Italia.pdf",
"http://www.fsb.org/wp-content/uploads/WOCCU-1.pdf",
"http://www.fsb.org/wp-content/uploads/WFE-1.pdf",
"http://www.fsb.org/wp-content/uploads/WSBI-ESBG.pdf"
],
"Title": "Feedback on the effects of financial regulatory reforms on SME financing",
"TiMe": "26 March 2019",
"content": "On 25 February 2019, the FSB invited . Interested parties were invited to provide written comments by 18 March 2019. The public comments received are available below.The feedback will be considered by the FSB as it prepares the draft report for its SME evaluation, which will be issued for public consultation ahead of the June 2019 G20 Summit. The final report, reflecting the feedback from the public consultation, will be published in October 2019."
}