1.python爬虫的流程
1获取网页 2 解析网页(提取数据)3 存储数据
技术实现:
入门知识点:
namebook={"Name:":"Alex","Age":7,"Class":"First"}
for key,value in namebook.items():
print(key,value)
#!/usr/bin/python
#coding:UTF-8
import requests
link="http://www.santostang.com/"
headers={'User-Agent':'Mozilla/5.0(Windows;U;Windows NT 6.1;en-US;rv:1.9.1.6) Geocko/20091201 Firefox/3.5.6'}
r=requests.get(link,headers=headers)
print(r.text)
上述代码获取了博客首页的HTML代码
首先 import requests,使用requests.get(link,headers=headers)获取网页
用requests的header伪装成浏览器访问
r是requests的Response回复对象
r.text是获取的网页内容代码
#!/usr/bin/python
#coding:UTF-8
import requests
from bs4 import BeautifulSoup#从bs4这个库中导入BeautifulSoup
link="http://www.santostang.com/"
headers={'User-Agent':'Mozilla/5.0(Windows;U;Windows NT 6.1;en-US;rv:1.9.1.6) Geocko/20091201 Firefox/3.5.6'}
r=requests.get(link,headers=headers)
soup=BeautifulSoup(r.text,"lxml")#使用BeautifulSoup解析这段代码
title=soup.find("h1",class_="post-title").a.text.strip()
print(title)
获取HTML代码后,需要从整个网页中提取第一篇文章的标题
用BeautifulSoup这个库对爬取下来的页面进行解析
先导入库,然后将HTML代码转化为soup对象
用soup.find("h1",class_="post-title").a.text.strip()获取标题
#!/usr/bin/python
#coding:UTF-8
import requests
from bs4 import BeautifulSoup#从bs4这个库中导入BeautifulSoup
link="http://www.santostang.com/"
headers={'User-Agent':'Mozilla/5.0(Windows;U;Windows NT 6.1;en-US;rv:1.9.1.6) Geocko/20091201 Firefox/3.5.6'}
r=requests.get(link,headers=headers)
soup=BeautifulSoup(r.text,"lxml")#使用BeautifulSoup解析这段代码
title=soup.find("h1",class_="post-title").a.text.strip()
print(title)
with open('title.txt',"a+")as f:
f.write(title)
f.close