第一节：大数据时代的数据挑战

没有固定的数据格式
例如网页资料
必须透过ETL（Extract,Transformation,Loading）工具将数据转化为结构化数据才能取用

什么叫ETL呢？

E Extract 数据抽取
T Transformation 数据转换
L Loading 数据储存

ETL

第二节：网络爬虫

如何将网络上有用的咨询收集下来，并处理这些非结构化数据呢？

通过撰写网路爬虫将非结构化的网络数据转化成结构化信息

此处用例http://news.sina.com.cn/china/

网络爬虫构架

网络爬虫构架图

第三节：了解网络爬虫背后的秘密

使用开发人员工具
使用chrome浏览器，网页上点选右键检查-->使用network页签-->点选doc-->选择china/

第四节：撰写第一只网络爬虫

Requests

网络资源（URLS）摘取套件
改善Urllib2缺点，让使用者以最简单的方式获取网络资源
可以使用REST操作（POST,PUT,GET,DELETE）存取网络资源

  import requests
  newsurl='http://news.sina.com.cn/china/'
  res = requests.get(newsurl)
  print(res.text)



  from bs4 import BeautifulSoup#从bs4这个套件中读入beautifulSoap这个方法
  html_sample = '\
  \
    \
    hello world\
    this is link1\
    this is link2\
    \
    '

soup = BeautifulSoup(html_sample,'html.parser')#html.parser指定剖析器
print(type(soup))#解释Soup是beautifulsoap的物件
print(soup.text)
#一个是帮你把，里面的文字的内容截取出来，而把其他不需要的标签移除

第一个网络爬虫

输入code:

import requests #导入requests模块，用来获取页面
from bs4 import BeautifulSoup#导入beautifulsoap模块
res = requests.get('http://news.sina.com.cn/china/')#调用    requests模块中的get方法，获取目标页面
res.encoding='utf-8'#出现乱码时，更改编码方式
#print(res.text)
soup =BeautifulSoup(res.text,'html.parser')#从获取出的页面中筛选想要的信息
#print(soup)
for news in soup.select('.news-item'):#通过观察
    if len(news.select('h2'))>0:
        h2 = (news.select('h2')[0].text)
        time =news.select('.time')[0].text
        a= news.select('a')[0]['href']
        print(time,h2,a)

输出：
8月28日 15:22 中国外交部:印方将越界人员和设备全部撤回 http://news.sina.com.cn/c/nd/2017-08-28/doc-ifykkfat0626932.shtml8月28日 15:21 中国外交部:印方将越界人员和设备撤回印方一侧 http://news.sina.com.cn/c/nd/2017-08-28/doc-ifykiqfe2229593.shtml8月28日 15:11 河北督导唐山沧州候鸟保护:严格部署野保清网 http://news.sina.com.cn/c/nd/2017-08-28/doc-ifykiqfe2227305.shtml8月28日 14:55 贵阳消防队出动10辆消防车和搜救犬赴纳雍灾区 http://news.sina.com.cn/c/nd/2017-08-28/doc-ifykiuaz1482376.shtml8月28日 14:31 “爱上”中国发展经验:印度三年计划67次提中国 http://news.sina.com.cn/o/2017-08-28/doc-ifykiqfe2215890.shtml8月28日 14:20 村里土地可建房租给白领这13个城市成首批试点 http://news.sina.com.cn/o/2017-08-28/doc-ifykiqfe2213030.shtml8月28日 14:01 云南大学等非985高校已入列42所双一流大学名单 http://news.sina.com.cn/o/2017-08-28/doc-ifykiqfe2207861.shtml8月28日 13:45 贵州毕节纳雍县发生山体垮塌消防紧急出动救援 http://news.sina.com.cn/o/2017-08-28/doc-ifykiurx2347838.shtml
....and so on.
.

Python网络爬虫实战