Python网络爬虫——简介

  • 检查 robots.txt
  • 检查网站地图
  • 识别网站所用技术
  • 寻找网站所有者

检查 robots.txt

大多数网站都会定义 robots.txt 文件,这样可以让爬虫了解爬取该网站时存在哪些限制。
例如:https://www.baidu.com/robots.txt

检查网站地图

网站提供的 Sitemap 文件(即网站地图)可以帮助爬虫定位网站最新的内容,而无须爬取每一个网页。

识别网站所用技术

安装python builtwith模块

pip install builtwith
C:\Users\chenj>python
Python 3.6.4 (v3.6.4:d48eceb, Dec 19 2017, 06:04:45) [MSC v.1900 32 bit (Intel)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> import builtwith
>>> builtwith.parse('http://www.baidu.com')
{'javascript-frameworks': ['jQuery']}
>>> builtwith.parse('http://example.webscraping.com')
{'web-servers': ['Nginx'], 'web-frameworks': ['Web2py', 'Twitter Bootstrap'], 'programming-languages': ['Python'], 'javascript-frameworks': ['jQuery', 'Modernizr', 'jQuery UI']}
>>>

寻找网站所有者

安装python whois模块

pip install python-whois
C:\Users\chenj>python
Python 3.6.4 (v3.6.4:d48eceb, Dec 19 2017, 06:04:45) [MSC v.1900 32 bit (Intel)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> import whois
>>> print(whois.whois('csdn.net'))
{
  "domain_name": "CSDN.NET",
  "registrar": "NETWORK SOLUTIONS, LLC.",
  "whois_server": "whois.networksolutions.com",
  "referral_url": null,
  "updated_date": [
    "2017-03-10 00:52:46",
    "2018-02-09 01:43:52"
  ],
  "creation_date": "1999-03-11 05:00:00",
  "expiration_date": "2020-03-11 04:00:00",
  "name_servers": [
    "NS3.DNSV3.COM",
    "NS4.DNSV3.COM"
  ],
  "status": "clientTransferProhibited https://icann.org/epp#clientTransferProhibited",
  "emails": [
    "[email protected]",
    "[email protected]"
  ],
  "dnssec": "unsigned",
  "name": "Beijing Chuangxin Lezhi Co.ltd",
  "org": "Beijing Chuangxin Lezhi Co.ltd",
  "address": "B3-2-1 ZHaowei Industry Park",
  "city": "Beijng",
  "state": "Beijing",
  "zipcode": "100016",
  "country": "CN"
}
>>>

你可能感兴趣的:(python)