大多数网站都会定义 robots.txt 文件,这样可以让爬虫了解爬取该网站时存在哪些限制。
例如:https://www.baidu.com/robots.txt
网站提供的 Sitemap 文件(即网站地图)可以帮助爬虫定位网站最新的内容,而无须爬取每一个网页。
安装python builtwith模块
pip install builtwith
C:\Users\chenj>python
Python 3.6.4 (v3.6.4:d48eceb, Dec 19 2017, 06:04:45) [MSC v.1900 32 bit (Intel)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> import builtwith
>>> builtwith.parse('http://www.baidu.com')
{'javascript-frameworks': ['jQuery']}
>>> builtwith.parse('http://example.webscraping.com')
{'web-servers': ['Nginx'], 'web-frameworks': ['Web2py', 'Twitter Bootstrap'], 'programming-languages': ['Python'], 'javascript-frameworks': ['jQuery', 'Modernizr', 'jQuery UI']}
>>>
安装python whois模块
pip install python-whois
C:\Users\chenj>python
Python 3.6.4 (v3.6.4:d48eceb, Dec 19 2017, 06:04:45) [MSC v.1900 32 bit (Intel)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> import whois
>>> print(whois.whois('csdn.net'))
{
"domain_name": "CSDN.NET",
"registrar": "NETWORK SOLUTIONS, LLC.",
"whois_server": "whois.networksolutions.com",
"referral_url": null,
"updated_date": [
"2017-03-10 00:52:46",
"2018-02-09 01:43:52"
],
"creation_date": "1999-03-11 05:00:00",
"expiration_date": "2020-03-11 04:00:00",
"name_servers": [
"NS3.DNSV3.COM",
"NS4.DNSV3.COM"
],
"status": "clientTransferProhibited https://icann.org/epp#clientTransferProhibited",
"emails": [
"[email protected]",
"[email protected]"
],
"dnssec": "unsigned",
"name": "Beijing Chuangxin Lezhi Co.ltd",
"org": "Beijing Chuangxin Lezhi Co.ltd",
"address": "B3-2-1 ZHaowei Industry Park",
"city": "Beijng",
"state": "Beijing",
"zipcode": "100016",
"country": "CN"
}
>>>