python编写poc_干货分享丨Python从入门到编写POC之爬虫专题

Python从入门到编写POC系列文章是i春秋论坛作家「Exp1ore」表哥原创的一套完整教程,想系统学习Python技能的小伙伴,不要错过哦!

Python从入门到编写POC之爬虫专题

说到爬虫,用Python写的貌似是很多的。

举个例子,re模块,BeautifulSoup模块,pyspider模块,pyquery等,当然还要用到requests模块,urllib模块,urllib2模块,还有一个四叶草公司开发的hackhttp等等。

PS:BeautifulSoup模块和requests模块,Pyspider都要安装,因为是第三方库。

BeautifulSoup模块

The Dormouse's story

The Dormouse's story

Once upon a time there were three little sisters; and their names were,Lacie andTillie;and they lived at the bottom of a well.

...

用BeautifulSoup创建一个对象

>>> from bs4 import BeautifulSoup>>> html = """... ...

... The Dormouse's story... ... ...

The Dormouse's story

......

Once upon a time there were three little sisters; and their names were... Elsie,... Lacie and... Tillie;... and they lived at the bottom of a well.

...

...

... ... ... """>>>>>> soup = BeautifulSoup(html)C:\Python27\lib\site-packages\bs4\__init__.py:181: UserWarning: No parser was explicitly specified, so I'm using the best available HTML parser for this system ("html.parser"). This usually isn't a problem, but if you run this code on another system, or in a different virtual environment, it may use a different parser and behave differently.The code that caused this warning is on line 1 of the file . To get rid of this warning, change code that looks like this: BeautifulSoup(YOUR_MARKUP})to this: BeautifulSoup(YOUR_MARKUP, "html.parser")  markup_type=markup_type))

浏览结构化数据的方法

>>> soup.title

The Dormouse's story>>> soup.title.nameu'title'>>> soup.p

The Dormouse's story

>>> soup.p['class'][u'title']>>> soup.head\nThe Dormouse's story\n>>> soup.p.attrs{u'class': [u'title']}

如果是爬虫,比如说要爬所有的链接,分析html代码得到,都是在标签那。所以用个循环,就可以完美的解决了。

>>> for link in soup.find_all('a'):... print(link.get('href'))...[url]http://example.com/elsie[/url][url]http://example.com/lacie[/url][url]http://example.com/tillie[/url]

那如果我要爬去所有的文字信息呢?

就要用到下面的命令了:

>>> print soup.get_text()The Dormouse's storyThe Dormouse's storyOnce upon a time there were three little sisters; and their names wereElsie,Lacie andTillie;and they lived at the bottom of a well....

接下来,咱们写一个简单的爬虫,调用站长帮手,写一个查询子域名的工具。

首先,咱们抓包分析一下,这里用到的是Burp

POST /subdomain/ HTTP/1.1Host: i.links.cnContent-Length: 34Cache-Control: max-age=0Origin: [url]http://i.links.cn[/url]Upgrade-Insecure-Requests: 1User-Agent: Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.87 Safari/537.36Content-Type: application/x-www-form-urlencodedAccept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8Referer: [url]http://i.links.cn/subdomain/[/url]Accept-Language: zh-CN,zh;q=0.8Cookie: ASPSESSIONIDCCRSRCQS=NFFNBODCNABACIGOEODDFKLG; __guid=12224748.1912086146849820700.1503481265395.9385; UM_distinctid=15e0e7780082dd-0f197d4291ddaa-5d4e211f-1fa400-15e0e7780091e6; linkhelper=sameipb3=1&sameipb4=1&sameipb2=1; serverurl=; ASPSESSIONIDQARRSARR=DNCFMEADGBBFOICPGKMFCNPK; safedog-flow-item=; monitor_count=2; umid=umid=f449b116e07d1d4f3d2dc5352b7fede9&querytime=2017%2D8%2D24+14%3A09%3A09; CNZZDATA30012337=cnzz_eid%3D226371595-1503478989-%26ntime%3D1503554751Connection: closedomain=ichunqiu.com&b2=1&b3=1&b4=1

可以知道他是一个post包,然后提交的post数据是

domain=ichunqiu.com&b2=1&b3=1&b4=1

所以用requests模块:

#coding = utf-8import requestsurl = 'http://i.links.cn/subdomain/'payload = 'domain=ichunqiu.com&b2=1&b3=1&b4=1'r = requests.post(url=url,data=payload)print r.content

结果报了一个错

Traceback (most recent call last): File "demo.py", line 8, in print r.textUnicodeEncodeError: 'gbk' codec can't encode character u'\\xcf' in position 386: illegal multibyte sequence

所以咱们要改一下编码:

import requestsurl = 'http://i.links.cn/subdomain/'payload = ("domain=ichunqiu.com&b2=1&b3=1&b4=1")r = requests.post(url=url,params=payload)con = r.text.encode('ISO-8859-1')

之后就打印出来了,然后就上re或者beautifulsoup了。

这里用re,简单明了。查看源码,得到在以下代码之间:

value="http://ichunqiu.com"/>

import rea = re.compile('value="(.+?)">

然后转成列表

list = '\n'.join(result)print list

咱们继续完善这个代码,改源码查询是不是有点麻烦?

这里,咱们用sys库,然后就用那个命令函数,修改一下代码,再格式化一下,这里用到了format函数。

payload = ("domain={domain}&b2=1&b3=1&b4=1".format(domain=domain))

然后在定义一个get函数来获取domain这个变量。

#coding = utf-8 import requestsimport reimport sys def get(domain): url = 'http://i.links.cn/subdomain/' payload = ("domain={domain}&b2=1&b3=1&b4=1".format(domain=domain)) r = requests.post(url=url,params=payload) con = r.text.encode('ISO-8859-1') a = re.compile('value="(.+?)">

这样子就好了,咱们实验一下。

以上是今天要分享的内容,大家看懂了吗?喜欢本文的小伙伴,记得文末点个赞哦~

你可能感兴趣的:(python编写poc)