python爬网页上所有的链接(爬到最深)

相关课程链接:Crawl Web


今天做的这个是在上个实验的基础上加了一个跳转挖掘链接,再从新链接里面继续向下挖掘,这样层层递进挖到深处~~

还没有学到get_page的真正写法,如果用urllib2.urlopen()会出现HTTP error的问题,这个在第四章才学习。这里直接贴上网站的源码,主要验证深挖的函数。


def get_page(url):   #尚未处理好,功能是传入网址,返回源码
    try:
        if url == "http://xkcd.com/353":
            return  '  xkcd: Python         

Python





Python

Permanent link to this comic: http://xkcd.com/353/

Image URL (for hotlinking/embedding): http://imgs.xkcd.com/comics/python.png

Selected Comics Grownups Circuit Diagram Angular Momentum Self-Description Alternative Energy Revolution

Search comic titles and transcripts:
RSS Feed - Atom Feed


Warning: this comic occasionally contains strong language (which may be unsuitable for children), unusual humor (which may be unsuitable for adults), and advanced mathematics (which may be unsuitable for liberal-arts majors).

We did not invent the algorithm. The algorithm consistently finds Jesus. The algorithm killed Jeeves.
The algorithm is banned in China. The algorithm is from Jersey. The algorithm constantly finds Jesus.
This is not the algorithm. This is close.



This work is licensed under a Creative Commons Attribution-NonCommercial 2.5 License.
This means you\"re free to copy and share these comics (but not to sell them). More details.
' elif url == "http://xkcd.com/554": return ' xkcd: Not Enough Work

Not Enough Work





Not Enough Work

Permanent link to this comic: http://xkcd.com/554/

Image URL (for hotlinking/embedding): http://imgs.xkcd.com/comics/not_enough_work.png

Selected Comics Grownups Circuit Diagram Angular Momentum Self-Description Alternative Energy Revolution

Search comic titles and transcripts:
RSS Feed - Atom Feed


Warning: this comic occasionally contains strong language (which may be unsuitable for children), unusual humor (which may be unsuitable for adults), and advanced mathematics (which may be unsuitable for liberal-arts majors).

We did not invent the algorithm. The algorithm consistently finds Jesus. The algorithm killed Jeeves.
The algorithm is banned in China. The algorithm is from Jersey. The algorithm constantly finds Jesus.
This is not the algorithm. This is close.



This work is licensed under a Creative Commons Attribution-NonCommercial 2.5 License.
This means you"re free to copy and share these comics (but not to sell them). More details.
' except: return "" return "" def get_next_target(page): #对源码进行查询链接功能,返回链接和该链接结束位置 start_link = page.find('

总之主体大意就是,对当前网页进行挖掘,返回所有子链接,添加到未挖掘队列

每一次对未挖掘队列中的一个元素进行挖掘,继续将查找到的子链接添加到未挖掘队列,直到未挖掘队列为空

每次查询就将已挖掘的记录下来,最后返回已挖掘队列,并在操作过程中有个查重的功能。

你可能感兴趣的:(python,python,源码,web,url)