Hadoop学习之网络爬虫+分词+倒排索引实现搜索引擎案例

本项目实现的是:自己写一个网络爬虫,对搜狐(或者csdn)爬取新闻(博客)标题,然后把这些新闻标题和它的链接地址上传到hdfs多个文件上,一个文件对应一个标题和链接地址,然后通过分词技术对每个文件中的标题进行分词,分词后建立倒排索引以此来实现搜索引擎的功能,建立倒排索引不熟悉的朋友可以看看我上篇博客 
Hadoop–倒排索引过程详解

首先 要自己写一个网络爬虫

由于我开始写爬虫的时候用了htmlparser,把所有搜到的链接存到队列,并且垂直搜索,这个工作量太大,爬了一个小时还没爬完造成了我电脑的死机,所以,现在我就去掉了垂直搜索,只爬搜狐的主页的新闻文章链接

不多说,看代码

首先看下载工具类,解释看注释

<code class="hljs java has-numbering" style="display: block; padding: 0px; color: inherit; box-sizing: border-box; font-family: 'Source Code Pro', monospace;font-size:undefined; white-space: pre; border-radius: 0px; word-wrap: normal; background: transparent;"><span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">package</span> com.yc.spider;

<span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">import</span> java.io.BufferedInputStream;
<span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">import</span> java.io.BufferedOutputStream;
<span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">import</span> java.io.File;
<span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">import</span> java.io.FileOutputStream;
<span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">import</span> java.io.IOException;
<span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">import</span> java.net.HttpURLConnection;
<span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">import</span> java.net.MalformedURLException;
<span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">import</span> java.net.URL;
<span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">import</span> java.text.DateFormat;
<span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">import</span> java.text.SimpleDateFormat;
<span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">import</span> java.util.Date;
<span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">import</span> java.util.List;
<span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">import</span> java.util.Random;
<span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">import</span> java.util.Scanner;
<span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">import</span> java.util.Set;
<span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">import</span> java.util.regex.Matcher;
<span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">import</span> java.util.regex.Pattern;
<span class="hljs-javadoc" style="color: rgb(136, 0, 0); box-sizing: border-box;">/**
 * 下载工具类
 *<span class="hljs-javadoctag" style="color: rgb(102, 0, 102); box-sizing: border-box;"> @author</span> 汤高 
 *
 */</span>
<span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">public</span> <span class="hljs-class" style="box-sizing: border-box;"><span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">class</span> <span class="hljs-title" style="box-sizing: border-box; color: rgb(102, 0, 102);">DownLoadTool</span> {</span>
    <span class="hljs-javadoc" style="color: rgb(136, 0, 0); box-sizing: border-box;">/**
     * 下载页面的内容
     * 就是根据地址下载整个html标签
     * 
     *<span class="hljs-javadoctag" style="color: rgb(102, 0, 102); box-sizing: border-box;"> @param</span> addr
     *<span class="hljs-javadoctag" style="color: rgb(102, 0, 102); box-sizing: border-box;"> @return</span>
     */</span>
    <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">public</span> String <span class="hljs-title" style="box-sizing: border-box;">downLoadUrl</span>(<span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">final</span> String addr) {
        StringBuffer sb = <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">new</span> StringBuffer();
        <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">try</span> {
            URL url;
            <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">if</span>(addr.startsWith(<span class="hljs-string" style="color: rgb(0, 136, 0); box-sizing: border-box;">"http://"</span>)==<span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">false</span>){
                String urladdr=addr+<span class="hljs-string" style="color: rgb(0, 136, 0); box-sizing: border-box;">"http://"</span>;
                 url = <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">new</span> URL(urladdr);
            }<span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">else</span>{
            System.out.println(addr);
            url = <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">new</span> URL(addr);
            }
            HttpURLConnection con = (HttpURLConnection) url.openConnection();
            con.setConnectTimeout(<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">5000</span>);
            con.connect();
            <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">if</span> (con.getResponseCode() == <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">200</span>) {
                BufferedInputStream bis = <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">new</span> BufferedInputStream(con
                        .getInputStream());
                Scanner sc = <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">new</span> Scanner(bis,<span class="hljs-string" style="color: rgb(0, 136, 0); box-sizing: border-box;">"gbk"</span>);
                <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">while</span> (sc.hasNextLine()) {
                    sb.append(sc.nextLine());
                }
            }
        } <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">catch</span> (IOException e) {
            e.printStackTrace();
        }
        <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">return</span> sb.toString();
    }

}

</code><ul class="pre-numbering" style="box-sizing: border-box; position: absolute; width: 50px; top: 0px; left: 0px; margin: 0px; padding: 6px 0px 40px; border-right-width: 1px; border-right-style: solid; border-right-color: rgb(221, 221, 221); list-style: none; text-align: right; background-color: rgb(238, 238, 238);"><li style="box-sizing: border-box; padding: 0px 5px;">1</li><li style="box-sizing: border-box; padding: 0px 5px;">2</li><li style="box-sizing: border-box; padding: 0px 5px;">3</li><li style="box-sizing: border-box; padding: 0px 5px;">4</li><li style="box-sizing: border-box; padding: 0px 5px;">5</li><li style="box-sizing: border-box; padding: 0px 5px;">6</li><li style="box-sizing: border-box; padding: 0px 5px;">7</li><li style="box-sizing: border-box; padding: 0px 5px;">8</li><li style="box-sizing: border-box; padding: 0px 5px;">9</li><li style="box-sizing: border-box; padding: 0px 5px;">10</li><li style="box-sizing: border-box; padding: 0px 5px;">11</li><li style="box-sizing: border-box; padding: 0px 5px;">12</li><li style="box-sizing: border-box; padding: 0px 5px;">13</li><li style="box-sizing: border-box; padding: 0px 5px;">14</li><li style="box-sizing: border-box; padding: 0px 5px;">15</li><li style="box-sizing: border-box; padding: 0px 5px;">16</li><li style="box-sizing: border-box; padding: 0px 5px;">17</li><li style="box-sizing: border-box; padding: 0px 5px;">18</li><li style="box-sizing: border-box; padding: 0px 5px;">19</li><li style="box-sizing: border-box; padding: 0px 5px;">20</li><li style="box-sizing: border-box; padding: 0px 5px;">21</li><li style="box-sizing: border-box; padding: 0px 5px;">22</li><li style="box-sizing: border-box; padding: 0px 5px;">23</li><li style="box-sizing: border-box; padding: 0px 5px;">24</li><li style="box-sizing: border-box; padding: 0px 5px;">25</li><li style="box-sizing: border-box; padding: 0px 5px;">26</li><li style="box-sizing: border-box; padding: 0px 5px;">27</li><li style="box-sizing: border-box; padding: 0px 5px;">28</li><li style="box-sizing: border-box; padding: 0px 5px;">29</li><li style="box-sizing: border-box; padding: 0px 5px;">30</li><li style="box-sizing: border-box; padding: 0px 5px;">31</li><li style="box-sizing: border-box; padding: 0px 5px;">32</li><li style="box-sizing: border-box; padding: 0px 5px;">33</li><li style="box-sizing: border-box; padding: 0px 5px;">34</li><li style="box-sizing: border-box; padding: 0px 5px;">35</li><li style="box-sizing: border-box; padding: 0px 5px;">36</li><li style="box-sizing: border-box; padding: 0px 5px;">37</li><li style="box-sizing: border-box; padding: 0px 5px;">38</li><li style="box-sizing: border-box; padding: 0px 5px;">39</li><li style="box-sizing: border-box; padding: 0px 5px;">40</li><li style="box-sizing: border-box; padding: 0px 5px;">41</li><li style="box-sizing: border-box; padding: 0px 5px;">42</li><li style="box-sizing: border-box; padding: 0px 5px;">43</li><li style="box-sizing: border-box; padding: 0px 5px;">44</li><li style="box-sizing: border-box; padding: 0px 5px;">45</li><li style="box-sizing: border-box; padding: 0px 5px;">46</li><li style="box-sizing: border-box; padding: 0px 5px;">47</li><li style="box-sizing: border-box; padding: 0px 5px;">48</li><li style="box-sizing: border-box; padding: 0px 5px;">49</li><li style="box-sizing: border-box; padding: 0px 5px;">50</li><li style="box-sizing: border-box; padding: 0px 5px;">51</li><li style="box-sizing: border-box; padding: 0px 5px;">52</li><li style="box-sizing: border-box; padding: 0px 5px;">53</li><li style="box-sizing: border-box; padding: 0px 5px;">54</li><li style="box-sizing: border-box; padding: 0px 5px;">55</li><li style="box-sizing: border-box; padding: 0px 5px;">56</li><li style="box-sizing: border-box; padding: 0px 5px;">57</li><li style="box-sizing: border-box; padding: 0px 5px;">58</li><li style="box-sizing: border-box; padding: 0px 5px;">59</li><li style="box-sizing: border-box; padding: 0px 5px;">60</li><li style="box-sizing: border-box; padding: 0px 5px;">61</li><li style="box-sizing: border-box; padding: 0px 5px;">62</li><li style="box-sizing: border-box; padding: 0px 5px;">63</li></ul><ul class="pre-numbering" style="box-sizing: border-box; position: absolute; width: 50px; top: 0px; left: 0px; margin: 0px; padding: 6px 0px 40px; border-right-width: 1px; border-right-style: solid; border-right-color: rgb(221, 221, 221); list-style: none; text-align: right; background-color: rgb(238, 238, 238);"><li style="box-sizing: border-box; padding: 0px 5px;">1</li><li style="box-sizing: border-box; padding: 0px 5px;">2</li><li style="box-sizing: border-box; padding: 0px 5px;">3</li><li style="box-sizing: border-box; padding: 0px 5px;">4</li><li style="box-sizing: border-box; padding: 0px 5px;">5</li><li style="box-sizing: border-box; padding: 0px 5px;">6</li><li style="box-sizing: border-box; padding: 0px 5px;">7</li><li style="box-sizing: border-box; padding: 0px 5px;">8</li><li style="box-sizing: border-box; padding: 0px 5px;">9</li><li style="box-sizing: border-box; padding: 0px 5px;">10</li><li style="box-sizing: border-box; padding: 0px 5px;">11</li><li style="box-sizing: border-box; padding: 0px 5px;">12</li><li style="box-sizing: border-box; padding: 0px 5px;">13</li><li style="box-sizing: border-box; padding: 0px 5px;">14</li><li style="box-sizing: border-box; padding: 0px 5px;">15</li><li style="box-sizing: border-box; padding: 0px 5px;">16</li><li style="box-sizing: border-box; padding: 0px 5px;">17</li><li style="box-sizing: border-box; padding: 0px 5px;">18</li><li style="box-sizing: border-box; padding: 0px 5px;">19</li><li style="box-sizing: border-box; padding: 0px 5px;">20</li><li style="box-sizing: border-box; padding: 0px 5px;">21</li><li style="box-sizing: border-box; padding: 0px 5px;">22</li><li style="box-sizing: border-box; padding: 0px 5px;">23</li><li style="box-sizing: border-box; padding: 0px 5px;">24</li><li style="box-sizing: border-box; padding: 0px 5px;">25</li><li style="box-sizing: border-box; padding: 0px 5px;">26</li><li style="box-sizing: border-box; padding: 0px 5px;">27</li><li style="box-sizing: border-box; padding: 0px 5px;">28</li><li style="box-sizing: border-box; padding: 0px 5px;">29</li><li style="box-sizing: border-box; padding: 0px 5px;">30</li><li style="box-sizing: border-box; padding: 0px 5px;">31</li><li style="box-sizing: border-box; padding: 0px 5px;">32</li><li style="box-sizing: border-box; padding: 0px 5px;">33</li><li style="box-sizing: border-box; padding: 0px 5px;">34</li><li style="box-sizing: border-box; padding: 0px 5px;">35</li><li style="box-sizing: border-box; padding: 0px 5px;">36</li><li style="box-sizing: border-box; padding: 0px 5px;">37</li><li style="box-sizing: border-box; padding: 0px 5px;">38</li><li style="box-sizing: border-box; padding: 0px 5px;">39</li><li style="box-sizing: border-box; padding: 0px 5px;">40</li><li style="box-sizing: border-box; padding: 0px 5px;">41</li><li style="box-sizing: border-box; padding: 0px 5px;">42</li><li style="box-sizing: border-box; padding: 0px 5px;">43</li><li style="box-sizing: border-box; padding: 0px 5px;">44</li><li style="box-sizing: border-box; padding: 0px 5px;">45</li><li style="box-sizing: border-box; padding: 0px 5px;">46</li><li style="box-sizing: border-box; padding: 0px 5px;">47</li><li style="box-sizing: border-box; padding: 0px 5px;">48</li><li style="box-sizing: border-box; padding: 0px 5px;">49</li><li style="box-sizing: border-box; padding: 0px 5px;">50</li><li style="box-sizing: border-box; padding: 0px 5px;">51</li><li style="box-sizing: border-box; padding: 0px 5px;">52</li><li style="box-sizing: border-box; padding: 0px 5px;">53</li><li style="box-sizing: border-box; padding: 0px 5px;">54</li><li style="box-sizing: border-box; padding: 0px 5px;">55</li><li style="box-sizing: border-box; padding: 0px 5px;">56</li><li style="box-sizing: border-box; padding: 0px 5px;">57</li><li style="box-sizing: border-box; padding: 0px 5px;">58</li><li style="box-sizing: border-box; padding: 0px 5px;">59</li><li style="box-sizing: border-box; padding: 0px 5px;">60</li><li style="box-sizing: border-box; padding: 0px 5px;">61</li><li style="box-sizing: border-box; padding: 0px 5px;">62</li><li style="box-sizing: border-box; padding: 0px 5px;">63</li></ul>

然后看一个文章链接的匹配类

<code class="hljs java has-numbering" style="display: block; padding: 0px; color: inherit; box-sizing: border-box; font-family: 'Source Code Pro', monospace;font-size:undefined; white-space: pre; border-radius: 0px; word-wrap: normal; background: transparent;"><span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">package</span> com.yc.spider;

<span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">import</span> java.io.BufferedInputStream;
<span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">import</span> java.io.IOException;
<span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">import</span> java.net.HttpURLConnection;
<span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">import</span> java.net.URL;
<span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">import</span> java.util.ArrayList;
<span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">import</span> java.util.HashSet;
<span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">import</span> java.util.List;
<span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">import</span> java.util.Scanner;
<span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">import</span> java.util.Set;
<span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">import</span> java.util.regex.Matcher;
<span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">import</span> java.util.regex.Pattern;

<span class="hljs-javadoc" style="color: rgb(136, 0, 0); box-sizing: border-box;">/**
 * 文章下载类
 * 
 *<span class="hljs-javadoctag" style="color: rgb(102, 0, 102); box-sizing: border-box;"> @author</span> 汤高
 *        
 *         
 */</span>
<span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">public</span> <span class="hljs-class" style="box-sizing: border-box;"><span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">class</span> <span class="hljs-title" style="box-sizing: border-box; color: rgb(102, 0, 102);">ArticleDownLoad</span> {</span>
    <span class="hljs-javadoc" style="color: rgb(136, 0, 0); box-sizing: border-box;">/**
     * 取出文章的a标记href
     */</span>
    <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">static</span> String ARTICLE_URL = <span class="hljs-string" style="color: rgb(0, 136, 0); box-sizing: border-box;">"<a\\s+[^<>]*\\s+href=\"?(http[^<>\"]*)\"[^<>]*>([^<]*)</a>"</span>;


    <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">static</span> Set<String> getImageLink(String html) {
        Set<String> result = <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">new</span> HashSet<String>();
        <span class="hljs-comment" style="color: rgb(136, 0, 0); box-sizing: border-box;">// 创建一个Pattern模式类,编译这个正则表达式</span>
        Pattern p = Pattern.compile(ARTICLE_URL, Pattern.CASE_INSENSITIVE);
        <span class="hljs-comment" style="color: rgb(136, 0, 0); box-sizing: border-box;">// 定义一个匹配器的类</span>
        Matcher matcher = p.matcher(html);
        <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">while</span> (matcher.find()) {
            System.out.println(<span class="hljs-string" style="color: rgb(0, 136, 0); box-sizing: border-box;">"======="</span>+matcher.group(<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">1</span>)+<span class="hljs-string" style="color: rgb(0, 136, 0); box-sizing: border-box;">"\t"</span>+matcher.group(<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">2</span>));
            result.add(matcher.group(<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">1</span>)+<span class="hljs-string" style="color: rgb(0, 136, 0); box-sizing: border-box;">"\t"</span>+matcher.group(<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">2</span>));
        }
        <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">return</span> result;

    }


}
</code><ul class="pre-numbering" style="box-sizing: border-box; position: absolute; width: 50px; top: 0px; left: 0px; margin: 0px; padding: 6px 0px 40px; border-right-width: 1px; border-right-style: solid; border-right-color: rgb(221, 221, 221); list-style: none; text-align: right; background-color: rgb(238, 238, 238);"><li style="box-sizing: border-box; padding: 0px 5px;">1</li><li style="box-sizing: border-box; padding: 0px 5px;">2</li><li style="box-sizing: border-box; padding: 0px 5px;">3</li><li style="box-sizing: border-box; padding: 0px 5px;">4</li><li style="box-sizing: border-box; padding: 0px 5px;">5</li><li style="box-sizing: border-box; padding: 0px 5px;">6</li><li style="box-sizing: border-box; padding: 0px 5px;">7</li><li style="box-sizing: border-box; padding: 0px 5px;">8</li><li style="box-sizing: border-box; padding: 0px 5px;">9</li><li style="box-sizing: border-box; padding: 0px 5px;">10</li><li style="box-sizing: border-box; padding: 0px 5px;">11</li><li style="box-sizing: border-box; padding: 0px 5px;">12</li><li style="box-sizing: border-box; padding: 0px 5px;">13</li><li style="box-sizing: border-box; padding: 0px 5px;">14</li><li style="box-sizing: border-box; padding: 0px 5px;">15</li><li style="box-sizing: border-box; padding: 0px 5px;">16</li><li style="box-sizing: border-box; padding: 0px 5px;">17</li><li style="box-sizing: border-box; padding: 0px 5px;">18</li><li style="box-sizing: border-box; padding: 0px 5px;">19</li><li style="box-sizing: border-box; padding: 0px 5px;">20</li><li style="box-sizing: border-box; padding: 0px 5px;">21</li><li style="box-sizing: border-box; padding: 0px 5px;">22</li><li style="box-sizing: border-box; padding: 0px 5px;">23</li><li style="box-sizing: border-box; padding: 0px 5px;">24</li><li style="box-sizing: border-box; padding: 0px 5px;">25</li><li style="box-sizing: border-box; padding: 0px 5px;">26</li><li style="box-sizing: border-box; padding: 0px 5px;">27</li><li style="box-sizing: border-box; padding: 0px 5px;">28</li><li style="box-sizing: border-box; padding: 0px 5px;">29</li><li style="box-sizing: border-box; padding: 0px 5px;">30</li><li style="box-sizing: border-box; padding: 0px 5px;">31</li><li style="box-sizing: border-box; padding: 0px 5px;">32</li><li style="box-sizing: border-box; padding: 0px 5px;">33</li><li style="box-sizing: border-box; padding: 0px 5px;">34</li><li style="box-sizing: border-box; padding: 0px 5px;">35</li><li style="box-sizing: border-box; padding: 0px 5px;">36</li><li style="box-sizing: border-box; padding: 0px 5px;">37</li><li style="box-sizing: border-box; padding: 0px 5px;">38</li><li style="box-sizing: border-box; padding: 0px 5px;">39</li><li style="box-sizing: border-box; padding: 0px 5px;">40</li><li style="box-sizing: border-box; padding: 0px 5px;">41</li><li style="box-sizing: border-box; padding: 0px 5px;">42</li><li style="box-sizing: border-box; padding: 0px 5px;">43</li><li style="box-sizing: border-box; padding: 0px 5px;">44</li><li style="box-sizing: border-box; padding: 0px 5px;">45</li></ul><ul class="pre-numbering" style="box-sizing: border-box; position: absolute; width: 50px; top: 0px; left: 0px; margin: 0px; padding: 6px 0px 40px; border-right-width: 1px; border-right-style: solid; border-right-color: rgb(221, 221, 221); list-style: none; text-align: right; background-color: rgb(238, 238, 238);"><li style="box-sizing: border-box; padding: 0px 5px;">1</li><li style="box-sizing: border-box; padding: 0px 5px;">2</li><li style="box-sizing: border-box; padding: 0px 5px;">3</li><li style="box-sizing: border-box; padding: 0px 5px;">4</li><li style="box-sizing: border-box; padding: 0px 5px;">5</li><li style="box-sizing: border-box; padding: 0px 5px;">6</li><li style="box-sizing: border-box; padding: 0px 5px;">7</li><li style="box-sizing: border-box; padding: 0px 5px;">8</li><li style="box-sizing: border-box; padding: 0px 5px;">9</li><li style="box-sizing: border-box; padding: 0px 5px;">10</li><li style="box-sizing: border-box; padding: 0px 5px;">11</li><li style="box-sizing: border-box; padding: 0px 5px;">12</li><li style="box-sizing: border-box; padding: 0px 5px;">13</li><li style="box-sizing: border-box; padding: 0px 5px;">14</li><li style="box-sizing: border-box; padding: 0px 5px;">15</li><li style="box-sizing: border-box; padding: 0px 5px;">16</li><li style="box-sizing: border-box; padding: 0px 5px;">17</li><li style="box-sizing: border-box; padding: 0px 5px;">18</li><li style="box-sizing: border-box; padding: 0px 5px;">19</li><li style="box-sizing: border-box; padding: 0px 5px;">20</li><li style="box-sizing: border-box; padding: 0px 5px;">21</li><li style="box-sizing: border-box; padding: 0px 5px;">22</li><li style="box-sizing: border-box; padding: 0px 5px;">23</li><li style="box-sizing: border-box; padding: 0px 5px;">24</li><li style="box-sizing: border-box; padding: 0px 5px;">25</li><li style="box-sizing: border-box; padding: 0px 5px;">26</li><li style="box-sizing: border-box; padding: 0px 5px;">27</li><li style="box-sizing: border-box; padding: 0px 5px;">28</li><li style="box-sizing: border-box; padding: 0px 5px;">29</li><li style="box-sizing: border-box; padding: 0px 5px;">30</li><li style="box-sizing: border-box; padding: 0px 5px;">31</li><li style="box-sizing: border-box; padding: 0px 5px;">32</li><li style="box-sizing: border-box; padding: 0px 5px;">33</li><li style="box-sizing: border-box; padding: 0px 5px;">34</li><li style="box-sizing: border-box; padding: 0px 5px;">35</li><li style="box-sizing: border-box; padding: 0px 5px;">36</li><li style="box-sizing: border-box; padding: 0px 5px;">37</li><li style="box-sizing: border-box; padding: 0px 5px;">38</li><li style="box-sizing: border-box; padding: 0px 5px;">39</li><li style="box-sizing: border-box; padding: 0px 5px;">40</li><li style="box-sizing: border-box; padding: 0px 5px;">41</li><li style="box-sizing: border-box; padding: 0px 5px;">42</li><li style="box-sizing: border-box; padding: 0px 5px;">43</li><li style="box-sizing: border-box; padding: 0px 5px;">44</li><li style="box-sizing: border-box; padding: 0px 5px;">45</li></ul>

下面看爬虫类

<code class="hljs avrasm has-numbering" style="display: block; padding: 0px; color: inherit; box-sizing: border-box; font-family: 'Source Code Pro', monospace;font-size:undefined; white-space: pre; border-radius: 0px; word-wrap: normal; background: transparent;">package <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">com</span><span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.yc</span><span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.spider</span><span class="hljs-comment" style="color: rgb(136, 0, 0); box-sizing: border-box;">;</span>

import java<span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.io</span><span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.IOException</span><span class="hljs-comment" style="color: rgb(136, 0, 0); box-sizing: border-box;">;</span>
import java<span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.net</span><span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.URI</span><span class="hljs-comment" style="color: rgb(136, 0, 0); box-sizing: border-box;">;</span>
import java<span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.net</span><span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.URISyntaxException</span><span class="hljs-comment" style="color: rgb(136, 0, 0); box-sizing: border-box;">;</span>
import java<span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.util</span><span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.List</span><span class="hljs-comment" style="color: rgb(136, 0, 0); box-sizing: border-box;">;</span>
import java<span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.util</span><span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.Set</span><span class="hljs-comment" style="color: rgb(136, 0, 0); box-sizing: border-box;">;</span>
import java<span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.util</span><span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.regex</span><span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.Matcher</span><span class="hljs-comment" style="color: rgb(136, 0, 0); box-sizing: border-box;">;</span>
import java<span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.util</span><span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.regex</span><span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.Pattern</span><span class="hljs-comment" style="color: rgb(136, 0, 0); box-sizing: border-box;">;</span>

import org<span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.apache</span><span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.hadoop</span><span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.conf</span><span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.Configuration</span><span class="hljs-comment" style="color: rgb(136, 0, 0); box-sizing: border-box;">;</span>
import org<span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.apache</span><span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.hadoop</span><span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.fs</span><span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.FSDataOutputStream</span><span class="hljs-comment" style="color: rgb(136, 0, 0); box-sizing: border-box;">;</span>
import org<span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.apache</span><span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.hadoop</span><span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.fs</span><span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.FileSystem</span><span class="hljs-comment" style="color: rgb(136, 0, 0); box-sizing: border-box;">;</span>
import org<span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.apache</span><span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.hadoop</span><span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.fs</span><span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.Path</span><span class="hljs-comment" style="color: rgb(136, 0, 0); box-sizing: border-box;">;</span>
import org<span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.htmlparser</span><span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.util</span><span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.ParserException</span><span class="hljs-comment" style="color: rgb(136, 0, 0); box-sizing: border-box;">;</span>
<span class="hljs-comment" style="color: rgb(136, 0, 0); box-sizing: border-box;">/**
 * 把爬到的内容上传到hdfs
 * @author 汤高
 * 
 *
 */</span>
public class Spider {
    private DownLoadTool dlt = new DownLoadTool()<span class="hljs-comment" style="color: rgb(136, 0, 0); box-sizing: border-box;">;</span>
    private ArticleDownLoad pdl = new ArticleDownLoad()<span class="hljs-comment" style="color: rgb(136, 0, 0); box-sizing: border-box;">;</span>
    public void crawling(String url) {
        try {
            Configuration conf = new Configuration()<span class="hljs-comment" style="color: rgb(136, 0, 0); box-sizing: border-box;">;</span>
            URI uri = new URI(<span class="hljs-string" style="color: rgb(0, 136, 0); box-sizing: border-box;">"hdfs://192.168.52.140:9000"</span>)<span class="hljs-comment" style="color: rgb(136, 0, 0); box-sizing: border-box;">;</span>
            FileSystem hdfs = FileSystem<span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.get</span>(uri, conf)<span class="hljs-comment" style="color: rgb(136, 0, 0); box-sizing: border-box;">;</span>

            String html = dlt<span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.downLoadUrl</span>(url)<span class="hljs-comment" style="color: rgb(136, 0, 0); box-sizing: border-box;">;</span>
            <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">Set</span><String> allneed = pdl<span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.getImageLink</span>(html)<span class="hljs-comment" style="color: rgb(136, 0, 0); box-sizing: border-box;">;</span>
            for (String addr : allneed) {
                //生成文件路径
                Path p1 = new Path(<span class="hljs-string" style="color: rgb(0, 136, 0); box-sizing: border-box;">"/myspider/"</span> + System<span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.currentTimeMillis</span>())<span class="hljs-comment" style="color: rgb(136, 0, 0); box-sizing: border-box;">;</span>
                FSDataOutputStream dos = hdfs<span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.create</span>(p1)<span class="hljs-comment" style="color: rgb(136, 0, 0); box-sizing: border-box;">;</span>
                String a = addr + <span class="hljs-string" style="color: rgb(0, 136, 0); box-sizing: border-box;">"\n"</span><span class="hljs-comment" style="color: rgb(136, 0, 0); box-sizing: border-box;">;</span>
                //把内容写入文件
                dos<span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.write</span>(a<span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.getBytes</span>())<span class="hljs-comment" style="color: rgb(136, 0, 0); box-sizing: border-box;">;</span>
            }
        } catch (IllegalArgumentException e) {
            e<span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.printStackTrace</span>()<span class="hljs-comment" style="color: rgb(136, 0, 0); box-sizing: border-box;">;</span>
        } catch (URISyntaxException e) {
            e<span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.printStackTrace</span>()<span class="hljs-comment" style="color: rgb(136, 0, 0); box-sizing: border-box;">;</span>
        } catch (IOException e) {
            e<span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.printStackTrace</span>()<span class="hljs-comment" style="color: rgb(136, 0, 0); box-sizing: border-box;">;</span>
        }
    }
}
</code><ul class="pre-numbering" style="box-sizing: border-box; position: absolute; width: 50px; top: 0px; left: 0px; margin: 0px; padding: 6px 0px 40px; border-right-width: 1px; border-right-style: solid; border-right-color: rgb(221, 221, 221); list-style: none; text-align: right; background-color: rgb(238, 238, 238);"><li style="box-sizing: border-box; padding: 0px 5px;">1</li><li style="box-sizing: border-box; padding: 0px 5px;">2</li><li style="box-sizing: border-box; padding: 0px 5px;">3</li><li style="box-sizing: border-box; padding: 0px 5px;">4</li><li style="box-sizing: border-box; padding: 0px 5px;">5</li><li style="box-sizing: border-box; padding: 0px 5px;">6</li><li style="box-sizing: border-box; padding: 0px 5px;">7</li><li style="box-sizing: border-box; padding: 0px 5px;">8</li><li style="box-sizing: border-box; padding: 0px 5px;">9</li><li style="box-sizing: border-box; padding: 0px 5px;">10</li><li style="box-sizing: border-box; padding: 0px 5px;">11</li><li style="box-sizing: border-box; padding: 0px 5px;">12</li><li style="box-sizing: border-box; padding: 0px 5px;">13</li><li style="box-sizing: border-box; padding: 0px 5px;">14</li><li style="box-sizing: border-box; padding: 0px 5px;">15</li><li style="box-sizing: border-box; padding: 0px 5px;">16</li><li style="box-sizing: border-box; padding: 0px 5px;">17</li><li style="box-sizing: border-box; padding: 0px 5px;">18</li><li style="box-sizing: border-box; padding: 0px 5px;">19</li><li style="box-sizing: border-box; padding: 0px 5px;">20</li><li style="box-sizing: border-box; padding: 0px 5px;">21</li><li style="box-sizing: border-box; padding: 0px 5px;">22</li><li style="box-sizing: border-box; padding: 0px 5px;">23</li><li style="box-sizing: border-box; padding: 0px 5px;">24</li><li style="box-sizing: border-box; padding: 0px 5px;">25</li><li style="box-sizing: border-box; padding: 0px 5px;">26</li><li style="box-sizing: border-box; padding: 0px 5px;">27</li><li style="box-sizing: border-box; padding: 0px 5px;">28</li><li style="box-sizing: border-box; padding: 0px 5px;">29</li><li style="box-sizing: border-box; padding: 0px 5px;">30</li><li style="box-sizing: border-box; padding: 0px 5px;">31</li><li style="box-sizing: border-box; padding: 0px 5px;">32</li><li style="box-sizing: border-box; padding: 0px 5px;">33</li><li style="box-sizing: border-box; padding: 0px 5px;">34</li><li style="box-sizing: border-box; padding: 0px 5px;">35</li><li style="box-sizing: border-box; padding: 0px 5px;">36</li><li style="box-sizing: border-box; padding: 0px 5px;">37</li><li style="box-sizing: border-box; padding: 0px 5px;">38</li><li style="box-sizing: border-box; padding: 0px 5px;">39</li><li style="box-sizing: border-box; padding: 0px 5px;">40</li><li style="box-sizing: border-box; padding: 0px 5px;">41</li><li style="box-sizing: border-box; padding: 0px 5px;">42</li><li style="box-sizing: border-box; padding: 0px 5px;">43</li><li style="box-sizing: border-box; padding: 0px 5px;">44</li><li style="box-sizing: border-box; padding: 0px 5px;">45</li><li style="box-sizing: border-box; padding: 0px 5px;">46</li><li style="box-sizing: border-box; padding: 0px 5px;">47</li><li style="box-sizing: border-box; padding: 0px 5px;">48</li><li style="box-sizing: border-box; padding: 0px 5px;">49</li><li style="box-sizing: border-box; padding: 0px 5px;">50</li></ul><ul class="pre-numbering" style="box-sizing: border-box; position: absolute; width: 50px; top: 0px; left: 0px; margin: 0px; padding: 6px 0px 40px; border-right-width: 1px; border-right-style: solid; border-right-color: rgb(221, 221, 221); list-style: none; text-align: right; background-color: rgb(238, 238, 238);"><li style="box-sizing: border-box; padding: 0px 5px;">1</li><li style="box-sizing: border-box; padding: 0px 5px;">2</li><li style="box-sizing: border-box; padding: 0px 5px;">3</li><li style="box-sizing: border-box; padding: 0px 5px;">4</li><li style="box-sizing: border-box; padding: 0px 5px;">5</li><li style="box-sizing: border-box; padding: 0px 5px;">6</li><li style="box-sizing: border-box; padding: 0px 5px;">7</li><li style="box-sizing: border-box; padding: 0px 5px;">8</li><li style="box-sizing: border-box; padding: 0px 5px;">9</li><li style="box-sizing: border-box; padding: 0px 5px;">10</li><li style="box-sizing: border-box; padding: 0px 5px;">11</li><li style="box-sizing: border-box; padding: 0px 5px;">12</li><li style="box-sizing: border-box; padding: 0px 5px;">13</li><li style="box-sizing: border-box; padding: 0px 5px;">14</li><li style="box-sizing: border-box; padding: 0px 5px;">15</li><li style="box-sizing: border-box; padding: 0px 5px;">16</li><li style="box-sizing: border-box; padding: 0px 5px;">17</li><li style="box-sizing: border-box; padding: 0px 5px;">18</li><li style="box-sizing: border-box; padding: 0px 5px;">19</li><li style="box-sizing: border-box; padding: 0px 5px;">20</li><li style="box-sizing: border-box; padding: 0px 5px;">21</li><li style="box-sizing: border-box; padding: 0px 5px;">22</li><li style="box-sizing: border-box; padding: 0px 5px;">23</li><li style="box-sizing: border-box; padding: 0px 5px;">24</li><li style="box-sizing: border-box; padding: 0px 5px;">25</li><li style="box-sizing: border-box; padding: 0px 5px;">26</li><li style="box-sizing: border-box; padding: 0px 5px;">27</li><li style="box-sizing: border-box; padding: 0px 5px;">28</li><li style="box-sizing: border-box; padding: 0px 5px;">29</li><li style="box-sizing: border-box; padding: 0px 5px;">30</li><li style="box-sizing: border-box; padding: 0px 5px;">31</li><li style="box-sizing: border-box; padding: 0px 5px;">32</li><li style="box-sizing: border-box; padding: 0px 5px;">33</li><li style="box-sizing: border-box; padding: 0px 5px;">34</li><li style="box-sizing: border-box; padding: 0px 5px;">35</li><li style="box-sizing: border-box; padding: 0px 5px;">36</li><li style="box-sizing: border-box; padding: 0px 5px;">37</li><li style="box-sizing: border-box; padding: 0px 5px;">38</li><li style="box-sizing: border-box; padding: 0px 5px;">39</li><li style="box-sizing: border-box; padding: 0px 5px;">40</li><li style="box-sizing: border-box; padding: 0px 5px;">41</li><li style="box-sizing: border-box; padding: 0px 5px;">42</li><li style="box-sizing: border-box; padding: 0px 5px;">43</li><li style="box-sizing: border-box; padding: 0px 5px;">44</li><li style="box-sizing: border-box; padding: 0px 5px;">45</li><li style="box-sizing: border-box; padding: 0px 5px;">46</li><li style="box-sizing: border-box; padding: 0px 5px;">47</li><li style="box-sizing: border-box; padding: 0px 5px;">48</li><li style="box-sizing: border-box; padding: 0px 5px;">49</li><li style="box-sizing: border-box; padding: 0px 5px;">50</li></ul>

最后看测试类来爬内容

<code class="hljs java has-numbering" style="display: block; padding: 0px; color: inherit; box-sizing: border-box; font-family: 'Source Code Pro', monospace;font-size:undefined; white-space: pre; border-radius: 0px; word-wrap: normal; background: transparent;"><span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">package</span> com.yc.spider;

<span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">import</span> java.io.FileNotFoundException;

<span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">import</span> org.htmlparser.util.ParserException;

<span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">public</span> <span class="hljs-class" style="box-sizing: border-box;"><span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">class</span> <span class="hljs-title" style="box-sizing: border-box; color: rgb(102, 0, 102);">Test</span> {</span>
    <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">public</span> <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">static</span> <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">void</span> <span class="hljs-title" style="box-sizing: border-box;">main</span>(String[] args) <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">throws</span> ParserException {
        Spider s=<span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">new</span> Spider();
        <span class="hljs-comment" style="color: rgb(136, 0, 0); box-sizing: border-box;">//  s.crawling("http://blog.csdn.net");</span>
        s.crawling(<span class="hljs-string" style="color: rgb(0, 136, 0); box-sizing: border-box;">"http://www.sohu.com"</span>);

    }
}
</code><ul class="pre-numbering" style="box-sizing: border-box; position: absolute; width: 50px; top: 0px; left: 0px; margin: 0px; padding: 6px 0px 40px; border-right-width: 1px; border-right-style: solid; border-right-color: rgb(221, 221, 221); list-style: none; text-align: right; background-color: rgb(238, 238, 238);"><li style="box-sizing: border-box; padding: 0px 5px;">1</li><li style="box-sizing: border-box; padding: 0px 5px;">2</li><li style="box-sizing: border-box; padding: 0px 5px;">3</li><li style="box-sizing: border-box; padding: 0px 5px;">4</li><li style="box-sizing: border-box; padding: 0px 5px;">5</li><li style="box-sizing: border-box; padding: 0px 5px;">6</li><li style="box-sizing: border-box; padding: 0px 5px;">7</li><li style="box-sizing: border-box; padding: 0px 5px;">8</li><li style="box-sizing: border-box; padding: 0px 5px;">9</li><li style="box-sizing: border-box; padding: 0px 5px;">10</li><li style="box-sizing: border-box; padding: 0px 5px;">11</li><li style="box-sizing: border-box; padding: 0px 5px;">12</li><li style="box-sizing: border-box; padding: 0px 5px;">13</li><li style="box-sizing: border-box; padding: 0px 5px;">14</li><li style="box-sizing: border-box; padding: 0px 5px;">15</li></ul><ul class="pre-numbering" style="box-sizing: border-box; position: absolute; width: 50px; top: 0px; left: 0px; margin: 0px; padding: 6px 0px 40px; border-right-width: 1px; border-right-style: solid; border-right-color: rgb(221, 221, 221); list-style: none; text-align: right; background-color: rgb(238, 238, 238);"><li style="box-sizing: border-box; padding: 0px 5px;">1</li><li style="box-sizing: border-box; padding: 0px 5px;">2</li><li style="box-sizing: border-box; padding: 0px 5px;">3</li><li style="box-sizing: border-box; padding: 0px 5px;">4</li><li style="box-sizing: border-box; padding: 0px 5px;">5</li><li style="box-sizing: border-box; padding: 0px 5px;">6</li><li style="box-sizing: border-box; padding: 0px 5px;">7</li><li style="box-sizing: border-box; padding: 0px 5px;">8</li><li style="box-sizing: border-box; padding: 0px 5px;">9</li><li style="box-sizing: border-box; padding: 0px 5px;">10</li><li style="box-sizing: border-box; padding: 0px 5px;">11</li><li style="box-sizing: border-box; padding: 0px 5px;">12</li><li style="box-sizing: border-box; padding: 0px 5px;">13</li><li style="box-sizing: border-box; padding: 0px 5px;">14</li><li style="box-sizing: border-box; padding: 0px 5px;">15</li></ul>

看看结果 
Hadoop学习之网络爬虫+分词+倒排索引实现搜索引擎案例_第1张图片

爬到的内容上传到hdfs上了

你可能感兴趣的:(Hadoop学习之网络爬虫+分词+倒排索引实现搜索引擎案例)