本项目实现的是:自己写一个网络爬虫,对搜狐(或者csdn)爬取新闻(博客)标题,然后把这些新闻标题和它的链接地址上传到hdfs多个文件上,一个文件对应一个标题和链接地址,然后通过分词技术对每个文件中的标题进行分词,分词后建立倒排索引以此来实现搜索引擎的功能,建立倒排索引不熟悉的朋友可以看看我上篇博客
Hadoop–倒排索引过程详解
首先 要自己写一个网络爬虫
由于我开始写爬虫的时候用了htmlparser,把所有搜到的链接存到队列,并且垂直搜索,这个工作量太大,爬了一个小时还没爬完造成了我电脑的死机,所以,现在我就去掉了垂直搜索,只爬搜狐的主页的新闻文章链接
不多说,看代码
首先看下载工具类,解释看注释
<code class="hljs java has-numbering" style="display: block; padding: 0px; color: inherit; box-sizing: border-box; font-family: 'Source Code Pro', monospace;font-size:undefined; white-space: pre; border-radius: 0px; word-wrap: normal; background: transparent;"><span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">package</span> com.yc.spider; <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">import</span> java.io.BufferedInputStream; <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">import</span> java.io.BufferedOutputStream; <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">import</span> java.io.File; <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">import</span> java.io.FileOutputStream; <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">import</span> java.io.IOException; <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">import</span> java.net.HttpURLConnection; <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">import</span> java.net.MalformedURLException; <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">import</span> java.net.URL; <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">import</span> java.text.DateFormat; <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">import</span> java.text.SimpleDateFormat; <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">import</span> java.util.Date; <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">import</span> java.util.List; <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">import</span> java.util.Random; <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">import</span> java.util.Scanner; <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">import</span> java.util.Set; <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">import</span> java.util.regex.Matcher; <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">import</span> java.util.regex.Pattern; <span class="hljs-javadoc" style="color: rgb(136, 0, 0); box-sizing: border-box;">/** * 下载工具类 *<span class="hljs-javadoctag" style="color: rgb(102, 0, 102); box-sizing: border-box;"> @author</span> 汤高 * */</span> <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">public</span> <span class="hljs-class" style="box-sizing: border-box;"><span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">class</span> <span class="hljs-title" style="box-sizing: border-box; color: rgb(102, 0, 102);">DownLoadTool</span> {</span> <span class="hljs-javadoc" style="color: rgb(136, 0, 0); box-sizing: border-box;">/** * 下载页面的内容 * 就是根据地址下载整个html标签 * *<span class="hljs-javadoctag" style="color: rgb(102, 0, 102); box-sizing: border-box;"> @param</span> addr *<span class="hljs-javadoctag" style="color: rgb(102, 0, 102); box-sizing: border-box;"> @return</span> */</span> <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">public</span> String <span class="hljs-title" style="box-sizing: border-box;">downLoadUrl</span>(<span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">final</span> String addr) { StringBuffer sb = <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">new</span> StringBuffer(); <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">try</span> { URL url; <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">if</span>(addr.startsWith(<span class="hljs-string" style="color: rgb(0, 136, 0); box-sizing: border-box;">"http://"</span>)==<span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">false</span>){ String urladdr=addr+<span class="hljs-string" style="color: rgb(0, 136, 0); box-sizing: border-box;">"http://"</span>; url = <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">new</span> URL(urladdr); }<span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">else</span>{ System.out.println(addr); url = <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">new</span> URL(addr); } HttpURLConnection con = (HttpURLConnection) url.openConnection(); con.setConnectTimeout(<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">5000</span>); con.connect(); <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">if</span> (con.getResponseCode() == <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">200</span>) { BufferedInputStream bis = <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">new</span> BufferedInputStream(con .getInputStream()); Scanner sc = <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">new</span> Scanner(bis,<span class="hljs-string" style="color: rgb(0, 136, 0); box-sizing: border-box;">"gbk"</span>); <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">while</span> (sc.hasNextLine()) { sb.append(sc.nextLine()); } } } <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">catch</span> (IOException e) { e.printStackTrace(); } <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">return</span> sb.toString(); } } </code><ul class="pre-numbering" style="box-sizing: border-box; position: absolute; width: 50px; top: 0px; left: 0px; margin: 0px; padding: 6px 0px 40px; border-right-width: 1px; border-right-style: solid; border-right-color: rgb(221, 221, 221); list-style: none; text-align: right; background-color: rgb(238, 238, 238);"><li style="box-sizing: border-box; padding: 0px 5px;">1</li><li style="box-sizing: border-box; padding: 0px 5px;">2</li><li style="box-sizing: border-box; padding: 0px 5px;">3</li><li style="box-sizing: border-box; padding: 0px 5px;">4</li><li style="box-sizing: border-box; padding: 0px 5px;">5</li><li style="box-sizing: border-box; padding: 0px 5px;">6</li><li style="box-sizing: border-box; padding: 0px 5px;">7</li><li style="box-sizing: border-box; padding: 0px 5px;">8</li><li style="box-sizing: border-box; padding: 0px 5px;">9</li><li style="box-sizing: border-box; padding: 0px 5px;">10</li><li style="box-sizing: border-box; padding: 0px 5px;">11</li><li style="box-sizing: border-box; padding: 0px 5px;">12</li><li style="box-sizing: border-box; padding: 0px 5px;">13</li><li style="box-sizing: border-box; padding: 0px 5px;">14</li><li style="box-sizing: border-box; padding: 0px 5px;">15</li><li style="box-sizing: border-box; padding: 0px 5px;">16</li><li style="box-sizing: border-box; padding: 0px 5px;">17</li><li style="box-sizing: border-box; padding: 0px 5px;">18</li><li style="box-sizing: border-box; padding: 0px 5px;">19</li><li style="box-sizing: border-box; padding: 0px 5px;">20</li><li style="box-sizing: border-box; padding: 0px 5px;">21</li><li style="box-sizing: border-box; padding: 0px 5px;">22</li><li style="box-sizing: border-box; padding: 0px 5px;">23</li><li style="box-sizing: border-box; padding: 0px 5px;">24</li><li style="box-sizing: border-box; padding: 0px 5px;">25</li><li style="box-sizing: border-box; padding: 0px 5px;">26</li><li style="box-sizing: border-box; padding: 0px 5px;">27</li><li style="box-sizing: border-box; padding: 0px 5px;">28</li><li style="box-sizing: border-box; padding: 0px 5px;">29</li><li style="box-sizing: border-box; padding: 0px 5px;">30</li><li style="box-sizing: border-box; padding: 0px 5px;">31</li><li style="box-sizing: border-box; padding: 0px 5px;">32</li><li style="box-sizing: border-box; padding: 0px 5px;">33</li><li style="box-sizing: border-box; padding: 0px 5px;">34</li><li style="box-sizing: border-box; padding: 0px 5px;">35</li><li style="box-sizing: border-box; padding: 0px 5px;">36</li><li style="box-sizing: border-box; padding: 0px 5px;">37</li><li style="box-sizing: border-box; padding: 0px 5px;">38</li><li style="box-sizing: border-box; padding: 0px 5px;">39</li><li style="box-sizing: border-box; padding: 0px 5px;">40</li><li style="box-sizing: border-box; padding: 0px 5px;">41</li><li style="box-sizing: border-box; padding: 0px 5px;">42</li><li style="box-sizing: border-box; padding: 0px 5px;">43</li><li style="box-sizing: border-box; padding: 0px 5px;">44</li><li style="box-sizing: border-box; padding: 0px 5px;">45</li><li style="box-sizing: border-box; padding: 0px 5px;">46</li><li style="box-sizing: border-box; padding: 0px 5px;">47</li><li style="box-sizing: border-box; padding: 0px 5px;">48</li><li style="box-sizing: border-box; padding: 0px 5px;">49</li><li style="box-sizing: border-box; padding: 0px 5px;">50</li><li style="box-sizing: border-box; padding: 0px 5px;">51</li><li style="box-sizing: border-box; padding: 0px 5px;">52</li><li style="box-sizing: border-box; padding: 0px 5px;">53</li><li style="box-sizing: border-box; padding: 0px 5px;">54</li><li style="box-sizing: border-box; padding: 0px 5px;">55</li><li style="box-sizing: border-box; padding: 0px 5px;">56</li><li style="box-sizing: border-box; padding: 0px 5px;">57</li><li style="box-sizing: border-box; padding: 0px 5px;">58</li><li style="box-sizing: border-box; padding: 0px 5px;">59</li><li style="box-sizing: border-box; padding: 0px 5px;">60</li><li style="box-sizing: border-box; padding: 0px 5px;">61</li><li style="box-sizing: border-box; padding: 0px 5px;">62</li><li style="box-sizing: border-box; padding: 0px 5px;">63</li></ul><ul class="pre-numbering" style="box-sizing: border-box; position: absolute; width: 50px; top: 0px; left: 0px; margin: 0px; padding: 6px 0px 40px; border-right-width: 1px; border-right-style: solid; border-right-color: rgb(221, 221, 221); list-style: none; text-align: right; background-color: rgb(238, 238, 238);"><li style="box-sizing: border-box; padding: 0px 5px;">1</li><li style="box-sizing: border-box; padding: 0px 5px;">2</li><li style="box-sizing: border-box; padding: 0px 5px;">3</li><li style="box-sizing: border-box; padding: 0px 5px;">4</li><li style="box-sizing: border-box; padding: 0px 5px;">5</li><li style="box-sizing: border-box; padding: 0px 5px;">6</li><li style="box-sizing: border-box; padding: 0px 5px;">7</li><li style="box-sizing: border-box; padding: 0px 5px;">8</li><li style="box-sizing: border-box; padding: 0px 5px;">9</li><li style="box-sizing: border-box; padding: 0px 5px;">10</li><li style="box-sizing: border-box; padding: 0px 5px;">11</li><li style="box-sizing: border-box; padding: 0px 5px;">12</li><li style="box-sizing: border-box; padding: 0px 5px;">13</li><li style="box-sizing: border-box; padding: 0px 5px;">14</li><li style="box-sizing: border-box; padding: 0px 5px;">15</li><li style="box-sizing: border-box; padding: 0px 5px;">16</li><li style="box-sizing: border-box; padding: 0px 5px;">17</li><li style="box-sizing: border-box; padding: 0px 5px;">18</li><li style="box-sizing: border-box; padding: 0px 5px;">19</li><li style="box-sizing: border-box; padding: 0px 5px;">20</li><li style="box-sizing: border-box; padding: 0px 5px;">21</li><li style="box-sizing: border-box; padding: 0px 5px;">22</li><li style="box-sizing: border-box; padding: 0px 5px;">23</li><li style="box-sizing: border-box; padding: 0px 5px;">24</li><li style="box-sizing: border-box; padding: 0px 5px;">25</li><li style="box-sizing: border-box; padding: 0px 5px;">26</li><li style="box-sizing: border-box; padding: 0px 5px;">27</li><li style="box-sizing: border-box; padding: 0px 5px;">28</li><li style="box-sizing: border-box; padding: 0px 5px;">29</li><li style="box-sizing: border-box; padding: 0px 5px;">30</li><li style="box-sizing: border-box; padding: 0px 5px;">31</li><li style="box-sizing: border-box; padding: 0px 5px;">32</li><li style="box-sizing: border-box; padding: 0px 5px;">33</li><li style="box-sizing: border-box; padding: 0px 5px;">34</li><li style="box-sizing: border-box; padding: 0px 5px;">35</li><li style="box-sizing: border-box; padding: 0px 5px;">36</li><li style="box-sizing: border-box; padding: 0px 5px;">37</li><li style="box-sizing: border-box; padding: 0px 5px;">38</li><li style="box-sizing: border-box; padding: 0px 5px;">39</li><li style="box-sizing: border-box; padding: 0px 5px;">40</li><li style="box-sizing: border-box; padding: 0px 5px;">41</li><li style="box-sizing: border-box; padding: 0px 5px;">42</li><li style="box-sizing: border-box; padding: 0px 5px;">43</li><li style="box-sizing: border-box; padding: 0px 5px;">44</li><li style="box-sizing: border-box; padding: 0px 5px;">45</li><li style="box-sizing: border-box; padding: 0px 5px;">46</li><li style="box-sizing: border-box; padding: 0px 5px;">47</li><li style="box-sizing: border-box; padding: 0px 5px;">48</li><li style="box-sizing: border-box; padding: 0px 5px;">49</li><li style="box-sizing: border-box; padding: 0px 5px;">50</li><li style="box-sizing: border-box; padding: 0px 5px;">51</li><li style="box-sizing: border-box; padding: 0px 5px;">52</li><li style="box-sizing: border-box; padding: 0px 5px;">53</li><li style="box-sizing: border-box; padding: 0px 5px;">54</li><li style="box-sizing: border-box; padding: 0px 5px;">55</li><li style="box-sizing: border-box; padding: 0px 5px;">56</li><li style="box-sizing: border-box; padding: 0px 5px;">57</li><li style="box-sizing: border-box; padding: 0px 5px;">58</li><li style="box-sizing: border-box; padding: 0px 5px;">59</li><li style="box-sizing: border-box; padding: 0px 5px;">60</li><li style="box-sizing: border-box; padding: 0px 5px;">61</li><li style="box-sizing: border-box; padding: 0px 5px;">62</li><li style="box-sizing: border-box; padding: 0px 5px;">63</li></ul>
然后看一个文章链接的匹配类
<code class="hljs java has-numbering" style="display: block; padding: 0px; color: inherit; box-sizing: border-box; font-family: 'Source Code Pro', monospace;font-size:undefined; white-space: pre; border-radius: 0px; word-wrap: normal; background: transparent;"><span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">package</span> com.yc.spider; <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">import</span> java.io.BufferedInputStream; <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">import</span> java.io.IOException; <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">import</span> java.net.HttpURLConnection; <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">import</span> java.net.URL; <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">import</span> java.util.ArrayList; <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">import</span> java.util.HashSet; <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">import</span> java.util.List; <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">import</span> java.util.Scanner; <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">import</span> java.util.Set; <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">import</span> java.util.regex.Matcher; <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">import</span> java.util.regex.Pattern; <span class="hljs-javadoc" style="color: rgb(136, 0, 0); box-sizing: border-box;">/** * 文章下载类 * *<span class="hljs-javadoctag" style="color: rgb(102, 0, 102); box-sizing: border-box;"> @author</span> 汤高 * * */</span> <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">public</span> <span class="hljs-class" style="box-sizing: border-box;"><span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">class</span> <span class="hljs-title" style="box-sizing: border-box; color: rgb(102, 0, 102);">ArticleDownLoad</span> {</span> <span class="hljs-javadoc" style="color: rgb(136, 0, 0); box-sizing: border-box;">/** * 取出文章的a标记href */</span> <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">static</span> String ARTICLE_URL = <span class="hljs-string" style="color: rgb(0, 136, 0); box-sizing: border-box;">"<a\\s+[^<>]*\\s+href=\"?(http[^<>\"]*)\"[^<>]*>([^<]*)</a>"</span>; <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">static</span> Set<String> getImageLink(String html) { Set<String> result = <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">new</span> HashSet<String>(); <span class="hljs-comment" style="color: rgb(136, 0, 0); box-sizing: border-box;">// 创建一个Pattern模式类,编译这个正则表达式</span> Pattern p = Pattern.compile(ARTICLE_URL, Pattern.CASE_INSENSITIVE); <span class="hljs-comment" style="color: rgb(136, 0, 0); box-sizing: border-box;">// 定义一个匹配器的类</span> Matcher matcher = p.matcher(html); <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">while</span> (matcher.find()) { System.out.println(<span class="hljs-string" style="color: rgb(0, 136, 0); box-sizing: border-box;">"======="</span>+matcher.group(<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">1</span>)+<span class="hljs-string" style="color: rgb(0, 136, 0); box-sizing: border-box;">"\t"</span>+matcher.group(<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">2</span>)); result.add(matcher.group(<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">1</span>)+<span class="hljs-string" style="color: rgb(0, 136, 0); box-sizing: border-box;">"\t"</span>+matcher.group(<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">2</span>)); } <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">return</span> result; } } </code><ul class="pre-numbering" style="box-sizing: border-box; position: absolute; width: 50px; top: 0px; left: 0px; margin: 0px; padding: 6px 0px 40px; border-right-width: 1px; border-right-style: solid; border-right-color: rgb(221, 221, 221); list-style: none; text-align: right; background-color: rgb(238, 238, 238);"><li style="box-sizing: border-box; padding: 0px 5px;">1</li><li style="box-sizing: border-box; padding: 0px 5px;">2</li><li style="box-sizing: border-box; padding: 0px 5px;">3</li><li style="box-sizing: border-box; padding: 0px 5px;">4</li><li style="box-sizing: border-box; padding: 0px 5px;">5</li><li style="box-sizing: border-box; padding: 0px 5px;">6</li><li style="box-sizing: border-box; padding: 0px 5px;">7</li><li style="box-sizing: border-box; padding: 0px 5px;">8</li><li style="box-sizing: border-box; padding: 0px 5px;">9</li><li style="box-sizing: border-box; padding: 0px 5px;">10</li><li style="box-sizing: border-box; padding: 0px 5px;">11</li><li style="box-sizing: border-box; padding: 0px 5px;">12</li><li style="box-sizing: border-box; padding: 0px 5px;">13</li><li style="box-sizing: border-box; padding: 0px 5px;">14</li><li style="box-sizing: border-box; padding: 0px 5px;">15</li><li style="box-sizing: border-box; padding: 0px 5px;">16</li><li style="box-sizing: border-box; padding: 0px 5px;">17</li><li style="box-sizing: border-box; padding: 0px 5px;">18</li><li style="box-sizing: border-box; padding: 0px 5px;">19</li><li style="box-sizing: border-box; padding: 0px 5px;">20</li><li style="box-sizing: border-box; padding: 0px 5px;">21</li><li style="box-sizing: border-box; padding: 0px 5px;">22</li><li style="box-sizing: border-box; padding: 0px 5px;">23</li><li style="box-sizing: border-box; padding: 0px 5px;">24</li><li style="box-sizing: border-box; padding: 0px 5px;">25</li><li style="box-sizing: border-box; padding: 0px 5px;">26</li><li style="box-sizing: border-box; padding: 0px 5px;">27</li><li style="box-sizing: border-box; padding: 0px 5px;">28</li><li style="box-sizing: border-box; padding: 0px 5px;">29</li><li style="box-sizing: border-box; padding: 0px 5px;">30</li><li style="box-sizing: border-box; padding: 0px 5px;">31</li><li style="box-sizing: border-box; padding: 0px 5px;">32</li><li style="box-sizing: border-box; padding: 0px 5px;">33</li><li style="box-sizing: border-box; padding: 0px 5px;">34</li><li style="box-sizing: border-box; padding: 0px 5px;">35</li><li style="box-sizing: border-box; padding: 0px 5px;">36</li><li style="box-sizing: border-box; padding: 0px 5px;">37</li><li style="box-sizing: border-box; padding: 0px 5px;">38</li><li style="box-sizing: border-box; padding: 0px 5px;">39</li><li style="box-sizing: border-box; padding: 0px 5px;">40</li><li style="box-sizing: border-box; padding: 0px 5px;">41</li><li style="box-sizing: border-box; padding: 0px 5px;">42</li><li style="box-sizing: border-box; padding: 0px 5px;">43</li><li style="box-sizing: border-box; padding: 0px 5px;">44</li><li style="box-sizing: border-box; padding: 0px 5px;">45</li></ul><ul class="pre-numbering" style="box-sizing: border-box; position: absolute; width: 50px; top: 0px; left: 0px; margin: 0px; padding: 6px 0px 40px; border-right-width: 1px; border-right-style: solid; border-right-color: rgb(221, 221, 221); list-style: none; text-align: right; background-color: rgb(238, 238, 238);"><li style="box-sizing: border-box; padding: 0px 5px;">1</li><li style="box-sizing: border-box; padding: 0px 5px;">2</li><li style="box-sizing: border-box; padding: 0px 5px;">3</li><li style="box-sizing: border-box; padding: 0px 5px;">4</li><li style="box-sizing: border-box; padding: 0px 5px;">5</li><li style="box-sizing: border-box; padding: 0px 5px;">6</li><li style="box-sizing: border-box; padding: 0px 5px;">7</li><li style="box-sizing: border-box; padding: 0px 5px;">8</li><li style="box-sizing: border-box; padding: 0px 5px;">9</li><li style="box-sizing: border-box; padding: 0px 5px;">10</li><li style="box-sizing: border-box; padding: 0px 5px;">11</li><li style="box-sizing: border-box; padding: 0px 5px;">12</li><li style="box-sizing: border-box; padding: 0px 5px;">13</li><li style="box-sizing: border-box; padding: 0px 5px;">14</li><li style="box-sizing: border-box; padding: 0px 5px;">15</li><li style="box-sizing: border-box; padding: 0px 5px;">16</li><li style="box-sizing: border-box; padding: 0px 5px;">17</li><li style="box-sizing: border-box; padding: 0px 5px;">18</li><li style="box-sizing: border-box; padding: 0px 5px;">19</li><li style="box-sizing: border-box; padding: 0px 5px;">20</li><li style="box-sizing: border-box; padding: 0px 5px;">21</li><li style="box-sizing: border-box; padding: 0px 5px;">22</li><li style="box-sizing: border-box; padding: 0px 5px;">23</li><li style="box-sizing: border-box; padding: 0px 5px;">24</li><li style="box-sizing: border-box; padding: 0px 5px;">25</li><li style="box-sizing: border-box; padding: 0px 5px;">26</li><li style="box-sizing: border-box; padding: 0px 5px;">27</li><li style="box-sizing: border-box; padding: 0px 5px;">28</li><li style="box-sizing: border-box; padding: 0px 5px;">29</li><li style="box-sizing: border-box; padding: 0px 5px;">30</li><li style="box-sizing: border-box; padding: 0px 5px;">31</li><li style="box-sizing: border-box; padding: 0px 5px;">32</li><li style="box-sizing: border-box; padding: 0px 5px;">33</li><li style="box-sizing: border-box; padding: 0px 5px;">34</li><li style="box-sizing: border-box; padding: 0px 5px;">35</li><li style="box-sizing: border-box; padding: 0px 5px;">36</li><li style="box-sizing: border-box; padding: 0px 5px;">37</li><li style="box-sizing: border-box; padding: 0px 5px;">38</li><li style="box-sizing: border-box; padding: 0px 5px;">39</li><li style="box-sizing: border-box; padding: 0px 5px;">40</li><li style="box-sizing: border-box; padding: 0px 5px;">41</li><li style="box-sizing: border-box; padding: 0px 5px;">42</li><li style="box-sizing: border-box; padding: 0px 5px;">43</li><li style="box-sizing: border-box; padding: 0px 5px;">44</li><li style="box-sizing: border-box; padding: 0px 5px;">45</li></ul>
下面看爬虫类
<code class="hljs avrasm has-numbering" style="display: block; padding: 0px; color: inherit; box-sizing: border-box; font-family: 'Source Code Pro', monospace;font-size:undefined; white-space: pre; border-radius: 0px; word-wrap: normal; background: transparent;">package <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">com</span><span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.yc</span><span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.spider</span><span class="hljs-comment" style="color: rgb(136, 0, 0); box-sizing: border-box;">;</span> import java<span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.io</span><span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.IOException</span><span class="hljs-comment" style="color: rgb(136, 0, 0); box-sizing: border-box;">;</span> import java<span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.net</span><span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.URI</span><span class="hljs-comment" style="color: rgb(136, 0, 0); box-sizing: border-box;">;</span> import java<span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.net</span><span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.URISyntaxException</span><span class="hljs-comment" style="color: rgb(136, 0, 0); box-sizing: border-box;">;</span> import java<span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.util</span><span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.List</span><span class="hljs-comment" style="color: rgb(136, 0, 0); box-sizing: border-box;">;</span> import java<span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.util</span><span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.Set</span><span class="hljs-comment" style="color: rgb(136, 0, 0); box-sizing: border-box;">;</span> import java<span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.util</span><span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.regex</span><span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.Matcher</span><span class="hljs-comment" style="color: rgb(136, 0, 0); box-sizing: border-box;">;</span> import java<span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.util</span><span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.regex</span><span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.Pattern</span><span class="hljs-comment" style="color: rgb(136, 0, 0); box-sizing: border-box;">;</span> import org<span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.apache</span><span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.hadoop</span><span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.conf</span><span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.Configuration</span><span class="hljs-comment" style="color: rgb(136, 0, 0); box-sizing: border-box;">;</span> import org<span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.apache</span><span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.hadoop</span><span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.fs</span><span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.FSDataOutputStream</span><span class="hljs-comment" style="color: rgb(136, 0, 0); box-sizing: border-box;">;</span> import org<span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.apache</span><span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.hadoop</span><span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.fs</span><span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.FileSystem</span><span class="hljs-comment" style="color: rgb(136, 0, 0); box-sizing: border-box;">;</span> import org<span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.apache</span><span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.hadoop</span><span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.fs</span><span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.Path</span><span class="hljs-comment" style="color: rgb(136, 0, 0); box-sizing: border-box;">;</span> import org<span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.htmlparser</span><span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.util</span><span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.ParserException</span><span class="hljs-comment" style="color: rgb(136, 0, 0); box-sizing: border-box;">;</span> <span class="hljs-comment" style="color: rgb(136, 0, 0); box-sizing: border-box;">/** * 把爬到的内容上传到hdfs * @author 汤高 * * */</span> public class Spider { private DownLoadTool dlt = new DownLoadTool()<span class="hljs-comment" style="color: rgb(136, 0, 0); box-sizing: border-box;">;</span> private ArticleDownLoad pdl = new ArticleDownLoad()<span class="hljs-comment" style="color: rgb(136, 0, 0); box-sizing: border-box;">;</span> public void crawling(String url) { try { Configuration conf = new Configuration()<span class="hljs-comment" style="color: rgb(136, 0, 0); box-sizing: border-box;">;</span> URI uri = new URI(<span class="hljs-string" style="color: rgb(0, 136, 0); box-sizing: border-box;">"hdfs://192.168.52.140:9000"</span>)<span class="hljs-comment" style="color: rgb(136, 0, 0); box-sizing: border-box;">;</span> FileSystem hdfs = FileSystem<span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.get</span>(uri, conf)<span class="hljs-comment" style="color: rgb(136, 0, 0); box-sizing: border-box;">;</span> String html = dlt<span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.downLoadUrl</span>(url)<span class="hljs-comment" style="color: rgb(136, 0, 0); box-sizing: border-box;">;</span> <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">Set</span><String> allneed = pdl<span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.getImageLink</span>(html)<span class="hljs-comment" style="color: rgb(136, 0, 0); box-sizing: border-box;">;</span> for (String addr : allneed) { //生成文件路径 Path p1 = new Path(<span class="hljs-string" style="color: rgb(0, 136, 0); box-sizing: border-box;">"/myspider/"</span> + System<span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.currentTimeMillis</span>())<span class="hljs-comment" style="color: rgb(136, 0, 0); box-sizing: border-box;">;</span> FSDataOutputStream dos = hdfs<span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.create</span>(p1)<span class="hljs-comment" style="color: rgb(136, 0, 0); box-sizing: border-box;">;</span> String a = addr + <span class="hljs-string" style="color: rgb(0, 136, 0); box-sizing: border-box;">"\n"</span><span class="hljs-comment" style="color: rgb(136, 0, 0); box-sizing: border-box;">;</span> //把内容写入文件 dos<span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.write</span>(a<span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.getBytes</span>())<span class="hljs-comment" style="color: rgb(136, 0, 0); box-sizing: border-box;">;</span> } } catch (IllegalArgumentException e) { e<span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.printStackTrace</span>()<span class="hljs-comment" style="color: rgb(136, 0, 0); box-sizing: border-box;">;</span> } catch (URISyntaxException e) { e<span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.printStackTrace</span>()<span class="hljs-comment" style="color: rgb(136, 0, 0); box-sizing: border-box;">;</span> } catch (IOException e) { e<span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.printStackTrace</span>()<span class="hljs-comment" style="color: rgb(136, 0, 0); box-sizing: border-box;">;</span> } } } </code><ul class="pre-numbering" style="box-sizing: border-box; position: absolute; width: 50px; top: 0px; left: 0px; margin: 0px; padding: 6px 0px 40px; border-right-width: 1px; border-right-style: solid; border-right-color: rgb(221, 221, 221); list-style: none; text-align: right; background-color: rgb(238, 238, 238);"><li style="box-sizing: border-box; padding: 0px 5px;">1</li><li style="box-sizing: border-box; padding: 0px 5px;">2</li><li style="box-sizing: border-box; padding: 0px 5px;">3</li><li style="box-sizing: border-box; padding: 0px 5px;">4</li><li style="box-sizing: border-box; padding: 0px 5px;">5</li><li style="box-sizing: border-box; padding: 0px 5px;">6</li><li style="box-sizing: border-box; padding: 0px 5px;">7</li><li style="box-sizing: border-box; padding: 0px 5px;">8</li><li style="box-sizing: border-box; padding: 0px 5px;">9</li><li style="box-sizing: border-box; padding: 0px 5px;">10</li><li style="box-sizing: border-box; padding: 0px 5px;">11</li><li style="box-sizing: border-box; padding: 0px 5px;">12</li><li style="box-sizing: border-box; padding: 0px 5px;">13</li><li style="box-sizing: border-box; padding: 0px 5px;">14</li><li style="box-sizing: border-box; padding: 0px 5px;">15</li><li style="box-sizing: border-box; padding: 0px 5px;">16</li><li style="box-sizing: border-box; padding: 0px 5px;">17</li><li style="box-sizing: border-box; padding: 0px 5px;">18</li><li style="box-sizing: border-box; padding: 0px 5px;">19</li><li style="box-sizing: border-box; padding: 0px 5px;">20</li><li style="box-sizing: border-box; padding: 0px 5px;">21</li><li style="box-sizing: border-box; padding: 0px 5px;">22</li><li style="box-sizing: border-box; padding: 0px 5px;">23</li><li style="box-sizing: border-box; padding: 0px 5px;">24</li><li style="box-sizing: border-box; padding: 0px 5px;">25</li><li style="box-sizing: border-box; padding: 0px 5px;">26</li><li style="box-sizing: border-box; padding: 0px 5px;">27</li><li style="box-sizing: border-box; padding: 0px 5px;">28</li><li style="box-sizing: border-box; padding: 0px 5px;">29</li><li style="box-sizing: border-box; padding: 0px 5px;">30</li><li style="box-sizing: border-box; padding: 0px 5px;">31</li><li style="box-sizing: border-box; padding: 0px 5px;">32</li><li style="box-sizing: border-box; padding: 0px 5px;">33</li><li style="box-sizing: border-box; padding: 0px 5px;">34</li><li style="box-sizing: border-box; padding: 0px 5px;">35</li><li style="box-sizing: border-box; padding: 0px 5px;">36</li><li style="box-sizing: border-box; padding: 0px 5px;">37</li><li style="box-sizing: border-box; padding: 0px 5px;">38</li><li style="box-sizing: border-box; padding: 0px 5px;">39</li><li style="box-sizing: border-box; padding: 0px 5px;">40</li><li style="box-sizing: border-box; padding: 0px 5px;">41</li><li style="box-sizing: border-box; padding: 0px 5px;">42</li><li style="box-sizing: border-box; padding: 0px 5px;">43</li><li style="box-sizing: border-box; padding: 0px 5px;">44</li><li style="box-sizing: border-box; padding: 0px 5px;">45</li><li style="box-sizing: border-box; padding: 0px 5px;">46</li><li style="box-sizing: border-box; padding: 0px 5px;">47</li><li style="box-sizing: border-box; padding: 0px 5px;">48</li><li style="box-sizing: border-box; padding: 0px 5px;">49</li><li style="box-sizing: border-box; padding: 0px 5px;">50</li></ul><ul class="pre-numbering" style="box-sizing: border-box; position: absolute; width: 50px; top: 0px; left: 0px; margin: 0px; padding: 6px 0px 40px; border-right-width: 1px; border-right-style: solid; border-right-color: rgb(221, 221, 221); list-style: none; text-align: right; background-color: rgb(238, 238, 238);"><li style="box-sizing: border-box; padding: 0px 5px;">1</li><li style="box-sizing: border-box; padding: 0px 5px;">2</li><li style="box-sizing: border-box; padding: 0px 5px;">3</li><li style="box-sizing: border-box; padding: 0px 5px;">4</li><li style="box-sizing: border-box; padding: 0px 5px;">5</li><li style="box-sizing: border-box; padding: 0px 5px;">6</li><li style="box-sizing: border-box; padding: 0px 5px;">7</li><li style="box-sizing: border-box; padding: 0px 5px;">8</li><li style="box-sizing: border-box; padding: 0px 5px;">9</li><li style="box-sizing: border-box; padding: 0px 5px;">10</li><li style="box-sizing: border-box; padding: 0px 5px;">11</li><li style="box-sizing: border-box; padding: 0px 5px;">12</li><li style="box-sizing: border-box; padding: 0px 5px;">13</li><li style="box-sizing: border-box; padding: 0px 5px;">14</li><li style="box-sizing: border-box; padding: 0px 5px;">15</li><li style="box-sizing: border-box; padding: 0px 5px;">16</li><li style="box-sizing: border-box; padding: 0px 5px;">17</li><li style="box-sizing: border-box; padding: 0px 5px;">18</li><li style="box-sizing: border-box; padding: 0px 5px;">19</li><li style="box-sizing: border-box; padding: 0px 5px;">20</li><li style="box-sizing: border-box; padding: 0px 5px;">21</li><li style="box-sizing: border-box; padding: 0px 5px;">22</li><li style="box-sizing: border-box; padding: 0px 5px;">23</li><li style="box-sizing: border-box; padding: 0px 5px;">24</li><li style="box-sizing: border-box; padding: 0px 5px;">25</li><li style="box-sizing: border-box; padding: 0px 5px;">26</li><li style="box-sizing: border-box; padding: 0px 5px;">27</li><li style="box-sizing: border-box; padding: 0px 5px;">28</li><li style="box-sizing: border-box; padding: 0px 5px;">29</li><li style="box-sizing: border-box; padding: 0px 5px;">30</li><li style="box-sizing: border-box; padding: 0px 5px;">31</li><li style="box-sizing: border-box; padding: 0px 5px;">32</li><li style="box-sizing: border-box; padding: 0px 5px;">33</li><li style="box-sizing: border-box; padding: 0px 5px;">34</li><li style="box-sizing: border-box; padding: 0px 5px;">35</li><li style="box-sizing: border-box; padding: 0px 5px;">36</li><li style="box-sizing: border-box; padding: 0px 5px;">37</li><li style="box-sizing: border-box; padding: 0px 5px;">38</li><li style="box-sizing: border-box; padding: 0px 5px;">39</li><li style="box-sizing: border-box; padding: 0px 5px;">40</li><li style="box-sizing: border-box; padding: 0px 5px;">41</li><li style="box-sizing: border-box; padding: 0px 5px;">42</li><li style="box-sizing: border-box; padding: 0px 5px;">43</li><li style="box-sizing: border-box; padding: 0px 5px;">44</li><li style="box-sizing: border-box; padding: 0px 5px;">45</li><li style="box-sizing: border-box; padding: 0px 5px;">46</li><li style="box-sizing: border-box; padding: 0px 5px;">47</li><li style="box-sizing: border-box; padding: 0px 5px;">48</li><li style="box-sizing: border-box; padding: 0px 5px;">49</li><li style="box-sizing: border-box; padding: 0px 5px;">50</li></ul>
最后看测试类来爬内容
<code class="hljs java has-numbering" style="display: block; padding: 0px; color: inherit; box-sizing: border-box; font-family: 'Source Code Pro', monospace;font-size:undefined; white-space: pre; border-radius: 0px; word-wrap: normal; background: transparent;"><span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">package</span> com.yc.spider; <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">import</span> java.io.FileNotFoundException; <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">import</span> org.htmlparser.util.ParserException; <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">public</span> <span class="hljs-class" style="box-sizing: border-box;"><span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">class</span> <span class="hljs-title" style="box-sizing: border-box; color: rgb(102, 0, 102);">Test</span> {</span> <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">public</span> <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">static</span> <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">void</span> <span class="hljs-title" style="box-sizing: border-box;">main</span>(String[] args) <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">throws</span> ParserException { Spider s=<span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">new</span> Spider(); <span class="hljs-comment" style="color: rgb(136, 0, 0); box-sizing: border-box;">// s.crawling("http://blog.csdn.net");</span> s.crawling(<span class="hljs-string" style="color: rgb(0, 136, 0); box-sizing: border-box;">"http://www.sohu.com"</span>); } } </code><ul class="pre-numbering" style="box-sizing: border-box; position: absolute; width: 50px; top: 0px; left: 0px; margin: 0px; padding: 6px 0px 40px; border-right-width: 1px; border-right-style: solid; border-right-color: rgb(221, 221, 221); list-style: none; text-align: right; background-color: rgb(238, 238, 238);"><li style="box-sizing: border-box; padding: 0px 5px;">1</li><li style="box-sizing: border-box; padding: 0px 5px;">2</li><li style="box-sizing: border-box; padding: 0px 5px;">3</li><li style="box-sizing: border-box; padding: 0px 5px;">4</li><li style="box-sizing: border-box; padding: 0px 5px;">5</li><li style="box-sizing: border-box; padding: 0px 5px;">6</li><li style="box-sizing: border-box; padding: 0px 5px;">7</li><li style="box-sizing: border-box; padding: 0px 5px;">8</li><li style="box-sizing: border-box; padding: 0px 5px;">9</li><li style="box-sizing: border-box; padding: 0px 5px;">10</li><li style="box-sizing: border-box; padding: 0px 5px;">11</li><li style="box-sizing: border-box; padding: 0px 5px;">12</li><li style="box-sizing: border-box; padding: 0px 5px;">13</li><li style="box-sizing: border-box; padding: 0px 5px;">14</li><li style="box-sizing: border-box; padding: 0px 5px;">15</li></ul><ul class="pre-numbering" style="box-sizing: border-box; position: absolute; width: 50px; top: 0px; left: 0px; margin: 0px; padding: 6px 0px 40px; border-right-width: 1px; border-right-style: solid; border-right-color: rgb(221, 221, 221); list-style: none; text-align: right; background-color: rgb(238, 238, 238);"><li style="box-sizing: border-box; padding: 0px 5px;">1</li><li style="box-sizing: border-box; padding: 0px 5px;">2</li><li style="box-sizing: border-box; padding: 0px 5px;">3</li><li style="box-sizing: border-box; padding: 0px 5px;">4</li><li style="box-sizing: border-box; padding: 0px 5px;">5</li><li style="box-sizing: border-box; padding: 0px 5px;">6</li><li style="box-sizing: border-box; padding: 0px 5px;">7</li><li style="box-sizing: border-box; padding: 0px 5px;">8</li><li style="box-sizing: border-box; padding: 0px 5px;">9</li><li style="box-sizing: border-box; padding: 0px 5px;">10</li><li style="box-sizing: border-box; padding: 0px 5px;">11</li><li style="box-sizing: border-box; padding: 0px 5px;">12</li><li style="box-sizing: border-box; padding: 0px 5px;">13</li><li style="box-sizing: border-box; padding: 0px 5px;">14</li><li style="box-sizing: border-box; padding: 0px 5px;">15</li></ul>
爬到的内容上传到hdfs上了