新华网分享页采集方法

1.第一次做的时候:

 if (url.matches(regxhp)) {

            List time1 = getElementAgainstXpath(s, "//div");
            time = listToString(time1);
            String regtime = "\\d{4}-\\d{2}-\\d{2} \\d{2}:\\d{2}:\\d{2}";
            List strs = new ArrayList();
            Pattern r = Pattern.compile(regtime);
            Matcher m = r.matcher(time);
            while (m.find()) {
                strs.add(m.group());
            }
            time = listToString(strs);
            SimpleDateFormat formatter = new SimpleDateFormat("yyyy-MM-dd HH:mm:ss");
            Date d = new Date();
            try {
                d = formatter.parse(formatter.format(new Date()));
                article.setPublishTime(d);
            } catch (ParseException e) {
                d = null;
                e.printStackTrace();
            }
            cret = s;
            cret = cret.replaceAll("var XinhuammNews =", "");
            String json = cret;
            JSONObject jsonObject = JSON.parseObject(json);

            String video = null;
            String videourl = "

 

2.第二次遇到的时候:

 

   if ("首页-新华发布".equals(article.getChannelName())) {
            String url = "http://xhpfmapi.zhongguowangshi.com/v600/news/%s.js";
            url = String.format(url, article.getArticleUrl());
            article.setArticleUrl(url);
            String html = "";
            try {
                HttpClientResponse response = request(url, "GET", null, null, context);
                html = response.getHtml().replace("var XinhuammNews =", "");
            } catch (Exception e) {
                LOG.error("detail request occurs error. {} {}", url, e);

            }
            return html;
        }

 

主要区别是:第一次做的时候使用的FastJson

                      第二次做的时候就是简单处理一下新华分享页返回的内容,将其json标准化,然后送去模板进行处理。

你可能感兴趣的:(实习,java,xml)