SpringBoot中使用Jsoup爬取网站数据

 

爬取数据

 

导入jar包

    
        1.8
        7.6.1
    

    
        
            org.jsoup
            jsoup
            1.10.2
        
        
            com.alibaba
            fastjson
            1.2.62
        
        
            org.springframework.boot
            spring-boot-starter-data-redis
        
        
            org.springframework.boot
            spring-boot-starter-data-elasticsearch
        
        
            org.springframework.boot
            spring-boot-starter-thymeleaf
        
        
            org.springframework.boot
            spring-boot-starter-web
        

        
            org.springframework.boot
            spring-boot-devtools
            runtime
            true
        
        
            org.springframework.boot
            spring-boot-configuration-processor
            true
        
        
            org.projectlombok
            lombok
            true
        
        
            org.springframework.boot
            spring-boot-starter-test
            test
            
                
                    org.junit.vintage
                    junit-vintage-engine
                
            
        
    

新建实体类

@Data
@NoArgsConstructor
@AllArgsConstructor
public class Content {
    private String title;
    private String img;
    private String price;
}

编写爬虫工具类

public class HtmlParseUtil {
    public static void main(String[] args) throws Exception {
        new HtmlParseUtil().parseDDJJ("包").forEach(System.out::println);
    }

    public List parseDDJJ(String keywords) throws Exception {
        //爬取url地址
        String url = "https://search.xxxx.com/Search?keyword="+keywords;
        //解析网页,30s内未爬取成功,打印错误
        Document document = Jsoup.parse(new URL(url),30000);
        //获取每一本书籍的id
        Element element = document.getElementById("DJ_goodsList");
        //获取所有的li标签
        Elements elements = element.getElementsByTag("li");

        ArrayList goodsList = new ArrayList<>();

        //遍历li标签的内容
        for (Element el : elements) {
            String img = el.getElementsByTag("img").eq(0).attr("src");
            String price = el.getElementsByClass("p-price").eq(0).text();
            String title = el.getElementsByClass("p-name").eq(0).text();

            Content content = new Content();
            content.setTitle(title);
            content.setPrice(price);
            content.setImg(img);
            goodsList.add(content);
        }
        return goodsList;
    }
}

 可以看到内容、图片、价格系数爬取

你可能感兴趣的:([爬虫],[ElasticStack])