全文检索综合实战

全文检索综合实战

编程环境介绍 es6.8.6 + mysql5.7 + idea + logstsh

一、Java数据爬虫

1、网页分析

目标网站:

JSON数据代表

华为手机商城:https://consumer.huawei.com/cn/phones/?ic_medium=hwdc&ic_source=corp_header_consumer

HTML代表

魅族手机商城:https://lists.meizu.com/page/list?categoryid=76

html特殊字符编码对照表

2、Apache HttpComponents

官网地址

快速入门

Maven仓库

Apache的一个开源项目,主要模拟HTTP请求。

3、jsoup

官网地址

jsoup 是一款Java 的HTML解析器,可直接解析某个URL地址、HTML文本内容。它提供了一套非常省力的API,可通过DOM,CSS以及类似于jQuery的操作方法来取出和操作数据。

帮助手册:http://hc.apache.org/httpcomponents-client-4.5.x/quickstart.html

  • 手机实体类

  • 爬取华为手机

        CloseableHttpClient httpclient = HttpClients.createDefault();
        HttpGet httpget = new HttpGet("https://consumer.huawei.com/cn/phones/?ic_medium=hwdc&ic_source=corp_header_consumer"); // 创建httpget实例
        CloseableHttpResponse response = httpclient.execute(httpget); // 执行get请求
        HttpEntity entity = response.getEntity(); // 获取返回实体
        String content = EntityUtils.toString(entity, "utf-8");
        response.close(); // 关闭流和释放系统资源

        Document document = Jsoup.parse(content);

        Elements elements = document.select("#content-v3-plp #pagehidedata .plphidedata");
        for (Element element : elements) {
            String jsonStr = element.text();
            List huaWeiPhoneBeanlist = JSON.parseArray(jsonStr, HuaWeiPhoneBean.class);
            for (HuaWeiPhoneBean bean : huaWeiPhoneBeanlist){
                String productName = bean.getProductName();
                List colorModeBeanList = bean.getColorsItemMode();

                String colors = "";

                for (ColorModeBean colorModeBean : colorModeBeanList){
                    String colorName = colorModeBean.getColorName();
                    colors += colorName + ";";
                }

                List sellingPointList = bean.getSellingPoints();
                String sellingPoints = "";
                for (String sellingPoint : sellingPointList) {
                    sellingPoints += sellingPoint+";";
                }
                System.out.println("产品名:" + productName);
                System.out.println("颜  色:" + colors);
                System.out.println("买  点:" + sellingPoints);
                Phone phone = new Phone();
                phone.setName(productName);
                phone.setColors(colors);
                phone.setSellingPoints(sellingPoints);
                phone.setCreateTime(new Date());
                phone.setMarketTime(new Date());
                phoneMysqlRepository.save(phone);
            }
        }
        return content;
  • 爬取魅族
 CloseableHttpClient httpclient = HttpClients.createDefault(); // 创建httpclient实例
        HttpGet httpget = new HttpGet("https://lists.meizu.com/page/list?categoryid=76"); // 创建httpget实例

        CloseableHttpResponse response = httpclient.execute(httpget); // 执行get请求
        HttpEntity entity=response.getEntity(); // 获取返回实体fsdf
        //System.out.println("网页内容:"+ EntityUtils.toString(entity, "utf-8")); // 指定编码打印网页内容

        String content = EntityUtils.toString(entity, "utf-8");
        response.close(); // 关闭流和释放系统资

        Document document = Jsoup.parse(content);
        Elements names = document.select("#goodsListWrap .gl-item .gl-item-link .item-title");

        Elements cellingPoints = document.select("#goodsListWrap .gl-item .gl-item-link .item-desc");

        Elements colorsElements = document.select(".container .goods-list #goodsListWrap .gl-item .gl-item-link .item-slide");
        int i = 0;

        for (Element nameElement : names) {
            Phone phone = new Phone();
            phone.setName(nameElement.text());
            Elements elements = colorsElements.get(i).select(".item-slide-dot");
            String endcolors = "";
            for (Element color : elements){
                endcolors += color.attr("title") + ";";
            }
//            System.out.println(endcolors);
            phone.setSellingPoints(cellingPoints.get(i).text());
            phone.setColors(endcolors);
            phone.setCreateTime(new Date());
            phone.setMarketTime(new Date());
            phoneMysqlRepository.save(phone);
        }
        return null;

3、json在线编辑器

http://www.newjson.com/Static/Json/jsoneditor.html

二、spring boot 快速集成MySQL数据库

模型为什么要继承Serializable类

MySQL驱动,数据访问抽象层jpa,连接池

1、添加依赖

MySQL驱动连接

        
        
            mysql
            mysql-connector-java
        

阿里云仓库地址

    
        
            maven-ali
            http://maven.aliyun.com/nexus/content/groups/public//
            
                true
            
            
                true
                always
                fail
            
        
    

mybatis-plus

# maven 依赖    
    
        com.baomidou
        mybatis-plus-boot-starter
        3.3.0
    
# yml配置
spring:
  datasource:
    driver-class-name: com.mysql.cj.jdbc.Driver
    url: jdbc:mysql://127.0.0.1:3306/amussh?useUnicode=true&characterEncoding=utf8&serverTimezone=GMT%2B8&useSSL=false
    username: root
    password: root
# Logger Config
logging:
  level:
    com.amu.esstudy.mapper: debug
mybatis-plus:
  configuration:
    log-impl: org.apache.ibatis.logging.stdout.StdOutImpl

2、官网spring-data-jpa手册

关键字:@Query@Entity#{#entityName}%:lastname%@Param("lastname")extends Repository

3、数据库参数配置

spring.datasource.url=jdbc:mysql://127.0.0.1:3306/es?characterEncoding=utf-8&serverTimezone=GMT%2B8
spring.datasource.driver-class-name=com.mysql.cj.jdbc.Driver
spring.datasource.username=root
spring.datasource.password=root
# 使用 druid 数据源
spring.datasource.type: com.alibaba.druid.pool.DruidDataSource
spring.datasource.initialSize: 5
spring.datasource.minIdle: 5
spring.datasource.maxActive: 20
spring.datasource.maxWait: 60000
spring.datasource.timeBetweenEvictionRunsMillis: 60000
spring.datasource.minEvictableIdleTimeMillis: 300000
spring.datasource.validationQuery: SELECT 1 FROM DUAL
spring.datasource.testWhileIdle: true
spring.datasource.testOnBorrow: false
spring.datasource.testOnReturn: false
spring.datasource.poolPreparedStatements: true
spring.datasource.filters: stat
spring.datasource.maxPoolPreparedStatementPerConnectionSize: 20
spring.datasource.useGlobalDataSourceStat: true
spring.datasource.connectionProperties: druid.stat.mergeSql=true;druid.stat.slowSqlMillis=500
# SpringBoot JPA
spring.jpa.show-sql=true
# create 每次都重新创建表,update,表若存在则不重建
spring.jpa.hibernate.ddl-auto=update
spring.jpa.database-platform=org.hibernate.dialect.MySQL55Dialect

4、数据库连接测试

看官网手册

新建一个全文检索方法

5、SQL日志,打印参数

rousouces目录下新建logback.xml

内容如下:



    
    
    
    
    

    
        
            1-%d{yyyy-MM-dd HH:mm:ss} [%thread] %-5level %logger - %msg%n
            GBK
        
    

    
    
        ${LOG_PATH}/${APPDIR}/log_error.log
        
            ${LOG_PATH}/${APPDIR}/error/log-error-%d{yyyy-MM-dd}.%i.log
            
                500MB
            
        
        true
        
            %d{yyyy-MM-dd HH:mm:ss.SSS} %-5level %logger Line:%-3L - %msg%n
            utf-8
        
        
            error
            ACCEPT
            DENY
        
    

    
    
        ${LOG_PATH}/${APPDIR}/log_warn.log
        
            ${LOG_PATH}/${APPDIR}/warn/log-warn-%d{yyyy-MM-dd}.%i.log
            
                2MB
            
        
        true
        
            %d{yyyy-MM-dd HH:mm:ss.SSS} %-5level %logger Line:%-3L - %msg%n
            utf-8        
        
            warn
            ACCEPT
            DENY
        
    

    
    
        ${LOG_PATH}/${APPDIR}/log_info.log
        
            ${LOG_PATH}/${APPDIR}/info/log-info-%d{yyyy-MM-dd}.%i.log
            
                2MB
            
        
        true
        
            %d{yyyy-MM-dd HH:mm:ss.SSS} %-5level %logger Line:%-3L - %msg%n
            utf-8
        
        
            info
            ACCEPT
            DENY
        
    

    
        
            %d{yyyy-MM-dd HH:mm:ss.SSS} %-5level %logger Line:%-3L - %msg%n
        
    

    
    
    
    
    
    
    
    
    
    
    
    
    
    
    
    
        
        
        
        
    


三、spring boot 快速集成ElasticSearch数据库

开发文档

1、新建索引,类型,指定分词插件

2、分词插件的使用

GET _analyze?pretty
{
  "analyzer": "ik_smart",
  "text": "中华人民共和国国歌"
}

GET _analyze?pretty
{
  "analyzer": "ik_smart",
  "text": "可折叠设计,靓丽全面屏;巴龙5000,华为首款多模5G芯片;55W华为超级快充"
}

四、红娘Logstash

1、下载Logstash

2、插件安装

logstash-plugin install logstash-input-jdbc
logstash-plugin install logstash-output-elasticsearch

3、配置

input {
    jdbc{
        # jdbc驱动包的位置
        jdbc_driver_library => "C:\\exp\logstash-6.8.6\\config\\mysql-connector-java-8.0.17.jar"
        # 要使用的驱动包类,不同的数据库不同的类
        jdbc_driver_class => "com.mysql.cj.jdbc.Driver"
        # 数据库的链接信息 
        jdbc_connection_string => "jdbc:mysql://127.0.0.1:3306/es?characterEncoding=utf-8&serverTimezone=GMT%2B8"
        # mysql用户
        jdbc_user => "root"
        # mysql密码
        jdbc_password => "root"
        # 定时任务,多久执行一次查询,默认一分钟,这种配置是无延迟
        schedule => "* * * * *"
        # 清空上次的sql_last_value记录
        clean_run => true
        # 你要执行的语句
        statement => "SELECT * FROM phone WHERE create_time > :sql_last_value AND create_time < NOW() ORDER BY create_time desc"

    }
}
output {
    elasticsearch{
        # es host:port
        hosts => ["http://localhost:9200"]
        # 索引
        index => "phones"
        # _id
        document_id => "%{id}"
        document_type => "phone"
    }
}

4、启动

logstash -f mysql2es.conf

同步数据到ES

五、扩展

mybatis中文文档分为以下几个部分:

XML配置:https://mybatis.org/mybatis-3/zh/configuration.html

XML映射:https://mybatis.org/mybatis-3/zh/sqlmap-xml.html

动态SQL:https://mybatis.org/mybatis-3/zh/dynamic-sql.html

Java API:https://mybatis.org/mybatis-3/zh/java-api.html

SQL语句构建器:https://mybatis.org/mybatis-3/zh/statement-builders.html

日志:https://mybatis.org/mybatis-3/zh/logging.html

另外,spring与mybatis相结合使用的中文文档为:

http://mybatis.org/spring/zh/

六、常用接口查询语句

text、keyword、date、``long``、integer、``short``、``byte``、``double``、``float``、half_float、scaled_float、boolean、ip

# 创建指定的分词插件的索引 put
curl --location --request PUT 'localhost:9200/fulltext' \
--header 'Content-Type: application/json' \
--data-raw '{
  "mappings": {
    "phone": {
      "properties": {
        "id": {
          "type": "integer"
        },
        "name": {
          "type": "text",
          "analyzer": "ik_max_word",
          "search_analyzer": "ik_max_word"
        },
        "colors": {
          "type": "text",
          "analyzer": "ik_max_word",
          "search_analyzer": "ik_max_word"
        },
        "selling_points": {
          "type": "text",
          "analyzer": "ik_max_word",
          "search_analyzer": "ik_max_word"
        }
      }
    }
  }
}'

下面的是我的公众号二维码图片,欢迎关注。


不惑小年轻

你可能感兴趣的:(全文检索综合实战)