全文检索综合实战
编程环境介绍 es6.8.6 + mysql5.7 + idea + logstsh
一、Java数据爬虫
1、网页分析
目标网站:
JSON数据代表
华为手机商城:https://consumer.huawei.com/cn/phones/?ic_medium=hwdc&ic_source=corp_header_consumer
HTML代表
魅族手机商城:https://lists.meizu.com/page/list?categoryid=76
html特殊字符编码对照表
2、Apache HttpComponents
官网地址
快速入门
Maven仓库
Apache的一个开源项目,主要模拟HTTP请求。
3、jsoup
官网地址
jsoup 是一款Java 的HTML解析器,可直接解析某个URL地址、HTML文本内容。它提供了一套非常省力的API,可通过DOM,CSS以及类似于jQuery的操作方法来取出和操作数据。
帮助手册:http://hc.apache.org/httpcomponents-client-4.5.x/quickstart.html
手机实体类
爬取华为手机
CloseableHttpClient httpclient = HttpClients.createDefault();
HttpGet httpget = new HttpGet("https://consumer.huawei.com/cn/phones/?ic_medium=hwdc&ic_source=corp_header_consumer"); // 创建httpget实例
CloseableHttpResponse response = httpclient.execute(httpget); // 执行get请求
HttpEntity entity = response.getEntity(); // 获取返回实体
String content = EntityUtils.toString(entity, "utf-8");
response.close(); // 关闭流和释放系统资源
Document document = Jsoup.parse(content);
Elements elements = document.select("#content-v3-plp #pagehidedata .plphidedata");
for (Element element : elements) {
String jsonStr = element.text();
List huaWeiPhoneBeanlist = JSON.parseArray(jsonStr, HuaWeiPhoneBean.class);
for (HuaWeiPhoneBean bean : huaWeiPhoneBeanlist){
String productName = bean.getProductName();
List colorModeBeanList = bean.getColorsItemMode();
String colors = "";
for (ColorModeBean colorModeBean : colorModeBeanList){
String colorName = colorModeBean.getColorName();
colors += colorName + ";";
}
List sellingPointList = bean.getSellingPoints();
String sellingPoints = "";
for (String sellingPoint : sellingPointList) {
sellingPoints += sellingPoint+";";
}
System.out.println("产品名:" + productName);
System.out.println("颜 色:" + colors);
System.out.println("买 点:" + sellingPoints);
Phone phone = new Phone();
phone.setName(productName);
phone.setColors(colors);
phone.setSellingPoints(sellingPoints);
phone.setCreateTime(new Date());
phone.setMarketTime(new Date());
phoneMysqlRepository.save(phone);
}
}
return content;
- 爬取魅族
CloseableHttpClient httpclient = HttpClients.createDefault(); // 创建httpclient实例
HttpGet httpget = new HttpGet("https://lists.meizu.com/page/list?categoryid=76"); // 创建httpget实例
CloseableHttpResponse response = httpclient.execute(httpget); // 执行get请求
HttpEntity entity=response.getEntity(); // 获取返回实体fsdf
//System.out.println("网页内容:"+ EntityUtils.toString(entity, "utf-8")); // 指定编码打印网页内容
String content = EntityUtils.toString(entity, "utf-8");
response.close(); // 关闭流和释放系统资
Document document = Jsoup.parse(content);
Elements names = document.select("#goodsListWrap .gl-item .gl-item-link .item-title");
Elements cellingPoints = document.select("#goodsListWrap .gl-item .gl-item-link .item-desc");
Elements colorsElements = document.select(".container .goods-list #goodsListWrap .gl-item .gl-item-link .item-slide");
int i = 0;
for (Element nameElement : names) {
Phone phone = new Phone();
phone.setName(nameElement.text());
Elements elements = colorsElements.get(i).select(".item-slide-dot");
String endcolors = "";
for (Element color : elements){
endcolors += color.attr("title") + ";";
}
// System.out.println(endcolors);
phone.setSellingPoints(cellingPoints.get(i).text());
phone.setColors(endcolors);
phone.setCreateTime(new Date());
phone.setMarketTime(new Date());
phoneMysqlRepository.save(phone);
}
return null;
3、json在线编辑器
http://www.newjson.com/Static/Json/jsoneditor.html
二、spring boot 快速集成MySQL数据库
模型为什么要继承Serializable类
MySQL驱动,数据访问抽象层jpa,连接池
1、添加依赖
MySQL驱动连接
mysql
mysql-connector-java
阿里云仓库地址
maven-ali
http://maven.aliyun.com/nexus/content/groups/public//
true
true
always
fail
mybatis-plus
# maven 依赖
com.baomidou
mybatis-plus-boot-starter
3.3.0
# yml配置
spring:
datasource:
driver-class-name: com.mysql.cj.jdbc.Driver
url: jdbc:mysql://127.0.0.1:3306/amussh?useUnicode=true&characterEncoding=utf8&serverTimezone=GMT%2B8&useSSL=false
username: root
password: root
# Logger Config
logging:
level:
com.amu.esstudy.mapper: debug
mybatis-plus:
configuration:
log-impl: org.apache.ibatis.logging.stdout.StdOutImpl
2、官网spring-data-jpa手册
关键字:@Query
,@Entity
,#{#entityName}
,%:lastname%
,@Param("lastname")
,extends Repository
3、数据库参数配置
spring.datasource.url=jdbc:mysql://127.0.0.1:3306/es?characterEncoding=utf-8&serverTimezone=GMT%2B8
spring.datasource.driver-class-name=com.mysql.cj.jdbc.Driver
spring.datasource.username=root
spring.datasource.password=root
# 使用 druid 数据源
spring.datasource.type: com.alibaba.druid.pool.DruidDataSource
spring.datasource.initialSize: 5
spring.datasource.minIdle: 5
spring.datasource.maxActive: 20
spring.datasource.maxWait: 60000
spring.datasource.timeBetweenEvictionRunsMillis: 60000
spring.datasource.minEvictableIdleTimeMillis: 300000
spring.datasource.validationQuery: SELECT 1 FROM DUAL
spring.datasource.testWhileIdle: true
spring.datasource.testOnBorrow: false
spring.datasource.testOnReturn: false
spring.datasource.poolPreparedStatements: true
spring.datasource.filters: stat
spring.datasource.maxPoolPreparedStatementPerConnectionSize: 20
spring.datasource.useGlobalDataSourceStat: true
spring.datasource.connectionProperties: druid.stat.mergeSql=true;druid.stat.slowSqlMillis=500
# SpringBoot JPA
spring.jpa.show-sql=true
# create 每次都重新创建表,update,表若存在则不重建
spring.jpa.hibernate.ddl-auto=update
spring.jpa.database-platform=org.hibernate.dialect.MySQL55Dialect
4、数据库连接测试
看官网手册
新建一个全文检索方法
5、SQL日志,打印参数
在rousouces
目录下新建logback.xml
内容如下:
1-%d{yyyy-MM-dd HH:mm:ss} [%thread] %-5level %logger - %msg%n
GBK
${LOG_PATH}/${APPDIR}/log_error.log
${LOG_PATH}/${APPDIR}/error/log-error-%d{yyyy-MM-dd}.%i.log
500MB
true
%d{yyyy-MM-dd HH:mm:ss.SSS} %-5level %logger Line:%-3L - %msg%n
utf-8
error
ACCEPT
DENY
${LOG_PATH}/${APPDIR}/log_warn.log
${LOG_PATH}/${APPDIR}/warn/log-warn-%d{yyyy-MM-dd}.%i.log
2MB
true
%d{yyyy-MM-dd HH:mm:ss.SSS} %-5level %logger Line:%-3L - %msg%n
utf-8
warn
ACCEPT
DENY
${LOG_PATH}/${APPDIR}/log_info.log
${LOG_PATH}/${APPDIR}/info/log-info-%d{yyyy-MM-dd}.%i.log
2MB
true
%d{yyyy-MM-dd HH:mm:ss.SSS} %-5level %logger Line:%-3L - %msg%n
utf-8
info
ACCEPT
DENY
%d{yyyy-MM-dd HH:mm:ss.SSS} %-5level %logger Line:%-3L - %msg%n
三、spring boot 快速集成ElasticSearch数据库
开发文档
1、新建索引,类型,指定分词插件
2、分词插件的使用
GET _analyze?pretty
{
"analyzer": "ik_smart",
"text": "中华人民共和国国歌"
}
GET _analyze?pretty
{
"analyzer": "ik_smart",
"text": "可折叠设计,靓丽全面屏;巴龙5000,华为首款多模5G芯片;55W华为超级快充"
}
四、红娘Logstash
1、下载Logstash
2、插件安装
logstash-plugin install logstash-input-jdbc
logstash-plugin install logstash-output-elasticsearch
3、配置
input {
jdbc{
# jdbc驱动包的位置
jdbc_driver_library => "C:\\exp\logstash-6.8.6\\config\\mysql-connector-java-8.0.17.jar"
# 要使用的驱动包类,不同的数据库不同的类
jdbc_driver_class => "com.mysql.cj.jdbc.Driver"
# 数据库的链接信息
jdbc_connection_string => "jdbc:mysql://127.0.0.1:3306/es?characterEncoding=utf-8&serverTimezone=GMT%2B8"
# mysql用户
jdbc_user => "root"
# mysql密码
jdbc_password => "root"
# 定时任务,多久执行一次查询,默认一分钟,这种配置是无延迟
schedule => "* * * * *"
# 清空上次的sql_last_value记录
clean_run => true
# 你要执行的语句
statement => "SELECT * FROM phone WHERE create_time > :sql_last_value AND create_time < NOW() ORDER BY create_time desc"
}
}
output {
elasticsearch{
# es host:port
hosts => ["http://localhost:9200"]
# 索引
index => "phones"
# _id
document_id => "%{id}"
document_type => "phone"
}
}
4、启动
logstash -f mysql2es.conf
同步数据到ES
五、扩展
mybatis中文文档分为以下几个部分:
XML配置:https://mybatis.org/mybatis-3/zh/configuration.html
XML映射:https://mybatis.org/mybatis-3/zh/sqlmap-xml.html
动态SQL:https://mybatis.org/mybatis-3/zh/dynamic-sql.html
Java API:https://mybatis.org/mybatis-3/zh/java-api.html
SQL语句构建器:https://mybatis.org/mybatis-3/zh/statement-builders.html
日志:https://mybatis.org/mybatis-3/zh/logging.html
另外,spring与mybatis相结合使用的中文文档为:
http://mybatis.org/spring/zh/
六、常用接口查询语句
text、keyword、date、``long``、integer、``short``、``byte``、``double``、``float``、half_float、scaled_float、boolean、ip
# 创建指定的分词插件的索引 put
curl --location --request PUT 'localhost:9200/fulltext' \
--header 'Content-Type: application/json' \
--data-raw '{
"mappings": {
"phone": {
"properties": {
"id": {
"type": "integer"
},
"name": {
"type": "text",
"analyzer": "ik_max_word",
"search_analyzer": "ik_max_word"
},
"colors": {
"type": "text",
"analyzer": "ik_max_word",
"search_analyzer": "ik_max_word"
},
"selling_points": {
"type": "text",
"analyzer": "ik_max_word",
"search_analyzer": "ik_max_word"
}
}
}
}
}'
下面的是我的公众号二维码图片,欢迎关注。