9.1 搜索引擎
数据库作为数据的存储中心保存着项目的所有数据,可以通过SQL查询任何需要的数据。但是全文搜索是数据库的一个弱项,因为数据库是通过在特定列上添加索引来加速检索,是通过索引找记录的方式。全文检索更需要通过属性来确定记录位置的方式,这种方式叫做倒排索引。
目前ElasticSearch是全文搜索引擎的首选,它可以快速地储存、搜索海量数据,而且实时性比较高,像Stack Overflow、Github 都在用它。
ElasticSearch是一种文档类型的数据库,可以通过和数据库对比来理解ElasticSearch存储数据的方式:
Relational DB -> Database -> Table -> Row -> Column
ElasticSearch -> Index -> Type -> Document -> Field
阿里云提供了2中搜索服务,一种是ElasticSearch一种是OpenSearch。OpenSearch也是基于ElasticSearch的,在ElasticSearch基础上做了封装,优化了数据同步,配置好数据库连接之后可以无缝自动同步数据。OpenSearch同时还提供了一套SDK方便调用。
9.2 OpenSearch
开通OpenSearch
OpenSearch采用数据库作为数据源,所以可以在数据库实例创建该数据库的OpenSearch服务。在云数据库RDS版找到之前建立的数据库实例,管理 > 开放搜索 > 立即开通 > 创建应用,提供了2种版本:标准版和高级版,而且都有详细的使用场景说明,可以根据需要选择,此处用于演示选用标准版。我们以address表作为数据源,即全文检索address表里的内容。
创建过程分5步,需要注意的是第2步定义应用结构中要选择主键,第3步根据需要选择分词器。
创建完成后需要激活,激活之后还要让实例全量构建索引,即让OpenSearch从数据源同步数据并构建索引,此步骤耗时较长。
集成OpenSearch
因为数据库里address表的只有2条数据,同步到OpenSearch的数据也只有2条,OpenSearch里如果文档太少则搜索时不会返回结果,所以我们在address表里插入至少10条模拟数据才会在OpenSearch搜索出结果。
打开starter项目,在pom.xml添加opensearch依赖
com.aliyun.opensearch
aliyun-sdk-opensearch
3.3.0
调用OpenSearch服务需要使用Access Key和Access Key Secret,登录阿里云控制台在账户头像位置下点击accesskeys可以查看您的Access Key ID和 Access Key Secret。
package cn.mx.starter.service;
import org.springframework.stereotype.Service;
import com.aliyun.opensearch.OpenSearchClient;
import com.aliyun.opensearch.SearcherClient;
import com.aliyun.opensearch.sdk.dependencies.com.google.common.collect.Lists;
import com.aliyun.opensearch.sdk.dependencies.org.json.JSONObject;
import com.aliyun.opensearch.sdk.generated.OpenSearch;
import com.aliyun.opensearch.sdk.generated.commons.OpenSearchClientException;
import com.aliyun.opensearch.sdk.generated.commons.OpenSearchException;
import com.aliyun.opensearch.sdk.generated.search.Config;
import com.aliyun.opensearch.sdk.generated.search.SearchFormat;
import com.aliyun.opensearch.sdk.generated.search.SearchParams;
import com.aliyun.opensearch.sdk.generated.search.general.SearchResult;
@Service
public class OpenSearchService {
String appName = "dev";
String accesskey = "accesskey";
String secret = "secret";
String host = "http://opensearch-cn-beijing.aliyuncs.com";
public JSONObject search() {
OpenSearch openSearch = new OpenSearch(accesskey, secret, host);
OpenSearchClient searchClient = new OpenSearchClient(openSearch);
Config config = new Config(Lists.newArrayList(appName));
config.setStart(0);
config.setHits(10);
config.setSearchFormat(SearchFormat.FULLJSON);
config.setFetchFields(Lists.newArrayList("id", "addr"));
SearchParams searchParams = new SearchParams(config);
searchParams.setQuery("addr:'北京'");
SearcherClient searcherClient = new SearcherClient(searchClient);
SearchResult searchResult = null;
try {
searchResult = searcherClient.execute(searchParams);
} catch (OpenSearchException e) {
e.printStackTrace();
} catch (OpenSearchClientException e) {
e.printStackTrace();
}
if(searchResult != null ) {
String result = searchResult.getResult();
JSONObject obj = new JSONObject(result);
return obj;
}
return null;
}
}
OpenSearch返回的结果是JSON格式的字符串:
{
"result": {
"searchtime": 0.035508,
"total": 19,
"num": 10,
"viewtotal": 19,
"compute_cost": [
{
"index_name": "130027524",
"value": 0.515
}
],
"items": [
{
"fields": {
"addr": "学院路北京大学",
"id": "2",
"user_id": "1",
"index_name": "130027524"
},
"property": {},
"attribute": {},
"variableValue": {},
"sortExprValues": [
"10000.0767986625"
],
"tracerInfo": ""
},
{
"fields": {
"addr": "学院路北京大学",
"id": "20",
"user_id": "1",
"index_name": "130027524"
},
"property": {},
"attribute": {},
"variableValue": {},
"sortExprValues": [
"10000.0767986625"
],
"tracerInfo": ""
}
],
"facet": []
}
}
从OpenSearch搜索出结果后可以根据需要转化成JSON或Map做后续处理。
实际项目没有必要把数据库表的所有字段都同步到OpenSearch,只同步需要进行全文检索的字段,搜索结果也只返回id集合再用id集合从数据库里查找全量数据。
本例演示了如何使用OpenSearch进行全文检索,阿里云文档里有所有API的详细使用说明,包括过滤、聚合、排序等,调用方式简单方便,此处不再赘述。
9.3 ElasticSearch
OpenSearch在ElasticSearch基础上做了封装,可以和数据库无缝同步实时更新,而且有完整的SDK方便调用,可以满足绝大部分场景下的全文搜索要求。
但是,OpenSearch对应数据库源的表至多是2个(高级版),如果项目中需要把多个表的结果作为一条搜索记录存到搜索引擎里,这种情况下OpenSearch就不能满足,只能采用原生的ElasticSearch。使用ElasticSearch另外的优点是可以更细粒度地调用原生API,比如设置字段权重等。
开通ElasticSearch
阿里云同样提供了ElasticSearch服务,管理控制台 > 产品与服务 > 阿里云ElasticSearch > 创建,输入账号密码:
数据同步
ElasticSearch需要调用它的接口进行数据同步,通过发送POST请求把JSON数据添加到索引文档中,可以编写脚本定时查询数据库再发送请求,而更常用的方式是使用工具logstash,它的作用就是定时从数据库查询数据,发送到ElasticSearch。logstash使用也很简单,编辑好配置文件运行即可。
下载logstash压缩包并解压,拷贝mysql-connector-java.jar拷贝到目录logstash下,修改logstash.conf:
input {
stdin {
}
jdbc {
type => "dev"
jdbc_driver_library => "/root/logstash-6.4.0/mysql-connector-java.jar"
jdbc_driver_class => "com.mysql.jdbc.Driver"
jdbc_connection_string => "jdbc:mysql://rm-*.mysql.rds.aliyuncs.com/test?"
jdbc_user => "user"
jdbc_password => "password"
record_last_run => "true"
last_run_metadata_path => "/root/logstash-6.4.0/data/dev.txt"
clean_run => "false"
jdbc_paging_enabled => "true"
jdbc_page_size => "100"
statement => "select u.id, u.name, a.addr from user u inner join address a on a.user_id = u.id where u.updated>:sql_last_value"
schedule => "*/5 * * * *"
}
}
filter {
json {
source => "message"
remove_field => ["message"]
}
}
output {
elasticsearch {
hosts => "es-cn-*.elasticsearch.aliyuncs.com:9200"
user => "user"
password => "password"
index => "dev"
document_id => "%{id}"
}
}
需要给user表增加updated字段:
alter table user add column updated timestamp not null default current_timestamp;
添加完mysql驱动,修改好配置文件之后,将logstas文件夹拷贝到服务器root目录下,然后登录服务器让logstash在后台运行,它会每5分钟进行一次同步。
/root/logstash/bin/logstash -f /root/logstash/logstash.conf &
集成ElasticSearch
打开starter项目,在pom.xml添加elasticsearch依赖
org.elasticsearch.client
transport
新增ElasticSearchService.java
package cn.mx.starter.service;
import java.io.IOException;
import java.util.Collections;
import java.util.Map;
import javax.annotation.PostConstruct;
import javax.annotation.PreDestroy;
import org.apache.http.Header;
import org.apache.http.HttpEntity;
import org.apache.http.HttpHost;
import org.apache.http.auth.AuthScope;
import org.apache.http.auth.UsernamePasswordCredentials;
import org.apache.http.client.CredentialsProvider;
import org.apache.http.entity.ContentType;
import org.apache.http.entity.StringEntity;
import org.apache.http.impl.client.BasicCredentialsProvider;
import org.apache.http.impl.nio.client.HttpAsyncClientBuilder;
import org.apache.http.message.BasicHeader;
import org.apache.http.util.EntityUtils;
import org.elasticsearch.client.Response;
import org.elasticsearch.client.RestClient;
import org.elasticsearch.client.RestClientBuilder;
import org.springframework.stereotype.Service;
import com.aliyun.opensearch.sdk.dependencies.org.json.JSONObject;
@Service
public class ElasticSearchService {
private static RestClient restClient = null;
public JSONObject query() {
String queryJSON = String.format(
"{\n" +
" \"query\": {\n" +
" \"bool\": {\n" +
" \"must\": {\n" +
" \"multi_match\": {\n" +
" \"query\": \"%s\",\n" +
" \"fields\": [\n" +
" \"name^2\",\n" +
" \"addr\"\n" +
" ],\n" +
" \"type\": \"best_fields\",\n" +
" \"tie_breaker\": 0.3\n" +
" }\n" +
" }\n" +
" }\n" +
" },\n" +
" \"from\": %d,\n" +
" \"size\": %d\n" +
"}",
"john 北京大学", 0, 10);
String endpoint = String.format("dev/_search");
Map params = Collections.singletonMap("pretty", "true");
HttpEntity queryEntity = new StringEntity(queryJSON, ContentType.APPLICATION_JSON);
Header[] headers = new Header[] { new BasicHeader("Content-Type", "application/json") };
Response response = null;
String result = null;
try {
response = restClient.performRequest("GET", endpoint, params, queryEntity, headers);
result = EntityUtils.toString(response.getEntity());
} catch (IOException e) {
e.printStackTrace();
}
JSONObject obj = new JSONObject(result);
return obj;
}
private RestClient getRestClient() {
final String HOST = "es-cn-*.elasticsearch.aliyuncs.com";
final Integer PORT = 9200;
final CredentialsProvider credentialsProvider = new BasicCredentialsProvider();
credentialsProvider.setCredentials(AuthScope.ANY, new UsernamePasswordCredentials("user", "password"));
RestClient restClient = RestClient.builder(new HttpHost(HOST, PORT))
.setHttpClientConfigCallback(new RestClientBuilder.HttpClientConfigCallback() {
@Override
public HttpAsyncClientBuilder customizeHttpClient(HttpAsyncClientBuilder httpClientBuilder) {
return httpClientBuilder.setDefaultCredentialsProvider(credentialsProvider);
}
}).build();
return restClient;
}
@PostConstruct
public void init() {
restClient = this.getRestClient();
}
@PreDestroy
public void destroy() {
if(restClient != null) {
try {
restClient.close();
} catch (IOException e) {
e.printStackTrace();
}
}
}
}
这个方法采用近似原生的方式,把JSON格式的搜索条件发送到ElasticSearch服务器,服务器返回搜索结果。name^2是给name字段添加权重,对过滤条件filed进行更细粒度的控制。ElasticSearch提供一套标准的检索数据语言query-dsl,可以查看https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl.html。