[TOC]
电影搜索服务
需求分析及架构设计
https://www.elastic.co/blog/e...
将电影数据导入 Elasticsearch
搭建你的电影搜索服务
Stackoverflow 用户调查问卷分析
需求分析与架构设计
https://insights.stackoverflo...
数据 Extract & Enrichment
- 分析数据 StackOverflow 2020 年度开发者调查
- 使用 Logstash 导入数据到 ES
ES 中进行相关配置
- Ingest 分割字段及转换字段类型
- 使用 dynamic template 设置导入数据的字段类型
- 在下面相关操作执行完毕后, 创建 Index Pattern
Kibana 中的操作: 在 Management 的 Index Patterns 中创建名为 "final-stackoverflow-survey" 的 Index Pattern, 只包含 final-stackoverflow-survey 这一个索引.
logstash 相关配置: stackoverflow-surver.conf
input {
file {
# 这里的路径必须是绝对路径, 不能是相对路径
path => ["D:/Programming/logstash-7.9.3/survey_results_public.csv"]
start_position => "beginning"
# sincedb_path => "/dev/null"
# windows 上没有 /dev/null, 使用 nul 作为替代.
sincedb_path => "nul"
}
}
filter {
csv {
autogenerate_column_names => false
skip_empty_columns => true
separator => ","
columns => [
"Respondent",
"MainBranch",
"Hobbyist",
"Age",
"Age1stCode",
"CompFreq",
"CompTotal",
"ConvertedComp",
"Country",
"CurrencyDesc",
"CurrencySymbol",
"DatabaseDesireNextYear",
"DatabaseWorkedWith",
"DevType",
"EdLevel",
"Employment",
"Ethnicity",
"Gender",
"JobFactors",
"JobSat",
"JobSeek",
"LanguageDesireNextYear",
"LanguageWorkedWith",
"MiscTechDesireNextYear",
"MiscTechWorkedWith",
"NEWCollabToolsDesireNextYear",
"NEWCollabToolsWorkedWith",
"NEWDevOps",
"NEWDevOpsImpt",
"NEWEdImpt",
"NEWJobHunt",
"NEWJobHuntResearch",
"NEWLearn",
"NEWOffTopic",
"NEWOnboardGood",
"NEWOtherComms",
"NEWOvertime",
"NEWPurchaseResearch",
"NEWPurpleLink",
"NEWSOSites",
"NEWStuck",
"OpSys",
"OrgSize",
"PlatformDesireNextYear",
"PlatformWorkedWith",
"PurchaseWhat",
"Sexuality",
"SOAccount",
"SOComm",
"SOPartFreq",
"SOVisitFreq",
"SurveyEase",
"SurveyLength",
"Trans",
"UndergradMajor",
"WebframeDesireNextYear",
"WebframeWorkedWith",
"WelcomeChange",
"WorkWeekHrs",
"YearsCode",
"YearsCodePro"
]
}
# ??? 没看懂
if ([collector] == "collector") {
drop {
}
}
# 移除部分字段
mutate {
remove_field => ["message", "@version", "@timestamp", "host"]
}
}
output {
# 方便显示处理进度
stdout {
codec => "dots"
}
# 写入 es 的 stackoverflow-survey-raw 索引中
elasticsearch {
hosts => ["http://localhost:9200"]
index => "stackoverflow-survey-raw"
document_type => "_doc"
}
}
Input Plugin
- File Input
Filter Plugin
- CSV Filter
- Mutate Filter
Output Plugin
- ES Output
windows 下运行 logstash 导入数据的示例
.\bin\logstash.bat -f .\stackoverflow-surver.conf
在 Kibana 中执行如下
DELETE stackoverflow-survey-raw
// 查看写入数据的字段类型, 发现都是 strng
// 由于我们不需要对这些数据进行全文搜索, 同时有聚合的需求, 因此需要将其指定为 keyword 类型
GET stackoverflow-survey-raw
// 设置 dynamic mapping
PUT final-stackoverflow-survey
{
"mappings": {
// 将所有的 string 类型转换为 keyword 类型
"dynamic_templates": [
{
"string_as_keyword": {
"match_mapping_type": "string",
"mapping": {
"type": "keyword"
}
}
}
]
},
"settings": {
// 副本分片数设为 0
"number_of_replicas": 0
}
}
GET stackoverflow-survey-raw/_search
- 使用 Dynamic Template 处理文本类型 Mapping
在 Kibana 中执行如下
// 创建一个 Ingest Pipeline, 对部分字段进行分割, 及格式转换操作
PUT _ingest/pipeline/stackoverflow_pipeline
{
"description": "Pipeline for stackoverflow survey",
"processors": [
{
"split": {
"field": "NEWPurchaseResearch",
"separator": ";"
}
},
{
"split": {
"field": "NEWSOSites",
"separator": ";"
}
},
{
"split": {
"field": "NEWStuck",
"separator": ";"
}
},
{
"split": {
"field": "DevType",
"separator": ";"
}
},
{
"split": {
"field": "NEWJobHunt",
"separator": ";"
}
},
{
"split": {
"field": "NEWJobHuntResearch",
"separator": ";"
}
},
{
"split": {
"field": "DatabaseDesireNextYear",
"separator": ";"
}
},
{
"split": {
"field": "DatabaseWorkedWith",
"separator": ";"
}
},
{
"split": {
"field": "LanguageWorkedWith",
"separator": ";"
}
},
{
"split": {
"field": "LanguageDesireNextYear",
"separator": ";"
}
},
{
"split": {
"field": "MiscTechDesireNextYear",
"separator": ";"
}
},
{
"split": {
"field": "MiscTechWorkedWith",
"separator": ";"
}
},
{
"split": {
"field": "PlatformDesireNextYear",
"separator": ";"
}
},
{
"split": {
"field": "PlatformWorkedWith",
"separator": ";"
}
},
{
"split": {
"field": "WebframeWorkedWith",
"separator": ";"
}
},
{
"split": {
"field": "WebframeDesireNextYear",
"separator": ";"
}
},
{
"split": {
"field": "NEWCollabToolsDesireNextYear",
"separator": ";"
}
},
{
"split": {
"field": "NEWCollabToolsWorkedWith",
"separator": ";"
}
},
{
"split": {
"field": "JobFactors",
"separator": ";"
}
},
{
"split": {
"field": "Ethnicity",
"separator": ";"
}
},
{
"split": {
"field": "Sexuality",
"separator": ";"
}
},
{
"convert": {
"field": "YearsCode",
"type": "integer",
"on_failure": [
{
"set": {
"field": "YearsCode",
"value": 0
}
}
]
}
},
{
"convert": {
"field": "WorkWeekHrs",
"type": "integer",
"on_failure": [
{
"set": {
"field": "WorkWeekHrs",
"value": 0
}
}
]
}
},
{
"convert": {
"field": "Age",
"type": "integer",
"on_failure": [
{
"set": {
"field": "Age",
"value": 0
}
}
]
}
},
{
"convert": {
"field": "Age1stCode",
"type": "integer",
"on_failure": [
{
"set": {
"field": "Age1stCode",
"value": 0
}
}
]
}
},
{
"convert": {
"field": "YearsCodePro",
"type": "integer",
"on_failure": [
{
"set": {
"field": "YearsCodePro",
"value": 0
}
}
]
}
}
]
}
// 通过 reindex 将 logstash 导入的数据重新导入(应用上述创建的 ingest pipeline)到 final-stackoverflow-survey 索引中
POST _reindex
{
"source": {
"index": "stackoverflow-survey-raw"
},
"dest": {
"index": "final-stackoverflow-survey",
"pipeline": "stackoverflow_pipeline"
}
}
GET final-stackoverflow-survey
GET final-stackoverflow-survey/_search
创建 Ingest Pipeline
- Split 一些字符串
- 转换整形数
构建 Insights Dashboard
Elastic认证
认证
...略
考纲整理
安装配置
- 根据需求, 配置部署集群
- 配置集群的节点
- 为集群设置安全保护
- 基于 X-Pack, 为集群配置 RBAC
索引数据
- 根据需求, 定义一个索引
- 执行索引的 Index, CRUD
- 定义与使用 Index Alias
- 定义与使用 Index Template
- 定义与使用 Dynamic Template
- 使用 Reindex API & Update By Query 重新索引文档
- 定义 Ingest Pipeline (包括使用 Painless 脚本)
查询
- 使用 terms 或 phrase 查询一个或多个字段
- 使用 Bool query
- 高亮查询结果
- 对查询结果排序
- 对查询结果分页
- 使用 Scroll API
- 使用模糊查询
使用 Search Template
日常工作可能不会使用, 但使用它可以更好的分离 Search 的定义与使用
- 跨集群搜索
聚合
- metric & Bucket Aggregation
- sub-aggregation
- pipeline aggregation
映射与分词
- 按需定义索引 mapping
- 按需自定义 analyzer
- 为字段定义多字段类型 (不同的字段使用不同的 type 和 analyzer)
- 定义和查询 nested 文档
- 定义及查询 parent/child 文档
集群管理
- 按需将索引的分片分配到特定的节点
- 为索引配置 Shard allocation awareness & Force awareness
- 诊断分片的问题, 恢复集群的 health 状态
- Backup & Restore 集群或者特定的索引
- 配置一个 hot & warm 架构的集群
- 配置跨集群搜索
模拟测试
集群的备份与恢复
- 这里的 my_fs_backup 是创建的 Repository
- 可以只为指定索引创建快照
- restore 之前需要先将索引删除, 否则会报错.