(一) ElasticSearch Introduction(官网文档7.7版本学习)

目录:

一、 Data in: documents and indices  数据输入:文档和索引 

二、 Information out: search and analyze  信息输出:搜索和分析

三、 Scalability and resilience: clusters, nodes, and shards  可伸缩性和弹性:集群,结点和分片

 

1、介绍Elasticsearch \ logstash、beats\ kibana的关系

Elasticsearch是分布式搜索和分析引擎

Logstash and Beats收集、聚合和丰富数据存储到Elasticsearch

Kibana提供交互式搜索、可视化

Elasticsearch is the distributed search and analytics engine at the heart of the Elastic Stack. 

Logstash and Beats facilitate collecting, aggregating, and enriching your data and storing it in Elasticsearch. 

Kibana enables you to interactively explore, visualize, and share insights into your data and manage and monitor the stack.

Elasticsearch is where the indexing, search, and analysis magic happen.

 

2、Elasticsearch可以做什么?存什么数据?   对所有数据类型提供实时搜索和分析

Elasticsearch provides real-time search and analytics for all types of data.

 Whether you have structured or unstructured text, numerical data, or geospatial data, Elasticsearch can efficiently store and index it in a way that supports fast searches. 

You can go far beyond simple data retrieval and aggregate information to discover trends and patterns in your data.

 And as your data and query volume grows, the distributed nature of Elasticsearch enables your deployment to grow seamlessly right along with it.

 

3、Elasticsearch的实现场景   提供快速和灵活地处理数据   并不是一定得是解决搜索问题  也可以解决分析、存储

 

While not every problem is a search problem, Elasticsearch offers speed and flexibility to handle data in a wide variety of use cases:

* Add a search box to an app or website

* Store and analyze logs, metrics, and security event data

* Use machine learning to automatically model the behavior of your data in real time

* Automate business workflows using Elasticsearch as a storage engine

* Manage, integrate, and analyze spatial information using Elasticsearch as a geographic information system (GIS)

* Store and process genetic data using Elasticsearch as a bioinformatics research tool

 

 

4、不管使用Elasticsearch来处理什么问题、在Elasticsearch中处理数据,文档和索引的方式都是相同的

 

We’re continually amazed by the novel ways people use search. But whether your use case is similar to one of these, or you’re using Elasticsearch to tackle a new problem, the way you work with your data, documents, and indices in Elasticsearch is the same.

 

一、 Data in: documents and indices  数据输入:文档和索引 

 

1、Elasticsearch存储已经被序列化为json文档的复杂数据结构。当你部署的集群中有多个Elasticsearch结点,文档被分布存储在集群中,并且可以从任何节点立即访问

 

Elasticsearch is a distributed document store. Instead of storing information as rows of columnar data, Elasticsearch stores complex data structures that have been serialized as JSON documents.

 When you have multiple Elasticsearch nodes in a cluster, stored documents are distributed across the cluster and can be accessed immediately from any node.ents and indices

 

2、Elasticsearch支持接近1s的实时索引和全文搜索。使用的是一种倒排索引,支持非常快的全文搜索

When a document is stored, it is indexed and fully searchable in near real-time—within 1 second. 

Elasticsearch uses a data structure called an inverted index that supports very fast full-text searches.

An inverted index lists every unique word that appears in any document and identifies all of the documents each word occurs in.

 

3、每个文档都是字段的集合,每个字段都是键值对的形式。ElasticSearch是对每个字段的值进行索引,并且不同数据类型的字段使得的索引结构不同,比如文本字段采用倒排索引 ,数字和地图字段 使用BKD树

正因为这样的处理之后返回的结果使得ElasticSearch查询很快

An index can be thought of as an optimized collection of documents and each document is a collection of fields, which are the key-value pairs that contain your data. 

By default, Elasticsearch indexes all data in every field and each indexed field has a dedicated, optimized data structure. 

For example, text fields are stored in inverted indices, and numeric and geo fields are stored in BKD trees.

 The ability to use the per-field data structures to assemble and return search results is what makes Elasticsearch so fast.

 

4、ElasticSearch还具有无模式能力,即不需要为文档的每个字段指定需要使用什么索引结构,ElasticSearch会自动进行匹配,为新的字段加上索引 。

会自动匹配booleans,floating point和integer、dates、string到合适的Elasticsearch的数据类型

Elasticsearch also has the ability to be schema-less, which means that documents can be indexed without explicitly specifying how to handle each of the different fields that might occur in a document.

 When dynamic mapping is enabled, Elasticsearch automatically detects and adds new fields to the index. 

This default behavior makes it easy to index and explore your data—just start indexing documents and Elasticsearch will detect and map booleans, floating point and integer values, dates, and strings to the appropriate Elasticsearch datatypes

5、不管怎样,都是我们比ElasticSearch更清楚字段的存储方式。所以你可以制定规则来控制动态的匹配和指定映射来完全控制每个字段的存储和索引方式

Ultimately, however, you know more about your data and how you want to use it than Elasticsearch can. You can define rules to control dynamic mapping and explicitly define mappings to take full control of how fields are stored and indexed.

6、定义自己的模式匹配的好处

Defining your own mappings enables you to:

* Distinguish between full-text string fields and exact value string fields

* Perform language-specific text analysis

* Optimize fields for partial matching

* Use custom date formats

* Use data types such as geo_point and geo_shape that cannot be automatically detected

 

7、索引相同的字段在不同的方式为不同的目的是很好用的。例如,你可以索引一个字符串字段作为文本字段为全文搜索,作为一个关键词字段为排序或聚合你的数据

It’s often useful to index the same field in different ways for different purposes. 

For example, you might want to index a string field as both a text field for full-text search and as a keyword field for sorting or aggregating your data. 

Or, you might choose to use more than one language analyzer to process the contents of a string field that contains user input.

8、在索引期间被应用于全文字段的分析链也是在搜索时被使用的。 当您查询全文字段时,对查询文本进行相同的分析,然后再在索引中查找术语

The analysis chain that is applied to a full-text field during indexing is also used at search time.

 When you query a full-text field, the query text undergoes the same analysis before the terms are looked up in the index.

 

二、 Information out: search and analyze  信息输出:搜索和分析

 

1、elasticsearch支持的客户端: Java, JavaScript, Go, .NET, PHP, Perl, Python or Ruby.

 

For testing purposes, you can easily submit requests directly from the command line or through the Developer Console in Kibana. 

From your applications, you can use the Elasticsearch client for your language of choice: Java, JavaScript, Go, .NET, PHP, Perl, Python or Ruby.

 

Searching your data

 

2、elasticsearch支持结构化查询、全文查询和两者结合的复杂查询

The Elasticsearch REST APIs support structured queries, full text queries, and complex queries that combine the two. Structured queries are similar to the types of queries you can construct in SQL. For example, you could search the gender and age fields in your employee index and sort the matches by the hire_date field. Full-text queries find all documents that match the query string and return them sorted by relevance—how good a match they are for your search terms.

3、可以进行短语查询,相似度查询 、前缀查询 和 得到自动补全建议;提供非文本索引来进行高性能地图和数据查询;可以使用DSL查询,也可以sql查询风格,用jdbc和odbc驱动器来与elasticsearch进行交互

 

In addition to searching for individual terms, you can perform phrase searches, similarity searches, and prefix searches, and get autocomplete suggestions.

 

Have geospatial or other numerical data that you want to search? Elasticsearch indexes non-textual data in optimized data structures that support high-performance geo and numerical queries.

 

You can access all of these search capabilities using Elasticsearch’s comprehensive JSON-style query language (Query DSL). You can also construct SQL-style queries to search and aggregate data natively inside Elasticsearch, and JDBC and ODBC drivers enable a broad range of third-party applications to interact with Elasticsearch via SQL.

 

Analyzing your data

 

4、elasticsearch聚合使你能够建立复杂的数据摘要,并深入了解关键指标、模式和趋势。

Elasticsearch aggregations enable you to build complex summaries of your data and gain insight into key metrics, patterns, and trends. Instead of just finding the proverbial “needle in a haystack”, aggregations enable you to answer questions like:

 

 

How many needles are in the haystack? 大海里有多少针

What is the average length of the needles? 针的平均长度

What is the median length of the needles, broken down by manufacturer? 针头的中位数,

How many needles were added to the haystack in each of the last six months?

 

You can also use aggregations to answer more subtle questions, such as:

What are your most popular needle manufacturers?

Are there any unusual or anomalous clumps of needles?

 

But wait,there'more

5、想要自动化地进行时间序列分析?可以使用机器学习功能特性来创建正常模式的基准来识别异常模式

 

Want to automate the analysis of your time-series data? You can use machine learning features to create accurate baselines of normal behavior in your data and identify anomalous patterns. With machine learning, you can detect:

* Anomalies related to temporal deviations in values, counts, or frequencies

* Statistical rarity

* Unusual behaviors for a member of a population

 

 

三、 Scalability and resilience: clusters, nodes, and shards

文档-索引-分片-节点-集群

 

1、elasticsearch旨在始终可用和依据你的需求可伸缩。天然地分布式,可以通过添加服务结点到集群来增加容量,自动分布你的数据和负载查询通过所有可用的结点。不需要大改你的应用,elasticsearch知道怎样平衡多个集群结点来提供可伸缩和高可用。节点越多越好

Elasticsearch is built to be always available and to scale with your needs.

It does this by being distributed by nature. You can add servers (nodes) to a cluster to increase capacity and Elasticsearch automatically distributes your data and query load across all of the available nodes.

No need to overhaul your application, Elasticsearch knows how to balance multi-node clusters to provide scale and high availability. The more nodes, the merrier.

2、es索引其实是一个分片或者更多物理分片的逻辑分组,每个分片实际上是一个独立的索引。通过在多个分片之间的索引中分配文档,并且在多个节点之间分配这些分片。es可以确保容余以防止硬盘故障,并且通过增加节点到集群来增加查询能力。当集群增加,es自动迁移分片来平衡集群

How does this work? 

Under the covers, an Elasticsearch index is really just a logical grouping of one or more physical shards, where each shard is actually a self-contained index. 

By distributing the documents in an index across multiple shards, and distributing those shards across multiple nodes, 

Elasticsearch can ensure redundancy, which both protects against hardware failures and increases query capacity as nodes are added to a cluster. 

As the cluster grows (or shrinks), Elasticsearch automatically migrates shards to rebalance the cluster.

 

3、有两种类型的分片:主分片和副本分片。索引中的每个文档属于一个主分片。一个副本分片是一个主分片复制。副本分片提供容余复制来防止硬盘故障和增加能力来服务于读要求像搜索或检索一个文档

There are two types of shards: primaries and replicas. Each document in an index belongs to one primary shard. A replica shard is a copy of a primary shard. Replicas provide redundant copies of your data to protect against hardware failure and increase capacity to serve read requests like searching or retrieving a document

 

4、在索引中的主分片是固定的,当在创建索引时就已经确定了。但副本分片的数量是在任何时间都可以改变的,不需要中断索引 或者查询操作

The number of primary shards in an index is fixed at the time that an index is created, but the number of replica shards can be changed at any time, without interrupting indexing or query operations.

 

It depends..  如何取舍分片的数量和大小

 

5、在配置分片大小和主分片数量时,有大量的性能考虑和权衡。

分片越多,维护这些索引 的开销越大。

分片大小越大,es需要平衡集群时迁移节点花费的时间就越长。

There are a number of performance considerations and trade offs with respect to shard size and the number of primary shards configured for an index.

The more shards, the more overhead there is simply in maintaining those indices. 

The larger the shard size, the longer it takes to move shards around when Elasticsearch needs to rebalance a cluster.

 

6、查询大量的小分片使每个分片查询速度很快,

但更多的查询意味着更多的开销,

所以查询较小数量的较大分片可能会更高。

简而言之,视情况而定。

Querying lots of small shards makes the processing per shard faster, 

but more queries means more overhead, so querying a smaller number of larger shards might be faster. 

In short…it depends.

 

7、平均分片大小应在几GB和几十GB之间,为基于时间数据的安全,它通常是20GB-40GB的范围。

为了避免庞大的分片问题,一个节点可以容纳的分片数量是与可用的堆空间成下比的。

作为一般规则,每GB的堆空间分片的数量应少于20。

为了更好决定配置,通过测试你自己的数据和查询接口。

As a starting point:

* Aim to keep the average shard size between a few GB and a few tens of GB. For use cases with time-based data, it is common to see shards in the 20GB to 40GB range.

* Avoid the gazillion shards problem. The number of shards a node can hold is proportional to the available heap space. As a general rule, the number of shards per GB of heap space should be less than 20.

 

The best way to determine the optimal configuration for your use case is through testing with your own data and queries.

 

In case of disaster  万一发生灾难  跨集群复制介绍

 

8、出于性能原因,在一个集群中的所有节点应该在同一个网络中;跨不同数据中心的节点在群集中平衡分片是很费时间。但高可用架构又要求避免把鸡蛋放在同一个篮子里。当一个位置发生重大故障时,在另一个位置的服务需要能够接管,并且是无缝接管。能怎么做到?答案是跨集群复制CCR

 

For performance reasons, the nodes within a cluster need to be on the same network. 

Balancing shards in a cluster across nodes in different data centers simply takes too long. 

But high-availability architectures demand that you avoid putting all of your eggs in one basket. 

In the event of a major outage in one location, servers in another location need to be able to take over. Seamlessly. The answer? Cross-cluster replication (CCR).

 

 

9、CCR提供一种方式可以从你的主集群中自动同步索引到可作为热备份的辅助远程集群。当主集群失败了,辅助集群可以接管。你也可以使用CCR来创建辅助集群来服务于读请求,向你的用户接近地理位置。

 

CCR provides a way to automatically synchronize indices from your primary cluster to a secondary remote cluster that can serve as a hot backup. If the primary cluster fails, the secondary cluster can take over. You can also use CCR to create secondary clusters to serve read requests in geo-proximity to your users.

 

10、CCR是主动-被动的。主集群的索引是主动领导者和处理所有写请求。被复制到辅助集群的索引是只读跟随者

Cross-cluster replication is active-passive.

The index on the primary cluster is the active leader index and handles all write requests. 

Indices replicated to secondary clusters are read-only followers.

 

Care and feeding   推荐了Kibana作为管理工具

 

11、跟其他任何企业系统一样,你需要工具来保护、管理和监控es集群。被集成到es中的安全性、监控和管理功能使你可以使用用Kibana作为管理中心来管理集群。像数据汇总和索引生命周期管理等功能特性帮助你在时间推移中智能地管理数据。

 

As with any enterprise system, you need tools to secure, manage, and monitor your Elasticsearch clusters. Security, monitoring, and administrative features that are integrated into Elasticsearch enable you to use Kibana as a control center for managing a cluster. Features like data rollups and index lifecycle management help you intelligently manage your data over time.

 

 

 

 

 

 

 

 

 

 

 

 

 

 

你可能感兴趣的:(es)