There are a few concepts that are core to Elasticsearch. Understanding these concepts from the outset will tremendously help ease the learning process.
Elasticsearch is a near real time search platform. What this means is there is a slight latency (normally one second) from the time you index a document until the time it becomes searchable.
Elasticsearch是近实时的搜索平台。基本上仅有一秒的延迟。
A cluster is a collection of one or more nodes (servers) that together holds your entire data and provides federated indexing and search capabilities across all nodes. A cluster is identified by a unique name which by default is "elasticsearch". This name is important because a node can only be part of a cluster if the node is set up to join the cluster by its name.
Make sure that you don’t reuse the same cluster names in different environments, otherwise you might end up with nodes joining the wrong cluster. For instance you could use logging-dev, logging-stage, and logging-prod for the development, staging, and production clusters.
Note that it is valid and perfectly fine to have a cluster with only a single node in it. Furthermore, you may also have multiple independent clusters each with its own unique cluster name.
Cluster
一个集群就是一组由一个或多个节点或(servers)在一起支撑整个数据和提供联合索引并且搜索能力跨越所有节点。一个集群由一个单一标示名所定义,默认为elasticsearch。这个名字和重要因为一个节点只能作为集群的一部分如果节点是通过这个名字设置加入这个集群。
确认不重用同样的集群名字在不同的环境,不然用户就加入到错的集群中了。例如用户可用这些名字logging-dev, logging-stage, and logging-prod做开发,阶段,和生产集群的名字。
A node is a single server that is part of your cluster, stores your data, and participates in the cluster’s indexing and search capabilities. Just like a cluster, a node is identified by a name which by default is a random Marvel character name that is assigned to the node at startup. You can define any node name you want if you do not want the default. This name is important for administration purposes where you want to identify which servers in your network correspond to which nodes in your Elasticsearch cluster.
一个节点是以单独的server,它是整个集群的一部分,存储数据并参与在集群的索引和搜索能力。就像一个集群,一个节点由默认为任意Marvel字符名字在节点启动时起的名字来定义,如果你不想要那个默认的名字你可以自己起名字。
A node can be configured to join a specific cluster by the cluster name. By default, each node is set up to join a cluster named elasticsearch which means that if you start up a number of nodes on your network and—assuming they can discover each other—they will all automatically form and join a single cluster named elasticsearch.
一个节点能被配置去加入特定集群根据集群的名字。在默认情况下,每个节点是设置去加入节点名字为elasticsearch的集群,那意味着如果用户启动一些节点在网路中,并且假定他们能发现每个节点,他们将全部自动形成并加入一个叫elasticsearch的集群
In a single cluster, you can have as many nodes as you want. Furthermore, if there are no other Elasticsearch nodes currently running on your network, starting a single node will by default form a new single-node cluster named elasticsearch.
在一个集群中,要多少就可以有多少节点。另外,如果没有其他的Elasticsearch节点当前正在运行,启动一个节点将通过默认形成一个单节点集群叫elasticsearch
An index is a collection of documents that have somewhat similar characteristics. For example, you can have an index for customer data, another index for a product catalog, and yet another index for order data. An index is identified by a name (that must be all lowercase) and this name is used to refer to the index when performing indexing, search, update, and delete operations against the documents in it.
In a single cluster, you can define as many indexes as you want.
一个索引是一组文件有些相似的特征。例如,你可以有一个客户数据索引,一个产品分类索引,另一个叫订单索引。一个索引被定义为一个名字(必须小写)并且这个名字指的是索引当展示索引,搜索,更新,和删除选项紧靠在文档在索引中。
一个集群,用户定义多少索引都可以。
Within an index, you can define one or more types. A type is a logical category/partition of your index whose semantics is completely up to you. In general, a type is defined for documents that have a set of common fields. For example, let’s assume you run a blogging platform and store all your data in a single index. In this index, you may define a type for user data, another type for blog data, and yet another type for comments data.
类型
在一个索引中,用户可以定义一个或多个类型。一个类型是有用户决定语义索引的一个逻辑的分类/分区。大体上,一个类型是定义一个相同文件集的文档。例如,假设用户运行博客平台并存储所有数据在一个单一的索引。这个索引,用户可以定义一个类型给用户数据,一个给博客数据,一个给评论数据。
A document is a basic unit of information that can be indexed. For example, you can have a document for a single customer, another document for a single product, and yet another for a single order. This document is expressed in JSON (JavaScript Object Notation) which is an ubiquitous internet data interchange format.
Within an index/type, you can store as many documents as you want. Note that although a document physically resides in an index, a document actually must be indexed/assigned to a type inside an index.
文档是被索引的基本信息单位。例如用户有一个客户文档,一个单独产品文档,和一个订单文档。这个文档用JSON一种常用的网络数据格式表达
An index can potentially store a large amount of data that can exceed the hardware limits of a single node. For example, a single index of a billion documents taking up 1TB of disk space may not fit on the disk of a single node or may be too slow to serve search requests from a single node alone.
一个索引有能力存储大量的数据,超过单个节点的硬件限制。例如,一个有十亿文档的索引承载1TB的磁盘空间可能不适合在磁盘所在一个节点或者可能太慢去服务搜索请求从单一的一个节点。
To solve this problem, Elasticsearch provides the ability to subdivide your index into multiple pieces called shards. When you create an index, you can simply define the number of shards that you want. Each shard is in itself a fully-functional and independent "index" that can be hosted on any node in the cluster.
解决这个问题,Elasticsearch提供这种能力细分用户索引到多个片叫shards。当用户创建一个
Sharding is important for two primary reasons:
· It allows you to horizontally split/scale your content volume
· It allows you to distribute and parallelize operations across shards (potentially on multiple nodes) thus increasing performance/throughput
The mechanics of how a shard is distributed and also how its documents are aggregated back into search requests are completely managed by Elasticsearch and is transparent to you as the user.
In a network/cloud environment where failures can be expected anytime, it is very useful and highly recommended to have a failover mechanism in case a shard/node somehow goes offline or disappears for whatever reason. To this end, Elasticsearch allows you to make one or more copies of your index’s shards into what are called replica shards, or replicas for short.
Replication is important for two primary reasons:
· It provides high availability in case a shard/node fails. For this reason, it is important to note that a replica shard is never allocated on the same node as the original/primary shard that it was copied from.
· It allows you to scale out your search volume/throughput since searches can be executed on all replicas in parallel.
To summarize, each index can be split into multiple shards. An index can also be replicated zero (meaning no replicas) or more times. Once replicated, each index will have primary shards (the original shards that were replicated from) and replica shards (the copies of the primary shards). The number of shards and replicas can be defined per index at the time the index is created. After the index is created, you may change the number of replicas dynamically anytime but you cannot change the number shards after-the-fact.
By default, each index in Elasticsearch is allocated 5 primary shards and 1 replica which means that if you have at least two nodes in your cluster, your index will have 5 primary shards and another 5 replica shards (1 complete replica) for a total of 10 shards per index.
Each Elasticsearch shard is a Lucene index. There is a maximum number of documents you can have in a single Lucene index. As of LUCENE-5843, the limit is 2,147,483,519 (= Integer.MAX_VALUE - 128) documents. You can monitor shard sizes using the _cat/shards api.
With that out of the way, let’s get started with the fun part…