学技术学英文:elasticsearch部署架构-容错设计

Unless you're running Elasticsearch on a single node, prepare to design for failure. Designing for failure means running your cluster in multiple locations and be ready to lose a whole data center without service interruption. It's not theoretical thinking here. You WILL lose a whole data center several times during your cluster's life.

The minimum requirement for a fault tolerant cluster is:

  • 3 locations to host your nodes. 2 locations to run half of your cluster, and one for the backup master node.
  • 3 master nodes. You need an odd number of eligible master nodes to avoid split brains when you lose a whole data center. Put one master node in each location so you hopefully never lose the quorum.
  • 2 ingest nodes, one in each primary data center.

As many data nodes as you need, split evenly between both main locations.

Architecture of a fault tolerant Elasticsearch cluster

学技术学英文:elasticsearch部署架构-容错设计_第1张图片

Elasticsearch design for failure

Elasticsearch provides an interesting feature called shard allocation awareness. It allows to split the primary shards and their replica in separated zones. Allocate nodes within a same data center to a same zone to limit the odds of having your cluster go red.

cluster:
  routing:
    allocation:
      awareness: 
        attributes: "rack_id"

node:
  attr:
    rack_id: "dontsmokecrack"

Using rack_id on the ingest nodes is interesting too, as Elasticsearch will run the queries on the closest neighbours. A query sent to the ingest node located in the datacenter 1 will more likely run on the same data center data nodes.

中文总结:

  1. 设计容错集群的必要性

    • 在分布式环境中,故障是不可避免的,因此必须设计容错集群以应对数据中心级别的故障。

    • 集群需要能够在多个地理位置运行,并确保在失去一个数据中心时服务不中断。

  2. 容错集群的最低要求

    • 3 个地理位置:两个主要数据中心运行集群的大部分节点,第三个位置用于备份主节点。

    • 3 个主节点:为了避免脑裂问题,主节点数量必须为奇数,每个地理位置部署一个主节点。

    • 2 个摄入节点:每个主要数据中心部署一个摄入节点。

    • 多个数据节点:根据需求部署数据节点,并均匀分布在两个主要数据中心。

  3. 分片分配感知(Shard Allocation Awareness)

    • Elasticsearch 提供了分片分配感知功能,可以将主分片和副本分片分配到不同的区域。

    • 通过配置 rack_id,将同一数据中心的节点分配到同一区域,减少集群故障的风险

  4. 查询优化

    • 在摄入节点上使用 rack_id,可以使查询更可能在最近的数据中心执行,从而减少延迟。

    • 示例:发送到数据中心 1 的查询更可能在数据中心 1 的数据节点上执行。

你可能感兴趣的:(elasticsearch,架构,全文检索,容错)