COMP9313-Introduction to ElasticSearch

Indexing Overview

Why do we need indexing?

  • Much of the information is represented as text(Web pages, business documents, health records)
  • Searching can be done through linear scan, to a certain extent.
  • Linear scan has its limitations.
    Scanning large collections of documents(with billions or trillions of words) becomes very slow for most applications(specially interactive ones)
    More flexible operations might be impractical using grep
    Rank retrieval -> Rank retrieval results based on a given match criteria.

Inverted Index

Key idea: index that maps terms to the documents where they occur.

COMP9313-Introduction to ElasticSearch_第1张图片
Inverted Index

Steps to build an inverted list

  1. Collect documents that need to be indexed.
  2. Turn documents into a list of tokens(tokenizations)
  3. Perform preprocessing to produce a normalized list of tokens(stemming)
  4. Create a list of terms and the corresponding postings(documents) where they occur.
  5. Sort terms and postings.
  6. Record(in dictionary ) stats such as documnet frequency.


    COMP9313-Introduction to ElasticSearch_第2张图片
    Boolean queries using Inverted Index

ElasticSearch

Open source search engine based on Apache Lucene.
Provides a distributed, full-text search engine with a REST APIs.
Document oriented(JSON as serialization format for documents).
Developed in Java(cross platform).
Focused on scalability-distributed by design.
Highly efficient search.

ElasticSearch Use Cases
E-commerce
Storage, analysis and mining of transaction data
Analyics/Business intelligence

ElasticSearch Elements
Cluster

  • An ElasticSearch cluster is a collection of nodes(servers).
  • Identified by unique name.
  • Data is stored in this collection of nodes.
  • Provide indexing and search capabilities across all nodes.

Node

  • A single server in the cluster.
  • Identified by a unique server.
  • Stores all or parts of the whole dataset.
  • Contributes to the indexing and search capabilities of ElasticSearch.

Shard

  • Individual instances of Apache Luence index.
  • Elasticsearch leverages Luence indexing in a distributed environment.

Index

  • Distributed across shards.
  • Collection of documetns.
  • Identifiable by a name.
  • Replicas(fault tolerance)
  • Analogy to RDBMS: Index -> Database

Type

  • Category of documents of the same class.
  • Types have a name and mapping.
  • Indexs can have one or more types.
  • Analogy to RDBMS: Type -> Table

Mapping

  • Defines the fields contained in a given type.
  • Describes data type for each field.
  • Describes how fields must be indexed and stored.
  • Dynamic mapping is possible.
  • Analogy to RDBMS: Mapping -> Schema of Table

Document

  • Basic unit of information.
  • Document contains fields.
  • ElasticSearch uses JSON to represent documents.
  • Analogy to RDBMS: Document -> Tuple

Replicas

  • Copy of a shard
  • Provides fault tolerance
  • Scalability -> Queries can be executed in parallel.
  • Default ElasticSearch configuration:
    5 primary shards.
    1 replica for each index.
COMP9313-Introduction to ElasticSearch_第3张图片
ElasticSearch Ecosystem
COMP9313-Introduction to ElasticSearch_第4张图片
Search APIs:Query String

COMP9313-Introduction to ElasticSearch_第5张图片
Search APIs:DSL

你可能感兴趣的:(COMP9313-Introduction to ElasticSearch)