Indexing Overview

Why do we need indexing?

Much of the information is represented as text(Web pages, business documents, health records)
Searching can be done through linear scan, to a certain extent.
Linear scan has its limitations.
Scanning large collections of documents(with billions or trillions of words) becomes very slow for most applications(specially interactive ones)
More flexible operations might be impractical using grep
Rank retrieval -> Rank retrieval results based on a given match criteria.

Inverted Index

Key idea: index that maps terms to the documents where they occur.

COMP9313-Introduction to ElasticSearch_第1张图片

Inverted Index

Steps to build an inverted list

Collect documents that need to be indexed.
Turn documents into a list of tokens(tokenizations)
Perform preprocessing to produce a normalized list of tokens(stemming)
Create a list of terms and the corresponding postings(documents) where they occur.
Sort terms and postings.
Record(in dictionary ) stats such as documnet frequency.

Boolean queries using Inverted Index

ElasticSearch

Open source search engine based on Apache Lucene.
Provides a distributed, full-text search engine with a REST APIs.
Document oriented(JSON as serialization format for documents).
Developed in Java(cross platform).
Focused on scalability-distributed by design.
Highly efficient search.

ElasticSearch Use Cases
E-commerce
Storage, analysis and mining of transaction data
Analyics/Business intelligence

ElasticSearch Elements
Cluster

An ElasticSearch cluster is a collection of nodes(servers).
Identified by unique name.
Data is stored in this collection of nodes.
Provide indexing and search capabilities across all nodes.

Node

A single server in the cluster.
Identified by a unique server.
Stores all or parts of the whole dataset.
Contributes to the indexing and search capabilities of ElasticSearch.

Shard

Individual instances of Apache Luence index.
Elasticsearch leverages Luence indexing in a distributed environment.

Index

Distributed across shards.
Collection of documetns.
Identifiable by a name.
Replicas(fault tolerance)
Analogy to RDBMS: Index -> Database

Type

Category of documents of the same class.
Types have a name and mapping.
Indexs can have one or more types.
Analogy to RDBMS: Type -> Table

Mapping

Defines the fields contained in a given type.
Describes data type for each field.
Describes how fields must be indexed and stored.
Dynamic mapping is possible.
Analogy to RDBMS: Mapping -> Schema of Table

Document

Basic unit of information.
Document contains fields.
ElasticSearch uses JSON to represent documents.
Analogy to RDBMS: Document -> Tuple

Replicas

Copy of a shard
Provides fault tolerance
Scalability -> Queries can be executed in parallel.
Default ElasticSearch configuration:
5 primary shards.
1 replica for each index.

COMP9313-Introduction to ElasticSearch_第3张图片

ElasticSearch Ecosystem

COMP9313-Introduction to ElasticSearch_第4张图片

Search APIs:Query String

COMP9313-Introduction to ElasticSearch_第5张图片

Search APIs:DSL

COMP9313-Introduction to ElasticSearch

Indexing Overview

Inverted Index

ElasticSearch

你可能感兴趣的:(COMP9313-Introduction to ElasticSearch)