【总结】搜索服务Solr

1, Solr is a standalone enterprise search server with a REST-like API. You put documents in it (called "indexing") via JSON, XML, CSV or binary over HTTP. You query it via HTTP GET and receive JSON, XML, CSV or binary results.
2, Solr Administration User Interface
  • Logging
  • Cloud Screens
  • Core Admin
  • Java Properties
  • Thread Dump
  • Core-Specific Tools
    • Analysis Screen
    • Dataimport screen
    • Documents Screen
    • Files Screen
    • Ping
    • Plugin & Stats Screen
    • Query Screen
    • Replication Screen
    • Schema Browser Screen
    • Segments Info
3, Documents, Fields, Schema Design
  • Field Properties
    • indexed
    • stored
    • docValues
    • sortMissingFirst / sortMissingLast
    • multiValued
    • omitNorms
    • omitTermFreqAndPositions
    • omitPositions
    • termVectors / termPositions / termOffsets / termPayloads
    • required
  • Field Types
    • BinaryField
    • BoolField
    • CollationField
    • CurrencyField
    • DataRangeField
    • ExternalFileField
    • EnumField
    • LatLonType
    • PointType
    • TextField
    • StrField
    • TrieField
    • TrieInt/Long/FloatField
    • UUIDField
4, Analyzers, Tokenizers and Filters
  • Analyzers
    • An analyzer examines the text of fields and generates a token stream
    • Analyzers are specified as a child of the <fieldType> element in the schema.xml configuration file
    • Analyzers
      • WhitespaceAnalyzer
      • SimpleAnalyzer
      • StopAnalyzer
      • StandardAnalyzer
  • Tokenizers
    • The job of a tokenizer is to break up a stream of text into tokens, where each token is (usually) a sub-sequence of the characters in the text
    • An analyzer is aware of the field it is configured for, but a tokenizer is not
    • Tokenizers read from a character stream (a Reader) and produce a sequence of Token objects (a TokenStream)
    • You configure the tokenizer for a text field type in schema.xml with a <tokenizer> element, as a child of <analyzer>
    • Tokenizers
      • WhitespaceTokenizer
      • KeywordTokenizer
      • LetterTokenizer
      • StandardTokenizer
  • Filters
    • Like tokenizers, filters consume input and produce a stream of tokens
    • Filters also derive from org.apache.lucene.analysis.TokenStream
    • Unlike tokenizers, a filter's input is another TokenStream. The job of a filter is usually easier than that of a tokenizer since in most cases a filter looks at each token in the stream sequentially and decides whether to pass it along, replace it or discard it
    • A filter may also do more complex analysis by looking ahead to consider multiple tokens at once, although this is less common
    • One hypothetical use for such a filter might be to normalize state names that would be tokenized as two words. For example, the single token "california" would be replaced with "CA", while the token pair "rhode" followed by "island" would become the single token "RI"
    • Filters
      • LowerCaseFilter
      • StopFilter
      • PorterStemFilter
      • ASCIIFoldingFilter
      • StandardFilter
5, Indexing
  • The three most common ways of loading data into a Solr index
    • Using the Solr Cell framework built on Apache Tika for ingesting binary files or structured files such as Office, Word, PDF, and other proprietary formats
    • Uploading XML files by sending HTTP requests to the Solr server from any environment where such requests can be generated
    • Writing a custom Java application to ingest data through Solr's Java Client API
  • Uploading Data with Index Handlers
    • Index Handlers are Request Handlers designed to add, delete and update documents to the index
    • In addition to having plugins for importing rich documents using Tika or from structured data sources using the Data Import Handler
    • Solr natively supports indexing structured documents in XML, CSV and JSON
  • Uploading Data with Solr Cell using Apache Tika
    • Solr uses code from the Apache Tika project to provide a framework for incorporating many different file-format parsers such as Apache PDFBox and Apache POI into Solr itself
    • Working with this framework, Solr's ExtractingRequestHandler can use Tika to support uploading binary files, including files in popular formats such as Word and PDF, for data extraction and indexing
    • When this framework was under development, it was called the Solr Content Extraction Library or CEL; from that abbreviation came this framework's name: Solr Cell
  • Uploading Structured Data Store Data with the Data Import Handler
    • The Data Import Handler (DIH) provides a mechanism for importing content from a data store and indexing it
    • In addition to relational databases, DIH can index content from HTTP based data sources such as RSS and ATOM feeds, e-mail repositories, and structured XML where an XPath processor is used to generate fields
  • Detecting Languages During Indexing
    • Solr can identify languages and map text to language-specific fields during indexing using the langid UpdateRequestProcessor. Solr supports two implementations of this feature
      • Tika's language detection feature
      • LangDetect language detection
  • UIMA Integration
    • UIMA(the Apache Unstructured Information Management Architecture) lets you define custom pipelines of Analysis Engines that incrementally add metadata to your documents as annotations
6, Searching
  • The search query is processed by a request handler
    • Solr supports a variety of request handlers. Some are designed for processing search queries, while others manage tasks such as index replication
    • To process a search query, a request handler calls a query parser, which interprets the terms and parameters of a query
    • Input to a query parser can include
      • search strings---that is, terms to search for in the index
      • parameters for fine-tuning the query by increasing the importance of particular strings or fields, by applying Boolean logic among the search terms, or by excluding content from the search results
      • parameters for controlling the presentation of the query response, such as specifying the order in which results are to be presented or limiting the response to particular fields of the search application's schema
    • Search parameters may also specify a query filter
  • Query Syntax and Parsing
    • The Standard Query Parser
      • Solr's default Query Parser is also known as the "lucene" parser
      • The key advantage of the standard query parser is that it supports a robust and fairly intuitive syntax allowing you to create a variety of structured queries
      • The largest disadvantage is that it's very intolerant of syntax errors, as compared with something like the DisMax query parser which is designed to throw as few errors as possible
    • The DisMax Query Parser
      • The DisMax query parser is designed to process simple phrases (without complex syntax) entered by users and to search for individual terms across several fields using different weighting (boosts) based on the significance of each field
      • Additional options enable users to influence the score based on rules specific to each use case independent of user input)
    • The Extended DisMax Query Parser
      • The Extended DisMax (eDisMax) query parser is an improved version of the DisMax query parser. In addition to supporting all the DisMax query parser parameters
    • Other Parsers
      • Block Join Query Parsers
      • Boost Query Parser
      • Collapsing Query Parser
      • Complex Phrase Query Parser
      • Field Query Parser
      • Function Query Parser
      • Function Range Query Parser
      • Join Query Parser
      • Lucene Query Parser
      • Max Score Query Parser
      • More Like This Query Parser
  • Query
    • TermQuery
    • TermRangeQuery
    • NumericRangeQuery
    • PrefixQuery
    • BooleanQuery
    • PhraseQuery
    • WildcardQuery
    • FuzzyQuery
    • MatchAllDocsQuery
  • Faceting
    • faceting is the arrangement of search results into categories based on indexed terms
    • Searchers are presented with the indexed terms, along with numerical counts of how many matching documents were found were each term
    • Faceting makes it easy for users to explore search results, narrowing in on exactly the results they are looking for
  • Highlighting
    • Highlighting in Solr allows fragments of documents that match the user's query to be included with the query response
    • There are three highlighting implementations available
      • The Standard Highlighter is the swiss-army knife of the highlighters. It has the most sophisticated and fine-grained query representation of the three highlighters
      • FastVector Highlighter
        • The FastVector Highlighter requires term vector options (termVectors, termP ositions, and termOffsets) on the field, and is optimized with that in mind
        • It tends to work better for more languages than the Standard Highlighter, because it supports Unicode breakiterators. On the other hand, its query-representation is less advanced than the Standard Highlighter
        • for example it will not work well with the surround parser. This highlighter is a good choice for large documents and highlighting text in a variety of languages
      • Postings Highlighter
        • The Postings Highlighter requires storeOffsetsWithPositions to be configured on the field
        • This is a much more compact and efficient structure than term vectors, but is not appropriate for huge numbers of query terms (e.g. wildcard queries). Like the FastVector Highlighter
        • it supports Unicode algorithms for dividing up the document
  • Spell Checking
  • Query Re-Ranking
  • Suggester
  • MoreLikeThis
  • Pagination of Results
  • Result Grouping
  • Spatial Search
  • The Term Vector Component: For each document in the response, the TermVectorCcomponent can return the term vector, the term frequency, inverse document frequency, position, and offset information
  • The Stats Component: The Stats component returns simple statistics for numeric, string, and date fields within the document set
  • Response Writers
    • CSVResponseWriter
    • JSONResponseWriter
    • VelocityResponseWriter
    • XMLResponseWriter
7, The Well-Configured Solr Instance
  • Configuring solrconfig.xml
    • request handlers, which process the requests to Solr, such as requests to add documents to the index or requests to return results for a query
    • listeners, processes that "listen" for particular query-related events; listeners can be used to trigger the
    • execution of special code, such as invoking some common queries to warm-up caches
    • the Request Dispatcher for managing HTTP communications
    • the Admin Web interface
    • parameters related to replication and duplication (these parameters are covered in detail in Legacy Scaling and Distribution)
  • Solr Cores and solr.xml
    • Solr.xml已经从配置一个Solr core进化到支持多个Solr core,并最终为SolrCloud定义参数
8, SolrCloud
  • 概念
    • Collection:在SolrCloud集群中逻辑意义上的完整的索引。它常常被划分为一个或多个Shard,它们使用相同的Config Set。如果Shard数超过一个,它就是分布式索引,SolrCloud让你通过Collection名称引用它,而不需要关心分布式检索时需要使用的和Shard相关参数
    • Config Set: Solr Core提供服务必须的一组配置文件。每个config set有一个名字。最小需要包括solrconfig.xml (SolrConfigXml)和schema.xml (SchemaXml),除此之外,依据这两个文件的配置内容,可能还需要包含其它文件。它存储在Zookeeper中。Config sets可以重新上传或者使用upconfig命令更新,使用Solr的启动参数bootstrap_confdir指定可以初始化或更新它
    • Core: 也就是Solr Core,一个Solr中包含一个或者多个Solr Core,每个Solr Core可以独立提供索引和查询功能每个Solr Core对应一个索引或者Collection的Shard,Solr Core的提出是为了增加管理灵活性和共用资源。在SolrCloud中有个不同点是它使用的配置是在Zookeeper中的,传统的Solr core的配置文件是在磁盘上的配置目录中
    • Leader: 赢得选举的Shard replicas。每个Shard有多个Replicas,这几个Replicas需要选举来确定一个Leader。选举可以发生在任何时间,但是通常他们仅在某个Solr实例发生故障时才会触发。当索引documents时,SolrCloud会传递它们到此Shard对应的leader,leader再分发它们到全部Shard的replicas
    • Replica: Shard的一个拷贝。每个Replica存在于Solr的一个Core中。一个命名为“test”的collection以numShards=1创建,并且指定replicationFactor设置为2,这会产生2个replicas,也就是对应会有2个Core,每个在不同的机器或者Solr实例。一个会被命名为test_shard1_replica1,另一个命名为test_shard1_replica2。它们中的一个会被选举为Leader
    • Shard: Collection的逻辑分片。每个Shard被化成一个或者多个replicas,通过选举确定哪个是Leader
    • Zookeeper: Zookeeper提供分布式锁功能,对SolrCloud是必须的。它处理Leader选举。Solr可以以内嵌的Zookeeper运行,但是建议用独立的,并且最好有3个以上的主机
  • Features
    • Central configuration for the entire cluster
    • Automatic load balancing and fail-over for queries
    • ZooKeeper integration for cluster coordination and configuration
  • Nodes, Cores, Clusters and Leaders
    • Nodes and Cores
      • In SolrCloud, anodeis Java Virtual Machine instance running Solr, commonly called a server. Each Solr core can also be considered a node. Any node can contain both an instance of Solr and various kinds of data
      • A Solrcoreis basically an index of the text and fields found in documents
      • A single Solr instance can contain multiple "cores", which are separate from each other based on local criteria
      • When you start a new core in SolrCloud mode, it registers itself with ZooKeeper. This involves creating an Ephemeral node that will go away if the Solr instance goes down, as well as registering information about the core and how to contact it
    • Clusters
      • A cluster is set of Solr nodes managed by ZooKeeper as a single unit
      • When you have a cluster, you can always make requests to the cluster and if the request is acknowledged, you can be sure that it will be managed as a unit and be durable, i.e., you won't lose data. Updates can be seen right after they are made and the cluster can be expanded or contracted
    • Leaders and Replicas
      • The concept of aleaderis similar to that ofmasterwhen thinking of traditional Solr replication. The leader is responsible for making sure thereplicasare up to date with the same information stored in the leader
      • However, with SolrCloud, you don't simply have one master and one or more "slaves", instead you likely have distributed your search and index traffic to multiple machines
  • Shards and Indexing Data in SolrCloud
    • When your data is too large for one node, you can break it up and store it in sections by creating one or moreshards. Each is a portion of the logical index, or core, and it's the set of all nodes containing that section of the index
    • A shard is a way of splitting a core over a number of "servers", or nodes. For example, you might have a shard for data that represents each state, or different categories that are likely to be searched independently, but are often combined
    • Before SolrCloud, Solr supported Distributed Search, which allowed one query to be executed across multiple shards, so the query was executed against the entire Solr index and no documents would be missed from the search results
    • ZooKeeper provides failover and load balancing
  • Distributed Requests
    • One of the advantages of using SolrCloud is the ability to distribute requests among various shards that may or may not contain the data that you're looking for. You have the option of searching over all of your data or just parts of it
    • Configuring the ShardHandlerFactory
      • You can directly configure aspects of the concurrency and thread-pooling used within distributed search in Solr. This allows for finer grained control and you can tune it to target your own specific requirements. The default configuration favors throughput over latency
9, 中文分词器
  • mmseg4j
    • mmseg4j用Chih-Hao Tsai 的MMSeg算法实现的中文分词器
    • MMSeg 算法有两种分词方法:Simple和Complex,都是基于正向最大匹配。Complex加了四个规则过虑
  • paoding
    • Paoding's Knives 中文分词具有极 高效率 和 高扩展性 。引入隐喻,采用完全的面向对象设计,构思先进
    • 高效率:在PIII 1G内存个人机器上,1秒 可准确分词 100万 汉字
    • 采用基于不限制个数 的词典文件对文章进行有效切分,使能够将对词汇分类定义
    • 能够对未知的词汇进行合理解析
  • ictclas4j
    • ictclas4j中文分词系统是sinboy在中科院张华平和刘群老师的研制的FreeICTCLAS的基础上完成的一个java开源分词项目
  • IKAnalyzer
    • 它是以开源项目Lucene为应用主体的,结合词典分词和文法分析算法的中文分词组件
    • 采用了特有的“正向迭代最细粒度切分算法“,具有60万字/秒的高速处理能力
    • 采用了多子处理器分析模式,支持:英文字母(IP地址、Email、URL)、数字(日期,常用中文数量词,罗马数字,科学计数法),中文词汇(姓名、地名处理)等分词处理
    • 对中英联合支持不是很好,在这方面的处理比较麻烦.需再做一次查询,同时是支持个人词条的优化的词典存储,更小的内存占用
    • 支持用户词典扩展定义
    • 针对Lucene全文检索优化的查询分析器IKQueryParser;采用歧义分析算法优化查询关键字的搜索排列组合,能极大的提高Lucene检索的命中率
  • ansj
    • 这是一个ictclas的java实现.基本上重写了所有的数据结构和算法.词典是用的开源版的ictclas所提供的.并且进行了部分的人工优化
    • 内存中中文分词每秒钟大约100万字(速度上已经超越ictclas)
    • 文件读取分词每秒钟大约30万字
    • 准确率能达到96%以上
    • 目前实现了.中文分词. 中文姓名识别 . 用户自定义词典
    • 可以应用到自然语言处理等方面,适用于对分词效果要求高的各种项目
10,Solr 性能因素
  • Schema Design Considerations(数据模型方面考虑)
  • indexed fields
  • Configuration Considerations(配置方面考虑)
  • mergeFactor

你可能感兴趣的:(【总结】搜索服务Solr)