isiqi

Efficient in-memory extensible inverted file

Abstract

The growing amount of on-line data demands efficient parallel and distributed indexing mechanisms to manage large resource requirements and unpredictable system failures. Parallel and distributed indices built using commodity hardware like personal computers (PCs) can substantially save cost because PCs are produced in bulk, achieving the scale of economy. However, PCs have limited amount of random access memory (RAM) and the effective utilization of RAM for in-memory inversion is crucial. This paper presents an analytical investigation and an empirical evaluation of storage-efficient inmemory extensible inverted files, which are represented by fixed- or variable-sized linked list nodes. The size of these linked list nodes is determined by minimizing the storage wastes or maximizing storage utilization under different conditions, which lead to different storage allocation schemes. Minimizing storage wastes also reduces the number of address indirections (i.e., chaining). We evaluated our storage allocation schemes using a number of reference collections. We found that the arrival rate scheme is the best in terms of both storage utilization and the mean number of chainings per term. The final storage utilization can be over 90% in our evaluation if there is a sufficient number of documents indexed. The mean number of chainings is not large (less than 2.6 for all the reference collections). We have also showed that our best storage allocation scheme can be used for our extensible compressed inverted file. The final storage utilization of the extensible compressed inverted file can be over 90% in our evaluation provided that there is a sufficient number of documents indexed. The proposed storage allocation schemes can also be used by compressed extensible inverted files with word positions

Keywords: Information retrieval; Indexing; Optimization

Article Outline

1. Introduction

2. Extensible inverted file and storage wastes

2.1. Extensible inverted file

2.2. Rationale for variable-size linked lists

2.3. Rationale for storage waste optimization

3. Stepwise storage allocation approach

3.1. Fixed-sized node scheme (F16)

3.2. Vocabulary growth rate scheme (VGR)

3.3. Term growth rate scheme (TGR)

4. Stationary storage allocation approach

4.1. AR allocation scheme

4.2. AAR allocation scheme

5. Comparing storage allocation schemes

5.1. Set up

5.2. Performance measures

5.3. Storage utilization

5.4. Determining the best scheme

5.5. Evaluating robustness of the best schemes

5.6. Indexing speed comparison

5.7. Extensible compressed inverted file

6. Related work

7. Conclusion

Acknowledgements

References

1. Introduction

As more and more data are made available on-line, it becomes increasingly difficult to manage a single inverted file. This difficulty arises from the substantial resource requirement for large-scale indexing and from the long indexing time, making the system vulnerable to unpredictable system failures. For examples, the very large collection (VLC) from TREC [1] requires 100Gb of storage and TREC terabyte track requires 426Gb [2]. The WebBase repository [3] requires 220Gb, estimated to be only 4% of the indexable web pages. The volume of high writing quality, non-English content is also increasing. In the near future, Japanese patent data from NTCIR [4] may be as large as 160Gb. One way to manage such large quantities of data is to create the index by merging smaller indices, which are built using multiple machines indexing different document subsets in parallel [5]. This would limit the impact of system failure to individual machines and increase indexing speed.

Acquiring the computing machines in bulk using commodity hardware substantially reduces monetary costs. Also, commodity hardware, like personal computers (PCs), makes in-memory inversion an attractive proposition because random access memory (RAM) for the PC market is relatively cheap and fast, and because RAM has the potential to be upgraded later at lower prices (e.g. DDR-300 RAM to DDR-400 RAM). However, PCs can only hold a relatively small amount of RAM (i.e., 4Gb) compared with mainframe computers. Efficient RAM utilization becomes an important issue for in-memory inversion using a large number of PCs because typically the entire inverted index cannot be stored in RAM due to the large volume of data on-line. Instead, the inverted file is typically built in relatively small batches [6] and [7]. For each batch, a partial index is built and held in RAM, which is written out as a run on disk. The runs are then merged as the final inverted file.

Efficient RAM utilization can reduce indexing time by reducing the number of inverted files for merging because efficient RAM utilization enables more documents to be indexed per run. During updates, temporary indices are maintained in memory, and then integrated into the main inverted file in batches. Lester and Zobel [8] showed that the amortized time cost to integrate the temporary index with the main inverted file is reduced for different inverted file maintenance methods (i.e., re-build, re-merge and in-place methods) when more documents are indexed. Therefore, during both initial construction and update, making better use of memory resources can reduce overall costs. With the potential to better balance system resource utilization as indexing is made memory-intensive whereas loading/flushing data are made disk- or network-intensive, efficient in-memory inversion is crucial in index construction.

Our major contribution of this paper is in enhancing existing simple-to-implement single-pass in-memory inversion to be storage-efficient for creating partial inverted files and/or temporary index by developing novel storage-efficient allocation schemes that predict the needed storage with minimal storage wastes. The partial index created by our in-memory inversion can be merged with the main or other partial inverted files. The temporary index created by our in-memory inversion can also be searched during the time that the temporary index is being built. This reduces the latency of the availability of the recently indexed documents for searching and this is important for certain applications (e.g. searching recently available news articles).

An evaluation was carried out to determine which of our storage allocation schemes is the best and whether the results are comparable to existing methods (Section 6). The evaluation was carried out using 3.5Gb of test data from the VLC. The best allocation scheme was the arrival rate scheme, which achieved 95% final storage utilization for this VLC dataset. To ascertain the generality of the results, various independent datasets for both English (TREC-2, TREC-6 and TREC-2005) and Chinese (NTCIR-5) are also used to evaluate the best storage allocation scheme. We also showed that the indexing speed of our best storage allocation is similar to the indexing speed of the reported results by others [6] and [9].

The rest of this paper is organized as follows. Section 2 discusses our extensible inverted file structures, the modifications of our extensible inverted file to incorporate compressed postings and word positions, and the related storage wastes. This section also provides the rationale behind the choice of our data structure for our storage allocation schemes and the rationale behind the need to optimize storage wastes of our storage allocation schemes. Section 3 describes the first approach to determine optimal node sizes using a stepwise optimization strategy. This approach results in three related storage allocation schemes. Section 4 discusses the second approach which determines the optimal nodes size that minimizes the asymptotic worst-case storage waste per period for individual terms. Section 5 evaluates these storage allocation schemes and discusses the best scheme in terms of storage utilization, the mean number of chainings, robustness in performance and indexing speed. This section also shows that our storage allocation schemes can be used for allocating nodes to store compressed postings using the best storage allocation scheme and variable byte compression as an example. Section 6 discusses the related work and describes how the storage allocation schemes can predict our extended inverted file that incorporates compressed postings and word positions. Section 7 is the concluding section.

2. Extensible inverted file and storage wastes

This section describes the structure of the extensible inverted file and the related considerations for handling compressed postings and word position information. This section also discusses the storage wastes of the extensible inverted file and the rationale to optimize them, as well as the rationale for using the variable-size linked list data structure.

2.1. Extensible inverted file

An in-memory inverted file can be considered as a set of inverted lists, which are implemented as linked lists. Each linked list node holds a set of (basic) postings of the form d_i, tf(d_i,t_k) where each basic posting consists of a document identifier d_i and the within-document term frequency tf(d_i, t_k) of the kth term in the ith document. The rest of this paper assumes that unless otherwise indicated, all postings are basic postings. If a linked list node can hold a variable number of postings, then two additional fields of information other than postings are stored in each node, namely a node size variable and an extension pointer. The node size variable specifies the amount of storage allocated for the current node and the extension pointer facilitates the chaining of nodes.

Fig. 1 shows the conceptual structure of the extensible inverted file, implemented using a set of variable-size linked list nodes. The dictionary data structure holds the set of index terms and the start address of the corresponding variable-size linked list. A new node is allocated whenever the linked list of the kth index term t_k is full (e.g., the linked list of the index term “Network” in Fig. 1) and when a new posting for t_k arrived. The size of the new node is determined using one of the storage allocation schemes discussed in the next two sections. If the linked list nodes hold a fixed number of postings per node, then the node size variable can be discarded, saving storage space.

Display Full Size version of this image (41K)

Fig. 1.The conceptual structure of our extensible inverted file, represented as variable-sized nodes. The start and last pointers point to, respectively, the first and last linked list nodes of the inverted list.

Each dictionary entry for an index term has a start pointer and a last pointer, which, respectively point to the beginning of the inverted list and the last linked list node of the inverted list. The last pointer reduces the traversal of the linked lists when a new posting for the index term is inserted. In this case, during insertion, the last linked list node needs to be exclusively locked to maintain the integrity of the data for concurrent access [10]. To reduce memory usage, the start pointers can be stored in a file since start pointers are used only for retrieval and not for inserting new postings. For clarity of presentation, each dictionary entry may contain additional information (e.g. document frequency) not shown in Fig. 1. In particular, each dictionary entry should hold a variable, say mpos, which indicates the position of the unfilled portion of the last node, to improve the posting insertion speed.

The extensible inverted file can support storing a special type of posting for block addressing inverted files [9] and [11] that index a fixed-size text block instead of variable-size documents. This special type of posting, called block-address posting in this paper, has only d_i field without the term frequency tf(d_i, t_k) field of the basic posting where d_i is the block identifier instead of the document identifier and t_k is the kth term. Our storage waste optimization discussed in Sections 3 and 4 can minimize the storage wastes of the nodes that store basic postings or block-address postings because the storage of these postings are constants (i.e., c₁) in our storage waste optimization.

The extensible inverted file can support storage of compressed postings [12], as well as word positions. For compressed postings (e.g., γ [13] or variable byte compression [14]), each dictionary entry keeps track of the bit position (again using mpos) or the byte position of the unfilled portion of the last node. The new compressed posting is inserted at mpos as a bit/byte string. If the last node does not have enough memory for a new compressed posting, then the new compressed posting is split between the last node and the newly allocated node.

There are two general approaches to storing postings with word positions. One approach stores the word positions in the nodes and one way (as in [6]) to do this is to store the posting followed by a sequence of word positions, i.e. d_i,tf(d_i,t_k).pos(d_i,t_k,1),…,pos(i,t_k,tf(d_i,t_k)) where d_i is the identifier of the ith document, tf(d_i, t_k) is the within-document frequency of the kth term in the ith document, pos(d_i,t_k,x) is the xth word position of the kth term in the ith document. In this case, the node size includes the storage to hold word positions as well as the postings. Another approach stores extended postings of the form d_i, tf(d_i,t_k), f(d_i,t_k), where f(d_i, t_k) is the file position in the auxiliary file that stores the sequence of word position of the kth term in the ith document. In this approach, the word positions are stored in an auxiliary file. Whenever a new extended posting is added, the last position of the auxiliary file is stored as f(d_i, t_k) of this new extended posting. The word positions of the term associated with the new extended posting are appended sequentially to the auxiliary file. These word positions in the auxiliary file can be compressed, for example, using differential coding compression [9], [12], [13], [14], [15] and [16]. If the within-document term frequency is one (i.e. tf(d_i, t_k)=1), then f(d_i, t_k) of the extended posting can directly store the single word position of the kth term in the ith document, saving both storage and access time. For both approaches that store word positions, the storage allocation schemes can be modified to determine the node sizes, as discussed in Section 5.

2.2. Rationale for variable-size linked lists

The variable-size linked-list data structure is chosen here because it is used in many in-memory inverted files (sometimes they are called buckets or fixed list blocks) and because it is relatively simple to implement and to analyze. Instead of linked lists, postings can be stored in RAM blocks that can expand to hold more postings when they are inserted. This type of RAM block expansion may involve copying and moving data chunks. Our work can be considered as extending the RAM block approach where the block is pre-allocated with storage to hold the expected amount of postings so that it is not necessary to copy or move data chunks in the RAM blocks. This pre-allocation avoids memory fragmentation and the difficulty is shifted in predicting the expected number postings for each term instead of using advanced storage allocators. If a fast storage allocator is used so that the allocation time is amortized to be a constant, then the storage utilization may be sacrificed. Instead of using advanced storage allocators, dynamic data structures like hash tables, skip lists and balanced trees can be used that support deletion of postings, as well as insertion of postings. However, the storage utilizations of these dynamic data structures are typically low (i.e., no more than 60% if each node contains at least one 4-byte pointer and one 6-byte posting). It is possible to store multiple postings in these data structures. In this case, the problem of optimizing the storage wastes per node re-appears whether one is dealing with a dynamic data structure (e.g., balanced trees) or a variable-size linked list. Therefore, we propose to use variable-size linked lists in this study because they are simple to program, use simple and fast storage allocators, are commonly used by in-memory inverted files, can easily be adopted to store compressed postings and can be optimized for storage waste in the same way as other dynamic data structures (e.g. balanced trees).

Our choice of using linked lists to store (compressed) postings implies that our extensible inverted files are designed largely for append-only operations where direct deletions and direct modifications can be avoided. Deletions can be done indirectly by filtering document identifiers that are known to be invalid (or deleted) from a registry of stale documents [8] and [17] because search engines can trade-off data consistency with availability [17] according to the CAP theorem [18] and [19]. It is expected that there will be little deletions or modifications during the in-memory inversion because the incoming documents are relatively fresh. On the other hand, deletions or modifications may occur more often than during in-memory inversion when the inverted file is transferred to disks. Since disk storage cost per byte is much cheaper than RAM, deletions by filtering document identifiers are practical solutions for large-scale search engines. Similar to deletions, modifications can be implemented effectively as a deletion operation (implemented as filtering) followed by a document insertion. When the registry of stale document identifiers is becoming large or when the temporary index is full, the main inverted file on disk can be maintained by re-building, re-merging or in-place updating approaches [8]. Therefore, the choice of an append-only data structure, like the variable-size linked lists, may not be a severe handicap.

The use of the variable-size linked list representation of inverted lists requires some consideration on how to merge partial inverted files as follows. The first approach saves the in-memory inverted lists that are represented as linked lists as contiguous lists on disk. This requires the system to follow the extension pointers of the linked list nodes when transferring the in-memory inverted file to disk. Tracing all the extension pointers incurs some additional time due to cache misses. However, most of the time cost is due to transferring data to disks provided that the mean number of chainings per term is not large. Once the in-memory inverted file is transferred to disk as a set of contiguous lists on disk, the conventional inverted file-merging algorithm can be used to merge these partial inverted file on disk. An alternative approach dumps the in-memory inverted file onto disk as it is. During the first level of partial inverted file merging, the merging algorithm combines the two inverted lists on disk, represented by two sets of linked lists, into a single contiguous inverted list on disk. The merged partial inverted file has a set of contiguous inverted lists on disk and it can be merged subsequently with other merged partial inverted file using conventional merging algorithm. However, following extension pointers on disk requires file seek operations that incur more time than a cache miss. Therefore, we prefer the first approach because the time cost of cache miss is less that of a file seek and because this approach is applicable to the inverted file maintenance methods mentioned by Lester and Zobel [8] (i.e., re-build, re-merge and in-place updates) using an in-memory temporary index (called the document buffer in [8]).

2.3. Rationale for storage waste optimization

The success of representing inverted lists by linked lists is based on the ability to accurately predict the required storage needed so that the RAM storage utilization is maximized. Otherwise, if final storage utilization is low (say 60%), other data structures that can support deletion should be used instead. The storage utilization of the extensible inverted file is the ratio of the total storage P of all (compressed) postings and the total storage (i.e., P+S, where S is the total storage waste). Maximization of the storage utilization U can be considered as the minimization of storage wastes of the extensible inverted file as follows:

since P is treated as a constant which is fixed for a particular data collection.

Storage wastes in the extensible inverted file can be divided into two types:

(a) The first type of storage waste is the storage overhead ε that includes the storage for extension pointers and for the node size variables; and

(b) The second type of storage waste is the latent free storage, which has been allocated but is yet unfilled with posting information. If this type of latent free storage is not considered as storage wastes, then the optimal linked-list node size would be as large as possible so that the overhead will appear minimal in the total storage allocated to that node.

The storage waste of each node in the extensible inverted file is the sum of these two types of storage wastes of the node.

There are many advantages to optimize storage wastes. First, it maximizes the storage utilization that can reduce the number of inverted file merge operation and can reduce the amortized time cost of inverted file maintenance [8]. Second, it can indirectly reduce the number of chainings per term. This can reduce (a) the time to search the temporary index when it is built on the fly, (b) the time to store the inverted lists in RAM as lists without chaining on disks when the partial inverted file is transferred to disk and (c) the number of file seeks when merging two partial inverted files on disk if these inverted files represent inverted lists as linked lists. Third, the analysis of optimizing the storage wastes can be applied to not just linked lists but to other dynamic data structures (e.g. balanced trees) where each node holds more than one (compressed) posting. In this case, the optimization analysis of these dynamic data structures treats the storage waste of each node of these dynamic structures as a constant ε with a different value.

3. Stepwise storage allocation approach

The stepwise storage allocation approach determines the optimal node size for the incoming documents based on statistics of the current set of indexed documents. This approach optimizes the expected worst-case storage waste E(W(S(N))) after N documents are indexed as follows:

(1)

where E(.) is the expectation operator, W(.) returns the worst-case function of its argument, S( n) is the storage waste after indexing n documents. The reason for optimizing the expected storage waste is to minimize the area under the storage waste curve against the number of documents indexed so that the storage waste is kept to a minimum for the different number of documents indexed (up to N documents). This approach assumes that the optimal node size for N documents is close to the optimal node size for N+Δ N documents, where Δ N is a small quantity compared with N, implying that this assumption is true when N is sufficiently large. Also, this approach assumes that the measured system parameters for determining the optimal node size should be smooth without large discontinuities. Otherwise, parameters (e.g. the size of the vocabulary) obtained after indexing N documents may vary substantially, leading to drastic changes in the optimal node size, implying that we cannot predict the optimal node size based on past statistics.

This approach has three related storage allocation schemes. The first storage allocation scheme determines the optimal node size after indexing N documents, which is the same as the optimal node size for a static collection of N documents. This scheme is called the fixed-sized node scheme (F16). The next storage allocation scheme is called the vocabulary growth rate (VGR) scheme. It extends the formula of the F16 scheme by determining the optimal node size based on the parameter values extracted at the time when a new node is allocated. The assumption is that this optimal node size remains more or less the same between the time that the node is allocated and the time that the node is filled (i.e., the system behavior should be smooth). Unfortunately, the VGR scheme allocates the same optimal node size at a given time instance for common terms and non-common terms, which are known to have widely different number of postings and different desirable node sizes. Thus, the final storage allocation scheme, called the term growth rate (TGR) scheme, determines the optimal node size for individual terms. The first two schemes optimize the expected worst-case storage waste E(W(S(N))) for all terms (as in Eq. (1)). The TGR scheme optimizes the expected worst-case storage waste E(W(S(N, t_k))) for the kth term after indexing N documents where S(n, t_k) is the storage waste of the kth term after indexing n documents. Similar to Eq. (1), the quantity E(W(S(N, t_k))) is defined as follows:

(2)

3.1. Fixed-sized node scheme (F16)

The fixed-sized node storage allocation scheme allocates storage to hold B postings for each new node. The overhead ε_p of a node allocated by this scheme is the storage for the extension pointer. Assuming that each posting occupies c₁ bytes, the node requires c₁× B+ε_p bytes. The storage waste S(n, t_k) for term t_k after indexing n documents is the latent free storage of the last node plus the storage overhead of all the chained nodes. The latent-free storage of the last node is c₁((df(n,t_k)/B)B-df(n,t_k)), where df(n, t_k) is the number of documents that contain the kth term, and . is the ceiling function. The storage overhead of all the chained nodes (including the new node) is due to the extension pointer and this overhead is ε_p(df(n,t_k)/B) (including the last unfilled node). The storage waste S(n, t_k) for term t_k is

The relative frequency estimate of the probability p( t_k) that the kth term appeared in a document is df(n, t_k)/n. Hence, df(n, t_k)=p(t_k)n. The above storage waste for n indexed documents can be rewritten as

where D( n) is the number of unique terms after indexing n documents. Since it is hard to optimize the closed form of S( n) that has discontinuous functions (e.g., ceiling function), the upper bound and the lower bound of the optimal node sizes are considered as follows. An upper bound W( S( n)) of S( n) is the storage overhead due to the extension pointers of all the chained nodes plus the latent free space where the last node is assumed to have no postings. Hence

(3)

A lower bound of S( n) is the total storage of the extension pointers and this bound assumes that there is no latent free space in the last node. Therefore, the lower bound of S( n) is

The above two bounds only differ by the amount of latent free space and the storage of one extension pointer. As the number of indexed document increases, these two bounds converge to S( n) since the latent free space becomes small compared with the storage overhead of the set of chained filled nodes. Thus, the optimal node sizes of the upper and lower bounds are valid approximations of S( n) for large collections.

By disregarding the storage waste due to the latent-free space, the lower bound of the storage waste can be approximated as

where , is called the total nominal (term) arrival rate after indexing n documents. This lower bound is not useful for the purpose of finding the optimal node size because it has no optimal value for finite values of B and it discounts the latent free storage, encouraging the undesirable allocation of larger than necessary node sizes. Alternatively, optimizing the upper bound W( S( n)) (as in Eq. (3)) of S( n) limits the storage waste after indexing n documents, as well as accounting for the latent-free storage. Based on Eq. (3), the worst-case (upper bound) storage waste after indexing n documents is . Substituting W( S( n)) into Eq. (1), the expected E( W( S( N))) worst-case storage waste after indexing N documents is .

The opti

分享到：

500-IIS服务器内部错误解决方法 | 在 Linux 系统上共享局域网中 Windows 打 ...

2007-07-10 10:20
浏览 215
评论(0)
相关推荐

发表评论

您还没有登录,请您登录后再发表评论

你可能感兴趣的:(cache,REST,Scheme,Access,performance)

MongoDB知识概括 GeorgeLin98 持久层 mongodb
MongoDB知识概括MongoDB相关概念单机部署基本常用命令索引-IndexSpirngDataMongoDB集成副本集分片集群安全认证MongoDB相关概念业务应用场景：传统的关系型数据库（如MySQL），在数据操作的“三高”需求以及应对Web2.0的网站需求面前，显得力不从心。解释：“三高”需求：①Highperformance-对数据库高并发读写的需求。②HugeStorage-对海量数
Android应用性能优化轻口味 Android
Android手机由于其本身的后台机制和硬件特点，性能上一直被诟病，所以软件开发者对软件本身的性能优化就显得尤为重要；本文将对Android开发过程中性能优化的各个方面做一个回顾与总结。Cache优化ListView缓存：ListView中有一个回收器，Item滑出界面的时候View会回收到这里，需要显示新的Item的时候，就尽量重用回收器里面的View；每次在getView函数中inflate新
怎么做才能真正限制塑料袋的使用？ BalNews
Environmentalpollutionisalwaysamajorlivelihoodissue.Morethanadecadeago,ourgovernmenthadintroducedapolicyaboutrestrictionsontheuseofplasticbags,wecallitrestrictionsontheuseofplasticbags.Butmorethan10ye
Vue( ElementUI入门、vue-cli安装) m0_l5z elementui vue.js
一.ElementUI入门目录：1.ElementUI入门1.1ElementUI简介1.2Vue+ElementUI安装1.3开发示例2.搭建nodejs环境2.1nodejs介绍2.2npm是什么2.3nodejs环境搭建2.3.1下载2.3.2解压2.3.3配置环境变量2.3.4配置npm全局模块路径和cache默认安装位置2.3.5修改npm镜像提高下载速度2.3.6验证安装结果3.运行n
ubuntu安装wordpress lissettecarlr
1安装nginx网上安装方式很多，这就就直接用apt-get了apt-getinstallnginx不用启动啥，然后直接在浏览器里面输入IP:80就能看到nginx的主页了。如果修改了一些配置可以使用下列命令重启一下systemctlrestartnginx.service2安装mysql输入安装前也可以更新一下软件源，在安装过程中将会让你输入数据库的密码。sudoapt-getinstallmy
华为云分布式缓存服务DCS 8月新特性发布华为云PaaS服务小智华为云分布式缓存
分布式缓存服务（DistributedCacheService，简称DCS）是华为云提供的一款兼容Redis的高速内存数据处理引擎，为您提供即开即用、安全可靠、弹性扩容、便捷管理的在线分布式缓存能力，满足用户高并发及数据快速访问的业务诉求。此次为大家带来DCS8月的特性更新内容，一起来看看吧！
02-Cesium聚合分析EntityCluster完整代码 fxshy html css javascript
1.完整代码Document-->-->Cesium.Ion.defaultAccessToken='eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJqdGkiOiJhZjZkZDAwZC1mNTFhLTRhOTEtOGExNi00MzRhNGIzMDdlNDQiLCJpZCI6MTA1MTUzLCJpYXQiOjE2NjA4MDg0Njd9.qajeJtc4-kp
03-Cesium自定义着色器完整代码以及注释 fxshy 着色器 javascript
1.效果展示2.完整代码自定义着色器完整代码#map{position:absolute;width:100%;height:100%;top:0;left:0;right:0;bottom:0;}Cesium.Ion.defaultAccessToken='eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJqdGkiOiJhZjZkZDAwZC1mNTFhLTRhO
mysql学习教程，从入门到精通，TOP 和MySQL LIMIT 子句（15）知识分享小能手大数据数据库 MySQL mysql 学习 oracle 数据库开发语言 adb 大数据
1、TOP和MySQLLIMIT子句内容在SQL中，不同的数据库系统对于限制查询结果的数量有不同的实现方式。TOP关键字主要用于SQLServer和Access数据库中，而LIMIT子句则主要用于MySQL、PostgreSQL（通过LIMIT/OFFSET语法）、SQLite等数据库中。下面将分别详细介绍这两个功能的语法、语句以及案例。1.1、TOP子句（SQLServer和Access）1.1
L1 L2 L3 缓存京天不下雨 windows 缓存 windows
L1L2L3缓存L1Cache(一级bai缓存)是CPU第一层高速缓存，分为数据缓存和指令缓存。du内置的zhiL1高速缓存的容量和结构对daoCPU的性能影响较大，不过高速缓冲存储器均由静态RAM组成，结构较复杂，在CPU管芯面积不能太大的情况下，L1级高速缓存的容量不可能做得太大。一般服务器CPU的L1缓存的容量通常在32—4096KB。L2由于L1级高速缓存容量的限制，为了再次提高CPU的运
使用selenium调用firefox提示Profile Missing的问题解决歪歪的酒壶 selenium 测试工具 python
在Ubuntu22.04环境中，使用python3运行selenium提示ProfileMissing，具体信息为：YourFirefoxprofilecannotbeloaded.Itmaybemissingorinaccessible在这个问题的环境中firefox浏览器工作正常。排查中，手动在命令行执行firefox可以打开浏览器，但是出现如下提示Gtk-Message:15:32:09.9
非关系型数据库天秤-white nosql
一、为什么要用Nosql1.单机MySQL的时代。一个基本的网站访问量一般不会太大，单个数据库完全足够。那时候更多使用的静态网页html，服务器根本没有太大压力。这时候网站的瓶颈是什么？-数据量如果太大，一个机器放不下。-数据量太大需要建立数据的索引（B+Tree），一个服务器内存放不下。-访问量读写混合，一个服务器承受不了。2.memcached缓存+MySQL+垂直拆分（读写分离）。网站80%
mybatis 二级缓存失效_Mybatis 缓存原理及失效情况解析 weixin_39844942 mybatis 二级缓存失效
这篇文章主要介绍了Mybatis缓存原理及失效情况解析,文中通过示例代码介绍的非常详细，对大家的学习或者工作具有一定的参考学习价值,需要的朋友可以参考下1、什么是缓存[Cache]存在内存中的临时数据。将用户经常查询的数据放在缓存(内存)中，用户去查询数据就不用从磁盘上(关系型数据库数据文件)查询，从缓存中查询，从而提高查询效率，解决了高并发系统的性能问题。2、为什么要使用缓存减少和数据库的交互次
k8s证书过期问题处理 olina_qin kubernetes 容器云原生
k8s证书过期问题处理opensslx509-in/etc/kubernetes/pki/apiserver.crt-noout-dateskubeadmcertsrenewallsystemctlrestartkubeleopensslx509-in/etc/kubernetes/pki/apiserver.crt-noout-text|grep"NotAfter"cp/etc/kubernet
深度学习-点击率预估-研究论文2024-09-14速读 sp_fyf_2024 深度学习人工智能
深度学习-点击率预估-研究论文2024-09-14速读1.DeepTargetSessionInterestNetworkforClick-ThroughRatePredictionHZhong,JMa,XDuan,SGu,JYao-2024InternationalJointConferenceonNeuralNetworks,2024深度目标会话兴趣网络用于点击率预测摘要：这篇文章提出了一种新
浅谈openresty 爱编码的钓鱼佬 nginx openresty 运维
熟悉了nginx后再来看openresty，不得不说openresty是比较优秀的。对nginx和openresty的历史等在这此就不介绍了。首先对标nginx，自然有优劣一、开发难度nginx：毫无疑问nginx的开发难度比较高，需要扎实的c/c++基础，而且还需要对nginx源码比较熟悉，开发效率慢，比如实现一个类似echo的功能，至少要上百行代码。而openresty只需要一句ngx.say
高并发内存池（4）——实现CentralCache Niu_brave 高并发内存池项目笔记 c++学习
目录一，CentralCache的简单介绍二，CentralCache的整体结构三，CentralCache实现的详细代码1，成员2，函数1，获取单例对象的指针2，FetchRangeObj函数3，GetOneSpan函数实现4，ReleaseListToSpans函数实现一，CentralCache的简单介绍CentralCache是高并发内存池这个项目的中间层。当第一层ThreadCache内
白嫖Gitee实现远程公告功能和小型文件上传下载 xu-jssy Python自动化脚本 gitee
备注：20240527更新：新增私人令牌下载模式一、准备工作1、打开Gitee网站：Gitee.comhttps://gitee.com/2、获取access_token，用于身份验证登录后点击右上角头像，在下拉菜单中进入个人主页点击左边的个人设置点击左边菜单中的安全设置->私人令牌点击右上角生成新令牌根据提示填写信息，点击提交在弹出的页面复制令牌，注意备份保存3、创建一个仓库用来存放文件可以把远
Tomcat 中 catalina.out、catalina.log、localhost.log 和 access_log 的区别金色888
打开Tomcat安装目录中的log文件夹，我们可以看到很多日志文件，这篇文章就来介绍下这些日记文件的具体区别。catalina.out日志#catalina.out日志文件是Tomcat的标准输出（stdout）和标准出错（stderr）输出的“目的地”。我们在应用里使用System.out打印的内容都会输出到这个日志文件中。另外，如果我们在应用里使用其他的日志框架，配置了向Console输出日志
华为云分布式缓存服务DCS与开源服务差异对比 hcinfo_18 redis使用华为云 Redis5.0 分布式缓存服务 Redis客户端
分布式缓存服务DCS提供单机、主备、集群等丰富的实例类型，满足用户高读写性能及快速数据访问的业务诉求。支持丰富的实例管理操作，帮助用户省去运维烦恼。用户可以聚焦于业务逻辑本身，而无需过多考虑部署、监控、扩容、安全、故障恢复等方面的问题。DCS基于开源Redis、Memcached向用户提供一定程度定制化的缓存服务，因此，除了拥有开源服务缓存数据库的优秀特性，DCS提供更多实用功能。一、与开源Red
软件测试/测试开发/全日制 |利用Django REST framework构建微服务霍格沃兹-慕漓 django 微服务 sqlite
霍格沃兹测试开发学社推出了《Python全栈开发与自动化测试班》。本课程面向开发人员、测试人员与运维人员，课程内容涵盖Python编程语言、人工智能应用、数据分析、自动化办公、平台开发、UI自动化测试、接口测试、性能测试等方向。为大家提供更全面、更深入、更系统化的学习体验，课程还增加了名企私教服务内容，不仅有名企经理为你1v1辅导，还有行业专家进行技术指导，针对性地解决学习、工作中遇到的难题。让找
2024年Presto【基础 01】简介+架构+数据源+数据模型(2)，2024年最新一线互联网公司面经总结 2401_84264536 架构
学习路线：这个方向初期比较容易入门一些，掌握一些基本技术，拿起各种现成的工具就可以开黑了。不过，要想从脚本小子变成黑客大神，这个方向越往后，需要学习和掌握的东西就会越来越多以下是网络渗透需要学习的内容：网上学习资料一大堆，但如果学到的知识不成体系，遇到问题时只是浅尝辄止，不再深入研究，那么很难做到真正的技术提升。需要这份系统化资料的朋友，可以点击这里获取一个人可以走的很快，但一群人才能走的更远！不
Presto【基础 01】简介+架构+数据源+数据模型 2401_84254343 程序员架构
一个Catalog包含Schema和Connector。例如，配置JMX的Catalog，通过JXMConnector访问JXM信息。当执行一条SQL语句时，可以同时运行在多个Catalog。Presto处理table时，是通过表的完全限定（fully-qualified）名来找到Catalog。例如，一个表的权限定名是hive.test_data.test，则test是表名，test_data是
npm install报错npm ERR! code CERT_HAS_EXPIRED 夏早安 npm 前端 node.js
报错如图解决办法：强制清缓存，取消ssl验证1.npmcacheclean--force2.npmconfigsetstrict-sslfalse3.重新npminstall4.依赖下载成功
Conda的清理（一下少了14G） Pin_BOY 配置 TSC 服务器 linus anaconda
一般使用conda命令清理condaclean-p//删除没有用的包（推荐）condaclean-t//tar打包condaclean-y-all//删除全部的安装包及cache如果想看看到底占多大存储空间可以往下看进入Anconda目录，通过命令du-sh*查看当下目录的所占内存为什么pkgs这么大！查阅资料后发现Anaconda文件夹下有个pkgs文件夹。据我观测（没找到正式说明），里面是各种
今年最值得期待的美股IPO（一）：网约车巨头优步 HOX币股世界
2018年是IPO大年，2019年可能更加“热闹”，包括Uber、Lyft、Palantir、Pinterest等公司都计划上市。在“今年最值得期待的美股IPO”系列文章中，将盘点今年最受市场关注的美股IPO。今年Lyft和Uber的先后上市令网约车行业成为焦点，尤其是优步。据悉，仍处于亏损状态的优步估值可能高达1200亿美元，这将是纳斯达克100指数公司平均估值的两倍以上，比起3M、21世纪福克
Open Feign 实战笔记自强-X spring-cloud java spring 微服务 ribbon spring cloud
OpenFeign笔记概念声明式的web服务客户端。使用接口加注解的形式编程。它是对RestTemplate和ribbon做了进一步封装。Feign已经停更，OpenFeign是在Feign的基础上又做了进一步的封装。Feign：Feign是SpringCloud组件中的一个轻量级RESTful的HTTP服务客户端Feign内置了Ribbon，用来做客户端负载均衡，去调用服务注册中心的服务。Fei
MySQL内存结构 san.hang 数据库数据结构与算法
实际上MySQL内存的组成和Oracle类似，也可以分为SGA（系统全局区）和PGA（程序缓存区）。mysql>showvariableslike"%buffer%";一、SGA1.innodb_buffer_bool用来缓存Innodb表的数据、索引、插入缓冲、数据字典等信息。2.innodb_log_buffer事务在内存中的缓冲，即redlogbuffer的大小3.querycache高速查
Docker安装Kafka和Kafka-Manager 阿靖哦
本文介绍如何通过Docker安装kafka与kafka界面管理界面一、拉取zookeeper由于kafka需要依赖于zookeeper，因此这里先运行zookeeper1、拉取镜像dockerpullwurstmeister/zookeeper2、启动dockerrun-d--namezookeeper-p2181:2181-eTZ="Asia/Shanghai"--restartalwayswu
面试题篇: 跨域问题如何处理(Java和Nginx处理方式) guicai_guojia java nginx 开发语言
1.服务器端解决方案最常见的解决方案是在服务器端配置CORS头。服务器需要在响应中添加适当的Access-Control-Allow-头来允许跨域请求。1.1NGINX配置在NGINX配置中，你可以通过add_header指令来设置CORS头。配置示例：server{ listen80; server_nameapi.example.com; location/{ proxy_pass
对股票分析时要注意哪些主要因素？会飞的奇葩猪股票分析云掌股吧
　　众所周知，对散户投资者来说，股票技术分析是应战股市的核心武器，想学好股票的技术分析一定要知道哪些是重点学习的，其实非常简单，我们只要记住三个要素：成交量、价格趋势、振荡指标。一、成交量　　大盘的成交量状态。成交量大说明市场的获利机会较多，成交量小说明市场的获利机会较少。当沪市的成交量超过150亿时是强市市场状态，运用技术找综合买点较准；
【Scala十八】视图界定与上下文界定 bit1129 scala
Context Bound，上下文界定，是Scala为隐式参数引入的一种语法糖，使得隐式转换的编码更加简洁。隐式参数首先引入一个泛型函数max，用于取a和b的最大值 def max[T](a: T, b: T) = { if (a > b) a else b } 因为T是未知类型，只有运行时才会代入真正的类型，因此调用a >
C语言的分支——Object-C程序设计阅读有感 darkblue086 apple c 框架 cocoa
自从1972年贝尔实验室Dennis Ritchie开发了C语言，C语言已经有了很多版本和实现，从Borland到microsoft还是GNU、Apple都提供了不同时代的多种选择，我们知道C语言是基于Thompson开发的B语言的，Object-C是以SmallTalk-80为基础的。和C++不同的是，Object C并不是C的超集，因为有很多特性与C是不同的。 Object-C程序设计这本书
去除浏览器对表单值的记忆周凡杨 html 记忆 autocomplete form 浏览
&n
java的树形通讯录 g21121 java
最近用到企业通讯录，虽然以前也开发过，但是用的是jsf，拼成的树形，及其笨重和难维护。后来就想到直接生成json格式字符串，页面上也好展现。 // 首先取出每个部门的联系人 for (int i = 0; i < depList.size(); i++) { List<Contacts> list = getContactList(depList.get(i
Nginx安装部署 510888780 nginx linux
Nginx ("engine x") 是一个高性能的 HTTP 和反向代理服务器，也是一个 IMAP/POP3/SMTP 代理服务器。 Nginx 是由 Igor Sysoev 为俄罗斯访问量第二的 Rambler.ru 站点开发的，第一个公开版本0.1.0发布于2004年10月4日。其将源代码以类BSD许可证的形式发布，因它的稳定性、丰富的功能集、示例配置文件和低系统资源
java servelet异步处理请求墙头上一根草ｊａｖａ异步返回ｓｅｒｖｌｅｔ
servlet3.0以后支持异步处理请求，具体是使用AsyncContext ，包装httpservletRequest以及httpservletResponse具有异步的功能， final AsyncContext ac = request.startAsync(request, response); ac.s
我的spring学习笔记8-Spring中Bean的实例化 aijuans Spring 3
在Spring中要实例化一个Bean有几种方法： 1、最常用的（普通方法） <bean id="myBean" class="www.6e6.org.MyBean" /> 使用这样方法，按Spring就会使用Bean的默认构造方法，也就是把没有参数的构造方法来建立Bean实例。（有构造方法的下个文细说） 2、还
为Mysql创建最优的索引 annan211 mysql 索引
索引对于良好的性能非常关键，尤其是当数据规模越来越大的时候，索引的对性能的影响越发重要。索引经常会被误解甚至忽略，而且经常被糟糕的设计。索引优化应该是对查询性能优化最有效的手段了，索引能够轻易将查询性能提高几个数量级，最优的索引会比较好的索引性能要好2个数量级。 1 索引的类型 (1) B-Tree 不出意外，这里提到的索引都是指 B-
日期函数百合不是茶 oracle sql 日期函数查询
ORACLE日期时间函数大全 TO_DATE格式(以时间:2007-11-02 13:45:25为例) Year: yy two digits 两位年显示值:07 yyy three digits 三位年显示值:007
线程优先级 bijian1013 java thread 多线程 java多线程
多线程运行时需要定义线程运行的先后顺序。线程优先级是用数字表示，数字越大线程优先级越高，取值在1到10，默认优先级为5。实例： package com.bijian.study; /** * 因为在代码段当中把线程B的优先级设置高于线程A,所以运行结果先执行线程B的run()方法后再执行线程A的run()方法 * 但在实际中，JAVA的优先级不准，强烈不建议用此方法来控制执
适配器模式和代理模式的区别 bijian1013 java 设计模式
一.简介适配器模式：适配器模式（英语：adapter pattern）有时候也称包装样式或者包装。将一个类的接口转接成用户所期待的。一个适配使得因接口不兼容而不能在一起工作的类工作在一起，做法是将类别自己的接口包裹在一个已存在的类中。 &nbs
【持久化框架MyBatis3三】MyBatis3 SQL映射配置文件 bit1129 Mybatis3
SQL映射配置文件一方面类似于Hibernate的映射配置文件，通过定义实体与关系表的列之间的对应关系。另一方面使用<select>,<insert>,<delete>，<update>元素定义增删改查的SQL语句，这些元素包含三方面内容 1. 要执行的SQL语句 2. SQL语句的入参，比如查询条件 3. SQL语句的返回结果
oracle大数据表复制备份个人经验 bitcarter oracle 大表备份大表数据复制
前提：数据库仓库A（就拿oracle11g为例）中有两个用户user1和user2,现在有user1中有表ldm_table1,且表ldm_table1有数据5千万以上，ldm_table1中的数据是从其他库B（数据源）中抽取过来的，前期业务理解不够或者需求有变，数据有变动需要重新从B中抽取数据到A库表ldm_table1中。
HTTP加速器varnish安装小记 ronin47 http varnish 加速
上午共享的那个varnish安装手册，个人看了下，有点不知所云，好吧~看来还是先安装玩玩！苦逼公司服务器没法连外网，不能用什么wget或yum命令直接下载安装，每每看到别人博客贴出的在线安装代码时，总有一股羡慕嫉妒“恨”冒了出来。。。好吧，既然没法上外网，那只能麻烦点通过下载源码来编译安装了！ Varnish 3.0.4下载地址： http://repo.varnish-cache.org/
java-73-输入一个字符串，输出该字符串中对称的子字符串的最大长度 bylijinnan java
public class LongestSymmtricalLength { /* * Q75题目：输入一个字符串，输出该字符串中对称的子字符串的最大长度。 * 比如输入字符串“google”，由于该字符串里最长的对称子字符串是“goog”，因此输出4。 */ public static void main(String[] args) { Str
学习编程的一点感想 Cb123456 编程感想 Gis
写点感想，总结一些，也顺便激励一些自己.现在就是复习阶段，也做做项目. 本专业是GIS专业，当初觉得本专业太水，靠这个会活不下去的，所以就报了培训班。学习的时候，进入状态很慢，而且当初进去的时候，已经上到Java高级阶段了，所以.....，呵呵，之后有点感觉了，不过，还是不好好写代码，还眼高手低的，有
[能源与安全]美国与中国 comsci 能源
现在有一个局面：地球上的石油只剩下N桶，这些油只够让中国和美国这两个国家中的一个顺利过渡到宇宙时代，但是如果这两个国家为争夺这些石油而发生战争，其结果是两个国家都无法平稳过渡到宇宙时代。。。。而且在战争中，剩下的石油也会被快速消耗在战争中，结果是两败俱伤。。。在这个大
SEMI-JOIN执行计划突然变成HASH JOIN了的原因分析 cwqcwqmax9 oracle
甲说： A B两个表总数据量都很大，在百万以上。 idx1 idx2字段表示是索引字段 A B 两表上都有 col1字段表示普通字段 select xxx from A where A.idx1 between mmm and nnn and exists (select 1 from B where B.idx2 =
SpringMVC-ajax返回值乱码解决方案 dashuaifu Ajax springMVC response 中文乱码
SpringMVC-ajax返回值乱码解决方案一：（自己总结，测试过可行） ajax返回如果含有中文汉字，则使用：（如下例：） @RequestMapping(value="/xxx.do") public @ResponseBody void getPunishReasonB
Linux系统中查看日志的常用命令 dcj3sjt126com OS
因为在日常的工作中，出问题的时候查看日志是每个管理员的习惯，作为初学者，为了以后的需要，我今天将下面这些查看命令共享给各位 cat tail -f 日志文件说明 /var/log/message 系统启动后的信息和错误日志，是Red Hat Linux中最常用的日志之一 /var/log/secure 与安全相关的日志信息 /var/log/maillog 与邮件相关的日志信
[应用结构]应用 dcj3sjt126com PHP yii2
应用主体应用主体是管理 Yii 应用系统整体结构和生命周期的对象。每个Yii应用系统只能包含一个应用主体，应用主体在入口脚本中创建并能通过表达式 \Yii::$app 全局范围内访问。补充: 当我们说"一个应用"，它可能是一个应用主体对象，也可能是一个应用系统，是根据上下文来决定[译：中文为避免歧义，Application翻译为应
assertThat用法 eksliang JUnit assertThat
junit4.0 assertThat用法一般匹配符1、assertThat( testedNumber, allOf( greaterThan(8), lessThan(16) ) ); 注释： allOf匹配符表明如果接下来的所有条件必须都成立测试才通过，相当于“与”（&&） 2、assertThat( testedNumber, anyOf( g
android点滴2 gundumw100 应用服务器 android 网络应用 OS HTC
如何让Drawable绕着中心旋转？ Animation a = new RotateAnimation(0.0f, 360.0f, Animation.RELATIVE_TO_SELF, 0.5f, Animation.RELATIVE_TO_SELF,0.5f); a.setRepeatCount(-1); a.setDuration(1000); 如何控制Andro
超简洁的CSS下拉菜单 ini html Web 工作 html5 css
效果体验：http://hovertree.com/texiao/css/3.htmHTML文件： <!DOCTYPE html> <html xmlns="http://www.w3.org/1999/xhtml"> <head> <title>简洁的HTML+CSS下拉菜单-HoverTree</title>
kafka consumer防止数据丢失 kane_xie kafka offset commit
kafka最初是被LinkedIn设计用来处理log的分布式消息系统，因此它的着眼点不在数据的安全性（log偶尔丢几条无所谓），换句话说kafka并不能完全保证数据不丢失。尽管kafka官网声称能够保证at-least-once，但如果consumer进程数小于partition_num，这个结论不一定成立。考虑这样一个case，partiton_num=2
@Repository、@Service、@Controller 和 @Component mhtbbx DAO spring bean prototype
@Repository、@Service、@Controller 和 @Component 将类标识为Bean Spring 自 2.0 版本开始，陆续引入了一些注解用于简化 Spring 的开发。@Repository注解便属于最先引入的一批，它用于将数据访问层 (DAO 层 ) 的类标识为 Spring Bean。具体只需将该注解标注在 DAO类上即可。同时，为了让 Spring 能够扫描类
java 多线程高并发读写控制误区 qifeifei java thread
先看一下下面的错误代码，对写加了synchronized控制，保证了写的安全，但是问题在哪里呢？ public class testTh7 { private String data; public String read(){ System.out.println(Thread.currentThread().getName() + "read data "
mongodb replica set(副本集)设置步骤 tcrct java mongodb
网上已经有一大堆的设置步骤的了，根据我遇到的问题，整理一下，如下：首先先去下载一个mongodb最新版，目前最新版应该是2.6 cd /usr/local/bin wget http://fastdl.mongodb.org/linux/mongodb-linux-x86_64-2.6.0.tgz tar -zxvf mongodb-linux-x86_64-2.6.0.t
rust学习笔记 wudixiaotie 学习笔记
1.rust里绑定变量是let，默认绑定了的变量是不可更改的，所以如果想让变量可变就要加上mut。 let x = 1; let mut y = 2; 2.match 相当于erlang中的case，但是case的每一项后都是分号，但是rust的match却是逗号。 3.match 的每一项最后都要加逗号，但是最后一项不加也不会报错，所有结尾加逗号的用法都是类似。 4.每个语句结尾都要加分