上一小节,我们对p稳定分布LSH的基本原理进行了介绍(http://blog.csdn.net/jasonding1354/article/details/38237353),在接下来的博文中,我将以E2LSH开源代码为基础,对E2LSH的源码进行注解学习,从而为掌握LSH的基本原理以及未来对相似性搜索的扩展学习打下基础。
1、代码概况
E2LSH的核心代码可以分为3部分:
其他代码说明:
2、主要的数据结构
(1)RNearNeighborStructureT(LocalitySensitiveHashing.h中定义)——R near neighbor数据结构。该结构包含构建数据结构的参数、哈希函数族gi的描述、结构中数据点的索引和用于存储哈希桶的L个哈希表的指针。
typedef struct _RNearNeighborStructT { IntT dimension; // dimension of points. IntT parameterK; // parameter K of the algorithm. IntT parameterL; // parameter L of the algorithm. RealT parameterW; // parameter W of the algorithm. IntT parameterT; // parameter T of the algorithm. RealT parameterR; // parameter R of the algorithm. RealT parameterR2; // = parameterR^2 // Whether to use <u> hash functions instead of usual <g> // functions. When this flag is set to TRUE, <u> functions are // generated (which are roughly k/2-tuples of LSH), and a <g> // function is a pair of 2 different <u> functions. BooleanT useUfunctions; // the number of tuples of hash functions used (= # of rows of // <lshFunctions>). When useUfunctions == FALSE, this field is equal // to parameterL, otherwise, to <m>, the number of <u> hash // functions (in this case, parameterL = m*(m-1)/2 = nHFTuples*(nHFTuples-1)/2 IntT nHFTuples; // How many LSH functions each of the tuple has (it is <k> when // useUfunctions == FALSE, and <k/2> when useUfunctions == TRUE). IntT hfTuplesLength; // number of points in the data set Int32T nPoints; // The array of pointers to the points that are contained in the // structure. Some types of this structure (of UHashStructureT, // actually) use indeces in this array to refer to points (as // opposed to using pointers). PPointT *points; // The size of the array <points> Int32T pointsArraySize; // If <reportingResult> == FALSE, no points are reported back in a // <get*> function. In particular any point that is found in the // bucket is considered to be outside the R-ball of the query point // (the distance is still computed). If <reportingResult> == TRUE, // then the structure behaves normally. BooleanT reportingResult; // This table stores the LSH functions. There are <nHFTuples> rows // of <hfTuplesLength> LSH functions. LSHFunctionT **lshFunctions; // Precomputed hashes of each of the <nHFTuples> of <u> functions // (to be used by the bucket hashing module). Uns32T **precomputedHashesOfULSHs; // The set of non-empty buckets (which are hashed using // PUHashStructureT). PUHashStructureT *hashedBuckets; // *** // The following vectors are used only for temporary operations // within this R-NN structure during a query operation. // *** // This vector is used to store the values of hash functions <u> // (<hfTuplesLength>-tuple of LSH fuctions). One <g> function is a concatenation // of <nHFTuples> of <u> LSH functions. Uns32T **pointULSHVectors; // A vector of length <dimension> to store the reduced point (point // with coordinates divided by <parameterR>). RealT *reducedPoint; // This vector is used for storing marked points in a query // operation (for computing distances to a point at most once). If // markedPoints[i]=TRUE then point <i> was examined already. BooleanT *markedPoints; // This vector stored the indeces in the vector <markedPoints> of all // TRUE entries. Int32T *markedPointsIndeces; // the size of <markedPoints> and of <markedPointsIndeces> IntT sizeMarkedPoints; } RNearNeighborStructT, *PRNearNeighborStructT;
(2)UHashStructureT(BucketHashing.h中定义)——该结构定义了用于映射哈希桶的哈希表。通过链表方式解决冲突的问题。
主要有两种哈希表类型:HT_LINKED_LIST 和HT_HYBRID_CHAINS。
HT_LINKED_LIST对应哈希表的链表版本:对于每一个哈希表,我们存储一个指向由桶(buckets)构成的单向链表(singly-linked list)的指针;对于每一个桶,我们存储该桶的指纹h2(·)和指向由数据点构成的单向链表的指针。
HT_HYBRID_CHAINS对应混合存储的哈希表:该方案会对数据集P中的所有数据点建立索引,并且存储所有桶(all buckets with values h2(·))和数据点。在哈希表Hi中,我们在索引l中保存混合存储表Y的一些索引ek,其中ek标识了链表k的起始位置;对于一个链表(chain),分别存储着链中第一个桶的h2(·)值和桶中所有数据点的索引,链中第二个桶的h2(·)值和桶中所有数据点的索引,以此类推。
两种哈希表都具有指向哈希函数h1(·)和h2(·)的指针。
typedef struct _UHashStructureT { // The type of the hash table (can take values HT_*). when // <typeHT>=HT_LINKED_LIST, chains&buckets are linked lists. when // <typeHT>=HT_PACKED, chains&buckets are static arrays. when // <typeHT>=HT_STATISTICS, chains are static arrays and buckets only // count # of elements. when <typeHT>=HT_HYBRID_CHAINS, a chain is // a "hybrid" array that contains both the buckets and the points // (the an element of the chain array is of type // <HybridChainEntryT>). all chains are conglamerated in the same // array <hybridChainsStorage>. IntT typeHT; // The array containing the hash slots of the universal hashing. union _hashTableT { PGBucketT *llHashTable; PackedGBucketT **packedHashTable; LinkPackedGBucketT **linkHashTable; PHybridChainEntryT *hybridHashTable; } hashTable; // The sizes of each of the chains of the hashtable (used only when // typeHT=HT_PACKED or HT_STATISTICS. IntT *chainSizes; union _bucketPoints{ PPointT *pointsArray; PointsListEntryT *pointsList; } bucketPoints; HybridChainEntryT *hybridChainsStorage; // The size of hashTable. Int32T hashTableSize; // Number of elements(buckets) stored in the hash table in total (that // is the number of non-empty buckets). Int32T nHashedBuckets; Int32T nHashedPoints; // Unused (but allocated) instances of the corresponding // structs. May be reused if needed (instead of allocated new // memory). PGBucketT unusedPGBuckets; PBucketEntryT unusedPBucketEntrys; Uns32T prime; // the prime used for the universal hash functions. IntT hashedDataLength;// the number of IntT's in an element from U (U is the set of values to hash). // The hash functions used for the universal hashing. // The main hash function (that defines the index // of the slot in the table). // The type of the hash function is: h_{a}(k) = ((a\cdot k)mod p)mod hashTableSize. Uns32T *mainHashA; // Control hash functions: used to compute/check the <controlValue>s // of <GBucket>s. // The type of the hash function is: h_{a}(k) = (a\cdot k)mod p Uns32T *controlHash1; } UHashStructureT, *PUHashStructureT;
(3)RNNParametersT(LocalitySensitiveHashing.h中定义)——包含构建RNearNeighborStructureT数据结构的必要参数的结构体。
typedef struct _RNNParametersT { RealT parameterR; // parameter R of the algorithm. RealT successProbability; // the success probability 1-\delta IntT dimension; // dimension of points. RealT parameterR2; // = parameterR^2 // Whether to use <u> hash functions instead of usual <g> // functions. When this flag is set to TRUE, <u> functions are // generated (which are roughly k/2-tuples of LSH), and a <g> // function is a pair of 2 different <u> functions. BooleanT useUfunctions; IntT parameterK; // parameter K of the algorithm. // parameter M (# of independent tuples of LSH functions) // if useUfunctions==TRUE, parameterL = parameterM * (parameterM - 1) / 2 // if useUfunctions==FALSE, parameterL = parameterM IntT parameterM; IntT parameterL; // parameter L of the algorithm. RealT parameterW; // parameter W of the algorithm. IntT parameterT; // parameter T of the algorithm. // The type of the hash table used for storing the buckets (of the // same <g> function). IntT typeHT; } RNNParametersT, *PRNNParametersT;
(4)PPoint(Geometry.h中定义)——用于存储数据点的结构体。该结构包含数据的坐标(coordinates),数据点范数的平方和该数据点在数据集P的下标索引。
typedef struct _PointT { IntT index; // the index of this point in the dataset list of points RealT *coordinates; RealT sqrLength; // the square of the length of the vector } PointT, *PPointT;
转载请注明作者及文章出处:http://blog.csdn.net/jasonding1354/article/details/38331229
E2LSH源代码下载地址:http://download.csdn.net/detail/jasonding1354/7704277
参考资料:
1、M.Datar,N.Immorlica,P.Indyk,and V.Mirrokni,“Locality-SensitiveHashing Scheme Based on p-Stable Distributions,”Proc.Symp. ComputationalGeometry, 2004.
2、A.Andoni,P.Indyk.E2lsh:Exact Euclidean locality-sensitive hashing.http://web.mit.edu/andoni/www/LSH/.2004.