【E2LSH源码分析】E2LSH源码综述及主要数据结构

上一小节,我们对p稳定分布LSH的基本原理进行了介绍(http://blog.csdn.net/jasonding1354/article/details/38237353),在接下来的博文中,我将以E2LSH开源代码为基础,对E2LSH的源码进行注解学习,从而为掌握LSH的基本原理以及未来对相似性搜索的扩展学习打下基础。


1、代码概况

E2LSH的核心代码可以分为3部分:

  • LocalitySensitiveHashing.cpp——主要包含基于LSH的RNN(R-near neighbor)数据结构。其主要功能是根据参数构建数据结构进行查询数据对象的功能;
  • BucketHashing.cpp——主要包含对于哈希桶的普通哈希表。其主要功能是构建哈希表,添加哈希桶到表中和查询哈希桶;
  • SelfTuning.cpp——包含计算RNN数据结构最佳参数的函数。

其他代码说明:

  • Geometry.h——包含对数据点的定义(数据类型PPoint);
  • NearNeighbors.cpp, NearNeighbors.h——包含E2LSH核心代码的函数接口;
  • Random.cpp, Random.h——包含伪随机数产生器;
  • BasicDefinitions.h——通用的类型定义和宏定义;
  • Utils.cpp, Utils.h——包含一些通用的函数(如复制向量)。

2、主要的数据结构

(1)RNearNeighborStructureT(LocalitySensitiveHashing.h中定义)——R near neighbor数据结构。该结构包含构建数据结构的参数、哈希函数族gi的描述、结构中数据点的索引和用于存储哈希桶的L个哈希表的指针。

typedef struct _RNearNeighborStructT {
  IntT dimension; // dimension of points.
  IntT parameterK; // parameter K of the algorithm.
  IntT parameterL; // parameter L of the algorithm.
  RealT parameterW; // parameter W of the algorithm.
  IntT parameterT; // parameter T of the algorithm.
  RealT parameterR; // parameter R of the algorithm.
  RealT parameterR2; // = parameterR^2

  // Whether to use <u> hash functions instead of usual <g>
  // functions. When this flag is set to TRUE, <u> functions are
  // generated (which are roughly k/2-tuples of LSH), and a <g>
  // function is a pair of 2 different <u> functions.
  BooleanT useUfunctions;

  // the number of tuples of hash functions used (= # of rows of
  // <lshFunctions>). When useUfunctions == FALSE, this field is equal
  // to parameterL, otherwise, to <m>, the number of <u> hash
  // functions (in this case, parameterL = m*(m-1)/2 = nHFTuples*(nHFTuples-1)/2
  IntT nHFTuples;
  // How many LSH functions each of the tuple has (it is <k> when
  // useUfunctions == FALSE, and <k/2> when useUfunctions == TRUE).
  IntT hfTuplesLength;

  // number of points in the data set
  Int32T nPoints;

  // The array of pointers to the points that are contained in the
  // structure. Some types of this structure (of UHashStructureT,
  // actually) use indeces in this array to refer to points (as
  // opposed to using pointers).
  PPointT *points;

  // The size of the array <points>
  Int32T pointsArraySize;

  // If <reportingResult> == FALSE, no points are reported back in a
  // <get*> function. In particular any point that is found in the
  // bucket is considered to be outside the R-ball of the query point
  // (the distance is still computed).  If <reportingResult> == TRUE,
  // then the structure behaves normally.
  BooleanT reportingResult;
  
  // This table stores the LSH functions. There are <nHFTuples> rows
  // of <hfTuplesLength> LSH functions.
  LSHFunctionT **lshFunctions;

  // Precomputed hashes of each of the <nHFTuples> of <u> functions
  // (to be used by the bucket hashing module).
  Uns32T **precomputedHashesOfULSHs;

  // The set of non-empty buckets (which are hashed using
  // PUHashStructureT).
  PUHashStructureT *hashedBuckets;

  // ***
  // The following vectors are used only for temporary operations
  // within this R-NN structure during a query operation.
  // ***

  // This vector is used to store the values of hash functions <u>
  // (<hfTuplesLength>-tuple of LSH fuctions). One <g> function is a concatenation
  // of <nHFTuples> of <u> LSH functions.
  Uns32T **pointULSHVectors;
  
  // A vector of length <dimension> to store the reduced point (point
  // with coordinates divided by <parameterR>).
  RealT *reducedPoint;

  // This vector is used for storing marked points in a query
  // operation (for computing distances to a point at most once). If
  // markedPoints[i]=TRUE then point <i> was examined already.
  BooleanT *markedPoints;
  // This vector stored the indeces in the vector <markedPoints> of all
  // TRUE entries.
  Int32T *markedPointsIndeces;
  // the size of <markedPoints> and of <markedPointsIndeces>
  IntT sizeMarkedPoints;
} RNearNeighborStructT, *PRNearNeighborStructT;

(2)UHashStructureT(BucketHashing.h中定义)——该结构定义了用于映射哈希桶的哈希表。通过链表方式解决冲突的问题。

主要有两种哈希表类型:HT_LINKED_LISTHT_HYBRID_CHAINS

HT_LINKED_LIST对应哈希表的链表版本:对于每一个哈希表,我们存储一个指向由桶(buckets)构成的单向链表(singly-linked list)的指针;对于每一个桶,我们存储该桶的指纹h2(·)和指向由数据点构成的单向链表的指针

HT_HYBRID_CHAINS对应混合存储的哈希表:该方案会对数据集P中的所有数据点建立索引,并且存储所有桶(all buckets with values h2(·))和数据点。在哈希表Hi中,我们在索引l中保存混合存储表Y的一些索引ek,其中ek标识了链表k的起始位置;对于一个链表(chain),分别存储着链中第一个桶的h2(·)值和桶中所有数据点的索引,链中第二个桶的h2(·)值和桶中所有数据点的索引,以此类推。

两种哈希表都具有指向哈希函数h1(·)和h2(·)的指针。

typedef struct _UHashStructureT {
  // The type of the hash table (can take values HT_*). when
  // <typeHT>=HT_LINKED_LIST, chains&buckets are linked lists. when
  // <typeHT>=HT_PACKED, chains&buckets are static arrays. when
  // <typeHT>=HT_STATISTICS, chains are static arrays and buckets only
  // count # of elements.  when <typeHT>=HT_HYBRID_CHAINS, a chain is
  // a "hybrid" array that contains both the buckets and the points
  // (the an element of the chain array is of type
  // <HybridChainEntryT>). all chains are conglamerated in the same
  // array <hybridChainsStorage>.
  IntT typeHT;

  // The array containing the hash slots of the universal hashing.
  union _hashTableT {
    PGBucketT *llHashTable;
    PackedGBucketT **packedHashTable;
    LinkPackedGBucketT **linkHashTable;
    PHybridChainEntryT *hybridHashTable;
  } hashTable;

  // The sizes of each of the chains of the hashtable (used only when
  // typeHT=HT_PACKED or HT_STATISTICS.
  IntT *chainSizes;

  union _bucketPoints{
    PPointT *pointsArray;
    PointsListEntryT *pointsList;
  } bucketPoints;

  HybridChainEntryT *hybridChainsStorage;

  // The size of hashTable.
  Int32T hashTableSize;

  // Number of elements(buckets) stored in the hash table in total (that
  // is the number of non-empty buckets).
  Int32T nHashedBuckets;

  Int32T nHashedPoints;

  // Unused (but allocated) instances of the corresponding
  // structs. May be reused if needed (instead of allocated new
  // memory).
  PGBucketT unusedPGBuckets;
  PBucketEntryT unusedPBucketEntrys;

  Uns32T prime; // the prime used for the universal hash functions.
  IntT hashedDataLength;// the number of IntT's in an element from U (U is the set of values to hash).

  // The hash functions used for the universal hashing.  

  // The main hash function (that defines the index
  // of the slot in the table).
  // The type of the hash function is: h_{a}(k) = ((a\cdot k)mod p)mod hashTableSize.
  Uns32T *mainHashA;

  // Control hash functions: used to compute/check the <controlValue>s
  // of <GBucket>s.
  // The type of the hash function is: h_{a}(k) = (a\cdot k)mod p
  Uns32T *controlHash1;
} UHashStructureT, *PUHashStructureT;

(3)RNNParametersT(LocalitySensitiveHashing.h中定义)——包含构建RNearNeighborStructureT数据结构的必要参数的结构体。

typedef struct _RNNParametersT {
  RealT parameterR; // parameter R of the algorithm.
  RealT successProbability; // the success probability 1-\delta
  IntT dimension; // dimension of points.
  RealT parameterR2; // = parameterR^2

  // Whether to use <u> hash functions instead of usual <g>
  // functions. When this flag is set to TRUE, <u> functions are
  // generated (which are roughly k/2-tuples of LSH), and a <g>
  // function is a pair of 2 different <u> functions.
  BooleanT useUfunctions;

  IntT parameterK; // parameter K of the algorithm.
  
  // parameter M (# of independent tuples of LSH functions)
  // if useUfunctions==TRUE, parameterL = parameterM * (parameterM - 1) / 2
  // if useUfunctions==FALSE, parameterL = parameterM
  IntT parameterM;

  IntT parameterL; // parameter L of the algorithm.
  RealT parameterW; // parameter W of the algorithm.
  IntT parameterT; // parameter T of the algorithm.

  // The type of the hash table used for storing the buckets (of the
  // same <g> function).
  IntT typeHT;
} RNNParametersT, *PRNNParametersT;

(4)PPoint(Geometry.h中定义)——用于存储数据点的结构体。该结构包含数据的坐标(coordinates),数据点范数的平方和该数据点在数据集P的下标索引。

typedef struct _PointT {
  IntT index; // the index of this point in the dataset list of points
  RealT *coordinates;
  RealT sqrLength; // the square of the length of the vector
} PointT, *PPointT;

转载请注明作者及文章出处:http://blog.csdn.net/jasonding1354/article/details/38331229

E2LSH源代码下载地址:http://download.csdn.net/detail/jasonding1354/7704277

参考资料:

1、M.Datar,N.Immorlica,P.Indyk,and V.Mirrokni,“Locality-SensitiveHashing Scheme Based on p-Stable Distributions,”Proc.Symp. ComputationalGeometry, 2004.

2、A.Andoni,P.Indyk.E2lsh:Exact Euclidean locality-sensitive hashing.http://web.mit.edu/andoni/www/LSH/.2004.


你可能感兴趣的:(E2LSH,P稳定分布LSH,相似性搜索)