局部敏感哈希开源项目和论文 Locality-Sensitive Hashing (LSH) · Jian Zhou

Locality-Sensitive Hashing (LSH)

Scala LSH project

Similarity Measure

Similarity measure is a real-valued function that quantifies the similarity between two objects. Although no single definition of a similarity measure exists, usually such measures are in some sense the inverse of distance metrics.

  • Cosine similarity
  • Euclidean similarity
  • Nucleotide similarity
  • Amino acid similarity
  • Hamming similarity
  • Jaccard similarity

Types of LSH

  • HyperplaneLSHfor Cosine Distance
  • Super-Bit Locality-Sensitive Hashingfor Hamming distance
  • Min Hashfor Jaccard similarity
  • Min-wise independent permutations
  • Nilsimsa Hash
  • Random projection

Implementations

  • tdebatty/java-LSHA Java implementation of Locality Sensitive Hashing (LSH) MinHash & Super-Bit
  • apache/incubator-datafua collection of libraries for working with large-scale data in Hadoop.
  • marufaytekin/lsh-sparkHyperplaneLSH for Spark
  • soundcloud/cosine-lsh-join-sparkApproximate Nearest Neighbors in Spark
  • karlhigley/spark-neighborsSpark-based approximate nearest neighbor search using locality-sensitive hashing supports Hamming, Jaccard, Euclidean, and cosine distance.
  • rholder/nilsimsaNilsimsa locality-sensitive hashing algorithm in Java.
  • chrisjmccormick/MinHash MinHash Tutorial with Python Codewith example to mining documents similarity.
  • barneygovan/lsh-scalaA Locality-Sensitive Hashing Library for Scala with optional Redis storage.
  • treadstone90/Locality-Sensitive-Hashingworks only for the text and can support only Jaccard Similarity.
  • richwhitjr/DistNNDistributed LSH Implementation in Scala.
  • beckgael/Mean-Shift-LSHDistributed Nearest Neighbours Mean Shift with Locality Sensitive Hashing DNNMS-LSH. Scala/Spark implementation.
  • ohtaman/LSHC++ implemented MinHash and SimHash.
  • JorenSix/TarsosLSHA Java library implementing Locality-sensitive Hashing (LSH), a practical nearest neighbour search algorithm for multidimensional vectors that operates in sublinear time.

Papers

  1. Practical and Optimal LSH for Angular Distance
  2. Optimal Data-Dependent Hashing for Approximate Near Neighbors
  3. Beyond Locality Sensitive Hashing
  4. Original LSH algorithm (1999)
  5. Efficient Distributed Locality Sensitive Hashing
  6. Jaccard distance: Mining Massive Data Sets chapter#3
  7. Hamming normA. Gionis, P. Indyk, and R. Motwani. Similarity Search in High Dimensions via Hashing. In Proc. of the 25th Intl. Conf. on Very Large Data Bases, VLDB(1999).
  8. Lp normsM. Datar, N. Immorlica, P. Indyk, and V. Mirrokni Locality-Sensitive Hashing Scheme Based on p-Stable Distributions. In Proc. of the 20th ACM Annual http://people.csail.mit.edu/indyk/nips-nn.ps
  9. Cosine distance and Earth movers distance (EMD)M. Charikar. Similarity Estimation Techniques from Rounding Algorithms. In Proc. of the 34th Annual ACM Symposium on Theory of Computing, STOC (2002).
  10. Very Sparse Random ProjectionsPing Li, T. Hastie and K. W. Church, 2006
  11. Similarity Estimation Techniques from Rounding Algorithms
  12. Random projectionRandom projection in dimensionality reduction: Applications to image and text data
  13. An Introduction to Sequence Similarity (“Homology”) Searching
  14. Efficient large-scale sequence comparison by locality-sensitive hashing

Finding Nearest Neighbors

  • Locality-Sensitive Hashing for Finding Nearest Neighbors
  • k-nearest neighbors algorithm

Additional Reading

  • What are some good LSH implementations?
  • Why do hash functions use prime numbers?

Issues for LSH

  1. SPARK-5992Locality Sensitive Hashing (LSH) for Spark
  2. spark/pull/15148

Implement Locality Sensitive Hashing along with approximate nearest neighbors and approximate similarity join based on the design doc.

Detailed changes are as follows:

  1. Implement abstract LSH, LSHModel classes as Estimator-Model
  2. Implement approxNearestNeighbors and approxSimilarityJoin in the abstract S.Model
  3. Implement Random Projection as LSH subclass for Euclidean distance, Min a.h for Jaccard Distance
  4. Implement unit test utility methods including checkLshProperty, checkNearestNeighbor and checkSimilarityJoin

Things that will be implemented in a follow-up PR:

  • Bit Sampling for Hamming Distance, SignRandomProjection for Cosine Distance
  • PySpark Integration for the scala classes and methods.

Datasets

  • Text Technologies for Data Science Assessment 3

你可能感兴趣的:(局部敏感哈希开源项目和论文 Locality-Sensitive Hashing (LSH) · Jian Zhou)