pipisorry

海量数据挖掘MMDS week2: 局部敏感哈希Locality-Sensitive Hashing, LSH

http://blog.csdn.net/pipisorry/article/details/48858661

海量数据挖掘Mining Massive Datasets(MMDs) -Jure Leskovec courses学习笔记之 Locality-Sensitive Hashing(LSH) 局部敏感哈希

{This is the first half of discussion of a powerful technique for focusing search on things that are likely to be relevant, while avoiding the examination of things unlikely to be what we are looking for, in much the way ordinary hashing gets us to records we want without looking through an entire database. This subject will continue in week 7.}

局部敏感哈希LSH简介

Hashing, probably seemed like a bit of magic.You have a large set of keys, and when you want to find some key k,you go right to it without having to look very far at all.
shingling, a way to convert the informal notion of similar documents into a formal test for similarity of sets.
min hashing, allows us to replace a large set by a much smaller list of values.the similarity of the small lists,called signatures,predicts the similarity of the whole sets.
locality sensitive hashing,is another bit of magic.we are pointed right at the similar pairs without having to wade through the morass of all pairs.find similar sets or similar documents without doing anything that involves searching all pairs.

哈希

大家应该都知道，它的查找时间复杂度是O(1)，有着极高的查找性能。我们通常的哈希，比如有一个hash function: f(x)=(x*7)%10，有两个数据x1=123，x2=124，现在用f(x)把它们hash一下，f(x1)=1，f(x2)=8，这想说明什么呢？看看原来的数据，是不是很相似？（假设用欧氏距离度量）再看hash后的数据，那就差得远了，说明这个hash它并没有保持相似性，那么这种就不是局部敏感哈希。

局部敏感

那什么叫“局部敏感”的哈希？它的意思就是：如果原来的数据相似，那么hash以后的数据也保持一定的相似性。LSH就是一种在hash之后能保持一定的相似性神奇玩意儿，这里为什么说是“保持一定的相似性”？因为我们知道，hash函数的值域一般都是有限的，但是要哈希的数据却是无法预知的，可能数据大大超过其值域，那么就不可能避免地会出现一个以上的数据点拥有相同的hash值。假设有一个很好的hash function，好到可以在hash之后很好地保持原始数据的相似性，假设它的不同取值数是10个，然后现在我有11个截然不相似的数据点，想用这个hash function来hash一下，必然会出现一个hash桶中有1个以上的数据点，只能在hash之后保持一定的相似性。其根本原因就是在做相似性度量的时候，hash function通常是把高维数据映射到低维空间上，高维空间上计算复杂度太高。

LSH是如何做的

一句话总结思想：它是用hash的方法把数据从原空间哈希到一个新的空间中，使得在原始空间的相似的数据，在新的空间中也相似的概率很大，而在原始空间的不相似的数据，在新的空间中相似的概率很小。

其实在用LSH前通常会进行一些降维操作

先说说整个流程，一般的步骤是先把数据点（可以是原始数据，或者提取到的特征向量）组成矩阵，然后通过第一步的hash functions（有多个哈希函数，是从某个哈希函数族中选出来的）哈希成一个叫“签名矩阵（Signature Matrix）”，这个矩阵可以直接理解为是降维后的数据，然后再通过LSH把Signature Matrix哈希一下，就得到了每个数据点最终被hash到了哪个bucket里，如果新来一个数据点，假如是一个网页的特征向量，我想找和这个网页相似的网页，那么把这个网页对应的特征向量hash一下，看看它到哪个桶里了，于是bucket里的网页就是和它相似的一些候选网页，这样就大大减少了要比对的网页数，极大的提高了效率。

皮皮blog

寻找相似集

规模带来的问题

{为什么要用LSH方法}

A fundamental problem of scale: If we have even a million sets, the number of pairs of sets is half a trillion.We don't have the resources to compare them all, so we need some magic defor,focus us on the pairs that are likely to be highly similar,never looking at the vast majority of pairs.

集合相似性的应用

Note:

1. Dual:We can use the same idea backwards.Where we think of a movie as the set of users who like that movie.Movies with similar sets of users can be expected to belong to the same genre of movie.
2. entity resolution实体分辨: People create records of data about themselves at many different sites,Google, Amazon, Facebook and so on.We may want to figure out when two records refer to the same individual.

相似文档

文档相似性度量的三大基本技术

寻找相似集的Outline

Note:

1. 存在false positives and negatives，但是by carefully choosing the parameters involved, we can make the probability of false positives and negatives be as small as we like.

2. The result of minhashing a set is a short vector of integers.The key property,which we'll prove, is that the number of components in which the, two of these vectors agree is the expected value of the similarity of the underlying sets.
3. the reason we want to replace sets by their signatures is that the signatures take up much less space.we'd like to be able to work in main memory.

Shingles

shingles及其相似性

Shingling is how we convert documents to sets so that documents that have a lot of text in common will be converted to sets that are similar in the sense that they have a lot of members in common.

k-shingle(k-gram): 文档中k个consecutive(连续的) 字符序列。

字符characters: The blanks that separate the words of the document are normally considered characters.If the document involves tags such as an HTML document then the tags may also be considered characters or they can be ignored.

k的取值： A k in the range five to ten is generally used.

Note: replacing a document by its shingles still lets us detect pairs of documents that are intuitively similar.

Shingles的压缩表示-tokens

我们构造shingle集合然后hash得到一个token。

Note:

1. 要压缩的原因：Because documents tend to consist mostly of the 26 letters, and most shingles do not appear in a document,we are often forced to use a large value of k, like k equals ten.But the number of different strings of length ten that will actually appear in any document is much smaller than 256 to the tenth power or even 26 to the tenth power.Thus, it common to compress shingles to save space while still preserving the property that most shingles do not appear in a given document.Shingling方法里的k值比较大时，可以对每个片段进行一次hash。比如k=9，我们可以把每个9字节的片段hash成一个32bit的整数。这样既节省了空间又简化了相等的判断。这样两步的方法和4-shingling占用空间相同，但是会有更好的效果。因为字符的分布不是均匀的，在4-shingling中实际上大量的4字母组合没有出现过，而如果是9-shingling再hash成4个字节就会均匀得多。

2. For example, we can hash strings of length ten to 32 bits or four bytes, thus saving 60% of the space that are needed to shore, to store the shingle sets. The result of hashing shingles is often called a token.
3. Since documents are much shorter than two to the 32nd power byte,we still can be sure that a document is only a small fraction of the possible tokens in it's sets.
4. 哈希冲突：几乎不发生。There's a small chance of a collision where two shingles hashed to the same token, but that could make two documents appear to have shingles in common when in fact they have different shingles.But such an an occurrence will be quite rare.

在有些情况下我们需要用压缩的方式表示集合，但是仍然希望能够（近似）计算出集合之间的相似度，此时可用下面的Minhashing方法。

Minhashing

Jaccard similarity相似性

{用于集合相似性的一般形式定义the formal definition of similarity that is commonly used for sets}

文档shingles的集合表示到布尔矩阵表示

Note:

1. For example, if the sets come from k-shingling documents,then the universal set is the set of all possible sequences of K characters or the set of all tokens if we hash the shingles.

2. 简单来说就是，每行即属性，每列即集合。

3. 这里的数值并不用去count出现了几次，因为是用Jaccard去度量的（只考虑出现与否，使用boolen值0、1）。

两列的Jaccard similarity

文档shingles的压缩表示-签名矩阵signature matrix

哈希函数哈希列C的hash值为对行随机排序后，列C对应有值1的那一行的行号（这个看上去跟hash没关系？对，没关系，但是这个行号可以通过hash模拟实现，在实际编程中就是使用hash函数hash得到的），得到新的一行所有列的hash值表示（也就是signature）。重复上述过程，对所有列使用多个hash函数进行hash，得到一个signature矩阵，这个矩阵就是原有矩阵的压缩表示。

Note:

1. 每一个Minshashing hash函数是和矩阵行的排列permutation关联在一起的，并且hash函数数目一般选择100个左右。And for the entire matrix or collection of sets,we select the Minhash functions once and apply the same Minhash functions to each of the columns.

2. 什么是signature: For each column, the signature is the sequence of row numbers we get when we apply each of these Minhash functions in turn to the column.

Minhashing示例

假设现在有4个网页（看成是document），页面中词项的出现情况用以下矩阵来表示，1表示对应词项出现，0则表示不出现，这里并不用去count出现了几次，因为是用Jaccard去度量的。

接下来我们就要去找一种hash function，使得在hash后尽量还能保持这些documents之间的Jaccard相似度。目标就是找到这样一种哈希函数h()，如果原来documents的Jaccard相似度高，那么它们的hash值相同的概率高，如果原来documents的Jaccard相似度低，那么它们的hash值不相同的概率高。

Min-hashing的定义和获得

海量数据挖掘MMDS week2: 局部敏感哈希Locality-Sensitive Hashing, LSH_第17张图片

首先生成一堆随机置换，把Signature Matrix的每一行进行置换，然后hash function就定义为把一个列C hash成一个这样的值：就是在置换后的列C上，第一个值为1的行的行号。

海量数据挖掘MMDS week2: 局部敏感哈希Locality-Sensitive Hashing, LSH_第18张图片

图中展示了三个置换（彩色的那三条）。比如现在看蓝色的那个置换，置换后的Signature Matrix（当然实际不用真的置换，使用索引就OK）为：

海量数据挖掘MMDS week2: 局部敏感哈希Locality-Sensitive Hashing, LSH_第19张图片

然后看第一列的第一个是1的行是第几行，是第2行，第二行的第一个是1的行号是2，同理再看三四列，分别是2，1，因此这四列（四个document）在这个置换下，被哈希成了2，1，2，1，对应图右图蓝色部分，也就相当于每个document现在是1维（It happens that neither of those columns has been assigned a value yet,because we haven't encountered a row in which either of those columns have 1.But they both get the value 2.Because the second row in the permuted order, but not the first row in that order has 1 in each of these columns）。注意，每置换（排列）对应生成签名矩阵的一行。再通过另外两个置换然后再hash，又得到右边的另外两行，于是最终结果是每个document从7维降到了3维。
[ MinHashing基本原理]

Minhashing的性质

对于两个document，在Min-Hashing方法中，它们hash值相等的概率等于它们降维前的Jaccard相似度，也就是说在signature矩阵中计算两列（两文档的shinglings/tokens表示）的相似度约等于在原有shingles矩阵两列的相似度。

Minhashing性质的证明

Note:

1. 也就是说，如果C1、C2在某个排列（置换）中遇到的第一个非0行是类型a的行（对应都是1），那么C1、C2的minhash值相同。否则C1、C2对应的minhash值不同。

2. minhash值是遇到1才有的。

同一行的两个元素的情况有三种：a.两者都为1；b+c.一个1一个0；d.两者都为0。易知Jaccard相似度为a/(a+b+c)。另一方面，若排列（置换）是等概率的，则第一个出现的a出现在b+c之前的概率也为a/(a+b+c)，而只有这种情况下两集合的minhash值才相同。Thus the probability that the two columns will have the same MinHash value is the probability that the first row that isn't of type-d is a type-a row.That probability is the number of type-a rows divided by the number of rows of any of the types a, b or c.That is, a/(a+b+c).证毕。

这标志着我们找到了需要的hash function。于是方法就有了，我们多次抽取随机排列得到n个minhash函数h1,h2,…,hn，依此对每一列都计算n个minhash值。对于两个集合，看看n个值里面对应相等的比例，即可估计出两集合的Jaccard相似度。可以把每个集合的n个minhash值列为一列，得到一个n行C列的签名矩阵。因为n可远小于R，这样我们就把集合压缩表示了，并且仍能近似计算出相似度。

Note: 现在这个hash function只适用于Jaccard相似度，并没有一个万能的hash function。

Minhashing的严格证明

设有一个词项x（就是Signature Matrix中的行上的值），它满足下式：

就是说，词项x被置换之后的位置，和C1,C2两列并起来（就是把两列对应位置的值求或）的结果的置换中第一个是1的行的位置相同。（π是一个随机排列）

那么有下面的式子成立：

就是说x这个词项要么出现在C1中（就是说C1中对应行的值为1），要么出现在C2中，或者都出现。这个应该很好理解，因为那个1肯定是从C1,C2中来的。

那么词项x同时出现在C1,C2中（就是C1,C2中词项x对应的行处的值是1）的概率，就等于x属于C1与C2的交集的概率。

那么现在问题是：已知x属于C1与C2的并集，那么x属于C1与C2的交集的概率是多少？其实就是看看它的交集有多大，并集有多大，那么x属于并集的概率就是交集的大小比上并集的大小，而交集的大小比上并集的大小，就是Jaccard相似度，于是有下式：

我们的hash function是

代入得到下式：

证毕。

签名signatures的相似度

Note: if we use several hundred Minhash functions, that is,signatures of several hundred components.We get a small enough standard deviation that we can estimate the true Jaccard similarity of the represented sets to within a few percent.

这里，我们将sets(shingles)转换成了signatures（矩阵中列代表的含义），而signaturs之间相似度就是行数值相同的比例。

我们来看看降维后的相似度情况

右下角那个表给出了降维后的document两两之间的相似性。

在原来的Input matrix矩阵中，列col1和col3的相似度为0.75，而在minhashing后的signature matrix矩阵中，列sig1和sig3的相似度为0.67。以此类推。

Note:列1、2原本就没有相似性，minhashing后也不会有。It turns out that when the similarity is zero it is impossible for any min hash function to return the same value for these two columns.

可以看出，使用signature矩阵代替input矩阵计算文档间相似度还是挺准确的：希望原来documents的Jaccard相似度高，那么它们的hash值相同的概率高，如果原来documents的Jaccard相似度低，那么它们的hash值不相同的概率。

Minhashing的实现

在具体的计算中，可以不用（也不大可能实用）真正生成随机排列，只要有一个hash函数从[0..R-1]映射到[0..R-1]即可。因为R是很大的，即使偶尔存在多个值映射为同一值也没大的影响。

随机排列的模拟实现

模拟实现排列而不实际对行进行排列。对每个main hash函数，pick a normal sort of hash function that hashes integers to some number of buckets.假设行R在排列中的位置是H(R)，其中H是一个hash函数。这样，对每一列，我们寻找行r，其中行r在这列中的值为1，且h(r)的值是最小的。每次hash，h()函数都不同，这样行r的hash得到的值也不同，相当于对行进行了一次随机排列。也就是说对每一次hash，h(c)不是已有排列上第一个碰到的列号（这个其实也不叫hash啊，改了之后就成了名符其实的hash了），而模拟成对列c上值为1的所有行号进行hash，对列c取所有hash值最小的hash值作为h(c)的值。相当于使用hash函数对rows进行了排列。

那么，每个hash函数hi()相当于一次不同的排列，我们要计算所有列c的最小哈希minhash值，得到一个M矩阵，其中M(i,c)表示列c中随机排列后1对应的最小行号。

具体来说见下图：

Note:

1. 如果有100个minhash函数，那么slots的数目就是100*列的数目。
2. 可能的冲突：It's Entirely possible that h sub i, maps two or more rows to the same designation.But if we make the number of buckets into which h of i hash is very large,larger than the number of rows,then the probability of a collision at the smallest value is very small, and we can ignore the probability of a collision.

Min-Hashing简易算法

Note:

1. M初始化为无穷大。

2. 每个h(i,r)只计算一次，不要重复计算了。It is important we compute h(r) only once for each hash function in each row.

Min-Hashing算法实现实例

下面选取的hash函数将整数映射到5个桶buckets中（桶初始为无穷大）。

第一步，对应的是，并只更新了sig1，由于对于row 1 第二列值为0，所以值不变，仍为无穷大。

Note:Incidentally notice that the two signatures disagree for both components,so they estimate the Jaccard similarities of the columns that are zero.That's off by a little since,as you can see the true Jaccard's similarity of the columns is one fifth.

实现过程要注意的地方

Note:The algorithm we describe as soon as we can visit the matrix row by row.But often the data is available by columns and not by rows.For instance,if we have a file of documents,it's natural to process each document once, computing its shingles.That, in effect,gives us one column of the matrix.
Start with a list of row column pairs where the ones are.Initially sort it by column,and sort these pairs by row.
[Locality Sensitive Hashing 详解]

皮皮blog

局部敏感哈希Locality-Sensitive Hashing，LSH

上面通过minhash降低了文档相似性比较复杂度，但是即使文档数目很少，但是需要比较的数目却可能还是相当大。所以要生成候选对，只有形成候选对的两列才进行相似度计算，相似度计算复杂度又降低了。

对signature矩阵，我们通过创建很大数目的hash函数（普通的hash函数，不是minhash函数）。对每个选择的hash函数，我们把列hash到buckets中，并将同一个bucket中的所有pairs作为候选pair。也就是说只要有一个hash函数将两列hash到同一个bucket中，它们就成为一个候选对，之后进行相似度计算。可以看出这样做同一个bucket中列两两之间才进行相似度计算，而不是所有列两两之间进行相似度计算，减少了计算量。

哈希函数数目和桶数目的协调：We need to tune the number of hash functions and the number of buckets for each hash function so that the buckets have relatively few signatures in them.That way, there are not too many candidate pairs generated.But we can't use too many buckets, or else, pairs that are truly similar will not wind up in the same bucket for even one of the hash functions we use.

Signatures中生成候选集

直觉知识

{我们想要达到的效果}

Note:

1. threshold t是成为候选集的最小阈值，怎么确定这个阈值将在后面讲到。

2. 列c和d要成为候选集，其signatures相似度至少为t，而其相似度就是signatures矩阵中行相同的个数即M(i,c)=M(i,d)的数目。M(i,c)代表第i行（也对应于第i个hash函数）c的hash值bucket。

Signatures的局部敏感哈希

基本思想：由于列c和d要成为候选集，其signatures相似度至少为t。对signature矩阵，我们需要通过创建很大数目的hash函数把列hash到buckets中。我们想要达到的目的是：保证相似的列（signatures）尽可能地hash到同一个bucket中，而不是不相似的两列。

LSH的实现

首先将Signature Matrix分成一些bands，每个bands包含一些rows

Note: b*r就是signatures的总长度，也就是用于创建signatures的main hash函数的数目。b times r is the total length of the signatures.That is, the number of main hash functions we use to create the signatures.
然后把每个band哈希到一些bucket中（不同的band使用不同的hash函数，也就是每个band中，我们都要创建一个hash函数）

hash策略：This hash function hashes the values that a given column has in that band only.Ideally, we would make one bucket for each possible vector of b values that a column could have in that band.That is, we'd like to have so many buckets that the hash function is really the identity function, but that is probably too many buckets.For example, if b equals 5 and the components of a signature are 32-bit integers,then they would be 2 to the 5 times 32,or 2 to the 160th power of buckets.We can't even look at all these buckets to see what is in them at the end.So we'll probably want to pick a number of buckets that is smaller,say, a million or a billion.

这个方法的直觉含义就是，只要两个signatures在某个片断（band）上相似（hash到同一个bucket中），它们在整体上就有一定的概率相似，就要加入候选pairs中进行下一步比较！本来是所有列两两之间都要进行相似性计算的，但是通过hash后，只有在同一个bucket中的两列才要进行相似度计算，同一个bucket中的两列是局部相似的（因为只要某个band（列的一部分）相似就会hash到同一个bucket中，这也应该就是局部敏感哈希名称的来源吧），所以同一个bucket中的两两对都是候选对。the only way we can be sure a pair of signatures will become a candidate pair is if they, if they have exactly the same components in at least one of the bands.Notice that if most of the components of two signatures agree,then there's a good chance that they will have 100% agreement in some band.Otherwise they are unlikely to agree 100% in any band.There's a small chance that these segments of these columns are not identical, but they just happen to hash to the same bucket.We will generally neglect(忽略) that probability as it can be made tiny,like 1 in 4 billion,if we use 2 to the 32nd power buckets.

只要bucket的数量要足够多，使得两个不一样的bands被哈希到不同的bucket中，这样一来就有：如果两个document的bands中，至少有一个share了同一个bucket，那这两个document就是candidate pair，也就是很有可能是相似的。

bands大小和相似度阈值的选取准则：如果b很大，r很小，则会有很多pairs会分到同一个bucket中。Thus making b large is good at the similarity,if the similarity threshold is relatively low.
conversely, if you make b small and r large, then it would be very hard for two signatures to hash to the same bucket.Thus is best if we have a high threshold of similarity.

Note:

1. Perhaps column six and seven will hash in the same bucket for some other hash function and will then therefore become a candidate pair from whoa.But looking only at this one hashing,they do not form a candidate pair.

2. 也可以对所有bands使用一个hash函数，但是要对每个bands使用独立的桶数组，这样就算不同bands有相同的行也不会hash到同一个bucket中。而且这样lz觉得可能比上面的使用不同hash函数同一个buckets数组更好，不容易产生冲突。

LSH示例

Note:

1. 这里设置了20个bands，且每个band中有5行。signature矩阵的行也称作组件components.

2. 可能比较的paris数目的计算：5,000,000,000个=(100,000 2)，即100,000中选出两个，100,000*99999/2.

3. because of the randomness involved in minhashing, the columns C1 and C2 may agree in more or fewer than 80 of their rows but approximately.

假设有两个document，它们对应的Signature Matrix矩阵的列分别为C1,C2，Signature Matrix还是分成20个bands，每个bands有5行

假设C1、C2实际上有80%相似，则：

Note: 0.328是8成相似的两列C1、C2在某个band中完全一样的概率0.8^5，虽然很小，但计算得到C1、C2在20个bands中都不相似的概率为0.00035，几乎为0，也就是说20个bands中总有一个band检测到C1、C2是相似的，并且检测到相似的概率（hash到同一个bucket中的概率）相当大为0.99965。同时0.00035=1/3000，是false negative的概率。

假设C1、C2实际上有40%相似，则：

Note: 可以看出这时有20% false positives,but the false positive rate falls rapidly as the similarity of underlying sets decreases.For example, for 20% Jaccard similarity,we get less than 1% false positives.
假设C1、C2实际上有30%相似，则：

C1中的一个band与C2中的一个band完全一样的概率就是0.3^5=0.00243，那么C1与C2在20个bands至少C1的一个band和C2的一个band一样的概率是1-（1-0.00243）^20=0.0474，换句话说就是，如果这两个document是30%相似的话，LSH中判定它们相似的概率是0.0474，也就是几乎不会认为它们相似。

LSH分析

我们想要的LSH理想情况

设s是2个sets(docs、colums、signatures)真实相似度，t是相似度阈值，共享同一个bucket的概率为1时（也就是两列总是hash到同一bucket中，其应该推断出相似度是100%），那么两列的相似性越大，并且大于这个阈值t，说明两列是相似的（与应该推断出相似度是100%相符）。

简单来说，这个理想阶跃函数的作用是：当两个sets相似度很高时（>t），总是分到同一个bucket中，相反相似度低时，总是不会分到同一个bucket中（分到同一bucket中的概率为0）。

实际情况：分析signature matrix的单行（两列）

阈值为t时的false pos和neg。

Note:

1. 由于the probability of two minhash values equaling the Jaccard similarity of the underlying set，所以单行对应的是图中对角线（红线）。

2. That's not too bad.At least the probability goes in the right direction, but it does leave a lot of false positives and negatives.

False pos和neg的概率大小控制

这些false pos和neg的概率是可以通过选取不同的band数量以及每个band中的row的数量来控制的。b和r越大，也就是signatures的长度（就是原始signature矩阵的行数）越长，S曲线就越接近于阶跃函数，false positives and negatives相应就会减小，但是signatures越长，它们占用的空间的计算量也就变大了。{lz的问题：如果输入的signatures矩阵只有那么大，那么是不是只能通过and/or哈希函数级联来模拟增加b、r大小了？[海量数据挖掘MMDS week7: 局部敏感哈希LSH(进阶)]}The larger we make b and r, that is,the longer the signatures we use,the closer the s curve will be to a step function.And therefore, the fewer false positives and negatives we can have.But the longer we make the signatures,the more space they will take and the more work it will be to perform all the minhashing.

S曲线分析：怎样确定阈值

直接在signature matrix中比较相似度不容易确定threshold，而添加s,r后，可以根据理论很好的设置一个threshold=(1/b)^1/r。

s是2个sets(docs、colums、signatures)真实相似度。则可得出以下概率计算公式：

当b和r变大时，函数1-（1-s^r)^b的增长类似一个阶跃函数step function。当阈值在大概(1/b)^1/r这个位置时跳跃。

这个阈值实际上是函数f(S) = (1-S^r)^b的不动点（近似值），也就是输入一个相似度S，得到一个相同的概率p（hash到同一个bucket中的概率）。当相似性相对不动点变大时，其hash到同一个bucket中的概率也变大，反之相似性相对不动点变小时，其hash到同一个bucket中的概率也变小了。Threshold t will be approximately (1/b)^1/r.But there are many suitable values of b and r for a given threshold.

所以应该选择t = (1/b)^1/r作为阈值，这样大于这个阈值说明两列相似，否则不相似。

除此之外，还可以通过AND和OR操作来控制，同样得到相同的S曲线。实际上上面的bands方法只是and-or操作的特例而已。[ 海量数据挖掘MMDS week7: 局部敏感哈希LSH(进阶)]

LSH分析实例

Note: The rise from 0.4 to 0.6 is more than 0.6

LSH总结

[Locality Sensitive Hashing 详解]

皮皮blog

LSH的应用(LSH的三个变型)

应用一：entity resolution实体分辨（LSH变型1）

实体分辨会出现很多问题，例如通过人名合并人物可能的问题：people with the same name,so grouping by name will merge records for different people and worse, the same person may have their name written：middle initials，nickname - formal name，misspellings。

匹配消费者记录

Note:

1. We gave 100 points each for identical names, addresses, and phone numbers, so 300 was the top score.only 7,000 pairs of records received this top score, although we identified over 180 thousand pairs that were very likely the same person.Then we penalize differences in these three fields.Completely different names addresses or phones got zero score but small changes gave scores close to 100.

2. small spelling difference in the first names.the score for the name would be 90.If the last names were the same but the first name's completely different the score for the names would be 50.

3. how we set the threshold without knowing ground truth.That is, which pairs of records really were created by the same individuals.this is not a job you can do with machine learning, because there's no training set available.

Note:

1. we used exactly three hash functions.One had a bucket for each possible name.The second had a bucket for each possible address.And the third had a bucket for each possible phone number.Now, the candidate pairs were those placed in the same bucket by at least one of these hash functions.
2. finding those extra missed pairs would probably have cost more than they were worth to either company.

Names, address, phone怎么hash

How do we hash strings such as names so there is one bucket for each string?

第一种hash方法：现实中名字太多，可以不用hash，直接对names进行排序，相同名字的记录就连在一起了，这样之后再去score。How we hash to one bucket for each possible name since there are an infinite number of possible names.But we didn't really hash to buckets,rather we sorted the records by name and then the records with identical names appear consecutively in the list and we can score each pair with identical names.After that we resorted by address and did the same thing with records that had identical addresses.
第二种hash方法：Another option was to use a few million buckets, and deal with buckets that contain several different strings.follow the strategy we used when we did LSH for signatures we could hash say several million buckets and compare all pairs of records within one bucket.如果bucket的数目远远大于数据中实际出现的名字数目，发生冲突的概率是极小的。
结果确认Result validation: score多少分才算sure?

Note:

1. For the identical pairs, we looked at the creation dates at companies A and B.It turns out that there was a 10 day lag on average between the time the record was created by company A and the time that the same person went to company B to begin their service.

2. In order to reduce further the pairs of records we needed to score we only looked at pairs of records where the A record was created between 0 and 90 days before the B record, you'll get an average delay of 45 days.

3. 如果匹配池matches pool中的平均延迟时间lag time是10天（也就是所有都是validated）则有(45-10)/35=100%的是valid，反之，如果平均延迟时间lag time是45天（也就是所有的都超过45天的最长时限，都不是validated）则有(45-45)/35=0%的是valid。
什么数据field可以用于validation?
Any field not used in the LSH could have been used to validate, provided corresponding values were closer for true matches than false.

应用二：Minutiae:A New Way of Bucketing

（并不需要原始的LSH来创建buckets，更不用shingles和minhashing）

Fingerprint Matching指纹匹配

指纹表示

Note: minutiae: particular locations where something interesting happens to the ridges that form a fingerprint.Examples are where two ridges merge into one or where a ridge ends.So, the image of a fingerprint is replaced by a set of coordinates in the two dimensional space where minutiae are located.

指纹匹配中的LSH

通过方格(minutiaes)的集合来表示指纹

Note:

1. The grid must be scaled and orientated properly.

2. Since some minutiae will be right on or near a boundary, it is useful to regard such minutiae as present in the squares on both sides of the boundary.
3. 指纹匹配问题转化：we have reduced the problem of finding matching fingerprints to the problem of finding similar sets of grid squares that have minutiae.
指纹匹配的难点：矩阵是稠密的，Min-Hashing对稀疏矩阵更有效，否则可能分不开不同的指纹。The problem is that the resulting matrix is not sparse.The grid cannot be too fine or it will be unclear where minutiae belong.And as a result,the matrix's rows are the grid squares and its columns or the fingerprints sets will not be sparse.That means min hashing will not work very well.Each min hash will have relatively few different values, so we don't get a good distribution into a large number of buckets when we do the LSH.
Discretizing Minutiaes离散化Minutiaes

Note: 图中交叉点就是一个minutiaes，要加入到指纹表示中，同时其周围很近的方格也要加入到sets中。It appears the point of merger lies within this grid square.So, we add that square to the set representing the fingerprint.we might also want to add the squares that are very close to the exact point of merger, because in another image of the same fingerprint,the grid might be shifted slightly to the left or down.
LSH在指纹中的应用

表示成bit-vector

指纹候选对

Note:

1. 随机选出1024个集合，每个集合中有3个方格。For every LSH, if we pick some member of sets of grid squares or components of the bit-vectors that represent fingerprints.In our example, we'll use 1,024 sets of three grid squares each.For each set of three squares,we look at all the prints that have minutiae in each of these three squares.
2. 如果存在某个集合，两个指纹对应的3个方格都是1，那么这两个指纹就是候选对（要进一步比较是否相同）。
3. 选出的集合就像是LSH中的bucket。但是unlike a hash function, a fingerprint can be placed in many buckets.
LSH/Fingerprint示例

Note:

1. 每个指纹有20%的方格存在minutiaes，两个相同的指纹有80%的方格是agree的。The fact that we place minutiae in nearby squares if they are at the boundary helps make this assumption true.

2. if the fingerprints come from different fingers,then the probability that both prints are placed in this bucket(含3个squares的set) is really tiny.
3. 一个指纹在3个squares中有minutiaes的概率是(0.2)^3，两个指纹就是(0.2)^6.
相同指纹匹配成功概率分析

Note: And by using a larger number of sets of squares and perhaps four or five squares per set,we can reduce the false positive rate substantially while still keeping the false negative rate low.

应用三：A New Way of Shingling: Bucketing by Length

(LSH变型3：变型的shingling技术，实际上并没有使用MinHashing或者LSH)

寻找重复新闻文章

与文档相似度计算的不同点是：there is a special way of shingling that works well when the difference are mostly in the ads associated with the article.We shall also talk about a simple bucketing method that works when the number of sets is not too great.

新闻分辨的难点

minhashing + LSH方法实现

invented a form of shingling that is probably better than the standard approach we covered for those webpages that are of the type we just described.And they invented a simple substitute for LSH that worked adequately well for the scale of problem.They partitioned the pages into groups of similar length and they only compared pages in the same group or nearby groups.

minhashing + LSH运行更快更好：But they found that minhashing + LSH was better as long as the similarity threshold was less than 80%.

Note: 提高效率和减小错误的小tips: do the minhashing row by row,where you compute the hash value for each row number once and for all rather than once for each column.Remember that the rows correspond to the shingles and the columns to the web pages.
特殊的shingling技术

Note:

1. The key observation was to give more weight to the articles themselves than to the ads and other elements surrounding the article.That is, they did not want to identify as similar two articles from the same newspaper with the same ads and other elements, but different underlying stories.
2. buy sudzo是广告，i recommend ...是正文（包含较多stop words）。
3. 根据他们定义的shingle，正文变成shingles是：I recommend that, that you buy...
4. Notice that there are relatively few shingles and it does not guarantee that each word is part of even one shingle.
shingling变型的分析

The reason this notion of shingle makes sense is that it biases the set of shingles for a page in favor of the news article.文章中含有stop words，他们选用的是比较文章而不是广告，So these two pages have almost the same shingles, and therefore have very high Jaccard similarity.

皮皮blog

LSH距离度量方法

Mining Massive Datasets - week2: LSH的距离度量方法

Review复习

Shingles

Note:2-shingles for ABRACADABRA : AB BR RA AC CA AD DA 7个

2-shingles for BRICABRAC : BR RI IC CA AB RA AC 7个

2-shingles in common : BR CA AB RA AC 5个

Jaccard similarity: 5/(7+7-5) = 5/9

Min-Hashing

Note: R4得C3-R4；R6得C2-R6；R1过；R3得C4-R3；R5得C1-R5；R2过。

故：C1 C2 C3 C4

R5 R6 R4 R3

LSH

Note: band1中hash到同一bucket中的候选pairs是C1-C4, C2-C5

band2中hash到同一bucket中的候选pairs是C1-C6

band3中hash到同一bucket中的候选pairs是C1-C3, C4-C7

from:http://blog.csdn.net/pipisorry/article/details/48858661

ref:海量数据挖掘MMDS week7: 局部敏感哈希LSH(进阶)

你可能感兴趣的:(hash,局部敏感哈希,mmds,海量数据挖掘)

敏感数据流动治理：API 调用中的动态脱敏技术实践 KKKlucifer rxjava android
在数字化转型加速推进的当下，API已成为企业数据流通的"神经网络"，但伴随而来的敏感数据泄露风险正呈指数级增长。Gartner报告显示，2023年全球企业数据泄露事件中，39%源于API接口滥用，而传统静态脱敏技术在复杂业务场景下的防护效能已下降42%。动态脱敏技术作为应对API数据流动安全的核心方案，通过实时识别、智能处理、动态响应的全流程防护，正成为企业构建数据安全流动体系的关键技术支撑。保旺
非结构化文档的自动化敏感标识方法技术解析 KKKlucifer 自动化运维
在数字化时代，企业与组织面临的数据形态正发生深刻变革。据统计，非结构化数据占企业数据总量的80%以上，涵盖文本、邮件、PDF、日志、社交媒体内容等多种形式。这些数据中往往蕴含着大量敏感信息，如个人身份信息、商业机密、医疗记录等，一旦泄露将造成严重的安全风险。然而，非结构化文档缺乏统一的数据模型和格式规范，传统基于结构化数据的敏感信息识别方法难以直接应用，面临着三大核心挑战：语义理解复杂性：自然语言
贪心算法（集合覆盖问题） RonzL 算法与数据结构贪心算法集合覆盖问题 java 算法
一、贪心算法概述贪心算法的核心思想可以总结为：贪心算法总是做出在当前看来最好的选择。也就是说贪心算法并不从整体最优考虑，它所做出的选择只是在某种意义上的局部最优选择。当然，希望贪心算法得到的最终结果也是整体最优的。虽然贪心算法不能对所有问题都得到整体最优解，但对许多问题它能产生整体最优解，如单源最短路经问题，最小生成树问题等。虽然在一些情况下，即使贪心算法不能得到整体最优解，但其最终结果却是最优解
Redis可视化管理工具选型指南：7款主流软件深度对比测评 redis
Redis作为高性能的内存数据库，在现代应用开发中扮演着重要角色。为了更好地管理和监控Redis实例，选择一款合适的可视化工具至关重要。本文将为您推荐7款优秀的Redis可视化管理软件，帮助您提升开发和运维效率。RedisInsightRedisInsight是Redis官方推出的免费可视化工具，提供了全面的数据库管理功能。该工具支持多种数据结构的可视化展示，包括字符串、哈希、列表、集合和有序集合
设计与实现淘客返利APP的数据安全与隐私保护：架构师的实践经验
设计与实现淘客返利APP的数据安全与隐私保护：架构师的实践经验大家好，我是阿可，微赚淘客系统及省赚客APP创始人，是个冬天不穿秋裤，天冷也要风度的程序猿！数据安全与隐私保护的重要性在淘客返利APP中，数据安全与隐私保护是至关重要的。用户数据不仅涉及个人隐私，还可能包含敏感信息，如身份证号、银行卡号等。一旦数据泄露，不仅会损害用户利益，还会对平台的声誉造成严重影响。因此，设计和实现一个安全可靠的数据
Vue2案例尔-尔学习笔记 vue 前端
一、自定义创建项目1、基于VueCli自定义创建项目Babel/Router/Vuex/CSS/LinterVue2.xVueRouterhash模式CSS预处理LessESlint:StandardconfigLintonSaveIndedicatedconfigfiles(配置文件所在位置)Npm2、ESlint代码规范1.认识代码规范代码规范:一套写代码的约定规则。赋值符号的左右是否需要空格
JavaScript性能优化代码示例突然暴富的我 || 比较富的我 javascript
JavaScript性能优化实战大纲性能优化的核心目标减少加载时间、提升渲染效率、降低内存占用、优化交互响应代码层面的优化实践避免全局变量污染，使用局部变量和模块化开发减少DOM操作频率，批量处理DOM更新使用事件委托替代大量事件监听器优化循环结构，减少不必要的计算使用WebWorkers处理密集型计算任务内存管理策略及时清除不再使用的对象引用避免内存泄漏，注意闭包使用场景使用弱引用（WeakMa
DiNA：扩张邻域注意力 Transformer AI专题精讲 Paper阅读 transformer 人工智能
摘要Transformer正迅速成为跨模态、跨领域和跨任务中应用最广泛的深度学习架构之一。在计算机视觉领域，除了持续发展的纯transformer架构，分层transformer也因其优越的性能和在现有框架中易于集成而受到广泛关注。这类模型通常采用局部化的注意力机制，如滑动窗口的NeighborhoodAttention（NA）或SwinTransformer的ShiftedWindowSelfA
人工智能-基础篇-10-什么是卷积神经网络CNN（网格状数据处理：输入层，卷积层，激活函数，池化层，全连接层，输出层等） weisian151 人工智能人工智能 cnn 神经网络
卷积神经网络（ConvolutionalNeuralNetwork,CNN）是一种专为处理网格状数据（如图像、视频、音频）设计的深度学习模型。它通过模拟生物视觉机制，从原始数据中自动提取多层次的特征，最终实现高效的分类、检测或生成任务。1、核心概念与原理1、生物视觉启发局部感受野：模仿人类视觉皮层神经元仅响应局部区域刺激的特性，每个神经元关注输入数据的局部区域（如图像的一小块区域）。权值共享：同一
3 大语言模型预训练数据-3.2 数据处理-3.2.2 冗余去除——2.SimHash算法文本去重实战案例：新闻文章去重场景
SimHash算法文本去重实战案例：新闻文章去重场景一、案例背景与目标二、具体实现步骤与示例1.**待去重文本示例**2.**步骤1：文本预处理与特征提取**3.**步骤2：特征向量化与哈希映射**4.**步骤3：特征向量聚合**5.**步骤4：降维生成SimHash值**6.**步骤5：计算汉明距离与去重判断**三、工程化实现代码（Python简化示例）四、案例总结与优化点一、案例背景与目标假设
CNN-GRU混合模型学习笔记 weixin_54372988 cnn gru 学习
GRU学习笔记CNN：卷积神经网络GRU（GateRecurrentUnit），门控循环单元CNN：卷积神经网络3个组成部分：1.卷积层——提取图像局部特征2.池化层——降维（防止过拟合）3.全连接层——输出结果一个卷积核扫完整张图片，得到每个小区域的特征值具体应用中通常有多个卷积核CNN可能有多层结构，如LeNet-5：卷积层–池化层–卷积层–池化层–卷积层–全连接层处理时间序列（1D序列）：（
Redis初识第五期---List的命令和使用场景 wuyunhang123456 redis 数据库缓存
List，相当于数组或者顺序表，List对元素顺序敏感，允许元素重复，这是和后面的Set类型来对比的，但是得益于Redis对List的优化，使得它支持头/尾插/删，使得List也可以作为一个栈/队列来使用。命令普通版本命令1.LPushLPushkeyelement[element.....]头插，可以同时插入多个元素，最后一个元素在最前面。返回的为list的长度，可以为key为空的插入元素。2.
Redis集群全流程实战指南代码中の快捷键 java redis
Redis集群全流程实战指南：从零搭建到生产优化1.开篇：理解Redis集群的核心价值Redis作为高性能内存数据库，在单机模式下存在三大瓶颈：容量限制：受单机内存大小制约性能瓶颈：单节点QPS上限可用性风险：单点故障问题Redis集群通过分布式架构完美解决这些问题，本文将带您完成从理论认知到生产部署的全流程实践。2.架构设计篇：深入Redis集群原理2.1数据分片机制哈希槽（HashSlot）：
Java--方法递归
介绍：递归就是方法自己调用自己，每次调用时传入不同的变量，递归有助于编程者解决复杂问题，同时让代码变得简介。递归重要规则：1.执行一个方法时，就创建一个新的受保护的独立空间2.方法的局部变量是独立的，不会相互影响，比如n变量3.如果方法中使用的是引用类型变量（比如数组，对象），就会共享该引用类型的数据。4.递归必须向退出递归的条件逼近，否则就是无限递归，5.当一个方法执行完毕，或者遇到retur，
同步日志系统深度解析【链式调用】【宏定义】【固定缓冲区】【线程局部存储】【RAII】 weixin_47868976 链表 python
你的理解完全正确！让我结合你的项目，深入讲解同步日志的实现原理和代码。同步日志系统深度解析开始之前：没有必要一行一行理解！没有必要一行一行理解！我建议你采用更高效的学习策略：一、学习策略建议重点理解核心概念和设计思路//只需要理解这些核心概念：1.RAII模式：构造函数获取资源，析构函数释放资源2.链式调用：operatorstream()output();//手动调用输出deletelogger
嵌入式面试笔试那点事2：2025.4.13 失落的多巴胺面试笔试那点事嵌入式硬件面试
做了某厂的笔试题，全程监控监考压力还是挺大的。。。挑几个博主记得比较清楚的有价值的简答题写写。一、详细从生命周期、作用域等方面说明普通局部变量、普通全局变量、静态局部变量、静态全局变量1.普通局部变量定义:在某个函数或代码块内部定义的变量。作用域:仅在定义它的函数或代码块内有效。生命周期:该变量在函数或代码块开始执行时被创建，当函数或代码块执行结束后，变量被销毁。特点:每次进入函数时都会重新创建，
HashMap HansenPole825 哈希算法散列表算法
一、结构1.数组（桶数组）初始容量默认16。数组元素成为桶，每个桶存储链表或红黑树（jdk1.8及以后）。2.链表当不同key的哈希值映射到同一桶式，以链表形式存储。3.红黑树jdk1.8及以后引入红黑树：当链表长度大于等于8且桶数组长度大于等于64式，链表转化为红黑树，查询时间从O（n）降为O（logn）。树节点小于6时退化为链表二、关键机制1.哈希计算（jdk1.8）staticfinalin
【力扣hot100】python刷题笔记之哈希 Animato. 哈希算法 leetcode 笔记
1.两数之和（简单）题目描述：给定一个整数数组nums和一个整数目标值target，请你在该数组中找出和为目标值target的那两个整数，并返回它们的数组下标。你可以假设每种输入只会对应一个答案，并且你不能使用两次相同的元素。你可以按任意顺序返回答案。示例：解法一：暴力解法：双层循环（这里就不给代码了）解法二：哈希表（时间复杂度O(n)）算法思路：（1）先创建一个空字典当做哈希表来存储已经遍历过的
矩阵（二维数组）局部极大/小值-python实现银河系渐入佳境编程指南算法 python 算法矩阵
题目来源：某为面试/算法第四版：Algs4-1.4.19矩阵的局部最小元素参考思路：传送CODE：importnumpyasnp'''deffindMin():arr=np.random.rand(10,10)index_arr=np.zeros((10,10))foriinrange(arr.shape[0]):forjinrange(arr.shape[1]):ifi>0andi0andj
iOS 应用安全加固指南：通过 IPA 混淆与防破解技术实现全面防护 00后程序员张 http udp https websocket 网络安全网络协议 tcp/ip
在现代移动应用开发中，安全性已不再是一个可以忽视的领域。随着黑客技术的日益成熟以及用户对隐私保护的重视，开发者必须将安全性嵌入到应用的每一个开发环节中，而不仅仅是在开发的后期进行加固。尤其是对于那些涉及用户数据、支付信息等敏感内容的应用，确保应用的安全性是至关重要的。本文将介绍iOS应用开发中的安全实践，并结合具体的安全加固技术，如使用IpaGuard、Obfuscator-LLVM，从应用的设计
SQL Server 中 GO 的作用 Lauren_Lu golang 数据库 oracle
CREATEDATABASEMyDatabase;USEMyDatabase;GO--定义局部变量DECLARE@s_novarchar(8),@s_avgradenumeric(4,1);--对局部变量赋值SETs_no='20170208';SET@s_avgrade=95.0;--使用局部变量UPDATEstudentSETs_avgrade=@s_avgradeWHEREs_no=@s_n
3.22.0-ohos-1.0.4版本发布说明 harmonyos
3.22.0-ohos-1.0.4发布版本概述本版本为基于Flutter3.22.0适配的OpenHarmony版本。本版本支持和完善OpenHarmony平台侧能力，提升稳定性。新增特性fluttersdk模版工程和测试工程适配api18新增图片解码适配EXIF旋转特性新增支持外接纹理局部刷新特性新增接入hiAppEvent接口的能力新增适配触控板滑动抛滑、双指捏合功能，Ctrl+鼠标滚轮缩放新
Selenium测试安全策略：防止逆向工程软件工程实践软件工程最佳实践 AI软件构建大数据系统架构 selenium 网络 tcp/ip ai
Selenium测试安全策略：防止逆向工程关键词：Selenium自动化测试、逆向工程、代码安全、敏感信息保护、测试脚本防护摘要：本文从Selenium自动化测试的实际场景出发，深入解析测试脚本面临的逆向工程风险（如敏感信息泄露、测试逻辑被破解），通过生活案例类比技术概念，系统讲解代码混淆、敏感信息加密、日志脱敏等核心安全策略，并提供可落地的实战代码与工具推荐，帮助测试人员构建“防逆向”的安全测试
横向移动02
基于wmic的横向移动本文章中的192.168.3.32是目标地址，就是靶机ip地址条件：wmi服务开启，端口135，默认开启防火墙允许135、445等端口通信知道目标机的账户密码或HASH内置（单执行）shell wmic /node:192.168.3.32 /user:sqlserver\administrator /password:admin!@#123 proce
Oracle 临时表空间相关操作 dazhong2012 数据库 oracle 数据库
一、临时表空间概述临时表空间（TemporaryTablespace）是Oracle数据库中用于存储临时数据的特殊存储区域，其数据在会话结束或事务提交后自动清除，重启数据库后彻底消失。主要用途包括：存储排序操作（如ORDERBY）的中间结果支持哈希连接（HashJoin）等复杂查询索引创建时的临时数据存储核心特点：数据非永久性，关闭数据库后自动删除不能存储永久性对象（如表、视图）独立于永久表空间管
Git 学习笔记笑衬人心。 git 学习笔记
Git简介Git是一个分布式版本控制系统，用于跟踪文件更改，协作开发软件项目。特点：分布式：每个开发者本地都有完整仓库。高效：分支和合并操作快速。安全：数据通过哈希存储，不易被篡改。安装GitWindows:下载地址：https://git-scm.com/安装后可使用GitBash。macOS:brewinstallgitLinux:sudoaptupdatesudoaptinstallgitG
day043-负载均衡算法与高可用keepalived 孙克旭‌ 老男孩教育Linux运维99期负载均衡算法运维 linux
文章目录0.老男孩思想-运维能为公司创造的价值1.负载均衡轮询算法1.1加权轮询1.2ip哈希1.3url哈希2.负载均衡模块指令补充3.高可用4.keepalived4.1部署keepalived服务4.2脑裂故障4.2.1脑裂故障常见原因4.2.2脑裂故障解决方法5.思维导图0.老男孩思想-运维能为公司创造的价值省钱：服务器设备、机房带宽、云主机云服务减少CDN流量优化、架构改造，当流量增加时
Redis布隆过滤器详解枸杞配码 redis 数据库缓存
1.布隆过滤器是什么redis的布隆过滤器其实有点像我们之前学习过的hyperloglog深入理解redis——新类型bitmap/hyperloglgo/GEO，它也是不保存元素的一个集合，它也不保存元素的具体内容，但是能判定这个元素是否在这个集合中存在（hyperloglog是判定集合中存在的不重复元素的个数）。1）它是由一个初值都为零的bit数组和多个哈希函数构成，用来快速判断某个数据是否存
Redis（十五）Bitmap、Hyperloglog、GEO案例、布隆过滤器 Lucky_Turtle Java redis 面试数据库
文章目录面试题常见统计类型聚合统计排序统计二值统计基数统计Hyperloglog专有名词UV（UniqueVisitor）独立访客PV（PageView）页面浏览量DAU（DailyActiveUser）日活跃用户量MAU（MonthlyActiveUser）需求原理亿级UV的Redis统计方案GEO面试题命令GEOADD获取某位置的经纬度GEOPOS返回坐标的Geohash表示GEOHASH两个
C#区块链共识的3大必杀技：PoW、PoS、DPoS谁才是代码界的“链主”？墨瑾轩一起学学C#【二】c#区块链开发语言
关注墨瑾轩，带你探索编程的奥秘！超萌技术攻略，轻松晋级编程高手技术宝库已备好，就等你来挖掘订阅墨瑾轩，智趣学习不孤单即刻启航，编程之旅更有趣**3大必杀技，让你的代码成为“链主”**必杀技1：工作量证明（PoW）——“算力擂台赛”问题：为什么比特币的“矿工”要疯狂算哈希？答案：因为他们在参与“算力擂台赛”！PoW核心逻辑：
SAX解析xml文件小猪猪08 xml
1.创建SAXParserFactory实例 2.通过SAXParserFactory对象获取SAXParser实例 3.创建一个类SAXParserHander继续DefaultHandler，并且实例化这个类 4.SAXParser实例的parse来获取文件 public static void main(String[] args) { //
为什么mysql里的ibdata1文件不断的增长？ brotherlamp linux linux运维 linux资料 linux视频 linux运维自学
我们在 Percona 支持栏目经常收到关于 MySQL 的 ibdata1 文件的这个问题。当监控服务器发送一个关于 MySQL 服务器存储的报警时，恐慌就开始了 —— 就是说磁盘快要满了。一番调查后你意识到大多数地盘空间被 InnoDB 的共享表空间 ibdata1 使用。而你已经启用了 innodbfileper_table，所以问题是： ibdata1存了什么？当你启用了 i
Quartz-quartz.properties配置 eksliang quartz
其实Quartz JAR文件的org.quartz包下就包含了一个quartz.properties属性配置文件并提供了默认设置。如果需要调整默认配置，可以在类路径下建立一个新的quartz.properties，它将自动被Quartz加载并覆盖默认的设置。下面是这些默认值的解释 #-----集群的配置 org.quartz.scheduler.instanceName =
informatica session的使用 18289753290 workflow session log Informatica
如果希望workflow存储最近20次的log，在session里的Config Object设置，log options做配置，save session log :sessions run ;savesessio log for these runs:20 session下面的source 里面有个tracing
Scrapy抓取网页时出现CRC check failed 0x471e6e9a != 0x7c07b839L的错误酷的飞上天空 scrapy
Scrapy版本0.14.4 出现问题现象： ERROR: Error downloading <GET http://xxxxx CRC check failed 解决方法 1.设置网络请求时的header中的属性'Accept-Encoding': '*;q=0' 明确表示不支持任何形式的压缩格式，避免程序的解压
java Swing小集锦永夜-极光 java swing
1.关闭窗体弹出确认对话框 1.1 this.setDefaultCloseOperation (JFrame.DO_NOTHING_ON_CLOSE); 1.2 this.addWindowListener ( new WindowAdapter () { public void windo
强制删除.svn文件夹随便小屋 java
在windows上，从别处复制的项目中可能带有.svn文件夹，手动删除太麻烦，并且每个文件夹下都有。所以写了个程序进行删除。因为.svn文件夹在windows上是只读的，所以用File中的delete()和deleteOnExist()方法都不能将其删除，所以只能采用windows命令方式进行删除
GET和POST有什么区别？及为什么网上的多数答案都是错的。 aijuans get post
如果有人问你，GET和POST，有什么区别？你会如何回答？我的经历前几天有人问我这个问题。我说GET是用于获取数据的，POST，一般用于将数据发给服务器之用。这个答案好像并不是他想要的。于是他继续追问有没有别的区别？我说这就是个名字而已，如果服务器支持，他完全可以把G
谈谈新浪微博背后的那些算法 aoyouzi 谈谈新浪微博背后的那些算法
本文对微博中常见的问题的对应算法进行了简单的介绍，在实际应用中的算法比介绍的要复杂的多。当然，本文覆盖的主题并不全，比如好友推荐、热点跟踪等就没有涉及到。但古人云“窥一斑而见全豹”，希望本文的介绍能帮助大家更好的理解微博这样的社交网络应用。微博是一个很多人都在用的社交应用。天天刷微博的人每天都会进行着这样几个操作：原创、转发、回复、阅读、关注、@等。其中，前四个是针对短博文，最后的关注和@则针
Connection reset 连接被重置的解决方法百合不是茶 java 字符流连接被重置
流是java的核心部分,,昨天在做android服务器连接服务器的时候出了问题,就将代码放到java中执行,结果还是一样连接被重置被重置的代码如下; 客户端代码; package 通信软件服务器; import java.io.BufferedWriter; import java.io.OutputStream; import java.io.O
web.xml配置详解之filter bijian1013 java web.xml filter
一.定义 <filter> <filter-name>encodingfilter</filter-name> <filter-class>com.my.app.EncodingFilter</filter-class> <init-param> <param-name>encoding<
Heritrix Bill_chen 多线程 xml 算法制造配置管理
作为纯Java语言开发的、功能强大的网络爬虫Heritrix，其功能极其强大，且扩展性良好，深受热爱搜索技术的盆友们的喜爱，但它配置较为复杂，且源码不好理解，最近又使劲看了下，结合自己的学习和理解，跟大家分享Heritrix的点点滴滴。 Heritrix的下载（http://sourceforge.net/projects/archive-crawler/）安装、配置，就不罗嗦了，可以自己找找资
【Zookeeper】FAQ bit1129 zookeeper
1.脱离IDE，运行简单的Java客户端程序 #ZkClient是简单的Zookeeper~$ java -cp "./:zookeeper-3.4.6.jar:./lib/*" ZKClient 1. Zookeeper是的Watcher回调是同步操作，需要添加异步处理的代码 2. 如果Zookeeper集群跨越多个机房，那么Leader/
The user specified as a definer ('aaa'@'localhost') does not exist 白糖_ localhost
今天遇到一个客户BUG，当前的jdbc连接用户是root，然后部分删除操作都会报下面这个错误：The user specified as a definer ('aaa'@'localhost') does not exist 最后找原因发现删除操作做了触发器，而触发器里面有这样一句 /*!50017 DEFINER = ''aaa@'localhost' */ 原来最初
javascript中showModelDialog刷新父页面 bozch JavaScript 刷新父页面 showModalDialog
在页面中使用showModalDialog打开模式子页面窗口的时候，如果想在子页面中操作父页面中的某个节点，可以通过如下的进行： window.showModalDialog('url',self,‘status...’); // 首先中间参数使用self 在子页面使用w
编程之美-买书折扣 bylijinnan 编程之美
import java.util.Arrays; public class BookDiscount { /**编程之美买书折扣书上的贪心算法的分析很有意思，我看了半天看不懂，结果作者说，贪心算法在这个问题上是不适用的。。下面用动态规划实现。哈利波特这本书一共有五卷，每卷都是8欧元，如果读者一次购买不同的两卷可扣除5%的折扣，三卷10%，四卷20%，五卷
关于struts2.3.4项目跨站执行脚本以及远程执行漏洞修复概要 chenbowen00 struts WEB安全
因为近期负责的几个银行系统软件，需要交付客户，因此客户专门请了安全公司对系统进行了安全评测，结果发现了诸如跨站执行脚本，远程执行漏洞以及弱口令等问题。下面记录下本次解决的过程以便后续 1、首先从最简单的开始处理，服务器的弱口令问题，首先根据安全工具提供的测试描述中发现应用服务器中存在一个匿名用户，默认是不需要密码的，经过分析发现服务器使用了FTP协议，而使用ftp协议默认会产生一个匿名用
[电力与暖气]煤炭燃烧与电力加温 comsci
在宇宙中,用贝塔射线观测地球某个部分,看上去,好像一个个马蜂窝,又像珊瑚礁一样,原来是某个国家的采煤区..... 不过,这个采煤区的煤炭看来是要用完了.....那么依赖将起燃烧并取暖的城市,在极度严寒的季节中...该怎么办呢? &nbs
oracle O7_DICTIONARY_ACCESSIBILITY参数 daizj oracle
O7_DICTIONARY_ACCESSIBILITY参数控制对数据字典的访问.设置为true,如果用户被授予了如select any table等any table权限,用户即使不是dba或sysdba用户也可以访问数据字典.在9i及以上版本默认为false,8i及以前版本默认为true.如果设置为true就可能会带来安全上的一些问题.这也就为什么O7_DICTIONARY_ACCESSIBIL
比较全面的MySQL优化参考 dengkane mysql
本文整理了一些MySQL的通用优化方法，做个简单的总结分享，旨在帮助那些没有专职MySQL DBA的企业做好基本的优化工作，至于具体的SQL优化，大部分通过加适当的索引即可达到效果，更复杂的就需要具体分析了，可以参考本站的一些优化案例或者联系我，下方有我的联系方式。这是上篇。 1、硬件层相关优化 1.1、CPU相关在服务器的BIOS设置中，可
C语言homework2，有一个逆序打印数字的小算法 dcj3sjt126com c
#h1# 0、完成课堂例子 1、将一个四位数逆序打印 1234 ==> 4321 实现方法一： # include <stdio.h> int main(void) { int i = 1234; int one = i%10; int two = i / 10 % 10; int three = i / 100 % 10;
apacheBench对网站进行压力测试 dcj3sjt126com apachebench
ab 的全称是 ApacheBench ，是 Apache 附带的一个小工具，专门用于 HTTP Server 的 benchmark testing ，可以同时模拟多个并发请求。前段时间看到公司的开发人员也在用它作一些测试，看起来也不错，很简单，也很容易使用，所以今天花一点时间看了一下。通过下面的一个简单的例子和注释，相信大家可以更容易理解这个工具的使用。
2种办法让HashMap线程安全 flyfoxs java jdk jni
多线程之--2种办法让HashMap线程安全多线程之--synchronized 和reentrantlock的优缺点多线程之--2种JAVA乐观锁的比较( NonfairSync VS. FairSync) HashMap不是线程安全的,往往在写程序时需要通过一些方法来回避.其实JDK原生的提供了2种方法让HashMap支持线程安全.
Spring Security（04）——认证简介 234390216 Spring Security 认证过程
认证简介目录 1.1 认证过程 1.2 Web应用的认证过程 1.2.1 ExceptionTranslationFilter 1.2.2 在request之间共享SecurityContext 1
Java 位运算 Javahuhui java 位运算
// 左移( << ) 低位补0 // 0000 0000 0000 0000 0000 0000 0000 0110 然后左移2位后，低位补0： // 0000 0000 0000 0000 0000 0000 0001 1000 System.out.println(6 << 2);// 运行结果是24 // 右移( >> ) 高位补"
mysql免安装版配置 ldzyz007 mysql
1、my-small.ini是为了小型数据库而设计的。不应该把这个模型用于含有一些常用项目的数据库。 2、my-medium.ini是为中等规模的数据库而设计的。如果你正在企业中使用RHEL,可能会比这个操作系统的最小RAM需求(256MB)明显多得多的物理内存。由此可见，如果有那么多RAM内存可以使用，自然可以在同一台机器上运行其它服务。 3、my-large.ini是为专用于一个SQL数据
MFC和ado数据库使用时遇到的问题你不认识的休道人 sql C++mfc
=================================================================== 第一个 =================================================================== try{ CString sql; sql.Format("select * from p
表单重复提交Double Submits rensanning double
可能发生的场景： *多次点击提交按钮 *刷新页面 *点击浏览器回退按钮 *直接访问收藏夹中的地址 *重复发送HTTP请求（Ajax）（1）点击按钮后disable该按钮一会儿，这样能避免急躁的用户频繁点击按钮。这种方法确实有些粗暴，友好一点的可以把按钮的文字变一下做个提示，比如Bootstrap的做法： http://getbootstrap.co
Java String 十大常见问题 tomcat_oracle java 正则表达式
　1.字符串比较，使用“==”还是equals()? 　　"=="判断两个引用的是不是同一个内存地址(同一个物理对象)。　　equals()判断两个字符串的值是否相等。　　除非你想判断两个string引用是否同一个对象，否则应该总是使用equals()方法。　　如果你了解字符串的驻留(String Interning)则会更好地理解这个问题。　　
SpringMVC 登陆拦截器实现登陆控制 xp9802 springMVC
思路，先登陆后，将登陆信息存储在session中，然后通过拦截器，对系统中的页面和资源进行访问拦截，同时对于登陆本身相关的页面和资源不拦截。实现方法： 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23