CS521 Advanced Algorithm Design 学习笔记(一)Course Intro and Hashing

Lecture 1 Course Intro and Hashing

Aim: how to design and analyze a variety of algorithms.

Grad algorithms is a lot about algorithms discovered since 1990. Gradual shift in the emphasis and goals of CS as it became a more mature field. It enjoys a broadened horizon and started looking at new problems, like big data, e-commerce and bioinformatics.

A lot more realistic things are required to concern:

  1. Changing Graph: formulation is not easy
  2. Changing Data Structures: data come from sources we do not control, e.g., noisy or inexact.
  3. Changing notion of I/O: datas may from datastreams, online, social network graphs, etc. Hard to grasp the appropriate output.
  4. Analysis: exact, work on all inputs-> approximation

Hashing

Preliminaries

存储一个巨大(e.g., 2 32 2^{32} 232)集合U的子集 S S S: ∣ S ∣ = m |S|=m S=m. 我们希望在 U U U中支持对 S S S的三种运算:插入,删除和询问,那么,哈希表正是我们需要的。

定义一个哈希函数 h : U → [ n ] h: U\rightarrow [n] h:U[n],我们只需要开放n个地址来存储哈希值为 0 ∼ n − 1 0\sim n-1 0n1的数据,并且碰撞元素用链表连接。

两种假设:1. 输入随机 2. 输入给定,哈希函数随机

Hash函数

我们如何定义一个函数是随机的?

理想中,对给定的 x 1 , . . . , x m ∈ S x_1,...,x_m\in S x1,...,xmS,任意的 a 1 , . . . , a m ∈ [ n ] a_1,...,a_m\in[n] a1,...,am[n],随机函数 H \mathcal{H} H应当满足:

  • Pr ⁡ h ∈ H [ h ( x 1 ) = a 1 ] = 1 n \operatorname{Pr}_{h \in \mathcal{H}}\left[h\left(x_1\right)=a_1\right]=\frac{1}{n} PrhH[h(x1)=a1]=n1.
  • Pr ⁡ h ∈ H [ h ( x 1 ) = a 1 ∧ h ( x 2 ) = a 2 ] = 1 n 2 \operatorname{Pr}_{h \in \mathcal{H}}\left[h\left(x_1\right)=a_1 \wedge h\left(x_2\right)=a_2\right]=\frac{1}{n^2} PrhH[h(x1)=a1h(x2)=a2]=n21. Pairwise independence.
  • Pr ⁡ h ∈ H [ h ( x 1 ) = a 1 ∧ h ( x 2 ) = a 2 ∧ ⋯ ∧ h ( x k ) = a k ] = 1 n k \operatorname{Pr}_{h \in \mathcal{H}}\left[h\left(x_1\right)=a_1 \wedge h\left(x_2\right)=a_2 \wedge \cdots \wedge h\left(x_k\right)=a_k\right]=\frac{1}{n^k} PrhH[h(x1)=a1h(x2)=a2h(xk)=ak]=nk1. k k k-wise independence.
  • Pr ⁡ h ∈ H [ h ( x 1 ) = a 1 ∧ h ( x 2 ) = a 2 ∧ ⋯ ∧ h ( x m ) = a m ] = 1 n m \operatorname{Pr}_{h \in \mathcal{H}}\left[h\left(x_1\right)=a_1 \wedge h\left(x_2\right)=a_2 \wedge \cdots \wedge h\left(x_m\right)=a_m\right]=\frac{1}{n^m} PrhH[h(x1)=a1h(x2)=a2h(xm)=am]=nm1. Full independence (note that ∣ U ∣ = m |U|=m U=m ). In this case we have n m n^m nm possible h h h (we store h ( x ) h(x) h(x) for each x ∈ U ) x \in U) xU), so we need m log ⁡ n m \log n mlogn bits to represent the each hash function. Since m m m is usually very large, this is not practical.

For any x x x, let L x L_x Lx be the length of the linked list containing x x x, then L x L_x Lx is just the number of elements with the same hash value as x x x. Let random variable
I y = { 1  if  h ( y ) = h ( x ) 0  otherwise  I_y= \begin{cases}1 & \text { if } h(y)=h(x) \\ 0 & \text { otherwise }\end{cases} Iy={10 if h(y)=h(x) otherwise 
So L x = 1 + ∑ y ≠ x I y L_x=1+\sum_{y \neq x} I_y Lx=1+y=xIy, and
E [ L x ] = 1 + ∑ y ≠ x E [ I y ] = 1 + m − 1 n E\left[L_x\right]=1+\sum_{y \neq x} E\left[I_y\right]=1+\frac{m-1}{n} E[Lx]=1+y=xE[Iy]=1+nm1
Note that we don’t need full independence to prove this property, and pairwise independence would actually suffice.

2-Universal Hash

定义函数族 H \mathcal{H} H是2-universal的,如果对任意 x ≠ y ∈ U x\neq y\in U x=yU, P r h ∈ H [ h ( x ) = h ( y ) ] ≤ 1 n Pr_{h\in H}[h(x)=h(y)]\leq \frac{1}{n} PrhH[h(x)=h(y)]n1. (比2-independence要弱)

构造:选取一个质数 p ∈ [ ∣ U ∣ , 2 ∣ U ∣ ] p\in [|U|,2|U|] p[U,2∣U],令 f a , b ( x ) = a x + b m o d    p f_{a,b}(x)=ax+b \mod p fa,b(x)=ax+bmodp a ≠ 0 a\neq 0 a=0)即可

Since [ p ] [p] [p] constitutes a finite field, we have that a = ( x 1 − x 2 ) − 1 ( s − t ) a=\left(x_1-x_2\right)^{-1}(s-t) a=(x1x2)1(st) and b = s − a x 1 b=s-a x_1 b=sax1. Since we have p ( p − 1 ) p(p-1) p(p1) different hash functions in H \mathcal{H} H in this case,
Pr ⁡ h ∈ H [ h ( x 1 ) = s ∧ h ( x 2 ) = t ] = 1 p ( p − 1 ) \operatorname{Pr}_{h \in \mathcal{H}}\left[h\left(x_1\right)=s \wedge h\left(x_2\right)=t\right]=\frac{1}{p(p-1)} PrhH[h(x1)=sh(x2)=t]=p(p1)1
定理: H = { h a , b : a , b ∈ [ p ] ∧ a ≠ 0 } \mathcal{H}=\left\{h_{a, b}: a, b \in[p] \wedge a \neq 0\right\} H={ha,b:a,b[p]a=0} 是 2-universal的
证明:For any x 1 ≠ x 2 x_1 \neq x_2 x1=x2,
Pr ⁡ [ h a , b ( x 1 ) = h a , b ( x 2 ) ] = ∑ s , t ∈ [ p ] , s ≠ t δ ( s = t   m o d   n ) Pr ⁡ [ f a , b ( x 1 ) = s ∧ f a , b ( x 2 ) = t ] = 1 p ( p − 1 ) ∑ s , t ∈ [ p ] , s ≠ t δ ( s = t   m o d   n ) ≤ 1 p ( p − 1 ) p ( p − 1 ) n = 1 n \begin{aligned} & \operatorname{Pr}\left[h_{a, b}\left(x_1\right)=h_{a, b}\left(x_2\right)\right] \\ = & \sum_{s, t \in[p], s \neq t} \delta_{(s=t \bmod n)} \operatorname{Pr}\left[f_{a, b}\left(x_1\right)=s \wedge f_{a, b}\left(x_2\right)=t\right] \\ = & \frac{1}{p(p-1)} \sum_{s, t \in[p], s \neq t} \delta_{(s=t \bmod n)} \\ \leq & \frac{1}{p(p-1)} \frac{p(p-1)}{n} \\ = & \frac{1}{n} \end{aligned} ===Pr[ha,b(x1)=ha,b(x2)]s,t[p],s=tδ(s=tmodn)Pr[fa,b(x1)=sfa,b(x2)=t]p(p1)1s,t[p],s=tδ(s=tmodn)p(p1)1np(p1)n1
注意到,如果选取 n n n充分大,使得 n ≥ m 2 n\geq m^2 nm2,那么hash表中的碰撞数量期望为:

E [ ∑ x 1 ≠ x 2 h ( x 1 ) = h ( x 2 ) ] ≤ ( m 2 ) 1 n ≤ 1 2 E[\sum_{x_1\neq x_2} h(x_1)=h(x_2)]\leq \binom{m}{2}\frac{1}{n}\leq \frac{1}{2} E[x1=x2h(x1)=h(x2)](2m)n121.

不过,我们还有更简单的办法:双人博弈哈希表。假设第i个位置有 s i s_i si个碰撞,那么我们构造一个大小为 s i 2 s_i^2 si2的哈希表,即可容纳所有碰撞。

注意到, E ( ∑ i s i 2 ) = E ( ∑ i s i ( s i − 1 ) ) + E ( ∑ i s i ) = m ( m − 1 ) n + m ≤ 2 m E(\sum_i s_i^2)=E(\sum_i s_i(s_i-1))+E(\sum_i s_i) =\frac{m(m-1)}{n}+m\leq 2m E(isi2)=E(isi(si1))+E(isi)=nm(m1)+m2m.

负载均衡

在负载均衡问题中,我们可以想象我们尝试把球放到桶里。如果我们有n个球和n个桶,并且随机放,那么第i个桶有k个球的概率 ≤ ( n k ) 1 n k \leq \binom{n}{k}\frac{1}{n^k} (kn)nk1.(把其他桶看成一个整体,用几何分布) ≤ 1 k ! \leq \frac{1}{k!} k!1.

根据斯特林公式,选取 k = O ( log ⁡ n log ⁡ log ⁡ n ) k=O(\frac{\log n}{\log \log n}) k=O(loglognlogn),我们有 1 k ! ≤ 1 n 2 \frac{1}{k!}\leq \frac{1}{n^2} k!1n21. 因此,存在一个桶有k个球的概率 ≤ 1 n \leq \frac{1}{n} n1.

因此,以 ≤ 1 − 1 n \leq 1-\frac{1}{n} 1n1的概率,我们的最大负载不超过 O ( log ⁡ n log ⁡ log ⁡ n ) O(\frac{\log n}{\log \log n}) O(loglognlogn).

改进:当球到来时,随机选两个桶,并且把球放在有更少的球的桶里,那么最大负载以高概率不超过 O ( log ⁡ log ⁡ n ) O(\log \log n) O(loglogn),这是一个巨大的改进!

你可能感兴趣的:(学习笔记,哈希算法)