Aim: how to design and analyze a variety of algorithms.
Grad algorithms is a lot about algorithms discovered since 1990. Gradual shift in the emphasis and goals of CS as it became a more mature field. It enjoys a broadened horizon and started looking at new problems, like big data, e-commerce and bioinformatics.
A lot more realistic things are required to concern:
存储一个巨大(e.g., 2 32 2^{32} 232)集合U的子集 S S S: ∣ S ∣ = m |S|=m ∣S∣=m. 我们希望在 U U U中支持对 S S S的三种运算:插入,删除和询问,那么,哈希表正是我们需要的。
定义一个哈希函数 h : U → [ n ] h: U\rightarrow [n] h:U→[n],我们只需要开放n个地址来存储哈希值为 0 ∼ n − 1 0\sim n-1 0∼n−1的数据,并且碰撞元素用链表连接。
两种假设:1. 输入随机 2. 输入给定,哈希函数随机
我们如何定义一个函数是随机的?
理想中,对给定的 x 1 , . . . , x m ∈ S x_1,...,x_m\in S x1,...,xm∈S,任意的 a 1 , . . . , a m ∈ [ n ] a_1,...,a_m\in[n] a1,...,am∈[n],随机函数 H \mathcal{H} H应当满足:
For any x x x, let L x L_x Lx be the length of the linked list containing x x x, then L x L_x Lx is just the number of elements with the same hash value as x x x. Let random variable
I y = { 1 if h ( y ) = h ( x ) 0 otherwise I_y= \begin{cases}1 & \text { if } h(y)=h(x) \\ 0 & \text { otherwise }\end{cases} Iy={10 if h(y)=h(x) otherwise
So L x = 1 + ∑ y ≠ x I y L_x=1+\sum_{y \neq x} I_y Lx=1+∑y=xIy, and
E [ L x ] = 1 + ∑ y ≠ x E [ I y ] = 1 + m − 1 n E\left[L_x\right]=1+\sum_{y \neq x} E\left[I_y\right]=1+\frac{m-1}{n} E[Lx]=1+y=x∑E[Iy]=1+nm−1
Note that we don’t need full independence to prove this property, and pairwise independence would actually suffice.
定义函数族 H \mathcal{H} H是2-universal的,如果对任意 x ≠ y ∈ U x\neq y\in U x=y∈U, P r h ∈ H [ h ( x ) = h ( y ) ] ≤ 1 n Pr_{h\in H}[h(x)=h(y)]\leq \frac{1}{n} Prh∈H[h(x)=h(y)]≤n1. (比2-independence要弱)
构造:选取一个质数 p ∈ [ ∣ U ∣ , 2 ∣ U ∣ ] p\in [|U|,2|U|] p∈[∣U∣,2∣U∣],令 f a , b ( x ) = a x + b m o d p f_{a,b}(x)=ax+b \mod p fa,b(x)=ax+bmodp( a ≠ 0 a\neq 0 a=0)即可
Since [ p ] [p] [p] constitutes a finite field, we have that a = ( x 1 − x 2 ) − 1 ( s − t ) a=\left(x_1-x_2\right)^{-1}(s-t) a=(x1−x2)−1(s−t) and b = s − a x 1 b=s-a x_1 b=s−ax1. Since we have p ( p − 1 ) p(p-1) p(p−1) different hash functions in H \mathcal{H} H in this case,
Pr h ∈ H [ h ( x 1 ) = s ∧ h ( x 2 ) = t ] = 1 p ( p − 1 ) \operatorname{Pr}_{h \in \mathcal{H}}\left[h\left(x_1\right)=s \wedge h\left(x_2\right)=t\right]=\frac{1}{p(p-1)} Prh∈H[h(x1)=s∧h(x2)=t]=p(p−1)1
定理: H = { h a , b : a , b ∈ [ p ] ∧ a ≠ 0 } \mathcal{H}=\left\{h_{a, b}: a, b \in[p] \wedge a \neq 0\right\} H={ha,b:a,b∈[p]∧a=0} 是 2-universal的
证明:For any x 1 ≠ x 2 x_1 \neq x_2 x1=x2,
Pr [ h a , b ( x 1 ) = h a , b ( x 2 ) ] = ∑ s , t ∈ [ p ] , s ≠ t δ ( s = t m o d n ) Pr [ f a , b ( x 1 ) = s ∧ f a , b ( x 2 ) = t ] = 1 p ( p − 1 ) ∑ s , t ∈ [ p ] , s ≠ t δ ( s = t m o d n ) ≤ 1 p ( p − 1 ) p ( p − 1 ) n = 1 n \begin{aligned} & \operatorname{Pr}\left[h_{a, b}\left(x_1\right)=h_{a, b}\left(x_2\right)\right] \\ = & \sum_{s, t \in[p], s \neq t} \delta_{(s=t \bmod n)} \operatorname{Pr}\left[f_{a, b}\left(x_1\right)=s \wedge f_{a, b}\left(x_2\right)=t\right] \\ = & \frac{1}{p(p-1)} \sum_{s, t \in[p], s \neq t} \delta_{(s=t \bmod n)} \\ \leq & \frac{1}{p(p-1)} \frac{p(p-1)}{n} \\ = & \frac{1}{n} \end{aligned} ==≤=Pr[ha,b(x1)=ha,b(x2)]s,t∈[p],s=t∑δ(s=tmodn)Pr[fa,b(x1)=s∧fa,b(x2)=t]p(p−1)1s,t∈[p],s=t∑δ(s=tmodn)p(p−1)1np(p−1)n1
注意到,如果选取 n n n充分大,使得 n ≥ m 2 n\geq m^2 n≥m2,那么hash表中的碰撞数量期望为:
E [ ∑ x 1 ≠ x 2 h ( x 1 ) = h ( x 2 ) ] ≤ ( m 2 ) 1 n ≤ 1 2 E[\sum_{x_1\neq x_2} h(x_1)=h(x_2)]\leq \binom{m}{2}\frac{1}{n}\leq \frac{1}{2} E[∑x1=x2h(x1)=h(x2)]≤(2m)n1≤21.
不过,我们还有更简单的办法:双人博弈哈希表。假设第i个位置有 s i s_i si个碰撞,那么我们构造一个大小为 s i 2 s_i^2 si2的哈希表,即可容纳所有碰撞。
注意到, E ( ∑ i s i 2 ) = E ( ∑ i s i ( s i − 1 ) ) + E ( ∑ i s i ) = m ( m − 1 ) n + m ≤ 2 m E(\sum_i s_i^2)=E(\sum_i s_i(s_i-1))+E(\sum_i s_i) =\frac{m(m-1)}{n}+m\leq 2m E(∑isi2)=E(∑isi(si−1))+E(∑isi)=nm(m−1)+m≤2m.
在负载均衡问题中,我们可以想象我们尝试把球放到桶里。如果我们有n个球和n个桶,并且随机放,那么第i个桶有k个球的概率 ≤ ( n k ) 1 n k \leq \binom{n}{k}\frac{1}{n^k} ≤(kn)nk1.(把其他桶看成一个整体,用几何分布) ≤ 1 k ! \leq \frac{1}{k!} ≤k!1.
根据斯特林公式,选取 k = O ( log n log log n ) k=O(\frac{\log n}{\log \log n}) k=O(loglognlogn),我们有 1 k ! ≤ 1 n 2 \frac{1}{k!}\leq \frac{1}{n^2} k!1≤n21. 因此,存在一个桶有k个球的概率 ≤ 1 n \leq \frac{1}{n} ≤n1.
因此,以 ≤ 1 − 1 n \leq 1-\frac{1}{n} ≤1−n1的概率,我们的最大负载不超过 O ( log n log log n ) O(\frac{\log n}{\log \log n}) O(loglognlogn).
改进:当球到来时,随机选两个桶,并且把球放在有更少的球的桶里,那么最大负载以高概率不超过 O ( log log n ) O(\log \log n) O(loglogn),这是一个巨大的改进!