N-d Selection (MA.hashing 多属性散列)

Hashing for N-d Selection

For a pmr query like:
select * from R where A1 = C1 and A2 = C2 and ....An=Cn
if one is the hash key, query is O(1)
if no is the hash key, need to use linear scan
对于多属性的查询操作，只要有一个属性有散列索引，查询就是O（1）。

如果没有一个属性上有散列索引，那么我们可以依据所有的属性来构建一个的hash key --- Multi-attribute hashing

Multi-attribute hashing 多属性散列：

多属性散列key 的长度
use d-bit hash values （file size = b = pages）
长度为d的比特位表示一个散列key。
多属性散列由每个属性值的散列值的某些比特位组成。
组成规则遵循 check vector 。
e.g. CV = <(1,0),(2.0),(3,0),(1,1),(2,1),(3,1),(1,2),(2,2)>

check vector（从左往右拼接）

比如某条记录 ('John Smith','BSC(CompSci)',1990, 99.5)
d = 6
cv = <(1,0), (1,1), (2,0), (3,0), (1,2), (2,1), (3,1), (1,3), ...>
hash1('John Smith') = ...0101010110110100
hash2('BSc(CompSci)') = ...1011111101101111
hash3(1990) = ...0001001011000000
最终该条记录的散列值就是
第1属性的第0 个比特位+ 第1属性的第1个比特位+ 第2个属性的第0个比特位+....
所以 hash( ('John Smith','BSC(CompSci)',1990, 99.5)) = 110100

当部分属性没有被查询时,比如 select * from R where name = "Green"

image.png

这种情况，需要查询的key就变成了100, 101, 110 ,111 四种可能。
假设 (a，b，？) -->会被表示成类似0001*1*111 的形式，但是二进制没有*这个东西，所以引入 stars ，将0001*1*111 转化为 bit0001010111, stars0000101000, stars中的 1 位表示 bits 中应该为*的位置。

Multi-attribute hashing 多属性散列的复杂度

多属性散列的查询复杂度和该query 中没有涉及到的属性数量有关。
比如某个query 散列key为 01001*11*1*1, 存在3个*，那么就需要查次。
所以复杂度为, .

Multi-attribute hashing 多属性散列的优化

多属性散列的优化办法就是优化 check vector cv，尽可能给容易被查的属性在散列key中分配多一些的比特位。

Consider a table with four attributes:
表 b =, 所以散列key长度为 log(10000,2) = 14 bits.
(branch, account, name, amount) 简写为（br,ac,nm,amt）
以下是所有可能的Query types 和相应的复杂度和用户查询概率。

image.png

上图可以简化成

image.png

一共 14 个比特位
br 出现了 0.5 + 0.25 = 0.75次，分配 6 bits
nm 出现了 0.5 + 0.25 = 0.75次，分配 4 bits
amt 出现了0.5 次，分配 2 bits
ac 出现了 0.25 次，分配 2 bits

分配大小这个自己定，并没有什么严格的规则。

N-d Selection (MA.hashing 多属性散列)

Hashing for N-d Selection

Multi-attribute hashing 多属性散列 ：

Multi-attribute hashing 多属性散列的复杂度

Multi-attribute hashing 多属性散列的优化

你可能感兴趣的:(N-d Selection (MA.hashing 多属性散列))

Multi-attribute hashing 多属性散列：