golang - map

1. 底层原理

hmap

Go中的map是一个指针，占用8个字节，指向底层的hmap结构体(hash表)，在源码包src/runtime/map.go中定义了该结构体,如下所示：

// A header for a Go map.
type hmap struct {
    // Note: the format of the hmap is also encoded in cmd/compile/internal/reflectdata/reflect.go.
    // Make sure this stays in sync with the compiler's definition.
    count     int // # live cells == size of map.  Must be first (used by len() builtin)   代表哈希表中的元素个数，调用len(map)时，返回的就是该字段值
    flags     uint8  //  状态标志（是否处于正在写入的状态等）
    B         uint8  // log_2 of # of buckets (can hold up to loadFactor * 2^B items)      buckets（桶）的对, 如果B=5，则buckets数组的长度 = 2^B=32，意味着有32个桶
    noverflow uint16 // approximate number of overflow buckets; see incrnoverflow for details,      溢出桶的数量
    hash0     uint32 // hash seed      生成hash的随机数种子

    buckets    unsafe.Pointer // array of 2^B Buckets. may be nil if count==0.       指向buckets数组的指针，数组大小为2^B，如果元素个数为0，它为nil。
    oldbuckets unsafe.Pointer // previous bucket array of half the size, non-nil only when growing  .如果发生扩容，oldbuckets是指向老的buckets数组的指针，老的buckets数组大小是新的buckets的1/2;非扩容状态下，它为nil。
    nevacuate  uintptr        // progress counter for evacuation (buckets less than this have been evacuated)     表示扩容进度，小于此地址的buckets代表已搬迁完成

    extra *mapextra // optional fields
}

map 结构

bmap

上面说到hmap 中有一个buckets 数组，其中数组中的每一个元素称为bucket（桶），我们也叫作 bmap。
一个桶里面会最多装 8 个键值对，这些键值对之所以会落入同一个桶，是因为它们的key经过哈希计算后，哈希结果的低B位(hmap 中的B字段)是相同的，关于key的定位我们在map的查询中详细说明。在桶内，又会根据 key 计算出来的 hash 值的高 8 位来决定 key 到底落入桶内的哪个位置（一个桶内最多有8个位置)。下面是bmap 结构体的定义，其中tophash是一个长度为8的数组（bucketCnt = 1 << bucketCntBits，bucketCntBits=3），用来快速定位key，如果key的tophash 值在这个数组中，则代表该key在该桶中

// A bucket for a Go map.
type bmap struct {
    // tophash generally contains the top byte of the hash value
    // for each key in this bucket. If tophash[0] < minTopHash,
    // tophash[0] is a bucket evacuation state instead.
    tophash [bucketCnt]uint8
    // Followed by bucketCnt keys and then bucketCnt elems.
    // NOTE: packing all the keys together and then all the elems together makes the
    // code a bit more complicated than alternating key/elem/key/elem/... but it allows
    // us to eliminate padding which would be needed for, e.g., map[int64]int8.
    // Followed by an overflow pointer.
}

上面bmap结构是静态结构，在编译过程中runtime.bmap会拓展成以下结构体：

type bmap struct{
    tophash [8]uint8
    keys [8]keytype 
    // keytype 由编译器编译时候确定
    values [8]elemtype 
    // elemtype 由编译器编译时候确定
    overflow uintptr 
    // overflow指向下一个bmap，overflow是uintptr而不是*bmap类型，保证bmap完全不含指针，是为了减少gc，溢出桶存储到extra字段中
}

tophash就是用于实现快速定位key的位置，在实现过程中会使用key的hash值的高8位作为tophash值，存放在bmap的tophash字段中. 同时还会存储一些状态值，表明当前的桶单元的状态，这些状态值都小于minTopHash。为了避免key哈希值的高8位值和这些状态值相等，产生混淆情况，所以当key哈希值高8位若小于minTopHash时候，自动将其值加上minTopHash作为该key的tophash。桶单元的状态值如下：

emptyRest = 0 // 表明此桶单元为空，且更高索引的单元也是空
emptyOne = 1 // 表明此桶单元为空
evacuatedX = 2 // 用于表示扩容迁移到新桶前半段区间
evacuatedY = 3 // 用于表示扩容迁移到新桶后半段区间
evacuatedEmpty = 4 // 用于表示此单元已迁移
minTopHash = 5 // key的tophash值与桶状态值分割线值，小于此值的一定代表着桶单元的状态，大于此值的一定是key对应的tophash值

如下的tophash函数，就是来计算key的tophash 值，可以看到，如果小于minTopHash（5），加上minTopHash作为该key的tophash

// tophash calculates the tophash value for hash.
func tophash(hash uintptr) uint8 {
    top := uint8(hash >> (goarch.PtrSize*8 - 8))
    if top < minTopHash {
        top += minTopHash
    }
    return top
}

bmap内存数据结构可视化如下:

注意到 key 和 value 是各自放在一起的，并不是 key/value/key/value/... 这样的形式，当key和value类型不一样的时候，key和value占用字节大小不一样，使用key/value这种形式可能会因为内存对齐导致内存空间浪费，所以Go采用key和value分开存储的设计，更节省内存空间

map 查找

map的查找流程如下：

map 查找流程

写保护监测：
函数首先会检查 map 底层hmap标志位 flags。如果 flags 的写标志位此时被置 1 了，说明有其他协程在执行“写”操作，进而导致程序 panic，这也说明了 map 不是线程安全的。flags标识如下：

// flags
iterator = 1 // there may be an iterator using buckets
oldIterator = 2 // there may be an iterator using oldbuckets
hashWriting = 4 // a goroutine is writing to the map
sameSizeGrow = 8 // the current map growth is to a new map of the same size

// 判断map 是否是在写的状态，如果是的话，则抛出异常，所以map 不是线程安全的
     if h.flags&hashWriting != 0 {
         throw("concurrent map read and map write")
     }

计算hash：
计算map 中的key 的hash 值

hash := t.hasher(key, uintptr(h.hash0))

hasher 函数类型为 func(unsafe.Pointer, uintptr) uintptr , 将传入key的指针地址，和hmap的hash 种子
key经过哈希函数计算后，得到的哈希值如下（主流64位机下共 64 个 bit 位），不同类型的key会有不同的hash函数

找到hash对应的bucket：
bucket定位：key 的hash 值的低B个bit 位，用来定位key所存放的bucket
如果当前map正在扩容中，并且定位到的旧bucket数据还未完成迁移，则使用旧的bucket（扩容前的bucket oldbuckets ）

// 桶的个数m-1，即 1<>= 1
    }
    // 计算哈希值对应的旧bucket
    oldb := (*bmap)(add(c, (hash&m)*uintptr(t.bucketsize)))
    // 如果旧bucket的数据没有完成迁移，则使用旧bucket查找
    if !evacuated(oldb) {
        b = oldb
    }
}

遍历查找bucket：
通过tophash 函数定位：哈希值的高8个bit 位，用来快速判断key是否已在当前bucket中，如果不在的话，需要去bucket的overflow中查找

示例：
假设某个key的hash 值为1001011100001111011011001000111100101010001001011001010101000110,其中hmap的B=5 即有2^5 =32 个桶（0-31号）, 如下图所示：

首先通过hash值的后B位，即后五位找到对应的bucket，此时为00110=6号桶，再讲key的hash值的高8位通过tophash 函数来找到对应的桶中的所在的下标，

map key 冲突解决方式

而Go map也采用 链地址法解决冲突，具体就是插入key到map中时，当key定位的桶填满8个元素后（这里的单元就是桶，不是元素），将会创建一个溢出桶（overflow ），并且将溢出桶插入当前桶所在链表尾部

2. map 遍历的无序性

使用 range 多次遍历 map 时输出的 key 和 value 的顺序可能不同。主要原因有2点：

map在遍历时，并不是从固定的0号bucket开始遍历的，每次遍历，都会从一个随机值序号的bucket，再从其中随机的cell开始遍历
map遍历时，是按序遍历bucket，同时按序遍历bucket中和其overflow bucket中的cell。但是map在扩容后，会发生key的搬迁，这造成原来落在一个bucket中的key，搬迁后，有可能会落到其他bucket中了，从这个角度看，遍历map的结果就不可能是按照原来的顺序了

如果想要有序遍历map，只需对 map中的 key 先排序，再按照 key 的顺序遍历 map

3. 线程安全性

map默认是并发不安全的，同时对map进行并发读写时，程序会通过map 的写保护监测机制来抛出panic异常
要想实现并发安全，有如下两种方式

map + 锁（sync.RWMutex）
使用Go自带的 sync.Map，该map 支持并发读写安全, sync.Map采取了 “空间换时间” 的机制，冗余了两个数据结构，分别是：read 和 dirty

type Map struct {
   mu Mutex
   read atomic.Value // readOnly
   dirty map[interface{}]*entry
   misses int
}

和原始map+RWLock的实现并发的方式相比，减少了加锁对性能的影响。它做了一些优化：可以无锁访问read map，而且会优先操作read map，倘若只操作read map就可以满足要求，那就不用去操作write map(dirty)，所以在某些特定场景中它发生锁竞争的频率会远远小于map+RWLock的实现方式

优点：

适合读多写少的场景

缺点：

写多的场景，会导致 read map 缓存失效，需要加锁，冲突变多，性能急剧下降