十万~百万~千万级 图片去重/筛选/搜重复图完整实现--10毫秒以内,平均约0.5ms

Image-repeat-image-filter

abstract

通过提取描述图片信息的hash值(phash、whash、dhash等)转化成10进制hash索引,采用字符转化方式可逆转化为原始hash值,通过hash索引组成的图片信息库进行重复图查询和图片信息插入更新,使用多线程信息提取和重复查询、二分位置查找,信息库建立和查询过程同步,使得重复图查找/搜索与图片信息库大小无关,查询时间稳定在0~10 ms,受图片信息提取时resize耗时影响,均时约0.5 ms。
后期可以在信息库的存储方式上进行优化,目前以json文件存取,50万图片信息约在25M,随着数据量增多不建议使用文件存取的方式。
keywords: Hash, 重复图, 二分法, 多线程

实现算法思路

1.多线程提取图片的hash值;
2.hash转化成32~64位长索引值, 在信息库中顺序查找位置索引,相同索引为重复图;
3.如无重复,将提取图片信息按索引顺序位置插入信息库;
  • requirement:
cv2
PIL
concurrent
imagehash or use my phash only
  • 信息库格式
image_infobase.json
{"info_base": [[hash_index, image_path], ...]}
信息库大小会影响读写时的速度,基本不影响查找时间,信息库详细格式如下:
{
	"info_base": 
		[
			[3010302070400050200020702050207030301, "E:/2019/image_rm-dui/train250_silimar/376-19_50325220.jpg"], 
			[3010302070400050200020702050207030301, "E:/2019/image_rm-dui/train250_silimar/377-19_50393399.jpg"], 
			[5040305050303020703050105020602060600, "E:/2019/image_rm-dui/train250_silimar/952-69_50475587.jpg"], 
			[106000103040107020705050602010605020001, "E:/2019/image_rm-dui/train250_silimar/800-4_47983604.jpg"], 
			[106020300060505070507050101030504010701, "E:/2019/image_rm-dui/train250_silimar/30-53_48706252.jpg"], 
			[106020300060505070507050101030504010701, "E:/2019/image_rm-dui/train250_silimar/31-53_49151624.jpg"], 
			[107040005020001010304020606070106020301, "E:/2019/image_rm-dui/train250_silimar/1228-22_50197080.jpg"], 
			[107040005020001010304020606070106020301, "E:/2019/image_rm-dui/train250_silimar/1229-22_50248780.jpg"], 
			[202070502070004060200010603070007050400, "E:/2019/image_rm-dui/train250_silimar/1920-40_49714099.jpg"]
		]
}

代码拆解

1、提取待查重图片信息
提取图片的phash,这里自写phash提取(也可采用其他库hash提取,如:imagehash),采取图片不同RGB通道不同位置区域提取,如下图(就是图片这段代码哈,方便看):
十万~百万~千万级 图片去重/筛选/搜重复图完整实现--10毫秒以内,平均约0.5ms_第1张图片

def pHash(img, read=False):
    if read:
        # img = io.imread(img)
        # img = cv2.cvtColor(img, cv2.COLOR_BGR2RGB)
        img = cv2.imread(img)
        img = cv2.resize(img, (32, 32), interpolation=cv2.INTER_CUBIC)
    vis = np.zeros_like(img)
    if len(img.shape)==3:
        vis[..., 0] = cv2.dct(cv2.dct(np.float32(img[..., 0])))
        vis[..., 1] = cv2.dct(cv2.dct(np.float32(img[..., 1])))
        vis[..., 2] = cv2.dct(cv2.dct(np.float32(img[..., 2])))
        vis1 = (vis[6:14, 6:14, 0] + vis[6:14, 18:26, 1] + vis[18:26, 6:14, 2] + vis[18:26, 18:26, 0] + vis[12:20, 12:20, 0] + vis[12:20, 12:20, 1] + vis[12:20, 12:20, 2])/7.0
    else:
        vis[..., 0] = cv2.dct(cv2.dct(np.float32(img[..., 0])))
        vis1 = (vis[6:14, 6:14, 0] + vis[6:14, 18:26, 0] + vis[18:26, 6:14, 0] + vis[18:26, 18:26, 0] + vis[12:20, 12:20, 0]) / 5.0
    img_list = vis1.reshape(-1).tolist()

    avg = np.mean(img_list)
    avg_list = ['0' if i < avg else '1' for i in img_list]

    phash = ['%d' % int("".join(avg_list[x:x + 3]), 2) for x in range(0, 8 * 8, 3)]
    int_hash = ''
    for i in phash:
        if int(i) < 10:
            # 如果为各位数十位补零
            int_hash += '0' + str(i)
        else:
            int_hash += str(i)
    int_hash = int(int_hash)
    return int_hash

提取后的hash值先逐位转化为二进制字符,后通过字符型转换成中间十位数补0的hash索引序列,如:

['12', '13', '12', '4', '1', '15', '12', '5', '14', '7', '6', '5', '2', '0', '4', '5']
--> 12131204011512051407060502000405

2、图片在信息库位置查找
图片hash索引在信息库中位置查找,二分法查找target在List(信息库)中位置

def find_Sorted_Position(List, target):
    """
    # 二分法查找100000000以内,查找耗时不超过1ms
    :param List: the list waiting to lookup
    :param target: target number
    :return: target's position
    """
    if target < List[0]:
        return 0
    elif target > List[-1]:
        return len(List)
    else:
        low = 0
        high = len(List) - 1
        while low <= high:
            mid = (high + low) // 2
            if high - low <= 1:
                return mid + 1
            elif target == List[mid]:
                return mid
            elif target < List[mid]:
                high = mid
            else:
                low = mid
        return low + 1

3、更新图片信息库

    def info_update(self, update=False):
        import time
        tm_min = time.localtime(time.time())[5]
        # 每30min更新一次图片信息库
        if tm_min % 31 == 0 or update:
            time_start = time.time()
            with open(self.infobase, 'w') as write_f:
                img_info = {"info_base": self.img_dict}
                json.dump(img_info, write_f)
            print("写/更新的图片信息库耗时: {:0.4f} s".format(time.time() - time_start, len(self.img_dict)))
            with open("similar_imgs.txt", 'a')as file_similar:
                for similar_img in self.similar:
                    file_similar.write(similar_img[0] + ' ' + similar_img[1] + '\n')
  • 本地测试结果展示
读取图片信息库数据字典耗时: 0.6860 s
当前信息库图片数量:  436776
10-57_50157089.jpg is similar to E:/2019/image_rm-dui/train0/57_50157089.jpg !
100-57_49992915.jpg is similar to E:/2019/image_rm-dui/train0/57_49992915.jpg !
0-69_50002503.jpg is similar to E:/2019/image_rm-dui/train0/69_50002503.jpg !
1-69_50362572.jpg is similar to E:/2019/image_rm-dui/train0/69_50002503.jpg !
1002-69_49593728.jpg is similar to E:/2019/image_rm-dui/train0/69_49593728.jpg !
1000-69_49924978.jpg is similar to E:/2019/image_rm-dui/train0/69_49924978.jpg !
1001-69_49926153.jpg is similar to E:/2019/image_rm-dui/train0/69_49924978.jpg !
1003-69_49614853.jpg is similar to E:/2019/image_rm-dui/train0/69_49593728.jpg !
重复图查找平均耗时: 0.2556 ms
写/更新的图片信息库耗时: 1.3470 s
  • 觉得还行,点个,或者github点个star呗,有问题欢迎留言咨询, 奉上github完整demo地址link
https://github.com/tao-ht/Image-repeat-image-filter

Copyright hengtao tao. All Rights Reserved.

你可能感兴趣的:(计算机视觉)