通过提取描述图片信息的hash值(phash、whash、dhash等)转化成10进制hash索引,采用字符转化方式可逆转化为原始hash值,通过hash索引组成的图片信息库进行重复图查询和图片信息插入更新,使用多线程信息提取和重复查询、二分位置查找,信息库建立和查询过程同步,使得重复图查找/搜索与图片信息库大小无关,查询时间稳定在0~10 ms,受图片信息提取时resize耗时影响,均时约0.5 ms。
后期可以在信息库的存储方式上进行优化,目前以json文件存取,50万图片信息约在25M,随着数据量增多不建议使用文件存取的方式。
keywords: Hash, 重复图, 二分法, 多线程
1.多线程提取图片的hash值;
2.hash转化成32~64位长索引值, 在信息库中顺序查找位置索引,相同索引为重复图;
3.如无重复,将提取图片信息按索引顺序位置插入信息库;
cv2
PIL
concurrent
imagehash or use my phash only
image_infobase.json
{"info_base": [[hash_index, image_path], ...]}
信息库大小会影响读写时的速度,基本不影响查找时间,信息库详细格式如下:
{
"info_base":
[
[3010302070400050200020702050207030301, "E:/2019/image_rm-dui/train250_silimar/376-19_50325220.jpg"],
[3010302070400050200020702050207030301, "E:/2019/image_rm-dui/train250_silimar/377-19_50393399.jpg"],
[5040305050303020703050105020602060600, "E:/2019/image_rm-dui/train250_silimar/952-69_50475587.jpg"],
[106000103040107020705050602010605020001, "E:/2019/image_rm-dui/train250_silimar/800-4_47983604.jpg"],
[106020300060505070507050101030504010701, "E:/2019/image_rm-dui/train250_silimar/30-53_48706252.jpg"],
[106020300060505070507050101030504010701, "E:/2019/image_rm-dui/train250_silimar/31-53_49151624.jpg"],
[107040005020001010304020606070106020301, "E:/2019/image_rm-dui/train250_silimar/1228-22_50197080.jpg"],
[107040005020001010304020606070106020301, "E:/2019/image_rm-dui/train250_silimar/1229-22_50248780.jpg"],
[202070502070004060200010603070007050400, "E:/2019/image_rm-dui/train250_silimar/1920-40_49714099.jpg"]
]
}
1、提取待查重图片信息
提取图片的phash,这里自写phash提取(也可采用其他库hash提取,如:imagehash),采取图片不同RGB通道不同位置区域提取,如下图(就是图片这段代码哈,方便看):
def pHash(img, read=False):
if read:
# img = io.imread(img)
# img = cv2.cvtColor(img, cv2.COLOR_BGR2RGB)
img = cv2.imread(img)
img = cv2.resize(img, (32, 32), interpolation=cv2.INTER_CUBIC)
vis = np.zeros_like(img)
if len(img.shape)==3:
vis[..., 0] = cv2.dct(cv2.dct(np.float32(img[..., 0])))
vis[..., 1] = cv2.dct(cv2.dct(np.float32(img[..., 1])))
vis[..., 2] = cv2.dct(cv2.dct(np.float32(img[..., 2])))
vis1 = (vis[6:14, 6:14, 0] + vis[6:14, 18:26, 1] + vis[18:26, 6:14, 2] + vis[18:26, 18:26, 0] + vis[12:20, 12:20, 0] + vis[12:20, 12:20, 1] + vis[12:20, 12:20, 2])/7.0
else:
vis[..., 0] = cv2.dct(cv2.dct(np.float32(img[..., 0])))
vis1 = (vis[6:14, 6:14, 0] + vis[6:14, 18:26, 0] + vis[18:26, 6:14, 0] + vis[18:26, 18:26, 0] + vis[12:20, 12:20, 0]) / 5.0
img_list = vis1.reshape(-1).tolist()
avg = np.mean(img_list)
avg_list = ['0' if i < avg else '1' for i in img_list]
phash = ['%d' % int("".join(avg_list[x:x + 3]), 2) for x in range(0, 8 * 8, 3)]
int_hash = ''
for i in phash:
if int(i) < 10:
# 如果为各位数十位补零
int_hash += '0' + str(i)
else:
int_hash += str(i)
int_hash = int(int_hash)
return int_hash
提取后的hash值先逐位转化为二进制字符,后通过字符型转换成中间十位数补0的hash索引序列,如:
['12', '13', '12', '4', '1', '15', '12', '5', '14', '7', '6', '5', '2', '0', '4', '5']
--> 12131204011512051407060502000405
2、图片在信息库位置查找
图片hash索引在信息库中位置查找,二分法查找target在List(信息库)中位置
def find_Sorted_Position(List, target):
"""
# 二分法查找100000000以内,查找耗时不超过1ms
:param List: the list waiting to lookup
:param target: target number
:return: target's position
"""
if target < List[0]:
return 0
elif target > List[-1]:
return len(List)
else:
low = 0
high = len(List) - 1
while low <= high:
mid = (high + low) // 2
if high - low <= 1:
return mid + 1
elif target == List[mid]:
return mid
elif target < List[mid]:
high = mid
else:
low = mid
return low + 1
3、更新图片信息库
def info_update(self, update=False):
import time
tm_min = time.localtime(time.time())[5]
# 每30min更新一次图片信息库
if tm_min % 31 == 0 or update:
time_start = time.time()
with open(self.infobase, 'w') as write_f:
img_info = {"info_base": self.img_dict}
json.dump(img_info, write_f)
print("写/更新的图片信息库耗时: {:0.4f} s".format(time.time() - time_start, len(self.img_dict)))
with open("similar_imgs.txt", 'a')as file_similar:
for similar_img in self.similar:
file_similar.write(similar_img[0] + ' ' + similar_img[1] + '\n')
读取图片信息库数据字典耗时: 0.6860 s
当前信息库图片数量: 436776
10-57_50157089.jpg is similar to E:/2019/image_rm-dui/train0/57_50157089.jpg !
100-57_49992915.jpg is similar to E:/2019/image_rm-dui/train0/57_49992915.jpg !
0-69_50002503.jpg is similar to E:/2019/image_rm-dui/train0/69_50002503.jpg !
1-69_50362572.jpg is similar to E:/2019/image_rm-dui/train0/69_50002503.jpg !
1002-69_49593728.jpg is similar to E:/2019/image_rm-dui/train0/69_49593728.jpg !
1000-69_49924978.jpg is similar to E:/2019/image_rm-dui/train0/69_49924978.jpg !
1001-69_49926153.jpg is similar to E:/2019/image_rm-dui/train0/69_49924978.jpg !
1003-69_49614853.jpg is similar to E:/2019/image_rm-dui/train0/69_49593728.jpg !
重复图查找平均耗时: 0.2556 ms
写/更新的图片信息库耗时: 1.3470 s
https://github.com/tao-ht/Image-repeat-image-filter
Copyright hengtao tao. All Rights Reserved.