python 文件中搜索字符串_在大型文本文件中搜索字符串 – 在python中分析各种方法...


I have a 600 MB file with 6 million lines of strings (Category paths from DMOZ project).

The entry on each line is unique.

I want to load the file once & keep searching for matches in the data


1) set :

(i) data = set(

(ii) result = search_str in data

Load time ~ 10s, Search time ~ 0.0s, Memory usage ~ 1.2GB

2) list :

(i) data =

(ii) result = search_str in data

Load time ~ 6s, Search time ~ 0.36s, Memory usage ~ 1.2GB

3) mmap :

(i) data = mmap.mmap(f.fileno(), 0, access=mmap.ACCESS_READ)

(ii) result = data.find(search_str)

Load time ~ 0s, Search time ~ 5.4s, Memory usage ~ NA

4) Hash lookup (using code from @alienhard below):

Load time ~ 65s, Search time ~ 0.0s, Memory usage ~ 250MB

5) File search (using code from @EOL below):

with open('input.txt') as f:

print search_str in f #search_str ends with the ('\n' or '\r\n') as in the file

Load time ~ 0s, Search time ~ 3.2s, Memory usage ~ NA

6) sqlite (with primary index on url):

Load time ~ 0s, Search time ~ 0.0s, Memory usage ~ NA


A better alternative e.g. sqlite ?

Ways to improve the search time using mmap. I have a 64-bit setup.

[edit] e.g. bloom filters

As the file size grows to a couple of GB, is there any way I can keep using ‘set’ e.g. split it in batches ..





解决方案:基于所有的分析&反馈,我想我会和sqlite一起去。第二种方法是方法4. sqlite的一个缺点是数据库大小是原始的带有url的csv文件的两倍以上。这是由于url的主索引
