这个问题已经被问了很多次。花了一些时间阅读答案后,我做了一些快速剖析,尝试前面提到的各种方法…
I have a 600 MB file with 6 million lines of strings (Category paths from DMOZ project).
The entry on each line is unique.
I want to load the file once & keep searching for matches in the data
我尝试下面的三种方法列出加载文件所花费的时间,搜索时间为负匹配&内存在任务管理器中的使用
1) set :
(i) data = set(f.read().splitlines())
(ii) result = search_str in data
Load time ~ 10s, Search time ~ 0.0s, Memory usage ~ 1.2GB
2) list :
(i) data = f.read().splitlines()
(ii) result = search_str in data
Load time ~ 6s, Search time ~ 0.36s, Memory usage ~ 1.2GB
3) mmap :
(i) data = mmap.mmap(f.fileno(), 0, access=mmap.ACCESS_READ)
(ii) result = data.find(search_str)
Load time ~ 0s, Search time ~ 5.4s, Memory usage ~ NA
4) Hash lookup (using code from @alienhard below):
Load time ~ 65s, Search time ~ 0.0s, Memory usage ~ 250MB
5) File search (using code from @EOL below):
with open('input.txt') as f:
print search_str in f #search_str ends with the ('\n' or '\r\n') as in the file
Load time ~ 0s, Search time ~ 3.2s, Memory usage ~ NA
6) sqlite (with primary index on url):
Load time ~ 0s, Search time ~ 0.0s, Memory usage ~ NA
对于我的使用情况,似乎去与集是最好的选择,只要我有足够的内存可用。我希望得到关于这些问题的一些意见:
A better alternative e.g. sqlite ?
Ways to improve the search time using mmap. I have a 64-bit setup.
[edit] e.g. bloom filters
As the file size grows to a couple of GB, is there any way I can keep using ‘set’ e.g. split it in batches ..
[编辑1]我需要频繁地搜索,添加/删除值,不能单独使用哈希表,因为我需要检索修改的值以后。
欢迎任何意见/建议!
[编辑2]用答案中建议的方法更新结果
[编辑3]使用sqlite结果更新
解决方案:基于所有的分析&反馈,我想我会和sqlite一起去。第二种方法是方法4. sqlite的一个缺点是数据库大小是原始的带有url的csv文件的两倍以上。这是由于url的主索引