题目:
All DNA is composed of a series of nucleotides abbreviated as A, C, G, and T, for example: "ACGAATTCCG". When studying DNA, it is sometimes useful to identify repeated sequences within the DNA.
Write a function to find all the 10-letter-long sequences (substrings) that occur more than once in a DNA molecule.
For example,
Given s = "AAAAACCCCCAAAAACCCCCCAAAAAGGGTTT",
Return:
["AAAAACCCCC", "CCCCCAAAAA"].
刚开始就想到用map进行子串的存储,结果Memory Limit Exceeded
我刚开始这样写的:
class Solution { public: vector<string> findRepeatedDnaSequences(string s) { map<string, int> dnamap; vector<string> result; string::size_type length = s.length(); string sequence; for (size_t i = 0; i + 10 < length; i++) { sequence = s.substr(i, 10); dnamap[sequence] += 1; } for (std::map<string, int>::iterator it = dnamap.begin(); it != dnamap.end(); ++it) { if (it->second > 1) { result.push_back(it->first); } } return result; } };
后来看到https://leetcode.com/discuss/24478/i-did-it-in-10-lines-of-c的讨论,原来可以将子字符串转成数字,存储在map中,这样就不会超出限制的内存空间了。
原文是这样说的:The main idea is to store the substring as int in map to bypass the memory limits.
下面是我写的C++代码:
class Solution { public: int str2int(string s) { int num = 0; for (int i = 0; i < s.size(); ++i) //s[i]&7取出s[i]的后三个字节,因为用三个字节可以完全表示A,C,G,T num = (num << 3) + (s[i] & 7); return num; } vector<string> findRepeatedDnaSequences(string s) { vector<string> result; unordered_map<int, int> dictionary; string sequence; for (int i = 0; i + 10 <= s.size(); ++i) { sequence = s.substr(i, 10); if (dictionary[str2int(sequence)]++ == 1) result.push_back(sequence); } return result; } };
看到很多朋友这道题都使用的是unordered_map,而不是map,然后我查阅了资料,下面是unordered_map和map的区别:
unordered_map原来是boost库中的容器,在C++11标准中被引入到STL中。unordered_map和map的区别就是,map是按照operator<比较判断元素是否相同,以及比较元素的大小,然后选择合适的位置插入到树中。所以,如果对map进行遍历(中序遍历)的话,输出的结果是有序的。顺序就是按照operator< 定义的大小排序。而unordered_map是计算元素的Hash值,根据Hash值判断元素是否相同。所以,对unordered_map进行遍历,结果是无序的。
用法的区别就是,map的key需要定义operator<。 而boost::unordered_map需要定义hash_value函数并且重载operator==。对于内置类型,如string,这些都不用操心。对于自定义的类型做key,就需要自己重载operator<或者hash_value()了。 当不需要结果排好序时,最好unordered_map。
其实,C++中map对于与Java中的TreeMap,而unordered_map对应于Java中的HashMap。
C#代码:
貌似C#中Dictionary不支持对不存在的key的直接索引,所以要先判断key存不存在。C++中如果不存在会自动添加要索引的key。
public class Solution { private int StrToInt(string s) { int num = 0; for (int i = 0; i < s.Length; ++i) { num = (num << 3) + (s[i] & 7); } return num; } public IList<string> FindRepeatedDnaSequences(string s) { IList<string> result = new List<string>(); IDictionary<int, int> map = new Dictionary<int, int>(); string sequence = string.Empty; int key = 0; for (int i = 0; i + 10 <= s.Length; ++i) { sequence = s.Substring(i, 10); key = StrToInt(sequence); if (map.ContainsKey(key)) { if (map[key]++ == 1) { result.Add(sequence); } } else { map.Add(key, 1); } } return result; } }
Python代码:
注意内置函数ord将单个字符串即字符转成对于的ASCII(Return the integer ordinal of a one-character string)
class Solution: # @param s, a string # @return a list of strings def str2int(self, s): num = 0 for i in range(len(s)): num = (num << 3) + (ord(s[i]) & 7) return num def findRepeatedDnaSequences(self, s): length = len(s) result = [] if length <= 10: return result map = {} i = 0 while (i + 10) <= length: sequence = s[i : i + 10] num = self.str2int(sequence) if map.has_key(num): map[num] = map[num] + 1 if map[num] == 1: result.append(sequence) else: map[num] = 0 i = i + 1 return result