Leetcode:Repeated DNA Sequences详细题解

题目

All DNA is composed of a series of nucleotides abbreviated as A, C, G, and T, for example: "ACGAATTCCG". When studying DNA, it is sometimes useful to identify repeated sequences within the DNA.

Write a function to find all the 10-letter-long sequences (substrings) that occur more than once in a DNA molecule.

 

原题链接:https://oj.leetcode.com/problems/repeated-dna-sequences/

 

straight-forward method(TLE)

算法分析

直接字符串匹配;设计next数组,存字符串中每个字母在其中后续出现的位置;遍历时以next数组为起始。

 

简化考虑长度为4的字符串

 

case1:

src A C G T A C G T

next [4] [5] [6] [7] [-1] [-1] [-1] [-1]

 

那么匹配ACGT字符串的过程,匹配next[0]之后的3位字符即可

 

case2:

src A C G T A A C G T

next [4] [5] [6] [7] [5] [-1] [-1] [-1] [-1]

 

多个A字符后继,那么需要匹配所有后继,匹配next[0]不符合之后,还要匹配next[next[0]]

 

case3:

src A A A A A A

next [1] [2] [3] [4] [5] [-1]

 

重复的情况,在next[0]匹配成功时,可以把next[next[0]]置为-1,即以next[0]开始的长度为4的字符串已经成功匹配过了,无需再次匹配了;当然这么做只能减少重复的情况,并不能消除重复,因此仍需要使用一个set存储匹配成功的结果,方便去重

 

时间复杂度

构造next数组的复杂度O(n^2),遍历的复杂度O(n^2);总时间复杂度O(n^2)

 

代码实现

 1 #include <string>

 2 #include <vector>

 3 #include <set>

 4 

 5 class Solution {

 6 public:

 7     std::vector<std::string> findRepeatedDnaSequences(std::string s);

 8 

 9     ~Solution();

10 

11 private:

12     std::size_t* next;

13 };

14 

15 std::vector<std::string> Solution::findRepeatedDnaSequences(std::string s) {

16     std::vector<std::string> rel;

17 

18     if (s.length() <= 10) {

19         return rel;

20     }

21 

22     next = new std::size_t[s.length()];

23 

24     // cal next array

25     for (int pos = 0; pos < s.length(); ++pos) {

26         next[pos] = s.find_first_of(s[pos], pos + 1);

27     }

28 

29     std::set<std::string> tmpRel;

30 

31     for (int pos = 0; pos < s.length(); ++pos) {

32         std::size_t nextPos = next[pos];

33         while (nextPos != std::string::npos) {

34             int ic = pos;

35             int in = nextPos;

36             int count = 0;

37             while (in != s.length() && count < 9 && s[++ic] == s[++in]) {

38                 ++count;

39             }

40             if (count == 9) {

41                 tmpRel.insert(s.substr(pos, 10));

42                 next[nextPos] = std::string::npos;

43             }

44             nextPos = next[nextPos];

45         }

46     }

47 

48     for (auto itr = tmpRel.begin(); itr != tmpRel.end(); ++itr) {

49         rel.push_back(*itr);

50     }

51 

52     return rel;

53 }

54 

55 Solution::~Solution() {

56     delete [] next;

57 }
View Code

 

hash table plus bit manipulation method

(view the Show Tags and Runtime 10ms !)

算法分析

首先考虑将ACGT进行二进制编码

A -> 00

C -> 01

G -> 10

T -> 11

 

在编码的情况下,每10位字符串的组合即为一个数字,且10位的字符串有20位;一般来说int有4个字节,32位,即可以用于对应一个10位的字符串。例如

ACGTACGTAC -> 00011011000110110001

AAAAAAAAAA -> 00000000000000000000

 

20位的二进制数,至多有2^20种组合,因此hash table的大小为2^20,即1024 * 1024,将hash table设计为bool hashTable[1024 * 1024];

 

遍历字符串的设计

每次向右移动1位字符,相当于字符串对应的int值左移2位,再将其最低2位置为新的字符的编码值,最后将高2位置0。例如

src CAAAAAAAAAC

 

subStr CAAAAAAAAA

int 0100000000

 

subStr AAAAAAAAAC

int 0000000001

 

时间复杂度

字符串遍历O(n),hash tableO(1);总时间复杂度O(n)

 

代码实现

 1 #include <string>

 2 #include <vector>

 3 #include <unordered_set>

 4 #include <cstring>

 5 

 6 bool hashMap[1024*1024];

 7 

 8 class Solution {

 9 public:

10     std::vector<std::string> findRepeatedDnaSequences(std::string s);

11 };

12 

13 std::vector<std::string> Solution::findRepeatedDnaSequences(std::string s) {

14     std::vector<std::string> rel;

15     if (s.length() <= 10) {

16         return rel;

17     }

18 

19     // map char to code

20     unsigned char convert[26];

21     convert[0] = 0; // 'A' - 'A'  00

22     convert[2] = 1; // 'C' - 'A'  01

23     convert[6] = 2; // 'G' - 'A'  10

24     convert[19] = 3; // 'T' - 'A' 11

25 

26     // initial process

27     // as ten length string

28     memset(hashMap, false, sizeof(hashMap));

29 

30     int hashValue = 0;

31 

32     for (int pos = 0; pos < 10; ++pos) {

33         hashValue <<= 2;

34         hashValue |= convert[s[pos] - 'A'];

35     }

36 

37     hashMap[hashValue] = true;

38 

39     std::unordered_set<int> strHashValue;

40 

41     // 

42     for (int pos = 10; pos < s.length(); ++pos) {

43         hashValue <<= 2;

44         hashValue |= convert[s[pos] - 'A'];

45         hashValue &= ~(0x300000);

46         

47         if (hashMap[hashValue]) {

48             if (strHashValue.find(hashValue) == strHashValue.end()) {

49                 rel.push_back(s.substr(pos - 9, 10));

50                 strHashValue.insert(hashValue);

51             }

52         } else {

53             hashMap[hashValue] = true;

54         }

55     }

56 

57     return rel; 

58 }

 

你可能感兴趣的:(LeetCode)