187. Repeated DNA Sequences

问题

All DNA is composed of a series of nucleotides abbreviated as A, C, G, and T, for example: "ACGAATTCCG". When studying DNA, it is sometimes useful to identify repeated sequences within the DNA.

Write a function to find all the 10-letter-long sequences (substrings) that occur more than once in a DNA molecule.

例子

Given s = "AAAAACCCCCAAAAACCCCCCAAAAAGGGTTT",
Return:
["AAAAACCCCC", "CCCCCAAAAA"].

分析

首先要明确,一个长度为n的字符串有且仅有n-9个长度为10的子字符串。
定义两个unordered_set: allSet, findSet. allSet用来保存所有遍历过的长度为10的子字符串,
findSet用来保存已经找到的出现次数不小于2次的长度为10的子字符串。findSet是用来防止结果列表里出现重复的子字符串。

要点

  • 防止检测到重复的子字符串;
  • vector的构造函数可以传入unordered_set的首尾迭代器(迭代器实在是太强大了!)。

时间复杂度

O(n)

空间复杂度

O(n)

代码

class Solution {
public:
    vector findRepeatedDnaSequences(string s) {
        if (s.length() < 11) return vector();

        unordered_set allSet, findSet;
        for (int i = 0; i <= s.length() - 10; i++) {
            string substring = s.substr(i, 10);
            if (allSet.find(substring) != allSet.end() && 
                findSet.find(substring) == findSet.end())
                findSet.insert(substring);
            else
                allSet.insert(substring);
        }
        
        return vector(findSet.begin(), findSet.end());
    }
};

你可能感兴趣的:(187. Repeated DNA Sequences)