The DNA sequence is composed of a series of nucleotides abbreviated as 'A'
, 'C'
, 'G'
, and 'T'
.
"ACGAATTCCG"
is a DNA sequence.When studying DNA, it is useful to identify repeated sequences within the DNA.
Given a string s
that represents a DNA sequence, return all the 10
-letter-long sequences (substrings) that occur more than once in a DNA molecule. You may return the answer in any order.
Example 1:
Input: s = "AAAAACCCCCAAAAACCCCCCAAAAAGGGTTT" Output: ["AAAAACCCCC","CCCCCAAAAA"]
Example 2:
Input: s = "AAAAAAAAAAAAA" Output: ["AAAAAAAAAA"]
Constraints:
1 <= s.length <= 105
s[i]
is either 'A'
, 'C'
, 'G'
, or 'T'
.This problem asks us to find all 10-letter-long substrings that appear more than once in a given DNA sequence string. The solutions presented leverage hashing techniques for efficient substring identification and counting.
This approach uses a hash table (or dictionary in Python) to store each 10-letter substring and its count. We iterate through the input string, extracting each 10-letter substring. For each substring, we check if it's already in the hash table. If it is, we increment its count; otherwise, we add it to the table with a count of 1. If a substring's count reaches 2, we add it to the result list.
Time Complexity: O(NK), where N is the length of the string and K is the length of the substring (10 in this case). The nested loops (implicit in string slicing) dominate the time complexity. Space Complexity: O(NK) in the worst case, as the hash table could potentially store all possible 10-letter substrings.
Code Examples: The provided code examples (Python, Java, C++, Go, TypeScript, Rust, Javascript, C#) all follow this approach, differing only in syntax and library usage. They all efficiently use hash table data structures to count occurrences of substrings.
This approach uses a more sophisticated hashing technique – a rolling hash – to optimize the substring comparison. Instead of generating a new hash for each substring, it updates the hash value efficiently by removing the leftmost character and adding the new rightmost character as we slide the window. This significantly reduces redundant calculations.
Time Complexity: O(N), where N is the length of the input string. The algorithm iterates through the string only once. Hashing and comparison are considered O(1) on average. Space Complexity: O(N) in the worst case, as the hash table might store all distinct 10-letter substrings.
Code Example (Go): The Go code example demonstrates this approach. It uses a custom hash function based on the base-4 representation of the DNA sequence (A=0, C=1, G=2, T=3). The rolling hash efficiently updates the hash value as the window slides. Note that this approach requires understanding of number theory and modulo arithmetic to avoid hash collisions effectively. Implementation details can vary based on the choice of hash function and collision handling.
For this problem, the Rabin-Karp approach (Approach 2) offers a superior time complexity of O(N) compared to the brute force O(N*K) of Approach 1. While both have a worst-case space complexity of O(N), the constant factors may make the Rabin-Karp approach slightly less space efficient in practice. However, the significant improvement in time complexity makes it generally preferable for larger input strings. Unless memory is extremely constrained, Approach 2 is recommended.