{x}
blog image

Minimum Genetic Mutation

A gene string can be represented by an 8-character long string, with choices from 'A', 'C', 'G', and 'T'.

Suppose we need to investigate a mutation from a gene string startGene to a gene string endGene where one mutation is defined as one single character changed in the gene string.

  • For example, "AACCGGTT" --> "AACCGGTA" is one mutation.

There is also a gene bank bank that records all the valid gene mutations. A gene must be in bank to make it a valid gene string.

Given the two gene strings startGene and endGene and the gene bank bank, return the minimum number of mutations needed to mutate from startGene to endGene. If there is no such a mutation, return -1.

Note that the starting point is assumed to be valid, so it might not be included in the bank.

 

Example 1:

Input: startGene = "AACCGGTT", endGene = "AACCGGTA", bank = ["AACCGGTA"]
Output: 1

Example 2:

Input: startGene = "AACCGGTT", endGene = "AAACGGTA", bank = ["AACCGGTA","AACCGCTA","AAACGGTA"]
Output: 2

 

Constraints:

  • 0 <= bank.length <= 10
  • startGene.length == endGene.length == bank[i].length == 8
  • startGene, endGene, and bank[i] consist of only the characters ['A', 'C', 'G', 'T'].

Minimum Genetic Mutation

This problem asks for the minimum number of mutations required to transform a starting gene string into an ending gene string, given a bank of valid gene strings. Each mutation involves changing a single character in the gene string.

Approach: Breadth-First Search (BFS)

The most efficient approach is using Breadth-First Search (BFS). BFS guarantees finding the shortest path (minimum mutations) in an unweighted graph. We can represent the problem as a graph where:

  • Nodes: Each valid gene string (including the startGene and endGene) is a node.
  • Edges: An edge connects two nodes if they differ by only one character.

The algorithm proceeds as follows:

  1. Initialization:

    • Create a queue q to store gene strings to be processed. Initially, it contains the startGene along with its mutation distance (0).
    • Create a set visited to track visited gene strings, preventing cycles and redundant exploration. Initially, it contains the startGene.
  2. Iteration:

    • While the queue is not empty:
      • Dequeue a gene string currentGene and its mutation distance distance.
      • If currentGene equals endGene, return distance.
      • Iterate through the bank of valid gene strings:
        • For each neighborGene:
          • Calculate the difference (number of differing characters) between currentGene and neighborGene.
          • If the difference is exactly 1 (one mutation) and neighborGene hasn't been visited:
            • Enqueue neighborGene with a distance of distance + 1.
            • Mark neighborGene as visited.
  3. No Path:

    • If the queue becomes empty without finding endGene, it means no valid mutation path exists. Return -1.

Time and Space Complexity

  • Time Complexity: O(N * L), where N is the number of genes in the bank and L is the length of the genes (8 in this case). In the worst case, we might explore all genes in the bank, and for each gene, we compare it with other genes (at most N times). The character comparison takes constant time (O(L)).

  • Space Complexity: O(N), dominated by the visited set which stores at most all the genes from the bank. The queue can also hold at most all the genes in the worst case.

Code Examples

The code implementations below use BFS as described above. They differ slightly in syntax but follow the same core logic.

Python:

from collections import deque
 
def minMutation(startGene, endGene, bank):
    q = deque([(startGene, 0)])
    visited = {startGene}
    while q:
        gene, dist = q.popleft()
        if gene == endGene:
            return dist
        for nextGene in bank:
            diff = sum(a != b for a, b in zip(gene, nextGene))
            if diff == 1 and nextGene not in visited:
                q.append((nextGene, dist + 1))
                visited.add(nextGene)
    return -1

Java:

import java.util.*;
 
class Solution {
    public int minMutation(String startGene, String endGene, String[] bank) {
        Queue<Pair<String, Integer>> q = new LinkedList<>();
        q.offer(new Pair<>(startGene, 0));
        Set<String> visited = new HashSet<>();
        visited.add(startGene);
 
        while (!q.isEmpty()) {
            Pair<String, Integer> current = q.poll();
            String gene = current.getKey();
            int dist = current.getValue();
 
            if (gene.equals(endGene)) return dist;
 
            for (String nextGene : bank) {
                if (!visited.contains(nextGene) && diff(gene, nextGene) == 1) {
                    q.offer(new Pair<>(nextGene, dist + 1));
                    visited.add(nextGene);
                }
            }
        }
        return -1;
    }
 
    private int diff(String s1, String s2) {
        int count = 0;
        for (int i = 0; i < s1.length(); i++) {
            if (s1.charAt(i) != s2.charAt(i)) count++;
        }
        return count;
    }
    
    private static class Pair<K, V> {
        private K key;
        private V value;
        
        public Pair(K key, V value){
            this.key = key;
            this.value = value;
        }
        
        public K getKey(){
            return key;
        }
        
        public V getValue(){
            return value;
        }
    }
}

//Other languages (C++, Go, TypeScript, Rust) would follow a similar structure, using their respective queue and set data structures. The core algorithm remains the same.