Rosalind — Genome Assembly as Shortest Superstring

This is the first problem in Rosalind, which I could not solve by myself, and needed external help.

“Half of their Length” is the key

First of all, Shortest Common Superstring (SCS) is an NP-hard problem — you might have learned it by now. Usually, it requires generating a directed graph and finding the best path (although still be hard to solve). So, there must be a catch in the problem statement — which is:

The dataset is guaranteed to satisfy the following condition: there exists a unique way to reconstruct the entire chromosome from these reads by gluing together pairs of reads that overlap by more than half their length.

Generating all permutations is not feasible

Okay. I can check up to half of the length for each string. Good. But don’t I still have to check all possible permutations? Just because I got an overlap of more than half of the string length for a string — does it necessarily mean it is the best possible match?

I was not sure — so I wrote my code generating all possible permutations, evaluating the shortest superstring for each, and then finally taking the shortest superstring among all of them. It worked for the given case. But it was taking unrealistic time during the submission case. ☹

Cheat

Then I start Googling for the solution leaving my ego and found one.

Here is the code making it a bit more descriptive:

Just to check if it works, I submitted my solution to Rosalind using this code and it worked!

Why this lazy solution works!!! 😠

It made me rather furious than happy — why it worked at all! It does not check all permutations. Kind of linearly checking overlap till half of the string length, and if match found — just pop and go ahead. Not even caring what if there was another better match down the path!

Is there a flaw in Rosalind system? 🤔

Then I started looking for a reason why it is not necessary to generate all permutations — why I can consider an overlap is the best overlap if it is found within half of the string length, and no need to check for a better match.

I tried to come up with a case to fail the code (which was already working according to Rosalind):

Case:1

AAAAAA
AAAAGT
AAAAAC

Okay! Now I think, for the first string, the code will mistakenly take the second string as best overlap although 3rd string is a better match. The code must fail, and I will email Rosalind complaining an issue in their judging system. Gotcha!

Invalid

The code failed alright— returning no result. But I found this case is invalid! The actual shortest superstring (AAAAAACAAAAGT) has length 13, which does not meet the “half of length” catch in the problem statement — making Case-1 an invalid case.

Valid Case

Then I started looking for a possible value for the 3rd string, which would produce a solution within half of the length — still being a better match than the second string. One such possible case is:

Case: 2

AAAAAA
AAAAGT
AAAAAA

It is a valid test case alright. But the 3rd string in this case is actually a substring in the cumulative shortest superstring (AAAAAAGT) after we finished processing up to the second string.

Oh — That’s it 😄

Then I realized, considering half of the length “catch” if you find a match within the half of length overlap, that has to be the best match. If any third string appears with better overlap with the first one or second one, that would be a substring in the cumulative shortest superstring. If not (Case: 1), the case would be invalid — not “overlapping more than half their length”.

Jack of a few trades, master of none.