Skip to content

Conversation

@Koeng101
Copy link
Contributor

This pull adds a function called "UniqueSequence", which is an interesting little function that uses a unique algorithm to accomplish its goal. It is useful for creating barcodes on primers and is useful for creating CDSs with maximally distant coding sequences.

If you call the UniqueSequence function, you want sequences of a certain length with a maximum subsequence length, so that all output sequences do not have any shared subsequence of the maximum subsequence length. In addition, you may pass in strings or functions on strings to ban certain sequences (such as a restriction enzyme site or high GC content). This removes any sequence that doesn't pass those checks, while still outputting as many as possible which pass.

To accomplish this, we create a de Bruijn Sequence. To quote - "a de Bruijn sequence of order n on a size-k alphabet A is a cyclic sequence in which every possible length-n string on A occurs exactly once as a substring".

This exact matching is important - it means that any length of n on a debruijn sequence will be unique. Because of that property, we can iterate over debruijn sequences, and every iteration will be a unique sequence, without any need to check the other sequences.

This function is specifically useful for creating barcodes, since it ensures that you will have uniquely generated sequences, while being non-random and enabling easy sorting of those barcodes. This function could also be applied to making unique CDS sequences.

I need this function to generate unique barcodes for Nanopore sequencing of materials.

@Koeng101
Copy link
Contributor Author

A note on efficiency and purity - the "best" way to implement this would be approximately a traveling salesman problem in which each subsequence below the minimal subsequence is checked between each possible combination of Xmer.

IMO that is kind of overkill, and this function should be more than enough for barcoding purposes.

Copy link
Contributor Author

@Koeng101 Koeng101 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Resolved problems + some new code that should make things clearer

@qlty-cloud-legacy
Copy link

qlty-cloud-legacy bot commented Apr 14, 2021

Code Climate has analyzed commit 9601e30 and detected 1 issue on this pull request.

Here's the issue category breakdown:

Category Count
Complexity 1

The test coverage on the diff in this pull request is 100.0% (95% is the threshold).

This pull request will bring the total coverage in the repository to 97.1% (0.0% change).

View more on Code Climate.

@bebop bebop deleted a comment from codecov bot Apr 22, 2021
@TimothyStiles
Copy link
Collaborator

A note on efficiency and purity - the "best" way to implement this would be approximately a traveling salesman problem in which each subsequence below the minimal subsequence is checked between each possible combination of Xmer.

IMO that is kind of overkill, and this function should be more than enough for barcoding purposes.

@Koeng101 how annoying would it be to upgrade to the "best" way in the future.

@Koeng101
Copy link
Contributor Author

Ya know, I might be wrong about that being most efficient. Here is the logic:

  • In the ideal case, you could take take the intersection of two strings of length X. The length of this intersection would be the max length shared between those two strings. A naive n^2 algorithm would compare every intersection to every intersection to find optimal pairings
  • However, by the definition of the de Bruijn sequences, every subsequence of a certain length Y will be contained within DeBruijn(Y). Therefore, the len(DeBruijn(Y)) will be the maximum length possible to find only sequences with subsequence shared of length Y-1.
  • So, if you want to minimize the length of identical sequence between strings of length X in DeBruijn(Y), you would actually just iterate on DeBruijn(Y-n) until you can't find enough sequences. Your optimal DeBruijn will occur right before that, I guess at DeBruijn(Y-n+1).

Of course, that changes when there are other functions you need to constrain on (for example, when you're not actually looking for unique sequences, but for sequences under certain constraints). The change therefore would be to add in something that automatically iterates down DeBruijn(Y-n), which shouldn't be too hard.

@Koeng101
Copy link
Contributor Author

I fixed all the things specific to this pull request by code climate except the "complexity" issues. A review would be appreciated @TimothyStiles

Koeng101 and others added 2 commits May 22, 2021 09:47
Co-authored-by: jkh <jonathan@expo.io>
Co-authored-by: jkh <jonathan@expo.io>
Copy link

@jkhales jkhales left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is looking pretty good to me. Left a few last suggestions inline.

@Koeng101
Copy link
Contributor Author

Alright, I do not know what t and p really mean. I've been searching rosetta code, looking at wikipedia, etc. Thoughts @TimothyStiles

Copy link
Collaborator

@TimothyStiles TimothyStiles left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

More comments and fixing of single variable names are needed.

Should barcodes have their own file/package?

@TimothyStiles
Copy link
Collaborator

@Koeng101 for single letter variables that you don't understand it'd be sufficient to name them something longer than a single letter as something related to what you think it does as long as reference to rosetta code function exists and you leave as much comments as possible in control flow.

Single letter variables are literally too small a symbol to make the code visually easy to read. Just this improvement could help a ton with future debugging and help future developers figure out what's going on. Also testing will have to be a little more thorough than normal because from this perspective it's somewhat a black-ish box.

I ran into similar problems with Booth Least Rotation so for its test I took a plasmid sequence and rotated it for each letter then hashed it and checked the hash against the hash from the original rotation. Not sure what the equivalent would be here but should have a similar-ish feel.

@TimothyStiles TimothyStiles merged commit f6e7dec into prime May 25, 2021
@delete-merged-branch delete-merged-branch bot deleted the deBruijnSequences branch May 25, 2021 20:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

6 participants