-
-
Notifications
You must be signed in to change notification settings - Fork 73
UniqueSequence function #87
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
A note on efficiency and purity - the "best" way to implement this would be approximately a traveling salesman problem in which each subsequence below the minimal subsequence is checked between each possible combination of Xmer. IMO that is kind of overkill, and this function should be more than enough for barcoding purposes. |
Koeng101
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Resolved problems + some new code that should make things clearer
|
Code Climate has analyzed commit 9601e30 and detected 1 issue on this pull request. Here's the issue category breakdown:
The test coverage on the diff in this pull request is 100.0% (95% is the threshold). This pull request will bring the total coverage in the repository to 97.1% (0.0% change). View more on Code Climate. |
@Koeng101 how annoying would it be to upgrade to the "best" way in the future. |
|
Ya know, I might be wrong about that being most efficient. Here is the logic:
Of course, that changes when there are other functions you need to constrain on (for example, when you're not actually looking for unique sequences, but for sequences under certain constraints). The change therefore would be to add in something that automatically iterates down |
|
I fixed all the things specific to this pull request by code climate except the "complexity" issues. A review would be appreciated @TimothyStiles |
Co-authored-by: jkh <jonathan@expo.io>
Co-authored-by: jkh <jonathan@expo.io>
…on CreateBarcodes
jkhales
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is looking pretty good to me. Left a few last suggestions inline.
|
Alright, I do not know what |
TimothyStiles
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
More comments and fixing of single variable names are needed.
Should barcodes have their own file/package?
|
@Koeng101 for single letter variables that you don't understand it'd be sufficient to name them something longer than a single letter as something related to what you think it does as long as reference to rosetta code function exists and you leave as much comments as possible in control flow. Single letter variables are literally too small a symbol to make the code visually easy to read. Just this improvement could help a ton with future debugging and help future developers figure out what's going on. Also testing will have to be a little more thorough than normal because from this perspective it's somewhat a black-ish box. I ran into similar problems with Booth Least Rotation so for its test I took a plasmid sequence and rotated it for each letter then hashed it and checked the hash against the hash from the original rotation. Not sure what the equivalent would be here but should have a similar-ish feel. |
This pull adds a function called "UniqueSequence", which is an interesting little function that uses a unique algorithm to accomplish its goal. It is useful for creating barcodes on primers and is useful for creating CDSs with maximally distant coding sequences.
If you call the UniqueSequence function, you want sequences of a certain length with a maximum subsequence length, so that all output sequences do not have any shared subsequence of the maximum subsequence length. In addition, you may pass in strings or functions on strings to ban certain sequences (such as a restriction enzyme site or high GC content). This removes any sequence that doesn't pass those checks, while still outputting as many as possible which pass.
To accomplish this, we create a de Bruijn Sequence. To quote - "a de Bruijn sequence of order n on a size-k alphabet A is a cyclic sequence in which every possible length-n string on A occurs exactly once as a substring".
This exact matching is important - it means that any length of n on a debruijn sequence will be unique. Because of that property, we can iterate over debruijn sequences, and every iteration will be a unique sequence, without any need to check the other sequences.
This function is specifically useful for creating barcodes, since it ensures that you will have uniquely generated sequences, while being non-random and enabling easy sorting of those barcodes. This function could also be applied to making unique CDS sequences.
I need this function to generate unique barcodes for Nanopore sequencing of materials.