UniqueSequence function #87

Koeng101 · 2020-12-19T19:12:38Z

This pull adds a function called "UniqueSequence", which is an interesting little function that uses a unique algorithm to accomplish its goal. It is useful for creating barcodes on primers and is useful for creating CDSs with maximally distant coding sequences.

If you call the UniqueSequence function, you want sequences of a certain length with a maximum subsequence length, so that all output sequences do not have any shared subsequence of the maximum subsequence length. In addition, you may pass in strings or functions on strings to ban certain sequences (such as a restriction enzyme site or high GC content). This removes any sequence that doesn't pass those checks, while still outputting as many as possible which pass.

To accomplish this, we create a de Bruijn Sequence. To quote - "a de Bruijn sequence of order n on a size-k alphabet A is a cyclic sequence in which every possible length-n string on A occurs exactly once as a substring".

This exact matching is important - it means that any length of n on a debruijn sequence will be unique. Because of that property, we can iterate over debruijn sequences, and every iteration will be a unique sequence, without any need to check the other sequences.

This function is specifically useful for creating barcodes, since it ensures that you will have uniquely generated sequences, while being non-random and enabling easy sorting of those barcodes. This function could also be applied to making unique CDS sequences.

I need this function to generate unique barcodes for Nanopore sequencing of materials.

…nctions

Koeng101 · 2020-12-20T15:22:44Z

A note on efficiency and purity - the "best" way to implement this would be approximately a traveling salesman problem in which each subsequence below the minimal subsequence is checked between each possible combination of Xmer.

IMO that is kind of overkill, and this function should be more than enough for barcoding purposes.

primers.go

primers_test.go

primers.go

primers_test.go

Koeng101

Resolved problems + some new code that should make things clearer

primers_test.go

primers.go

qlty-cloud-legacy · 2021-04-14T16:23:06Z

Code Climate has analyzed commit 9601e30 and detected 1 issue on this pull request.

Here's the issue category breakdown:

Category	Count
Complexity	1

The test coverage on the diff in this pull request is 100.0% (95% is the threshold).

This pull request will bring the total coverage in the repository to 97.1% (0.0% change).

View more on Code Climate.

TimothyStiles · 2021-04-22T08:16:50Z

A note on efficiency and purity - the "best" way to implement this would be approximately a traveling salesman problem in which each subsequence below the minimal subsequence is checked between each possible combination of Xmer.

IMO that is kind of overkill, and this function should be more than enough for barcoding purposes.

@Koeng101 how annoying would it be to upgrade to the "best" way in the future.

Koeng101 · 2021-04-22T14:17:06Z

Ya know, I might be wrong about that being most efficient. Here is the logic:

In the ideal case, you could take take the intersection of two strings of length X. The length of this intersection would be the max length shared between those two strings. A naive n^2 algorithm would compare every intersection to every intersection to find optimal pairings
However, by the definition of the de Bruijn sequences, every subsequence of a certain length Y will be contained within DeBruijn(Y). Therefore, the len(DeBruijn(Y)) will be the maximum length possible to find only sequences with subsequence shared of length Y-1.
So, if you want to minimize the length of identical sequence between strings of length X in DeBruijn(Y), you would actually just iterate on DeBruijn(Y-n) until you can't find enough sequences. Your optimal DeBruijn will occur right before that, I guess at DeBruijn(Y-n+1).

Of course, that changes when there are other functions you need to constrain on (for example, when you're not actually looking for unique sequences, but for sequences under certain constraints). The change therefore would be to add in something that automatically iterates down DeBruijn(Y-n), which shouldn't be too hard.

Koeng101 · 2021-04-25T14:42:51Z

I fixed all the things specific to this pull request by code climate except the "complexity" issues. A review would be appreciated @TimothyStiles

primers.go

primers_test.go

primers.go

primers_test.go

primers.go

Co-authored-by: jkh <jonathan@expo.io>

…on CreateBarcodes

… returns a list

jkhales

This is looking pretty good to me. Left a few last suggestions inline.

primers.go

… of just Debruijn

Koeng101 · 2021-05-23T15:30:29Z

Alright, I do not know what t and p really mean. I've been searching rosetta code, looking at wikipedia, etc. Thoughts @TimothyStiles

TimothyStiles

More comments and fixing of single variable names are needed.

Should barcodes have their own file/package?

primers.go

primers_test.go

primers.go

primers_test.go

primers.go

TimothyStiles · 2021-05-23T16:23:25Z

@Koeng101 for single letter variables that you don't understand it'd be sufficient to name them something longer than a single letter as something related to what you think it does as long as reference to rosetta code function exists and you leave as much comments as possible in control flow.

Single letter variables are literally too small a symbol to make the code visually easy to read. Just this improvement could help a ton with future debugging and help future developers figure out what's going on. Also testing will have to be a little more thorough than normal because from this perspective it's somewhat a black-ish box.

I ran into similar problems with Booth Least Rotation so for its test I took a plasmid sequence and rotated it for each letter then hashed it and checked the hash against the hash from the original rotation. Not sure what the equivalent would be here but should have a similar-ish feel.

primers.go

Keoni Gandall added 4 commits December 19, 2020 10:32

Add UniqueSequence function

fe12446

Added functionality for banned sequences

1fc0d78

Added functionality to allow for banned functions, like GC content fu…

dff4323

…nctions

Increased to 100% test coverage

b697b23

TimothyStiles previously requested changes Mar 24, 2021

View reviewed changes

NickNolan reviewed Mar 24, 2021

View reviewed changes

primers.go Outdated Show resolved Hide resolved

dexho reviewed Mar 24, 2021

View reviewed changes

primers_test.go Outdated Show resolved Hide resolved

Added updates for code review

87ab5f6

Koeng101 commented Apr 14, 2021

View reviewed changes

qlty-cloud-legacy bot reviewed Apr 14, 2021

View reviewed changes

primers.go Outdated Show resolved Hide resolved

primers.go Outdated Show resolved Hide resolved

primers.go Outdated Show resolved Hide resolved

primers.go Outdated Show resolved Hide resolved

bebop deleted a comment from codecov bot Apr 22, 2021

Fixed style issues

ca3417f

Keoni Gandall added 4 commits April 25, 2021 16:15

Merge branch 'prime' into deBruijnSequences

9601e30

Added comment in response to @dexho comment on PR

a94de35

Merge branch 'prime' into deBruijnSequences

6a272d1

Fixed typo

773281c

TimothyStiles requested changes May 20, 2021

View reviewed changes

primers.go Outdated Show resolved Hide resolved

primers.go Outdated Show resolved Hide resolved

primers.go Show resolved Hide resolved

primers_test.go Outdated Show resolved Hide resolved