-
-
Notifications
You must be signed in to change notification settings - Fork 73
All iupac variants #106
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
All iupac variants #106
Conversation
sequence.go
Outdated
| } | ||
| } | ||
|
|
||
| func allVariantsIUPAC(seq string) []string { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Make public (capitalize allVariantsIUPAC -> AllVariantsIUPAC ) and create export comment starting as // AllVariantsIUPAC (insert description here)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@TimothyStiles Does the export comment go on the same line as the function declaration?
sequence.go
Outdated
|
|
||
| // the following functions Iter and iter, derive from github.com/schwarmco/go-cartesian-product | ||
| // which uses interfaces, so I modified it to use runes | ||
| func Iter(params ...[]rune) chan []rune { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
make private.
|
This is awesome @tijeco! I've added a couple things and will review more later today when I get the chance. Commenting looks good. Variables could have more descriptive names. Are there any standard test cases we could use to test it? |
|
@TimothyStiles thanks! From easy_dna the sample was as follows >>> all_iupac_variants('ATN')
>>> ['ATA', 'ATC', 'ATG', 'ATT']The output from the port I wrote won't be sorted, and I think because it uses go routines the order could be different each time. So a test would either need to sort the output, or the function would need to return sorted output. I don't know which is preferred. |
|
@tijeco In this case I think we can just sort the output for the test. There may be something we can do about it being returned from the function sorted but we don't have to do that right now. Do you think we could find a stronger test case? Something like "The quick brown fox jumped over the lazy dog", but for degenerate base pairs? N, B, V, etc. |
|
@TimothyStiles Sounds good! I will work on a test set and put it in a separate pull request for sequence_test.go |
|
@tijeco could you put it in this pull request? Would be easier for me to manage. All you have to do is push commits to the PR in the branch and it should auto update. |
|
@TimothyStiles Will do! Thanks! I don't know if I'll have time to work on it today/this weekend, but I definitely will early next week. |
|
So here is a potential problem: RAM issues. Each goroutine takes about 2Kb of memory https://stackoverflow.com/questions/22326765/go-memory-consumption-with-many-goroutines If you have 10N, for example, this would spawn (potentially) 1,048,576 goroutines, taking up nearly 2GB of memory. Meanwhile, if you did an iterative function without goroutines, you could possibly reduce that amount of RAM by a large amount because a new goroutine wouldn't be spawned for every sequence branch. Since append functions do not take much CPU, you might get equivalent performance with a much simpler stack that takes far less RAM. Could you try some harder benchmarks to see where the concurrent method breaks down? |
|
@Koeng101 thanks for pointing that out! I'll be honest, go routines and concurrency is still a new thing to me, so I'm glad to have this opportunity to work on this. I'll try to make a non-concurrent version of the function as well and do some benchmarks to see how memory/cpu usage varies between the two. |
|
@tijeco can you pull down the latest from |
|
@TimothyStiles I think I did the thing? |
| allVariants := make([][]rune, possibleVariants) // this is the 2D slice where all variants will be stored | ||
| variantHolders := make([]rune, possibleVariants*len(inList)) // this is an empty slice with a length totaling the size of all input characters | ||
| variantChoices := make([]int, len(inList)) // these will be all the possible variants |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why are these allocations necessary? Can you run them without?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These allocations are helping to reduce the total memory footprint. Full disclosure: this is a retool of the cartesian product from Rosetta code https://rosettacode.org/wiki/Cartesian_product_of_two_or_more_lists
This is for #92, a port of all_iupac_variants from easy_dna.
IUPAC codes are stored as a map of rune slices. The cartesian products are then generated using a retooled version of the function found in github.com/schwarmco/go-cartesian-product, which uses recursion and go routines, so it should scale quite nicely to large sequences.